* Caching/buffers become useless after some time @ 2018-07-11 13:18 Marinko Catovic 2018-07-12 11:34 ` Michal Hocko 0 siblings, 1 reply; 66+ messages in thread From: Marinko Catovic @ 2018-07-11 13:18 UTC (permalink / raw) To: linux-mm [-- Attachment #1: Type: text/plain, Size: 3385 bytes --] hello guys I tried in a few IRC, people told me to ask here, so I'll give it a try. I have a very weird issue with mm on several hosts. The systems are for shared hosting, so lots of users there with lots of files. Maybe 5TB of files per host, several million at least, there is lots of I/O which can be handled perfectly fine with buffers/cache The kernel version is the latest stable, 4.17.4, I had 3.x before and did not notice any issues until now. the same is for 4.16 which was in use before: The hosts altogether have 64G of RAM and operate with SSD+HDD. HDDs are the issue here, since those 5TB of data are stored there, there goes the high I/O. Running applications need about 15GB, so say 40GB of RAM are left for buffers/caching. Usually this works perfectly fine. The buffers take about 1-3G of RAM, the cache the rest, say 35GB as an example. But every now and then, maybe every 2 days it happens that both drop to really low values, say 100MB buffers, 3GB caches and the rest of the RAM is not in use, so there are about 35GB+ of totally free RAM. The performance of the host goes down significantly then, as it becomes unusable at some point, since it behaves as if the buffers/cache were totally useless. After lots and lots of playing around I noticed that when shutting down all services that access the HDDs on the system and restarting them, that this does *not* make any difference. But what did make a difference was stopping and umounting the fs, mounting it again and starting the services. Then the buffers+cache built up to 5GB/35GB as usual after a while and everything was perfectly fine again! I noticed that what happens when umount is called, that the caches are being dropped. So I gave it a try: sync; echo 2 > /proc/sys/vm/drop_caches has the exactly same effect. Note that echo 1 > .. does not. So if that low usage like 100MB/3GB occurs I'd have to drop the caches by echoing 2 to drop_caches. The 3GB then become even lower, which is expected, but then at least the buffers/cache built up again to ordinary values and the usual performance is restored after a few minutes. I have never seen this before, this happened after I switched the systems to newer ones, where the old ones had kernel 3.x, this behavior was never observed before. Do you have *any idea* at all what could be causing this? that issue is bugging me since over a month and seriously really disturbs everything I'm doing since lot of people access that data and all of them start to complain at some point where I see that the caches became useless at that time, having to drop them to rebuild again. Some guys in IRC suggested that his could be a fragmentation problem or something, or about slab shrinking. The problem is that I can not reproduce this, I have to wait a while, maybe 2 days to observe that, until that the buffers/caches are fully in use and at some point they decrease within a few hours to those useless values. Sadly this is a production system and I can not play that much around, already causing downtime when dropping caches (populating caches needs maybe 5-10 minutes until the performance is ok again). Please tell me whatever info you need me to pastebin and when (before/after what event). Any hints are appreciated a lot, it really gives me lots of headache, since I am really busy with other things. Thank you very much! Marinko [-- Attachment #2: Type: text/html, Size: 3950 bytes --] ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-07-11 13:18 Caching/buffers become useless after some time Marinko Catovic @ 2018-07-12 11:34 ` Michal Hocko 2018-07-13 15:48 ` Marinko Catovic 0 siblings, 1 reply; 66+ messages in thread From: Michal Hocko @ 2018-07-12 11:34 UTC (permalink / raw) To: Marinko Catovic; +Cc: linux-mm On Wed 11-07-18 15:18:30, Marinko Catovic wrote: > hello guys > > > I tried in a few IRC, people told me to ask here, so I'll give it a try. > > > I have a very weird issue with mm on several hosts. > The systems are for shared hosting, so lots of users there with lots of > files. > Maybe 5TB of files per host, several million at least, there is lots of I/O > which can be handled perfectly fine with buffers/cache > > The kernel version is the latest stable, 4.17.4, I had 3.x before and did > not notice any issues until now. the same is for 4.16 which was in use > before: > > The hosts altogether have 64G of RAM and operate with SSD+HDD. > HDDs are the issue here, since those 5TB of data are stored there, there > goes the high I/O. > Running applications need about 15GB, so say 40GB of RAM are left for > buffers/caching. > > Usually this works perfectly fine. The buffers take about 1-3G of RAM, the > cache the rest, say 35GB as an example. > But every now and then, maybe every 2 days it happens that both drop to > really low values, say 100MB buffers, 3GB caches and the rest of the RAM is > not in use, so there are about 35GB+ of totally free RAM. > > The performance of the host goes down significantly then, as it becomes > unusable at some point, since it behaves as if the buffers/cache were > totally useless. > After lots and lots of playing around I noticed that when shutting down all > services that access the HDDs on the system and restarting them, that this > does *not* make any difference. > > But what did make a difference was stopping and umounting the fs, mounting > it again and starting the services. > Then the buffers+cache built up to 5GB/35GB as usual after a while and > everything was perfectly fine again! > > I noticed that what happens when umount is called, that the caches are > being dropped. So I gave it a try: > > sync; echo 2 > /proc/sys/vm/drop_caches > > has the exactly same effect. Note that echo 1 > .. does not. > > So if that low usage like 100MB/3GB occurs I'd have to drop the caches by > echoing 2 to drop_caches. The 3GB then become even lower, which is > expected, but then at least the buffers/cache built up again to ordinary > values and the usual performance is restored after a few minutes. > I have never seen this before, this happened after I switched the systems > to newer ones, where the old ones had kernel 3.x, this behavior was never > observed before. > > Do you have *any idea* at all what could be causing this? that issue is > bugging me since over a month and seriously really disturbs everything I'm > doing since lot of people access that data and all of them start to > complain at some point where I see that the caches became useless at that > time, having to drop them to rebuild again. > > Some guys in IRC suggested that his could be a fragmentation problem or > something, or about slab shrinking. Well, the page cache shouldn't really care about fragmentation because single pages are used. Btw. what is the filesystem that you are using? > The problem is that I can not reproduce this, I have to wait a while, maybe > 2 days to observe that, until that the buffers/caches are fully in use and > at some point they decrease within a few hours to those useless values. > Sadly this is a production system and I can not play that much around, > already causing downtime when dropping caches (populating caches needs > maybe 5-10 minutes until the performance is ok again). This doesn't really ring bells for me. > Please tell me whatever info you need me to pastebin and when (before/after > what event). > Any hints are appreciated a lot, it really gives me lots of headache, since > I am really busy with other things. Thank you very much! Could you collect /proc/vmstat every few seconds over that time period? Maybe it will tell us more. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-07-12 11:34 ` Michal Hocko @ 2018-07-13 15:48 ` Marinko Catovic 2018-07-16 15:53 ` Marinko Catovic 0 siblings, 1 reply; 66+ messages in thread From: Marinko Catovic @ 2018-07-13 15:48 UTC (permalink / raw) To: linux-mm [-- Attachment #1: Type: text/plain, Size: 7058 bytes --] hello Michal well these hints were just ideas mentioned by some people, it took me weeks just to figure out that 2>drop_caches helps, still not knowing why this happens. Right now I am observing ~18GB of unused RAM, since yesterday, so this is not always about 100MB/3.5GB, but right now it may be in the process of shrinking. I really can not tell for sure, this is so nondeterministic - I just wish I could reproduce it for better testing. Right now top shows: KiB Mem : 65892044 total, 18169232 free, 11879604 used, 35843208 buff/cache Where 1GB goes to buffers, the rest to cache, the host *is* busy and the buff/cache consumed all RAM yesterday, where I did 2>drop_caches about one day before. Another host (still) shows full usage. That other one is 1:1 the same by software and config, but with different data/users; the use-cases and load are pretty much similar. Affected host at this time: https://pastebin.com/fRQMPuwb https://pastebin.com/tagXJRi1 .. 3 minutes later https://pastebin.com/8YNFfKXf .. 3 minutes later https://pastebin.com/UEq7NKR4 .. 3 minutes later To compare - this is the other host, that is still showing full buffers/cache usage by now: https://pastebin.com/Jraux2gy Usually both show this more or less at the same time, sometimes it is the one, sometimes the other. Other hosts I have are currently not under similar high load, making it even harder to compare. However, right now I can not observe this dropping towards really low values, but I am sure it will come. fs is ext4, mount options are auto,rw,data=writeback,noatime ,nodiratime,nodev,nosuid,async previous mount options with same behavior also had max_dir_size_kb, quotas and defaults for data= so I also played around with these, but that made no difference. --------- follow up (sorry, messed up with reply-to this mailing list): https://pastebin.com/0v4ZFNCv .. one hour later, right after my last report, 22GB free https://pastebin.com/rReWnHtE .. one day later, 28GB free It is interesting to see however, that this did not get that low as mentioned before. So not sure where this is going right now, but nevertheless, the RAM is not occupied fully, there should be no reason to allow 28GB to be free at all. Still lots I/O, and I am 100% positive that if I'd echo 2 > drop_caches, this would fill up the entire RAM again. What I can see is that buffers are around 500-700MB, the values increase and decrease all the time, really "oscillating" around 600. afaik this should get as high as possible, as long there is free ram - the other host that is still healthy has about 2GB/48GB fully occupying RAM. Currently I have set vm.dirty_ratio = 15, vm.dirty_background_ratio = 3, vm.vfs_cache_pressure = 1 and the low usage occurred 3 days before, other values like the defaults or when I was playing around with vm.dirty_ratio = 90, vm.dirty_background_ratio = 80 and whatever cache_pressure showed similar results. 2018-07-12 13:34 GMT+02:00 Michal Hocko <mhocko@kernel.org>: > On Wed 11-07-18 15:18:30, Marinko Catovic wrote: > > hello guys > > > > > > I tried in a few IRC, people told me to ask here, so I'll give it a try. > > > > > > I have a very weird issue with mm on several hosts. > > The systems are for shared hosting, so lots of users there with lots of > > files. > > Maybe 5TB of files per host, several million at least, there is lots of > I/O > > which can be handled perfectly fine with buffers/cache > > > > The kernel version is the latest stable, 4.17.4, I had 3.x before and did > > not notice any issues until now. the same is for 4.16 which was in use > > before: > > > > The hosts altogether have 64G of RAM and operate with SSD+HDD. > > HDDs are the issue here, since those 5TB of data are stored there, there > > goes the high I/O. > > Running applications need about 15GB, so say 40GB of RAM are left for > > buffers/caching. > > > > Usually this works perfectly fine. The buffers take about 1-3G of RAM, > the > > cache the rest, say 35GB as an example. > > But every now and then, maybe every 2 days it happens that both drop to > > really low values, say 100MB buffers, 3GB caches and the rest of the RAM > is > > not in use, so there are about 35GB+ of totally free RAM. > > > > The performance of the host goes down significantly then, as it becomes > > unusable at some point, since it behaves as if the buffers/cache were > > totally useless. > > After lots and lots of playing around I noticed that when shutting down > all > > services that access the HDDs on the system and restarting them, that > this > > does *not* make any difference. > > > > But what did make a difference was stopping and umounting the fs, > mounting > > it again and starting the services. > > Then the buffers+cache built up to 5GB/35GB as usual after a while and > > everything was perfectly fine again! > > > > I noticed that what happens when umount is called, that the caches are > > being dropped. So I gave it a try: > > > > sync; echo 2 > /proc/sys/vm/drop_caches > > > > has the exactly same effect. Note that echo 1 > .. does not. > > > > So if that low usage like 100MB/3GB occurs I'd have to drop the caches by > > echoing 2 to drop_caches. The 3GB then become even lower, which is > > expected, but then at least the buffers/cache built up again to ordinary > > values and the usual performance is restored after a few minutes. > > I have never seen this before, this happened after I switched the systems > > to newer ones, where the old ones had kernel 3.x, this behavior was never > > observed before. > > > > Do you have *any idea* at all what could be causing this? that issue is > > bugging me since over a month and seriously really disturbs everything > I'm > > doing since lot of people access that data and all of them start to > > complain at some point where I see that the caches became useless at that > > time, having to drop them to rebuild again. > > > > Some guys in IRC suggested that his could be a fragmentation problem or > > something, or about slab shrinking. > > Well, the page cache shouldn't really care about fragmentation because > single pages are used. Btw. what is the filesystem that you are using? > > > The problem is that I can not reproduce this, I have to wait a while, > maybe > > 2 days to observe that, until that the buffers/caches are fully in use > and > > at some point they decrease within a few hours to those useless values. > > Sadly this is a production system and I can not play that much around, > > already causing downtime when dropping caches (populating caches needs > > maybe 5-10 minutes until the performance is ok again). > > This doesn't really ring bells for me. > > > Please tell me whatever info you need me to pastebin and when > (before/after > > what event). > > Any hints are appreciated a lot, it really gives me lots of headache, > since > > I am really busy with other things. Thank you very much! > > Could you collect /proc/vmstat every few seconds over that time period? > Maybe it will tell us more. > -- > Michal Hocko > SUSE Labs > [-- Attachment #2: Type: text/html, Size: 8955 bytes --] ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-07-13 15:48 ` Marinko Catovic @ 2018-07-16 15:53 ` Marinko Catovic 2018-07-16 16:23 ` Michal Hocko 0 siblings, 1 reply; 66+ messages in thread From: Marinko Catovic @ 2018-07-16 15:53 UTC (permalink / raw) To: linux-mm [-- Attachment #1: Type: text/plain, Size: 8247 bytes --] I can provide further data now, monitoring vmstat: https://pastebin.com/j0dMGBe4 .. 1 day later, 600MB/13GB in use, 35GB free https://pastebin.com/N011kYyd .. 1 day later, 300MB/10GB in use, 40GB free, performance becomes even worse the issue raises up again, I would have to drop caches by now to restore normal usage for another day or two. Afaik there should be no reason at all to not have the buffers/cache fill up the entire memory, isn't that true? There is to my knowledge almost no O_DIRECT involved, also as mentioned before: when dropping caches the buffers/cache usage would eat up all RAM within the hour as usual for 1-2 days until it starts to go crazy again. As mentioned, the usage oscillates up and down instead of up until all RAM is consumed. Please tell me if there is anything else I can do to help investigate this. 2018-07-13 17:48 GMT+02:00 Marinko Catovic <marinko.catovic@gmail.com>: > hello Michal > > > well these hints were just ideas mentioned by some people, it took me > weeks just to figure > out that 2>drop_caches helps, still not knowing why this happens. > > Right now I am observing ~18GB of unused RAM, since yesterday, so this is > not always > about 100MB/3.5GB, but right now it may be in the process of shrinking. > I really can not tell for sure, this is so nondeterministic - I just wish > I could reproduce it for better testing. > > Right now top shows: > KiB Mem : 65892044 total, 18169232 free, 11879604 used, 35843208 buff/cache > Where 1GB goes to buffers, the rest to cache, the host *is* busy and the > buff/cache consumed > all RAM yesterday, where I did 2>drop_caches about one day before. > > Another host (still) shows full usage. That other one is 1:1 the same by > software and config, > but with different data/users; the use-cases and load are pretty much > similar. > > Affected host at this time: > https://pastebin.com/fRQMPuwb > https://pastebin.com/tagXJRi1 .. 3 minutes later > https://pastebin.com/8YNFfKXf .. 3 minutes later > https://pastebin.com/UEq7NKR4 .. 3 minutes later > > To compare - this is the other host, that is still showing full > buffers/cache usage by now: > https://pastebin.com/Jraux2gy > > Usually both show this more or less at the same time, sometimes it is the > one, sometimes > the other. Other hosts I have are currently not under similar high load, > making it even harder > to compare. > > However, right now I can not observe this dropping towards really low > values, but I am sure it will come. > > fs is ext4, mount options are auto,rw,data=writeback,noatime > ,nodiratime,nodev,nosuid,async > previous mount options with same behavior also had max_dir_size_kb, quotas > and defaults for data= > so I also played around with these, but that made no difference. > > --------- > > follow up (sorry, messed up with reply-to this mailing list): > > > https://pastebin.com/0v4ZFNCv .. one hour later, right after my last > report, 22GB free > https://pastebin.com/rReWnHtE .. one day later, 28GB free > > It is interesting to see however, that this did not get that low as > mentioned before. > So not sure where this is going right now, but nevertheless, the RAM is > not occupied fully, > there should be no reason to allow 28GB to be free at all. > > Still lots I/O, and I am 100% positive that if I'd echo 2 > drop_caches, > this would fill up the > entire RAM again. > > > What I can see is that buffers are around 500-700MB, the values increase > and decrease > all the time, really "oscillating" around 600. afaik this should get as > high as possible, as long > there is free ram - the other host that is still healthy has about > 2GB/48GB fully occupying RAM. > > Currently I have set vm.dirty_ratio = 15, vm.dirty_background_ratio = 3, > vm.vfs_cache_pressure = 1 > and the low usage occurred 3 days before, other values like the defaults > or when I was playing > around with vm.dirty_ratio = 90, vm.dirty_background_ratio = 80 and > whatever cache_pressure > showed similar results. > > > 2018-07-12 13:34 GMT+02:00 Michal Hocko <mhocko@kernel.org>: > >> On Wed 11-07-18 15:18:30, Marinko Catovic wrote: >> > hello guys >> > >> > >> > I tried in a few IRC, people told me to ask here, so I'll give it a try. >> > >> > >> > I have a very weird issue with mm on several hosts. >> > The systems are for shared hosting, so lots of users there with lots of >> > files. >> > Maybe 5TB of files per host, several million at least, there is lots of >> I/O >> > which can be handled perfectly fine with buffers/cache >> > >> > The kernel version is the latest stable, 4.17.4, I had 3.x before and >> did >> > not notice any issues until now. the same is for 4.16 which was in use >> > before: >> > >> > The hosts altogether have 64G of RAM and operate with SSD+HDD. >> > HDDs are the issue here, since those 5TB of data are stored there, there >> > goes the high I/O. >> > Running applications need about 15GB, so say 40GB of RAM are left for >> > buffers/caching. >> > >> > Usually this works perfectly fine. The buffers take about 1-3G of RAM, >> the >> > cache the rest, say 35GB as an example. >> > But every now and then, maybe every 2 days it happens that both drop to >> > really low values, say 100MB buffers, 3GB caches and the rest of the >> RAM is >> > not in use, so there are about 35GB+ of totally free RAM. >> > >> > The performance of the host goes down significantly then, as it becomes >> > unusable at some point, since it behaves as if the buffers/cache were >> > totally useless. >> > After lots and lots of playing around I noticed that when shutting down >> all >> > services that access the HDDs on the system and restarting them, that >> this >> > does *not* make any difference. >> > >> > But what did make a difference was stopping and umounting the fs, >> mounting >> > it again and starting the services. >> > Then the buffers+cache built up to 5GB/35GB as usual after a while and >> > everything was perfectly fine again! >> > >> > I noticed that what happens when umount is called, that the caches are >> > being dropped. So I gave it a try: >> > >> > sync; echo 2 > /proc/sys/vm/drop_caches >> > >> > has the exactly same effect. Note that echo 1 > .. does not. >> > >> > So if that low usage like 100MB/3GB occurs I'd have to drop the caches >> by >> > echoing 2 to drop_caches. The 3GB then become even lower, which is >> > expected, but then at least the buffers/cache built up again to ordinary >> > values and the usual performance is restored after a few minutes. >> > I have never seen this before, this happened after I switched the >> systems >> > to newer ones, where the old ones had kernel 3.x, this behavior was >> never >> > observed before. >> > >> > Do you have *any idea* at all what could be causing this? that issue is >> > bugging me since over a month and seriously really disturbs everything >> I'm >> > doing since lot of people access that data and all of them start to >> > complain at some point where I see that the caches became useless at >> that >> > time, having to drop them to rebuild again. >> > >> > Some guys in IRC suggested that his could be a fragmentation problem or >> > something, or about slab shrinking. >> >> Well, the page cache shouldn't really care about fragmentation because >> single pages are used. Btw. what is the filesystem that you are using? >> >> > The problem is that I can not reproduce this, I have to wait a while, >> maybe >> > 2 days to observe that, until that the buffers/caches are fully in use >> and >> > at some point they decrease within a few hours to those useless values. >> > Sadly this is a production system and I can not play that much around, >> > already causing downtime when dropping caches (populating caches needs >> > maybe 5-10 minutes until the performance is ok again). >> >> This doesn't really ring bells for me. >> >> > Please tell me whatever info you need me to pastebin and when >> (before/after >> > what event). >> > Any hints are appreciated a lot, it really gives me lots of headache, >> since >> > I am really busy with other things. Thank you very much! >> >> Could you collect /proc/vmstat every few seconds over that time period? >> Maybe it will tell us more. >> -- >> Michal Hocko >> SUSE Labs >> > > [-- Attachment #2: Type: text/html, Size: 10590 bytes --] ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-07-16 15:53 ` Marinko Catovic @ 2018-07-16 16:23 ` Michal Hocko 2018-07-16 16:33 ` Marinko Catovic 0 siblings, 1 reply; 66+ messages in thread From: Michal Hocko @ 2018-07-16 16:23 UTC (permalink / raw) To: Marinko Catovic; +Cc: linux-mm On Mon 16-07-18 17:53:42, Marinko Catovic wrote: > I can provide further data now, monitoring vmstat: > > https://pastebin.com/j0dMGBe4 .. 1 day later, 600MB/13GB in use, 35GB free > https://pastebin.com/N011kYyd .. 1 day later, 300MB/10GB in use, 40GB free, > performance becomes even worse > > the issue raises up again, I would have to drop caches by now to restore > normal usage for another day or two. > > Afaik there should be no reason at all to not have the buffers/cache fill > up the entire memory, isn't that true? > There is to my knowledge almost no O_DIRECT involved, also as mentioned > before: when dropping caches > the buffers/cache usage would eat up all RAM within the hour as usual for > 1-2 days until it starts to go crazy again. > > As mentioned, the usage oscillates up and down instead of up until all RAM > is consumed. > > Please tell me if there is anything else I can do to help investigate this. Do you have periodic /proc/vmstat snapshots I have asked before? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-07-16 16:23 ` Michal Hocko @ 2018-07-16 16:33 ` Marinko Catovic 2018-07-16 16:45 ` Michal Hocko 0 siblings, 1 reply; 66+ messages in thread From: Marinko Catovic @ 2018-07-16 16:33 UTC (permalink / raw) To: linux-mm [-- Attachment #1: Type: text/plain, Size: 1816 bytes --] how periodically do you want them? I assumed this some-hours and days snapshots would be sufficient. any particular command with or without grep perhaps? I just had to drop caches, right before your response, the performance was simply too bad. this is for your information, how it was right after dropping and 0+5+25 minutes later https://pastebin.com/LcjKgQkg .. this is what it looks like just after sync; echo 2 > /proc/sys/vm/drop_caches https://pastebin.com/ZCeFCKrb .. 5 minutes later, when performance is starting to get better again https://pastebin.com/8hij8Lid .. 20 minutes after that, you can expect this to consume all the available ram within 1-2 hours 2018-07-16 18:23 GMT+02:00 Michal Hocko <mhocko@kernel.org>: > On Mon 16-07-18 17:53:42, Marinko Catovic wrote: > > I can provide further data now, monitoring vmstat: > > > > https://pastebin.com/j0dMGBe4 .. 1 day later, 600MB/13GB in use, 35GB > free > > https://pastebin.com/N011kYyd .. 1 day later, 300MB/10GB in use, 40GB > free, > > performance becomes even worse > > > > the issue raises up again, I would have to drop caches by now to restore > > normal usage for another day or two. > > > > Afaik there should be no reason at all to not have the buffers/cache fill > > up the entire memory, isn't that true? > > There is to my knowledge almost no O_DIRECT involved, also as mentioned > > before: when dropping caches > > the buffers/cache usage would eat up all RAM within the hour as usual for > > 1-2 days until it starts to go crazy again. > > > > As mentioned, the usage oscillates up and down instead of up until all > RAM > > is consumed. > > > > Please tell me if there is anything else I can do to help investigate > this. > > Do you have periodic /proc/vmstat snapshots I have asked before? > -- > Michal Hocko > SUSE Labs > [-- Attachment #2: Type: text/html, Size: 2842 bytes --] ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-07-16 16:33 ` Marinko Catovic @ 2018-07-16 16:45 ` Michal Hocko 2018-07-20 22:03 ` Marinko Catovic 0 siblings, 1 reply; 66+ messages in thread From: Michal Hocko @ 2018-07-16 16:45 UTC (permalink / raw) To: Marinko Catovic; +Cc: linux-mm On Mon 16-07-18 18:33:57, Marinko Catovic wrote: > how periodically do you want them? I assumed this some-hours and days > snapshots would be sufficient. Every 10s should be reasonable even for a long term monitoring. > any particular command with or without grep perhaps? while true do cp /proc/vmstat vmstat.$(date +%s) sleep 10s done -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-07-16 16:45 ` Michal Hocko @ 2018-07-20 22:03 ` Marinko Catovic 2018-07-27 11:15 ` Vlastimil Babka 0 siblings, 1 reply; 66+ messages in thread From: Marinko Catovic @ 2018-07-20 22:03 UTC (permalink / raw) To: linux-mm [-- Attachment #1: Type: text/plain, Size: 1815 bytes --] I let this run for 3 days now, so it is quite a lot, there you go: https://nofile.io/f/egGyRjf0NPs/vmstat.tar.gz There is one thing I forgot to mention: the hosts perform find and du (I mean the commands, finding files and disk usage) on the HDDs every night, starting from 00:20 AM up until in the morning 07:45 AM, for maintenance and stats. During this period the buffers/caches raise again as you may see from the logs, so find/du do fill them. Nevertheless as the day passes both decrease again until low values are reached. I disabled find/du for the night on 19->20th July to compare. I have to say that this really low usage (300MB/xGB) occured just once after I upgraded from 4.16 to 4.17, not sure why, where one can still see from the logs that the buffers/cache is not using up the entire available RAM. This low usage occured the last time on that one host when I mentioned that I had to 2>drop_caches again in my previous message, so this is still an issue even on the latest kernel. The other host (the one that was not measured with the vmstat logs) has currently 600MB/14GB, 34GB of free RAM. Both were reset with drop_caches at the same time. From the looks of this the really low usage will occur again somewhat shortly, it just did not come up during measurement. However, the RAM should be full anyway, true? 2018-07-16 18:45 GMT+02:00 Michal Hocko <mhocko@kernel.org>: > On Mon 16-07-18 18:33:57, Marinko Catovic wrote: > > how periodically do you want them? I assumed this some-hours and days > > snapshots would be sufficient. > > Every 10s should be reasonable even for a long term monitoring. > > > any particular command with or without grep perhaps? > > while true > do > cp /proc/vmstat vmstat.$(date +%s) > sleep 10s > done > -- > Michal Hocko > SUSE Labs > [-- Attachment #2: Type: text/html, Size: 2593 bytes --] ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-07-20 22:03 ` Marinko Catovic @ 2018-07-27 11:15 ` Vlastimil Babka 2018-07-30 14:40 ` Michal Hocko 0 siblings, 1 reply; 66+ messages in thread From: Vlastimil Babka @ 2018-07-27 11:15 UTC (permalink / raw) To: Marinko Catovic, linux-mm On 07/21/2018 12:03 AM, Marinko Catovic wrote: > I let this run for 3 days now, so it is quite a lot, there you go: > https://nofile.io/f/egGyRjf0NPs/vmstat.tar.gz The stats show that compaction has very bad results. Between first and last snapshot, compact_fail grew by 80k and compact_success by 1300. High-order allocations will thus cycle between (failing) compaction and reclaim that removes the buffer/caches from memory. Since dropping slab caches helps, I suspect it's either the slab pages (which cannot be migrated for compaction) being spread over all memory, making it impossible to assemble high-order pages, or some slab objects are pinning file pages making them also impossible to be migrated. > There is one thing I forgot to mention: the hosts perform find and du (I > mean the commands, finding files and disk usage) > on the HDDs every night, starting from 00:20 AM up until in the morning > 07:45 AM, for maintenance and stats. > > During this period the buffers/caches raise again as you may see from > the logs, so find/du do fill them. > Nevertheless as the day passes both decrease again until low values are > reached. > I disabled find/du for the night on 19->20th July to compare. > > I have to say that this really low usage (300MB/xGB) occured just once > after I upgraded from 4.16 to 4.17, not sure > why, where one can still see from the logs that the buffers/cache is not > using up the entire available RAM. > > This low usage occured the last time on that one host when I mentioned > that I had to 2>drop_caches again in my > previous message, so this is still an issue even on the latest kernel. > > The other host (the one that was not measured with the vmstat logs) has > currently 600MB/14GB, 34GB of free RAM. > Both were reset with drop_caches at the same time. From the looks of > this the really low usage will occur again > somewhat shortly, it just did not come up during measurement. However, > the RAM should be full anyway, true? Can you provide (a single snapshot) /proc/pagetypeinfo and /proc/slabinfo from a system that's currently experiencing the issue, also with /proc/vmstat and /proc/zoneinfo to verify? Thanks. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-07-27 11:15 ` Vlastimil Babka @ 2018-07-30 14:40 ` Michal Hocko 2018-07-30 22:08 ` Marinko Catovic 0 siblings, 1 reply; 66+ messages in thread From: Michal Hocko @ 2018-07-30 14:40 UTC (permalink / raw) To: Vlastimil Babka; +Cc: Marinko Catovic, linux-mm On Fri 27-07-18 13:15:33, Vlastimil Babka wrote: > On 07/21/2018 12:03 AM, Marinko Catovic wrote: > > I let this run for 3 days now, so it is quite a lot, there you go: > > https://nofile.io/f/egGyRjf0NPs/vmstat.tar.gz > > The stats show that compaction has very bad results. Between first and > last snapshot, compact_fail grew by 80k and compact_success by 1300. > High-order allocations will thus cycle between (failing) compaction and > reclaim that removes the buffer/caches from memory. I guess you are right. I've just looked at random large direct reclaim activity $ grep -w pgscan_direct vmstat*| awk '{diff=$2-old; if (old && diff > 100000) printf "%s %d\n", $1, diff; old=$2}' vmstat.1531957422:pgscan_direct 114334 vmstat.1532047588:pgscan_direct 111796 $ paste-with-diff.sh vmstat.1532047578 vmstat.1532047588 | grep "pgscan\|pgsteal\|compact\|pgalloc" | sort # counter value1 value2-value1 compact_daemon_free_scanned 2628160139 0 compact_daemon_migrate_scanned 797948703 0 compact_daemon_wake 23634 0 compact_fail 124806 108 compact_free_scanned 226181616304 295560271 compact_isolated 2881602028 480577 compact_migrate_scanned 147900786550 27834455 compact_stall 146749 108 compact_success 21943 0 pgalloc_dma 0 0 pgalloc_dma32 1577060946 10752 pgalloc_movable 0 0 pgalloc_normal 29389246430 343249 pgscan_direct 737335028 111796 pgscan_direct_throttle 0 0 pgscan_kswapd 1177909394 0 pgsteal_direct 704542843 111784 pgsteal_kswapd 898170720 0 There is zero kswapd activity so this must have been higher order allocation activity and all the direct compaction failed so we keep reclaiming. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-07-30 14:40 ` Michal Hocko @ 2018-07-30 22:08 ` Marinko Catovic 2018-08-02 16:15 ` Vlastimil Babka 0 siblings, 1 reply; 66+ messages in thread From: Marinko Catovic @ 2018-07-30 22:08 UTC (permalink / raw) To: linux-mm [-- Attachment #1: Type: text/plain, Size: 2926 bytes --] > Can you provide (a single snapshot) /proc/pagetypeinfo and > /proc/slabinfo from a system that's currently experiencing the issue, > also with /proc/vmstat and /proc/zoneinfo to verify? Thanks. your request came in just one day after I 2>drop_caches again when the ram usage was really really low again. Up until now it did not reoccur on any of the 2 hosts, where one shows 550MB/11G with 37G of totally free ram for now - so not that low like last time when I dropped it, I think it was like 300M/8G or so, but I hope it helps: /proc/pagetypeinfo https://pastebin.com/6QWEZagL /proc/slabinfo https://pastebin.com/81QAFgke /proc/vmstat https://pastebin.com/S7mrQx1s /proc/zoneinfo https://pastebin.com/csGeqNyX also please note - whether this makes any difference: there is no swap file/partition I am using this without swap space. imho this should not be necessary since applications running on the hosts would not consume more than 20GB, the rest should be used by buffers/cache. 2018-07-30 16:40 GMT+02:00 Michal Hocko <mhocko@suse.com>: > On Fri 27-07-18 13:15:33, Vlastimil Babka wrote: > > On 07/21/2018 12:03 AM, Marinko Catovic wrote: > > > I let this run for 3 days now, so it is quite a lot, there you go: > > > https://nofile.io/f/egGyRjf0NPs/vmstat.tar.gz > > > > The stats show that compaction has very bad results. Between first and > > last snapshot, compact_fail grew by 80k and compact_success by 1300. > > High-order allocations will thus cycle between (failing) compaction and > > reclaim that removes the buffer/caches from memory. > > I guess you are right. I've just looked at random large direct reclaim > activity > $ grep -w pgscan_direct vmstat*| awk '{diff=$2-old; if (old && diff > > 100000) printf "%s %d\n", $1, diff; old=$2}' > vmstat.1531957422:pgscan_direct 114334 > vmstat.1532047588:pgscan_direct 111796 > > $ paste-with-diff.sh vmstat.1532047578 vmstat.1532047588 | grep > "pgscan\|pgsteal\|compact\|pgalloc" | sort > # counter value1 value2-value1 > compact_daemon_free_scanned 2628160139 0 > compact_daemon_migrate_scanned 797948703 0 > compact_daemon_wake 23634 0 > compact_fail 124806 108 > compact_free_scanned 226181616304 295560271 > compact_isolated 2881602028 480577 > compact_migrate_scanned 147900786550 27834455 > compact_stall 146749 108 > compact_success 21943 0 > pgalloc_dma 0 0 > pgalloc_dma32 1577060946 10752 > pgalloc_movable 0 0 > pgalloc_normal 29389246430 343249 > pgscan_direct 737335028 111796 > pgscan_direct_throttle 0 0 > pgscan_kswapd 1177909394 0 > pgsteal_direct 704542843 111784 > pgsteal_kswapd 898170720 0 > > There is zero kswapd activity so this must have been higher order > allocation activity and all the direct compaction failed so we keep > reclaiming. > -- > Michal Hocko > SUSE Labs > [-- Attachment #2: Type: text/html, Size: 4218 bytes --] ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-07-30 22:08 ` Marinko Catovic @ 2018-08-02 16:15 ` Vlastimil Babka 2018-08-03 14:13 ` Marinko Catovic 0 siblings, 1 reply; 66+ messages in thread From: Vlastimil Babka @ 2018-08-02 16:15 UTC (permalink / raw) To: Marinko Catovic, linux-mm, Michal Hocko On 07/31/2018 12:08 AM, Marinko Catovic wrote: > >> Can you provide (a single snapshot) /proc/pagetypeinfo and >> /proc/slabinfo from a system that's currently experiencing the issue, >> also with /proc/vmstat and /proc/zoneinfo to verify? Thanks. > > your request came in just one day after I 2>drop_caches again when the > ram usage > was really really low again. Up until now it did not reoccur on any of > the 2 hosts, > where one shows 550MB/11G with 37G of totally free ram for now - so not > that low > like last time when I dropped it, I think it was like 300M/8G or so, but > I hope it helps: Thanks. > /proc/pagetypeinfoA https://pastebin.com/6QWEZagL Yep, looks like fragmented by reclaimable slabs: Node 0, zone Normal, type Unmovable 29101 32754 8372 2790 1334 354 23 3 4 0 0 Node 0, zone Normal, type Movable 142449 83386 99426 69177 36761 12931 1378 24 0 0 0 Node 0, zone Normal, type Reclaimable 467195 530638 355045 192638 80358 15627 2029 231 18 0 0 Number of blocks type Unmovable Movable Reclaimable HighAtomic Isolate Node 0, zone DMA 1 7 0 0 0 Node 0, zone DMA32 34 703 375 0 0 Node 0, zone Normal 1672 14276 15659 1 0 Half of the memory is marked as reclaimable (2 megabyte) pageblocks. zoneinfo has nr_slab_reclaimable 1679817 so the reclaimable slabs occupy only 3280 (6G) pageblocks, yet they are spread over 5 times as much. It's also possible they pollute the Movable pageblocks as well, but the stats can't tell us. Either the page grouping mobility heuristics are broken here, or the worst case scenario happened - memory was at some point really wholly filled with reclaimable slabs, and the rather random reclaim did not result in whole pageblocks being freed. > /proc/slabinfoA https://pastebin.com/81QAFgke Largest caches seem to be: # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> ext4_inode_cache 3107754 3759573 1080 3 1 : tunables 24 12 8 : slabdata 1253191 1253191 0 dentry 2840237 7328181 192 21 1 : tunables 120 60 8 : slabdata 348961 348961 120 The internal framentation of dentry cache is significant as well. Dunno if some of those objects pin movable pages as well... So looks like there's insufficient slab reclaim (shrinker activity), and possibly problems with page grouping by mobility heuristics as well... > /proc/vmstatA https://pastebin.com/S7mrQx1s > /proc/zoneinfoA https://pastebin.com/csGeqNyX > > also please note - whether this makes any difference: there is no swap > file/partition > I am using this without swap space. imho this should not be necessary since > applications running on the hosts would not consume more than 20GB, the rest > should be used by buffers/cache. > ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-08-02 16:15 ` Vlastimil Babka @ 2018-08-03 14:13 ` Marinko Catovic 2018-08-06 9:40 ` Vlastimil Babka 0 siblings, 1 reply; 66+ messages in thread From: Marinko Catovic @ 2018-08-03 14:13 UTC (permalink / raw) To: Vlastimil Babka; +Cc: linux-mm, Michal Hocko [-- Attachment #1: Type: text/plain, Size: 3743 bytes --] Thanks for the analysis. So since I am no mem management dev, what exactly does this mean? Is there any way of workaround or quickfix or something that can/will be fixed at some point in time? I can not imagine that I am the only one who is affected by this, nor do I know why my use case would be so much different from any other. Most 'cloud' services should be affected as well. Tell me if you need any other snapshots or whatever info. 2018-08-02 18:15 GMT+02:00 Vlastimil Babka <vbabka@suse.cz>: > On 07/31/2018 12:08 AM, Marinko Catovic wrote: > > > >> Can you provide (a single snapshot) /proc/pagetypeinfo and > >> /proc/slabinfo from a system that's currently experiencing the issue, > >> also with /proc/vmstat and /proc/zoneinfo to verify? Thanks. > > > > your request came in just one day after I 2>drop_caches again when the > > ram usage > > was really really low again. Up until now it did not reoccur on any of > > the 2 hosts, > > where one shows 550MB/11G with 37G of totally free ram for now - so not > > that low > > like last time when I dropped it, I think it was like 300M/8G or so, but > > I hope it helps: > > Thanks. > > > /proc/pagetypeinfo https://pastebin.com/6QWEZagL > > Yep, looks like fragmented by reclaimable slabs: > > Node 0, zone Normal, type Unmovable 29101 32754 8372 2790 > 1334 354 23 3 4 0 0 > Node 0, zone Normal, type Movable 142449 83386 99426 69177 > 36761 12931 1378 24 0 0 0 > Node 0, zone Normal, type Reclaimable 467195 530638 355045 192638 > 80358 15627 2029 231 18 0 0 > > Number of blocks type Unmovable Movable Reclaimable > HighAtomic Isolate > Node 0, zone DMA 1 7 0 0 > 0 > Node 0, zone DMA32 34 703 375 0 > 0 > Node 0, zone Normal 1672 14276 15659 1 > 0 > > Half of the memory is marked as reclaimable (2 megabyte) pageblocks. > zoneinfo has nr_slab_reclaimable 1679817 so the reclaimable slabs occupy > only 3280 (6G) pageblocks, yet they are spread over 5 times as much. > It's also possible they pollute the Movable pageblocks as well, but the > stats can't tell us. Either the page grouping mobility heuristics are > broken here, or the worst case scenario happened - memory was at some point > really wholly filled with reclaimable slabs, and the rather random reclaim > did not result in whole pageblocks being freed. > > > /proc/slabinfo https://pastebin.com/81QAFgke > > Largest caches seem to be: > # name <active_objs> <num_objs> <objsize> <objperslab> > <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata > <active_slabs> <num_slabs> <sharedavail> > ext4_inode_cache 3107754 3759573 1080 3 1 : tunables 24 12 > 8 : slabdata 1253191 1253191 0 > dentry 2840237 7328181 192 21 1 : tunables 120 60 > 8 : slabdata 348961 348961 120 > > The internal framentation of dentry cache is significant as well. > Dunno if some of those objects pin movable pages as well... > > So looks like there's insufficient slab reclaim (shrinker activity), and > possibly problems with page grouping by mobility heuristics as well... > > > /proc/vmstat https://pastebin.com/S7mrQx1s > > /proc/zoneinfo https://pastebin.com/csGeqNyX > > > > also please note - whether this makes any difference: there is no swap > > file/partition > > I am using this without swap space. imho this should not be necessary > since > > applications running on the hosts would not consume more than 20GB, the > rest > > should be used by buffers/cache. > > > [-- Attachment #2: Type: text/html, Size: 5128 bytes --] ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-08-03 14:13 ` Marinko Catovic @ 2018-08-06 9:40 ` Vlastimil Babka 2018-08-06 10:29 ` Marinko Catovic 0 siblings, 1 reply; 66+ messages in thread From: Vlastimil Babka @ 2018-08-06 9:40 UTC (permalink / raw) To: Marinko Catovic; +Cc: linux-mm, Michal Hocko On 08/03/2018 04:13 PM, Marinko Catovic wrote: > Thanks for the analysis. > > So since I am no mem management dev, what exactly does this mean? > Is there any way of workaround or quickfix or something that can/will > be fixed at some point in time? Workaround would be the manual / periodic cache flushing, unfortunately. Maybe a memcg with kmemcg limit? Michal could know more. A long-term generic solution will be much harder to find :( > I can not imagine that I am the only one who is affected by this, nor do I > know why my use case would be so much different from any other. > Most 'cloud' services should be affected as well. Hmm, either your workload is specific in being hungry for fs metadata and not much data (page cache). And/Or there's some source of the high-order allocations that others don't have, possibly related to some piece of hardware? > Tell me if you need any other snapshots or whatever info. > > 2018-08-02 18:15 GMT+02:00 Vlastimil Babka <vbabka@suse.cz > <mailto:vbabka@suse.cz>>: > > On 07/31/2018 12:08 AM, Marinko Catovic wrote: > > > >> Can you provide (a single snapshot) /proc/pagetypeinfo and > >> /proc/slabinfo from a system that's currently experiencing the issue, > >> also with /proc/vmstat and /proc/zoneinfo to verify? Thanks. > > > > your request came in just one day after I 2>drop_caches again when the > > ram usage > > was really really low again. Up until now it did not reoccur on any of > > the 2 hosts, > > where one shows 550MB/11G with 37G of totally free ram for now - so not > > that low > > like last time when I dropped it, I think it was like 300M/8G or so, but > > I hope it helps: > > Thanks. > > > /proc/pagetypeinfoA https://pastebin.com/6QWEZagL > > Yep, looks like fragmented by reclaimable slabs: > > NodeA A 0, zoneA A Normal, typeA A UnmovableA 29101A 32754A A 8372A > A 2790A A 1334A A 354A A A 23A A A 3A A A 4A A A 0A A A 0 > NodeA A 0, zoneA A Normal, typeA A A Movable 142449A 83386A 99426A > 69177A 36761A 12931A A 1378A A A 24A A A 0A A A 0A A A 0 > NodeA A 0, zoneA A Normal, typeA Reclaimable 467195 530638 355045 > 192638A 80358A 15627A A 2029A A 231A A A 18A A A 0A A A 0 > > Number of blocks typeA A A UnmovableA A A MovableA ReclaimableA > A HighAtomicA A A Isolate > Node 0, zoneA A A DMAA A A A A A 1A A A A A A 7A A A A A A 0A A A A > A A 0A A A A A A 0 > Node 0, zoneA A DMA32A A A A A A 34A A A A A 703A A A A A 375A A A A > A A 0A A A A A A 0 > Node 0, zoneA A NormalA A A A A 1672A A A A 14276A A A A 15659A A A A > A A 1A A A A A A 0 > > Half of the memory is marked as reclaimable (2 megabyte) pageblocks. > zoneinfo has nr_slab_reclaimable 1679817 so the reclaimable slabs occupy > only 3280 (6G) pageblocks, yet they are spread over 5 times as much. > It's also possible they pollute the Movable pageblocks as well, but the > stats can't tell us. Either the page grouping mobility heuristics are > broken here, or the worst case scenario happened - memory was at > some point > really wholly filled with reclaimable slabs, and the rather random > reclaim > did not result in whole pageblocks being freed. > > > /proc/slabinfoA https://pastebin.com/81QAFgke > > Largest caches seem to be: > # nameA A A A A A <active_objs> <num_objs> <objsize> <objperslab> > <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : > slabdata <active_slabs> <num_slabs> <sharedavail> > ext4_inode_cacheA 3107754 3759573A A 1080A A 3A A 1 : tunablesA A 24A > A 12A A 8 : slabdata 1253191 1253191A A A 0 > dentryA A A A A A 2840237 7328181A A 192A A 21A A 1 : tunablesA 120A > A 60A A 8 : slabdata 348961 348961A A 120 > > The internal framentation of dentry cache is significant as well. > Dunno if some of those objects pin movable pages as well... > > So looks like there's insufficient slab reclaim (shrinker activity), and > possibly problems with page grouping by mobility heuristics as well... > > > /proc/vmstatA https://pastebin.com/S7mrQx1s > > /proc/zoneinfoA https://pastebin.com/csGeqNyX > > > > also please note - whether this makes any difference: there is no swap > > file/partition > > I am using this without swap space. imho this should not be > necessary since > > applications running on the hosts would not consume more than > 20GB, the rest > > should be used by buffers/cache. > > > > ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-08-06 9:40 ` Vlastimil Babka @ 2018-08-06 10:29 ` Marinko Catovic 2018-08-06 12:00 ` Michal Hocko 0 siblings, 1 reply; 66+ messages in thread From: Marinko Catovic @ 2018-08-06 10:29 UTC (permalink / raw) To: Vlastimil Babka; +Cc: linux-mm, Michal Hocko [-- Attachment #1: Type: text/plain, Size: 5362 bytes --] > Maybe a memcg with kmemcg limit? Michal could know more. Could you/Michael explain this perhaps? The hardware is pretty much high end datacenter grade, I really would not know how this is to be related with the hardware :( I do not understand why apparently the caching is working very much fine for the beginning after a drop_caches, then degrades to low usage somewhat later. I can not possibly drop caches automatically, since this requires monitoring for overload with temporary dropping traffic on specific ports until the writes/reads cool down. 2018-08-06 11:40 GMT+02:00 Vlastimil Babka <vbabka@suse.cz>: > On 08/03/2018 04:13 PM, Marinko Catovic wrote: > > Thanks for the analysis. > > > > So since I am no mem management dev, what exactly does this mean? > > Is there any way of workaround or quickfix or something that can/will > > be fixed at some point in time? > > Workaround would be the manual / periodic cache flushing, unfortunately. > > Maybe a memcg with kmemcg limit? Michal could know more. > > A long-term generic solution will be much harder to find :( > > > I can not imagine that I am the only one who is affected by this, nor do > I > > know why my use case would be so much different from any other. > > Most 'cloud' services should be affected as well. > > Hmm, either your workload is specific in being hungry for fs metadata > and not much data (page cache). And/Or there's some source of the > high-order allocations that others don't have, possibly related to some > piece of hardware? > > > Tell me if you need any other snapshots or whatever info. > > > > 2018-08-02 18:15 GMT+02:00 Vlastimil Babka <vbabka@suse.cz > > <mailto:vbabka@suse.cz>>: > > > > On 07/31/2018 12:08 AM, Marinko Catovic wrote: > > > > > >> Can you provide (a single snapshot) /proc/pagetypeinfo and > > >> /proc/slabinfo from a system that's currently experiencing the > issue, > > >> also with /proc/vmstat and /proc/zoneinfo to verify? Thanks. > > > > > > your request came in just one day after I 2>drop_caches again when > the > > > ram usage > > > was really really low again. Up until now it did not reoccur on > any of > > > the 2 hosts, > > > where one shows 550MB/11G with 37G of totally free ram for now - > so not > > > that low > > > like last time when I dropped it, I think it was like 300M/8G or > so, but > > > I hope it helps: > > > > Thanks. > > > > > /proc/pagetypeinfo https://pastebin.com/6QWEZagL > > > > Yep, looks like fragmented by reclaimable slabs: > > > > Node 0, zone Normal, type Unmovable 29101 32754 8372 > > 2790 1334 354 23 3 4 0 0 > > Node 0, zone Normal, type Movable 142449 83386 99426 > > 69177 36761 12931 1378 24 0 0 0 > > Node 0, zone Normal, type Reclaimable 467195 530638 355045 > > 192638 80358 15627 2029 231 18 0 0 > > > > Number of blocks type Unmovable Movable Reclaimable > > HighAtomic Isolate > > Node 0, zone DMA 1 7 0 > > 0 0 > > Node 0, zone DMA32 34 703 375 > > 0 0 > > Node 0, zone Normal 1672 14276 15659 > > 1 0 > > > > Half of the memory is marked as reclaimable (2 megabyte) pageblocks. > > zoneinfo has nr_slab_reclaimable 1679817 so the reclaimable slabs > occupy > > only 3280 (6G) pageblocks, yet they are spread over 5 times as much. > > It's also possible they pollute the Movable pageblocks as well, but > the > > stats can't tell us. Either the page grouping mobility heuristics are > > broken here, or the worst case scenario happened - memory was at > > some point > > really wholly filled with reclaimable slabs, and the rather random > > reclaim > > did not result in whole pageblocks being freed. > > > > > /proc/slabinfo https://pastebin.com/81QAFgke > > > > Largest caches seem to be: > > # name <active_objs> <num_objs> <objsize> <objperslab> > > <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : > > slabdata <active_slabs> <num_slabs> <sharedavail> > > ext4_inode_cache 3107754 3759573 1080 3 1 : tunables 24 > > 12 8 : slabdata 1253191 1253191 0 > > dentry 2840237 7328181 192 21 1 : tunables 120 > > 60 8 : slabdata 348961 348961 120 > > > > The internal framentation of dentry cache is significant as well. > > Dunno if some of those objects pin movable pages as well... > > > > So looks like there's insufficient slab reclaim (shrinker activity), > and > > possibly problems with page grouping by mobility heuristics as > well... > > > > > /proc/vmstat https://pastebin.com/S7mrQx1s > > > /proc/zoneinfo https://pastebin.com/csGeqNyX > > > > > > also please note - whether this makes any difference: there is no > swap > > > file/partition > > > I am using this without swap space. imho this should not be > > necessary since > > > applications running on the hosts would not consume more than > > 20GB, the rest > > > should be used by buffers/cache. > > > > > > > > > [-- Attachment #2: Type: text/html, Size: 7521 bytes --] ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-08-06 10:29 ` Marinko Catovic @ 2018-08-06 12:00 ` Michal Hocko 2018-08-06 15:37 ` Christopher Lameter 0 siblings, 1 reply; 66+ messages in thread From: Michal Hocko @ 2018-08-06 12:00 UTC (permalink / raw) To: Marinko Catovic; +Cc: Vlastimil Babka, linux-mm [Please do not top-post] On Mon 06-08-18 12:29:43, Marinko Catovic wrote: > > Maybe a memcg with kmemcg limit? Michal could know more. > > Could you/Michael explain this perhaps? The only way how kmemcg limit could help I can think of would be to enforce metadata reclaim much more often. But that is rather a bad workaround. > The hardware is pretty much high end datacenter grade, I really would > not know how this is to be related with the hardware :( Well, there are some drivers (mostly out-of-tree) which are high order hungry. You can try to trace all allocations which with order > 0 and see who that might be. # mount -t tracefs none /debug/trace/ # echo stacktrace > /debug/trace/trace_options # echo "order>0" > /debug/trace/events/kmem/mm_page_alloc/filter # echo 1 > /debug/trace/events/kmem/mm_page_alloc/enable # cat /debug/trace/trace_pipe And later this to disable tracing. # echo 0 > /debug/trace/events/kmem/mm_page_alloc/enable > I do not understand why apparently the caching is working very much > fine for the beginning after a drop_caches, then degrades to low usage > somewhat later. Because a lot of FS metadata is fragmenting the memory and a large number of high order allocations which want to be served reclaim a lot of memory to achieve their gol. Considering a large part of memory is fragmented by unmovable objects there is no other way than to use reclaim to release that memory. > I can not possibly drop caches automatically, since > this requires monitoring for overload with temporary dropping traffic > on specific ports until the writes/reads cool down. You do not have to drop all caches. echo 2 > /proc/sys/vm/drop_caches should be sufficient to drop metadata only. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-08-06 12:00 ` Michal Hocko @ 2018-08-06 15:37 ` Christopher Lameter 2018-08-06 18:16 ` Michal Hocko 0 siblings, 1 reply; 66+ messages in thread From: Christopher Lameter @ 2018-08-06 15:37 UTC (permalink / raw) To: Michal Hocko; +Cc: Marinko Catovic, Vlastimil Babka, linux-mm On Mon, 6 Aug 2018, Michal Hocko wrote: > Because a lot of FS metadata is fragmenting the memory and a large > number of high order allocations which want to be served reclaim a lot > of memory to achieve their gol. Considering a large part of memory is > fragmented by unmovable objects there is no other way than to use > reclaim to release that memory. Well it looks like the fragmentation issue gets worse. Is that enough to consider merging the slab defrag patchset and get some work done on inodes and dentries to make them movable (or use targetd reclaim)? ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-08-06 15:37 ` Christopher Lameter @ 2018-08-06 18:16 ` Michal Hocko 2018-08-09 8:29 ` Marinko Catovic 0 siblings, 1 reply; 66+ messages in thread From: Michal Hocko @ 2018-08-06 18:16 UTC (permalink / raw) To: Christopher Lameter; +Cc: Marinko Catovic, Vlastimil Babka, linux-mm On Mon 06-08-18 15:37:14, Cristopher Lameter wrote: > On Mon, 6 Aug 2018, Michal Hocko wrote: > > > Because a lot of FS metadata is fragmenting the memory and a large > > number of high order allocations which want to be served reclaim a lot > > of memory to achieve their gol. Considering a large part of memory is > > fragmented by unmovable objects there is no other way than to use > > reclaim to release that memory. > > Well it looks like the fragmentation issue gets worse. Is that enough to > consider merging the slab defrag patchset and get some work done on inodes > and dentries to make them movable (or use targetd reclaim)? Is there anything to test? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-08-06 18:16 ` Michal Hocko @ 2018-08-09 8:29 ` Marinko Catovic 2018-08-21 0:36 ` Marinko Catovic 0 siblings, 1 reply; 66+ messages in thread From: Marinko Catovic @ 2018-08-09 8:29 UTC (permalink / raw) Cc: Christopher Lameter, Vlastimil Babka, linux-mm [-- Attachment #1: Type: text/plain, Size: 2996 bytes --] On Mon 06-08-18 15:37:14, Cristopher Lameter wrote: > > On Mon, 6 Aug 2018, Michal Hocko wrote: > > > > > Because a lot of FS metadata is fragmenting the memory and a large > > > number of high order allocations which want to be served reclaim a lot > > > of memory to achieve their gol. Considering a large part of memory is > > > fragmented by unmovable objects there is no other way than to use > > > reclaim to release that memory. > > > > Well it looks like the fragmentation issue gets worse. Is that enough to > > consider merging the slab defrag patchset and get some work done on > inodes > > and dentries to make them movable (or use targetd reclaim)? > > Is there anything to test? > -- > Michal Hocko > SUSE Labs > > [Please do not top-post] like this? > The only way how kmemcg limit could help I can think of would be to > enforce metadata reclaim much more often. But that is rather a bad > workaround. would that have some significant performance impact? I would be willing to try if you think the idea is not thaaat bad. If so, could you please explain what to do? > > > Because a lot of FS metadata is fragmenting the memory and a large > > > number of high order allocations which want to be served reclaim a lot > > > of memory to achieve their gol. Considering a large part of memory is > > > fragmented by unmovable objects there is no other way than to use > > > reclaim to release that memory. > > > > Well it looks like the fragmentation issue gets worse. Is that enough to > > consider merging the slab defrag patchset and get some work done on inodes > > and dentries to make them movable (or use targetd reclaim)? > Is there anything to test? Are you referring to some known issue there, possibly directly related to mine? If so, I would be willing to test that patchset, if it makes into the kernel.org sources, or if I'd have to patch that manually. > Well, there are some drivers (mostly out-of-tree) which are high order > hungry. You can try to trace all allocations which with order > 0 and > see who that might be. > # mount -t tracefs none /debug/trace/ > # echo stacktrace > /debug/trace/trace_options > # echo "order>0" > /debug/trace/events/kmem/mm_page_alloc/filter > # echo 1 > /debug/trace/events/kmem/mm_page_alloc/enable > # cat /debug/trace/trace_pipe > > And later this to disable tracing. > # echo 0 > /debug/trace/events/kmem/mm_page_alloc/enable I just had a major cache-useless situation, with like 100M/8G usage only and horrible performance. There you go: https://nofile.io/f/mmwVedaTFsd I think mysql occurs mostly, regardless of the binary name this is actually mariadb in version 10.1. > You do not have to drop all caches. echo 2 > /proc/sys/vm/drop_caches > should be sufficient to drop metadata only. that is exactly what I am doing, I already mentioned that 1> does not make any difference at all 2> is the only way that helps. just 5 minutes after doing that the usage grew to 2GB/10GB and is steadily going up, as usual. [-- Attachment #2: Type: text/html, Size: 5068 bytes --] ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-08-09 8:29 ` Marinko Catovic @ 2018-08-21 0:36 ` Marinko Catovic 2018-08-21 6:49 ` Michal Hocko 0 siblings, 1 reply; 66+ messages in thread From: Marinko Catovic @ 2018-08-21 0:36 UTC (permalink / raw) To: Marinko Catovic; +Cc: Christopher Lameter, Vlastimil Babka, linux-mm [-- Attachment #1: Type: text/plain, Size: 6465 bytes --] > The only way how kmemcg limit could help I can think of would be to >> enforce metadata reclaim much more often. But that is rather a bad >> workaround. > >would that have some significant performance impact? >I would be willing to try if you think the idea is not thaaat bad. >If so, could you please explain what to do? > >> > > Because a lot of FS metadata is fragmenting the memory and a large >> > > number of high order allocations which want to be served reclaim a lot >> > > of memory to achieve their gol. Considering a large part of memory is >> > > fragmented by unmovable objects there is no other way than to use >> > > reclaim to release that memory. >> > >> > Well it looks like the fragmentation issue gets worse. Is that enough to >> > consider merging the slab defrag patchset and get some work done on inodes >> > and dentries to make them movable (or use targetd reclaim)? > >> Is there anything to test? > >Are you referring to some known issue there, possibly directly related to mine? >If so, I would be willing to test that patchset, if it makes into the kernel.org sources, >or if I'd have to patch that manually. > > >> Well, there are some drivers (mostly out-of-tree) which are high order >> hungry. You can try to trace all allocations which with order > 0 and >> see who that might be. >> # mount -t tracefs none /debug/trace/ >> # echo stacktrace > /debug/trace/trace_options >> # echo "order>0" > /debug/trace/events/kmem/mm_page_alloc/filter >> # echo 1 > /debug/trace/events/kmem/mm_page_alloc/enable >> # cat /debug/trace/trace_pipe >> >> And later this to disable tracing. >> # echo 0 > /debug/trace/events/kmem/mm_page_alloc/enable > >I just had a major cache-useless situation, with like 100M/8G usage only >and horrible performance. There you go: > >https://nofile.io/f/mmwVedaTFsd > >I think mysql occurs mostly, regardless of the binary name this is actually >mariadb in version 10.1. > >> You do not have to drop all caches. echo 2 > /proc/sys/vm/drop_caches >> should be sufficient to drop metadata only. > >that is exactly what I am doing, I already mentioned that 1> does not >make any difference at all 2> is the only way that helps. >just 5 minutes after doing that the usage grew to 2GB/10GB and is steadily >going up, as usual. > > >2018-08-09 10:29 GMT+02:00 Marinko Catovic <marinko.catovic@gmail.com>: > > > > On Mon 06-08-18 15:37:14, Cristopher Lameter wrote: > > On Mon, 6 Aug 2018, Michal Hocko wrote: > > > > > Because a lot of FS metadata is fragmenting the memory and a large > > > number of high order allocations which want to be served reclaim a lot > > > of memory to achieve their gol. Considering a large part of memory is > > > fragmented by unmovable objects there is no other way than to use > > > reclaim to release that memory. > > > > Well it looks like the fragmentation issue gets worse. Is that enough to > > consider merging the slab defrag patchset and get some work done on inodes > > and dentries to make them movable (or use targetd reclaim)? > > Is there anything to test? > -- > Michal Hocko > SUSE Labs > > > > [Please do not top-post] > > like this? > > > The only way how kmemcg limit could help I can think of would be to > > enforce metadata reclaim much more often. But that is rather a bad > > workaround. > > would that have some significant performance impact? > I would be willing to try if you think the idea is not thaaat bad. > If so, could you please explain what to do? > > > > > Because a lot of FS metadata is fragmenting the memory and a large > > > > number of high order allocations which want to be served reclaim a lot > > > > of memory to achieve their gol. Considering a large part of memory is > > > > fragmented by unmovable objects there is no other way than to use > > > > reclaim to release that memory. > > > > > > Well it looks like the fragmentation issue gets worse. Is that enough to > > > consider merging the slab defrag patchset and get some work done on inodes > > > and dentries to make them movable (or use targetd reclaim)? > > > Is there anything to test? > > Are you referring to some known issue there, possibly directly related to mine? > If so, I would be willing to test that patchset, if it makes into the kernel.org sources, > or if I'd have to patch that manually. > > > > Well, there are some drivers (mostly out-of-tree) which are high order > > hungry. You can try to trace all allocations which with order > 0 and > > see who that might be. > > # mount -t tracefs none /debug/trace/ > > # echo stacktrace > /debug/trace/trace_options > > # echo "order>0" > /debug/trace/events/kmem/mm_page_alloc/filter > > # echo 1 > /debug/trace/events/kmem/mm_page_alloc/enable > > # cat /debug/trace/trace_pipe > > > > And later this to disable tracing. > > # echo 0 > /debug/trace/events/kmem/mm_page_alloc/enable > > I just had a major cache-useless situation, with like 100M/8G usage only > and horrible performance. There you go: > > https://nofile.io/f/mmwVedaTFsd > > I think mysql occurs mostly, regardless of the binary name this is actually > mariadb in version 10.1. > > > You do not have to drop all caches. echo 2 > /proc/sys/vm/drop_caches > > should be sufficient to drop metadata only. > > that is exactly what I am doing, I already mentioned that 1> does not > make any difference at all 2> is the only way that helps. > just 5 minutes after doing that the usage grew to 2GB/10GB and is steadily > going up, as usual. Is there anything you can read from these results? The issue keeps occuring, the latest one was even totally unexpected in the morning hours, causing downtime the entire morning until noon when I could check and drop the caches again. I also reset O_DIRECT from mariadb to `fsync`, the new default in their latest release, hoping that this would help, but it did not. Before giving totally up I'd like to know whether there is any solution for this, where again I can not believe that I am the only one affected. this *has* to affect anyone with similar a use case, I do not see what is so special about mine. this is simply many users with many files, every larger shared hosting provider should experience the totally same behaviour with the 4.x kernel branch. [-- Attachment #2: Type: text/html, Size: 8266 bytes --] ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-08-21 0:36 ` Marinko Catovic @ 2018-08-21 6:49 ` Michal Hocko 2018-08-21 7:19 ` Vlastimil Babka 0 siblings, 1 reply; 66+ messages in thread From: Michal Hocko @ 2018-08-21 6:49 UTC (permalink / raw) To: Marinko Catovic; +Cc: Christopher Lameter, Vlastimil Babka, linux-mm On Tue 21-08-18 02:36:05, Marinko Catovic wrote: [...] > > > Well, there are some drivers (mostly out-of-tree) which are high order > > > hungry. You can try to trace all allocations which with order > 0 and > > > see who that might be. > > > # mount -t tracefs none /debug/trace/ > > > # echo stacktrace > /debug/trace/trace_options > > > # echo "order>0" > /debug/trace/events/kmem/mm_page_alloc/filter > > > # echo 1 > /debug/trace/events/kmem/mm_page_alloc/enable > > > # cat /debug/trace/trace_pipe > > > > > > And later this to disable tracing. > > > # echo 0 > /debug/trace/events/kmem/mm_page_alloc/enable > > > > I just had a major cache-useless situation, with like 100M/8G usage only > > and horrible performance. There you go: > > > > https://nofile.io/f/mmwVedaTFsd $ grep mm_page_alloc: trace_pipe | sed 's@.*order=\([0-9]*\) .*gfp_flags=\(.*\)@\1 \2@' | sort | uniq -c 428 1 __GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE 10 1 __GFP_HIGH|__GFP_ATOMIC|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE 6 1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE 3061 1 GFP_KERNEL_ACCOUNT|__GFP_ZERO 8672 1 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_ACCOUNT 2547 1 __GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE 4 2 __GFP_HIGH|__GFP_ATOMIC|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE 5 2 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE 20030 2 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_ACCOUNT 1528 3 GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC 2476 3 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP 6512 3 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_ACCOUNT 277 9 GFP_TRANSHUGE|__GFP_THISNODE This only covers ~90s of the allocator activity. Most of those requests are not troggering any reclaim (GFP_NOWAIT/ATOMIC). Vlastimil will know better but this might mean that we are not envoking kcompactd enough. But considering that we have suspected that an overly eager reclaim triggers the page cache reduction I am not really sure I see the above to match that theory. Btw. I was probably not specific enough. This data should be collected _during_ the time when the page cache is disappearing. I suspect you have started collecting after the fact. Btw. vast majority of order-3 requests come from the network layer. Are you using a large MTU (jumbo packets)? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-08-21 6:49 ` Michal Hocko @ 2018-08-21 7:19 ` Vlastimil Babka 2018-08-22 20:02 ` Marinko Catovic 0 siblings, 1 reply; 66+ messages in thread From: Vlastimil Babka @ 2018-08-21 7:19 UTC (permalink / raw) To: Michal Hocko, Marinko Catovic; +Cc: Christopher Lameter, linux-mm On 8/21/18 8:49 AM, Michal Hocko wrote: > On Tue 21-08-18 02:36:05, Marinko Catovic wrote: > [...] >>>> Well, there are some drivers (mostly out-of-tree) which are high order >>>> hungry. You can try to trace all allocations which with order > 0 and >>>> see who that might be. >>>> # mount -t tracefs none /debug/trace/ >>>> # echo stacktrace > /debug/trace/trace_options >>>> # echo "order>0" > /debug/trace/events/kmem/mm_page_alloc/filter >>>> # echo 1 > /debug/trace/events/kmem/mm_page_alloc/enable >>>> # cat /debug/trace/trace_pipe >>>> >>>> And later this to disable tracing. >>>> # echo 0 > /debug/trace/events/kmem/mm_page_alloc/enable >>> >>> I just had a major cache-useless situation, with like 100M/8G usage only >>> and horrible performance. There you go: >>> >>> https://nofile.io/f/mmwVedaTFsd > > $ grep mm_page_alloc: trace_pipe | sed 's@.*order=\([0-9]*\) .*gfp_flags=\(.*\)@\1 \2@' | sort | uniq -c > 428 1 __GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE > 10 1 __GFP_HIGH|__GFP_ATOMIC|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE > 6 1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE > 3061 1 GFP_KERNEL_ACCOUNT|__GFP_ZERO > 8672 1 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_ACCOUNT > 2547 1 __GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE > 4 2 __GFP_HIGH|__GFP_ATOMIC|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE > 5 2 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE > 20030 2 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_ACCOUNT > 1528 3 GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC > 2476 3 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP > 6512 3 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_ACCOUNT > 277 9 GFP_TRANSHUGE|__GFP_THISNODE > > This only covers ~90s of the allocator activity. Most of those requests > are not troggering any reclaim (GFP_NOWAIT/ATOMIC). Vlastimil will > know better but this might mean that we are not envoking kcompactd > enough. Earlier vmstat data showed that it's invoked but responsible for less than 1% of compaction activity. > But considering that we have suspected that an overly eager > reclaim triggers the page cache reduction I am not really sure I see the > above to match that theory. Yeah, the GFP_NOWAIT/GFP_ATOMIC above shouldn't be responsible for such overreclaim? > Btw. I was probably not specific enough. This data should be collected > _during_ the time when the page cache is disappearing. I suspect you > have started collecting after the fact. It might be also interesting to do in the problematic state, instead of dropping caches: - save snapshot of /proc/vmstat and /proc/pagetypeinfo - echo 1 > /proc/sys/vm/compact_memory - save new snapshot of /proc/vmstat and /proc/pagetypeinfo That would show if compaction is able to help at all. > Btw. vast majority of order-3 requests come from the network layer. Are > you using a large MTU (jumbo packets)? > ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-08-21 7:19 ` Vlastimil Babka @ 2018-08-22 20:02 ` Marinko Catovic 2018-08-23 12:10 ` Vlastimil Babka 0 siblings, 1 reply; 66+ messages in thread From: Marinko Catovic @ 2018-08-22 20:02 UTC (permalink / raw) To: Vlastimil Babka; +Cc: Michal Hocko, Christopher Lameter, linux-mm [-- Attachment #1: Type: text/plain, Size: 1763 bytes --] > It might be also interesting to do in the problematic state, instead of > dropping caches: > > - save snapshot of /proc/vmstat and /proc/pagetypeinfo > - echo 1 > /proc/sys/vm/compact_memory > - save new snapshot of /proc/vmstat and /proc/pagetypeinfo There was just a worstcase in progress, about 100MB/10GB were used, super-low perfomance, but could not see any improvement there after echo 1, I watches this for about 3 minutes, the cache usage did not change. pagetypeinfo before echo https://pastebin.com/MjSgiMRL pagetypeinfo 3min after echo https://pastebin.com/uWM6xGDd vmstat before echo https://pastebin.com/TjYSKNdE vmstat 3min after echo https://pastebin.com/MqTibEKi > Btw. vast majority of order-3 requests come from the network layer. Are > you using a large MTU (jumbo packets)? not that I know of, how would I figure that out? I have not touched sysctl net.* besides a few values not related to mtu afaik > Btw. I was probably not specific enough. This data should be collected > _during_ the time when the page cache is disappearing. I suspect you > have started collecting after the fact. meh, I just messed up that output with the latest drop_caches, but I am pretty much sure that the one you see is while the usage was like 300MB/10GB, before drop caches. I was thinking maybe it would really help if one of you guys links up with the hosts in that state so that you can see for yourself. due to privacy issues (gdpr and stuff) I'd like to monitor this, so the ssh login would have to go over something like teamviewer on my host or whatever. please let me know if anyone is willing, since I really see no help there with anything I tried for 3 months by now. thanks for the efforts. surely any diagnosis would be easier this way. [-- Attachment #2: Type: text/html, Size: 3043 bytes --] ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-08-22 20:02 ` Marinko Catovic @ 2018-08-23 12:10 ` Vlastimil Babka 2018-08-23 12:21 ` Michal Hocko 0 siblings, 1 reply; 66+ messages in thread From: Vlastimil Babka @ 2018-08-23 12:10 UTC (permalink / raw) To: Marinko Catovic; +Cc: Michal Hocko, Christopher Lameter, linux-mm On 08/22/2018 10:02 PM, Marinko Catovic wrote: >> It might be also interesting to do in the problematic state, instead of >> dropping caches: >> >> - save snapshot of /proc/vmstat and /proc/pagetypeinfo >> - echo 1 > /proc/sys/vm/compact_memory >> - save new snapshot of /proc/vmstat and /proc/pagetypeinfo > > There was just a worstcase in progress, about 100MB/10GB were used, > super-low perfomance, but could not see any improvement there after echo 1, > I watches this for about 3 minutes, the cache usage did not change. > > pagetypeinfo before echo https://pastebin.com/MjSgiMRL > pagetypeinfo 3min after echo https://pastebin.com/uWM6xGDd > > vmstat before echo https://pastebin.com/TjYSKNdE > vmstat 3min after echo https://pastebin.com/MqTibEKi OK, that confirms compaction is useless here. Thanks. It also shows that all orders except order-9 are in fact plentiful. Michal's earlier summary of the trace shows that most allocations are up to order-3 and should be fine, the exception is THP: 277 9 GFP_TRANSHUGE|__GFP_THISNODE Hmm it's actually interesting to see GFP_TRANSHUGE there and not GFP_TRANSHUGE_LIGHT. What's your thp defrag setting? (cat /sys/kernel/mm/transparent_hugepage/enabled). Maybe it's set to "always", or there's a heavily faulting process that's using madvise(MADV_HUGEPAGE). If that's the case, setting it to "defer" or even "never" could be a workaround. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-08-23 12:10 ` Vlastimil Babka @ 2018-08-23 12:21 ` Michal Hocko 2018-08-24 0:11 ` Marinko Catovic 2018-08-24 6:24 ` Vlastimil Babka 0 siblings, 2 replies; 66+ messages in thread From: Michal Hocko @ 2018-08-23 12:21 UTC (permalink / raw) To: Vlastimil Babka; +Cc: Marinko Catovic, Christopher Lameter, linux-mm On Thu 23-08-18 14:10:28, Vlastimil Babka wrote: > On 08/22/2018 10:02 PM, Marinko Catovic wrote: > >> It might be also interesting to do in the problematic state, instead of > >> dropping caches: > >> > >> - save snapshot of /proc/vmstat and /proc/pagetypeinfo > >> - echo 1 > /proc/sys/vm/compact_memory > >> - save new snapshot of /proc/vmstat and /proc/pagetypeinfo > > > > There was just a worstcase in progress, about 100MB/10GB were used, > > super-low perfomance, but could not see any improvement there after echo 1, > > I watches this for about 3 minutes, the cache usage did not change. > > > > pagetypeinfo before echo https://pastebin.com/MjSgiMRL > > pagetypeinfo 3min after echo https://pastebin.com/uWM6xGDd > > > > vmstat before echo https://pastebin.com/TjYSKNdE > > vmstat 3min after echo https://pastebin.com/MqTibEKi > > OK, that confirms compaction is useless here. Thanks. > > It also shows that all orders except order-9 are in fact plentiful. > Michal's earlier summary of the trace shows that most allocations are up > to order-3 and should be fine, the exception is THP: > > 277 9 GFP_TRANSHUGE|__GFP_THISNODE But please note that this is not from the time when the page cache dropped to the observed values. So we do not know what happened at the time. Anyway 277 THP pages paging out such a large page cache amount would be more than unexpected even for explicitly costly THP fault in methods. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-08-23 12:21 ` Michal Hocko @ 2018-08-24 0:11 ` Marinko Catovic 2018-08-24 6:34 ` Vlastimil Babka 2018-08-24 6:24 ` Vlastimil Babka 1 sibling, 1 reply; 66+ messages in thread From: Marinko Catovic @ 2018-08-24 0:11 UTC (permalink / raw) To: Michal Hocko; +Cc: Vlastimil Babka, Christopher Lameter, linux-mm [-- Attachment #1: Type: text/plain, Size: 3349 bytes --] > Hmm it's actually interesting to see GFP_TRANSHUGE there and not > GFP_TRANSHUGE_LIGHT. What's your thp defrag setting? (cat > /sys/kernel/mm/transparent_hugepage/enabled). Maybe it's set to > "always", or there's a heavily faulting process that's using > madvise(MADV_HUGEPAGE). If that's the case, setting it to "defer" or > even "never" could be a workaround. cat /sys/kernel/mm/transparent_hugepage/enabled always [madvise] never according to the docs this is the default > "madvise" will enter direct reclaim like "always" but only for regions > that are have used madvise(MADV_HUGEPAGE). This is the default behaviour. would any change there kick in immediately, even when in the 100M/10G case? > or there's a heavily faulting process that's using madvise(MADV_HUGEPAGE) are you suggesting that a/one process can cause this? how would one be able to identify it..? should killing it allow the cache to be populated again instantly? if yes, then I could start killing all processes on the host until there is improvement to observe. so far I can tell that it is not the database server, since restarting it did not help at all. Please remember that, suggesting this, I can see how buffers (the 100MB value) are `oscillating`. When in the cache-useless state it jumps around literally every second from e.g. 100 to 102, then 99, 104, 85, 101, 105, 98, .. and so on, where it always gets closer from well-populated several GB in the beginning to those 100MB over the days. so doing anything that should cause an effect would be easily measurable instantly, which is to date only achieved by dropping caches. Please tell me if you need any measurements again, when or at what state, with code snippets perhaps to fit your needs. Am Do., 23. Aug. 2018 um 14:21 Uhr schrieb Michal Hocko <mhocko@suse.com>: > On Thu 23-08-18 14:10:28, Vlastimil Babka wrote: > > On 08/22/2018 10:02 PM, Marinko Catovic wrote: > > >> It might be also interesting to do in the problematic state, instead > of > > >> dropping caches: > > >> > > >> - save snapshot of /proc/vmstat and /proc/pagetypeinfo > > >> - echo 1 > /proc/sys/vm/compact_memory > > >> - save new snapshot of /proc/vmstat and /proc/pagetypeinfo > > > > > > There was just a worstcase in progress, about 100MB/10GB were used, > > > super-low perfomance, but could not see any improvement there after > echo 1, > > > I watches this for about 3 minutes, the cache usage did not change. > > > > > > pagetypeinfo before echo https://pastebin.com/MjSgiMRL > > > pagetypeinfo 3min after echo https://pastebin.com/uWM6xGDd > > > > > > vmstat before echo https://pastebin.com/TjYSKNdE > > > vmstat 3min after echo https://pastebin.com/MqTibEKi > > > > OK, that confirms compaction is useless here. Thanks. > > > > It also shows that all orders except order-9 are in fact plentiful. > > Michal's earlier summary of the trace shows that most allocations are up > > to order-3 and should be fine, the exception is THP: > > > > 277 9 GFP_TRANSHUGE|__GFP_THISNODE > > But please note that this is not from the time when the page cache > dropped to the observed values. So we do not know what happened at the > time. > > Anyway 277 THP pages paging out such a large page cache amount would be > more than unexpected even for explicitly costly THP fault in methods. > -- > Michal Hocko > SUSE Labs > [-- Attachment #2: Type: text/html, Size: 4620 bytes --] ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-08-24 0:11 ` Marinko Catovic @ 2018-08-24 6:34 ` Vlastimil Babka 2018-08-24 8:11 ` Marinko Catovic 0 siblings, 1 reply; 66+ messages in thread From: Vlastimil Babka @ 2018-08-24 6:34 UTC (permalink / raw) To: Marinko Catovic, Michal Hocko; +Cc: Christopher Lameter, linux-mm On 08/24/2018 02:11 AM, Marinko Catovic wrote: >> Hmm it's actually interesting to see GFP_TRANSHUGE there and not >> GFP_TRANSHUGE_LIGHT. What's your thp defrag setting? (cat >> /sys/kernel/mm/transparent_hugepage/enabled). Maybe it's set to >> "always", or there's a heavily faulting process that's using >> madvise(MADV_HUGEPAGE). If that's the case, setting it to "defer" or >> even "never" could be a workaround. > > cat /sys/kernel/mm/transparent_hugepage/enabled > always [madvise] never Hmm my mistake. I was actually interested in /sys/kernel/mm/transparent_hugepage/defrag > according to the docs this is the default >> "madvise" will enter direct reclaim like "always" but only for regions >> that are have used madvise(MADV_HUGEPAGE). This is the default behaviour. Yeah but that's about 'defrag'. For 'enabled', the default should be always. But it's a kernel config option I think? Let's see what you have for 'defrag'... > would any change there kick in immediately, even when in the 100M/10G case? If it's indeed preventing the cache from growing back, changing that should result in gradual increase. Note that it doesn't look probable that THP is the cause, but the trace didn't contain any other allocations that could be responsible for high-order direct reclaim. >> or there's a heavily faulting process that's using madvise(MADV_HUGEPAGE) > > are you suggesting that a/one process can cause this? > how would one be able to identify it..? should killing it allow the > cache to be > populated again instantly? if yes, then I could start killing all > processes on the > host until there is improvement to observe. It's not the process' fault, and killing it might disrupt the observation in unexpected ways. It's simpler to change the global setting to "never" to confirm or rule out this. Ah, checked the trace and it seems to be "php-cgi". Interesting that they use madvise(MADV_HUGEPAGE). Anyway the above still applies. > so far I can tell that it is not the database server, since restarting > it did not help at all. > > Please remember that, suggesting this, I can see how buffers (the 100MB > value) > are `oscillating`. When in the cache-useless state it jumps around > literally every second > from e.g. 100 to 102, then 99, 104, 85, 101, 105, 98, .. and so on, > where it always gets > closer from well-populated several GB in the beginning to those 100MB > over the days. > so doing anything that should cause an effect would be easily measurable > instantly, > which is to date only achieved by dropping caches. > > Please tell me if you need any measurements again, when or at what > state, with code > snippets perhaps to fit your needs. 1. Send the current value of /sys/kernel/mm/transparent_hugepage/defrag 2. Unless it's 'defer' or 'never' already, try changing it to 'defer'. Thanks. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-08-24 6:34 ` Vlastimil Babka @ 2018-08-24 8:11 ` Marinko Catovic 2018-08-24 8:36 ` Vlastimil Babka 0 siblings, 1 reply; 66+ messages in thread From: Marinko Catovic @ 2018-08-24 8:11 UTC (permalink / raw) To: Vlastimil Babka; +Cc: Michal Hocko, Christopher Lameter, linux-mm [-- Attachment #1: Type: text/plain, Size: 1891 bytes --] > > 1. Send the current value of /sys/kernel/mm/transparent_hugepage/defrag > 2. Unless it's 'defer' or 'never' already, try changing it to 'defer'. > /sys/kernel/mm/transparent_hugepage/defrag is always defer defer+madvise [madvise] never I *think* I already played around with these values, as far as I remember `never` almost caused the system to hang, or at least while I switched back to madvise. shall I switch it to defer and observe (all hosts are running fine by just now) or switch to defer while it is in the bad state? and when doing this, should improvement be measurable immediately? I need to know how long to hold this, before dropping caches becomes necessary. > Ah, checked the trace and it seems to be "php-cgi". Interesting that > they use madvise(MADV_HUGEPAGE). Anyway the above still applies. you know, that's at least an interesting hint. look at this: https://ckon.wordpress.com/2015/09/18/php7-opcache-performance/ this was experimental there, but a more recent version seems to have it on by default, since I need to disable it on request (implies to me that it is on by default). it is however *disabled* in the runtime configuration (and not in effect, I just confirmed that) It would be interesting to know whether madvise(MADV_HUGEPAGE) is then active somewhere else, since it is in the dump as you observed. Please note that `killing` php-cgi would not make any difference then, since these processes are started by request for every user and killed after whatever script is finished. this may invoke about 10-50 forks, depending on load, (with different system users) every second. That also *may* explain why it is not so much deterministic (sometimes earlier/sooner, sometimes on one host and not on the other), since there are multiple php-cgi versions available and not everyone is using the same version - most people stick to legacy versions. [-- Attachment #2: Type: text/html, Size: 3153 bytes --] ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-08-24 8:11 ` Marinko Catovic @ 2018-08-24 8:36 ` Vlastimil Babka 2018-08-29 14:54 ` Marinko Catovic 0 siblings, 1 reply; 66+ messages in thread From: Vlastimil Babka @ 2018-08-24 8:36 UTC (permalink / raw) To: Marinko Catovic; +Cc: Michal Hocko, Christopher Lameter, linux-mm On 08/24/2018 10:11 AM, Marinko Catovic wrote: > 1. Send the current value of /sys/kernel/mm/transparent_hugepage/defrag > 2. Unless it's 'defer' or 'never' already, try changing it to 'defer'. > > > A /sys/kernel/mm/transparent_hugepage/defrag is > always defer defer+madvise [madvise] never Yeah that's the default. > I *think* I already played around with these values, as far as I > remember `never` > almost caused the system to hang, or at least while I switched back to > madvise. That would be unexpected for the 'defrag' file, but maybe possible for 'enabled' file where mm structs are put on/removed from a list system-wide, AFAIK. > shall I switch it to defer and observe (all hosts are running fine by > just now) or > switch to defer while it is in the bad state? You could do it immediately and see if no problems appear for long enough, OTOH... > and when doing this, should improvement be measurable immediately? I would expect that. It would be a more direct proof that that was the cause. > I need to know how long to hold this, before dropping caches becomes > necessary. If it keeps oscillating and doesn't start growing, it means it didn't help. Few minutes should be enough. >> Ah, checked the trace and it seems to be "php-cgi". Interesting that >> they use madvise(MADV_HUGEPAGE). Anyway the above still applies. > > you know, that's at least an interesting hint. look at this: > https://ckon.wordpress.com/2015/09/18/php7-opcache-performance/ > > this was experimental there, but a more recent version seems to have it on > by default, since I need to disable it on request (implies to me that it > is on by default). > it is however *disabled* in the runtime configuration (and not in > effect, I just confirmed that) > > It would be interesting to know whether madvise(MADV_HUGEPAGE) is then > active > somewhere else, since it is in the dump as you observed. The trace points to php-cgi so either disabling it doesn't work, or they started using the madvise also for other stuff than opcache. But that doesn't matter, it would be kernel's fault if a program using the madvise would effectively kill the system like this. Let's just stick with the global 'defrag'='defer' change and not tweak several things at once. > Please note that `killing` php-cgi would not make any difference then, > since these processes > are started by request for every user and killed after whatever script > is finished. this may > invoke about 10-50 forks, depending on load, (with different system > users) every second. Yep. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-08-24 8:36 ` Vlastimil Babka @ 2018-08-29 14:54 ` Marinko Catovic 2018-08-29 15:01 ` Michal Hocko 0 siblings, 1 reply; 66+ messages in thread From: Marinko Catovic @ 2018-08-29 14:54 UTC (permalink / raw) To: Vlastimil Babka; +Cc: Michal Hocko, Christopher Lameter, linux-mm [-- Attachment #1: Type: text/plain, Size: 1034 bytes --] > > shall I switch it to defer and observe (all hosts are running fine by > > just now) or > > switch to defer while it is in the bad state? > > You could do it immediately and see if no problems appear for long > enough, OTOH... > well cat /sys/kernel/mm/transparent_hugepage/defrag always [defer] defer+madvise madvise never was active now since your reply, however, I can not tell that it helped. This was set on 2 hosts, one has 20GB of unused RAM now. Yesterday there was a similar picture for both, with several GB, one with up to 10GB unused, I just checked once, this is what I recall. tell me if one would like to login remotely, I can set up teamviewer or something for this at any time, just drop a message here and I'll contact you. I have hopes that one can investigate things even on that host that has 20GB unused, it's just a matter of time until this gets to the low values, surely the problem here already kicked in. Also if the remote login is not an option, I'm always happy to provide whatever info you need. [-- Attachment #2: Type: text/html, Size: 1468 bytes --] ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-08-29 14:54 ` Marinko Catovic @ 2018-08-29 15:01 ` Michal Hocko 2018-08-29 15:13 ` Marinko Catovic 0 siblings, 1 reply; 66+ messages in thread From: Michal Hocko @ 2018-08-29 15:01 UTC (permalink / raw) To: Marinko Catovic; +Cc: Vlastimil Babka, Christopher Lameter, linux-mm On Wed 29-08-18 16:54:32, Marinko Catovic wrote: [...] > Also if the remote login is not an option, I'm always happy to provide > whatever info you need. trace data which starts _before_ the cache dropdown starts and while it is decreasing should be the first step. Ideally along with /proc/vmstat gathered at the same time. I am pretty sure you have some high order memory consumer which forces the reclaim and we over reclaim. Last data was not really conclusive as it didn't really captured the dropdown IIRC. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-08-29 15:01 ` Michal Hocko @ 2018-08-29 15:13 ` Marinko Catovic 2018-08-29 15:27 ` Michal Hocko 0 siblings, 1 reply; 66+ messages in thread From: Marinko Catovic @ 2018-08-29 15:13 UTC (permalink / raw) To: Michal Hocko; +Cc: Vlastimil Babka, Christopher Lameter, linux-mm [-- Attachment #1: Type: text/plain, Size: 755 bytes --] > > trace data which starts _before_ the cache dropdown starts and while it > is decreasing should be the first step. Ideally along with /proc/vmstat > gathered at the same time. I am pretty sure you have some high order > memory consumer which forces the reclaim and we over reclaim. Last data > was not really conclusive as it didn't really captured the dropdown > IIRC. > with before you mean in a totally healthy state? as I can not tell when decreasing starts this would mean collecting data over days perhaps. however, I have no issue with that. As I do not want to miss anything that might help you, could you please provide the commands for all the data you require? one host is at a healthy state right now, I'd run that over there immediately. [-- Attachment #2: Type: text/html, Size: 1060 bytes --] ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-08-29 15:13 ` Marinko Catovic @ 2018-08-29 15:27 ` Michal Hocko 2018-08-29 16:44 ` Marinko Catovic 0 siblings, 1 reply; 66+ messages in thread From: Michal Hocko @ 2018-08-29 15:27 UTC (permalink / raw) To: Marinko Catovic; +Cc: Vlastimil Babka, Christopher Lameter, linux-mm On Wed 29-08-18 17:13:59, Marinko Catovic wrote: > > > > trace data which starts _before_ the cache dropdown starts and while it > > is decreasing should be the first step. Ideally along with /proc/vmstat > > gathered at the same time. I am pretty sure you have some high order > > memory consumer which forces the reclaim and we over reclaim. Last data > > was not really conclusive as it didn't really captured the dropdown > > IIRC. > > > > with before you mean in a totally healthy state? yep > as I can not tell when decreasing starts this would mean collecting data > over days perhaps. however, I have no issue with that. yeah, you can pipe the trace buffer to gzip and reduce the output considerably. > As I do not want to miss anything that might help you, could you please > provide the commands for all the data you require? Use the same set of commands for tracing I have provided earlier + add the compresssion cat /debug/trace/trace_pipe | gzip > file.gz + the loop to gather vmstat while true do cp /proc/vmstat vmstat.$(date +%s) sleep 5s done > one host is at a healthy state right now, I'd run that over there immediately. Let's see what we can get from here. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-08-29 15:27 ` Michal Hocko @ 2018-08-29 16:44 ` Marinko Catovic 2018-10-22 1:19 ` Marinko Catovic 0 siblings, 1 reply; 66+ messages in thread From: Marinko Catovic @ 2018-08-29 16:44 UTC (permalink / raw) To: Michal Hocko; +Cc: Vlastimil Babka, Christopher Lameter, linux-mm [-- Attachment #1: Type: text/plain, Size: 1029 bytes --] > > one host is at a healthy state right now, I'd run that over there > immediately. > > Let's see what we can get from here. > oh well, that went fast. actually with having low values for buffers (around 100MB) with caches around 20G or so, the performance was nevertheless super-low, I really had to drop the caches right now. This is the first time I see it with caches >10G happening, but hopefully this also provides a clue for you. Just after starting the stats I reset from previously defer to madvise - I suspect that this somehow caused the rapid reaction, since a few minutes later I saw that the free RAM jumped from 5GB to 10GB, after that I went afk, returning to the pc since my monitoring systems went crazy telling me about downtime. If you think changing /sys/kernel/mm/transparent_hugepage/defrag back to its default, while it was on defer now for days, was a mistake, then please tell me. here you go: https://nofile.io/f/VqRg644AT01/vmstat.tar.gz trace_pipe: https://nofile.io/f/wFShvZScpvn/trace_pipe.gz [-- Attachment #2: Type: text/html, Size: 1702 bytes --] ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-08-29 16:44 ` Marinko Catovic @ 2018-10-22 1:19 ` Marinko Catovic 2018-10-23 17:41 ` Marinko Catovic ` (3 more replies) 0 siblings, 4 replies; 66+ messages in thread From: Marinko Catovic @ 2018-10-22 1:19 UTC (permalink / raw) To: Michal Hocko, linux-mm, Vlastimil Babka, Christopher Lameter Am Mi., 29. Aug. 2018 um 18:44 Uhr schrieb Marinko Catovic <marinko.catovic@gmail.com>: > > >> > one host is at a healthy state right now, I'd run that over there immediately. >> >> Let's see what we can get from here. > > > oh well, that went fast. actually with having low values for buffers (around 100MB) with caches > around 20G or so, the performance was nevertheless super-low, I really had to drop > the caches right now. This is the first time I see it with caches >10G happening, but hopefully > this also provides a clue for you. > > Just after starting the stats I reset from previously defer to madvise - I suspect that this somehow > caused the rapid reaction, since a few minutes later I saw that the free RAM jumped from 5GB to 10GB, > after that I went afk, returning to the pc since my monitoring systems went crazy telling me about downtime. > > If you think changing /sys/kernel/mm/transparent_hugepage/defrag back to its default, while it was > on defer now for days, was a mistake, then please tell me. > > here you go: https://nofile.io/f/VqRg644AT01/vmstat.tar.gz > trace_pipe: https://nofile.io/f/wFShvZScpvn/trace_pipe.gz > There we go again. First of all, I have set up this monitoring on 1 host, as a matter of fact it did not occur on that single one for days and weeks now, so I set this up again on all the hosts and it just happened again on another one. This issue is far from over, even when upgrading to the latest 4.18.12 https://nofile.io/f/z2KeNwJSMDj/vmstat-2.zip https://nofile.io/f/5ezPUkFWtnx/trace_pipe-2.gz Please note: the trace_pipe is quite big in size, but it covers a full-RAM to unused-RAM within just ~24 hours, the measurements were initiated right after echo 3 > drop_caches and stopped when the RAM was unused aka re-used after another echo 3 in the end. This issue is alive for about half a year now, any suggestions, hints or solutions are greatly appreciated, again, I can not possibly be the only one experiencing this, I just may be among the few ones who actually notice this and are indeed suffering from very poor performance with lots of I/O on cache/buffers. Also, I'd like to ask for a workaround until this is fixed someday: echo 3 > drop_caches can take a very long time when the host is busy with I/O in the background. According to some resources in the net I discovered that dropping caches operates until some lower threshold is reached, which is less and less likely, when the host is really busy. Could one point out what threshold this is perhaps? I was thinking of e.g. mm/vmscan.c 549 void drop_slab_node(int nid) 550 { 551 unsigned long freed; 552 553 do { 554 struct mem_cgroup *memcg = NULL; 555 556 freed = 0; 557 do { 558 freed += shrink_slab(GFP_KERNEL, nid, memcg, 0); 559 } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL); 560 } while (freed > 10); 561 } ..would it make sense to increase > 10 here with, for example, > 100 ? I could easily adjust this, or any other relevant threshold, since I am compiling the kernel in use. I'd just like it to be able to finish dropping caches to achieve the workaround here until this issue is fixed, which as mentioned, can take hours on a busy host, causing the host to hang (having low performance) since buffers/caches are not used at that time while drop_caches is being set to 3, until that freeing up is finished. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-10-22 1:19 ` Marinko Catovic @ 2018-10-23 17:41 ` Marinko Catovic 2018-10-26 5:48 ` Marinko Catovic 2018-10-26 8:01 ` Michal Hocko ` (2 subsequent siblings) 3 siblings, 1 reply; 66+ messages in thread From: Marinko Catovic @ 2018-10-23 17:41 UTC (permalink / raw) To: Michal Hocko, linux-mm, Vlastimil Babka, Christopher Lameter Am Mo., 22. Okt. 2018 um 03:19 Uhr schrieb Marinko Catovic <marinko.catovic@gmail.com>: > > Am Mi., 29. Aug. 2018 um 18:44 Uhr schrieb Marinko Catovic > <marinko.catovic@gmail.com>: > > > > > >> > one host is at a healthy state right now, I'd run that over there immediately. > >> > >> Let's see what we can get from here. > > > > > > oh well, that went fast. actually with having low values for buffers (around 100MB) with caches > > around 20G or so, the performance was nevertheless super-low, I really had to drop > > the caches right now. This is the first time I see it with caches >10G happening, but hopefully > > this also provides a clue for you. > > > > Just after starting the stats I reset from previously defer to madvise - I suspect that this somehow > > caused the rapid reaction, since a few minutes later I saw that the free RAM jumped from 5GB to 10GB, > > after that I went afk, returning to the pc since my monitoring systems went crazy telling me about downtime. > > > > If you think changing /sys/kernel/mm/transparent_hugepage/defrag back to its default, while it was > > on defer now for days, was a mistake, then please tell me. > > > > here you go: https://nofile.io/f/VqRg644AT01/vmstat.tar.gz > > trace_pipe: https://nofile.io/f/wFShvZScpvn/trace_pipe.gz > > > > There we go again. > > First of all, I have set up this monitoring on 1 host, as a matter of > fact it did not occur on that single > one for days and weeks now, so I set this up again on all the hosts > and it just happened again on another one. > > This issue is far from over, even when upgrading to the latest 4.18.12 > > https://nofile.io/f/z2KeNwJSMDj/vmstat-2.zip > https://nofile.io/f/5ezPUkFWtnx/trace_pipe-2.gz > > Please note: the trace_pipe is quite big in size, but it covers a > full-RAM to unused-RAM within just ~24 hours, > the measurements were initiated right after echo 3 > drop_caches and > stopped when the RAM was unused > aka re-used after another echo 3 in the end. > > This issue is alive for about half a year now, any suggestions, hints > or solutions are greatly appreciated, > again, I can not possibly be the only one experiencing this, I just > may be among the few ones who actually > notice this and are indeed suffering from very poor performance with > lots of I/O on cache/buffers. > > Also, I'd like to ask for a workaround until this is fixed someday: > echo 3 > drop_caches can take a very > long time when the host is busy with I/O in the background. According > to some resources in the net I discovered > that dropping caches operates until some lower threshold is reached, > which is less and less likely, when the > host is really busy. Could one point out what threshold this is perhaps? > I was thinking of e.g. mm/vmscan.c > > 549 void drop_slab_node(int nid) > 550 { > 551 unsigned long freed; > 552 > 553 do { > 554 struct mem_cgroup *memcg = NULL; > 555 > 556 freed = 0; > 557 do { > 558 freed += shrink_slab(GFP_KERNEL, nid, memcg, 0); > 559 } while ((memcg = mem_cgroup_iter(NULL, memcg, > NULL)) != NULL); > 560 } while (freed > 10); > 561 } > > ..would it make sense to increase > 10 here with, for example, > 100 ? > I could easily adjust this, or any other relevant threshold, since I > am compiling the kernel in use. > > I'd just like it to be able to finish dropping caches to achieve the > workaround here until this issue is fixed, > which as mentioned, can take hours on a busy host, causing the host to > hang (having low performance) since > buffers/caches are not used at that time while drop_caches is being > set to 3, until that freeing up is finished. by the way, it seems to happen on the one mentioned host on a daily basis now, like dropping to 100M/10G every 24 hours, so it is actually a lot easier now to capture relevant data/stats, since it occurs again and again right now. strangely, other hosts are currently not affected for days. So if there is anything you need to know, beside the vmstat and trace_pipe files, please let me know. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-10-23 17:41 ` Marinko Catovic @ 2018-10-26 5:48 ` Marinko Catovic 0 siblings, 0 replies; 66+ messages in thread From: Marinko Catovic @ 2018-10-26 5:48 UTC (permalink / raw) To: Michal Hocko, linux-mm, Vlastimil Babka, Christopher Lameter Am Di., 23. Okt. 2018 um 19:41 Uhr schrieb Marinko Catovic <marinko.catovic@gmail.com>: > > Am Mo., 22. Okt. 2018 um 03:19 Uhr schrieb Marinko Catovic > <marinko.catovic@gmail.com>: > > > > Am Mi., 29. Aug. 2018 um 18:44 Uhr schrieb Marinko Catovic > > <marinko.catovic@gmail.com>: > > > > > > > > >> > one host is at a healthy state right now, I'd run that over there immediately. > > >> > > >> Let's see what we can get from here. > > > > > > > > > oh well, that went fast. actually with having low values for buffers (around 100MB) with caches > > > around 20G or so, the performance was nevertheless super-low, I really had to drop > > > the caches right now. This is the first time I see it with caches >10G happening, but hopefully > > > this also provides a clue for you. > > > > > > Just after starting the stats I reset from previously defer to madvise - I suspect that this somehow > > > caused the rapid reaction, since a few minutes later I saw that the free RAM jumped from 5GB to 10GB, > > > after that I went afk, returning to the pc since my monitoring systems went crazy telling me about downtime. > > > > > > If you think changing /sys/kernel/mm/transparent_hugepage/defrag back to its default, while it was > > > on defer now for days, was a mistake, then please tell me. > > > > > > here you go: https://nofile.io/f/VqRg644AT01/vmstat.tar.gz > > > trace_pipe: https://nofile.io/f/wFShvZScpvn/trace_pipe.gz > > > > > > > There we go again. > > > > First of all, I have set up this monitoring on 1 host, as a matter of > > fact it did not occur on that single > > one for days and weeks now, so I set this up again on all the hosts > > and it just happened again on another one. > > > > This issue is far from over, even when upgrading to the latest 4.18.12 > > > > https://nofile.io/f/z2KeNwJSMDj/vmstat-2.zip > > https://nofile.io/f/5ezPUkFWtnx/trace_pipe-2.gz > > > > Please note: the trace_pipe is quite big in size, but it covers a > > full-RAM to unused-RAM within just ~24 hours, > > the measurements were initiated right after echo 3 > drop_caches and > > stopped when the RAM was unused > > aka re-used after another echo 3 in the end. > > > > This issue is alive for about half a year now, any suggestions, hints > > or solutions are greatly appreciated, > > again, I can not possibly be the only one experiencing this, I just > > may be among the few ones who actually > > notice this and are indeed suffering from very poor performance with > > lots of I/O on cache/buffers. > > > > Also, I'd like to ask for a workaround until this is fixed someday: > > echo 3 > drop_caches can take a very > > long time when the host is busy with I/O in the background. According > > to some resources in the net I discovered > > that dropping caches operates until some lower threshold is reached, > > which is less and less likely, when the > > host is really busy. Could one point out what threshold this is perhaps? > > I was thinking of e.g. mm/vmscan.c > > > > 549 void drop_slab_node(int nid) > > 550 { > > 551 unsigned long freed; > > 552 > > 553 do { > > 554 struct mem_cgroup *memcg = NULL; > > 555 > > 556 freed = 0; > > 557 do { > > 558 freed += shrink_slab(GFP_KERNEL, nid, memcg, 0); > > 559 } while ((memcg = mem_cgroup_iter(NULL, memcg, > > NULL)) != NULL); > > 560 } while (freed > 10); > > 561 } > > > > ..would it make sense to increase > 10 here with, for example, > 100 ? > > I could easily adjust this, or any other relevant threshold, since I > > am compiling the kernel in use. > > > > I'd just like it to be able to finish dropping caches to achieve the > > workaround here until this issue is fixed, > > which as mentioned, can take hours on a busy host, causing the host to > > hang (having low performance) since > > buffers/caches are not used at that time while drop_caches is being > > set to 3, until that freeing up is finished. > > by the way, it seems to happen on the one mentioned host on a daily > basis now, like dropping > to 100M/10G every 24 hours, so it is actually a lot easier now to > capture relevant data/stats, since > it occurs again and again right now. > > strangely, other hosts are currently not affected for days. > So if there is anything you need to know, beside the vmstat and > trace_pipe files, please let me know. As it happened again now for the 2nd time within 2 days, and mainly on the very same host I mentioned before and with the reports given with my previous reply, I just wanted to point out something that I observed: earlier I stated that the buffers were really low and the caches as well - however, I just monitored for the second or third time, that this applies to buffers way more significantly than to caches. As an example: 50MB buffers were in use, yet 10GB for caches, still leaving around 20GB or RAM totally unused. Note: buffer/caches were surely around 5GB/35GB in the healthy state before, so still both are getting lower. So the performance dropped that much so all services on the host basically stopped working since there was so much I/O wait, again. I tried to summarize what file contents people asked me to post, so besides the trace_pipe and vmstat-folder from my previos post, here goes another with the others while in the 50MB buffers state: cat /proc/pagetypeinfo https://pastebin.com/W1sJscsZ cat /proc/slabinfo https://pastebin.com/9ZPU3q7X cat /proc/zoneinfo https://pastebin.com/RMTwtXGr Hopefully you can read something from this. As always, feel free to ask whatever info you'd like me to share. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-10-22 1:19 ` Marinko Catovic 2018-10-23 17:41 ` Marinko Catovic @ 2018-10-26 8:01 ` Michal Hocko 2018-10-26 23:31 ` Marinko Catovic [not found] ` <6e3a9434-32f2-0388-e0c7-2bd1c2ebc8b1@suse.cz> 2018-10-31 13:12 ` Vlastimil Babka 3 siblings, 1 reply; 66+ messages in thread From: Michal Hocko @ 2018-10-26 8:01 UTC (permalink / raw) To: Marinko Catovic; +Cc: linux-mm, Vlastimil Babka, Christopher Lameter Sorry for late reply. Busy as always... On Mon 22-10-18 03:19:57, Marinko Catovic wrote: [...] > There we go again. > > First of all, I have set up this monitoring on 1 host, as a matter of > fact it did not occur on that single > one for days and weeks now, so I set this up again on all the hosts > and it just happened again on another one. > > This issue is far from over, even when upgrading to the latest 4.18.12 > > https://nofile.io/f/z2KeNwJSMDj/vmstat-2.zip > https://nofile.io/f/5ezPUkFWtnx/trace_pipe-2.gz I cannot download these. I am getting an invalid certificate and 403 when ignoring it [...] > Also, I'd like to ask for a workaround until this is fixed someday: > echo 3 > drop_caches can take a very > long time when the host is busy with I/O in the background. According > to some resources in the net I discovered > that dropping caches operates until some lower threshold is reached, > which is less and less likely, when the > host is really busy. Could one point out what threshold this is perhaps? > I was thinking of e.g. mm/vmscan.c > > 549 void drop_slab_node(int nid) > 550 { > 551 unsigned long freed; > 552 > 553 do { > 554 struct mem_cgroup *memcg = NULL; > 555 > 556 freed = 0; > 557 do { > 558 freed += shrink_slab(GFP_KERNEL, nid, memcg, 0); > 559 } while ((memcg = mem_cgroup_iter(NULL, memcg, > NULL)) != NULL); > 560 } while (freed > 10); > 561 } > > ..would it make sense to increase > 10 here with, for example, > 100 ? > I could easily adjust this, or any other relevant threshold, since I > am compiling the kernel in use. > > I'd just like it to be able to finish dropping caches to achieve the > workaround here until this issue is fixed, > which as mentioned, can take hours on a busy host, causing the host to > hang (having low performance) since > buffers/caches are not used at that time while drop_caches is being > set to 3, until that freeing up is finished. This is worth a separate discussion. Please start a new email thread. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-10-26 8:01 ` Michal Hocko @ 2018-10-26 23:31 ` Marinko Catovic 2018-10-27 6:42 ` Michal Hocko 0 siblings, 1 reply; 66+ messages in thread From: Marinko Catovic @ 2018-10-26 23:31 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-mm, Vlastimil Babka, Christopher Lameter Am Fr., 26. Okt. 2018 um 10:02 Uhr schrieb Michal Hocko <mhocko@suse.com>: > > Sorry for late reply. Busy as always... > > On Mon 22-10-18 03:19:57, Marinko Catovic wrote: > [...] > > There we go again. > > > > First of all, I have set up this monitoring on 1 host, as a matter of > > fact it did not occur on that single > > one for days and weeks now, so I set this up again on all the hosts > > and it just happened again on another one. > > > > This issue is far from over, even when upgrading to the latest 4.18.12 > > > > https://nofile.io/f/z2KeNwJSMDj/vmstat-2.zip > > https://nofile.io/f/5ezPUkFWtnx/trace_pipe-2.gz > > I cannot download these. I am getting an invalid certificate and > 403 when ignoring it are you sure about that? I can download both just fine, different browsers, the cert seems fine, no 403 there. > This is worth a separate discussion. Please start a new email thread. I was merely looking for a real quick-hotfix there in the meantime, also wondering why '10' is hardcoded ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-10-26 23:31 ` Marinko Catovic @ 2018-10-27 6:42 ` Michal Hocko 0 siblings, 0 replies; 66+ messages in thread From: Michal Hocko @ 2018-10-27 6:42 UTC (permalink / raw) To: Marinko Catovic; +Cc: linux-mm, Vlastimil Babka, Christopher Lameter On Sat 27-10-18 01:31:05, Marinko Catovic wrote: > Am Fr., 26. Okt. 2018 um 10:02 Uhr schrieb Michal Hocko <mhocko@suse.com>: > > > > Sorry for late reply. Busy as always... > > > > On Mon 22-10-18 03:19:57, Marinko Catovic wrote: > > [...] > > > There we go again. > > > > > > First of all, I have set up this monitoring on 1 host, as a matter of > > > fact it did not occur on that single > > > one for days and weeks now, so I set this up again on all the hosts > > > and it just happened again on another one. > > > > > > This issue is far from over, even when upgrading to the latest 4.18.12 > > > > > > https://nofile.io/f/z2KeNwJSMDj/vmstat-2.zip > > > https://nofile.io/f/5ezPUkFWtnx/trace_pipe-2.gz > > > > I cannot download these. I am getting an invalid certificate and > > 403 when ignoring it > > are you sure about that? I can download both just fine, different > browsers, the cert seems fine, no 403 there. Interesting. It works now from my home network. Something must have been fishy in the office network when I've tried the same thing. I have it now. Will have a look at monday at earliest. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <6e3a9434-32f2-0388-e0c7-2bd1c2ebc8b1@suse.cz>]
* Re: Caching/buffers become useless after some time [not found] ` <6e3a9434-32f2-0388-e0c7-2bd1c2ebc8b1@suse.cz> @ 2018-10-30 15:30 ` Michal Hocko 2018-10-30 16:08 ` Marinko Catovic 0 siblings, 1 reply; 66+ messages in thread From: Michal Hocko @ 2018-10-30 15:30 UTC (permalink / raw) To: Vlastimil Babka; +Cc: Marinko Catovic, linux-mm, Christopher Lameter On Tue 30-10-18 14:44:27, Vlastimil Babka wrote: > On 10/22/18 3:19 AM, Marinko Catovic wrote: > > Am Mi., 29. Aug. 2018 um 18:44 Uhr schrieb Marinko Catovic [...] > >> here you go: https://nofile.io/f/VqRg644AT01/vmstat.tar.gz > >> trace_pipe: https://nofile.io/f/wFShvZScpvn/trace_pipe.gz > >> > > > > There we go again. > > > > First of all, I have set up this monitoring on 1 host, as a matter of > > fact it did not occur on that single > > one for days and weeks now, so I set this up again on all the hosts > > and it just happened again on another one. > > > > This issue is far from over, even when upgrading to the latest 4.18.12 > > > > https://nofile.io/f/z2KeNwJSMDj/vmstat-2.zip > > https://nofile.io/f/5ezPUkFWtnx/trace_pipe-2.gz > > I have plot the vmstat using the attached script, and got the attached > plots. X axis are the vmstat snapshots, almost 14k of them, each for 5 > seconds, so almost 19 hours. I can see the following phases: Thanks a lot. I like the script much! [...] > 12000 - end: > - free pages growing sharply > - page cache declining sharply > - slab still slowly declining $ cat filter pgfree pgsteal_ pgscan_ compact nr_free_pages $ grep -f filter -h vmstat.1539866837 vmstat.1539874353 | awk '{if (c[$1]) {printf "%s %d\n", $1, $2-c[$1]}; c[$1]=$2}' nr_free_pages 4216371 pgfree 267884025 pgsteal_kswapd 0 pgsteal_direct 11890416 pgscan_kswapd 0 pgscan_direct 11937805 compact_migrate_scanned 2197060121 compact_free_scanned 4747491606 compact_isolated 54281848 compact_stall 1797 compact_fail 1721 compact_success 76 So we have ended up with 16G freed pages in that last time period. Kswapd was sleeping throughout the time but direct reclaim was quite active. ~46GB pages recycled. Note that much more pages were freed which suggests there was quite a large memory allocation/free activity. One notable thing here is that there shouldn't be any reason to do the direct reclaim when kswapd itself doesn't do anything. It could be either blocked on something but I find it quite surprising to see it in that state for the whole 1500s time period or we are simply not low on free memory at all. That would point towards compaction triggered memory reclaim which account as the direct reclaim as well. The direct compaction triggered more than once a second in average. We shouldn't really reclaim unless we are low on memory but repeatedly failing compaction could just add up and reclaim a lot in the end. There seem to be quite a lot of low order request as per your trace buffer $ grep order trace-last-phase | sed 's@.*\(order=[0-9]*\).*gfp_flags=\(.*\)@\1 \2@' | sort | uniq -c 1238 order=1 __GFP_HIGH|__GFP_ATOMIC|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE 5812 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE 121 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE 22 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE 395910 order=1 GFP_KERNEL_ACCOUNT|__GFP_ZERO 783055 order=1 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT 1060 order=1 __GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE 3278 order=2 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE 797255 order=2 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT 93524 order=3 GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC 498148 order=3 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT 243563 order=3 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP 10 order=4 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE 114 order=7 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE 67621 order=9 GFP_TRANSHUGE|__GFP_THISNODE We can safely rule out NOWAIT and ATOMIC because those do not reclaim. That leaves us with 5812 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE 121 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE 22 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE 395910 order=1 GFP_KERNEL_ACCOUNT|__GFP_ZERO 1060 order=1 __GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE 3278 order=2 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE 10 order=4 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE 114 order=7 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE 67621 order=9 GFP_TRANSHUGE|__GFP_THISNODE by large the kernel stack allocations are in lead. You can put some relief by enabling CONFIG_VMAP_STACK. There is alos a notable number of THP pages allocations. Just curious are you running on a NUMA machine? If yes [1] might be relevant. Other than that nothing really jumped at me. [1] http://lkml.kernel.org/r/20180925120326.24392-2-mhocko@kernel.org -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-10-30 15:30 ` Michal Hocko @ 2018-10-30 16:08 ` Marinko Catovic 2018-10-30 17:00 ` Vlastimil Babka 0 siblings, 1 reply; 66+ messages in thread From: Marinko Catovic @ 2018-10-30 16:08 UTC (permalink / raw) To: Michal Hocko; +Cc: Vlastimil Babka, linux-mm, Christopher Lameter Am Di., 30. Okt. 2018 um 16:30 Uhr schrieb Michal Hocko <mhocko@suse.com>: > > On Tue 30-10-18 14:44:27, Vlastimil Babka wrote: > > On 10/22/18 3:19 AM, Marinko Catovic wrote: > > > Am Mi., 29. Aug. 2018 um 18:44 Uhr schrieb Marinko Catovic > [...] > > >> here you go: https://nofile.io/f/VqRg644AT01/vmstat.tar.gz > > >> trace_pipe: https://nofile.io/f/wFShvZScpvn/trace_pipe.gz > > >> > > > > > > There we go again. > > > > > > First of all, I have set up this monitoring on 1 host, as a matter of > > > fact it did not occur on that single > > > one for days and weeks now, so I set this up again on all the hosts > > > and it just happened again on another one. > > > > > > This issue is far from over, even when upgrading to the latest 4.18.12 > > > > > > https://nofile.io/f/z2KeNwJSMDj/vmstat-2.zip > > > https://nofile.io/f/5ezPUkFWtnx/trace_pipe-2.gz > > > > I have plot the vmstat using the attached script, and got the attached > > plots. X axis are the vmstat snapshots, almost 14k of them, each for 5 > > seconds, so almost 19 hours. I can see the following phases: > > Thanks a lot. I like the script much! > > [...] > > > 12000 - end: > > - free pages growing sharply > > - page cache declining sharply > > - slab still slowly declining > > $ cat filter > pgfree > pgsteal_ > pgscan_ > compact > nr_free_pages > > $ grep -f filter -h vmstat.1539866837 vmstat.1539874353 | awk '{if (c[$1]) {printf "%s %d\n", $1, $2-c[$1]}; c[$1]=$2}' > nr_free_pages 4216371 > pgfree 267884025 > pgsteal_kswapd 0 > pgsteal_direct 11890416 > pgscan_kswapd 0 > pgscan_direct 11937805 > compact_migrate_scanned 2197060121 > compact_free_scanned 4747491606 > compact_isolated 54281848 > compact_stall 1797 > compact_fail 1721 > compact_success 76 > > So we have ended up with 16G freed pages in that last time period. > Kswapd was sleeping throughout the time but direct reclaim was quite > active. ~46GB pages recycled. Note that much more pages were freed which > suggests there was quite a large memory allocation/free activity. > > One notable thing here is that there shouldn't be any reason to do the > direct reclaim when kswapd itself doesn't do anything. It could be > either blocked on something but I find it quite surprising to see it in > that state for the whole 1500s time period or we are simply not low on > free memory at all. That would point towards compaction triggered memory > reclaim which account as the direct reclaim as well. The direct > compaction triggered more than once a second in average. We shouldn't > really reclaim unless we are low on memory but repeatedly failing > compaction could just add up and reclaim a lot in the end. There seem to > be quite a lot of low order request as per your trace buffer > > $ grep order trace-last-phase | sed 's@.*\(order=[0-9]*\).*gfp_flags=\(.*\)@\1 \2@' | sort | uniq -c > 1238 order=1 __GFP_HIGH|__GFP_ATOMIC|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE > 5812 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE > 121 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE > 22 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE > 395910 order=1 GFP_KERNEL_ACCOUNT|__GFP_ZERO > 783055 order=1 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT > 1060 order=1 __GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE > 3278 order=2 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE > 797255 order=2 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT > 93524 order=3 GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC > 498148 order=3 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT > 243563 order=3 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP > 10 order=4 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE > 114 order=7 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE > 67621 order=9 GFP_TRANSHUGE|__GFP_THISNODE > > We can safely rule out NOWAIT and ATOMIC because those do not reclaim. > That leaves us with > 5812 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE > 121 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE > 22 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE > 395910 order=1 GFP_KERNEL_ACCOUNT|__GFP_ZERO > 1060 order=1 __GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE > 3278 order=2 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE > 10 order=4 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE > 114 order=7 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE > 67621 order=9 GFP_TRANSHUGE|__GFP_THISNODE > > by large the kernel stack allocations are in lead. You can put some > relief by enabling CONFIG_VMAP_STACK. There is alos a notable number of > THP pages allocations. Just curious are you running on a NUMA machine? > If yes [1] might be relevant. Other than that nothing really jumped at > me. > > [1] http://lkml.kernel.org/r/20180925120326.24392-2-mhocko@kernel.org > -- > Michal Hocko > SUSE Labs thanks a lot Vlastimil! I would not really know whether this is a NUMA, it is some usual server running with a i7-8700 and ECC RAM. How would I find out? So I should do CONFIG_VMAP_STACK=y and try that..? ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-10-30 16:08 ` Marinko Catovic @ 2018-10-30 17:00 ` Vlastimil Babka 2018-10-30 18:26 ` Marinko Catovic ` (2 more replies) 0 siblings, 3 replies; 66+ messages in thread From: Vlastimil Babka @ 2018-10-30 17:00 UTC (permalink / raw) To: Marinko Catovic, Michal Hocko; +Cc: linux-mm, Christopher Lameter On 10/30/18 5:08 PM, Marinko Catovic wrote: >> One notable thing here is that there shouldn't be any reason to do the >> direct reclaim when kswapd itself doesn't do anything. It could be >> either blocked on something but I find it quite surprising to see it in >> that state for the whole 1500s time period or we are simply not low on >> free memory at all. That would point towards compaction triggered memory >> reclaim which account as the direct reclaim as well. The direct >> compaction triggered more than once a second in average. We shouldn't >> really reclaim unless we are low on memory but repeatedly failing >> compaction could just add up and reclaim a lot in the end. There seem to >> be quite a lot of low order request as per your trace buffer I realized that the fact that slabs grew so large might be very relevant. It means a lot of unmovable pages, and while they are slowly being freed, the remaining are scattered all over the memory, making it impossible to successfully compact, until the slabs are almost *completely* freed. It's in fact the theoretical worst case scenario for compaction and fragmentation avoidance. Next time it would be nice to also gather /proc/pagetypeinfo, and /proc/slabinfo to see what grew so much there (probably dentries and inodes). The question is why the problems happened some time later after the unmovable pollution. The trace showed me that the structure of allocations wrt order+flags as Michal breaks them down below, is not significanly different in the last phase than in the whole trace. Possibly the state of memory gradually changed so that the various heuristics (fragindex, pageblock skip bits etc) resulted in compaction being tried more than initially, eventually hitting a very bad corner case. >> $ grep order trace-last-phase | sed 's@.*\(order=[0-9]*\).*gfp_flags=\(.*\)@\1 \2@' | sort | uniq -c >> 1238 order=1 __GFP_HIGH|__GFP_ATOMIC|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE >> 5812 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE >> 121 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE >> 22 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE >> 395910 order=1 GFP_KERNEL_ACCOUNT|__GFP_ZERO >> 783055 order=1 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT >> 1060 order=1 __GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE >> 3278 order=2 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE >> 797255 order=2 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT >> 93524 order=3 GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC >> 498148 order=3 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT >> 243563 order=3 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP >> 10 order=4 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE >> 114 order=7 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE >> 67621 order=9 GFP_TRANSHUGE|__GFP_THISNODE >> >> We can safely rule out NOWAIT and ATOMIC because those do not reclaim. >> That leaves us with >> 5812 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE >> 121 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE >> 22 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE >> 395910 order=1 GFP_KERNEL_ACCOUNT|__GFP_ZERO I suspect there are lots of short-lived processes, so these are probably rapidly recycled and not causing compaction. It also seems to be pgd allocation (2 pages due to PTI) not kernel stack? >> 1060 order=1 __GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE >> 3278 order=2 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE >> 10 order=4 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE >> 114 order=7 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE >> 67621 order=9 GFP_TRANSHUGE|__GFP_THISNODE I would again suspect those. IIRC we already confirmed earlier that THP defrag setting is madvise or madvise+defer, and there are madvise(MADV_HUGEPAGE) using processes? Did you ever try changing defrag to plain 'defer'? >> >> by large the kernel stack allocations are in lead. You can put some >> relief by enabling CONFIG_VMAP_STACK. There is alos a notable number of >> THP pages allocations. Just curious are you running on a NUMA machine? >> If yes [1] might be relevant. Other than that nothing really jumped at >> me. > thanks a lot Vlastimil! And Michal :) > I would not really know whether this is a NUMA, it is some usual > server running with a i7-8700 > and ECC RAM. How would I find out? Please provide /proc/zoneinfo and we'll see. > So I should do CONFIG_VMAP_STACK=y and try that..? I suspect you already have it. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-10-30 17:00 ` Vlastimil Babka @ 2018-10-30 18:26 ` Marinko Catovic 2018-10-31 7:34 ` Michal Hocko 2018-10-31 7:32 ` Michal Hocko 2018-10-31 13:40 ` Vlastimil Babka 2 siblings, 1 reply; 66+ messages in thread From: Marinko Catovic @ 2018-10-30 18:26 UTC (permalink / raw) To: Vlastimil Babka; +Cc: Michal Hocko, linux-mm, Christopher Lameter Am Di., 30. Okt. 2018 um 18:03 Uhr schrieb Vlastimil Babka <vbabka@suse.cz>: > > On 10/30/18 5:08 PM, Marinko Catovic wrote: > >> One notable thing here is that there shouldn't be any reason to do the > >> direct reclaim when kswapd itself doesn't do anything. It could be > >> either blocked on something but I find it quite surprising to see it in > >> that state for the whole 1500s time period or we are simply not low on > >> free memory at all. That would point towards compaction triggered memory > >> reclaim which account as the direct reclaim as well. The direct > >> compaction triggered more than once a second in average. We shouldn't > >> really reclaim unless we are low on memory but repeatedly failing > >> compaction could just add up and reclaim a lot in the end. There seem to > >> be quite a lot of low order request as per your trace buffer > > I realized that the fact that slabs grew so large might be very > relevant. It means a lot of unmovable pages, and while they are slowly > being freed, the remaining are scattered all over the memory, making it > impossible to successfully compact, until the slabs are almost > *completely* freed. It's in fact the theoretical worst case scenario for > compaction and fragmentation avoidance. Next time it would be nice to > also gather /proc/pagetypeinfo, and /proc/slabinfo to see what grew so > much there (probably dentries and inodes). how would you like the results? as a job collecting those from 3 > drop_caches until worst case, which may be 24 hours every 5 seconds, or at what point in time? Please note that I already provided them (see my response before) as a one-time snapshot while being in the worst case; cat /proc/pagetypeinfo https://pastebin.com/W1sJscsZ cat /proc/slabinfo https://pastebin.com/9ZPU3q7X > The question is why the problems happened some time later after the > unmovable pollution. The trace showed me that the structure of > allocations wrt order+flags as Michal breaks them down below, is not > significanly different in the last phase than in the whole trace. > Possibly the state of memory gradually changed so that the various > heuristics (fragindex, pageblock skip bits etc) resulted in compaction > being tried more than initially, eventually hitting a very bad corner case. > > >> $ grep order trace-last-phase | sed 's@.*\(order=[0-9]*\).*gfp_flags=\(.*\)@\1 \2@' | sort | uniq -c > >> 1238 order=1 __GFP_HIGH|__GFP_ATOMIC|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE > >> 5812 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE > >> 121 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE > >> 22 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE > >> 395910 order=1 GFP_KERNEL_ACCOUNT|__GFP_ZERO > >> 783055 order=1 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT > >> 1060 order=1 __GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE > >> 3278 order=2 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE > >> 797255 order=2 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT > >> 93524 order=3 GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC > >> 498148 order=3 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT > >> 243563 order=3 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP > >> 10 order=4 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE > >> 114 order=7 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE > >> 67621 order=9 GFP_TRANSHUGE|__GFP_THISNODE > >> > >> We can safely rule out NOWAIT and ATOMIC because those do not reclaim. > >> That leaves us with > >> 5812 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE > >> 121 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE > >> 22 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE > >> 395910 order=1 GFP_KERNEL_ACCOUNT|__GFP_ZERO > > I suspect there are lots of short-lived processes, so these are probably > rapidly recycled and not causing compaction. Well yes, since it is about shared hosting there are lots of users, running lots of scripts, perhaps 5-50 new forks and kills every second, depending on load, hard to tell. > It also seems to be pgd allocation (2 pages due to PTI) not kernel stack? plain english, please? :) > >> 1060 order=1 __GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE > >> 3278 order=2 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE > >> 10 order=4 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE > >> 114 order=7 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE > >> 67621 order=9 GFP_TRANSHUGE|__GFP_THISNODE > > I would again suspect those. IIRC we already confirmed earlier that THP > defrag setting is madvise or madvise+defer, and there are > madvise(MADV_HUGEPAGE) using processes? Did you ever try changing defrag > to plain 'defer'? Yes, I think I mentioned this before. AFAIK it did not make (immediate) changes, madvise is the current type. > and there are madvise(MADV_HUGEPAGE) using processes? Can't tell you that.. > >> > >> by large the kernel stack allocations are in lead. You can put some > >> relief by enabling CONFIG_VMAP_STACK. There is alos a notable number of > >> THP pages allocations. Just curious are you running on a NUMA machine? > >> If yes [1] might be relevant. Other than that nothing really jumped at > >> me. > > > > thanks a lot Vlastimil! > > And Michal :) > > > I would not really know whether this is a NUMA, it is some usual > > server running with a i7-8700 > > and ECC RAM. How would I find out? > > Please provide /proc/zoneinfo and we'll see. there you go: cat /proc/zoneinfo https://pastebin.com/RMTwtXGr > > So I should do CONFIG_VMAP_STACK=y and try that..? > > I suspect you already have it. Yes true, the currently loaded kernel is with =y there. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-10-30 18:26 ` Marinko Catovic @ 2018-10-31 7:34 ` Michal Hocko 0 siblings, 0 replies; 66+ messages in thread From: Michal Hocko @ 2018-10-31 7:34 UTC (permalink / raw) To: Marinko Catovic; +Cc: Vlastimil Babka, linux-mm, Christopher Lameter On Tue 30-10-18 19:26:32, Marinko Catovic wrote: [...] > > > I would not really know whether this is a NUMA, it is some usual > > > server running with a i7-8700 > > > and ECC RAM. How would I find out? > > > > Please provide /proc/zoneinfo and we'll see. > > there you go: cat /proc/zoneinfo https://pastebin.com/RMTwtXGr Nope, a single node machine so no NUMA. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-10-30 17:00 ` Vlastimil Babka 2018-10-30 18:26 ` Marinko Catovic @ 2018-10-31 7:32 ` Michal Hocko 2018-10-31 13:40 ` Vlastimil Babka 2 siblings, 0 replies; 66+ messages in thread From: Michal Hocko @ 2018-10-31 7:32 UTC (permalink / raw) To: Vlastimil Babka; +Cc: Marinko Catovic, linux-mm, Christopher Lameter On Tue 30-10-18 18:00:23, Vlastimil Babka wrote: [...] > I suspect there are lots of short-lived processes, so these are probably > rapidly recycled and not causing compaction. It also seems to be pgd > allocation (2 pages due to PTI) not kernel stack? I guess you are right. I have misread order=2 yesterday. order=1 stack would be quite unexpected. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-10-30 17:00 ` Vlastimil Babka 2018-10-30 18:26 ` Marinko Catovic 2018-10-31 7:32 ` Michal Hocko @ 2018-10-31 13:40 ` Vlastimil Babka 2018-10-31 14:53 ` Marinko Catovic 2 siblings, 1 reply; 66+ messages in thread From: Vlastimil Babka @ 2018-10-31 13:40 UTC (permalink / raw) To: Marinko Catovic, Michal Hocko; +Cc: linux-mm, Christopher Lameter On 10/30/18 6:00 PM, Vlastimil Babka wrote: > On 10/30/18 5:08 PM, Marinko Catovic wrote: >>> One notable thing here is that there shouldn't be any reason to do the >>> direct reclaim when kswapd itself doesn't do anything. It could be >>> either blocked on something but I find it quite surprising to see it in >>> that state for the whole 1500s time period or we are simply not low on >>> free memory at all. That would point towards compaction triggered memory >>> reclaim which account as the direct reclaim as well. The direct >>> compaction triggered more than once a second in average. We shouldn't >>> really reclaim unless we are low on memory but repeatedly failing >>> compaction could just add up and reclaim a lot in the end. There seem to >>> be quite a lot of low order request as per your trace buffer > > I realized that the fact that slabs grew so large might be very > relevant. It means a lot of unmovable pages, and while they are slowly > being freed, the remaining are scattered all over the memory, making it > impossible to successfully compact, until the slabs are almost > *completely* freed. It's in fact the theoretical worst case scenario for > compaction and fragmentation avoidance. Next time it would be nice to > also gather /proc/pagetypeinfo, and /proc/slabinfo to see what grew so > much there (probably dentries and inodes). I went through the whole thread again as it was spread over months, and finally connected some dots. In one mail you said: > There is one thing I forgot to mention: the hosts perform find and du (I mean the commands, finding files and disk usage) > on the HDDs every night, starting from 00:20 AM up until in the morning 07:45 AM, for maintenance and stats. The timespan above roughly matches the phase where reclaimable slab grow (samples 2000-6000 over 5 seconds is roughly 5.5 hours). The find will fetch a lots of metadata in dentries, inodes etc. which are part of reclaimable slabs. In other mail you posted a slabinfo https://pastebin.com/81QAFgke in the phase where it's already being slowly reclaimed, but still occupies 6.5GB, and mostly it's ext4_inode_cache, and dentry cache (also very much internally fragmented). In another mail I suggest that maybe fragmentation happened because the slab filled up much more at some point, and I think we now have that solidly confirmed from the vmstat plots. I think one workaround is for you to perform echo 2 > drop_caches (not 3) right after the find/du maintenance finishes. At that point you don't have too much page cache anyway, since the slabs have pushed it out. It's also overnight so there are not many users yet? Alternatively the find/du could run in a memcg limiting its slab use. Michal would know the details. Long term we should do something about these slab objects that are only used briefly (once?) so there's no point in caching them and letting the cache grow like this. > The question is why the problems happened some time later after the > unmovable pollution. The trace showed me that the structure of > allocations wrt order+flags as Michal breaks them down below, is not > significanly different in the last phase than in the whole trace. > Possibly the state of memory gradually changed so that the various > heuristics (fragindex, pageblock skip bits etc) resulted in compaction > being tried more than initially, eventually hitting a very bad corner case. This is still an open question. Why do we overreclaim that much? If we can trust one of the older pagetypeinfo snapshots https://pastebin.com/6QWEZagL then of those below, only the THP allocations should need reclaim/compaction. Maybe the order-7 ones as well, but there are just a few of those and they are __GFP_NORETRY. Maybe enable also tracing events (in addition to page alloc) compaction/mm_compaction_try_to_compact_pages and compaction/mm_compaction_suitable? >>> We can safely rule out NOWAIT and ATOMIC because those do not reclaim. >>> That leaves us with >>> 5812 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE >>> 121 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE >>> 22 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE >>> 395910 order=1 GFP_KERNEL_ACCOUNT|__GFP_ZERO > > I suspect there are lots of short-lived processes, so these are probably > rapidly recycled and not causing compaction. It also seems to be pgd > allocation (2 pages due to PTI) not kernel stack? > >>> 1060 order=1 __GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE >>> 3278 order=2 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE >>> 10 order=4 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE >>> 114 order=7 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE >>> 67621 order=9 GFP_TRANSHUGE|__GFP_THISNODE > > I would again suspect those. IIRC we already confirmed earlier that THP > defrag setting is madvise or madvise+defer, and there are > madvise(MADV_HUGEPAGE) using processes? Did you ever try changing defrag > to plain 'defer'? > >>> >>> by large the kernel stack allocations are in lead. You can put some >>> relief by enabling CONFIG_VMAP_STACK. There is alos a notable number of >>> THP pages allocations. Just curious are you running on a NUMA machine? >>> If yes [1] might be relevant. Other than that nothing really jumped at >>> me. > > >> thanks a lot Vlastimil! > > And Michal :) > >> I would not really know whether this is a NUMA, it is some usual >> server running with a i7-8700 >> and ECC RAM. How would I find out? > > Please provide /proc/zoneinfo and we'll see. > >> So I should do CONFIG_VMAP_STACK=y and try that..? > > I suspect you already have it. > ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-10-31 13:40 ` Vlastimil Babka @ 2018-10-31 14:53 ` Marinko Catovic 2018-10-31 17:01 ` Michal Hocko 2018-11-02 14:59 ` Vlastimil Babka 0 siblings, 2 replies; 66+ messages in thread From: Marinko Catovic @ 2018-10-31 14:53 UTC (permalink / raw) To: Vlastimil Babka; +Cc: Michal Hocko, linux-mm, Christopher Lameter > I went through the whole thread again as it was spread over months, and > finally connected some dots. In one mail you said: > > > There is one thing I forgot to mention: the hosts perform find and du (I mean the commands, finding files and disk usage) > > on the HDDs every night, starting from 00:20 AM up until in the morning 07:45 AM, for maintenance and stats. > > The timespan above roughly matches the phase where reclaimable slab grow > (samples 2000-6000 over 5 seconds is roughly 5.5 hours). The find will > fetch a lots of metadata in dentries, inodes etc. which are part of > reclaimable slabs. In other mail you posted a slabinfo > https://pastebin.com/81QAFgke in the phase where it's already being > slowly reclaimed, but still occupies 6.5GB, and mostly it's > ext4_inode_cache, and dentry cache (also very much internally fragmented). > In another mail I suggest that maybe fragmentation happened because the > slab filled up much more at some point, and I think we now have that > solidly confirmed from the vmstat plots. > I think one workaround is for you to perform echo 2 > drop_caches (not > 3) right after the find/du maintenance finishes. At that point you don't > have too much page cache anyway, since the slabs have pushed it out. > It's also overnight so there are not many users yet? > Alternatively the find/du could run in a memcg limiting its slab use. > Michal would know the details. > > Long term we should do something about these slab objects that are only > used briefly (once?) so there's no point in caching them and letting the > cache grow like this. > Well caching of any operations with find/du is not necessary imho anyway, since walking over all these millions of files in that time period is really not worth caching at all - if there is a way you mentioned to limit the commands there, that would be great. Also I want to mention that these operations were in use with 3.x kernels as well, for years, with absolutely zero issues. 2 > drop_caches right after that is something I considered, I just had some bad experience with this, since I tried it around 5:00 AM in the first place to give it enough spare time to finish, since sync; echo 2 > drop_caches can take some time, hence my question about lowering the limits in mm/vmscan.c, void drop_slab_node(int nid) I could do this effectively right after find/du at 07:45, just hoping that this is finished soon enough - in one worst case it took over 2 hours (from 05:00 AM to 07:00 AM), since the host was busy during that time with find/du, never having freed enough caches to continue, hence my question to let it stop earlier with the modification of drop_slab_node ... it was just an idea, nevermind if you believe that it was a bad one :) ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-10-31 14:53 ` Marinko Catovic @ 2018-10-31 17:01 ` Michal Hocko 2018-10-31 19:21 ` Marinko Catovic 2018-11-02 14:59 ` Vlastimil Babka 1 sibling, 1 reply; 66+ messages in thread From: Michal Hocko @ 2018-10-31 17:01 UTC (permalink / raw) To: Marinko Catovic; +Cc: Vlastimil Babka, linux-mm, Christopher Lameter On Wed 31-10-18 15:53:44, Marinko Catovic wrote: [...] > Well caching of any operations with find/du is not necessary imho > anyway, since walking over all these millions of files in that time > period is really not worth caching at all - if there is a way you > mentioned to limit the commands there, that would be great. One possible way would be to run this find/du workload inside a memory cgroup with high limit set to something reasonable (that will likely require some tuning). I am not 100% sure that will behave for metadata mostly workload without almost any pagecache to reclaim so it might turn out this will result in other issues. But it is definitely worth trying. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-10-31 17:01 ` Michal Hocko @ 2018-10-31 19:21 ` Marinko Catovic 2018-11-01 13:23 ` Michal Hocko 0 siblings, 1 reply; 66+ messages in thread From: Marinko Catovic @ 2018-10-31 19:21 UTC (permalink / raw) To: Michal Hocko; +Cc: Vlastimil Babka, linux-mm, Christopher Lameter Am Mi., 31. Okt. 2018 um 18:01 Uhr schrieb Michal Hocko <mhocko@suse.com>: > > On Wed 31-10-18 15:53:44, Marinko Catovic wrote: > [...] > > Well caching of any operations with find/du is not necessary imho > > anyway, since walking over all these millions of files in that time > > period is really not worth caching at all - if there is a way you > > mentioned to limit the commands there, that would be great. > > One possible way would be to run this find/du workload inside a memory > cgroup with high limit set to something reasonable (that will likely > require some tuning). I am not 100% sure that will behave for metadata > mostly workload without almost any pagecache to reclaim so it might turn > out this will result in other issues. But it is definitely worth trying. hm, how would that be possible..? every user has its UID, the group can also not be a factor, since this memory restriction would apply to all users then, find/du are running as UID 0 to have access to everyone's data. so what is the conclusion from this issue now btw? is it something that will be changed/fixed at any time? As I understand everyone would have this issue when extensive walking over files is performed, basically any `cloud`, shared hosting or storage systems should experience it, true? ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-10-31 19:21 ` Marinko Catovic @ 2018-11-01 13:23 ` Michal Hocko 2018-11-01 22:46 ` Marinko Catovic 0 siblings, 1 reply; 66+ messages in thread From: Michal Hocko @ 2018-11-01 13:23 UTC (permalink / raw) To: Marinko Catovic; +Cc: Vlastimil Babka, linux-mm, Christopher Lameter On Wed 31-10-18 20:21:42, Marinko Catovic wrote: > Am Mi., 31. Okt. 2018 um 18:01 Uhr schrieb Michal Hocko <mhocko@suse.com>: > > > > On Wed 31-10-18 15:53:44, Marinko Catovic wrote: > > [...] > > > Well caching of any operations with find/du is not necessary imho > > > anyway, since walking over all these millions of files in that time > > > period is really not worth caching at all - if there is a way you > > > mentioned to limit the commands there, that would be great. > > > > One possible way would be to run this find/du workload inside a memory > > cgroup with high limit set to something reasonable (that will likely > > require some tuning). I am not 100% sure that will behave for metadata > > mostly workload without almost any pagecache to reclaim so it might turn > > out this will result in other issues. But it is definitely worth trying. > > hm, how would that be possible..? every user has its UID, the group > can also not be a factor, since this memory restriction would apply to > all users then, find/du are running as UID 0 to have access to > everyone's data. I thought you have a dedicated script(s) to do all the stats. All you need is to run that particular script(s) within a memory cgroup > so what is the conclusion from this issue now btw? is it something > that will be changed/fixed at any time? It is likely that you are triggering a pathological memory fragmentation with a lot of unmovable objects that prevent it to get resolved. That leads to memory over reclaim to make a forward progress. A hard nut to resolve but something that is definitely on radar to be solved eventually. So far we have been quite lucky to not trigger it that badly. > As I understand everyone would have this issue when extensive walking > over files is performed, basically any `cloud`, shared hosting or > storage systems should experience it, true? Not really. You need also a high demand for high order allocations to require contiguous physical memory. Maybe there is something in your workload triggering this particular pattern. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-11-01 13:23 ` Michal Hocko @ 2018-11-01 22:46 ` Marinko Catovic 2018-11-02 8:05 ` Michal Hocko 0 siblings, 1 reply; 66+ messages in thread From: Marinko Catovic @ 2018-11-01 22:46 UTC (permalink / raw) To: Michal Hocko; +Cc: Vlastimil Babka, linux-mm, Christopher Lameter Am Do., 1. Nov. 2018 um 14:23 Uhr schrieb Michal Hocko <mhocko@suse.com>: > > On Wed 31-10-18 20:21:42, Marinko Catovic wrote: > > Am Mi., 31. Okt. 2018 um 18:01 Uhr schrieb Michal Hocko <mhocko@suse.com>: > > > > > > On Wed 31-10-18 15:53:44, Marinko Catovic wrote: > > > [...] > > > > Well caching of any operations with find/du is not necessary imho > > > > anyway, since walking over all these millions of files in that time > > > > period is really not worth caching at all - if there is a way you > > > > mentioned to limit the commands there, that would be great. > > > > > > One possible way would be to run this find/du workload inside a memory > > > cgroup with high limit set to something reasonable (that will likely > > > require some tuning). I am not 100% sure that will behave for metadata > > > mostly workload without almost any pagecache to reclaim so it might turn > > > out this will result in other issues. But it is definitely worth trying. > > > > hm, how would that be possible..? every user has its UID, the group > > can also not be a factor, since this memory restriction would apply to > > all users then, find/du are running as UID 0 to have access to > > everyone's data. > > I thought you have a dedicated script(s) to do all the stats. All you > need is to run that particular script(s) within a memory cgroup yes, that is the case - the scripts are running as root, since as mentioned all users have own UIDs and specific groups, so to have access one would need root privileges. My question was how to limit this using cgroups, since afaik limits there apply to given UIDs/GIDs > > so what is the conclusion from this issue now btw? is it something > > that will be changed/fixed at any time? > > It is likely that you are triggering a pathological memory fragmentation > with a lot of unmovable objects that prevent it to get resolved. That > leads to memory over reclaim to make a forward progress. A hard nut to > resolve but something that is definitely on radar to be solved > eventually. So far we have been quite lucky to not trigger it that > badly. good to hear :) > > As I understand everyone would have this issue when extensive walking > > over files is performed, basically any `cloud`, shared hosting or > > storage systems should experience it, true? > > Not really. You need also a high demand for high order allocations to > require contiguous physical memory. Maybe there is something in your > workload triggering this particular pattern. I would not even know what triggers it, nor what it has to do with high order, I'm just running find/du, nothing special I'd say. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-11-01 22:46 ` Marinko Catovic @ 2018-11-02 8:05 ` Michal Hocko 2018-11-02 11:31 ` Marinko Catovic 0 siblings, 1 reply; 66+ messages in thread From: Michal Hocko @ 2018-11-02 8:05 UTC (permalink / raw) To: Marinko Catovic; +Cc: Vlastimil Babka, linux-mm, Christopher Lameter On Thu 01-11-18 23:46:27, Marinko Catovic wrote: > Am Do., 1. Nov. 2018 um 14:23 Uhr schrieb Michal Hocko <mhocko@suse.com>: > > > > On Wed 31-10-18 20:21:42, Marinko Catovic wrote: > > > Am Mi., 31. Okt. 2018 um 18:01 Uhr schrieb Michal Hocko <mhocko@suse.com>: > > > > > > > > On Wed 31-10-18 15:53:44, Marinko Catovic wrote: > > > > [...] > > > > > Well caching of any operations with find/du is not necessary imho > > > > > anyway, since walking over all these millions of files in that time > > > > > period is really not worth caching at all - if there is a way you > > > > > mentioned to limit the commands there, that would be great. > > > > > > > > One possible way would be to run this find/du workload inside a memory > > > > cgroup with high limit set to something reasonable (that will likely > > > > require some tuning). I am not 100% sure that will behave for metadata > > > > mostly workload without almost any pagecache to reclaim so it might turn > > > > out this will result in other issues. But it is definitely worth trying. > > > > > > hm, how would that be possible..? every user has its UID, the group > > > can also not be a factor, since this memory restriction would apply to > > > all users then, find/du are running as UID 0 to have access to > > > everyone's data. > > > > I thought you have a dedicated script(s) to do all the stats. All you > > need is to run that particular script(s) within a memory cgroup > > yes, that is the case - the scripts are running as root, since as > mentioned all users have own UIDs and specific groups, so to have > access one would need root privileges. > My question was how to limit this using cgroups, since afaik limits > there apply to given UIDs/GIDs No. Limits apply to a specific memory cgroup and all tasks which are associated with it. There are many tutorials on how to configure/use memory cgroups or cgroups in general. If I were you I would simply do this mount -t cgroup -o memory none $SOME_MOUNTPOINT mkdir $SOME_MOUNTPOINT/A echo 500M > $SOME_MOUNTPOINT/A/memory.limit_in_bytes Your script then just do echo $$ > $SOME_MOUNTPOINT/A/tasks # rest of your script echo 1 > $SOME_MOUNTPOINT/A/memory.force_empty That should drop the memory cached on behalf of the memcg A including the metadata. [...] > > > As I understand everyone would have this issue when extensive walking > > > over files is performed, basically any `cloud`, shared hosting or > > > storage systems should experience it, true? > > > > Not really. You need also a high demand for high order allocations to > > require contiguous physical memory. Maybe there is something in your > > workload triggering this particular pattern. > > I would not even know what triggers it, nor what it has to do with > high order, I'm just running find/du, nothing special I'd say. Please note that find/du is mostly a fragmentation generator. It seems there is other system activity which requires those high order allocations. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-11-02 8:05 ` Michal Hocko @ 2018-11-02 11:31 ` Marinko Catovic 2018-11-02 11:49 ` Michal Hocko 0 siblings, 1 reply; 66+ messages in thread From: Marinko Catovic @ 2018-11-02 11:31 UTC (permalink / raw) To: Michal Hocko; +Cc: Vlastimil Babka, linux-mm, Christopher Lameter Am Fr., 2. Nov. 2018 um 09:05 Uhr schrieb Michal Hocko <mhocko@suse.com>: > > On Thu 01-11-18 23:46:27, Marinko Catovic wrote: > > Am Do., 1. Nov. 2018 um 14:23 Uhr schrieb Michal Hocko <mhocko@suse.com>: > > > > > > On Wed 31-10-18 20:21:42, Marinko Catovic wrote: > > > > Am Mi., 31. Okt. 2018 um 18:01 Uhr schrieb Michal Hocko <mhocko@suse.com>: > > > > > > > > > > On Wed 31-10-18 15:53:44, Marinko Catovic wrote: > > > > > [...] > > > > > > Well caching of any operations with find/du is not necessary imho > > > > > > anyway, since walking over all these millions of files in that time > > > > > > period is really not worth caching at all - if there is a way you > > > > > > mentioned to limit the commands there, that would be great. > > > > > > > > > > One possible way would be to run this find/du workload inside a memory > > > > > cgroup with high limit set to something reasonable (that will likely > > > > > require some tuning). I am not 100% sure that will behave for metadata > > > > > mostly workload without almost any pagecache to reclaim so it might turn > > > > > out this will result in other issues. But it is definitely worth trying. > > > > > > > > hm, how would that be possible..? every user has its UID, the group > > > > can also not be a factor, since this memory restriction would apply to > > > > all users then, find/du are running as UID 0 to have access to > > > > everyone's data. > > > > > > I thought you have a dedicated script(s) to do all the stats. All you > > > need is to run that particular script(s) within a memory cgroup > > > > yes, that is the case - the scripts are running as root, since as > > mentioned all users have own UIDs and specific groups, so to have > > access one would need root privileges. > > My question was how to limit this using cgroups, since afaik limits > > there apply to given UIDs/GIDs > > No. Limits apply to a specific memory cgroup and all tasks which are > associated with it. There are many tutorials on how to configure/use > memory cgroups or cgroups in general. If I were you I would simply do > this > > mount -t cgroup -o memory none $SOME_MOUNTPOINT > mkdir $SOME_MOUNTPOINT/A > echo 500M > $SOME_MOUNTPOINT/A/memory.limit_in_bytes > > Your script then just do > echo $$ > $SOME_MOUNTPOINT/A/tasks > # rest of your script > echo 1 > $SOME_MOUNTPOINT/A/memory.force_empty > > That should drop the memory cached on behalf of the memcg A including the > metadata. well, that's an interesting approach, I did not know that this was possible to assign cgroups to PIDs, without additionally explicitly defining UID/GID. This way memory.force_empty basically acts like echo 3 > drop_caches, but only for the memory affected by the PIDs and its children/forks from the A/tasks-list, true? I'll give it a try with the nightly du/find jobs, thank you! > > > [...] > > > > As I understand everyone would have this issue when extensive walking > > > > over files is performed, basically any `cloud`, shared hosting or > > > > storage systems should experience it, true? > > > > > > Not really. You need also a high demand for high order allocations to > > > require contiguous physical memory. Maybe there is something in your > > > workload triggering this particular pattern. > > > > I would not even know what triggers it, nor what it has to do with > > high order, I'm just running find/du, nothing special I'd say. > > Please note that find/du is mostly a fragmentation generator. It > seems there is other system activity which requires those high order > allocations. any idea how to find out what that might be? I'd really have no idea, I also wonder why this never was an issue with 3.x find uses regex patterns, that's the only thing that may be unusual. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-11-02 11:31 ` Marinko Catovic @ 2018-11-02 11:49 ` Michal Hocko 2018-11-02 12:22 ` Vlastimil Babka 0 siblings, 1 reply; 66+ messages in thread From: Michal Hocko @ 2018-11-02 11:49 UTC (permalink / raw) To: Marinko Catovic; +Cc: Vlastimil Babka, linux-mm, Christopher Lameter On Fri 02-11-18 12:31:09, Marinko Catovic wrote: > Am Fr., 2. Nov. 2018 um 09:05 Uhr schrieb Michal Hocko <mhocko@suse.com>: > > > > On Thu 01-11-18 23:46:27, Marinko Catovic wrote: > > > Am Do., 1. Nov. 2018 um 14:23 Uhr schrieb Michal Hocko <mhocko@suse.com>: > > > > > > > > On Wed 31-10-18 20:21:42, Marinko Catovic wrote: > > > > > Am Mi., 31. Okt. 2018 um 18:01 Uhr schrieb Michal Hocko <mhocko@suse.com>: > > > > > > > > > > > > On Wed 31-10-18 15:53:44, Marinko Catovic wrote: > > > > > > [...] > > > > > > > Well caching of any operations with find/du is not necessary imho > > > > > > > anyway, since walking over all these millions of files in that time > > > > > > > period is really not worth caching at all - if there is a way you > > > > > > > mentioned to limit the commands there, that would be great. > > > > > > > > > > > > One possible way would be to run this find/du workload inside a memory > > > > > > cgroup with high limit set to something reasonable (that will likely > > > > > > require some tuning). I am not 100% sure that will behave for metadata > > > > > > mostly workload without almost any pagecache to reclaim so it might turn > > > > > > out this will result in other issues. But it is definitely worth trying. > > > > > > > > > > hm, how would that be possible..? every user has its UID, the group > > > > > can also not be a factor, since this memory restriction would apply to > > > > > all users then, find/du are running as UID 0 to have access to > > > > > everyone's data. > > > > > > > > I thought you have a dedicated script(s) to do all the stats. All you > > > > need is to run that particular script(s) within a memory cgroup > > > > > > yes, that is the case - the scripts are running as root, since as > > > mentioned all users have own UIDs and specific groups, so to have > > > access one would need root privileges. > > > My question was how to limit this using cgroups, since afaik limits > > > there apply to given UIDs/GIDs > > > > No. Limits apply to a specific memory cgroup and all tasks which are > > associated with it. There are many tutorials on how to configure/use > > memory cgroups or cgroups in general. If I were you I would simply do > > this > > > > mount -t cgroup -o memory none $SOME_MOUNTPOINT > > mkdir $SOME_MOUNTPOINT/A > > echo 500M > $SOME_MOUNTPOINT/A/memory.limit_in_bytes > > > > Your script then just do > > echo $$ > $SOME_MOUNTPOINT/A/tasks > > # rest of your script > > echo 1 > $SOME_MOUNTPOINT/A/memory.force_empty > > > > That should drop the memory cached on behalf of the memcg A including the > > metadata. > > well, that's an interesting approach, I did not know that this was > possible to assign cgroups to PIDs, without additionally explicitly > defining UID/GID. This way memory.force_empty basically acts like echo > 3 > drop_caches, but only for the memory affected by the PIDs and its > children/forks from the A/tasks-list, true? Yup > I'll give it a try with the nightly du/find jobs, thank you! I am still a bit curious how that will work out on metadata mostly workload because we usually have quite a lot of memory on normal LRUs to reclaim (page cache, anonymous memory) and slab reclaim is just to balance kmem. But let's see. Watch for memcg OOM killer invocations if the reclaim is not sufficient. > > [...] > > > > > As I understand everyone would have this issue when extensive walking > > > > > over files is performed, basically any `cloud`, shared hosting or > > > > > storage systems should experience it, true? > > > > > > > > Not really. You need also a high demand for high order allocations to > > > > require contiguous physical memory. Maybe there is something in your > > > > workload triggering this particular pattern. > > > > > > I would not even know what triggers it, nor what it has to do with > > > high order, I'm just running find/du, nothing special I'd say. > > > > Please note that find/du is mostly a fragmentation generator. It > > seems there is other system activity which requires those high order > > allocations. > > any idea how to find out what that might be? I'd really have no idea, > I also wonder why this never was an issue with 3.x > find uses regex patterns, that's the only thing that may be unusual. The allocation tracepoint has the stack trace so that might help. This is quite a lot of work to pin point and find a pattern though. This is way out the time scope I can devote to this unfortunately. This might be some driver asking for more, or even the core kernel being more high order memory hungry. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-11-02 11:49 ` Michal Hocko @ 2018-11-02 12:22 ` Vlastimil Babka 2018-11-02 12:41 ` Marinko Catovic 0 siblings, 1 reply; 66+ messages in thread From: Vlastimil Babka @ 2018-11-02 12:22 UTC (permalink / raw) To: Michal Hocko, Marinko Catovic; +Cc: linux-mm, Christopher Lameter On 11/2/18 12:49 PM, Michal Hocko wrote: > On Fri 02-11-18 12:31:09, Marinko Catovic wrote: >> Am Fr., 2. Nov. 2018 um 09:05 Uhr schrieb Michal Hocko <mhocko@suse.com>: >>> >>> On Thu 01-11-18 23:46:27, Marinko Catovic wrote: >>>> Am Do., 1. Nov. 2018 um 14:23 Uhr schrieb Michal Hocko <mhocko@suse.com>: >>>>> >>>>> On Wed 31-10-18 20:21:42, Marinko Catovic wrote: >>>>>> Am Mi., 31. Okt. 2018 um 18:01 Uhr schrieb Michal Hocko <mhocko@suse.com>: >>>>>>> >>>>>>> On Wed 31-10-18 15:53:44, Marinko Catovic wrote: >>>>>>> [...] >>>>>>>> Well caching of any operations with find/du is not necessary imho >>>>>>>> anyway, since walking over all these millions of files in that time >>>>>>>> period is really not worth caching at all - if there is a way you >>>>>>>> mentioned to limit the commands there, that would be great. >>>>>>> >>>>>>> One possible way would be to run this find/du workload inside a memory >>>>>>> cgroup with high limit set to something reasonable (that will likely >>>>>>> require some tuning). I am not 100% sure that will behave for metadata >>>>>>> mostly workload without almost any pagecache to reclaim so it might turn >>>>>>> out this will result in other issues. But it is definitely worth trying. >>>>>> >>>>>> hm, how would that be possible..? every user has its UID, the group >>>>>> can also not be a factor, since this memory restriction would apply to >>>>>> all users then, find/du are running as UID 0 to have access to >>>>>> everyone's data. >>>>> >>>>> I thought you have a dedicated script(s) to do all the stats. All you >>>>> need is to run that particular script(s) within a memory cgroup >>>> >>>> yes, that is the case - the scripts are running as root, since as >>>> mentioned all users have own UIDs and specific groups, so to have >>>> access one would need root privileges. >>>> My question was how to limit this using cgroups, since afaik limits >>>> there apply to given UIDs/GIDs >>> >>> No. Limits apply to a specific memory cgroup and all tasks which are >>> associated with it. There are many tutorials on how to configure/use >>> memory cgroups or cgroups in general. If I were you I would simply do >>> this >>> >>> mount -t cgroup -o memory none $SOME_MOUNTPOINT >>> mkdir $SOME_MOUNTPOINT/A >>> echo 500M > $SOME_MOUNTPOINT/A/memory.limit_in_bytes >>> >>> Your script then just do >>> echo $$ > $SOME_MOUNTPOINT/A/tasks >>> # rest of your script >>> echo 1 > $SOME_MOUNTPOINT/A/memory.force_empty >>> >>> That should drop the memory cached on behalf of the memcg A including the >>> metadata. >> >> well, that's an interesting approach, I did not know that this was >> possible to assign cgroups to PIDs, without additionally explicitly >> defining UID/GID. This way memory.force_empty basically acts like echo >> 3 > drop_caches, but only for the memory affected by the PIDs and its >> children/forks from the A/tasks-list, true? > > Yup > >> I'll give it a try with the nightly du/find jobs, thank you! > > I am still a bit curious how that will work out on metadata mostly > workload because we usually have quite a lot of memory on normal LRUs to > reclaim (page cache, anonymous memory) and slab reclaim is just to > balance kmem. But let's see. Watch for memcg OOM killer invocations if > the reclaim is not sufficient. > >>> [...] >>>>>> As I understand everyone would have this issue when extensive walking >>>>>> over files is performed, basically any `cloud`, shared hosting or >>>>>> storage systems should experience it, true? >>>>> >>>>> Not really. You need also a high demand for high order allocations to >>>>> require contiguous physical memory. Maybe there is something in your >>>>> workload triggering this particular pattern. >>>> >>>> I would not even know what triggers it, nor what it has to do with >>>> high order, I'm just running find/du, nothing special I'd say. >>> >>> Please note that find/du is mostly a fragmentation generator. It >>> seems there is other system activity which requires those high order >>> allocations. >> >> any idea how to find out what that might be? I'd really have no idea, >> I also wonder why this never was an issue with 3.x >> find uses regex patterns, that's the only thing that may be unusual. > > The allocation tracepoint has the stack trace so that might help. This Well we already checked the mm_page_alloc traces and it seemed that only THP allocations could be the culprit. But apparently defrag=defer made no difference. I would still recommend it so we can see the effects on the traces. And adding tracepoints compaction/mm_compaction_try_to_compact_pages and compaction/mm_compaction_suitable as I suggested should show which high-order allocations actually invoke the compaction. > is quite a lot of work to pin point and find a pattern though. This is > way out the time scope I can devote to this unfortunately. This might be > some driver asking for more, or even the core kernel being more high > order memory hungry. > ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-11-02 12:22 ` Vlastimil Babka @ 2018-11-02 12:41 ` Marinko Catovic 2018-11-02 13:13 ` Vlastimil Babka 0 siblings, 1 reply; 66+ messages in thread From: Marinko Catovic @ 2018-11-02 12:41 UTC (permalink / raw) To: Vlastimil Babka; +Cc: Michal Hocko, linux-mm, Christopher Lameter > >> any idea how to find out what that might be? I'd really have no idea, > >> I also wonder why this never was an issue with 3.x > >> find uses regex patterns, that's the only thing that may be unusual. > > > > The allocation tracepoint has the stack trace so that might help. This > > Well we already checked the mm_page_alloc traces and it seemed that only > THP allocations could be the culprit. But apparently defrag=defer made > no difference. I would still recommend it so we can see the effects on > the traces. And adding tracepoints > compaction/mm_compaction_try_to_compact_pages and > compaction/mm_compaction_suitable as I suggested should show which > high-order allocations actually invoke the compaction. Anything in particular I should do to figure this out? ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-11-02 12:41 ` Marinko Catovic @ 2018-11-02 13:13 ` Vlastimil Babka 2018-11-02 13:50 ` Marinko Catovic 0 siblings, 1 reply; 66+ messages in thread From: Vlastimil Babka @ 2018-11-02 13:13 UTC (permalink / raw) To: Marinko Catovic; +Cc: Michal Hocko, linux-mm, Christopher Lameter On 11/2/18 1:41 PM, Marinko Catovic wrote: >>>> any idea how to find out what that might be? I'd really have no idea, >>>> I also wonder why this never was an issue with 3.x >>>> find uses regex patterns, that's the only thing that may be unusual. >>> >>> The allocation tracepoint has the stack trace so that might help. This >> >> Well we already checked the mm_page_alloc traces and it seemed that only >> THP allocations could be the culprit. But apparently defrag=defer made >> no difference. I would still recommend it so we can see the effects on >> the traces. And adding tracepoints >> compaction/mm_compaction_try_to_compact_pages and >> compaction/mm_compaction_suitable as I suggested should show which >> high-order allocations actually invoke the compaction. > > Anything in particular I should do to figure this out? Setup the same monitoring as before, but with two additional tracepoints (echo 1 > .../enable) and once the problem appears, provide the tracing output. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-11-02 13:13 ` Vlastimil Babka @ 2018-11-02 13:50 ` Marinko Catovic 2018-11-02 14:49 ` Vlastimil Babka 0 siblings, 1 reply; 66+ messages in thread From: Marinko Catovic @ 2018-11-02 13:50 UTC (permalink / raw) To: Vlastimil Babka; +Cc: Michal Hocko, linux-mm, Christopher Lameter Am Fr., 2. Nov. 2018 um 14:13 Uhr schrieb Vlastimil Babka <vbabka@suse.cz>: > > On 11/2/18 1:41 PM, Marinko Catovic wrote: > >>>> any idea how to find out what that might be? I'd really have no idea, > >>>> I also wonder why this never was an issue with 3.x > >>>> find uses regex patterns, that's the only thing that may be unusual. > >>> > >>> The allocation tracepoint has the stack trace so that might help. This > >> > >> Well we already checked the mm_page_alloc traces and it seemed that only > >> THP allocations could be the culprit. But apparently defrag=defer made > >> no difference. I would still recommend it so we can see the effects on > >> the traces. And adding tracepoints > >> compaction/mm_compaction_try_to_compact_pages and > >> compaction/mm_compaction_suitable as I suggested should show which > >> high-order allocations actually invoke the compaction. > > > > Anything in particular I should do to figure this out? > > Setup the same monitoring as before, but with two additional tracepoints > (echo 1 > .../enable) and once the problem appears, provide the tracing > output. I think I'll need more details about that setup :) also, do you want the tracing output every 5sec or just once when it is around the worst case? what files exactly? ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-11-02 13:50 ` Marinko Catovic @ 2018-11-02 14:49 ` Vlastimil Babka 0 siblings, 0 replies; 66+ messages in thread From: Vlastimil Babka @ 2018-11-02 14:49 UTC (permalink / raw) To: Marinko Catovic; +Cc: Michal Hocko, linux-mm, Christopher Lameter On 11/2/18 2:50 PM, Marinko Catovic wrote: > Am Fr., 2. Nov. 2018 um 14:13 Uhr schrieb Vlastimil Babka <vbabka@suse.cz>: >> >> On 11/2/18 1:41 PM, Marinko Catovic wrote: >>>>>> any idea how to find out what that might be? I'd really have no idea, >>>>>> I also wonder why this never was an issue with 3.x >>>>>> find uses regex patterns, that's the only thing that may be unusual. >>>>> >>>>> The allocation tracepoint has the stack trace so that might help. This >>>> >>>> Well we already checked the mm_page_alloc traces and it seemed that only >>>> THP allocations could be the culprit. But apparently defrag=defer made >>>> no difference. I would still recommend it so we can see the effects on >>>> the traces. And adding tracepoints >>>> compaction/mm_compaction_try_to_compact_pages and >>>> compaction/mm_compaction_suitable as I suggested should show which >>>> high-order allocations actually invoke the compaction. >>> >>> Anything in particular I should do to figure this out? >> >> Setup the same monitoring as before, but with two additional tracepoints >> (echo 1 > .../enable) and once the problem appears, provide the tracing >> output. > > I think I'll need more details about that setup :) It's like what you already did based on suggestion from Michal Hocko: # mount -t tracefs none /debug/trace/ # echo stacktrace > /debug/trace/trace_options # echo "order>0" > /debug/trace/events/kmem/mm_page_alloc/filter # echo 1 > /debug/trace/events/kmem/mm_page_alloc/enable # echo 1 > /debug/trace/events/compaction/mm_compaction_try_to_compact_pages # echo 1 > /debug/trace/events/compaction/mm_compaction_suitable # cat /debug/trace/trace_pipe | gzip > /path/to/trace_pipe.txt.gz And later this to disable tracing. # echo 0 > /debug/trace/events/kmem/mm_page_alloc/enable > also, do you want the tracing output every 5sec or just once when it > is around the worst case? what files exactly? Collect vmstat periodically every 5 secs as you already did. Tracing is continuous and results in the single trace_pipe.txt.gz file. The trace should cover at least some time while you're experiencing the too much free memory/too little pagecache phase. Might be enough to enable the collection only after you detect the situation, and before you e.g. drop caches to restore the system. To remove THP allocations from the picture, it would be nice if the system was configured with: echo defer > /sys/kernel/mm/transparent_hugepage/defrag Again you can do that only after detecting the problematic situation, before starting to collect trace. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-10-31 14:53 ` Marinko Catovic 2018-10-31 17:01 ` Michal Hocko @ 2018-11-02 14:59 ` Vlastimil Babka 2018-11-30 12:01 ` Marinko Catovic 1 sibling, 1 reply; 66+ messages in thread From: Vlastimil Babka @ 2018-11-02 14:59 UTC (permalink / raw) To: Marinko Catovic; +Cc: Michal Hocko, linux-mm, Christopher Lameter Forgot to answer this: On 10/31/18 3:53 PM, Marinko Catovic wrote: > Well caching of any operations with find/du is not necessary imho > anyway, since walking over all these millions of files in that time > period is really not worth caching at all - if there is a way you > mentioned to limit the commands there, that would be great. > Also I want to mention that these operations were in use with 3.x > kernels as well, for years, with absolutely zero issues. Yep, something had to change at some point. Possibly the reclaim/compaction loop. Probably not the way dentries/inodes are being cached though. > 2 > drop_caches right after that is something I considered, I just had > some bad experience with this, since I tried it around 5:00 AM in the > first place to give it enough spare time to finish, since sync; echo 2 >> drop_caches can take some time, hence my question about lowering the > limits in mm/vmscan.c, void drop_slab_node(int nid) > > I could do this effectively right after find/du at 07:45, just hoping > that this is finished soon enough - in one worst case it took over 2 > hours (from 05:00 AM to 07:00 AM), since the host was busy during that > time with find/du, never having freed enough caches to continue, hence Dropping caches while find/du is still running would be counter-productive. If done after it's already finished, it shouldn't be so disruptive. > my question to let it stop earlier with the modification of > drop_slab_node ... it was just an idea, nevermind if you believe that > it was a bad one :) Finding a universally "correct" threshold could easily be impossible. I guess the proper solution would be to drop the while loop and restructure the shrinking so that it would do a single pass through all objects. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-11-02 14:59 ` Vlastimil Babka @ 2018-11-30 12:01 ` Marinko Catovic 2018-12-10 21:30 ` Marinko Catovic 0 siblings, 1 reply; 66+ messages in thread From: Marinko Catovic @ 2018-11-30 12:01 UTC (permalink / raw) To: Vlastimil Babka; +Cc: Michal Hocko, linux-mm, Christopher Lameter Am Fr., 2. Nov. 2018 um 15:59 Uhr schrieb Vlastimil Babka <vbabka@suse.cz>: > > Forgot to answer this: > > On 10/31/18 3:53 PM, Marinko Catovic wrote: > > Well caching of any operations with find/du is not necessary imho > > anyway, since walking over all these millions of files in that time > > period is really not worth caching at all - if there is a way you > > mentioned to limit the commands there, that would be great. > > Also I want to mention that these operations were in use with 3.x > > kernels as well, for years, with absolutely zero issues. > > Yep, something had to change at some point. Possibly the > reclaim/compaction loop. Probably not the way dentries/inodes are being > cached though. > > > 2 > drop_caches right after that is something I considered, I just had > > some bad experience with this, since I tried it around 5:00 AM in the > > first place to give it enough spare time to finish, since sync; echo 2 > >> drop_caches can take some time, hence my question about lowering the > > limits in mm/vmscan.c, void drop_slab_node(int nid) > > > > I could do this effectively right after find/du at 07:45, just hoping > > that this is finished soon enough - in one worst case it took over 2 > > hours (from 05:00 AM to 07:00 AM), since the host was busy during that > > time with find/du, never having freed enough caches to continue, hence > > Dropping caches while find/du is still running would be > counter-productive. If done after it's already finished, it shouldn't be > so disruptive. > > > my question to let it stop earlier with the modification of > > drop_slab_node ... it was just an idea, nevermind if you believe that > > it was a bad one :) > > Finding a universally "correct" threshold could easily be impossible. I > guess the proper solution would be to drop the while loop and > restructure the shrinking so that it would do a single pass through all > objects. well after a few weeks to make sure, the results seem very promising. There were no issues any more after setting up the cgroup with the limit. This workaround is anyway a good idea to prevent the nightly processed from eating up all the caching/buffers which become useless anyway in the morning, so performance got even better - although the issue is not fixed with that workaround. Since other people will be affected sooner or later as well imho, hopefully you'll figure out a fix soon. Nevertheless I also ran into a new problem there. While writing the PID into the tasks-file (echo $$ > ../tasks) or a direct fputs(getpid(), tasks_fp); works very well, I also had problems with daemons that I wanted to start (e.g. a SQL server) from within that cgroup-controlled binary. This results in the sql server's task kill, since the memory limit is exceeded. I would not like to set the memory.limit_in_bytes to something that huge, such as 30G to make sure, I'd rather just use a wrapper script to handle this, for example: 1) the cgroup-controlled instance starts the wrapper script 2) which excludes itself from the tasks-PID-list (hence the wrapper script it is not controlled any more) 3) it starts or does whatever necessary that should continue normally without the memory restriction Currently I fail to manage this, since I do not know how to do step 2. echo $PID > tasks writes into it and adds the PID, but how would one remove the wrapper script's PID from there? I came up with: cat /cgpath/A/tasks | sed "/$$/d" | cat > /cgpath/A/tasks ..which results in a list without the current PID, however, it fails to write to tasks with cat: write error: Invalid argument, since this is not a regular file. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-11-30 12:01 ` Marinko Catovic @ 2018-12-10 21:30 ` Marinko Catovic 2018-12-10 21:47 ` Michal Hocko 0 siblings, 1 reply; 66+ messages in thread From: Marinko Catovic @ 2018-12-10 21:30 UTC (permalink / raw) To: Vlastimil Babka; +Cc: Michal Hocko, linux-mm, Christopher Lameter > Currently I fail to manage this, since I do not know how to do step 2. > echo $PID > tasks writes into it and adds the PID, but how would one > remove the wrapper script's PID from there? any ideas on this perhaps? The workaround, otherwise working perfectly fine, causes huge problems there since I have to exclude certain processes from that tasklist. Basically I'd need to know how to remove a PID from the mountpoint, created by mount -t cgroup -o memory none $SOME_MOUNTPOINT mkdir $SOME_MOUNTPOINT/A echo 500M > $SOME_MOUNTPOINT/A/memory.limit_in_bytes aka remove a specific PID from $SOME_MOUNTPOINT/A/tasks ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-12-10 21:30 ` Marinko Catovic @ 2018-12-10 21:47 ` Michal Hocko 0 siblings, 0 replies; 66+ messages in thread From: Michal Hocko @ 2018-12-10 21:47 UTC (permalink / raw) To: Marinko Catovic; +Cc: Vlastimil Babka, linux-mm, Christopher Lameter On Mon 10-12-18 22:30:40, Marinko Catovic wrote: > > Currently I fail to manage this, since I do not know how to do step 2. > > echo $PID > tasks writes into it and adds the PID, but how would one > > remove the wrapper script's PID from there? > > any ideas on this perhaps? > The workaround, otherwise working perfectly fine, causes huge problems there > since I have to exclude certain processes from that tasklist. I am sorry, I didn't get to your previous email. But this is quite simply. You just echo those pids to a different cgroup. E.g. the root one at the top of the mounted hierarchy. There are also wrappers to execute a task into a specific cgroup in libcgroup package and I am pretty sure systemd has its own mechanisms to achieve the same. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
* Caching/buffers become useless after some time 2018-10-22 1:19 ` Marinko Catovic ` (2 preceding siblings ...) [not found] ` <6e3a9434-32f2-0388-e0c7-2bd1c2ebc8b1@suse.cz> @ 2018-10-31 13:12 ` Vlastimil Babka 3 siblings, 0 replies; 66+ messages in thread From: Vlastimil Babka @ 2018-10-31 13:12 UTC (permalink / raw) To: Marinko Catovic, Michal Hocko, linux-mm, Christopher Lameter Resending for lists which dropped my mail due to attachments. Sorry. plots: https://nofile.io/f/ogwbrwhwBU7/plots.tar.bz2 R script: files <- Sys.glob("vmstat.1*") results <- read.table(files[1], row.names=1) for (file in files[-1]) { tmp2 <- read.table(file)$V2 results <- cbind(results, tmp2) } for (row in row.names(results)) { png(paste("plots/", row, ".png", sep=""), width=1900, height=1150) plot(t(as.vector(results[row,])), main=row) dev.off() } On 10/22/18 3:19 AM, Marinko Catovic wrote: > Am Mi., 29. Aug. 2018 um 18:44 Uhr schrieb Marinko Catovic > <marinko.catovic@gmail.com>: >> >> >>>> one host is at a healthy state right now, I'd run that over there immediately. >>> >>> Let's see what we can get from here. >> >> >> oh well, that went fast. actually with having low values for buffers (around 100MB) with caches >> around 20G or so, the performance was nevertheless super-low, I really had to drop >> the caches right now. This is the first time I see it with caches >10G happening, but hopefully >> this also provides a clue for you. >> >> Just after starting the stats I reset from previously defer to madvise - I suspect that this somehow >> caused the rapid reaction, since a few minutes later I saw that the free RAM jumped from 5GB to 10GB, >> after that I went afk, returning to the pc since my monitoring systems went crazy telling me about downtime. >> >> If you think changing /sys/kernel/mm/transparent_hugepage/defrag back to its default, while it was >> on defer now for days, was a mistake, then please tell me. >> >> here you go: https://nofile.io/f/VqRg644AT01/vmstat.tar.gz >> trace_pipe: https://nofile.io/f/wFShvZScpvn/trace_pipe.gz >> > > There we go again. > > First of all, I have set up this monitoring on 1 host, as a matter of > fact it did not occur on that single > one for days and weeks now, so I set this up again on all the hosts > and it just happened again on another one. > > This issue is far from over, even when upgrading to the latest 4.18.12 > > https://nofile.io/f/z2KeNwJSMDj/vmstat-2.zip > https://nofile.io/f/5ezPUkFWtnx/trace_pipe-2.gz I have plot the vmstat using the attached script, and got the attached plots. X axis are the vmstat snapshots, almost 14k of them, each for 5 seconds, so almost 19 hours. I can see the following phases: 0 - 2000: - free memory (nr_free_pages) dropping from 48GB to the minimum allowed by watermarks - page cache (nr_file_pages) grows correspondingly 2000 - 6000: - reclaimable slab (nr_slab_reclaimable) grows up to 40GB, unreclaimable slab has same trend but much less - page cache is shrinked correspondingly - free memory remains at miminum 6000 - 12000: - slab usage is slowly declining - page cache slowly growing but there are hiccups - free pages at minimum, growing after 9000, oscillating between 10000 and 12000 12000 - end: - free pages growing sharply - page cache declining sharply - slab still slowly declining I guess the original problem is manifested in the last phase. There might be secondary issue with the slab usage, between 2000 and 6000 but it doesn't seem immeidately connected (?). I can see compaction activity (but not success) increased a lot in the last phase, while direct reclaim is steady from 2000 onwards. This would again suggest high-order allocations. THP doesn't seem to be the cause. Vlastimil ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Caching/buffers become useless after some time 2018-08-23 12:21 ` Michal Hocko 2018-08-24 0:11 ` Marinko Catovic @ 2018-08-24 6:24 ` Vlastimil Babka 1 sibling, 0 replies; 66+ messages in thread From: Vlastimil Babka @ 2018-08-24 6:24 UTC (permalink / raw) To: Michal Hocko; +Cc: Marinko Catovic, Christopher Lameter, linux-mm On 08/23/2018 02:21 PM, Michal Hocko wrote: > On Thu 23-08-18 14:10:28, Vlastimil Babka wrote: >> It also shows that all orders except order-9 are in fact plentiful. >> Michal's earlier summary of the trace shows that most allocations are up >> to order-3 and should be fine, the exception is THP: >> >> 277 9 GFP_TRANSHUGE|__GFP_THISNODE > > But please note that this is not from the time when the page cache > dropped to the observed values. So we do not know what happened at the > time. Okay, we didn't observe it drop, but there must still be something going on that keeps it from growing back? > Anyway 277 THP pages paging out such a large page cache amount would be > more than unexpected even for explicitly costly THP fault in methods. It's 277 in 90 seconds. But it seems no reclaim should happen there anyway, because shrink_zones() should evaluate compaction_ready() as true and skip the zones. Unless there is some kind of bug, maybe e.g. ZONE_DMA returns compaction_ready() as false, causing the whole node to be reclaimed? Hmm. ^ permalink raw reply [flat|nested] 66+ messages in thread
end of thread, other threads:[~2018-12-10 21:47 UTC | newest] Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-07-11 13:18 Caching/buffers become useless after some time Marinko Catovic 2018-07-12 11:34 ` Michal Hocko 2018-07-13 15:48 ` Marinko Catovic 2018-07-16 15:53 ` Marinko Catovic 2018-07-16 16:23 ` Michal Hocko 2018-07-16 16:33 ` Marinko Catovic 2018-07-16 16:45 ` Michal Hocko 2018-07-20 22:03 ` Marinko Catovic 2018-07-27 11:15 ` Vlastimil Babka 2018-07-30 14:40 ` Michal Hocko 2018-07-30 22:08 ` Marinko Catovic 2018-08-02 16:15 ` Vlastimil Babka 2018-08-03 14:13 ` Marinko Catovic 2018-08-06 9:40 ` Vlastimil Babka 2018-08-06 10:29 ` Marinko Catovic 2018-08-06 12:00 ` Michal Hocko 2018-08-06 15:37 ` Christopher Lameter 2018-08-06 18:16 ` Michal Hocko 2018-08-09 8:29 ` Marinko Catovic 2018-08-21 0:36 ` Marinko Catovic 2018-08-21 6:49 ` Michal Hocko 2018-08-21 7:19 ` Vlastimil Babka 2018-08-22 20:02 ` Marinko Catovic 2018-08-23 12:10 ` Vlastimil Babka 2018-08-23 12:21 ` Michal Hocko 2018-08-24 0:11 ` Marinko Catovic 2018-08-24 6:34 ` Vlastimil Babka 2018-08-24 8:11 ` Marinko Catovic 2018-08-24 8:36 ` Vlastimil Babka 2018-08-29 14:54 ` Marinko Catovic 2018-08-29 15:01 ` Michal Hocko 2018-08-29 15:13 ` Marinko Catovic 2018-08-29 15:27 ` Michal Hocko 2018-08-29 16:44 ` Marinko Catovic 2018-10-22 1:19 ` Marinko Catovic 2018-10-23 17:41 ` Marinko Catovic 2018-10-26 5:48 ` Marinko Catovic 2018-10-26 8:01 ` Michal Hocko 2018-10-26 23:31 ` Marinko Catovic 2018-10-27 6:42 ` Michal Hocko [not found] ` <6e3a9434-32f2-0388-e0c7-2bd1c2ebc8b1@suse.cz> 2018-10-30 15:30 ` Michal Hocko 2018-10-30 16:08 ` Marinko Catovic 2018-10-30 17:00 ` Vlastimil Babka 2018-10-30 18:26 ` Marinko Catovic 2018-10-31 7:34 ` Michal Hocko 2018-10-31 7:32 ` Michal Hocko 2018-10-31 13:40 ` Vlastimil Babka 2018-10-31 14:53 ` Marinko Catovic 2018-10-31 17:01 ` Michal Hocko 2018-10-31 19:21 ` Marinko Catovic 2018-11-01 13:23 ` Michal Hocko 2018-11-01 22:46 ` Marinko Catovic 2018-11-02 8:05 ` Michal Hocko 2018-11-02 11:31 ` Marinko Catovic 2018-11-02 11:49 ` Michal Hocko 2018-11-02 12:22 ` Vlastimil Babka 2018-11-02 12:41 ` Marinko Catovic 2018-11-02 13:13 ` Vlastimil Babka 2018-11-02 13:50 ` Marinko Catovic 2018-11-02 14:49 ` Vlastimil Babka 2018-11-02 14:59 ` Vlastimil Babka 2018-11-30 12:01 ` Marinko Catovic 2018-12-10 21:30 ` Marinko Catovic 2018-12-10 21:47 ` Michal Hocko 2018-10-31 13:12 ` Vlastimil Babka 2018-08-24 6:24 ` Vlastimil Babka
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).