hello Michal
well these hints were just ideas mentioned by some people, it took me weeks just to figure
out that 2>drop_caches helps, still not knowing why this happens.
Right now I am observing ~18GB of unused RAM, since yesterday, so this is not always
about 100MB/3.5GB, but right now it may be in the process of shrinking.
I really can not tell for sure, this is so nondeterministic - I just wish I could reproduce it for better testing.
Right now top shows:
KiB Mem : 65892044 total, 18169232 free, 11879604 used, 35843208 buff/cache
Where 1GB goes to buffers, the rest to cache, the host *is* busy and the buff/cache consumed
all RAM yesterday, where I did 2>drop_caches about one day before.
Another host (still) shows full usage. That other one is 1:1 the same by software and config,
but with different data/users; the use-cases and load are pretty much similar.
Affected host at this time:
To compare - this is the other host, that is still showing full buffers/cache usage by now:
Usually both show this more or less at the same time, sometimes it is the one, sometimes
the other. Other hosts I have are currently not under similar high load, making it even harder
to compare.
However, right now I can not observe this dropping towards really low values, but I am sure it will come.
fs is ext4, mount options are auto,rw,data=writeback,noatime,nodiratime,nodev,nosuid,async
previous mount options with same behavior also had max_dir_size_kb, quotas and defaults for data=
so I also played around with these, but that made no difference.
---------
follow up (sorry, messed up with reply-to this mailing list):
https://pastebin.com/0v4ZFNCv .. one hour later, right after my last report, 22GB free
https://pastebin.com/rReWnHtE .. one day later, 28GB free
It is interesting to see however, that this did not get that low as mentioned before.
So not sure where this is going right now, but nevertheless, the RAM is not occupied fully,
there should be no reason to allow 28GB to be free at all.
Still lots I/O, and I am 100% positive that if I'd echo 2 > drop_caches, this would fill up the
entire RAM again.
What I can see is that buffers are around 500-700MB, the values increase and decrease
all the time, really "oscillating" around 600. afaik this should get as high as possible, as long
there is free ram - the other host that is still healthy has about 2GB/48GB fully occupying RAM.
Currently I have set vm.dirty_ratio = 15, vm.dirty_background_ratio = 3, vm.vfs_cache_pressure = 1
and the low usage occurred 3 days before, other values like the defaults or when I was playing
around with vm.dirty_ratio = 90, vm.dirty_background_ratio = 80 and whatever cache_pressure
showed similar results.