> > 1. Send the current value of /sys/kernel/mm/transparent_hugepage/defrag > 2. Unless it's 'defer' or 'never' already, try changing it to 'defer'. > /sys/kernel/mm/transparent_hugepage/defrag is always defer defer+madvise [madvise] never I *think* I already played around with these values, as far as I remember `never` almost caused the system to hang, or at least while I switched back to madvise. shall I switch it to defer and observe (all hosts are running fine by just now) or switch to defer while it is in the bad state? and when doing this, should improvement be measurable immediately? I need to know how long to hold this, before dropping caches becomes necessary. > Ah, checked the trace and it seems to be "php-cgi". Interesting that > they use madvise(MADV_HUGEPAGE). Anyway the above still applies. you know, that's at least an interesting hint. look at this: https://ckon.wordpress.com/2015/09/18/php7-opcache-performance/ this was experimental there, but a more recent version seems to have it on by default, since I need to disable it on request (implies to me that it is on by default). it is however *disabled* in the runtime configuration (and not in effect, I just confirmed that) It would be interesting to know whether madvise(MADV_HUGEPAGE) is then active somewhere else, since it is in the dump as you observed. Please note that `killing` php-cgi would not make any difference then, since these processes are started by request for every user and killed after whatever script is finished. this may invoke about 10-50 forks, depending on load, (with different system users) every second. That also *may* explain why it is not so much deterministic (sometimes earlier/sooner, sometimes on one host and not on the other), since there are multiple php-cgi versions available and not everyone is using the same version - most people stick to legacy versions.