> Can you provide (a single snapshot) /proc/pagetypeinfo and
> /proc/slabinfo from a system that's currently experiencing the issue,
> also with /proc/vmstat and /proc/zoneinfo to verify? Thanks.

your request came in just one day after I 2>drop_caches again when the ram
usage
was really really low again. Up until now it did not reoccur on any of the
2 hosts,
where one shows 550MB/11G with 37G of totally free ram for now - so not
that low
like last time when I dropped it, I think it was like 300M/8G or so, but I
hope it helps:

/proc/pagetypeinfo  https://pastebin.com/6QWEZagL
/proc/slabinfo  https://pastebin.com/81QAFgke
/proc/vmstat  https://pastebin.com/S7mrQx1s
/proc/zoneinfo  https://pastebin.com/csGeqNyX

also please note - whether this makes any difference: there is no swap
file/partition
I am using this without swap space. imho this should not be necessary since
applications running on the hosts would not consume more than 20GB, the rest
should be used by buffers/cache.

2018-07-30 16:40 GMT+02:00 Michal Hocko <mhocko@suse.com>:

> On Fri 27-07-18 13:15:33, Vlastimil Babka wrote:
> > On 07/21/2018 12:03 AM, Marinko Catovic wrote:
> > > I let this run for 3 days now, so it is quite a lot, there you go:
> > > https://nofile.io/f/egGyRjf0NPs/vmstat.tar.gz
> >
> > The stats show that compaction has very bad results. Between first and
> > last snapshot, compact_fail grew by 80k and compact_success by 1300.
> > High-order allocations will thus cycle between (failing) compaction and
> > reclaim that removes the buffer/caches from memory.
>
> I guess you are right. I've just looked at random large direct reclaim
> activity
> $ grep -w pgscan_direct  vmstat*| awk  '{diff=$2-old; if (old && diff >
> 100000) printf "%s %d\n", $1, diff; old=$2}'
> vmstat.1531957422:pgscan_direct 114334
> vmstat.1532047588:pgscan_direct 111796
>
> $ paste-with-diff.sh vmstat.1532047578 vmstat.1532047588 | grep
> "pgscan\|pgsteal\|compact\|pgalloc" | sort
> # counter                       value1          value2-value1
> compact_daemon_free_scanned     2628160139      0
> compact_daemon_migrate_scanned  797948703       0
> compact_daemon_wake     23634   0
> compact_fail    124806  108
> compact_free_scanned    226181616304    295560271
> compact_isolated        2881602028      480577
> compact_migrate_scanned 147900786550    27834455
> compact_stall   146749  108
> compact_success 21943   0
> pgalloc_dma     0       0
> pgalloc_dma32   1577060946      10752
> pgalloc_movable 0       0
> pgalloc_normal  29389246430     343249
> pgscan_direct   737335028       111796
> pgscan_direct_throttle  0       0
> pgscan_kswapd   1177909394      0
> pgsteal_direct  704542843       111784
> pgsteal_kswapd  898170720       0
>
> There is zero kswapd activity so this must have been higher order
> allocation activity and all the direct compaction failed so we keep
> reclaiming.
> --
> Michal Hocko
> SUSE Labs
>