> Can you provide (a single snapshot) /proc/pagetypeinfo and
> /proc/slabinfo from a system that's currently experiencing the issue,

> also with /proc/vmstat and /proc/zoneinfo to verify? Thanks.

your request came in just one day after I 2>drop_caches again when the ram usage

was really really low again. Up until now it did not reoccur on any of the 2 hosts,

where one shows 550MB/11G with 37G of totally free ram for now - so not that low

like last time when I dropped it, I think it was like 300M/8G or so, but I hope it helps:

/proc/pagetypeinfo https://pastebin.com/6QWEZagL

/proc/slabinfo https://pastebin.com/81QAFgke

/proc/vmstat https://pastebin.com/S7mrQx1s

/proc/zoneinfo https://pastebin.com/csGeqNyX

also please note - whether this makes any difference: there is no swap file/partition

I am using this without swap space. imho this should not be necessary since

applications running on the hosts would not consume more than 20GB, the rest

should be used by buffers/cache.

2018-07-30 16:40 GMT+02:00 Michal Hocko <mhocko@suse.com>:

On Fri 27-07-18 13:15:33, Vlastimil Babka wrote:
> On 07/21/2018 12:03 AM, Marinko Catovic wrote:
> > I let this run for 3 days now, so it is quite a lot, there you go:
> > https://nofile.io/f/egGyRjf0NPs/vmstat.tar.gz
>
> The stats show that compaction has very bad results. Between first and
> last snapshot, compact_fail grew by 80k and compact_success by 1300.
> High-order allocations will thus cycle between (failing) compaction and
> reclaim that removes the buffer/caches from memory.

I guess you are right. I've just looked at random large direct reclaim activity
$ grep -w pgscan_direct vmstat*| awk '{diff=$2-old; if (old && diff > 100000) printf "%s %d\n", $1, diff; old=$2}'
vmstat.1531957422:pgscan_direct 114334
vmstat.1532047588:pgscan_direct 111796

$ paste-with-diff.sh vmstat.1532047578 vmstat.1532047588 | grep "pgscan\|pgsteal\|compact\|pgalloc" | sort
# counter value1 value2-value1
compact_daemon_free_scanned 2628160139 0
compact_daemon_migrate_scanned 797948703 0
compact_daemon_wake 23634 0
compact_fail 124806 108
compact_free_scanned 226181616304 295560271
compact_isolated 2881602028 480577
compact_migrate_scanned 147900786550 27834455
compact_stall 146749 108
compact_success 21943 0
pgalloc_dma 0 0
pgalloc_dma32 1577060946 10752
pgalloc_movable 0 0
pgalloc_normal 29389246430 343249
pgscan_direct 737335028 111796
pgscan_direct_throttle 0 0
pgscan_kswapd 1177909394 0
pgsteal_direct 704542843 111784
pgsteal_kswapd 898170720 0

There is zero kswapd activity so this must have been higher order
allocation activity and all the direct compaction failed so we keep
reclaiming.

--
Michal Hocko
SUSE Labs