From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f69.google.com (mail-wm0-f69.google.com [74.125.82.69]) by kanga.kvack.org (Postfix) with ESMTP id DF43F6B0003 for ; Mon, 30 Jul 2018 18:08:10 -0400 (EDT) Received: by mail-wm0-f69.google.com with SMTP id l4-v6so419784wme.7 for ; Mon, 30 Jul 2018 15:08:10 -0700 (PDT) Received: from mail-sor-f41.google.com (mail-sor-f41.google.com. [209.85.220.41]) by mx.google.com with SMTPS id j124-v6sor171572wmd.21.2018.07.30.15.08.09 for (Google Transport Security); Mon, 30 Jul 2018 15:08:09 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20180730144048.GW24267@dhcp22.suse.cz> References: <20180712113411.GB328@dhcp22.suse.cz> <20180716162337.GY17280@dhcp22.suse.cz> <20180716164500.GZ17280@dhcp22.suse.cz> <20180730144048.GW24267@dhcp22.suse.cz> From: Marinko Catovic Date: Tue, 31 Jul 2018 00:08:08 +0200 Message-ID: Subject: Re: Caching/buffers become useless after some time Content-Type: multipart/alternative; boundary="000000000000508f0705723eb22d" Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org --000000000000508f0705723eb22d Content-Type: text/plain; charset="UTF-8" > Can you provide (a single snapshot) /proc/pagetypeinfo and > /proc/slabinfo from a system that's currently experiencing the issue, > also with /proc/vmstat and /proc/zoneinfo to verify? Thanks. your request came in just one day after I 2>drop_caches again when the ram usage was really really low again. Up until now it did not reoccur on any of the 2 hosts, where one shows 550MB/11G with 37G of totally free ram for now - so not that low like last time when I dropped it, I think it was like 300M/8G or so, but I hope it helps: /proc/pagetypeinfo https://pastebin.com/6QWEZagL /proc/slabinfo https://pastebin.com/81QAFgke /proc/vmstat https://pastebin.com/S7mrQx1s /proc/zoneinfo https://pastebin.com/csGeqNyX also please note - whether this makes any difference: there is no swap file/partition I am using this without swap space. imho this should not be necessary since applications running on the hosts would not consume more than 20GB, the rest should be used by buffers/cache. 2018-07-30 16:40 GMT+02:00 Michal Hocko : > On Fri 27-07-18 13:15:33, Vlastimil Babka wrote: > > On 07/21/2018 12:03 AM, Marinko Catovic wrote: > > > I let this run for 3 days now, so it is quite a lot, there you go: > > > https://nofile.io/f/egGyRjf0NPs/vmstat.tar.gz > > > > The stats show that compaction has very bad results. Between first and > > last snapshot, compact_fail grew by 80k and compact_success by 1300. > > High-order allocations will thus cycle between (failing) compaction and > > reclaim that removes the buffer/caches from memory. > > I guess you are right. I've just looked at random large direct reclaim > activity > $ grep -w pgscan_direct vmstat*| awk '{diff=$2-old; if (old && diff > > 100000) printf "%s %d\n", $1, diff; old=$2}' > vmstat.1531957422:pgscan_direct 114334 > vmstat.1532047588:pgscan_direct 111796 > > $ paste-with-diff.sh vmstat.1532047578 vmstat.1532047588 | grep > "pgscan\|pgsteal\|compact\|pgalloc" | sort > # counter value1 value2-value1 > compact_daemon_free_scanned 2628160139 0 > compact_daemon_migrate_scanned 797948703 0 > compact_daemon_wake 23634 0 > compact_fail 124806 108 > compact_free_scanned 226181616304 295560271 > compact_isolated 2881602028 480577 > compact_migrate_scanned 147900786550 27834455 > compact_stall 146749 108 > compact_success 21943 0 > pgalloc_dma 0 0 > pgalloc_dma32 1577060946 10752 > pgalloc_movable 0 0 > pgalloc_normal 29389246430 343249 > pgscan_direct 737335028 111796 > pgscan_direct_throttle 0 0 > pgscan_kswapd 1177909394 0 > pgsteal_direct 704542843 111784 > pgsteal_kswapd 898170720 0 > > There is zero kswapd activity so this must have been higher order > allocation activity and all the direct compaction failed so we keep > reclaiming. > -- > Michal Hocko > SUSE Labs > --000000000000508f0705723eb22d Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

> Can you provide (a single snapshot) /proc/pagetypeinfo and
&= gt; /proc/slabinfo from a system that's currently experiencing the issu= e,
> also with /proc/vmstat and /proc/zoneinfo to verify? Thanks= .

your request came in just one day after I 2>d= rop_caches again when the ram usage
was really really low again. = Up until now it did not reoccur on any of the 2 hosts,
where one = shows 550MB/11G with 37G of totally free ram for now - so not that low
like last time when I dropped it, I think it was like 300M/8G or so, = but I hope it helps:

/proc/pagetypeinfo=C2=A0 https://pastebin.com/6QWEZagL
/proc/slabinfo=C2=A0 https= ://pastebin.com/81QAFgke
/p= roc/zoneinfo=C2=A0 https://pasteb= in.com/csGeqNyX

also please note - whether thi= s makes any difference: there is no swap file/partition
I am using this without swap space. imho this should not b= e necessary since
applications running on t= he hosts would not consume more than 20GB, the rest
should be used by buffers/cache.

2018-07-30 16:40 GMT+02:00 Michal Hocko = <= mhocko@suse.com>:
On Fri 27-07-18 13:15:33, Vlastimil Babka= wrote:
> On 07/21/2018 12:03 AM, Marinko Catovic wrote:
> > I let this run for 3 days now, so it is quite a lot, there you go= :
> > https://nofile.io/f/egGyRjf0NPs/vmstat.t= ar.gz
>
> The stats show that compaction has very bad results. Between first and=
> last snapshot, compact_fail grew by 80k and compact_success by 1300. > High-order allocations will thus cycle between (failing) compaction an= d
> reclaim that removes the buffer/caches from memory.

I guess you are right. I've just looked at random large direct r= eclaim activity
$ grep -w pgscan_direct=C2=A0 vmstat*| awk=C2=A0 '{diff=3D$2-old; if (o= ld && diff > 100000) printf "%s %d\n", $1, diff; old= =3D$2}'
vmstat.1531957422:pgscan_direct 114334
vmstat.1532047588:pgscan_direct 111796

$ paste-with-diff.sh vmstat.1532047578 vmstat.1532047588 | grep "pgsca= n\|pgsteal\|compact\|pgalloc" | sort
# counter=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0value1=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 value2-value1
compact_daemon_free_scanned=C2=A0 =C2=A0 =C2=A02628160139=C2=A0 =C2=A0 =C2= =A0 0
compact_daemon_migrate_scanned=C2=A0 797948703=C2=A0 =C2=A0 =C2=A0 =C2=A00<= br> compact_daemon_wake=C2=A0 =C2=A0 =C2=A023634=C2=A0 =C2=A00
compact_fail=C2=A0 =C2=A0 124806=C2=A0 108
compact_free_scanned=C2=A0 =C2=A0 226181616304=C2=A0 =C2=A0 295560271
compact_isolated=C2=A0 =C2=A0 =C2=A0 =C2=A0 2881602028=C2=A0 =C2=A0 =C2=A0 = 480577
compact_migrate_scanned 147900786550=C2=A0 =C2=A0 27834455
compact_stall=C2=A0 =C2=A0146749=C2=A0 108
compact_success 21943=C2=A0 =C2=A00
pgalloc_dma=C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A0 =C2=A00
pgalloc_dma32=C2=A0 =C2=A01577060946=C2=A0 =C2=A0 =C2=A0 10752
pgalloc_movable 0=C2=A0 =C2=A0 =C2=A0 =C2=A00
pgalloc_normal=C2=A0 29389246430=C2=A0 =C2=A0 =C2=A0343249
pgscan_direct=C2=A0 =C2=A0737335028=C2=A0 =C2=A0 =C2=A0 =C2=A0111796
pgscan_direct_throttle=C2=A0 0=C2=A0 =C2=A0 =C2=A0 =C2=A00
pgscan_kswapd=C2=A0 =C2=A01177909394=C2=A0 =C2=A0 =C2=A0 0
pgsteal_direct=C2=A0 704542843=C2=A0 =C2=A0 =C2=A0 =C2=A0111784
pgsteal_kswapd=C2=A0 898170720=C2=A0 =C2=A0 =C2=A0 =C2=A00

There is zero kswapd activity so this must have been higher order
allocation activity and all the direct compaction failed so we keep
reclaiming.
--
Michal Hocko
SUSE Labs

--000000000000508f0705723eb22d--