From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm1-f72.google.com (mail-wm1-f72.google.com [209.85.128.72]) by kanga.kvack.org (Postfix) with ESMTP id 1D25B6B0357 for ; Tue, 30 Oct 2018 14:26:51 -0400 (EDT) Received: by mail-wm1-f72.google.com with SMTP id y14-v6so17805105wmd.1 for ; Tue, 30 Oct 2018 11:26:51 -0700 (PDT) Received: from mail-sor-f41.google.com (mail-sor-f41.google.com. [209.85.220.41]) by mx.google.com with SMTPS id j137-v6sor3858526wmd.27.2018.10.30.11.26.44 for (Google Transport Security); Tue, 30 Oct 2018 11:26:44 -0700 (PDT) MIME-Version: 1.0 References: <76c6e92b-df49-d4b5-27f7-5f2013713727@suse.cz> <8b211f35-0722-cd94-1360-a2dd9fba351e@suse.cz> <20180829150136.GA10223@dhcp22.suse.cz> <20180829152716.GB10223@dhcp22.suse.cz> <6e3a9434-32f2-0388-e0c7-2bd1c2ebc8b1@suse.cz> <20181030152632.GG32673@dhcp22.suse.cz> <98305976-612f-cf6d-1377-2f9f045710a9@suse.cz> In-Reply-To: <98305976-612f-cf6d-1377-2f9f045710a9@suse.cz> From: Marinko Catovic Date: Tue, 30 Oct 2018 19:26:32 +0100 Message-ID: Subject: Re: Caching/buffers become useless after some time Content-Type: text/plain; charset="UTF-8" Sender: owner-linux-mm@kvack.org List-ID: To: Vlastimil Babka Cc: Michal Hocko , linux-mm@kvack.org, Christopher Lameter Am Di., 30. Okt. 2018 um 18:03 Uhr schrieb Vlastimil Babka : > > On 10/30/18 5:08 PM, Marinko Catovic wrote: > >> One notable thing here is that there shouldn't be any reason to do the > >> direct reclaim when kswapd itself doesn't do anything. It could be > >> either blocked on something but I find it quite surprising to see it in > >> that state for the whole 1500s time period or we are simply not low on > >> free memory at all. That would point towards compaction triggered memory > >> reclaim which account as the direct reclaim as well. The direct > >> compaction triggered more than once a second in average. We shouldn't > >> really reclaim unless we are low on memory but repeatedly failing > >> compaction could just add up and reclaim a lot in the end. There seem to > >> be quite a lot of low order request as per your trace buffer > > I realized that the fact that slabs grew so large might be very > relevant. It means a lot of unmovable pages, and while they are slowly > being freed, the remaining are scattered all over the memory, making it > impossible to successfully compact, until the slabs are almost > *completely* freed. It's in fact the theoretical worst case scenario for > compaction and fragmentation avoidance. Next time it would be nice to > also gather /proc/pagetypeinfo, and /proc/slabinfo to see what grew so > much there (probably dentries and inodes). how would you like the results? as a job collecting those from 3 > drop_caches until worst case, which may be 24 hours every 5 seconds, or at what point in time? Please note that I already provided them (see my response before) as a one-time snapshot while being in the worst case; cat /proc/pagetypeinfo https://pastebin.com/W1sJscsZ cat /proc/slabinfo https://pastebin.com/9ZPU3q7X > The question is why the problems happened some time later after the > unmovable pollution. The trace showed me that the structure of > allocations wrt order+flags as Michal breaks them down below, is not > significanly different in the last phase than in the whole trace. > Possibly the state of memory gradually changed so that the various > heuristics (fragindex, pageblock skip bits etc) resulted in compaction > being tried more than initially, eventually hitting a very bad corner case. > > >> $ grep order trace-last-phase | sed 's@.*\(order=[0-9]*\).*gfp_flags=\(.*\)@\1 \2@' | sort | uniq -c > >> 1238 order=1 __GFP_HIGH|__GFP_ATOMIC|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE > >> 5812 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE > >> 121 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE > >> 22 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE > >> 395910 order=1 GFP_KERNEL_ACCOUNT|__GFP_ZERO > >> 783055 order=1 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT > >> 1060 order=1 __GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE > >> 3278 order=2 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE > >> 797255 order=2 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT > >> 93524 order=3 GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC > >> 498148 order=3 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT > >> 243563 order=3 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP > >> 10 order=4 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE > >> 114 order=7 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE > >> 67621 order=9 GFP_TRANSHUGE|__GFP_THISNODE > >> > >> We can safely rule out NOWAIT and ATOMIC because those do not reclaim. > >> That leaves us with > >> 5812 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE > >> 121 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE > >> 22 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE > >> 395910 order=1 GFP_KERNEL_ACCOUNT|__GFP_ZERO > > I suspect there are lots of short-lived processes, so these are probably > rapidly recycled and not causing compaction. Well yes, since it is about shared hosting there are lots of users, running lots of scripts, perhaps 5-50 new forks and kills every second, depending on load, hard to tell. > It also seems to be pgd allocation (2 pages due to PTI) not kernel stack? plain english, please? :) > >> 1060 order=1 __GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE > >> 3278 order=2 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE > >> 10 order=4 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE > >> 114 order=7 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE > >> 67621 order=9 GFP_TRANSHUGE|__GFP_THISNODE > > I would again suspect those. IIRC we already confirmed earlier that THP > defrag setting is madvise or madvise+defer, and there are > madvise(MADV_HUGEPAGE) using processes? Did you ever try changing defrag > to plain 'defer'? Yes, I think I mentioned this before. AFAIK it did not make (immediate) changes, madvise is the current type. > and there are madvise(MADV_HUGEPAGE) using processes? Can't tell you that.. > >> > >> by large the kernel stack allocations are in lead. You can put some > >> relief by enabling CONFIG_VMAP_STACK. There is alos a notable number of > >> THP pages allocations. Just curious are you running on a NUMA machine? > >> If yes [1] might be relevant. Other than that nothing really jumped at > >> me. > > > > thanks a lot Vlastimil! > > And Michal :) > > > I would not really know whether this is a NUMA, it is some usual > > server running with a i7-8700 > > and ECC RAM. How would I find out? > > Please provide /proc/zoneinfo and we'll see. there you go: cat /proc/zoneinfo https://pastebin.com/RMTwtXGr > > So I should do CONFIG_VMAP_STACK=y and try that..? > > I suspect you already have it. Yes true, the currently loaded kernel is with =y there.