All of lore.kernel.org
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@suse.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, Alex Williamson <alex.williamson@redhat.com>,
	David Rientjes <rientjes@google.com>,
	Vlastimil Babka <vbabka@suse.cz>
Subject: Re: [PATCH 2/2] mm: thp: fix transparent_hugepage/defrag = madvise || always
Date: Thu, 23 Aug 2018 12:50:57 +0200	[thread overview]
Message-ID: <20180823105057.GA29735@dhcp22.suse.cz> (raw)
In-Reply-To: <20180822152402.GO13047@redhat.com>

On Wed 22-08-18 11:24:02, Andrea Arcangeli wrote:
> On Wed, Aug 22, 2018 at 04:45:17PM +0200, Michal Hocko wrote:
> > Now I am confused. How can compaction help at all then? I mean  if the
> > node is full of GUP pins then you can hardly do anything but fallback to
> > other node. Or how come your new GFP flag makes any difference?
> 
> It helps until the node is full.
> 
> If you don't call compaction you will get zero THP even when you've
> plenty of free memory.
> 
> So the free memory goes down and down as more and more THP are
> generated y compaction until compaction then fails with
> COMPACT_SKIPPED, there's not enough free memory to relocate an "order
> 9" amount of physically contiguous PAGE_SIZEd fragments.
> 
> At that point the code calls reclaim to make space for a new
> compaction run. Then if that fails again it's not because there's no
> enough free memory.
> 
> Problem is if you ever call reclaim when compaction fails, what
> happens is you free an "order 9" and then it gets eaten up by the app
> so then next compaction call, calls COMPACT_SKIPPED again.
> 
> This is how compaction works since day zero it was introduced in
> kernel 2.6.x something, if you don't have crystal clear the inner
> workings of compaction you have an hard time to review this. So hope
> the above shed some light of how this plays out.
> 
> So in general calling reclaim is ok because compaction fails more
> often than not in such case because it can't compact memory not
> because there aren't at least 2m free in any node. However when you use
> __GFP_THISNODE combined with reclaim that changes the whole angle and
> behavior of compaction if reclaim is still active.
> 
> Not calling compaction in MADV_HUGEPAGE means you can drop
> MADV_HUGEPAGE as a whole. There's no point to ever set it unless we
> call compaction. And if you don't call at least once compaction you
> have near zero chances to get gigabytes of THP even if it's all
> compactable memory and there are gigabytes of free memory in the node,
> after some runtime that shakes the fragments in the buddy.
> 
> To make it even more clear why compaction has to run once at least
> when MADV_HUGEPAGE is set, just check the second last column of your
> /proc/buddyinfo before and after "echo 3 >/proc/sys/vm/drop_caches;
> echo >/proc/sys/vm/compact_memory". Try to allocate memory without
> MADV_HUGEPAGE and without running the "echo 3; echo" and see how much
> THP you'll get. I've plenty of workloads that use MADV_HUGEPAGE not
> just qemu and that totally benefit immediately from THP and there's no
> point to ever defer compaction to khugepaged when userland says "this
> is a long lived allocation".
> 
> Compaction is only potentially wasteful for short lived allocation, so
> MADV_HUGEPAGE has to call compaction.

I guess you have missed my point. I was not suggesting compaction is
pointless. I meant to say, how can be compaction useful in the scenario
you were suggesting when the node is full of pinned pages.

> > It would still try to reclaim easy target as compaction requires. If you
> > do not reclaim at all you can make the current implementation of the
> > compaction noop due to its own watermark checks IIRC.
> 
> That's the feature, if you don't make it a noop when watermark checks
> trigger, it'll end up wasting CPU and breaking vfio.
> 
> The point is that we want compaction to run when there's free memory
> and compaction keeps succeeding.
> 
> So when compaction fails, if it's because we finished all free memory
> in the node, we should just remove __GFP_THISNODE and allocate without
> it (i.e. the optimization). If compaction fails because the memory is
> fragmented but here's still free memory we should fail the allocation
> and trigger the THP fallback to PAGE_SIZE fault.
> 
> Overall removing __GFP_THISNODE unconditionally would simply
> prioritize THP over NUMA locality which is the point of this special
> logic for THP. I can't blame the logic because it certainly helps NUMA
> balancing a lot in letting the memory be in the right place from the
> start. This is why __GFP_COMPACT_ONLY makes sense, to be able to
> retain the logic but still preventing the corner case of such
> __GFP_THISNODE that breaks the VM with MADV_HUGEPAGE.

But __GFP_COMPACT_ONLY is a layering violation because you are
compaction does depend on the reclaim right now.
 
> > yeah, I agree about PAGE_ALLOC_COSTLY_ORDER being an arbitrary limit for
> > a different behavior. But we already do handle those specially so it
> > kind of makes sense to me to expand on that.
> 
> It's still a sign of one more place that needs magic for whatever
> reason. So unless it can be justified by some runtime tests I wouldn't
> make such change by just thinking about it. Reclaim is called if
> there's no free memory left anywhere for compaction to run (i.e. if
> __GFP_THISNODE is not set, if __GPF_THISNODE is set then the caller
> better use __GFP_COMPACT_ONLY).

I am not insisting on the hack I have proposed mostly for the sake of
discussion. But I _strongly_ believe that __GFP_COMPACT_ONLY is the
wrong way around the issue. We are revolving around __GFP_THISNODE
having negative side effect and that is exactly an example of a gfp flag
abuse for internal MM stuff which just happens to be a complete PITA for
a long time.
 
> Now we could also get away without __GFP_COMPACT_ONLY, we could check
> __GFP_THISNODE and make it behave exactly like __GFP_COMPACT_ONLY
> whenever __GFP_DIRECT_RECLAIM was also set in addition of
> __GFP_THISNODE, but then you couldn't use __GFP_THISNODE as a mbind
> anymore and it would have more obscure semantics than a new flag I
> think.

Or simply do not play tricks with __GFP_THISNODE.
-- 
Michal Hocko
SUSE Labs

  reply	other threads:[~2018-08-23 10:51 UTC|newest]

Thread overview: 61+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-08-20  3:22 [PATCH 0/2] fix for "pathological THP behavior" Andrea Arcangeli
2018-08-20  3:22 ` [PATCH 1/2] mm: thp: consolidate policy_nodemask call Andrea Arcangeli
2018-08-20  3:22 ` [PATCH 2/2] mm: thp: fix transparent_hugepage/defrag = madvise || always Andrea Arcangeli
2018-08-20  3:26   ` [PATCH 0/1] fix for "pathological THP behavior" v2 Andrea Arcangeli
2018-08-20  3:26     ` [PATCH 1/1] mm: thp: fix transparent_hugepage/defrag = madvise || always Andrea Arcangeli
2018-08-20 12:35   ` [PATCH 2/2] " Zi Yan
2018-08-20 15:32     ` Andrea Arcangeli
2018-08-21 11:50   ` Michal Hocko
2018-08-21 21:40     ` Andrea Arcangeli
2018-08-22  9:02       ` Michal Hocko
2018-08-22 11:07         ` Michal Hocko
2018-08-22 14:24           ` Andrea Arcangeli
2018-08-22 14:45             ` Michal Hocko
2018-08-22 15:24               ` Andrea Arcangeli
2018-08-23 10:50                 ` Michal Hocko [this message]
2018-08-22 15:52         ` Andrea Arcangeli
2018-08-23 10:52           ` Michal Hocko
2018-08-28  7:53             ` Michal Hocko
2018-08-28  8:18               ` Michal Hocko
2018-08-28  8:54                 ` Stefan Priebe - Profihost AG
2018-08-29 11:11                   ` Stefan Priebe - Profihost AG
     [not found]                 ` <D5F4A33C-0A37-495C-9468-D6866A862097@cs.rutgers.edu>
2018-08-29 14:28                   ` Michal Hocko
2018-08-29 14:35                     ` Michal Hocko
2018-08-29 15:22                       ` Zi Yan
2018-08-29 15:47                         ` Michal Hocko
2018-08-29 16:06                           ` Zi Yan
2018-08-29 16:25                             ` Michal Hocko
2018-08-29 19:24                               ` [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings Michal Hocko
2018-08-29 22:54                                 ` Zi Yan
2018-08-30  7:00                                   ` Michal Hocko
2018-08-30 13:22                                     ` Zi Yan
2018-08-30 13:45                                       ` Michal Hocko
2018-08-30 14:02                                         ` Zi Yan
2018-08-30 16:19                                           ` Stefan Priebe - Profihost AG
2018-08-30 16:40                                           ` Michal Hocko
2018-09-05  3:44                                             ` Andrea Arcangeli
2018-09-05  7:08                                               ` Michal Hocko
2018-09-06 11:10                                                 ` Vlastimil Babka
2018-09-06 11:16                                                   ` Vlastimil Babka
2018-09-06 11:25                                                     ` Michal Hocko
2018-09-06 12:35                                                       ` Zi Yan
2018-09-06 10:59                                   ` Vlastimil Babka
2018-09-06 11:17                                     ` Zi Yan
2018-08-30  6:47                                 ` Michal Hocko
2018-09-06 11:18                                   ` Vlastimil Babka
2018-09-06 11:27                                     ` Michal Hocko
2018-09-12 17:29                                 ` Mel Gorman
2018-09-17  6:11                                   ` Michal Hocko
2018-09-17  7:04                                     ` Stefan Priebe - Profihost AG
2018-09-17  9:32                                       ` Stefan Priebe - Profihost AG
2018-09-17 11:27                                       ` Michal Hocko
2018-08-20 11:58 ` [PATCH 0/2] fix for "pathological THP behavior" Kirill A. Shutemov
2018-08-20 15:19   ` Andrea Arcangeli
2018-08-21 15:30     ` Vlastimil Babka
2018-08-21 17:26       ` David Rientjes
2018-08-21 22:18         ` Andrea Arcangeli
2018-08-21 22:05       ` Andrea Arcangeli
2018-08-22  9:24       ` Michal Hocko
2018-08-22 15:56         ` Andrea Arcangeli
2018-08-20 19:06   ` Yang Shi
2018-08-20 23:24     ` Andrea Arcangeli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180823105057.GA29735@dhcp22.suse.cz \
    --to=mhocko@suse.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex.williamson@redhat.com \
    --cc=linux-mm@kvack.org \
    --cc=rientjes@google.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.