linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Andrea Arcangeli <aarcange@redhat.com>
To: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, Alex Williamson <alex.williamson@redhat.com>,
	David Rientjes <rientjes@google.com>,
	Vlastimil Babka <vbabka@suse.cz>
Subject: Re: [PATCH 2/2] mm: thp: fix transparent_hugepage/defrag = madvise || always
Date: Tue, 21 Aug 2018 17:40:49 -0400	[thread overview]
Message-ID: <20180821214049.GG13047@redhat.com> (raw)
In-Reply-To: <20180821115057.GY29735@dhcp22.suse.cz>

On Tue, Aug 21, 2018 at 01:50:57PM +0200, Michal Hocko wrote:
> So does reverting 5265047ac301 ("mm, thp: really limit transparent
> hugepage allocation to local node") help?

That won't revert clean, to be sure you'd need to bisect around it
which I haven't tried because I don't think there was doubt around it
(and only the part in mempolicy.c is relevant at it).

It's not so important to focus on that commit the code changed again
later, I'm focusing on the code as a whole.

It's not just that commit that is the problem here, the problem starts
in the previous commits that adds the NUMA locality logic, the
addition of __GFP_THISNODE is the icing on the cake that makes things
fall apart.

> I really detest a new gfp flag for one time semantic that is muddy as
> hell.

Well there's no way to fix this other than to prevent reclaim to run,
if you still want to give a chance to page faults to obtain THP under
MADV_HUGEPAGE in the page fault without waiting minutes or hours for
khugpaged to catch up with it.

> This is simply incomprehensible. How can anybody who is not deeply
> familiar with the allocator/reclaim internals know when to use it.

Nobody should use this in drivers, it's a __GFP flag.

Note:

	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {

#define alloc_hugepage_vma(gfp_mask, vma, addr, order)	\
	alloc_pages_vma(gfp_mask, order, vma, addr, numa_node_id(), true)

Only THP is ever affected by the BUG so nothing else will ever need to
call __GFP_COMPACT_ONLY. It is a VM internal flag, I wish there was a
way to make the build fail if a driver would use it but there isn't
right now.

All other order 9 or 10 allocations from drivers won't call
alloc_hugepage_vma. Only mm/huge_memory.c ever calls that which is why
the regression only happens with MADV_HUGEPAGE (i.e. qemu is being
bitten badly on NUMA hosts).

> If this is really a regression then we should start by pinpointing the

You can check yourself, create a 2 node vnuma guest or pick any host
with more than one node. Set defrag=always and run "memhog -r11111111
18g" if host has 16g per node. Add some swap and notice the swap storm
while all ram is left free in the other node.

> real culprit and go from there. If this is really 5265047ac301 then just

In my view there's no single culprit, but it was easy to identify the
last drop that made the MADV_HUGEPAGE glass overflow, and it's that
commit that adds __GFP_THISNODE. The combination of the previous code
that prioritized NUMA over THP and then the MADV_HUGEPAGE logic that
still uses compaction (and in turn reclaim if compaction fails with
COMPACT_SKIPPED because there's no 4k page in the local node) just
falls apart with __GFP_THISNODE set as well on top of it and it
doesn't do the expected thing either without it (i.e. THP gets
priority over NUMA locality without such flag).

__GFP_THISNODE and the logic there, only works ok when
__GFP_DIRECT_RECLAIM is not set, i.e. MADV_HUGEPAGE not set.

We don't want to wait hours for khugepaged to catch up in qemu to get
THP. Compaction is certainly worth it to run if the userland
explicitly gives the hint to the kernel the allocations are long
lived.

> start by reverting it. I strongly suspect there is some mismatch in
> expectations here. What others consider acceptable seems to be a problem
> for others. I believe that was one of the reasons why we have changed
> the default THP direct compaction behavior, no?

This is not about the "default" though, I'm not changing the default
either, this is about MADV_HUGEPAGE behavior and when you change from
the defrag "default" value to "always" (which is equivalent than
having have all vmas set with MADV_HUGEPAGE).

QEMU is being optimal in setting MADV_HUGEPAGE, and rightfully so, but
it's getting punished badly because of this kernel bug.

Thanks,
Andrea

  reply	other threads:[~2018-08-21 21:40 UTC|newest]

Thread overview: 61+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-08-20  3:22 [PATCH 0/2] fix for "pathological THP behavior" Andrea Arcangeli
2018-08-20  3:22 ` [PATCH 1/2] mm: thp: consolidate policy_nodemask call Andrea Arcangeli
2018-08-20  3:22 ` [PATCH 2/2] mm: thp: fix transparent_hugepage/defrag = madvise || always Andrea Arcangeli
2018-08-20  3:26   ` [PATCH 0/1] fix for "pathological THP behavior" v2 Andrea Arcangeli
2018-08-20  3:26     ` [PATCH 1/1] mm: thp: fix transparent_hugepage/defrag = madvise || always Andrea Arcangeli
2018-08-20 12:35   ` [PATCH 2/2] " Zi Yan
2018-08-20 15:32     ` Andrea Arcangeli
2018-08-21 11:50   ` Michal Hocko
2018-08-21 21:40     ` Andrea Arcangeli [this message]
2018-08-22  9:02       ` Michal Hocko
2018-08-22 11:07         ` Michal Hocko
2018-08-22 14:24           ` Andrea Arcangeli
2018-08-22 14:45             ` Michal Hocko
2018-08-22 15:24               ` Andrea Arcangeli
2018-08-23 10:50                 ` Michal Hocko
2018-08-22 15:52         ` Andrea Arcangeli
2018-08-23 10:52           ` Michal Hocko
2018-08-28  7:53             ` Michal Hocko
2018-08-28  8:18               ` Michal Hocko
2018-08-28  8:54                 ` Stefan Priebe - Profihost AG
2018-08-29 11:11                   ` Stefan Priebe - Profihost AG
     [not found]                 ` <D5F4A33C-0A37-495C-9468-D6866A862097@cs.rutgers.edu>
2018-08-29 14:28                   ` Michal Hocko
2018-08-29 14:35                     ` Michal Hocko
2018-08-29 15:22                       ` Zi Yan
2018-08-29 15:47                         ` Michal Hocko
2018-08-29 16:06                           ` Zi Yan
2018-08-29 16:25                             ` Michal Hocko
2018-08-29 19:24                               ` [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings Michal Hocko
2018-08-29 22:54                                 ` Zi Yan
2018-08-30  7:00                                   ` Michal Hocko
2018-08-30 13:22                                     ` Zi Yan
2018-08-30 13:45                                       ` Michal Hocko
2018-08-30 14:02                                         ` Zi Yan
2018-08-30 16:19                                           ` Stefan Priebe - Profihost AG
2018-08-30 16:40                                           ` Michal Hocko
2018-09-05  3:44                                             ` Andrea Arcangeli
2018-09-05  7:08                                               ` Michal Hocko
2018-09-06 11:10                                                 ` Vlastimil Babka
2018-09-06 11:16                                                   ` Vlastimil Babka
2018-09-06 11:25                                                     ` Michal Hocko
2018-09-06 12:35                                                       ` Zi Yan
2018-09-06 10:59                                   ` Vlastimil Babka
2018-09-06 11:17                                     ` Zi Yan
2018-08-30  6:47                                 ` Michal Hocko
2018-09-06 11:18                                   ` Vlastimil Babka
2018-09-06 11:27                                     ` Michal Hocko
2018-09-12 17:29                                 ` Mel Gorman
2018-09-17  6:11                                   ` Michal Hocko
2018-09-17  7:04                                     ` Stefan Priebe - Profihost AG
2018-09-17  9:32                                       ` Stefan Priebe - Profihost AG
2018-09-17 11:27                                       ` Michal Hocko
2018-08-20 11:58 ` [PATCH 0/2] fix for "pathological THP behavior" Kirill A. Shutemov
2018-08-20 15:19   ` Andrea Arcangeli
2018-08-21 15:30     ` Vlastimil Babka
2018-08-21 17:26       ` David Rientjes
2018-08-21 22:18         ` Andrea Arcangeli
2018-08-21 22:05       ` Andrea Arcangeli
2018-08-22  9:24       ` Michal Hocko
2018-08-22 15:56         ` Andrea Arcangeli
2018-08-20 19:06   ` Yang Shi
2018-08-20 23:24     ` Andrea Arcangeli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180821214049.GG13047@redhat.com \
    --to=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex.williamson@redhat.com \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=rientjes@google.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).