All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrea Arcangeli <aarcange@redhat.com>
To: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Mel Gorman <mgorman@suse.de>, Vlastimil Babka <vbabka@suse.cz>,
	Andrea Argangeli <andrea@kernel.org>,
	Zi Yan <zi.yan@cs.rutgers.edu>,
	Stefan Priebe - Profihost AG <s.priebe@profihost.ag>,
	"Kirill A. Shutemov" <kirill@shutemov.name>,
	linux-mm@kvack.org, LKML <linux-kernel@vger.kernel.org>,
	Stable tree <stable@vger.kernel.org>,
	Michal Hocko <mhocko@suse.com>
Subject: Re: [PATCH 1/2] mm: thp:  relax __GFP_THISNODE for MADV_HUGEPAGE mappings
Date: Thu, 4 Oct 2018 17:10:29 -0400	[thread overview]
Message-ID: <20181004211029.GE7344@redhat.com> (raw)
In-Reply-To: <alpine.DEB.2.21.1810041302330.16935@chino.kir.corp.google.com>

Hello David,

On Thu, Oct 04, 2018 at 01:16:32PM -0700, David Rientjes wrote:
> There are ways to address this without introducing regressions for 
> existing users of MADV_HUGEPAGE: introduce an madvise() mode to accept 
> remote thp allocations, which users of this library would never set, or 
> fix memory compaction so that it does not incur substantial allocation 
> latency when it will likely fail.

These librarians needs to call a new MADV_ and the current
MADV_HUGEPAGE should not be affected because the new MADV_ will
require some capbility (i.e. root privilege).

qemu was the first user of MADV_HUGEPAGE and I don't think it's fair
to break it and require change to it to run at higher privilege to
retain the direct compaction behavior of MADV_HUGEPAGE.

The new behavior you ask to retain in MADV_HUGEPAGE, generated the
same misbehavior to VM as mlock could have done too, so it can't just
be given by default without any privilege whatsoever.

Ok you could mitigate the breakage that MADV_HUGEPAGE could have
generated (before the recent fix) by isolating malicious or
inefficient programs with memcg, but by default in a multiuser system
without cgroups the global disruption provided before the fix
(i.e. the pathological THP behavior) is not warranted. memcg shouldn't
be mandatory to avoid a process to affect the VM in such a strong way
(i.e. all other processes who happened to be allocated in the node
where the THP allocation triggered, being trashed in swap like if all
memory of all other nodes was not completely free).

Not only that, it's not only about malicious processes it's also
excessively inefficient for processes that just don't fit in a local
node and use MADV_HUGEPAGE. Your processes all fit in the local node
for sure if they're happy about it. This was reported as a
"pathological THP regression" after all in a workload that couldn't
swap at all because of the iommu gup persistent refcount pins.

The alternative patch I posted which still invoked direct reclaim
locally, and falled back to NUMA-local PAGE_SIZEd allocations instead
of falling back to NUMA-remote THP, could have been given without
privilege, but it still won't swapout as this patch would have done,
so it still won't go as far as the MADV_HUGEPAGE behavior had before
the fix (for the lib users that strongly needed local memory).

Overall I think the call about the default behavior of MADV_HUGEPAGE
is still between removing __GFP_THISNODE if gfp_flags can reclaim (the
fix in -mm), or by changing direct compaction to only call compaction
and not reclaim (i.e. __GFP_COMPACT_ONLY) when __GFP_THISNODE is set.

To go beyond that some privilege is needed and a new MADV_ flag can
require privilege or return error if there's not enough privilege. So
the lib with 100's users can try to use that new flag first, show an
error in stderr (maybe under debug), and fallback to MADV_HUGEPAGE if
the app hasn't enough privilege. The alternative is to add a new mem
policy less strict than MPOL_BIND to achieve what you need on top of
MADV_HUGEPAGE (which also would require some privilege of course as
all mbinds). I assume you already evaluated the preferred and local
mbinds and it's not a perfect fit?

If we keep this as a new MADV_HUGEPAGE_FORCE_LOCAL flag, you could
still add a THP sysfs/sysctl control to lift the privilege requirement
marking it as insecure setting in docs
(mm/transparent_hugepage/madv_hugepage_force_local=0|1 forced to 0 by
default). This would be on the same lines of other sysctl that
increase the max number of files open and such things (perhaps a
sysctl would be better in fact for tuning in /etc/sysctl.conf).

Note there was still some improvement left possible in my
__GFP_COMPACT_ONLY patch alternative. Notably if the watermarks for
the local node shown the local node not to have enough real "free"
PAGE_SIZEd pages to succeed the PAGE_SIZEd local THP allocation if
compaction failed, we should have relaxed __GFP_THISNODE and tried to
allocate THP from the NUMA-remote nodes before falling back to
PAGE_SIZEd allocations. That also won't require any new privilege.

To get the same behavior as before though you would need to drop all
caches with echo 3 >drop_caches before the app starts (and it still
won't swap anon memory which previous MADV_HUGEPAGE behavior would
have, but the whole point is that the previous MADV_HUGEPAGE behavior
would have backfired for the lib too if the app was allocating so much
RAM from the local node that it required swapouts of local anon
memory).

Thanks,
Andrea

  reply	other threads:[~2018-10-04 21:10 UTC|newest]

Thread overview: 78+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-09-25 12:03 [PATCH 0/2] thp nodereclaim fixes Michal Hocko
2018-09-25 12:03 ` Michal Hocko
2018-09-25 12:03 ` [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings Michal Hocko
2018-09-25 12:03   ` Michal Hocko
2018-09-25 12:20   ` Mel Gorman
2018-09-25 12:30     ` Michal Hocko
2018-10-04 20:16   ` David Rientjes
2018-10-04 21:10     ` Andrea Arcangeli [this message]
2018-10-04 23:05       ` David Rientjes
2018-10-06  3:19         ` Andrea Arcangeli
2018-10-05  7:38     ` Mel Gorman
2018-10-05 20:35       ` David Rientjes
2018-10-05 23:21         ` Andrea Arcangeli
2018-10-08 20:41           ` David Rientjes
2018-10-09  9:48             ` Mel Gorman
2018-10-09 12:27               ` Michal Hocko
2018-10-09 13:00                 ` Mel Gorman
2018-10-09 14:25                   ` Michal Hocko
2018-10-09 15:16                     ` Mel Gorman
2018-10-09 23:03                     ` Andrea Arcangeli
2018-10-10 21:19                       ` David Rientjes
2018-10-15 22:30                         ` David Rientjes
2018-10-15 22:44                           ` Andrew Morton
2018-10-15 23:19                             ` Andrea Arcangeli
2018-10-22 20:54                               ` David Rientjes
2018-10-16  7:46                             ` Mel Gorman
2018-10-16 22:37                               ` Andrew Morton
2018-10-16 23:11                                 ` Andrea Arcangeli
2018-10-16 23:16                                   ` Andrew Morton
2018-10-17  7:08                                     ` Michal Hocko
2018-10-17  9:00                                 ` Mel Gorman
2018-10-22 21:04                               ` David Rientjes
2018-10-23  1:27                                 ` Zi Yan
2018-10-23  1:27                                   ` Zi Yan
2018-10-28 21:45                                   ` David Rientjes
2018-10-23  7:57                                 ` Mel Gorman
2018-10-23  8:38                                   ` Mel Gorman
2018-10-15 22:57                           ` Andrea Arcangeli
2018-10-22 20:45                             ` David Rientjes
2018-10-09 22:17               ` David Rientjes
2018-10-09 22:51                 ` Andrea Arcangeli
2018-10-10  7:54                   ` Vlastimil Babka
2018-10-10 21:00                   ` David Rientjes
2018-10-09 13:08             ` Vlastimil Babka
2018-10-09 22:21             ` Andrea Arcangeli
2018-10-29  5:17   ` Balbir Singh
2018-10-29  9:00     ` Michal Hocko
2018-10-29  9:42       ` Balbir Singh
2018-10-29 10:08         ` Michal Hocko
2018-10-29 10:56           ` Andrea Arcangeli
2018-09-25 12:03 ` [PATCH 2/2] mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask Michal Hocko
2018-09-25 12:03   ` Michal Hocko
2018-09-26 13:30   ` Kirill A. Shutemov
2018-09-26 14:17     ` Michal Hocko
2018-09-26 14:22       ` Michal Hocko
2018-10-19  2:11         ` Andrew Morton
2018-10-19  8:06           ` Michal Hocko
2018-10-22 13:27             ` Vlastimil Babka
2018-10-24 23:17               ` Andrew Morton
2018-10-25  4:56                 ` Vlastimil Babka
2018-10-25 16:14                   ` Michal Hocko
2018-10-25 16:18                     ` Andrew Morton
2018-10-25 16:45                       ` Michal Hocko
2018-10-22 13:15         ` Vlastimil Babka
2018-10-22 13:30           ` Michal Hocko
2018-10-22 13:35             ` Vlastimil Babka
2018-10-22 13:46               ` Michal Hocko
2018-10-22 13:53                 ` Vlastimil Babka
2018-10-04 20:17     ` David Rientjes
2018-10-04 21:49       ` Zi Yan
2018-10-09 12:36       ` Michal Hocko
2018-09-26 13:08 ` linux-mm@ archive on lore.kernel.org (Was: [PATCH 0/2] thp nodereclaim fixes) Kirill A. Shutemov
2018-09-26 13:14   ` Michal Hocko
2018-09-26 22:22     ` Andrew Morton
2018-09-26 23:08       ` Mel Gorman
2018-09-27  0:47         ` Konstantin Ryabitsev
2018-09-26 15:25   ` Konstantin Ryabitsev
2018-09-27 11:30     ` Kirill A. Shutemov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181004211029.GE7344@redhat.com \
    --to=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=andrea@kernel.org \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@kernel.org \
    --cc=mhocko@suse.com \
    --cc=rientjes@google.com \
    --cc=s.priebe@profihost.ag \
    --cc=stable@vger.kernel.org \
    --cc=vbabka@suse.cz \
    --cc=zi.yan@cs.rutgers.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.