From: "Liang, Kan" <kan.liang@intel.com>
To: Mel Gorman <mgorman@techsingularity.net>,
Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
Tim Chen <tim.c.chen@linux.intel.com>,
Peter Zijlstra <peterz@infradead.org>,
Ingo Molnar <mingo@elte.hu>, Andi Kleen <ak@linux.intel.com>,
Andrew Morton <akpm@linux-foundation.org>,
Johannes Weiner <hannes@cmpxchg.org>, Jan Kara <jack@suse.cz>,
linux-mm <linux-mm@kvack.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: RE: [PATCH 1/2] sched/wait: Break up long wake list walk
Date: Fri, 18 Aug 2017 14:20:38 +0000 [thread overview]
Message-ID: <37D7C6CF3E00A74B8858931C1DB2F077537879BB@SHSMSX103.ccr.corp.intel.com> (raw)
In-Reply-To: <20170818122339.24grcbzyhnzmr4qw@techsingularity.net>
> On Thu, Aug 17, 2017 at 01:44:40PM -0700, Linus Torvalds wrote:
> > On Thu, Aug 17, 2017 at 1:18 PM, Liang, Kan <kan.liang@intel.com> wrote:
> > >
> > > Here is the call stack of wait_on_page_bit_common when the queue is
> > > long (entries >1000).
> > >
> > > # Overhead Trace output
> > > # ........ ..................
> > > #
> > > 100.00% (ffffffff931aefca)
> > > |
> > > ---wait_on_page_bit
> > > __migration_entry_wait
> > > migration_entry_wait
> > > do_swap_page
> > > __handle_mm_fault
> > > handle_mm_fault
> > > __do_page_fault
> > > do_page_fault
> > > page_fault
> >
> > Hmm. Ok, so it does seem to very much be related to migration. Your
> > wake_up_page_bit() profile made me suspect that, but this one seems to
> > pretty much confirm it.
> >
> > So it looks like that wait_on_page_locked() thing in
> > __migration_entry_wait(), and what probably happens is that your load
> > ends up triggering a lot of migration (or just migration of a very hot
> > page), and then *every* thread ends up waiting for whatever page that
> > ended up getting migrated.
> >
>
> Agreed.
>
> > And so the wait queue for that page grows hugely long.
> >
>
> It's basically only bounded by the maximum number of threads that can exist.
>
> > Looking at the other profile, the thing that is locking the page (that
> > everybody then ends up waiting on) would seem to be
> > migrate_misplaced_transhuge_page(), so this is _presumably_ due to
> > NUMA balancing.
> >
>
> Yes, migrate_misplaced_transhuge_page requires NUMA balancing to be
> part of the picture.
>
> > Does the problem go away if you disable the NUMA balancing code?
> >
> > Adding Mel and Kirill to the participants, just to make them aware of
> > the issue, and just because their names show up when I look at blame.
> >
>
> I'm not imagining a way of dealing with this that would reliably detect when
> there are a large number of waiters without adding a mess. We could adjust
> the scanning rate to reduce the problem but it would be difficult to target
> properly and wouldn't prevent the problem occurring with the added hassle
> that it would now be intermittent.
>
> Assuming the problem goes away by disabling NUMA then it would be nice if
> it could be determined that the page lock holder is trying to allocate a page
> when the queue is huge. That is part of the operation that potentially takes a
> long time and may be why so many callers are stacking up. If so, I would
> suggest clearing __GFP_DIRECT_RECLAIM from the GFP flags in
> migrate_misplaced_transhuge_page and assume that a remote hit is always
> going to be cheaper than compacting memory to successfully allocate a THP.
> That may be worth doing unconditionally because we'd have to save a
> *lot* of remote misses to offset compaction cost.
>
> Nothing fancy other than needing a comment if it works.
>
No, the patch doesn't work.
Thanks,
Kan
> diff --git a/mm/migrate.c b/mm/migrate.c index
> 627671551873..87b0275ddcdb 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1926,7 +1926,7 @@ int migrate_misplaced_transhuge_page(struct
> mm_struct *mm,
> goto out_dropref;
>
> new_page = alloc_pages_node(node,
> - (GFP_TRANSHUGE_LIGHT | __GFP_THISNODE),
> + (GFP_TRANSHUGE_LIGHT | __GFP_THISNODE) &
> ~__GFP_DIRECT_RECLAIM,
> HPAGE_PMD_ORDER);
> if (!new_page)
> goto out_fail;
>
> --
> Mel Gorman
> SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2017-08-18 14:20 UTC|newest]
Thread overview: 65+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-08-15 0:52 [PATCH 1/2] sched/wait: Break up long wake list walk Tim Chen
2017-08-15 0:52 ` [PATCH 2/2] sched/wait: Introduce lock breaker in wake_up_page_bit Tim Chen
2017-08-15 1:48 ` [PATCH 1/2] sched/wait: Break up long wake list walk Linus Torvalds
2017-08-15 2:27 ` Andi Kleen
2017-08-15 2:52 ` Linus Torvalds
2017-08-15 3:15 ` Andi Kleen
2017-08-15 3:28 ` Linus Torvalds
2017-08-15 19:05 ` Tim Chen
2017-08-15 19:41 ` Linus Torvalds
2017-08-15 19:47 ` Linus Torvalds
2017-08-15 22:47 ` Davidlohr Bueso
2017-08-15 22:56 ` Linus Torvalds
2017-08-15 22:57 ` Linus Torvalds
2017-08-15 23:50 ` Linus Torvalds
2017-08-16 23:22 ` Eric W. Biederman
2017-08-17 16:17 ` Liang, Kan
2017-08-17 16:25 ` Linus Torvalds
2017-08-17 20:18 ` Liang, Kan
2017-08-17 20:44 ` Linus Torvalds
2017-08-18 12:23 ` Mel Gorman
2017-08-18 14:20 ` Liang, Kan [this message]
2017-08-18 14:46 ` Mel Gorman
2017-08-18 16:36 ` Tim Chen
2017-08-18 16:45 ` Andi Kleen
2017-08-18 16:53 ` Liang, Kan
2017-08-18 17:48 ` Linus Torvalds
2017-08-18 18:54 ` Mel Gorman
2017-08-18 19:14 ` Linus Torvalds
2017-08-18 19:58 ` Andi Kleen
2017-08-18 20:10 ` Linus Torvalds
2017-08-21 18:32 ` Mel Gorman
2017-08-21 18:56 ` Liang, Kan
2017-08-22 17:23 ` Liang, Kan
2017-08-22 18:19 ` Linus Torvalds
2017-08-22 18:25 ` Linus Torvalds
2017-08-22 18:56 ` Peter Zijlstra
2017-08-22 19:15 ` Linus Torvalds
2017-08-22 19:08 ` Peter Zijlstra
2017-08-22 19:30 ` Linus Torvalds
2017-08-22 19:37 ` Andi Kleen
2017-08-22 21:08 ` Christopher Lameter
2017-08-22 21:24 ` Andi Kleen
2017-08-22 22:52 ` Linus Torvalds
2017-08-22 23:19 ` Linus Torvalds
2017-08-23 14:51 ` Liang, Kan
2017-08-22 19:55 ` Liang, Kan
2017-08-22 20:42 ` Linus Torvalds
2017-08-22 20:53 ` Peter Zijlstra
2017-08-22 20:58 ` Linus Torvalds
2017-08-23 14:49 ` Liang, Kan
2017-08-23 15:58 ` Tim Chen
2017-08-23 18:17 ` Linus Torvalds
2017-08-23 20:55 ` Liang, Kan
2017-08-23 23:30 ` Linus Torvalds
2017-08-24 17:49 ` Tim Chen
2017-08-24 18:16 ` Linus Torvalds
2017-08-24 20:44 ` Mel Gorman
2017-08-25 16:44 ` Tim Chen
2017-08-23 16:04 ` Mel Gorman
2017-08-18 20:05 ` Andi Kleen
2017-08-18 20:29 ` Linus Torvalds
2017-08-18 20:29 ` Liang, Kan
2017-08-18 20:34 ` Linus Torvalds
2017-08-18 16:55 ` Linus Torvalds
2017-08-18 13:06 ` Liang, Kan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=37D7C6CF3E00A74B8858931C1DB2F077537879BB@SHSMSX103.ccr.corp.intel.com \
--to=kan.liang@intel.com \
--cc=ak@linux.intel.com \
--cc=akpm@linux-foundation.org \
--cc=hannes@cmpxchg.org \
--cc=jack@suse.cz \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=mgorman@techsingularity.net \
--cc=mingo@elte.hu \
--cc=peterz@infradead.org \
--cc=tim.c.chen@linux.intel.com \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).