From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f70.google.com (mail-pg0-f70.google.com [74.125.83.70]) by kanga.kvack.org (Postfix) with ESMTP id 8A5636B025F for ; Fri, 18 Aug 2017 10:20:44 -0400 (EDT) Received: by mail-pg0-f70.google.com with SMTP id t80so173958991pgb.0 for ; Fri, 18 Aug 2017 07:20:44 -0700 (PDT) Received: from mga11.intel.com (mga11.intel.com. [192.55.52.93]) by mx.google.com with ESMTPS id a7si3607190pgd.387.2017.08.18.07.20.42 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 18 Aug 2017 07:20:43 -0700 (PDT) From: "Liang, Kan" Subject: RE: [PATCH 1/2] sched/wait: Break up long wake list walk Date: Fri, 18 Aug 2017 14:20:38 +0000 Message-ID: <37D7C6CF3E00A74B8858931C1DB2F077537879BB@SHSMSX103.ccr.corp.intel.com> References: <84c7f26182b7f4723c0fe3b34ba912a9de92b8b7.1502758114.git.tim.c.chen@linux.intel.com> <37D7C6CF3E00A74B8858931C1DB2F07753786CE9@SHSMSX103.ccr.corp.intel.com> <37D7C6CF3E00A74B8858931C1DB2F0775378761B@SHSMSX103.ccr.corp.intel.com> <20170818122339.24grcbzyhnzmr4qw@techsingularity.net> In-Reply-To: <20170818122339.24grcbzyhnzmr4qw@techsingularity.net> Content-Language: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman , Linus Torvalds Cc: Mel Gorman , "Kirill A. Shutemov" , Tim Chen , Peter Zijlstra , Ingo Molnar , Andi Kleen , Andrew Morton , Johannes Weiner , Jan Kara , linux-mm , Linux Kernel Mailing List > On Thu, Aug 17, 2017 at 01:44:40PM -0700, Linus Torvalds wrote: > > On Thu, Aug 17, 2017 at 1:18 PM, Liang, Kan wrote= : > > > > > > Here is the call stack of wait_on_page_bit_common when the queue is > > > long (entries >1000). > > > > > > # Overhead Trace output > > > # ........ .................. > > > # > > > 100.00% (ffffffff931aefca) > > > | > > > ---wait_on_page_bit > > > __migration_entry_wait > > > migration_entry_wait > > > do_swap_page > > > __handle_mm_fault > > > handle_mm_fault > > > __do_page_fault > > > do_page_fault > > > page_fault > > > > Hmm. Ok, so it does seem to very much be related to migration. Your > > wake_up_page_bit() profile made me suspect that, but this one seems to > > pretty much confirm it. > > > > So it looks like that wait_on_page_locked() thing in > > __migration_entry_wait(), and what probably happens is that your load > > ends up triggering a lot of migration (or just migration of a very hot > > page), and then *every* thread ends up waiting for whatever page that > > ended up getting migrated. > > >=20 > Agreed. >=20 > > And so the wait queue for that page grows hugely long. > > >=20 > It's basically only bounded by the maximum number of threads that can exi= st. >=20 > > Looking at the other profile, the thing that is locking the page (that > > everybody then ends up waiting on) would seem to be > > migrate_misplaced_transhuge_page(), so this is _presumably_ due to > > NUMA balancing. > > >=20 > Yes, migrate_misplaced_transhuge_page requires NUMA balancing to be > part of the picture. >=20 > > Does the problem go away if you disable the NUMA balancing code? > > > > Adding Mel and Kirill to the participants, just to make them aware of > > the issue, and just because their names show up when I look at blame. > > >=20 > I'm not imagining a way of dealing with this that would reliably detect w= hen > there are a large number of waiters without adding a mess. We could adjus= t > the scanning rate to reduce the problem but it would be difficult to targ= et > properly and wouldn't prevent the problem occurring with the added hassle > that it would now be intermittent. >=20 > Assuming the problem goes away by disabling NUMA then it would be nice if > it could be determined that the page lock holder is trying to allocate a = page > when the queue is huge. That is part of the operation that potentially ta= kes a > long time and may be why so many callers are stacking up. If so, I would > suggest clearing __GFP_DIRECT_RECLAIM from the GFP flags in > migrate_misplaced_transhuge_page and assume that a remote hit is always > going to be cheaper than compacting memory to successfully allocate a THP= . > That may be worth doing unconditionally because we'd have to save a > *lot* of remote misses to offset compaction cost. >=20 > Nothing fancy other than needing a comment if it works. >=20 No, the patch doesn't work. Thanks, Kan > diff --git a/mm/migrate.c b/mm/migrate.c index > 627671551873..87b0275ddcdb 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -1926,7 +1926,7 @@ int migrate_misplaced_transhuge_page(struct > mm_struct *mm, > goto out_dropref; >=20 > new_page =3D alloc_pages_node(node, > - (GFP_TRANSHUGE_LIGHT | __GFP_THISNODE), > + (GFP_TRANSHUGE_LIGHT | __GFP_THISNODE) & > ~__GFP_DIRECT_RECLAIM, > HPAGE_PMD_ORDER); > if (!new_page) > goto out_fail; >=20 > -- > Mel Gorman > SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org