From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=d9cH=I6=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00,
	HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,
	URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 56510C433B4
	for <linux-mm@archiver.kernel.org>; Thu,  1 Apr 2021 00:52:39 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id D39D161056
	for <linux-mm@archiver.kernel.org>; Thu,  1 Apr 2021 00:52:38 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D39D161056
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 412506B0080; Wed, 31 Mar 2021 20:52:38 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 3ACF36B0081; Wed, 31 Mar 2021 20:52:38 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 215726B0082; Wed, 31 Mar 2021 20:52:38 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0224.hostedemail.com [216.40.44.224])
	by kanga.kvack.org (Postfix) with ESMTP id 0196C6B0080
	for <linux-mm@kvack.org>; Wed, 31 Mar 2021 20:52:37 -0400 (EDT)
Received: from smtpin02.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with ESMTP id B2A2C8134
	for <linux-mm@kvack.org>; Thu,  1 Apr 2021 00:52:37 +0000 (UTC)
X-FDA: 77981972754.02.DBCA6CF
Received: from mga14.intel.com (mga14.intel.com [192.55.52.115])
	by imf04.hostedemail.com (Postfix) with ESMTP id E41BB135
	for <linux-mm@kvack.org>; Thu,  1 Apr 2021 00:52:35 +0000 (UTC)
IronPort-SDR: 3u0KT0dzHCyM/ILE+HXBn7OZlYg23Pq3+K78UEPV/fxd51ZQ6I26UV6T/w/0iya38EqXNs9u4W
 vbdIQaeSRCEw==
X-IronPort-AV: E=McAfee;i="6000,8403,9940"; a="191601293"
X-IronPort-AV: E=Sophos;i="5.81,295,1610438400"; 
   d="scan'208";a="191601293"
Received: from orsmga008.jf.intel.com ([10.7.209.65])
  by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 31 Mar 2021 17:52:34 -0700
IronPort-SDR: WI0DpeLHC2+Wo6SqSQWqD3/QV8iwauzz8oZJzT2WYB6ycnYdHoawatwQ481lJ59dOA+Pqve+HB
 z2NCI7bs/2nA==
X-IronPort-AV: E=Sophos;i="5.81,295,1610438400"; 
   d="scan'208";a="418933735"
Received: from yhuang6-desk1.sh.intel.com (HELO yhuang6-desk1.ccr.corp.intel.com) ([10.239.13.1])
  by orsmga008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 31 Mar 2021 17:52:31 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Mel Gorman <mgorman@suse.de>
Cc: <linux-mm@kvack.org>,  Andrew Morton <akpm@linux-foundation.org>,
  <linux-kernel@vger.kernel.org>,  Peter Zijlstra <peterz@infradead.org>,
  "Peter Xu" <peterx@redhat.com>,  Johannes Weiner <hannes@cmpxchg.org>,
  "Vlastimil Babka" <vbabka@suse.cz>,  Matthew Wilcox
 <willy@infradead.org>,  Will Deacon <will@kernel.org>,  Michel Lespinasse
 <walken@google.com>,  Arjun Roy <arjunroy@google.com>,  "Kirill A.
 Shutemov" <kirill.shutemov@linux.intel.com>
Subject: Re: [RFC] NUMA balancing: reduce TLB flush via delaying mapping on
 hint page fault
References: <20210329062651.2487905-1-ying.huang@intel.com>
	<20210330133310.GT15768@suse.de>
	<87a6qj8t92.fsf@yhuang6-desk1.ccr.corp.intel.com>
	<20210331131658.GV15768@suse.de>
Date: Thu, 01 Apr 2021 08:52:29 +0800
In-Reply-To: <20210331131658.GV15768@suse.de> (Mel Gorman's message of "Wed,
	31 Mar 2021 14:16:58 +0100")
Message-ID: <875z16967m.fsf@yhuang6-desk1.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=ascii
X-Stat-Signature: efe89a1ja6qrywkwz4m3bwajqhpmcbkj
X-Rspamd-Server: rspam04
X-Rspamd-Queue-Id: E41BB135
Received-SPF: none (intel.com>: No applicable sender policy available) receiver=imf04; identity=mailfrom; envelope-from="<ying.huang@intel.com>"; helo=mga14.intel.com; client-ip=192.55.52.115
X-HE-DKIM-Result: none/none
X-HE-Tag: 1617238355-972982
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Mel Gorman <mgorman@suse.de> writes:

> On Wed, Mar 31, 2021 at 07:20:09PM +0800, Huang, Ying wrote:
>> Mel Gorman <mgorman@suse.de> writes:
>> 
>> > On Mon, Mar 29, 2021 at 02:26:51PM +0800, Huang Ying wrote:
>> >> For NUMA balancing, in hint page fault handler, the faulting page will
>> >> be migrated to the accessing node if necessary.  During the migration,
>> >> TLB will be shot down on all CPUs that the process has run on
>> >> recently.  Because in the hint page fault handler, the PTE will be
>> >> made accessible before the migration is tried.  The overhead of TLB
>> >> shooting down is high, so it's better to be avoided if possible.  In
>> >> fact, if we delay mapping the page in PTE until migration, that can be
>> >> avoided.  This is what this patch doing.
>> >> 
>> >
>> > Why would the overhead be high? It was previously inaccessibly so it's
>> > only parallel accesses making forward progress that trigger the need
>> > for a flush.
>> 
>> Sorry, I don't understand this.  Although the page is inaccessible, the
>> threads may access other pages, so TLB flushing is still necessary.
>> 
>
> You assert the overhead of TLB shootdown is high and yes, it can be
> very high but you also said "the benchmark score has no visible changes"
> indicating the TLB shootdown cost is not a major problem for the workload.
> It does not mean we should ignore it though.
>
>> > <SNIP the parts that are not problematic>
>> >
>> > If migration is attempted, then the time until the migration PTE is
>> > created is variable. The page has to be isolated from the LRU so there
>> > could be contention on the LRU lock, a new page has to be allocated and
>> > that allocation potentially has to enter the page allocator slow path
>> > etc. During that time, parallel threads make forward progress but with
>> > the patch, multiple threads potentially attempt the allocation and fail
>> > instead of doing real work.
>> 
>> If my understanding of the code were correct, only the first thread will
>> attempt the isolation and allocation.  Because TestClearPageLRU() is
>> called in
>> 
>>   migrate_misplaced_page()
>>     numamigrate_isolate_page()
>>       isolate_lru_page()
>> 
>> And migrate_misplaced_page() will return 0 immediately if
>> TestClearPageLRU() returns false.  Then the second thread will make the
>> page accessible and make forward progress.
>> 
>
> Ok, that's true. While additional work is done, the cost is reasonably
> low -- lower than I initially imagined and with fewer side-effects.
>
>> But there's still some timing difference between the original and
>> patched kernel.  We have several choices to reduce the difference.
>> 
>> 1. Check PageLRU() with PTL held in do_numa_page()
>> 
>> If PageLRU() return false, do_numa_page() can make the page accessible
>> firstly.  So the second thread will make the page accessible earlier.
>> 
>> 2. Try to lock the page with PTL held in do_numa_page()
>> 
>> If the try-locking succeeds, it's the first thread, so it can delay
>> mapping.  If try-locking fails, it may be the second thread, so it will
>> make the page accessible firstly.  We need to teach
>> migrate_misplaced_page() to work with the page locked.  This will
>> enlarge the duration that the page is locked.  Is it a problem?
>> 
>> 3. Check page_count() with PTL held in do_numa_page()
>> 
>> The first thread will call get_page() in numa_migrate_prep().  So if the
>> second thread can detect that, it can make the page accessible firstly.
>> The difficulty is that it appears hard to identify the expected
>> page_count() for the file pages.  For anonymous pages, that is much
>> easier, so at least if a page passes the following test, we can delay
>> mapping,
>> 
>>     PageAnon(page) && page_count(page) == page_mapcount(page) + !!PageSwapCache(page)
>> 
>> This will disable the optimization for the file pages.  But it may be
>> good enough?
>> 
>> Which one do you think is better?  Maybe the first one is good enough?
>> 
>
> The first one is probably the most straight-forward but it's more
> important to figure out why interrupts were higher with at least one
> workload when the exact opposite is expected. Investigating which of
> options 1-3 are best and whether it's worth the duplicated check could
> be done as a separate patch.
>
>> > You should consider the following question -- is the potential saving
>> > of an IPI transmission enough to offset the cost of parallel accesses
>> > not making forward progress while one migration is setup and having
>> > different migration attempts collide?
>> >
>> > I have tests running just in case but I think the answer may be "no".
>> > So far only one useful test as completed (specjbb2005 with one VM per NUMA
>> > node) and it showed a mix of small gains and losses but with *higher*
>> > interrupts contrary to what was expected from the changelog.
>> 
>> That is hard to be understood.  May be caused by the bug you pointed out
>> (about was_writable)?
>> 
>
> It's possible and I could not figure out what the rationale behind the
> change was :/
>
> Fix it and run it through your tests to make sure it works as you
> expect. Assuming it passes your tests and it's posted, I'll read it again
> and run it through a battery of tests. If it shows that interrupts are
> lower and is either netural or improves performance in enough cases then
> I think it'll be ok. Even if it's only neutral in terms of performance
> but interrupts are lower, it'll be acceptable.

Will do it.  Thanks a lot for your help!

Best Regards,
Huang, Ying