From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 56510C433B4 for ; Thu, 1 Apr 2021 00:52:39 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id D39D161056 for ; Thu, 1 Apr 2021 00:52:38 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D39D161056 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 412506B0080; Wed, 31 Mar 2021 20:52:38 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3ACF36B0081; Wed, 31 Mar 2021 20:52:38 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 215726B0082; Wed, 31 Mar 2021 20:52:38 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0224.hostedemail.com [216.40.44.224]) by kanga.kvack.org (Postfix) with ESMTP id 0196C6B0080 for ; Wed, 31 Mar 2021 20:52:37 -0400 (EDT) Received: from smtpin02.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id B2A2C8134 for ; Thu, 1 Apr 2021 00:52:37 +0000 (UTC) X-FDA: 77981972754.02.DBCA6CF Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by imf04.hostedemail.com (Postfix) with ESMTP id E41BB135 for ; Thu, 1 Apr 2021 00:52:35 +0000 (UTC) IronPort-SDR: 3u0KT0dzHCyM/ILE+HXBn7OZlYg23Pq3+K78UEPV/fxd51ZQ6I26UV6T/w/0iya38EqXNs9u4W vbdIQaeSRCEw== X-IronPort-AV: E=McAfee;i="6000,8403,9940"; a="191601293" X-IronPort-AV: E=Sophos;i="5.81,295,1610438400"; d="scan'208";a="191601293" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 31 Mar 2021 17:52:34 -0700 IronPort-SDR: WI0DpeLHC2+Wo6SqSQWqD3/QV8iwauzz8oZJzT2WYB6ycnYdHoawatwQ481lJ59dOA+Pqve+HB z2NCI7bs/2nA== X-IronPort-AV: E=Sophos;i="5.81,295,1610438400"; d="scan'208";a="418933735" Received: from yhuang6-desk1.sh.intel.com (HELO yhuang6-desk1.ccr.corp.intel.com) ([10.239.13.1]) by orsmga008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 31 Mar 2021 17:52:31 -0700 From: "Huang, Ying" To: Mel Gorman Cc: , Andrew Morton , , Peter Zijlstra , "Peter Xu" , Johannes Weiner , "Vlastimil Babka" , Matthew Wilcox , Will Deacon , Michel Lespinasse , Arjun Roy , "Kirill A. Shutemov" Subject: Re: [RFC] NUMA balancing: reduce TLB flush via delaying mapping on hint page fault References: <20210329062651.2487905-1-ying.huang@intel.com> <20210330133310.GT15768@suse.de> <87a6qj8t92.fsf@yhuang6-desk1.ccr.corp.intel.com> <20210331131658.GV15768@suse.de> Date: Thu, 01 Apr 2021 08:52:29 +0800 In-Reply-To: <20210331131658.GV15768@suse.de> (Mel Gorman's message of "Wed, 31 Mar 2021 14:16:58 +0100") Message-ID: <875z16967m.fsf@yhuang6-desk1.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Stat-Signature: efe89a1ja6qrywkwz4m3bwajqhpmcbkj X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: E41BB135 Received-SPF: none (intel.com>: No applicable sender policy available) receiver=imf04; identity=mailfrom; envelope-from=""; helo=mga14.intel.com; client-ip=192.55.52.115 X-HE-DKIM-Result: none/none X-HE-Tag: 1617238355-972982 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Mel Gorman writes: > On Wed, Mar 31, 2021 at 07:20:09PM +0800, Huang, Ying wrote: >> Mel Gorman writes: >> >> > On Mon, Mar 29, 2021 at 02:26:51PM +0800, Huang Ying wrote: >> >> For NUMA balancing, in hint page fault handler, the faulting page will >> >> be migrated to the accessing node if necessary. During the migration, >> >> TLB will be shot down on all CPUs that the process has run on >> >> recently. Because in the hint page fault handler, the PTE will be >> >> made accessible before the migration is tried. The overhead of TLB >> >> shooting down is high, so it's better to be avoided if possible. In >> >> fact, if we delay mapping the page in PTE until migration, that can be >> >> avoided. This is what this patch doing. >> >> >> > >> > Why would the overhead be high? It was previously inaccessibly so it's >> > only parallel accesses making forward progress that trigger the need >> > for a flush. >> >> Sorry, I don't understand this. Although the page is inaccessible, the >> threads may access other pages, so TLB flushing is still necessary. >> > > You assert the overhead of TLB shootdown is high and yes, it can be > very high but you also said "the benchmark score has no visible changes" > indicating the TLB shootdown cost is not a major problem for the workload. > It does not mean we should ignore it though. > >> > >> > >> > If migration is attempted, then the time until the migration PTE is >> > created is variable. The page has to be isolated from the LRU so there >> > could be contention on the LRU lock, a new page has to be allocated and >> > that allocation potentially has to enter the page allocator slow path >> > etc. During that time, parallel threads make forward progress but with >> > the patch, multiple threads potentially attempt the allocation and fail >> > instead of doing real work. >> >> If my understanding of the code were correct, only the first thread will >> attempt the isolation and allocation. Because TestClearPageLRU() is >> called in >> >> migrate_misplaced_page() >> numamigrate_isolate_page() >> isolate_lru_page() >> >> And migrate_misplaced_page() will return 0 immediately if >> TestClearPageLRU() returns false. Then the second thread will make the >> page accessible and make forward progress. >> > > Ok, that's true. While additional work is done, the cost is reasonably > low -- lower than I initially imagined and with fewer side-effects. > >> But there's still some timing difference between the original and >> patched kernel. We have several choices to reduce the difference. >> >> 1. Check PageLRU() with PTL held in do_numa_page() >> >> If PageLRU() return false, do_numa_page() can make the page accessible >> firstly. So the second thread will make the page accessible earlier. >> >> 2. Try to lock the page with PTL held in do_numa_page() >> >> If the try-locking succeeds, it's the first thread, so it can delay >> mapping. If try-locking fails, it may be the second thread, so it will >> make the page accessible firstly. We need to teach >> migrate_misplaced_page() to work with the page locked. This will >> enlarge the duration that the page is locked. Is it a problem? >> >> 3. Check page_count() with PTL held in do_numa_page() >> >> The first thread will call get_page() in numa_migrate_prep(). So if the >> second thread can detect that, it can make the page accessible firstly. >> The difficulty is that it appears hard to identify the expected >> page_count() for the file pages. For anonymous pages, that is much >> easier, so at least if a page passes the following test, we can delay >> mapping, >> >> PageAnon(page) && page_count(page) == page_mapcount(page) + !!PageSwapCache(page) >> >> This will disable the optimization for the file pages. But it may be >> good enough? >> >> Which one do you think is better? Maybe the first one is good enough? >> > > The first one is probably the most straight-forward but it's more > important to figure out why interrupts were higher with at least one > workload when the exact opposite is expected. Investigating which of > options 1-3 are best and whether it's worth the duplicated check could > be done as a separate patch. > >> > You should consider the following question -- is the potential saving >> > of an IPI transmission enough to offset the cost of parallel accesses >> > not making forward progress while one migration is setup and having >> > different migration attempts collide? >> > >> > I have tests running just in case but I think the answer may be "no". >> > So far only one useful test as completed (specjbb2005 with one VM per NUMA >> > node) and it showed a mix of small gains and losses but with *higher* >> > interrupts contrary to what was expected from the changelog. >> >> That is hard to be understood. May be caused by the bug you pointed out >> (about was_writable)? >> > > It's possible and I could not figure out what the rationale behind the > change was :/ > > Fix it and run it through your tests to make sure it works as you > expect. Assuming it passes your tests and it's posted, I'll read it again > and run it through a battery of tests. If it shows that interrupts are > lower and is either netural or improves performance in enough cases then > I think it'll be ok. Even if it's only neutral in terms of performance > but interrupts are lower, it'll be acceptable. Will do it. Thanks a lot for your help! Best Regards, Huang, Ying