From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3B82CC433DB for ; Mon, 29 Mar 2021 06:27:13 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id C569D61964 for ; Mon, 29 Mar 2021 06:27:12 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C569D61964 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 42ED26B006C; Mon, 29 Mar 2021 02:27:12 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3B8226B006E; Mon, 29 Mar 2021 02:27:12 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 20BAA6B0074; Mon, 29 Mar 2021 02:27:12 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0061.hostedemail.com [216.40.44.61]) by kanga.kvack.org (Postfix) with ESMTP id 00DC46B006C for ; Mon, 29 Mar 2021 02:27:11 -0400 (EDT) Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id A3D6D18033D13 for ; Mon, 29 Mar 2021 06:27:11 +0000 (UTC) X-FDA: 77971929462.13.829216A Received: from mga06.intel.com (mga06.intel.com [134.134.136.31]) by imf11.hostedemail.com (Postfix) with ESMTP id 9BCDE2000244 for ; Mon, 29 Mar 2021 06:27:05 +0000 (UTC) IronPort-SDR: XsoqcEYnz7is/DOGz29OqwlrqX+vvG7+OdZrSMGwqCt2WSwMFZGLC6THZLHTNn+DBbYvPVsx/9 RqKOB9yu0/gQ== X-IronPort-AV: E=McAfee;i="6000,8403,9937"; a="252834536" X-IronPort-AV: E=Sophos;i="5.81,285,1610438400"; d="scan'208";a="252834536" Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Mar 2021 23:27:07 -0700 IronPort-SDR: C55Yv/wcT4TNJESa9QMS+A9xypbRgFYWcf7BBW0vRprWTUw3EW9+BJ9KnzMJzAMgqaBE2LAYkA UqOaE6lC5DrA== X-IronPort-AV: E=Sophos;i="5.81,285,1610438400"; d="scan'208";a="609616892" Received: from yhuang6-desk1.sh.intel.com ([10.239.13.1]) by fmsmga005-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Mar 2021 23:27:04 -0700 From: Huang Ying To: linux-mm@kvack.org Cc: Andrew Morton , linux-kernel@vger.kernel.org, Huang Ying , Peter Zijlstra , Mel Gorman , Peter Xu , Johannes Weiner , Vlastimil Babka , "Matthew Wilcox" , Will Deacon , Michel Lespinasse , Arjun Roy , "Kirill A. Shutemov" Subject: [RFC] NUMA balancing: reduce TLB flush via delaying mapping on hint page fault Date: Mon, 29 Mar 2021 14:26:51 +0800 Message-Id: <20210329062651.2487905-1-ying.huang@intel.com> X-Mailer: git-send-email 2.30.2 MIME-Version: 1.0 X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 9BCDE2000244 X-Stat-Signature: wf874sbro6bfhqrf8mq3ncoqnjw54ud7 Received-SPF: none (intel.com>: No applicable sender policy available) receiver=imf11; identity=mailfrom; envelope-from=""; helo=mga06.intel.com; client-ip=134.134.136.31 X-HE-DKIM-Result: none/none X-HE-Tag: 1616999225-73585 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: For NUMA balancing, in hint page fault handler, the faulting page will be migrated to the accessing node if necessary. During the migration, TLB will be shot down on all CPUs that the process has run on recently. Because in the hint page fault handler, the PTE will be made accessible before the migration is tried. The overhead of TLB shooting down is high, so it's better to be avoided if possible. In fact, if we delay mapping the page in PTE until migration, that can be avoided. This is what this patch doing. We have tested the patch with the pmbench memory accessing benchmark on a 2-socket Intel server, and found that the number of the TLB shooting down IPI reduces up to 99% (from ~6.0e6 to ~2.3e4) if NUMA balancing is triggered (~8.8e6 pages migrated). The benchmark score has no visible changes. Known issues: For the multiple threads applications, it's possible that the page is accessed by 2 threads almost at the same time. In the original implementation, the second thread may go accessing the page directly because the first thread has installed the accessible PTE. While with this patch, there will be a window that the second thread will find the PTE is still inaccessible. But the difference between the accessible window is small. Because the page will be made inaccessible soon for migrating. Signed-off-by: "Huang, Ying" Cc: Peter Zijlstra Cc: Mel Gorman Cc: Peter Xu Cc: Johannes Weiner Cc: Vlastimil Babka Cc: "Matthew Wilcox" Cc: Will Deacon Cc: Michel Lespinasse Cc: Arjun Roy Cc: "Kirill A. Shutemov" --- mm/memory.c | 54 +++++++++++++++++++++++++++++++---------------------- 1 file changed, 32 insertions(+), 22 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index d3273bd69dbb..a9a8ed1ac06c 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4148,29 +4148,17 @@ static vm_fault_t do_numa_page(struct vm_fault *v= mf) goto out; } =20 - /* - * Make it present again, Depending on how arch implementes non - * accessible ptes, some can allow access by kernel mode. - */ - old_pte =3D ptep_modify_prot_start(vma, vmf->address, vmf->pte); + /* Get the normal PTE */ + old_pte =3D ptep_get(vmf->pte); pte =3D pte_modify(old_pte, vma->vm_page_prot); - pte =3D pte_mkyoung(pte); - if (was_writable) - pte =3D pte_mkwrite(pte); - ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte); - update_mmu_cache(vma, vmf->address, vmf->pte); =20 page =3D vm_normal_page(vma, vmf->address, pte); - if (!page) { - pte_unmap_unlock(vmf->pte, vmf->ptl); - return 0; - } + if (!page) + goto out_map; =20 /* TODO: handle PTE-mapped THP */ - if (PageCompound(page)) { - pte_unmap_unlock(vmf->pte, vmf->ptl); - return 0; - } + if (PageCompound(page)) + goto out_map; =20 /* * Avoid grouping on RO pages in general. RO pages shouldn't hurt as @@ -4180,7 +4168,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf= ) * pte_dirty has unpredictable behaviour between PTE scan updates, * background writeback, dirty balancing and application behaviour. */ - if (!pte_write(pte)) + if (was_writable) flags |=3D TNF_NO_GROUP; =20 /* @@ -4194,23 +4182,45 @@ static vm_fault_t do_numa_page(struct vm_fault *v= mf) page_nid =3D page_to_nid(page); target_nid =3D numa_migrate_prep(page, vma, vmf->address, page_nid, &flags); - pte_unmap_unlock(vmf->pte, vmf->ptl); if (target_nid =3D=3D NUMA_NO_NODE) { put_page(page); - goto out; + goto out_map; } + pte_unmap_unlock(vmf->pte, vmf->ptl); =20 /* Migrate to the requested node */ if (migrate_misplaced_page(page, vma, target_nid)) { page_nid =3D target_nid; flags |=3D TNF_MIGRATED; - } else + } else { flags |=3D TNF_MIGRATE_FAIL; + vmf->pte =3D pte_offset_map(vmf->pmd, vmf->address); + spin_lock(vmf->ptl); + if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) { + pte_unmap_unlock(vmf->pte, vmf->ptl); + goto out; + } + goto out_map; + } =20 out: if (page_nid !=3D NUMA_NO_NODE) task_numa_fault(last_cpupid, page_nid, 1, flags); return 0; +out_map: + /* + * Make it present again, Depending on how arch implementes non + * accessible ptes, some can allow access by kernel mode. + */ + old_pte =3D ptep_modify_prot_start(vma, vmf->address, vmf->pte); + pte =3D pte_modify(old_pte, vma->vm_page_prot); + pte =3D pte_mkyoung(pte); + if (was_writable) + pte =3D pte_mkwrite(pte); + ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte); + update_mmu_cache(vma, vmf->address, vmf->pte); + pte_unmap_unlock(vmf->pte, vmf->ptl); + goto out; } =20 static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf) --=20 2.30.2