From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753868AbdBIRg6 (ORCPT ); Thu, 9 Feb 2017 12:36:58 -0500 Received: from mail-sn1nam02on0091.outbound.protection.outlook.com ([104.47.36.91]:32313 "EHLO NAM02-SN1-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1752744AbdBIRgz (ORCPT ); Thu, 9 Feb 2017 12:36:55 -0500 Authentication-Results: spf=none (sender IP is ) smtp.mailfrom=zi.yan@cs.rutgers.edu; From: Zi Yan To: Naoya Horiguchi , "kirill.shutemov@linux.intel.com" , Andrea Arcangeli CC: "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , "akpm@linux-foundation.org" , "minchan@kernel.org" , "vbabka@suse.cz" , "mgorman@techsingularity.net" , "khandual@linux.vnet.ibm.com" Subject: Re: [PATCH v3 09/14] mm: thp: check pmd migration entry in common path Date: Thu, 9 Feb 2017 11:36:47 -0600 Message-ID: <30979A4A-4DFA-42B4-AD63-89261650544D@cs.rutgers.edu> In-Reply-To: <20170209091616.GA15890@hori1.linux.bs1.fc.nec.co.jp> References: <20170205161252.85004-1-zi.yan@sent.com> <20170205161252.85004-10-zi.yan@sent.com> <20170209091616.GA15890@hori1.linux.bs1.fc.nec.co.jp> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="=_MailMate_9E427DA3-73D2-4C1D-AF5A-CD816ED9A7DB_="; micalg=pgp-sha512; protocol="application/pgp-signature" X-Mailer: MailMate (2.0BETAr6076) X-Originating-IP: [12.1.252.66] X-ClientProxiedBy: DM5PR12CA0067.namprd12.prod.outlook.com (10.175.83.157) To BN6PR14MB1652.namprd14.prod.outlook.com (10.171.176.18) X-MS-Office365-Filtering-Correlation-Id: 14046774-3182-4cce-1dbd-08d451123bd8 X-Microsoft-Antispam: UriScan:;BCL:0;PCL:0;RULEID:(22001);SRVR:BN6PR14MB1652; X-Microsoft-Exchange-Diagnostics: 1;BN6PR14MB1652;3:1aB2d2bR1XjML9FReRzou7+tZLXrhE8OoMbWKWcq9zdYEl1KwgzygW2GIPnmBdidm8vkZux6A79fFNuTCo5Yta3crGLFy2Ijm183ewQWyW8vda0NQay40D6+7MCVOhfBKHQFWEISU4NAnpml2tnhErFhOmtiEU4tWecT1672m4v0/MKS83oBieUGNc2GfQFw5XRymeJitu83ZEzTcRzEnJx+WOVfGOC6/D7SE+n/1cML2UOhmvECE2qxwI+rjRiaMXyJTM9UlhRVlmntTEdPNA==;25:l7d9AfWGJf0M3iq5WfZuH42tdLzLoYJhWRNVjDERPil5KKxdYxnn5SxmxTEU1ylH7lP3XEfu5XYhs+2uMWBO8mDguP6cDfsdFMx2BAJ2LOnW0WF2IAFs57J447I+jxZrscLuGUwughTIr5hry4raF7lUNFj2o9crTo3RWXNqqZxu/FRu3OBGJkBwtb7vPsSNjVnop50COMtWoTw6wn0sAzsPnY3zsTkH1vNME/8P+gklBAoPNMuBQxGWfdAlbOO72+co/wrZF2t5S+qUGoQogR7G8wcObo/z78uIYse2hAIdTHvjQsKu2YWK1CGl9L+14SBWRFge7ig7XjSTe1kp7GOBrIReRZHEAUWw6I9Qkfw2QSwPhmgPZ22aX2hRdVHmfkmsKyDGw4toSq/y93oTHzLZHTP599HVWPXdcS1ekrhlOgXQSu2Rp4l2dbr38hZv X-Microsoft-Exchange-Diagnostics: 1;BN6PR14MB1652;31:T3X6sFRM70/EdG4mkHmoy8NNsaiTJpZqtUIxbSiwv/icjdv3DZM93TRglvJCDamY2nnvUQTPEFTyRlhDd/suCDz3ESNLwZZdZmY8QLOqIItmcUt+Bi1FiOOq4GTQX/RL27TNu6eJ/SkIxap2wZ+Z9UtEpaVN3lvQdGolScfkVSe6u5CrC0amS2KIJCSnW3Bk3iFUDKM5tl1srDnO+Oc8EoCb5YIo2NGm4zULDc4Odf0dT8hiPeiZqegYg4Vloz3Y;20:VZ8+Tv7wWoo38QehxYfv2i0lfkDHtqWqppFZw8YryVkOD1CMrC8EAV7M3QVJYHLYI8Bv9qe5UAGuZdHMZby43WkU1PT3rYi4oUwVQ9v5doG64xoP+SoaOxKOxi7tNSAbILh2SBRvIF0sREynk75IoMYJbeiFxrDP81MKytg1mUzXS+IE8IuGIc+R3bOCLhfTMxQ+y04A28eB+zH88epQ2ArVVHo11ocNAbYURNSlvetq+mAsyIdHAOLBGgT+Wog5AsK9TjSLlxR5DJE9I7yjtzC7vY1acy8lzLKYr6BGUfLNMEfbX8+a5wJoiMNAidDdrUHQBHRQu9vDMcT3TH0bli4aEjh//xCLyrY1rwFD+dCAiPZAWaJkSt2uTsDCPEsERfvmynYPnNwRQnZjK+34uKIRxfvLuq0Hc7IEtzP0BQ1Tk51ZQFybZgjsIO9wuu2mJRKPgpZMqaR8TRV+4hQT+3v4bTAfaHlriR7v393jHnOxJ61nCovDB552pC+1exOp X-Microsoft-Antispam-PRVS: X-Exchange-Antispam-Report-Test: UriScan:(155532106045638); X-Exchange-Antispam-Report-CFA-Test: BCL:0;PCL:0;RULEID:(6040375)(601004)(2401047)(8121501046)(5005006)(3002001)(10201501046)(6041248)(20161123558025)(20161123555025)(20161123562025)(20161123564025)(20161123560025)(6072148);SRVR:BN6PR14MB1652;BCL:0;PCL:0;RULEID:;SRVR:BN6PR14MB1652; X-Microsoft-Exchange-Diagnostics: 1;BN6PR14MB1652;4:C3m91HzowXv47avDulang73dsyER8viRXTiySBGCNFaDsBOw9LeC/DGMm64b3mtk8Y7qn4YwEpZojNWJNUUWpngEN3ghsm/7G1cCNiXbvYUiwHW4TSRX65CdSeO44RGPBZSgZcMGZ3ItKudJb9OhYGiTBCwoMM1R+4MOSf5grI2WoAuuae+5eI2tMMeVydM6UN0+ITcdmEbzZqGDD4hqFJxD1Uf97eMtfxzlVC7Mcu2R6XHupEXdAet/LTx0VjE4bcnLJSJtWn1h0m3yO98d91MrUY0jJ9mUKR7Znc3zX+ILtlL/z8DkaqAJITCasNuTMI0uYZcrUy8FEhwIf1wxbTTb+qqfMAMf3pjNzfho4mF84OaUjx5Ku1VDcZTcaVtdl22R2VRAHiFNZQXYLU8xCOzWOAV97vJRLWju1esUNI+Ubibyuq/ARooqPmKSaRlSo6wq0zIx9hmClMyVp6YWV3Z6N18Zy16Ne+yRLrPLAOoz5pFDsxHV0ETp/N6YsZjWUpGgskW6y97CnEl8p0NSqLvtNDH3PUEyVxFYyZLC6yuj1J0vzV5p2LdaMBi/13aMOxZrWVQUQCWXaqXRqCtvN1OQ2qTxmto6lwSSz2DsXOhD3w8W8bHz1LWgtT6fQ5hXJK8wKs71d2U9EdTg3EzwZw== X-Forefront-PRVS: 02135EB356 X-Forefront-Antispam-Report: SFV:NSPM;SFS:(10019020)(4630300001)(6009001)(6049001)(7916002)(39450400003)(39830400002)(39410400002)(189002)(52314003)(24454002)(54534003)(199003)(2950100002)(4326007)(105586002)(8676002)(88552002)(2906002)(101416001)(50226002)(305945005)(76176999)(106356001)(50986999)(7736002)(81156014)(81166006)(42882006)(33656002)(66066001)(92566002)(6666003)(189998001)(5660300001)(6246003)(82746002)(6116002)(3846002)(75432002)(568964002)(6486002)(84326002)(90366009)(83716003)(42186005)(53546003)(53936002)(68736007)(5890100001)(2501003)(25786008)(229853002)(54906002)(77096006)(7416002)(86362001)(21480400002)(38730400002)(97736004)(72826003)(104396002);DIR:OUT;SFP:1102;SCL:1;SRVR:BN6PR14MB1652;H:[10.20.179.37];FPR:;SPF:None;PTR:InfoNoRecords;MX:1;A:1;LANG:en; X-Microsoft-Exchange-Diagnostics: =?us-ascii?Q?1;BN6PR14MB1652;23:aiXkp5OUDI672tVBSyM2vwLRz9+Zv/dc/ClHle4Id?= =?us-ascii?Q?R+M/CiM+2gLAfwajvbkWpGwHHWziwQ5FyJ5oCdwjhfsrtTTEwRUGFqEzwaaR?= =?us-ascii?Q?M1QE1Iy+wxrCP6I2idNDybYP6YeDolxikFZI7LXG3G+qIHaZ28VeAse0nnit?= =?us-ascii?Q?JU463+eLvtIAHaaWTjiTxF3wNKKKYxW+JwXNeEC3vMcxZ5ep1qmHvIGMvSgW?= =?us-ascii?Q?iZCH6d2EuuxH2oZ9SF94cTrrhxHevDmJgY0yth8khaLkq+2K/ASP51ePRE30?= =?us-ascii?Q?F8c2Uw+RpfnM0go0azJzCU4DzzhIz/uGXxLHyF7YlKxzDhPH/l1nAkEj16ZU?= =?us-ascii?Q?itGdgv8WXaGj50NpeaToV62i11V2iOWfoCCleI/nNygSO8acdShKupEtFpp4?= =?us-ascii?Q?nonuVJfq4vSKDelhozQRc9Er2kp5sElDTXs/k2vGiHkmBBgLKJ4ii7Yfbpu3?= =?us-ascii?Q?PH2x0GKqqD+bS7Euryqy8WIzfWNGXuVnldGAlafL0UDMeuna+bVZRWghG+AM?= =?us-ascii?Q?4+LehP+duKwp6/H4vrry5NpEMYtd33PZ+CAzNXs2sSVGWzhh01w5rdEeNNRz?= =?us-ascii?Q?Y/Hl/w7MefRQWUyViL2zq2LdvY171wWMLgamXwIC8eoAXMj6CFmL8s9eftBS?= =?us-ascii?Q?598s87PkoryvbeMmJeFGNZVe3jzeZiCkNqftHa0DGDgT4dybudw5Os3+Qi/s?= =?us-ascii?Q?1LCrlOb96dBuUzHt9Xh3HGL/o15JvtiEgNt8BpxMNlWqKYQ3JLeuWxzNynnw?= =?us-ascii?Q?kUBvO9RDSILQSbnFwaV3SHRi6bHXqxBi8BmQg10vKVx1dTPdoc87JBLaN3yc?= =?us-ascii?Q?Qalf7OOq+oGKCe7n+2FvUiTSYl0kawHRdS11ZT/eS0LKDuFoLCeQ8WedGPd5?= =?us-ascii?Q?qv+bReRmi5RSuD8FrCc+T3ITtN3RQdfAcYpd3Zfle6PR+e+T6MvEh2iyIKee?= =?us-ascii?Q?AQRP76U87DJ/96r1/LM4ES54nlHY5HYER3RxDELaDMmO1eIizaJiQ7E4pxYs?= =?us-ascii?Q?lch375fPKhgbh5hZ1Ii7z9xrJY+bbeFW8R7bA8BHF8hqVPYf7bw9F8/2+Dw6?= =?us-ascii?Q?EXA9Xn7/c3mb/OWUIh3X4PS6HKfxGef3bqyAhjTNnIL6hioxyArXspWVIIIX?= =?us-ascii?Q?YTszOYMhJUyg5wZOfZMImZZ1kqyLKshjY4AlkTPbgB6SbNOl3dSYFA7dqfHO?= =?us-ascii?Q?Dapndy0kkPoofYu7aoBJnb1lDBON/qyV3+J0EJ9cDnve9nmkerxU1ToJLwun?= =?us-ascii?Q?EL2xProOfxSCyvCIv/2zuSzImtXRZLH3p76mKzWImnL6MlPHC3yWXgFgzTLf?= =?us-ascii?Q?p4fmsyTkvWXxH3ouQlF3XInN+tV9fMFALg3SV1j1BBCgvET7bsQogKztnJYs?= =?us-ascii?Q?wz5RDd6q8ORsntSGGwkuVoS2ylRtNR/PimvZqF9FN1SrA6LsINi5SCldS9N+?= =?us-ascii?Q?9R2a+CD3AVvkA3uGT2wllA43Ds4r7NybW+3XAjcs2QMbybTqy9g?= X-Microsoft-Exchange-Diagnostics: 1;BN6PR14MB1652;6:YJYr73AMUs6gD0YRqvFMAoXE5WzyVjWuY8oLcud+/kf+xe/pDRBTy5Xr1XXRCiXtX8sUoWOA7uE2pOjAHZ0wWAxwVLa6bEyolNQUYV71A4RwLjQAYSc37RLf3UovSSENQpK/t3E+JfhEyg/oS+2Ge2UGzLatuV4AFufc6qvctqodKL/zbHIDxw1t38O3OQOfzhqZ/hADCpo3RCOn1u4qqMX3NwqyRffWKniaQl1WcT1DnwyrthWSUVh80dNpivEJ8qcrIZomv1uvOLdea81Tjra0PwhRpZUEKCN91RCGHGWthfO/WcEdfp+u3Ffn69zMIK711ysQSTx5CDqpaVv8OsN9dauTrvl4/OPDed7sEjDzEsq+SoxRbs8flB5LruQsZzwB7XZ5j5Ke/BFbqSdlog==;5:uy53w5n1BVlg4GmHdcHBNxvYXh1QRy/9T6TD80Kk5h1FvHdtcKGH32Z5pH7xhI4iyz9tO0Ck6qYDinPhBq3QUkvNKcyBBpcjMVYrasrAdiKCHm0jMHHIiqh/iPTsP0Qr5Q53EzyFTFr0S+AAzQuo0xYjSN+5CaF7O2ufRjfvZF4=;24:vPJb8zw15nH14vo0zNUrsNfZ/zHvse9boCni0nYuGTtGARNlUxZJ2NGy8/BNH76bj9WeMQTr4OwGU49Ld3n/VCCC8KfYQG4N0vUNsu/dO78= SpamDiagnosticOutput: 1:99 SpamDiagnosticMetadata: NSPM X-Microsoft-Exchange-Diagnostics: 1;BN6PR14MB1652;7:oa6SCS6OZ9RJi3JknYBTGNLmYlKac38QK/moURtNcRSKIPzncJLqZhk5cvLd8JPwcRxE5w8WPzi93z1mAxY1Ip+kcp8dzGv/D8NS3NMubedomlQyJEsR4EfH42JGDdB5Koh5NK9E7j2gqat56IBKvBswXRZTr+Sjb/o/P/v8A11RGXoNMqqWX+wOvdyyjGuPCaLjjxT3ZoKQYlSxNmyxAFM3TnpCHgkiY/wIkKnAxR/XHQAx9SzaL0v5rgsofW44fdyFrAF5poolrBRLNe0MAP6Serr8vLpOdV80i7Ksf7PxeDTtqEF1nKgJmIrKfYcY4Le5K9e9wqCcXlm6JZOM9fY5AlaxlKzBz2LMTZV9T9vaRL4R/hl0UvNswQZF8dxZd3LbEqUtWQneDG1Y9ytJU9JaefSgCxh2xsYqF2HLKHV7ciOp+r5xEs4Zrua+P+S3901zOxjBerAJzgFvbQpe8HfauT4bgxZApBm5006itCFsvnRJxWd9YGav8zKpfa+Hi3Q6A0WCK1DJgAkwQjKYmg== X-OriginatorOrg: cs.rutgers.edu X-MS-Exchange-CrossTenant-OriginalArrivalTime: 09 Feb 2017 17:36:51.2455 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-Transport-CrossTenantHeadersStamped: BN6PR14MB1652 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org --=_MailMate_9E427DA3-73D2-4C1D-AF5A-CD816ED9A7DB_= Content-Type: text/plain Content-Transfer-Encoding: quoted-printable On 9 Feb 2017, at 3:16, Naoya Horiguchi wrote: > On Sun, Feb 05, 2017 at 11:12:47AM -0500, Zi Yan wrote: >> From: Naoya Horiguchi >> >> If one of callers of page migration starts to handle thp, >> memory management code start to see pmd migration entry, so we need >> to prepare for it before enabling. This patch changes various code >> point which checks the status of given pmds in order to prevent race >> between thp migration and the pmd-related works. >> >> ChangeLog v1 -> v2: >> - introduce pmd_related() (I know the naming is not good, but can't >> think up no better name. Any suggesntion is welcomed.) >> >> Signed-off-by: Naoya Horiguchi >> >> ChangeLog v2 -> v3: >> - add is_swap_pmd() >> - a pmd entry should be is_swap_pmd(), pmd_trans_huge(), pmd_devmap(),= >> or pmd_none() > > (nitpick) ... or normal pmd pointing to pte pages? Sure, I will add it. > >> - use pmdp_huge_clear_flush() instead of pmdp_huge_get_and_clear() >> - flush_cache_range() while set_pmd_migration_entry() >> - pmd_none_or_trans_huge_or_clear_bad() and pmd_trans_unstable() retur= n >> true on pmd_migration_entry, so that migration entries are not >> treated as pmd page table entries. >> >> Signed-off-by: Zi Yan >> --- >> arch/x86/mm/gup.c | 4 +-- >> fs/proc/task_mmu.c | 22 ++++++++----- >> include/asm-generic/pgtable.h | 71 ----------------------------------= ------ >> include/linux/huge_mm.h | 21 ++++++++++-- >> include/linux/swapops.h | 74 ++++++++++++++++++++++++++++++++++= +++++++ >> mm/gup.c | 20 ++++++++++-- >> mm/huge_memory.c | 76 ++++++++++++++++++++++++++++++++++= ++------- >> mm/madvise.c | 2 ++ >> mm/memcontrol.c | 2 ++ >> mm/memory.c | 9 +++-- >> mm/memory_hotplug.c | 13 +++++++- >> mm/mempolicy.c | 1 + >> mm/mprotect.c | 6 ++-- >> mm/mremap.c | 2 +- >> mm/pagewalk.c | 2 ++ >> 15 files changed, 221 insertions(+), 104 deletions(-) >> >> diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c >> index 0d4fb3ebbbac..78a153d90064 100644 >> --- a/arch/x86/mm/gup.c >> +++ b/arch/x86/mm/gup.c >> @@ -222,9 +222,9 @@ static int gup_pmd_range(pud_t pud, unsigned long = addr, unsigned long end, >> pmd_t pmd =3D *pmdp; >> >> next =3D pmd_addr_end(addr, end); >> - if (pmd_none(pmd)) >> + if (!pmd_present(pmd)) >> return 0; >> - if (unlikely(pmd_large(pmd) || !pmd_present(pmd))) { >> + if (unlikely(pmd_large(pmd))) { >> /* >> * NUMA hinting faults need to be handled in the GUP >> * slowpath for accounting purposes and so that they >> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c >> index 6c07c7813b26..1e64d6898c68 100644 >> --- a/fs/proc/task_mmu.c >> +++ b/fs/proc/task_mmu.c >> @@ -596,7 +596,8 @@ static int smaps_pte_range(pmd_t *pmd, unsigned lo= ng addr, unsigned long end, >> >> ptl =3D pmd_trans_huge_lock(pmd, vma); >> if (ptl) { >> - smaps_pmd_entry(pmd, addr, walk); >> + if (pmd_present(*pmd)) >> + smaps_pmd_entry(pmd, addr, walk); >> spin_unlock(ptl); >> return 0; >> } >> @@ -929,6 +930,9 @@ static int clear_refs_pte_range(pmd_t *pmd, unsign= ed long addr, >> goto out; >> } >> >> + if (!pmd_present(*pmd)) >> + goto out; >> + >> page =3D pmd_page(*pmd); >> >> /* Clear accessed and referenced bits. */ >> @@ -1208,19 +1212,19 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsi= gned long addr, unsigned long end, >> if (ptl) { >> u64 flags =3D 0, frame =3D 0; >> pmd_t pmd =3D *pmdp; >> + struct page *page; >> >> if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(pmd)) >> flags |=3D PM_SOFT_DIRTY; >> >> - /* >> - * Currently pmd for thp is always present because thp >> - * can not be swapped-out, migrated, or HWPOISONed >> - * (split in such cases instead.) >> - * This if-check is just to prepare for future implementation. >> - */ >> - if (pmd_present(pmd)) { >> - struct page *page =3D pmd_page(pmd); >> + if (is_pmd_migration_entry(pmd)) { >> + swp_entry_t entry =3D pmd_to_swp_entry(pmd); >> >> + frame =3D swp_type(entry) | >> + (swp_offset(entry) << MAX_SWAPFILES_SHIFT); >> + page =3D migration_entry_to_page(entry); >> + } else if (pmd_present(pmd)) { >> + page =3D pmd_page(pmd); >> if (page_mapcount(page) =3D=3D 1) >> flags |=3D PM_MMAP_EXCLUSIVE; >> >> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtab= le.h >> index b71a431ed649..6cf9e9b5a7be 100644 >> --- a/include/asm-generic/pgtable.h >> +++ b/include/asm-generic/pgtable.h >> @@ -726,77 +726,6 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp) >> #ifndef arch_needs_pgtable_deposit >> #define arch_needs_pgtable_deposit() (false) >> #endif >> -/* >> - * This function is meant to be used by sites walking pagetables with= >> - * the mmap_sem hold in read mode to protect against MADV_DONTNEED an= d >> - * transhuge page faults. MADV_DONTNEED can convert a transhuge pmd >> - * into a null pmd and the transhuge page fault can convert a null pm= d >> - * into an hugepmd or into a regular pmd (if the hugepage allocation >> - * fails). While holding the mmap_sem in read mode the pmd becomes >> - * stable and stops changing under us only if it's not null and not a= >> - * transhuge pmd. When those races occurs and this function makes a >> - * difference vs the standard pmd_none_or_clear_bad, the result is >> - * undefined so behaving like if the pmd was none is safe (because it= >> - * can return none anyway). The compiler level barrier() is criticall= y >> - * important to compute the two checks atomically on the same pmdval.= >> - * >> - * For 32bit kernels with a 64bit large pmd_t this automatically take= s >> - * care of reading the pmd atomically to avoid SMP race conditions >> - * against pmd_populate() when the mmap_sem is hold for reading by th= e >> - * caller (a special atomic read not done by "gcc" as in the generic >> - * version above, is also needed when THP is disabled because the pag= e >> - * fault can populate the pmd from under us). >> - */ >> -static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd) >> -{ >> - pmd_t pmdval =3D pmd_read_atomic(pmd); >> - /* >> - * The barrier will stabilize the pmdval in a register or on >> - * the stack so that it will stop changing under the code. >> - * >> - * When CONFIG_TRANSPARENT_HUGEPAGE=3Dy on x86 32bit PAE, >> - * pmd_read_atomic is allowed to return a not atomic pmdval >> - * (for example pointing to an hugepage that has never been >> - * mapped in the pmd). The below checks will only care about >> - * the low part of the pmd with 32bit PAE x86 anyway, with the >> - * exception of pmd_none(). So the important thing is that if >> - * the low part of the pmd is found null, the high part will >> - * be also null or the pmd_none() check below would be >> - * confused. >> - */ >> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE >> - barrier(); >> -#endif >> - if (pmd_none(pmdval) || pmd_trans_huge(pmdval)) >> - return 1; >> - if (unlikely(pmd_bad(pmdval))) { >> - pmd_clear_bad(pmd); >> - return 1; >> - } >> - return 0; >> -} >> - >> -/* >> - * This is a noop if Transparent Hugepage Support is not built into >> - * the kernel. Otherwise it is equivalent to >> - * pmd_none_or_trans_huge_or_clear_bad(), and shall only be called in= >> - * places that already verified the pmd is not none and they want to >> - * walk ptes while holding the mmap sem in read mode (write mode don'= t >> - * need this). If THP is not enabled, the pmd can't go away under the= >> - * code even if MADV_DONTNEED runs, but if THP is enabled we need to >> - * run a pmd_trans_unstable before walking the ptes after >> - * split_huge_page_pmd returns (because it may have run when the pmd >> - * become null, but then a page fault can map in a THP and not a >> - * regular page). >> - */ >> -static inline int pmd_trans_unstable(pmd_t *pmd) >> -{ >> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE >> - return pmd_none_or_trans_huge_or_clear_bad(pmd); >> -#else >> - return 0; >> -#endif >> -} >> >> #ifndef CONFIG_NUMA_BALANCING >> /* >> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h >> index 83a8d42f9d55..c2e5a4eab84a 100644 >> --- a/include/linux/huge_mm.h >> +++ b/include/linux/huge_mm.h >> @@ -131,7 +131,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, = pmd_t *pmd, >> #define split_huge_pmd(__vma, __pmd, __address) \ >> do { \ >> pmd_t *____pmd =3D (__pmd); \ >> - if (pmd_trans_huge(*____pmd) \ >> + if (is_swap_pmd(*____pmd) || pmd_trans_huge(*____pmd) \ >> || pmd_devmap(*____pmd)) \ >> __split_huge_pmd(__vma, __pmd, __address, \ >> false, NULL); \ >> @@ -162,12 +162,18 @@ extern spinlock_t *__pmd_trans_huge_lock(pmd_t *= pmd, >> struct vm_area_struct *vma); >> extern spinlock_t *__pud_trans_huge_lock(pud_t *pud, >> struct vm_area_struct *vma); >> + >> +static inline int is_swap_pmd(pmd_t pmd) >> +{ >> + return !pmd_none(pmd) && !pmd_present(pmd); >> +} >> + >> /* mmap_sem must be held on entry */ >> static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd, >> struct vm_area_struct *vma) >> { >> VM_BUG_ON_VMA(!rwsem_is_locked(&vma->vm_mm->mmap_sem), vma); >> - if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) >> + if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) >> return __pmd_trans_huge_lock(pmd, vma); >> else >> return NULL; >> @@ -192,6 +198,12 @@ struct page *follow_devmap_pmd(struct vm_area_str= uct *vma, unsigned long addr, >> pmd_t *pmd, int flags); >> struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned l= ong addr, >> pud_t *pud, int flags); >> +static inline int hpage_order(struct page *page) >> +{ >> + if (unlikely(PageTransHuge(page))) >> + return HPAGE_PMD_ORDER; >> + return 0; >> +} >> >> extern int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t orig_pmd= ); >> >> @@ -232,6 +244,7 @@ static inline bool thp_migration_supported(void) >> #define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; }) >> >> #define hpage_nr_pages(x) 1 >> +#define hpage_order(x) 0 >> >> #define transparent_hugepage_enabled(__vma) 0 >> >> @@ -274,6 +287,10 @@ static inline void vma_adjust_trans_huge(struct v= m_area_struct *vma, >> long adjust_next) >> { >> } >> +static inline int is_swap_pmd(pmd_t pmd) >> +{ >> + return 0; >> +} >> static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd, >> struct vm_area_struct *vma) >> { >> diff --git a/include/linux/swapops.h b/include/linux/swapops.h >> index 6625bea13869..50e4aa7e7ff9 100644 >> --- a/include/linux/swapops.h >> +++ b/include/linux/swapops.h >> @@ -229,6 +229,80 @@ static inline int is_pmd_migration_entry(pmd_t pm= d) >> } >> #endif >> >> +/* >> + * This function is meant to be used by sites walking pagetables with= >> + * the mmap_sem hold in read mode to protect against MADV_DONTNEED an= d >> + * transhuge page faults. MADV_DONTNEED can convert a transhuge pmd >> + * into a null pmd and the transhuge page fault can convert a null pm= d >> + * into an hugepmd or into a regular pmd (if the hugepage allocation >> + * fails). While holding the mmap_sem in read mode the pmd becomes >> + * stable and stops changing under us only if it's not null and not a= >> + * transhuge pmd. When those races occurs and this function makes a >> + * difference vs the standard pmd_none_or_clear_bad, the result is >> + * undefined so behaving like if the pmd was none is safe (because it= >> + * can return none anyway). The compiler level barrier() is criticall= y >> + * important to compute the two checks atomically on the same pmdval.= >> + * >> + * For 32bit kernels with a 64bit large pmd_t this automatically take= s >> + * care of reading the pmd atomically to avoid SMP race conditions >> + * against pmd_populate() when the mmap_sem is hold for reading by th= e >> + * caller (a special atomic read not done by "gcc" as in the generic >> + * version above, is also needed when THP is disabled because the pag= e >> + * fault can populate the pmd from under us). >> + */ >> +static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd) >> +{ >> + pmd_t pmdval =3D pmd_read_atomic(pmd); >> + /* >> + * The barrier will stabilize the pmdval in a register or on >> + * the stack so that it will stop changing under the code. >> + * >> + * When CONFIG_TRANSPARENT_HUGEPAGE=3Dy on x86 32bit PAE, >> + * pmd_read_atomic is allowed to return a not atomic pmdval >> + * (for example pointing to an hugepage that has never been >> + * mapped in the pmd). The below checks will only care about >> + * the low part of the pmd with 32bit PAE x86 anyway, with the >> + * exception of pmd_none(). So the important thing is that if >> + * the low part of the pmd is found null, the high part will >> + * be also null or the pmd_none() check below would be >> + * confused. >> + */ >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE >> + barrier(); >> +#endif >> + if (pmd_none(pmdval) || pmd_trans_huge(pmdval) >> + || is_pmd_migration_entry(pmdval)) >> + return 1; >> + if (unlikely(pmd_bad(pmdval))) { >> + pmd_clear_bad(pmd); >> + return 1; >> + } >> + return 0; >> +} >> + >> +/* >> + * This is a noop if Transparent Hugepage Support is not built into >> + * the kernel. Otherwise it is equivalent to >> + * pmd_none_or_trans_huge_or_clear_bad(), and shall only be called in= >> + * places that already verified the pmd is not none and they want to >> + * walk ptes while holding the mmap sem in read mode (write mode don'= t >> + * need this). If THP is not enabled, the pmd can't go away under the= >> + * code even if MADV_DONTNEED runs, but if THP is enabled we need to >> + * run a pmd_trans_unstable before walking the ptes after >> + * split_huge_page_pmd returns (because it may have run when the pmd >> + * become null, but then a page fault can map in a THP and not a >> + * regular page). >> + */ >> +static inline int pmd_trans_unstable(pmd_t *pmd) >> +{ >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE >> + return pmd_none_or_trans_huge_or_clear_bad(pmd); >> +#else >> + return 0; >> +#endif >> +} >> + >> + > > These functions are page table or thp matter, so putting them in swapop= s.h > looks weird to me. Maybe you can avoid this code transfer by using !pmd= _present > instead of is_pmd_migration_entry? > And we have to consider renaming pmd_none_or_trans_huge_or_clear_bad(),= > I like a simple name like __pmd_trans_unstable(), but if you have an id= ea, > that's great. Yes. I will move it back. I am not sure if it is OK to only use !pmd_present. We may miss some pmd_= bad. Kirill and Andrea, can you give some insight on this? >> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c >> index 19b460acb5e1..9cb4c83151a8 100644 >> --- a/mm/memory_hotplug.c >> +++ b/mm/memory_hotplug.c > > Changes on mm/memory_hotplug.c should be with patch 14/14? > # If that's right, definition of hpage_order() also should go to 14/14.= Got it. I will move it. -- Best Regards Yan Zi --=_MailMate_9E427DA3-73D2-4C1D-AF5A-CD816ED9A7DB_= Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- Comment: GPGTools - https://gpgtools.org iQEcBAEBCgAGBQJYnKivAAoJEEGLLxGcTqbMUdgH/3RNDlYfzA7jtpCQwBj+jHDA 3yLW4+oAZhZhpI5n8U+IUYxkBLFc5ho1BGFOUyZbKu5STaUmoPvSDeYv17z3ugK2 MBg+S4Y+1ameQSbDRqbOOhaRc5pynH5YiPnSPlWempZ/1hN2YlNok/dEwOybSJWv IF5ww9LIBshyc5hIP6z7I7+pUY8rol5oW19VAWkMpZey8WPJIg2/11OCBGC7joDJ dlRkIZ9NTO/5GUb9mAmyAuiWxG6l8xpFKyk4nRaswGHj59cWJUBHfccqIsNqThxb yGxJIcqjrJUi7IOUT77/uigqJvgIYHzmw3y+CbfTgbXYT3EVdvl+FF6RAUZnAU4= =MGaF -----END PGP SIGNATURE----- --=_MailMate_9E427DA3-73D2-4C1D-AF5A-CD816ED9A7DB_=--