From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS,URIBL_BLOCKED,USER_AGENT_MUTT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 656A3C433F4 for ; Thu, 30 Aug 2018 14:08:32 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 1F7E120834 for ; Thu, 30 Aug 2018 14:08:32 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1F7E120834 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729184AbeH3SKs (ORCPT ); Thu, 30 Aug 2018 14:10:48 -0400 Received: from mx3-rdu2.redhat.com ([66.187.233.73]:41148 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1729000AbeH3SKs (ORCPT ); Thu, 30 Aug 2018 14:10:48 -0400 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 58E3D40241EE; Thu, 30 Aug 2018 14:08:28 +0000 (UTC) Received: from redhat.com (unknown [10.20.6.215]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 22FE72166BA1; Thu, 30 Aug 2018 14:08:27 +0000 (UTC) Date: Thu, 30 Aug 2018 10:08:25 -0400 From: Jerome Glisse To: Michal Hocko Cc: Mike Kravetz , linux-mm@kvack.org, linux-kernel@vger.kernel.org, "Kirill A . Shutemov" , Vlastimil Babka , Naoya Horiguchi , Davidlohr Bueso , Andrew Morton , stable@vger.kernel.org, linux-rdma@vger.kernel.org, Matan Barak , Leon Romanovsky , Dimitri Sivanich Subject: Re: [PATCH v6 1/2] mm: migration: fix migration of huge PMD shared pages Message-ID: <20180830140825.GA3529@redhat.com> References: <20180823205917.16297-2-mike.kravetz@oracle.com> <20180824084157.GD29735@dhcp22.suse.cz> <6063f215-a5c8-2f0c-465a-2c515ddc952d@oracle.com> <20180827074645.GB21556@dhcp22.suse.cz> <20180827134633.GB3930@redhat.com> <9209043d-3240-105b-72a3-b4cd30f1b1f1@oracle.com> <20180829181424.GB3784@redhat.com> <20180829183906.GF10223@dhcp22.suse.cz> <20180829211106.GC3784@redhat.com> <20180830105616.GD2656@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20180830105616.GD2656@dhcp22.suse.cz> User-Agent: Mutt/1.10.0 (2018-05-17) X-Scanned-By: MIMEDefang 2.78 on 10.11.54.6 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.7]); Thu, 30 Aug 2018 14:08:28 +0000 (UTC) X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.7]); Thu, 30 Aug 2018 14:08:28 +0000 (UTC) for IP:'10.11.54.6' DOMAIN:'int-mx06.intmail.prod.int.rdu2.redhat.com' HELO:'smtp.corp.redhat.com' FROM:'jglisse@redhat.com' RCPT:'' Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Aug 30, 2018 at 12:56:16PM +0200, Michal Hocko wrote: > On Wed 29-08-18 17:11:07, Jerome Glisse wrote: > > On Wed, Aug 29, 2018 at 08:39:06PM +0200, Michal Hocko wrote: > > > On Wed 29-08-18 14:14:25, Jerome Glisse wrote: > > > > On Wed, Aug 29, 2018 at 10:24:44AM -0700, Mike Kravetz wrote: > > > [...] > > > > > What would be the best mmu notifier interface to use where there are no > > > > > start/end calls? > > > > > Or, is the best solution to add the start/end calls as is done in later > > > > > versions of the code? If that is the suggestion, has there been any change > > > > > in invalidate start/end semantics that we should take into account? > > > > > > > > start/end would be the one to add, 4.4 seems broken in respect to THP > > > > and mmu notification. Another solution is to fix user of mmu notifier, > > > > they were only a handful back then. For instance properly adjust the > > > > address to match first address covered by pmd or pud and passing down > > > > correct page size to mmu_notifier_invalidate_page() would allow to fix > > > > this easily. > > > > > > > > This is ok because user of try_to_unmap_one() replace the pte/pmd/pud > > > > with an invalid one (either poison, migration or swap) inside the > > > > function. So anyone racing would synchronize on those special entry > > > > hence why it is fine to delay mmu_notifier_invalidate_page() to after > > > > dropping the page table lock. > > > > > > > > Adding start/end might the solution with less code churn as you would > > > > only need to change try_to_unmap_one(). > > > > > > What about dependencies? 369ea8242c0fb sounds like it needs work for all > > > notifiers need to be updated as well. > > > > This commit remove mmu_notifier_invalidate_page() hence why everything > > need to be updated. But in 4.4 you can get away with just adding start/ > > end and keep around mmu_notifier_invalidate_page() to minimize disruption. > > OK, this is really interesting. I was really worried to change the > semantic of the mmu notifiers in stable kernels because this is really > a hard to review change and high risk for anybody running those old > kernels. If we can keep the mmu_notifier_invalidate_page and wrap them > into the range scope API then this sounds like the best way forward. > > So just to make sure we are at the same page. Does this sounds goo for > stable 4.4. backport? Mike's hugetlb pmd shared fixup can be applied on > top. What do you think? You need to invalidate outside page table lock so before the call to page_check_address(). For instance like below patch, which also only do the range invalidation for huge page which would avoid too much of a behavior change for user of mmu notifier. >From 1be4109cfbf1c475ad67a5a57c87c74fd183ab1d Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= Date: Thu, 31 Aug 2017 17:17:27 -0400 Subject: [PATCH] mm/rmap: update to new mmu_notifier semantic v2 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit commit 369ea8242c0fb5239b4ddf0dc568f694bd244de4 upstrea. Please note that this patch differs from the mainline because we do not really replace mmu_notifier_invalidate_page by mmu_notifier_invalidate_range because that requires changes to most of existing mmu notifiers. We also do not want to change the semantic of this API in old kernels. Anyway Jérôme has suggested that it should be sufficient to simply wrap mmu_notifier_invalidate_page by *_invalidate_range_start()/end() to fix invalidation of larger than pte mappings (e.g. THP/hugetlb pages during migration). We need this change to handle large (hugetlb/THP) pages migration properly. Note that because we can not presume the pmd value or pte value we have to assume the worst and unconditionaly report an invalidation as happening. Changed since v2: - try_to_unmap_one() only one call to mmu_notifier_invalidate_range() - compute end with PAGE_SIZE << compound_order(page) - fix PageHuge() case in try_to_unmap_one() Signed-off-by: Jérôme Glisse Reviewed-by: Andrea Arcangeli Cc: Dan Williams Cc: Ross Zwisler Cc: Bernhard Held Cc: Adam Borowski Cc: Radim Krčmář Cc: Wanpeng Li Cc: Paolo Bonzini Cc: Takashi Iwai Cc: Nadav Amit Cc: Mike Galbraith Cc: Kirill A. Shutemov Cc: axie Cc: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Michal Hocko # backport to 4.4 --- mm/rmap.c | 20 +++++++++++++++++++- 1 file changed, 19 insertions(+), 1 deletion(-) diff --git a/mm/rmap.c b/mm/rmap.c index b577fbb98d4b..a77f15dc0cf1 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1302,15 +1302,30 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, pte_t pteval; spinlock_t *ptl; int ret = SWAP_AGAIN; + unsigned long start = address, end; enum ttu_flags flags = (enum ttu_flags)arg; /* munlock has nothing to gain from examining un-locked vmas */ if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED)) goto out; + if (unlikely(PageHuge(page))) { + /* + * We have to assume the worse case ie pmd for invalidation. + * Note that the page can not be free in this function as call + * of try_to_unmap() must hold a reference on the page. + * + * This is ok to invalidate even if are not unmapping anything + * ie below page_check_address() returning NULL. + */ + end = min(vma->vm_end, start + (PAGE_SIZE << + compound_order(page))); + mmu_notifier_invalidate_range_start(vma->vm_mm, start, end); + } + pte = page_check_address(page, mm, address, &ptl, 0); if (!pte) - goto out; + goto out_notify; /* * If the page is mlock()d, we cannot swap it out. @@ -1427,6 +1442,9 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, pte_unmap_unlock(pte, ptl); if (ret != SWAP_FAIL && ret != SWAP_MLOCK && !(flags & TTU_MUNLOCK)) mmu_notifier_invalidate_page(mm, address); +out_notify: + if (unlikely(PageHuge(page))) + mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); out: return ret; } -- 2.17.1