From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f70.google.com (mail-pg0-f70.google.com [74.125.83.70]) by kanga.kvack.org (Postfix) with ESMTP id 67E136B04A2 for ; Tue, 29 Aug 2017 23:18:45 -0400 (EDT) Received: by mail-pg0-f70.google.com with SMTP id q68so10074952pgq.11 for ; Tue, 29 Aug 2017 20:18:45 -0700 (PDT) Received: from mail-pf0-x242.google.com (mail-pf0-x242.google.com. [2607:f8b0:400e:c00::242]) by mx.google.com with ESMTPS id k9si3499191pgp.433.2017.08.29.20.18.44 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 29 Aug 2017 20:18:44 -0700 (PDT) Received: by mail-pf0-x242.google.com with SMTP id g13so3500333pfm.2 for ; Tue, 29 Aug 2017 20:18:44 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic From: Nadav Amit In-Reply-To: Date: Tue, 29 Aug 2017 20:18:41 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <0D062FD4-AAEB-46C9-9D3A-AC91E84414D3@gmail.com> References: <20170829235447.10050-1-jglisse@redhat.com> <20170829235447.10050-3-jglisse@redhat.com> <6D58FBE4-5D03-49CC-AAFF-3C1279A5A849@gmail.com> <20170830025910.GB2386@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: Jerome Glisse Cc: Linux Kernel Mailing List , "open list:MEMORY MANAGEMENT" , Dan Williams , Ross Zwisler , Linus Torvalds , Bernhard Held , Adam Borowski , Andrea Arcangeli , =?utf-8?B?UmFkaW0gS3LEjW3DocWZ?= , Wanpeng Li , Paolo Bonzini , Takashi Iwai , Mike Galbraith , "Kirill A . Shutemov" , axie , Andrew Morton Nadav Amit wrote: > Jerome Glisse wrote: >=20 >> On Tue, Aug 29, 2017 at 07:46:07PM -0700, Nadav Amit wrote: >>> J=C3=A9r=C3=B4me Glisse wrote: >>>=20 >>>> Replacing all mmu_notifier_invalidate_page() by = mmu_notifier_invalidat_range() >>>> and making sure it is bracketed by call to = mmu_notifier_invalidate_range_start/ >>>> end. >>>>=20 >>>> Note that because we can not presume the pmd value or pte value we = have to >>>> assume the worse and unconditionaly report an invalidation as = happening. >>>>=20 >>>> Signed-off-by: J=C3=A9r=C3=B4me Glisse >>>> Cc: Dan Williams >>>> Cc: Ross Zwisler >>>> Cc: Linus Torvalds >>>> Cc: Bernhard Held >>>> Cc: Adam Borowski >>>> Cc: Andrea Arcangeli >>>> Cc: Radim Kr=C4=8Dm=C3=A1=C5=99 >>>> Cc: Wanpeng Li >>>> Cc: Paolo Bonzini >>>> Cc: Takashi Iwai >>>> Cc: Nadav Amit >>>> Cc: Mike Galbraith >>>> Cc: Kirill A. Shutemov >>>> Cc: axie >>>> Cc: Andrew Morton >>>> --- >>>> mm/rmap.c | 44 +++++++++++++++++++++++++++++++++++++++++--- >>>> 1 file changed, 41 insertions(+), 3 deletions(-) >>>>=20 >>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>> index c8993c63eb25..da97ed525088 100644 >>>> --- a/mm/rmap.c >>>> +++ b/mm/rmap.c >>>> @@ -887,11 +887,21 @@ static bool page_mkclean_one(struct page = *page, struct vm_area_struct *vma, >>>> .address =3D address, >>>> .flags =3D PVMW_SYNC, >>>> }; >>>> + unsigned long start =3D address, end; >>>> int *cleaned =3D arg; >>>>=20 >>>> + /* >>>> + * We have to assume the worse case ie pmd for invalidation. = Note that >>>> + * the page can not be free from this function. >>>> + */ >>>> + end =3D min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE); >>>> + mmu_notifier_invalidate_range_start(vma->vm_mm, start, end); >>>> + >>>> while (page_vma_mapped_walk(&pvmw)) { >>>> + unsigned long cstart, cend; >>>> int ret =3D 0; >>>> - address =3D pvmw.address; >>>> + >>>> + cstart =3D address =3D pvmw.address; >>>> if (pvmw.pte) { >>>> pte_t entry; >>>> pte_t *pte =3D pvmw.pte; >>>> @@ -904,6 +914,7 @@ static bool page_mkclean_one(struct page *page, = struct vm_area_struct *vma, >>>> entry =3D pte_wrprotect(entry); >>>> entry =3D pte_mkclean(entry); >>>> set_pte_at(vma->vm_mm, address, pte, entry); >>>> + cend =3D cstart + PAGE_SIZE; >>>> ret =3D 1; >>>> } else { >>>> #ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE >>>> @@ -918,6 +929,8 @@ static bool page_mkclean_one(struct page *page, = struct vm_area_struct *vma, >>>> entry =3D pmd_wrprotect(entry); >>>> entry =3D pmd_mkclean(entry); >>>> set_pmd_at(vma->vm_mm, address, pmd, entry); >>>> + cstart &=3D PMD_MASK; >>>> + cend =3D cstart + PMD_SIZE; >>>> ret =3D 1; >>>> #else >>>> /* unexpected pmd-mapped page? */ >>>> @@ -926,11 +939,13 @@ static bool page_mkclean_one(struct page = *page, struct vm_area_struct *vma, >>>> } >>>>=20 >>>> if (ret) { >>>> - mmu_notifier_invalidate_page(vma->vm_mm, = address); >>>> + mmu_notifier_invalidate_range(vma->vm_mm, = cstart, cend); >>>> (*cleaned)++; >>>> } >>>> } >>>>=20 >>>> + mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); >>>> + >>>> return true; >>>> } >>>>=20 >>>> @@ -1324,6 +1339,7 @@ static bool try_to_unmap_one(struct page = *page, struct vm_area_struct *vma, >>>> pte_t pteval; >>>> struct page *subpage; >>>> bool ret =3D true; >>>> + unsigned long start =3D address, end; >>>> enum ttu_flags flags =3D (enum ttu_flags)arg; >>>>=20 >>>> /* munlock has nothing to gain from examining un-locked vmas */ >>>> @@ -1335,6 +1351,14 @@ static bool try_to_unmap_one(struct page = *page, struct vm_area_struct *vma, >>>> flags & TTU_MIGRATION, page); >>>> } >>>>=20 >>>> + /* >>>> + * We have to assume the worse case ie pmd for invalidation. = Note that >>>> + * the page can not be free in this function as call of = try_to_unmap() >>>> + * must hold a reference on the page. >>>> + */ >>>> + end =3D min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE); >>>> + mmu_notifier_invalidate_range_start(vma->vm_mm, start, end); >>>> + >>>> while (page_vma_mapped_walk(&pvmw)) { >>>> /* >>>> * If the page is mlock()d, we cannot swap it out. >>>> @@ -1408,6 +1432,8 @@ static bool try_to_unmap_one(struct page = *page, struct vm_area_struct *vma, >>>> set_huge_swap_pte_at(mm, address, >>>> pvmw.pte, pteval, >>>> = vma_mmu_pagesize(vma)); >>>> + mmu_notifier_invalidate_range(mm, = address, >>>> + address + = vma_mmu_pagesize(vma)); >>>=20 >>> I don=E2=80=99t think that the notifier should be called after the = PTE is set, but >>> after the PTE is cleared, PTE permissions are demoted (e.g., RW->RO) = or >>> access/dirty bits are cleared. [There is an exception: if the PFN in = the PTE >>> is changed without clearing the PTE before, but it does not apply = here, and >>> there is a different notifier for this case.] >>>=20 >>> Therefore, IIUC, try_to_umap_one() should only call >>> mmu_notifier_invalidate_range() after ptep_get_and_clear() and >>> ptep_clear_flush() are called. All the other calls to >>> mmu_notifier_invalidate_range() in this function can be removed. >>=20 >> Yes it would simplify the patch, i was trying to optimize for the = case >> where we restore the pte to its original value after = ptep_clear_flush() >> or ptep_get_and_clear() as in this case there is no need to = invalidate >> any secondary page table but that's an overkill optimization that = just >> add too much complexity. >=20 > Interesting. Actually, prior to your changes, it seems that the break > statements would skip mmu_notifier_invalidate_page() when the PTE is = not > changed. So why not just change mmu_notifier_invalidate_page() into > mmu_notifier_invalidate_range() ? Sorry - I noticed it was actually wrong before (as you noted) at least in one case: (unlikely(PageSwapBacked(page) !=3D PageSwapCache(page))) Regards, Nadav= -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org