All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] Optimize mmu_notifier->invalidate_range callback
@ 2017-10-17  3:10 ` jglisse
  0 siblings, 0 replies; 40+ messages in thread
From: jglisse @ 2017-10-17  3:10 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Jérôme Glisse, Andrea Arcangeli,
	Andrew Morton, Joerg Roedel, Suravee Suthikulpanit,
	David Woodhouse, Alistair Popple, Michael Ellerman,
	Benjamin Herrenschmidt, Stephen Rothwell, Andrew Donnellan,
	iommu, linuxppc-dev

From: Jérôme Glisse <jglisse@redhat.com>

(Andrew you already have v1 in your queue of patch 1, patch 2 is new,
 i think you can drop it patch 1 v1 for v2, v2 is bit more conservative
 and i fixed typos)

All this only affect user of invalidate_range callback (at this time
CAPI arch/powerpc/platforms/powernv/npu-dma.c, IOMMU ATS/PASID in
drivers/iommu/amd_iommu_v2.c|intel-svm.c)

This patchset remove useless double call to mmu_notifier->invalidate_range
callback wherever it is safe to do so. The first patch just remove useless
call and add documentation explaining why it is safe to do so. The second
patch go further by introducing mmu_notifier_invalidate_range_only_end()
which skip callback to invalidate_range this can be done when clearing a
pte, pmd or pud with notification which call invalidate_range right after
clearing under the page table lock.

It should improve performances but i am lacking hardware and benchmarks
which might show an improvement. Maybe folks in cc can help here.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Alistair Popple <alistair@popple.id.au>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org

Jérôme Glisse (2):
  mm/mmu_notifier: avoid double notification when it is useless v2
  mm/mmu_notifier: avoid call to invalidate_range() in range_end()

 Documentation/vm/mmu_notifier.txt | 93 +++++++++++++++++++++++++++++++++++++++
 fs/dax.c                          |  9 +++-
 include/linux/mmu_notifier.h      | 20 +++++++--
 mm/huge_memory.c                  | 66 ++++++++++++++++++++++++---
 mm/hugetlb.c                      | 16 +++++--
 mm/ksm.c                          | 15 ++++++-
 mm/memory.c                       |  6 ++-
 mm/migrate.c                      | 15 +++++--
 mm/mmu_notifier.c                 | 11 ++++-
 mm/rmap.c                         | 59 ++++++++++++++++++++++---
 10 files changed, 281 insertions(+), 29 deletions(-)
 create mode 100644 Documentation/vm/mmu_notifier.txt

-- 
2.13.6

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 0/2] Optimize mmu_notifier->invalidate_range callback
@ 2017-10-17  3:10 ` jglisse
  0 siblings, 0 replies; 40+ messages in thread
From: jglisse @ 2017-10-17  3:10 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Jérôme Glisse, Andrea Arcangeli,
	Andrew Morton, Joerg Roedel, Suravee Suthikulpanit,
	David Woodhouse, Alistair Popple, Michael Ellerman,
	Benjamin Herrenschmidt, Stephen Rothwell, Andrew Donnellan,
	iommu, linuxppc-dev

From: JA(C)rA'me Glisse <jglisse@redhat.com>

(Andrew you already have v1 in your queue of patch 1, patch 2 is new,
 i think you can drop it patch 1 v1 for v2, v2 is bit more conservative
 and i fixed typos)

All this only affect user of invalidate_range callback (at this time
CAPI arch/powerpc/platforms/powernv/npu-dma.c, IOMMU ATS/PASID in
drivers/iommu/amd_iommu_v2.c|intel-svm.c)

This patchset remove useless double call to mmu_notifier->invalidate_range
callback wherever it is safe to do so. The first patch just remove useless
call and add documentation explaining why it is safe to do so. The second
patch go further by introducing mmu_notifier_invalidate_range_only_end()
which skip callback to invalidate_range this can be done when clearing a
pte, pmd or pud with notification which call invalidate_range right after
clearing under the page table lock.

It should improve performances but i am lacking hardware and benchmarks
which might show an improvement. Maybe folks in cc can help here.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Alistair Popple <alistair@popple.id.au>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org

JA(C)rA'me Glisse (2):
  mm/mmu_notifier: avoid double notification when it is useless v2
  mm/mmu_notifier: avoid call to invalidate_range() in range_end()

 Documentation/vm/mmu_notifier.txt | 93 +++++++++++++++++++++++++++++++++++++++
 fs/dax.c                          |  9 +++-
 include/linux/mmu_notifier.h      | 20 +++++++--
 mm/huge_memory.c                  | 66 ++++++++++++++++++++++++---
 mm/hugetlb.c                      | 16 +++++--
 mm/ksm.c                          | 15 ++++++-
 mm/memory.c                       |  6 ++-
 mm/migrate.c                      | 15 +++++--
 mm/mmu_notifier.c                 | 11 ++++-
 mm/rmap.c                         | 59 ++++++++++++++++++++++---
 10 files changed, 281 insertions(+), 29 deletions(-)
 create mode 100644 Documentation/vm/mmu_notifier.txt

-- 
2.13.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 0/2] Optimize mmu_notifier->invalidate_range callback
@ 2017-10-17  3:10 ` jglisse
  0 siblings, 0 replies; 40+ messages in thread
From: jglisse @ 2017-10-17  3:10 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Jérôme Glisse, Andrea Arcangeli,
	Andrew Morton, Joerg Roedel, Suravee Suthikulpanit,
	David Woodhouse, Alistair Popple, Michael Ellerman,
	Benjamin Herrenschmidt, Stephen Rothwell, Andrew Donnellan,
	iommu, linuxppc-dev

From: Jérôme Glisse <jglisse@redhat.com>

(Andrew you already have v1 in your queue of patch 1, patch 2 is new,
 i think you can drop it patch 1 v1 for v2, v2 is bit more conservative
 and i fixed typos)

All this only affect user of invalidate_range callback (at this time
CAPI arch/powerpc/platforms/powernv/npu-dma.c, IOMMU ATS/PASID in
drivers/iommu/amd_iommu_v2.c|intel-svm.c)

This patchset remove useless double call to mmu_notifier->invalidate_range
callback wherever it is safe to do so. The first patch just remove useless
call and add documentation explaining why it is safe to do so. The second
patch go further by introducing mmu_notifier_invalidate_range_only_end()
which skip callback to invalidate_range this can be done when clearing a
pte, pmd or pud with notification which call invalidate_range right after
clearing under the page table lock.

It should improve performances but i am lacking hardware and benchmarks
which might show an improvement. Maybe folks in cc can help here.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Alistair Popple <alistair@popple.id.au>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org

Jérôme Glisse (2):
  mm/mmu_notifier: avoid double notification when it is useless v2
  mm/mmu_notifier: avoid call to invalidate_range() in range_end()

 Documentation/vm/mmu_notifier.txt | 93 +++++++++++++++++++++++++++++++++++++++
 fs/dax.c                          |  9 +++-
 include/linux/mmu_notifier.h      | 20 +++++++--
 mm/huge_memory.c                  | 66 ++++++++++++++++++++++++---
 mm/hugetlb.c                      | 16 +++++--
 mm/ksm.c                          | 15 ++++++-
 mm/memory.c                       |  6 ++-
 mm/migrate.c                      | 15 +++++--
 mm/mmu_notifier.c                 | 11 ++++-
 mm/rmap.c                         | 59 ++++++++++++++++++++++---
 10 files changed, 281 insertions(+), 29 deletions(-)
 create mode 100644 Documentation/vm/mmu_notifier.txt

-- 
2.13.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2
@ 2017-10-17  3:10   ` jglisse-H+wXaHxf7aLQT0dZR+AlfA
  0 siblings, 0 replies; 40+ messages in thread
From: jglisse @ 2017-10-17  3:10 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Jérôme Glisse, Andrea Arcangeli,
	Nadav Amit, Linus Torvalds, Andrew Morton, Joerg Roedel,
	Suravee Suthikulpanit, David Woodhouse, Alistair Popple,
	Michael Ellerman, Benjamin Herrenschmidt, Stephen Rothwell,
	Andrew Donnellan, iommu, linuxppc-dev, linux-next

From: Jérôme Glisse <jglisse@redhat.com>

This patch only affects users of mmu_notifier->invalidate_range callback
which are device drivers related to ATS/PASID, CAPI, IOMMUv2, SVM ...
and it is an optimization for those users. Everyone else is unaffected
by it.

When clearing a pte/pmd we are given a choice to notify the event under
the page table lock (notify version of *_clear_flush helpers do call the
mmu_notifier_invalidate_range). But that notification is not necessary in
all cases.

This patches remove almost all cases where it is useless to have a call
to mmu_notifier_invalidate_range before mmu_notifier_invalidate_range_end.
It also adds documentation in all those cases explaining why.

Below is a more in depth analysis of why this is fine to do this:

For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when device
use thing like ATS/PASID to get the IOMMU to walk the CPU page table to
access a process virtual address space). There is only 2 cases when you
need to notify those secondary TLB while holding page table lock when
clearing a pte/pmd:

  A) page backing address is free before mmu_notifier_invalidate_range_end
  B) a page table entry is updated to point to a new page (COW, write fault
     on zero page, __replace_page(), ...)

Case A is obvious you do not want to take the risk for the device to write
to a page that might now be used by something completely different.

Case B is more subtle. For correctness it requires the following sequence
to happen:
  - take page table lock
  - clear page table entry and notify (pmd/pte_huge_clear_flush_notify())
  - set page table entry to point to new page

If clearing the page table entry is not followed by a notify before setting
the new pte/pmd value then you can break memory model like C11 or C++11 for
the device.

Consider the following scenario (device use a feature similar to ATS/
PASID):

Two address addrA and addrB such that |addrA - addrB| >= PAGE_SIZE we
assume they are write protected for COW (other case of B apply too).

[Time N] -----------------------------------------------------------------
CPU-thread-0  {try to write to addrA}
CPU-thread-1  {try to write to addrB}
CPU-thread-2  {}
CPU-thread-3  {}
DEV-thread-0  {read addrA and populate device TLB}
DEV-thread-2  {read addrB and populate device TLB}
[Time N+1] ---------------------------------------------------------------
CPU-thread-0  {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}}
CPU-thread-1  {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}}
CPU-thread-2  {}
CPU-thread-3  {}
DEV-thread-0  {}
DEV-thread-2  {}
[Time N+2] ---------------------------------------------------------------
CPU-thread-0  {COW_step1: {update page table point to new page for addrA}}
CPU-thread-1  {COW_step1: {update page table point to new page for addrB}}
CPU-thread-2  {}
CPU-thread-3  {}
DEV-thread-0  {}
DEV-thread-2  {}
[Time N+3] ---------------------------------------------------------------
CPU-thread-0  {preempted}
CPU-thread-1  {preempted}
CPU-thread-2  {write to addrA which is a write to new page}
CPU-thread-3  {}
DEV-thread-0  {}
DEV-thread-2  {}
[Time N+3] ---------------------------------------------------------------
CPU-thread-0  {preempted}
CPU-thread-1  {preempted}
CPU-thread-2  {}
CPU-thread-3  {write to addrB which is a write to new page}
DEV-thread-0  {}
DEV-thread-2  {}
[Time N+4] ---------------------------------------------------------------
CPU-thread-0  {preempted}
CPU-thread-1  {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}}
CPU-thread-2  {}
CPU-thread-3  {}
DEV-thread-0  {}
DEV-thread-2  {}
[Time N+5] ---------------------------------------------------------------
CPU-thread-0  {preempted}
CPU-thread-1  {}
CPU-thread-2  {}
CPU-thread-3  {}
DEV-thread-0  {read addrA from old page}
DEV-thread-2  {read addrB from new page}

So here because at time N+2 the clear page table entry was not pair with a
notification to invalidate the secondary TLB, the device see the new value
for addrB before seing the new value for addrA. This break total memory
ordering for the device.

When changing a pte to write protect or to point to a new write protected
page with same content (KSM) it is ok to delay invalidate_range callback to
mmu_notifier_invalidate_range_end() outside the page table lock. This is
true even if the thread doing page table update is preempted right after
releasing page table lock before calling mmu_notifier_invalidate_range_end

Changed since v1:
  - typos (thanks to Andrea)
  - Avoid unnecessary precaution in try_to_unmap() (Andrea)
  - Be more conservative in try_to_unmap_one()

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Alistair Popple <alistair@popple.id.au>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>

Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-next@vger.kernel.org
---
 Documentation/vm/mmu_notifier.txt | 93 +++++++++++++++++++++++++++++++++++++++
 fs/dax.c                          |  9 +++-
 include/linux/mmu_notifier.h      |  3 +-
 mm/huge_memory.c                  | 20 +++++++--
 mm/hugetlb.c                      | 16 +++++--
 mm/ksm.c                          | 15 ++++++-
 mm/rmap.c                         | 59 ++++++++++++++++++++++---
 7 files changed, 198 insertions(+), 17 deletions(-)
 create mode 100644 Documentation/vm/mmu_notifier.txt

diff --git a/Documentation/vm/mmu_notifier.txt b/Documentation/vm/mmu_notifier.txt
new file mode 100644
index 000000000000..23b462566bb7
--- /dev/null
+++ b/Documentation/vm/mmu_notifier.txt
@@ -0,0 +1,93 @@
+When do you need to notify inside page table lock ?
+
+When clearing a pte/pmd we are given a choice to notify the event through
+(notify version of *_clear_flush call mmu_notifier_invalidate_range) under
+the page table lock. But that notification is not necessary in all cases.
+
+For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when device use
+thing like ATS/PASID to get the IOMMU to walk the CPU page table to access a
+process virtual address space). There is only 2 cases when you need to notify
+those secondary TLB while holding page table lock when clearing a pte/pmd:
+
+  A) page backing address is free before mmu_notifier_invalidate_range_end()
+  B) a page table entry is updated to point to a new page (COW, write fault
+     on zero page, __replace_page(), ...)
+
+Case A is obvious you do not want to take the risk for the device to write to
+a page that might now be used by some completely different task.
+
+Case B is more subtle. For correctness it requires the following sequence to
+happen:
+  - take page table lock
+  - clear page table entry and notify ([pmd/pte]p_huge_clear_flush_notify())
+  - set page table entry to point to new page
+
+If clearing the page table entry is not followed by a notify before setting
+the new pte/pmd value then you can break memory model like C11 or C++11 for
+the device.
+
+Consider the following scenario (device use a feature similar to ATS/PASID):
+
+Two address addrA and addrB such that |addrA - addrB| >= PAGE_SIZE we assume
+they are write protected for COW (other case of B apply too).
+
+[Time N] --------------------------------------------------------------------
+CPU-thread-0  {try to write to addrA}
+CPU-thread-1  {try to write to addrB}
+CPU-thread-2  {}
+CPU-thread-3  {}
+DEV-thread-0  {read addrA and populate device TLB}
+DEV-thread-2  {read addrB and populate device TLB}
+[Time N+1] ------------------------------------------------------------------
+CPU-thread-0  {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}}
+CPU-thread-1  {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}}
+CPU-thread-2  {}
+CPU-thread-3  {}
+DEV-thread-0  {}
+DEV-thread-2  {}
+[Time N+2] ------------------------------------------------------------------
+CPU-thread-0  {COW_step1: {update page table to point to new page for addrA}}
+CPU-thread-1  {COW_step1: {update page table to point to new page for addrB}}
+CPU-thread-2  {}
+CPU-thread-3  {}
+DEV-thread-0  {}
+DEV-thread-2  {}
+[Time N+3] ------------------------------------------------------------------
+CPU-thread-0  {preempted}
+CPU-thread-1  {preempted}
+CPU-thread-2  {write to addrA which is a write to new page}
+CPU-thread-3  {}
+DEV-thread-0  {}
+DEV-thread-2  {}
+[Time N+3] ------------------------------------------------------------------
+CPU-thread-0  {preempted}
+CPU-thread-1  {preempted}
+CPU-thread-2  {}
+CPU-thread-3  {write to addrB which is a write to new page}
+DEV-thread-0  {}
+DEV-thread-2  {}
+[Time N+4] ------------------------------------------------------------------
+CPU-thread-0  {preempted}
+CPU-thread-1  {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}}
+CPU-thread-2  {}
+CPU-thread-3  {}
+DEV-thread-0  {}
+DEV-thread-2  {}
+[Time N+5] ------------------------------------------------------------------
+CPU-thread-0  {preempted}
+CPU-thread-1  {}
+CPU-thread-2  {}
+CPU-thread-3  {}
+DEV-thread-0  {read addrA from old page}
+DEV-thread-2  {read addrB from new page}
+
+So here because at time N+2 the clear page table entry was not pair with a
+notification to invalidate the secondary TLB, the device see the new value for
+addrB before seing the new value for addrA. This break total memory ordering
+for the device.
+
+When changing a pte to write protect or to point to a new write protected page
+with same content (KSM) it is fine to delay the mmu_notifier_invalidate_range
+call to mmu_notifier_invalidate_range_end() outside the page table lock. This
+is true even if the thread doing the page table update is preempted right after
+releasing page table lock but before call mmu_notifier_invalidate_range_end().
diff --git a/fs/dax.c b/fs/dax.c
index f3a44a7c14b3..9ec797424e4f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -614,6 +614,13 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
 		if (follow_pte_pmd(vma->vm_mm, address, &start, &end, &ptep, &pmdp, &ptl))
 			continue;
 
+		/*
+		 * No need to call mmu_notifier_invalidate_range() as we are
+		 * downgrading page table protection not changing it to point
+		 * to a new page.
+		 *
+		 * See Documentation/vm/mmu_notifier.txt
+		 */
 		if (pmdp) {
 #ifdef CONFIG_FS_DAX_PMD
 			pmd_t pmd;
@@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
 			pmd = pmd_wrprotect(pmd);
 			pmd = pmd_mkclean(pmd);
 			set_pmd_at(vma->vm_mm, address, pmdp, pmd);
-			mmu_notifier_invalidate_range(vma->vm_mm, start, end);
 unlock_pmd:
 			spin_unlock(ptl);
 #endif
@@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
 			pte = pte_wrprotect(pte);
 			pte = pte_mkclean(pte);
 			set_pte_at(vma->vm_mm, address, ptep, pte);
-			mmu_notifier_invalidate_range(vma->vm_mm, start, end);
 unlock_pte:
 			pte_unmap_unlock(ptep, ptl);
 		}
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 6866e8126982..49c925c96b8a 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -155,7 +155,8 @@ struct mmu_notifier_ops {
 	 * shared page-tables, it not necessary to implement the
 	 * invalidate_range_start()/end() notifiers, as
 	 * invalidate_range() alread catches the points in time when an
-	 * external TLB range needs to be flushed.
+	 * external TLB range needs to be flushed. For more in depth
+	 * discussion on this see Documentation/vm/mmu_notifier.txt
 	 *
 	 * The invalidate_range() function is called under the ptl
 	 * spin-lock and not allowed to sleep.
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c037d3d34950..ff5bc647b51d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
 		goto out_free_pages;
 	VM_BUG_ON_PAGE(!PageHead(page), page);
 
+	/*
+	 * Leave pmd empty until pte is filled note we must notify here as
+	 * concurrent CPU thread might write to new page before the call to
+	 * mmu_notifier_invalidate_range_end() happens which can lead to a
+	 * device seeing memory write in different order than CPU.
+	 *
+	 * See Documentation/vm/mmu_notifier.txt
+	 */
 	pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
-	/* leave pmd empty until pte is filled */
 
 	pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd);
 	pmd_populate(vma->vm_mm, &_pmd, pgtable);
@@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
 	pmd_t _pmd;
 	int i;
 
-	/* leave pmd empty until pte is filled */
-	pmdp_huge_clear_flush_notify(vma, haddr, pmd);
+	/*
+	 * Leave pmd empty until pte is filled note that it is fine to delay
+	 * notification until mmu_notifier_invalidate_range_end() as we are
+	 * replacing a zero pmd write protected page with a zero pte write
+	 * protected page.
+	 *
+	 * See Documentation/vm/mmu_notifier.txt
+	 */
+	pmdp_huge_clear_flush(vma, haddr, pmd);
 
 	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
 	pmd_populate(mm, &_pmd, pgtable);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1768efa4c501..63a63f1b536c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
 		} else {
 			if (cow) {
+				/*
+				 * No need to notify as we are downgrading page
+				 * table protection not changing it to point
+				 * to a new page.
+				 *
+				 * See Documentation/vm/mmu_notifier.txt
+				 */
 				huge_ptep_set_wrprotect(src, addr, src_pte);
-				mmu_notifier_invalidate_range(src, mmun_start,
-								   mmun_end);
 			}
 			entry = huge_ptep_get(src_pte);
 			ptepage = pte_page(entry);
@@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	 * and that page table be reused and filled with junk.
 	 */
 	flush_hugetlb_tlb_range(vma, start, end);
-	mmu_notifier_invalidate_range(mm, start, end);
+	/*
+	 * No need to call mmu_notifier_invalidate_range() we are downgrading
+	 * page table protection not changing it to point to a new page.
+	 *
+	 * See Documentation/vm/mmu_notifier.txt
+	 */
 	i_mmap_unlock_write(vma->vm_file->f_mapping);
 	mmu_notifier_invalidate_range_end(mm, start, end);
 
diff --git a/mm/ksm.c b/mm/ksm.c
index 6cb60f46cce5..be8f4576f842 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 		 * So we clear the pte and flush the tlb before the check
 		 * this assure us that no O_DIRECT can happen after the check
 		 * or in the middle of the check.
+		 *
+		 * No need to notify as we are downgrading page table to read
+		 * only not changing it to point to a new page.
+		 *
+		 * See Documentation/vm/mmu_notifier.txt
 		 */
-		entry = ptep_clear_flush_notify(vma, pvmw.address, pvmw.pte);
+		entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte);
 		/*
 		 * Check that no O_DIRECT or similar I/O is in progress on the
 		 * page
@@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	}
 
 	flush_cache_page(vma, addr, pte_pfn(*ptep));
-	ptep_clear_flush_notify(vma, addr, ptep);
+	/*
+	 * No need to notify as we are replacing a read only page with another
+	 * read only page with the same content.
+	 *
+	 * See Documentation/vm/mmu_notifier.txt
+	 */
+	ptep_clear_flush(vma, addr, ptep);
 	set_pte_at_notify(mm, addr, ptep, newpte);
 
 	page_remove_rmap(page, false);
diff --git a/mm/rmap.c b/mm/rmap.c
index 061826278520..6b5a0f219ac0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 #endif
 		}
 
-		if (ret) {
-			mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend);
+		/*
+		 * No need to call mmu_notifier_invalidate_range() as we are
+		 * downgrading page table protection not changing it to point
+		 * to a new page.
+		 *
+		 * See Documentation/vm/mmu_notifier.txt
+		 */
+		if (ret)
 			(*cleaned)++;
-		}
 	}
 
 	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
@@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			if (pte_soft_dirty(pteval))
 				swp_pte = pte_swp_mksoft_dirty(swp_pte);
 			set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
+			/*
+			 * No need to invalidate here it will synchronize on
+			 * against the special swap migration pte.
+			 */
 			goto discard;
 		}
 
@@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			 * will take care of the rest.
 			 */
 			dec_mm_counter(mm, mm_counter(page));
+			/* We have to invalidate as we cleared the pte */
+			mmu_notifier_invalidate_range(mm, address,
+						      address + PAGE_SIZE);
 		} else if (IS_ENABLED(CONFIG_MIGRATION) &&
 				(flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
 			swp_entry_t entry;
@@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			if (pte_soft_dirty(pteval))
 				swp_pte = pte_swp_mksoft_dirty(swp_pte);
 			set_pte_at(mm, address, pvmw.pte, swp_pte);
+			/*
+			 * No need to invalidate here it will synchronize on
+			 * against the special swap migration pte.
+			 */
 		} else if (PageAnon(page)) {
 			swp_entry_t entry = { .val = page_private(subpage) };
 			pte_t swp_pte;
@@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 				WARN_ON_ONCE(1);
 				ret = false;
 				/* We have to invalidate as we cleared the pte */
+				mmu_notifier_invalidate_range(mm, address,
+							address + PAGE_SIZE);
 				page_vma_mapped_walk_done(&pvmw);
 				break;
 			}
@@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			/* MADV_FREE page check */
 			if (!PageSwapBacked(page)) {
 				if (!PageDirty(page)) {
+					/* Invalidate as we cleared the pte */
+					mmu_notifier_invalidate_range(mm,
+						address, address + PAGE_SIZE);
 					dec_mm_counter(mm, MM_ANONPAGES);
 					goto discard;
 				}
@@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			if (pte_soft_dirty(pteval))
 				swp_pte = pte_swp_mksoft_dirty(swp_pte);
 			set_pte_at(mm, address, pvmw.pte, swp_pte);
-		} else
+			/* Invalidate as we cleared the pte */
+			mmu_notifier_invalidate_range(mm, address,
+						      address + PAGE_SIZE);
+		} else {
+			/*
+			 * We should not need to notify here as we reach this
+			 * case only from freeze_page() itself only call from
+			 * split_huge_page_to_list() so everything below must
+			 * be true:
+			 *   - page is not anonymous
+			 *   - page is locked
+			 *
+			 * So as it is a locked file back page thus it can not
+			 * be remove from the page cache and replace by a new
+			 * page before mmu_notifier_invalidate_range_end so no
+			 * concurrent thread might update its page table to
+			 * point at new page while a device still is using this
+			 * page.
+			 *
+			 * See Documentation/vm/mmu_notifier.txt
+			 */
 			dec_mm_counter(mm, mm_counter_file(page));
+		}
 discard:
+		/*
+		 * No need to call mmu_notifier_invalidate_range() it has be
+		 * done above for all cases requiring it to happen under page
+		 * table lock before mmu_notifier_invalidate_range_end()
+		 *
+		 * See Documentation/vm/mmu_notifier.txt
+		 */
 		page_remove_rmap(subpage, PageHuge(page));
 		put_page(page);
-		mmu_notifier_invalidate_range(mm, address,
-					      address + PAGE_SIZE);
 	}
 
 	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2
@ 2017-10-17  3:10   ` jglisse-H+wXaHxf7aLQT0dZR+AlfA
  0 siblings, 0 replies; 40+ messages in thread
From: jglisse-H+wXaHxf7aLQT0dZR+AlfA @ 2017-10-17  3:10 UTC (permalink / raw)
  To: linux-mm-Bw31MaZKKs3YtjvyW6yDsg
  Cc: Andrea Arcangeli, Stephen Rothwell, Joerg Roedel,
	Benjamin Herrenschmidt, Andrew Donnellan,
	linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Jérôme Glisse, linux-next-u79uwXL29TY76Z2rM5mHXA,
	Michael Ellerman, Alistair Popple, Andrew Morton, Linus Torvalds,
	David Woodhouse

From: Jérôme Glisse <jglisse@redhat.com>

This patch only affects users of mmu_notifier->invalidate_range callback
which are device drivers related to ATS/PASID, CAPI, IOMMUv2, SVM ...
and it is an optimization for those users. Everyone else is unaffected
by it.

When clearing a pte/pmd we are given a choice to notify the event under
the page table lock (notify version of *_clear_flush helpers do call the
mmu_notifier_invalidate_range). But that notification is not necessary in
all cases.

This patches remove almost all cases where it is useless to have a call
to mmu_notifier_invalidate_range before mmu_notifier_invalidate_range_end.
It also adds documentation in all those cases explaining why.

Below is a more in depth analysis of why this is fine to do this:

For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when device
use thing like ATS/PASID to get the IOMMU to walk the CPU page table to
access a process virtual address space). There is only 2 cases when you
need to notify those secondary TLB while holding page table lock when
clearing a pte/pmd:

  A) page backing address is free before mmu_notifier_invalidate_range_end
  B) a page table entry is updated to point to a new page (COW, write fault
     on zero page, __replace_page(), ...)

Case A is obvious you do not want to take the risk for the device to write
to a page that might now be used by something completely different.

Case B is more subtle. For correctness it requires the following sequence
to happen:
  - take page table lock
  - clear page table entry and notify (pmd/pte_huge_clear_flush_notify())
  - set page table entry to point to new page

If clearing the page table entry is not followed by a notify before setting
the new pte/pmd value then you can break memory model like C11 or C++11 for
the device.

Consider the following scenario (device use a feature similar to ATS/
PASID):

Two address addrA and addrB such that |addrA - addrB| >= PAGE_SIZE we
assume they are write protected for COW (other case of B apply too).

[Time N] -----------------------------------------------------------------
CPU-thread-0  {try to write to addrA}
CPU-thread-1  {try to write to addrB}
CPU-thread-2  {}
CPU-thread-3  {}
DEV-thread-0  {read addrA and populate device TLB}
DEV-thread-2  {read addrB and populate device TLB}
[Time N+1] ---------------------------------------------------------------
CPU-thread-0  {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}}
CPU-thread-1  {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}}
CPU-thread-2  {}
CPU-thread-3  {}
DEV-thread-0  {}
DEV-thread-2  {}
[Time N+2] ---------------------------------------------------------------
CPU-thread-0  {COW_step1: {update page table point to new page for addrA}}
CPU-thread-1  {COW_step1: {update page table point to new page for addrB}}
CPU-thread-2  {}
CPU-thread-3  {}
DEV-thread-0  {}
DEV-thread-2  {}
[Time N+3] ---------------------------------------------------------------
CPU-thread-0  {preempted}
CPU-thread-1  {preempted}
CPU-thread-2  {write to addrA which is a write to new page}
CPU-thread-3  {}
DEV-thread-0  {}
DEV-thread-2  {}
[Time N+3] ---------------------------------------------------------------
CPU-thread-0  {preempted}
CPU-thread-1  {preempted}
CPU-thread-2  {}
CPU-thread-3  {write to addrB which is a write to new page}
DEV-thread-0  {}
DEV-thread-2  {}
[Time N+4] ---------------------------------------------------------------
CPU-thread-0  {preempted}
CPU-thread-1  {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}}
CPU-thread-2  {}
CPU-thread-3  {}
DEV-thread-0  {}
DEV-thread-2  {}
[Time N+5] ---------------------------------------------------------------
CPU-thread-0  {preempted}
CPU-thread-1  {}
CPU-thread-2  {}
CPU-thread-3  {}
DEV-thread-0  {read addrA from old page}
DEV-thread-2  {read addrB from new page}

So here because at time N+2 the clear page table entry was not pair with a
notification to invalidate the secondary TLB, the device see the new value
for addrB before seing the new value for addrA. This break total memory
ordering for the device.

When changing a pte to write protect or to point to a new write protected
page with same content (KSM) it is ok to delay invalidate_range callback to
mmu_notifier_invalidate_range_end() outside the page table lock. This is
true even if the thread doing page table update is preempted right after
releasing page table lock before calling mmu_notifier_invalidate_range_end

Changed since v1:
  - typos (thanks to Andrea)
  - Avoid unnecessary precaution in try_to_unmap() (Andrea)
  - Be more conservative in try_to_unmap_one()

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Alistair Popple <alistair@popple.id.au>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>

Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-next@vger.kernel.org
---
 Documentation/vm/mmu_notifier.txt | 93 +++++++++++++++++++++++++++++++++++++++
 fs/dax.c                          |  9 +++-
 include/linux/mmu_notifier.h      |  3 +-
 mm/huge_memory.c                  | 20 +++++++--
 mm/hugetlb.c                      | 16 +++++--
 mm/ksm.c                          | 15 ++++++-
 mm/rmap.c                         | 59 ++++++++++++++++++++++---
 7 files changed, 198 insertions(+), 17 deletions(-)
 create mode 100644 Documentation/vm/mmu_notifier.txt

diff --git a/Documentation/vm/mmu_notifier.txt b/Documentation/vm/mmu_notifier.txt
new file mode 100644
index 000000000000..23b462566bb7
--- /dev/null
+++ b/Documentation/vm/mmu_notifier.txt
@@ -0,0 +1,93 @@
+When do you need to notify inside page table lock ?
+
+When clearing a pte/pmd we are given a choice to notify the event through
+(notify version of *_clear_flush call mmu_notifier_invalidate_range) under
+the page table lock. But that notification is not necessary in all cases.
+
+For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when device use
+thing like ATS/PASID to get the IOMMU to walk the CPU page table to access a
+process virtual address space). There is only 2 cases when you need to notify
+those secondary TLB while holding page table lock when clearing a pte/pmd:
+
+  A) page backing address is free before mmu_notifier_invalidate_range_end()
+  B) a page table entry is updated to point to a new page (COW, write fault
+     on zero page, __replace_page(), ...)
+
+Case A is obvious you do not want to take the risk for the device to write to
+a page that might now be used by some completely different task.
+
+Case B is more subtle. For correctness it requires the following sequence to
+happen:
+  - take page table lock
+  - clear page table entry and notify ([pmd/pte]p_huge_clear_flush_notify())
+  - set page table entry to point to new page
+
+If clearing the page table entry is not followed by a notify before setting
+the new pte/pmd value then you can break memory model like C11 or C++11 for
+the device.
+
+Consider the following scenario (device use a feature similar to ATS/PASID):
+
+Two address addrA and addrB such that |addrA - addrB| >= PAGE_SIZE we assume
+they are write protected for COW (other case of B apply too).
+
+[Time N] --------------------------------------------------------------------
+CPU-thread-0  {try to write to addrA}
+CPU-thread-1  {try to write to addrB}
+CPU-thread-2  {}
+CPU-thread-3  {}
+DEV-thread-0  {read addrA and populate device TLB}
+DEV-thread-2  {read addrB and populate device TLB}
+[Time N+1] ------------------------------------------------------------------
+CPU-thread-0  {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}}
+CPU-thread-1  {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}}
+CPU-thread-2  {}
+CPU-thread-3  {}
+DEV-thread-0  {}
+DEV-thread-2  {}
+[Time N+2] ------------------------------------------------------------------
+CPU-thread-0  {COW_step1: {update page table to point to new page for addrA}}
+CPU-thread-1  {COW_step1: {update page table to point to new page for addrB}}
+CPU-thread-2  {}
+CPU-thread-3  {}
+DEV-thread-0  {}
+DEV-thread-2  {}
+[Time N+3] ------------------------------------------------------------------
+CPU-thread-0  {preempted}
+CPU-thread-1  {preempted}
+CPU-thread-2  {write to addrA which is a write to new page}
+CPU-thread-3  {}
+DEV-thread-0  {}
+DEV-thread-2  {}
+[Time N+3] ------------------------------------------------------------------
+CPU-thread-0  {preempted}
+CPU-thread-1  {preempted}
+CPU-thread-2  {}
+CPU-thread-3  {write to addrB which is a write to new page}
+DEV-thread-0  {}
+DEV-thread-2  {}
+[Time N+4] ------------------------------------------------------------------
+CPU-thread-0  {preempted}
+CPU-thread-1  {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}}
+CPU-thread-2  {}
+CPU-thread-3  {}
+DEV-thread-0  {}
+DEV-thread-2  {}
+[Time N+5] ------------------------------------------------------------------
+CPU-thread-0  {preempted}
+CPU-thread-1  {}
+CPU-thread-2  {}
+CPU-thread-3  {}
+DEV-thread-0  {read addrA from old page}
+DEV-thread-2  {read addrB from new page}
+
+So here because at time N+2 the clear page table entry was not pair with a
+notification to invalidate the secondary TLB, the device see the new value for
+addrB before seing the new value for addrA. This break total memory ordering
+for the device.
+
+When changing a pte to write protect or to point to a new write protected page
+with same content (KSM) it is fine to delay the mmu_notifier_invalidate_range
+call to mmu_notifier_invalidate_range_end() outside the page table lock. This
+is true even if the thread doing the page table update is preempted right after
+releasing page table lock but before call mmu_notifier_invalidate_range_end().
diff --git a/fs/dax.c b/fs/dax.c
index f3a44a7c14b3..9ec797424e4f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -614,6 +614,13 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
 		if (follow_pte_pmd(vma->vm_mm, address, &start, &end, &ptep, &pmdp, &ptl))
 			continue;
 
+		/*
+		 * No need to call mmu_notifier_invalidate_range() as we are
+		 * downgrading page table protection not changing it to point
+		 * to a new page.
+		 *
+		 * See Documentation/vm/mmu_notifier.txt
+		 */
 		if (pmdp) {
 #ifdef CONFIG_FS_DAX_PMD
 			pmd_t pmd;
@@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
 			pmd = pmd_wrprotect(pmd);
 			pmd = pmd_mkclean(pmd);
 			set_pmd_at(vma->vm_mm, address, pmdp, pmd);
-			mmu_notifier_invalidate_range(vma->vm_mm, start, end);
 unlock_pmd:
 			spin_unlock(ptl);
 #endif
@@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
 			pte = pte_wrprotect(pte);
 			pte = pte_mkclean(pte);
 			set_pte_at(vma->vm_mm, address, ptep, pte);
-			mmu_notifier_invalidate_range(vma->vm_mm, start, end);
 unlock_pte:
 			pte_unmap_unlock(ptep, ptl);
 		}
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 6866e8126982..49c925c96b8a 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -155,7 +155,8 @@ struct mmu_notifier_ops {
 	 * shared page-tables, it not necessary to implement the
 	 * invalidate_range_start()/end() notifiers, as
 	 * invalidate_range() alread catches the points in time when an
-	 * external TLB range needs to be flushed.
+	 * external TLB range needs to be flushed. For more in depth
+	 * discussion on this see Documentation/vm/mmu_notifier.txt
 	 *
 	 * The invalidate_range() function is called under the ptl
 	 * spin-lock and not allowed to sleep.
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c037d3d34950..ff5bc647b51d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
 		goto out_free_pages;
 	VM_BUG_ON_PAGE(!PageHead(page), page);
 
+	/*
+	 * Leave pmd empty until pte is filled note we must notify here as
+	 * concurrent CPU thread might write to new page before the call to
+	 * mmu_notifier_invalidate_range_end() happens which can lead to a
+	 * device seeing memory write in different order than CPU.
+	 *
+	 * See Documentation/vm/mmu_notifier.txt
+	 */
 	pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
-	/* leave pmd empty until pte is filled */
 
 	pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd);
 	pmd_populate(vma->vm_mm, &_pmd, pgtable);
@@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
 	pmd_t _pmd;
 	int i;
 
-	/* leave pmd empty until pte is filled */
-	pmdp_huge_clear_flush_notify(vma, haddr, pmd);
+	/*
+	 * Leave pmd empty until pte is filled note that it is fine to delay
+	 * notification until mmu_notifier_invalidate_range_end() as we are
+	 * replacing a zero pmd write protected page with a zero pte write
+	 * protected page.
+	 *
+	 * See Documentation/vm/mmu_notifier.txt
+	 */
+	pmdp_huge_clear_flush(vma, haddr, pmd);
 
 	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
 	pmd_populate(mm, &_pmd, pgtable);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1768efa4c501..63a63f1b536c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
 		} else {
 			if (cow) {
+				/*
+				 * No need to notify as we are downgrading page
+				 * table protection not changing it to point
+				 * to a new page.
+				 *
+				 * See Documentation/vm/mmu_notifier.txt
+				 */
 				huge_ptep_set_wrprotect(src, addr, src_pte);
-				mmu_notifier_invalidate_range(src, mmun_start,
-								   mmun_end);
 			}
 			entry = huge_ptep_get(src_pte);
 			ptepage = pte_page(entry);
@@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	 * and that page table be reused and filled with junk.
 	 */
 	flush_hugetlb_tlb_range(vma, start, end);
-	mmu_notifier_invalidate_range(mm, start, end);
+	/*
+	 * No need to call mmu_notifier_invalidate_range() we are downgrading
+	 * page table protection not changing it to point to a new page.
+	 *
+	 * See Documentation/vm/mmu_notifier.txt
+	 */
 	i_mmap_unlock_write(vma->vm_file->f_mapping);
 	mmu_notifier_invalidate_range_end(mm, start, end);
 
diff --git a/mm/ksm.c b/mm/ksm.c
index 6cb60f46cce5..be8f4576f842 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 		 * So we clear the pte and flush the tlb before the check
 		 * this assure us that no O_DIRECT can happen after the check
 		 * or in the middle of the check.
+		 *
+		 * No need to notify as we are downgrading page table to read
+		 * only not changing it to point to a new page.
+		 *
+		 * See Documentation/vm/mmu_notifier.txt
 		 */
-		entry = ptep_clear_flush_notify(vma, pvmw.address, pvmw.pte);
+		entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte);
 		/*
 		 * Check that no O_DIRECT or similar I/O is in progress on the
 		 * page
@@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	}
 
 	flush_cache_page(vma, addr, pte_pfn(*ptep));
-	ptep_clear_flush_notify(vma, addr, ptep);
+	/*
+	 * No need to notify as we are replacing a read only page with another
+	 * read only page with the same content.
+	 *
+	 * See Documentation/vm/mmu_notifier.txt
+	 */
+	ptep_clear_flush(vma, addr, ptep);
 	set_pte_at_notify(mm, addr, ptep, newpte);
 
 	page_remove_rmap(page, false);
diff --git a/mm/rmap.c b/mm/rmap.c
index 061826278520..6b5a0f219ac0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 #endif
 		}
 
-		if (ret) {
-			mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend);
+		/*
+		 * No need to call mmu_notifier_invalidate_range() as we are
+		 * downgrading page table protection not changing it to point
+		 * to a new page.
+		 *
+		 * See Documentation/vm/mmu_notifier.txt
+		 */
+		if (ret)
 			(*cleaned)++;
-		}
 	}
 
 	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
@@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			if (pte_soft_dirty(pteval))
 				swp_pte = pte_swp_mksoft_dirty(swp_pte);
 			set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
+			/*
+			 * No need to invalidate here it will synchronize on
+			 * against the special swap migration pte.
+			 */
 			goto discard;
 		}
 
@@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			 * will take care of the rest.
 			 */
 			dec_mm_counter(mm, mm_counter(page));
+			/* We have to invalidate as we cleared the pte */
+			mmu_notifier_invalidate_range(mm, address,
+						      address + PAGE_SIZE);
 		} else if (IS_ENABLED(CONFIG_MIGRATION) &&
 				(flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
 			swp_entry_t entry;
@@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			if (pte_soft_dirty(pteval))
 				swp_pte = pte_swp_mksoft_dirty(swp_pte);
 			set_pte_at(mm, address, pvmw.pte, swp_pte);
+			/*
+			 * No need to invalidate here it will synchronize on
+			 * against the special swap migration pte.
+			 */
 		} else if (PageAnon(page)) {
 			swp_entry_t entry = { .val = page_private(subpage) };
 			pte_t swp_pte;
@@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 				WARN_ON_ONCE(1);
 				ret = false;
 				/* We have to invalidate as we cleared the pte */
+				mmu_notifier_invalidate_range(mm, address,
+							address + PAGE_SIZE);
 				page_vma_mapped_walk_done(&pvmw);
 				break;
 			}
@@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			/* MADV_FREE page check */
 			if (!PageSwapBacked(page)) {
 				if (!PageDirty(page)) {
+					/* Invalidate as we cleared the pte */
+					mmu_notifier_invalidate_range(mm,
+						address, address + PAGE_SIZE);
 					dec_mm_counter(mm, MM_ANONPAGES);
 					goto discard;
 				}
@@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			if (pte_soft_dirty(pteval))
 				swp_pte = pte_swp_mksoft_dirty(swp_pte);
 			set_pte_at(mm, address, pvmw.pte, swp_pte);
-		} else
+			/* Invalidate as we cleared the pte */
+			mmu_notifier_invalidate_range(mm, address,
+						      address + PAGE_SIZE);
+		} else {
+			/*
+			 * We should not need to notify here as we reach this
+			 * case only from freeze_page() itself only call from
+			 * split_huge_page_to_list() so everything below must
+			 * be true:
+			 *   - page is not anonymous
+			 *   - page is locked
+			 *
+			 * So as it is a locked file back page thus it can not
+			 * be remove from the page cache and replace by a new
+			 * page before mmu_notifier_invalidate_range_end so no
+			 * concurrent thread might update its page table to
+			 * point at new page while a device still is using this
+			 * page.
+			 *
+			 * See Documentation/vm/mmu_notifier.txt
+			 */
 			dec_mm_counter(mm, mm_counter_file(page));
+		}
 discard:
+		/*
+		 * No need to call mmu_notifier_invalidate_range() it has be
+		 * done above for all cases requiring it to happen under page
+		 * table lock before mmu_notifier_invalidate_range_end()
+		 *
+		 * See Documentation/vm/mmu_notifier.txt
+		 */
 		page_remove_rmap(subpage, PageHuge(page));
 		put_page(page);
-		mmu_notifier_invalidate_range(mm, address,
-					      address + PAGE_SIZE);
 	}
 
 	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
-- 
2.13.6

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2
@ 2017-10-17  3:10   ` jglisse-H+wXaHxf7aLQT0dZR+AlfA
  0 siblings, 0 replies; 40+ messages in thread
From: jglisse @ 2017-10-17  3:10 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Jérôme Glisse, Andrea Arcangeli,
	Nadav Amit, Linus Torvalds, Andrew Morton, Joerg Roedel,
	Suravee Suthikulpanit, David Woodhouse, Alistair Popple,
	Michael Ellerman, Benjamin Herrenschmidt, Stephen Rothwell,
	Andrew Donnellan, iommu, linuxppc-dev, linux-next

From: JA(C)rA'me Glisse <jglisse@redhat.com>

This patch only affects users of mmu_notifier->invalidate_range callback
which are device drivers related to ATS/PASID, CAPI, IOMMUv2, SVM ...
and it is an optimization for those users. Everyone else is unaffected
by it.

When clearing a pte/pmd we are given a choice to notify the event under
the page table lock (notify version of *_clear_flush helpers do call the
mmu_notifier_invalidate_range). But that notification is not necessary in
all cases.

This patches remove almost all cases where it is useless to have a call
to mmu_notifier_invalidate_range before mmu_notifier_invalidate_range_end.
It also adds documentation in all those cases explaining why.

Below is a more in depth analysis of why this is fine to do this:

For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when device
use thing like ATS/PASID to get the IOMMU to walk the CPU page table to
access a process virtual address space). There is only 2 cases when you
need to notify those secondary TLB while holding page table lock when
clearing a pte/pmd:

  A) page backing address is free before mmu_notifier_invalidate_range_end
  B) a page table entry is updated to point to a new page (COW, write fault
     on zero page, __replace_page(), ...)

Case A is obvious you do not want to take the risk for the device to write
to a page that might now be used by something completely different.

Case B is more subtle. For correctness it requires the following sequence
to happen:
  - take page table lock
  - clear page table entry and notify (pmd/pte_huge_clear_flush_notify())
  - set page table entry to point to new page

If clearing the page table entry is not followed by a notify before setting
the new pte/pmd value then you can break memory model like C11 or C++11 for
the device.

Consider the following scenario (device use a feature similar to ATS/
PASID):

Two address addrA and addrB such that |addrA - addrB| >= PAGE_SIZE we
assume they are write protected for COW (other case of B apply too).

[Time N] -----------------------------------------------------------------
CPU-thread-0  {try to write to addrA}
CPU-thread-1  {try to write to addrB}
CPU-thread-2  {}
CPU-thread-3  {}
DEV-thread-0  {read addrA and populate device TLB}
DEV-thread-2  {read addrB and populate device TLB}
[Time N+1] ---------------------------------------------------------------
CPU-thread-0  {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}}
CPU-thread-1  {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}}
CPU-thread-2  {}
CPU-thread-3  {}
DEV-thread-0  {}
DEV-thread-2  {}
[Time N+2] ---------------------------------------------------------------
CPU-thread-0  {COW_step1: {update page table point to new page for addrA}}
CPU-thread-1  {COW_step1: {update page table point to new page for addrB}}
CPU-thread-2  {}
CPU-thread-3  {}
DEV-thread-0  {}
DEV-thread-2  {}
[Time N+3] ---------------------------------------------------------------
CPU-thread-0  {preempted}
CPU-thread-1  {preempted}
CPU-thread-2  {write to addrA which is a write to new page}
CPU-thread-3  {}
DEV-thread-0  {}
DEV-thread-2  {}
[Time N+3] ---------------------------------------------------------------
CPU-thread-0  {preempted}
CPU-thread-1  {preempted}
CPU-thread-2  {}
CPU-thread-3  {write to addrB which is a write to new page}
DEV-thread-0  {}
DEV-thread-2  {}
[Time N+4] ---------------------------------------------------------------
CPU-thread-0  {preempted}
CPU-thread-1  {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}}
CPU-thread-2  {}
CPU-thread-3  {}
DEV-thread-0  {}
DEV-thread-2  {}
[Time N+5] ---------------------------------------------------------------
CPU-thread-0  {preempted}
CPU-thread-1  {}
CPU-thread-2  {}
CPU-thread-3  {}
DEV-thread-0  {read addrA from old page}
DEV-thread-2  {read addrB from new page}

So here because at time N+2 the clear page table entry was not pair with a
notification to invalidate the secondary TLB, the device see the new value
for addrB before seing the new value for addrA. This break total memory
ordering for the device.

When changing a pte to write protect or to point to a new write protected
page with same content (KSM) it is ok to delay invalidate_range callback to
mmu_notifier_invalidate_range_end() outside the page table lock. This is
true even if the thread doing page table update is preempted right after
releasing page table lock before calling mmu_notifier_invalidate_range_end

Changed since v1:
  - typos (thanks to Andrea)
  - Avoid unnecessary precaution in try_to_unmap() (Andrea)
  - Be more conservative in try_to_unmap_one()

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Alistair Popple <alistair@popple.id.au>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>

Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-next@vger.kernel.org
---
 Documentation/vm/mmu_notifier.txt | 93 +++++++++++++++++++++++++++++++++++++++
 fs/dax.c                          |  9 +++-
 include/linux/mmu_notifier.h      |  3 +-
 mm/huge_memory.c                  | 20 +++++++--
 mm/hugetlb.c                      | 16 +++++--
 mm/ksm.c                          | 15 ++++++-
 mm/rmap.c                         | 59 ++++++++++++++++++++++---
 7 files changed, 198 insertions(+), 17 deletions(-)
 create mode 100644 Documentation/vm/mmu_notifier.txt

diff --git a/Documentation/vm/mmu_notifier.txt b/Documentation/vm/mmu_notifier.txt
new file mode 100644
index 000000000000..23b462566bb7
--- /dev/null
+++ b/Documentation/vm/mmu_notifier.txt
@@ -0,0 +1,93 @@
+When do you need to notify inside page table lock ?
+
+When clearing a pte/pmd we are given a choice to notify the event through
+(notify version of *_clear_flush call mmu_notifier_invalidate_range) under
+the page table lock. But that notification is not necessary in all cases.
+
+For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when device use
+thing like ATS/PASID to get the IOMMU to walk the CPU page table to access a
+process virtual address space). There is only 2 cases when you need to notify
+those secondary TLB while holding page table lock when clearing a pte/pmd:
+
+  A) page backing address is free before mmu_notifier_invalidate_range_end()
+  B) a page table entry is updated to point to a new page (COW, write fault
+     on zero page, __replace_page(), ...)
+
+Case A is obvious you do not want to take the risk for the device to write to
+a page that might now be used by some completely different task.
+
+Case B is more subtle. For correctness it requires the following sequence to
+happen:
+  - take page table lock
+  - clear page table entry and notify ([pmd/pte]p_huge_clear_flush_notify())
+  - set page table entry to point to new page
+
+If clearing the page table entry is not followed by a notify before setting
+the new pte/pmd value then you can break memory model like C11 or C++11 for
+the device.
+
+Consider the following scenario (device use a feature similar to ATS/PASID):
+
+Two address addrA and addrB such that |addrA - addrB| >= PAGE_SIZE we assume
+they are write protected for COW (other case of B apply too).
+
+[Time N] --------------------------------------------------------------------
+CPU-thread-0  {try to write to addrA}
+CPU-thread-1  {try to write to addrB}
+CPU-thread-2  {}
+CPU-thread-3  {}
+DEV-thread-0  {read addrA and populate device TLB}
+DEV-thread-2  {read addrB and populate device TLB}
+[Time N+1] ------------------------------------------------------------------
+CPU-thread-0  {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}}
+CPU-thread-1  {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}}
+CPU-thread-2  {}
+CPU-thread-3  {}
+DEV-thread-0  {}
+DEV-thread-2  {}
+[Time N+2] ------------------------------------------------------------------
+CPU-thread-0  {COW_step1: {update page table to point to new page for addrA}}
+CPU-thread-1  {COW_step1: {update page table to point to new page for addrB}}
+CPU-thread-2  {}
+CPU-thread-3  {}
+DEV-thread-0  {}
+DEV-thread-2  {}
+[Time N+3] ------------------------------------------------------------------
+CPU-thread-0  {preempted}
+CPU-thread-1  {preempted}
+CPU-thread-2  {write to addrA which is a write to new page}
+CPU-thread-3  {}
+DEV-thread-0  {}
+DEV-thread-2  {}
+[Time N+3] ------------------------------------------------------------------
+CPU-thread-0  {preempted}
+CPU-thread-1  {preempted}
+CPU-thread-2  {}
+CPU-thread-3  {write to addrB which is a write to new page}
+DEV-thread-0  {}
+DEV-thread-2  {}
+[Time N+4] ------------------------------------------------------------------
+CPU-thread-0  {preempted}
+CPU-thread-1  {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}}
+CPU-thread-2  {}
+CPU-thread-3  {}
+DEV-thread-0  {}
+DEV-thread-2  {}
+[Time N+5] ------------------------------------------------------------------
+CPU-thread-0  {preempted}
+CPU-thread-1  {}
+CPU-thread-2  {}
+CPU-thread-3  {}
+DEV-thread-0  {read addrA from old page}
+DEV-thread-2  {read addrB from new page}
+
+So here because at time N+2 the clear page table entry was not pair with a
+notification to invalidate the secondary TLB, the device see the new value for
+addrB before seing the new value for addrA. This break total memory ordering
+for the device.
+
+When changing a pte to write protect or to point to a new write protected page
+with same content (KSM) it is fine to delay the mmu_notifier_invalidate_range
+call to mmu_notifier_invalidate_range_end() outside the page table lock. This
+is true even if the thread doing the page table update is preempted right after
+releasing page table lock but before call mmu_notifier_invalidate_range_end().
diff --git a/fs/dax.c b/fs/dax.c
index f3a44a7c14b3..9ec797424e4f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -614,6 +614,13 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
 		if (follow_pte_pmd(vma->vm_mm, address, &start, &end, &ptep, &pmdp, &ptl))
 			continue;
 
+		/*
+		 * No need to call mmu_notifier_invalidate_range() as we are
+		 * downgrading page table protection not changing it to point
+		 * to a new page.
+		 *
+		 * See Documentation/vm/mmu_notifier.txt
+		 */
 		if (pmdp) {
 #ifdef CONFIG_FS_DAX_PMD
 			pmd_t pmd;
@@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
 			pmd = pmd_wrprotect(pmd);
 			pmd = pmd_mkclean(pmd);
 			set_pmd_at(vma->vm_mm, address, pmdp, pmd);
-			mmu_notifier_invalidate_range(vma->vm_mm, start, end);
 unlock_pmd:
 			spin_unlock(ptl);
 #endif
@@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
 			pte = pte_wrprotect(pte);
 			pte = pte_mkclean(pte);
 			set_pte_at(vma->vm_mm, address, ptep, pte);
-			mmu_notifier_invalidate_range(vma->vm_mm, start, end);
 unlock_pte:
 			pte_unmap_unlock(ptep, ptl);
 		}
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 6866e8126982..49c925c96b8a 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -155,7 +155,8 @@ struct mmu_notifier_ops {
 	 * shared page-tables, it not necessary to implement the
 	 * invalidate_range_start()/end() notifiers, as
 	 * invalidate_range() alread catches the points in time when an
-	 * external TLB range needs to be flushed.
+	 * external TLB range needs to be flushed. For more in depth
+	 * discussion on this see Documentation/vm/mmu_notifier.txt
 	 *
 	 * The invalidate_range() function is called under the ptl
 	 * spin-lock and not allowed to sleep.
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c037d3d34950..ff5bc647b51d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
 		goto out_free_pages;
 	VM_BUG_ON_PAGE(!PageHead(page), page);
 
+	/*
+	 * Leave pmd empty until pte is filled note we must notify here as
+	 * concurrent CPU thread might write to new page before the call to
+	 * mmu_notifier_invalidate_range_end() happens which can lead to a
+	 * device seeing memory write in different order than CPU.
+	 *
+	 * See Documentation/vm/mmu_notifier.txt
+	 */
 	pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
-	/* leave pmd empty until pte is filled */
 
 	pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd);
 	pmd_populate(vma->vm_mm, &_pmd, pgtable);
@@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
 	pmd_t _pmd;
 	int i;
 
-	/* leave pmd empty until pte is filled */
-	pmdp_huge_clear_flush_notify(vma, haddr, pmd);
+	/*
+	 * Leave pmd empty until pte is filled note that it is fine to delay
+	 * notification until mmu_notifier_invalidate_range_end() as we are
+	 * replacing a zero pmd write protected page with a zero pte write
+	 * protected page.
+	 *
+	 * See Documentation/vm/mmu_notifier.txt
+	 */
+	pmdp_huge_clear_flush(vma, haddr, pmd);
 
 	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
 	pmd_populate(mm, &_pmd, pgtable);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1768efa4c501..63a63f1b536c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
 		} else {
 			if (cow) {
+				/*
+				 * No need to notify as we are downgrading page
+				 * table protection not changing it to point
+				 * to a new page.
+				 *
+				 * See Documentation/vm/mmu_notifier.txt
+				 */
 				huge_ptep_set_wrprotect(src, addr, src_pte);
-				mmu_notifier_invalidate_range(src, mmun_start,
-								   mmun_end);
 			}
 			entry = huge_ptep_get(src_pte);
 			ptepage = pte_page(entry);
@@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	 * and that page table be reused and filled with junk.
 	 */
 	flush_hugetlb_tlb_range(vma, start, end);
-	mmu_notifier_invalidate_range(mm, start, end);
+	/*
+	 * No need to call mmu_notifier_invalidate_range() we are downgrading
+	 * page table protection not changing it to point to a new page.
+	 *
+	 * See Documentation/vm/mmu_notifier.txt
+	 */
 	i_mmap_unlock_write(vma->vm_file->f_mapping);
 	mmu_notifier_invalidate_range_end(mm, start, end);
 
diff --git a/mm/ksm.c b/mm/ksm.c
index 6cb60f46cce5..be8f4576f842 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 		 * So we clear the pte and flush the tlb before the check
 		 * this assure us that no O_DIRECT can happen after the check
 		 * or in the middle of the check.
+		 *
+		 * No need to notify as we are downgrading page table to read
+		 * only not changing it to point to a new page.
+		 *
+		 * See Documentation/vm/mmu_notifier.txt
 		 */
-		entry = ptep_clear_flush_notify(vma, pvmw.address, pvmw.pte);
+		entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte);
 		/*
 		 * Check that no O_DIRECT or similar I/O is in progress on the
 		 * page
@@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	}
 
 	flush_cache_page(vma, addr, pte_pfn(*ptep));
-	ptep_clear_flush_notify(vma, addr, ptep);
+	/*
+	 * No need to notify as we are replacing a read only page with another
+	 * read only page with the same content.
+	 *
+	 * See Documentation/vm/mmu_notifier.txt
+	 */
+	ptep_clear_flush(vma, addr, ptep);
 	set_pte_at_notify(mm, addr, ptep, newpte);
 
 	page_remove_rmap(page, false);
diff --git a/mm/rmap.c b/mm/rmap.c
index 061826278520..6b5a0f219ac0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 #endif
 		}
 
-		if (ret) {
-			mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend);
+		/*
+		 * No need to call mmu_notifier_invalidate_range() as we are
+		 * downgrading page table protection not changing it to point
+		 * to a new page.
+		 *
+		 * See Documentation/vm/mmu_notifier.txt
+		 */
+		if (ret)
 			(*cleaned)++;
-		}
 	}
 
 	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
@@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			if (pte_soft_dirty(pteval))
 				swp_pte = pte_swp_mksoft_dirty(swp_pte);
 			set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
+			/*
+			 * No need to invalidate here it will synchronize on
+			 * against the special swap migration pte.
+			 */
 			goto discard;
 		}
 
@@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			 * will take care of the rest.
 			 */
 			dec_mm_counter(mm, mm_counter(page));
+			/* We have to invalidate as we cleared the pte */
+			mmu_notifier_invalidate_range(mm, address,
+						      address + PAGE_SIZE);
 		} else if (IS_ENABLED(CONFIG_MIGRATION) &&
 				(flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
 			swp_entry_t entry;
@@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			if (pte_soft_dirty(pteval))
 				swp_pte = pte_swp_mksoft_dirty(swp_pte);
 			set_pte_at(mm, address, pvmw.pte, swp_pte);
+			/*
+			 * No need to invalidate here it will synchronize on
+			 * against the special swap migration pte.
+			 */
 		} else if (PageAnon(page)) {
 			swp_entry_t entry = { .val = page_private(subpage) };
 			pte_t swp_pte;
@@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 				WARN_ON_ONCE(1);
 				ret = false;
 				/* We have to invalidate as we cleared the pte */
+				mmu_notifier_invalidate_range(mm, address,
+							address + PAGE_SIZE);
 				page_vma_mapped_walk_done(&pvmw);
 				break;
 			}
@@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			/* MADV_FREE page check */
 			if (!PageSwapBacked(page)) {
 				if (!PageDirty(page)) {
+					/* Invalidate as we cleared the pte */
+					mmu_notifier_invalidate_range(mm,
+						address, address + PAGE_SIZE);
 					dec_mm_counter(mm, MM_ANONPAGES);
 					goto discard;
 				}
@@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			if (pte_soft_dirty(pteval))
 				swp_pte = pte_swp_mksoft_dirty(swp_pte);
 			set_pte_at(mm, address, pvmw.pte, swp_pte);
-		} else
+			/* Invalidate as we cleared the pte */
+			mmu_notifier_invalidate_range(mm, address,
+						      address + PAGE_SIZE);
+		} else {
+			/*
+			 * We should not need to notify here as we reach this
+			 * case only from freeze_page() itself only call from
+			 * split_huge_page_to_list() so everything below must
+			 * be true:
+			 *   - page is not anonymous
+			 *   - page is locked
+			 *
+			 * So as it is a locked file back page thus it can not
+			 * be remove from the page cache and replace by a new
+			 * page before mmu_notifier_invalidate_range_end so no
+			 * concurrent thread might update its page table to
+			 * point at new page while a device still is using this
+			 * page.
+			 *
+			 * See Documentation/vm/mmu_notifier.txt
+			 */
 			dec_mm_counter(mm, mm_counter_file(page));
+		}
 discard:
+		/*
+		 * No need to call mmu_notifier_invalidate_range() it has be
+		 * done above for all cases requiring it to happen under page
+		 * table lock before mmu_notifier_invalidate_range_end()
+		 *
+		 * See Documentation/vm/mmu_notifier.txt
+		 */
 		page_remove_rmap(subpage, PageHuge(page));
 		put_page(page);
-		mmu_notifier_invalidate_range(mm, address,
-					      address + PAGE_SIZE);
 	}
 
 	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
-- 
2.13.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 2/2] mm/mmu_notifier: avoid call to invalidate_range() in range_end()
  2017-10-17  3:10 ` jglisse
  (?)
@ 2017-10-17  3:10   ` jglisse
  -1 siblings, 0 replies; 40+ messages in thread
From: jglisse @ 2017-10-17  3:10 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Jérôme Glisse, Andrea Arcangeli,
	Andrew Morton, Joerg Roedel, Suravee Suthikulpanit,
	David Woodhouse, Alistair Popple, Michael Ellerman,
	Benjamin Herrenschmidt, Stephen Rothwell, Andrew Donnellan,
	iommu, linuxppc-dev

From: Jérôme Glisse <jglisse@redhat.com>

This is an optimization patch that only affect mmu_notifier users which
rely on the invalidate_range() callback. This patch avoids calling that
callback twice in a row from inside __mmu_notifier_invalidate_range_end

Existing pattern (before this patch):
    mmu_notifier_invalidate_range_start()
        pte/pmd/pud_clear_flush_notify()
            mmu_notifier_invalidate_range()
    mmu_notifier_invalidate_range_end()
        mmu_notifier_invalidate_range()

New pattern (after this patch):
    mmu_notifier_invalidate_range_start()
        pte/pmd/pud_clear_flush_notify()
            mmu_notifier_invalidate_range()
    mmu_notifier_invalidate_range_only_end()

We call the invalidate_range callback after clearing the page table under
the page table lock and we skip the call to invalidate_range inside the
__mmu_notifier_invalidate_range_end() function.

Idea from Andrea Arcangeli

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Alistair Popple <alistair@popple.id.au>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>

Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
---
 include/linux/mmu_notifier.h | 17 ++++++++++++++--
 mm/huge_memory.c             | 46 ++++++++++++++++++++++++++++++++++++++++----
 mm/memory.c                  |  6 +++++-
 mm/migrate.c                 | 15 ++++++++++++---
 mm/mmu_notifier.c            | 11 +++++++++--
 5 files changed, 83 insertions(+), 12 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 49c925c96b8a..6665c4624287 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -213,7 +213,8 @@ extern void __mmu_notifier_change_pte(struct mm_struct *mm,
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 				  unsigned long start, unsigned long end);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end);
+				  unsigned long start, unsigned long end,
+				  bool only_end);
 extern void __mmu_notifier_invalidate_range(struct mm_struct *mm,
 				  unsigned long start, unsigned long end);
 
@@ -267,7 +268,14 @@ static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 				  unsigned long start, unsigned long end)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_end(mm, start, end);
+		__mmu_notifier_invalidate_range_end(mm, start, end, false);
+}
+
+static inline void mmu_notifier_invalidate_range_only_end(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_invalidate_range_end(mm, start, end, true);
 }
 
 static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
@@ -438,6 +446,11 @@ static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 {
 }
 
+static inline void mmu_notifier_invalidate_range_only_end(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+}
+
 static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
 				  unsigned long start, unsigned long end)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ff5bc647b51d..b2912305994f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1220,7 +1220,12 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
 	page_remove_rmap(page, true);
 	spin_unlock(vmf->ptl);
 
-	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
+	/*
+	 * No need to double call mmu_notifier->invalidate_range() callback as
+	 * the above pmdp_huge_clear_flush_notify() did already call it.
+	 */
+	mmu_notifier_invalidate_range_only_end(vma->vm_mm, mmun_start,
+						mmun_end);
 
 	ret |= VM_FAULT_WRITE;
 	put_page(page);
@@ -1369,7 +1374,12 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 	}
 	spin_unlock(vmf->ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
+	/*
+	 * No need to double call mmu_notifier->invalidate_range() callback as
+	 * the above pmdp_huge_clear_flush_notify() did already call it.
+	 */
+	mmu_notifier_invalidate_range_only_end(vma->vm_mm, mmun_start,
+					       mmun_end);
 out:
 	return ret;
 out_unlock:
@@ -2021,7 +2031,12 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
 
 out:
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PUD_SIZE);
+	/*
+	 * No need to double call mmu_notifier->invalidate_range() callback as
+	 * the above pudp_huge_clear_flush_notify() did already call it.
+	 */
+	mmu_notifier_invalidate_range_only_end(mm, haddr, haddr +
+					       HPAGE_PUD_SIZE);
 }
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
@@ -2096,6 +2111,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		add_mm_counter(mm, MM_FILEPAGES, -HPAGE_PMD_NR);
 		return;
 	} else if (is_huge_zero_pmd(*pmd)) {
+		/*
+		 * FIXME: Do we want to invalidate secondary mmu by calling
+		 * mmu_notifier_invalidate_range() see comments below inside
+		 * __split_huge_pmd() ?
+		 *
+		 * We are going from a zero huge page write protected to zero
+		 * small page also write protected so it does not seems useful
+		 * to invalidate secondary mmu at this time.
+		 */
 		return __split_huge_zero_page_pmd(vma, haddr, pmd);
 	}
 
@@ -2231,7 +2255,21 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	__split_huge_pmd_locked(vma, pmd, haddr, freeze);
 out:
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PMD_SIZE);
+	/*
+	 * No need to double call mmu_notifier->invalidate_range() callback.
+	 * They are 3 cases to consider inside __split_huge_pmd_locked():
+	 *  1) pmdp_huge_clear_flush_notify() call invalidate_range() obvious
+	 *  2) __split_huge_zero_page_pmd() read only zero page and any write
+	 *    fault will trigger a flush_notify before pointing to a new page
+	 *    (it is fine if the secondary mmu keeps pointing to the old zero
+	 *    page in the meantime)
+	 *  3) Split a huge pmd into pte pointing to the same page. No need
+	 *     to invalidate secondary tlb entry they are all still valid.
+	 *     any further changes to individual pte will notify. So no need
+	 *     to call mmu_notifier->invalidate_range()
+	 */
+	mmu_notifier_invalidate_range_only_end(mm, haddr, haddr +
+					       HPAGE_PMD_SIZE);
 }
 
 void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
diff --git a/mm/memory.c b/mm/memory.c
index 47cdf4e85c2d..8a0c410037d2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2555,7 +2555,11 @@ static int wp_page_copy(struct vm_fault *vmf)
 		put_page(new_page);
 
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	/*
+	 * No need to double call mmu_notifier->invalidate_range() callback as
+	 * the above ptep_clear_flush_notify() did already call it.
+	 */
+	mmu_notifier_invalidate_range_only_end(mm, mmun_start, mmun_end);
 	if (old_page) {
 		/*
 		 * Don't let another task, with possibly unlocked vma,
diff --git a/mm/migrate.c b/mm/migrate.c
index e00814ca390e..2f0f8190cb6f 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2088,7 +2088,11 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	set_page_owner_migrate_reason(new_page, MR_NUMA_MISPLACED);
 
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	/*
+	 * No need to double call mmu_notifier->invalidate_range() callback as
+	 * the above pmdp_huge_clear_flush_notify() did already call it.
+	 */
+	mmu_notifier_invalidate_range_only_end(mm, mmun_start, mmun_end);
 
 	/* Take an "isolate" reference and put new page on the LRU. */
 	get_page(new_page);
@@ -2804,9 +2808,14 @@ static void migrate_vma_pages(struct migrate_vma *migrate)
 			migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
 	}
 
+	/*
+	 * No need to double call mmu_notifier->invalidate_range() callback as
+	 * the above ptep_clear_flush_notify() inside migrate_vma_insert_page()
+	 * did already call it.
+	 */
 	if (notified)
-		mmu_notifier_invalidate_range_end(mm, mmu_start,
-						  migrate->end);
+		mmu_notifier_invalidate_range_only_end(mm, mmu_start,
+						       migrate->end);
 }
 
 /*
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 314285284e6e..96edb33fd09a 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -190,7 +190,9 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
 
 void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+					 unsigned long start,
+					 unsigned long end,
+					 bool only_end)
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -204,8 +206,13 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 		 * subsystem registers either invalidate_range_start()/end() or
 		 * invalidate_range(), so this will be no additional overhead
 		 * (besides the pointer check).
+		 *
+		 * We skip call to invalidate_range() if we know it is safe ie
+		 * call site use mmu_notifier_invalidate_range_only_end() which
+		 * is safe to do when we know that a call to invalidate_range()
+		 * already happen under page table lock.
 		 */
-		if (mn->ops->invalidate_range)
+		if (!only_end && mn->ops->invalidate_range)
 			mn->ops->invalidate_range(mn, mm, start, end);
 		if (mn->ops->invalidate_range_end)
 			mn->ops->invalidate_range_end(mn, mm, start, end);
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 2/2] mm/mmu_notifier: avoid call to invalidate_range() in range_end()
@ 2017-10-17  3:10   ` jglisse
  0 siblings, 0 replies; 40+ messages in thread
From: jglisse @ 2017-10-17  3:10 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Jérôme Glisse, Andrea Arcangeli,
	Andrew Morton, Joerg Roedel, Suravee Suthikulpanit,
	David Woodhouse, Alistair Popple, Michael Ellerman,
	Benjamin Herrenschmidt, Stephen Rothwell, Andrew Donnellan,
	iommu, linuxppc-dev

From: JA(C)rA'me Glisse <jglisse@redhat.com>

This is an optimization patch that only affect mmu_notifier users which
rely on the invalidate_range() callback. This patch avoids calling that
callback twice in a row from inside __mmu_notifier_invalidate_range_end

Existing pattern (before this patch):
    mmu_notifier_invalidate_range_start()
        pte/pmd/pud_clear_flush_notify()
            mmu_notifier_invalidate_range()
    mmu_notifier_invalidate_range_end()
        mmu_notifier_invalidate_range()

New pattern (after this patch):
    mmu_notifier_invalidate_range_start()
        pte/pmd/pud_clear_flush_notify()
            mmu_notifier_invalidate_range()
    mmu_notifier_invalidate_range_only_end()

We call the invalidate_range callback after clearing the page table under
the page table lock and we skip the call to invalidate_range inside the
__mmu_notifier_invalidate_range_end() function.

Idea from Andrea Arcangeli

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Alistair Popple <alistair@popple.id.au>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>

Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
---
 include/linux/mmu_notifier.h | 17 ++++++++++++++--
 mm/huge_memory.c             | 46 ++++++++++++++++++++++++++++++++++++++++----
 mm/memory.c                  |  6 +++++-
 mm/migrate.c                 | 15 ++++++++++++---
 mm/mmu_notifier.c            | 11 +++++++++--
 5 files changed, 83 insertions(+), 12 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 49c925c96b8a..6665c4624287 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -213,7 +213,8 @@ extern void __mmu_notifier_change_pte(struct mm_struct *mm,
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 				  unsigned long start, unsigned long end);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end);
+				  unsigned long start, unsigned long end,
+				  bool only_end);
 extern void __mmu_notifier_invalidate_range(struct mm_struct *mm,
 				  unsigned long start, unsigned long end);
 
@@ -267,7 +268,14 @@ static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 				  unsigned long start, unsigned long end)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_end(mm, start, end);
+		__mmu_notifier_invalidate_range_end(mm, start, end, false);
+}
+
+static inline void mmu_notifier_invalidate_range_only_end(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_invalidate_range_end(mm, start, end, true);
 }
 
 static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
@@ -438,6 +446,11 @@ static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 {
 }
 
+static inline void mmu_notifier_invalidate_range_only_end(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+}
+
 static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
 				  unsigned long start, unsigned long end)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ff5bc647b51d..b2912305994f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1220,7 +1220,12 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
 	page_remove_rmap(page, true);
 	spin_unlock(vmf->ptl);
 
-	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
+	/*
+	 * No need to double call mmu_notifier->invalidate_range() callback as
+	 * the above pmdp_huge_clear_flush_notify() did already call it.
+	 */
+	mmu_notifier_invalidate_range_only_end(vma->vm_mm, mmun_start,
+						mmun_end);
 
 	ret |= VM_FAULT_WRITE;
 	put_page(page);
@@ -1369,7 +1374,12 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 	}
 	spin_unlock(vmf->ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
+	/*
+	 * No need to double call mmu_notifier->invalidate_range() callback as
+	 * the above pmdp_huge_clear_flush_notify() did already call it.
+	 */
+	mmu_notifier_invalidate_range_only_end(vma->vm_mm, mmun_start,
+					       mmun_end);
 out:
 	return ret;
 out_unlock:
@@ -2021,7 +2031,12 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
 
 out:
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PUD_SIZE);
+	/*
+	 * No need to double call mmu_notifier->invalidate_range() callback as
+	 * the above pudp_huge_clear_flush_notify() did already call it.
+	 */
+	mmu_notifier_invalidate_range_only_end(mm, haddr, haddr +
+					       HPAGE_PUD_SIZE);
 }
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
@@ -2096,6 +2111,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		add_mm_counter(mm, MM_FILEPAGES, -HPAGE_PMD_NR);
 		return;
 	} else if (is_huge_zero_pmd(*pmd)) {
+		/*
+		 * FIXME: Do we want to invalidate secondary mmu by calling
+		 * mmu_notifier_invalidate_range() see comments below inside
+		 * __split_huge_pmd() ?
+		 *
+		 * We are going from a zero huge page write protected to zero
+		 * small page also write protected so it does not seems useful
+		 * to invalidate secondary mmu at this time.
+		 */
 		return __split_huge_zero_page_pmd(vma, haddr, pmd);
 	}
 
@@ -2231,7 +2255,21 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	__split_huge_pmd_locked(vma, pmd, haddr, freeze);
 out:
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PMD_SIZE);
+	/*
+	 * No need to double call mmu_notifier->invalidate_range() callback.
+	 * They are 3 cases to consider inside __split_huge_pmd_locked():
+	 *  1) pmdp_huge_clear_flush_notify() call invalidate_range() obvious
+	 *  2) __split_huge_zero_page_pmd() read only zero page and any write
+	 *    fault will trigger a flush_notify before pointing to a new page
+	 *    (it is fine if the secondary mmu keeps pointing to the old zero
+	 *    page in the meantime)
+	 *  3) Split a huge pmd into pte pointing to the same page. No need
+	 *     to invalidate secondary tlb entry they are all still valid.
+	 *     any further changes to individual pte will notify. So no need
+	 *     to call mmu_notifier->invalidate_range()
+	 */
+	mmu_notifier_invalidate_range_only_end(mm, haddr, haddr +
+					       HPAGE_PMD_SIZE);
 }
 
 void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
diff --git a/mm/memory.c b/mm/memory.c
index 47cdf4e85c2d..8a0c410037d2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2555,7 +2555,11 @@ static int wp_page_copy(struct vm_fault *vmf)
 		put_page(new_page);
 
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	/*
+	 * No need to double call mmu_notifier->invalidate_range() callback as
+	 * the above ptep_clear_flush_notify() did already call it.
+	 */
+	mmu_notifier_invalidate_range_only_end(mm, mmun_start, mmun_end);
 	if (old_page) {
 		/*
 		 * Don't let another task, with possibly unlocked vma,
diff --git a/mm/migrate.c b/mm/migrate.c
index e00814ca390e..2f0f8190cb6f 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2088,7 +2088,11 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	set_page_owner_migrate_reason(new_page, MR_NUMA_MISPLACED);
 
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	/*
+	 * No need to double call mmu_notifier->invalidate_range() callback as
+	 * the above pmdp_huge_clear_flush_notify() did already call it.
+	 */
+	mmu_notifier_invalidate_range_only_end(mm, mmun_start, mmun_end);
 
 	/* Take an "isolate" reference and put new page on the LRU. */
 	get_page(new_page);
@@ -2804,9 +2808,14 @@ static void migrate_vma_pages(struct migrate_vma *migrate)
 			migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
 	}
 
+	/*
+	 * No need to double call mmu_notifier->invalidate_range() callback as
+	 * the above ptep_clear_flush_notify() inside migrate_vma_insert_page()
+	 * did already call it.
+	 */
 	if (notified)
-		mmu_notifier_invalidate_range_end(mm, mmu_start,
-						  migrate->end);
+		mmu_notifier_invalidate_range_only_end(mm, mmu_start,
+						       migrate->end);
 }
 
 /*
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 314285284e6e..96edb33fd09a 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -190,7 +190,9 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
 
 void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+					 unsigned long start,
+					 unsigned long end,
+					 bool only_end)
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -204,8 +206,13 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 		 * subsystem registers either invalidate_range_start()/end() or
 		 * invalidate_range(), so this will be no additional overhead
 		 * (besides the pointer check).
+		 *
+		 * We skip call to invalidate_range() if we know it is safe ie
+		 * call site use mmu_notifier_invalidate_range_only_end() which
+		 * is safe to do when we know that a call to invalidate_range()
+		 * already happen under page table lock.
 		 */
-		if (mn->ops->invalidate_range)
+		if (!only_end && mn->ops->invalidate_range)
 			mn->ops->invalidate_range(mn, mm, start, end);
 		if (mn->ops->invalidate_range_end)
 			mn->ops->invalidate_range_end(mn, mm, start, end);
-- 
2.13.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 2/2] mm/mmu_notifier: avoid call to invalidate_range() in range_end()
@ 2017-10-17  3:10   ` jglisse
  0 siblings, 0 replies; 40+ messages in thread
From: jglisse @ 2017-10-17  3:10 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Jérôme Glisse, Andrea Arcangeli,
	Andrew Morton, Joerg Roedel, Suravee Suthikulpanit,
	David Woodhouse, Alistair Popple, Michael Ellerman,
	Benjamin Herrenschmidt, Stephen Rothwell, Andrew Donnellan,
	iommu, linuxppc-dev

From: Jérôme Glisse <jglisse@redhat.com>

This is an optimization patch that only affect mmu_notifier users which
rely on the invalidate_range() callback. This patch avoids calling that
callback twice in a row from inside __mmu_notifier_invalidate_range_end

Existing pattern (before this patch):
    mmu_notifier_invalidate_range_start()
        pte/pmd/pud_clear_flush_notify()
            mmu_notifier_invalidate_range()
    mmu_notifier_invalidate_range_end()
        mmu_notifier_invalidate_range()

New pattern (after this patch):
    mmu_notifier_invalidate_range_start()
        pte/pmd/pud_clear_flush_notify()
            mmu_notifier_invalidate_range()
    mmu_notifier_invalidate_range_only_end()

We call the invalidate_range callback after clearing the page table under
the page table lock and we skip the call to invalidate_range inside the
__mmu_notifier_invalidate_range_end() function.

Idea from Andrea Arcangeli

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Alistair Popple <alistair@popple.id.au>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>

Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
---
 include/linux/mmu_notifier.h | 17 ++++++++++++++--
 mm/huge_memory.c             | 46 ++++++++++++++++++++++++++++++++++++++++----
 mm/memory.c                  |  6 +++++-
 mm/migrate.c                 | 15 ++++++++++++---
 mm/mmu_notifier.c            | 11 +++++++++--
 5 files changed, 83 insertions(+), 12 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 49c925c96b8a..6665c4624287 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -213,7 +213,8 @@ extern void __mmu_notifier_change_pte(struct mm_struct *mm,
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 				  unsigned long start, unsigned long end);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end);
+				  unsigned long start, unsigned long end,
+				  bool only_end);
 extern void __mmu_notifier_invalidate_range(struct mm_struct *mm,
 				  unsigned long start, unsigned long end);
 
@@ -267,7 +268,14 @@ static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 				  unsigned long start, unsigned long end)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_end(mm, start, end);
+		__mmu_notifier_invalidate_range_end(mm, start, end, false);
+}
+
+static inline void mmu_notifier_invalidate_range_only_end(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_invalidate_range_end(mm, start, end, true);
 }
 
 static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
@@ -438,6 +446,11 @@ static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 {
 }
 
+static inline void mmu_notifier_invalidate_range_only_end(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+}
+
 static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
 				  unsigned long start, unsigned long end)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ff5bc647b51d..b2912305994f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1220,7 +1220,12 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
 	page_remove_rmap(page, true);
 	spin_unlock(vmf->ptl);
 
-	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
+	/*
+	 * No need to double call mmu_notifier->invalidate_range() callback as
+	 * the above pmdp_huge_clear_flush_notify() did already call it.
+	 */
+	mmu_notifier_invalidate_range_only_end(vma->vm_mm, mmun_start,
+						mmun_end);
 
 	ret |= VM_FAULT_WRITE;
 	put_page(page);
@@ -1369,7 +1374,12 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 	}
 	spin_unlock(vmf->ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
+	/*
+	 * No need to double call mmu_notifier->invalidate_range() callback as
+	 * the above pmdp_huge_clear_flush_notify() did already call it.
+	 */
+	mmu_notifier_invalidate_range_only_end(vma->vm_mm, mmun_start,
+					       mmun_end);
 out:
 	return ret;
 out_unlock:
@@ -2021,7 +2031,12 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
 
 out:
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PUD_SIZE);
+	/*
+	 * No need to double call mmu_notifier->invalidate_range() callback as
+	 * the above pudp_huge_clear_flush_notify() did already call it.
+	 */
+	mmu_notifier_invalidate_range_only_end(mm, haddr, haddr +
+					       HPAGE_PUD_SIZE);
 }
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
@@ -2096,6 +2111,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		add_mm_counter(mm, MM_FILEPAGES, -HPAGE_PMD_NR);
 		return;
 	} else if (is_huge_zero_pmd(*pmd)) {
+		/*
+		 * FIXME: Do we want to invalidate secondary mmu by calling
+		 * mmu_notifier_invalidate_range() see comments below inside
+		 * __split_huge_pmd() ?
+		 *
+		 * We are going from a zero huge page write protected to zero
+		 * small page also write protected so it does not seems useful
+		 * to invalidate secondary mmu at this time.
+		 */
 		return __split_huge_zero_page_pmd(vma, haddr, pmd);
 	}
 
@@ -2231,7 +2255,21 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	__split_huge_pmd_locked(vma, pmd, haddr, freeze);
 out:
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PMD_SIZE);
+	/*
+	 * No need to double call mmu_notifier->invalidate_range() callback.
+	 * They are 3 cases to consider inside __split_huge_pmd_locked():
+	 *  1) pmdp_huge_clear_flush_notify() call invalidate_range() obvious
+	 *  2) __split_huge_zero_page_pmd() read only zero page and any write
+	 *    fault will trigger a flush_notify before pointing to a new page
+	 *    (it is fine if the secondary mmu keeps pointing to the old zero
+	 *    page in the meantime)
+	 *  3) Split a huge pmd into pte pointing to the same page. No need
+	 *     to invalidate secondary tlb entry they are all still valid.
+	 *     any further changes to individual pte will notify. So no need
+	 *     to call mmu_notifier->invalidate_range()
+	 */
+	mmu_notifier_invalidate_range_only_end(mm, haddr, haddr +
+					       HPAGE_PMD_SIZE);
 }
 
 void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
diff --git a/mm/memory.c b/mm/memory.c
index 47cdf4e85c2d..8a0c410037d2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2555,7 +2555,11 @@ static int wp_page_copy(struct vm_fault *vmf)
 		put_page(new_page);
 
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	/*
+	 * No need to double call mmu_notifier->invalidate_range() callback as
+	 * the above ptep_clear_flush_notify() did already call it.
+	 */
+	mmu_notifier_invalidate_range_only_end(mm, mmun_start, mmun_end);
 	if (old_page) {
 		/*
 		 * Don't let another task, with possibly unlocked vma,
diff --git a/mm/migrate.c b/mm/migrate.c
index e00814ca390e..2f0f8190cb6f 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2088,7 +2088,11 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	set_page_owner_migrate_reason(new_page, MR_NUMA_MISPLACED);
 
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	/*
+	 * No need to double call mmu_notifier->invalidate_range() callback as
+	 * the above pmdp_huge_clear_flush_notify() did already call it.
+	 */
+	mmu_notifier_invalidate_range_only_end(mm, mmun_start, mmun_end);
 
 	/* Take an "isolate" reference and put new page on the LRU. */
 	get_page(new_page);
@@ -2804,9 +2808,14 @@ static void migrate_vma_pages(struct migrate_vma *migrate)
 			migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
 	}
 
+	/*
+	 * No need to double call mmu_notifier->invalidate_range() callback as
+	 * the above ptep_clear_flush_notify() inside migrate_vma_insert_page()
+	 * did already call it.
+	 */
 	if (notified)
-		mmu_notifier_invalidate_range_end(mm, mmu_start,
-						  migrate->end);
+		mmu_notifier_invalidate_range_only_end(mm, mmu_start,
+						       migrate->end);
 }
 
 /*
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 314285284e6e..96edb33fd09a 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -190,7 +190,9 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
 
 void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+					 unsigned long start,
+					 unsigned long end,
+					 bool only_end)
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -204,8 +206,13 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 		 * subsystem registers either invalidate_range_start()/end() or
 		 * invalidate_range(), so this will be no additional overhead
 		 * (besides the pointer check).
+		 *
+		 * We skip call to invalidate_range() if we know it is safe ie
+		 * call site use mmu_notifier_invalidate_range_only_end() which
+		 * is safe to do when we know that a call to invalidate_range()
+		 * already happen under page table lock.
 		 */
-		if (mn->ops->invalidate_range)
+		if (!only_end && mn->ops->invalidate_range)
 			mn->ops->invalidate_range(mn, mm, start, end);
 		if (mn->ops->invalidate_range_end)
 			mn->ops->invalidate_range_end(mn, mm, start, end);
-- 
2.13.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/2] Optimize mmu_notifier->invalidate_range callback
  2017-10-17  3:10 ` jglisse
  (?)
@ 2017-10-19  2:43   ` Balbir Singh
  -1 siblings, 0 replies; 40+ messages in thread
From: Balbir Singh @ 2017-10-19  2:43 UTC (permalink / raw)
  To: jglisse
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Andrew Morton,
	Joerg Roedel, Suravee Suthikulpanit, David Woodhouse,
	Alistair Popple, Michael Ellerman, Benjamin Herrenschmidt,
	Stephen Rothwell, Andrew Donnellan, iommu, linuxppc-dev

On Mon, 16 Oct 2017 23:10:01 -0400
jglisse@redhat.com wrote:

> From: Jérôme Glisse <jglisse@redhat.com>
> 
> (Andrew you already have v1 in your queue of patch 1, patch 2 is new,
>  i think you can drop it patch 1 v1 for v2, v2 is bit more conservative
>  and i fixed typos)
> 
> All this only affect user of invalidate_range callback (at this time
> CAPI arch/powerpc/platforms/powernv/npu-dma.c, IOMMU ATS/PASID in
> drivers/iommu/amd_iommu_v2.c|intel-svm.c)
> 
> This patchset remove useless double call to mmu_notifier->invalidate_range
> callback wherever it is safe to do so. The first patch just remove useless
> call

As in an extra call? Where does that come from?

> and add documentation explaining why it is safe to do so. The second
> patch go further by introducing mmu_notifier_invalidate_range_only_end()
> which skip callback to invalidate_range this can be done when clearing a
> pte, pmd or pud with notification which call invalidate_range right after
> clearing under the page table lock.
>

Balbir Singh.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/2] Optimize mmu_notifier->invalidate_range callback
@ 2017-10-19  2:43   ` Balbir Singh
  0 siblings, 0 replies; 40+ messages in thread
From: Balbir Singh @ 2017-10-19  2:43 UTC (permalink / raw)
  To: jglisse
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Andrew Morton,
	Joerg Roedel, Suravee Suthikulpanit, David Woodhouse,
	Alistair Popple, Michael Ellerman, Benjamin Herrenschmidt,
	Stephen Rothwell, Andrew Donnellan, iommu, linuxppc-dev

On Mon, 16 Oct 2017 23:10:01 -0400
jglisse@redhat.com wrote:

> From: Jérôme Glisse <jglisse@redhat.com>
> 
> (Andrew you already have v1 in your queue of patch 1, patch 2 is new,
>  i think you can drop it patch 1 v1 for v2, v2 is bit more conservative
>  and i fixed typos)
> 
> All this only affect user of invalidate_range callback (at this time
> CAPI arch/powerpc/platforms/powernv/npu-dma.c, IOMMU ATS/PASID in
> drivers/iommu/amd_iommu_v2.c|intel-svm.c)
> 
> This patchset remove useless double call to mmu_notifier->invalidate_range
> callback wherever it is safe to do so. The first patch just remove useless
> call

As in an extra call? Where does that come from?

> and add documentation explaining why it is safe to do so. The second
> patch go further by introducing mmu_notifier_invalidate_range_only_end()
> which skip callback to invalidate_range this can be done when clearing a
> pte, pmd or pud with notification which call invalidate_range right after
> clearing under the page table lock.
>

Balbir Singh.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/2] Optimize mmu_notifier->invalidate_range callback
@ 2017-10-19  2:43   ` Balbir Singh
  0 siblings, 0 replies; 40+ messages in thread
From: Balbir Singh @ 2017-10-19  2:43 UTC (permalink / raw)
  To: jglisse
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Andrew Morton,
	Joerg Roedel, Suravee Suthikulpanit, David Woodhouse,
	Alistair Popple, Michael Ellerman, Benjamin Herrenschmidt,
	Stephen Rothwell, Andrew Donnellan, iommu, linuxppc-dev

On Mon, 16 Oct 2017 23:10:01 -0400
jglisse@redhat.com wrote:

> From: J=C3=A9r=C3=B4me Glisse <jglisse@redhat.com>
>=20
> (Andrew you already have v1 in your queue of patch 1, patch 2 is new,
>  i think you can drop it patch 1 v1 for v2, v2 is bit more conservative
>  and i fixed typos)
>=20
> All this only affect user of invalidate_range callback (at this time
> CAPI arch/powerpc/platforms/powernv/npu-dma.c, IOMMU ATS/PASID in
> drivers/iommu/amd_iommu_v2.c|intel-svm.c)
>=20
> This patchset remove useless double call to mmu_notifier->invalidate_range
> callback wherever it is safe to do so. The first patch just remove useless
> call

As in an extra call? Where does that come from?

> and add documentation explaining why it is safe to do so. The second
> patch go further by introducing mmu_notifier_invalidate_range_only_end()
> which skip callback to invalidate_range this can be done when clearing a
> pte, pmd or pud with notification which call invalidate_range right after
> clearing under the page table lock.
>

Balbir Singh.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2
  2017-10-17  3:10   ` jglisse-H+wXaHxf7aLQT0dZR+AlfA
  (?)
@ 2017-10-19  3:04     ` Balbir Singh
  -1 siblings, 0 replies; 40+ messages in thread
From: Balbir Singh @ 2017-10-19  3:04 UTC (permalink / raw)
  To: jglisse
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Nadav Amit,
	Linus Torvalds, Andrew Morton, Joerg Roedel,
	Suravee Suthikulpanit, David Woodhouse, Alistair Popple,
	Michael Ellerman, Benjamin Herrenschmidt, Stephen Rothwell,
	Andrew Donnellan, iommu, linuxppc-dev, linux-next

On Mon, 16 Oct 2017 23:10:02 -0400
jglisse@redhat.com wrote:

> From: Jérôme Glisse <jglisse@redhat.com>
> 
> +		/*
> +		 * No need to call mmu_notifier_invalidate_range() as we are
> +		 * downgrading page table protection not changing it to point
> +		 * to a new page.
> +		 *
> +		 * See Documentation/vm/mmu_notifier.txt
> +		 */
>  		if (pmdp) {
>  #ifdef CONFIG_FS_DAX_PMD
>  			pmd_t pmd;
> @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
>  			pmd = pmd_wrprotect(pmd);
>  			pmd = pmd_mkclean(pmd);
>  			set_pmd_at(vma->vm_mm, address, pmdp, pmd);
> -			mmu_notifier_invalidate_range(vma->vm_mm, start, end);

Could the secondary TLB still see the mapping as dirty and propagate the dirty bit back?

>  unlock_pmd:
>  			spin_unlock(ptl);
>  #endif
> @@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
>  			pte = pte_wrprotect(pte);
>  			pte = pte_mkclean(pte);
>  			set_pte_at(vma->vm_mm, address, ptep, pte);
> -			mmu_notifier_invalidate_range(vma->vm_mm, start, end);

Ditto

>  unlock_pte:
>  			pte_unmap_unlock(ptep, ptl);
>  		}
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 6866e8126982..49c925c96b8a 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -155,7 +155,8 @@ struct mmu_notifier_ops {
>  	 * shared page-tables, it not necessary to implement the
>  	 * invalidate_range_start()/end() notifiers, as
>  	 * invalidate_range() alread catches the points in time when an
> -	 * external TLB range needs to be flushed.
> +	 * external TLB range needs to be flushed. For more in depth
> +	 * discussion on this see Documentation/vm/mmu_notifier.txt
>  	 *
>  	 * The invalidate_range() function is called under the ptl
>  	 * spin-lock and not allowed to sleep.
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index c037d3d34950..ff5bc647b51d 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
>  		goto out_free_pages;
>  	VM_BUG_ON_PAGE(!PageHead(page), page);
>  
> +	/*
> +	 * Leave pmd empty until pte is filled note we must notify here as
> +	 * concurrent CPU thread might write to new page before the call to
> +	 * mmu_notifier_invalidate_range_end() happens which can lead to a
> +	 * device seeing memory write in different order than CPU.
> +	 *
> +	 * See Documentation/vm/mmu_notifier.txt
> +	 */
>  	pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
> -	/* leave pmd empty until pte is filled */
>  
>  	pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd);
>  	pmd_populate(vma->vm_mm, &_pmd, pgtable);
> @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>  	pmd_t _pmd;
>  	int i;
>  
> -	/* leave pmd empty until pte is filled */
> -	pmdp_huge_clear_flush_notify(vma, haddr, pmd);
> +	/*
> +	 * Leave pmd empty until pte is filled note that it is fine to delay
> +	 * notification until mmu_notifier_invalidate_range_end() as we are
> +	 * replacing a zero pmd write protected page with a zero pte write
> +	 * protected page.
> +	 *
> +	 * See Documentation/vm/mmu_notifier.txt
> +	 */
> +	pmdp_huge_clear_flush(vma, haddr, pmd);

Shouldn't the secondary TLB know if the page size changed?

>  
>  	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
>  	pmd_populate(mm, &_pmd, pgtable);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 1768efa4c501..63a63f1b536c 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>  			set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
>  		} else {
>  			if (cow) {
> +				/*
> +				 * No need to notify as we are downgrading page
> +				 * table protection not changing it to point
> +				 * to a new page.
> +				 *
> +				 * See Documentation/vm/mmu_notifier.txt
> +				 */
>  				huge_ptep_set_wrprotect(src, addr, src_pte);

OK.. so we could get write faults on write accesses from the device.

> -				mmu_notifier_invalidate_range(src, mmun_start,
> -								   mmun_end);
>  			}
>  			entry = huge_ptep_get(src_pte);
>  			ptepage = pte_page(entry);
> @@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
>  	 * and that page table be reused and filled with junk.
>  	 */
>  	flush_hugetlb_tlb_range(vma, start, end);
> -	mmu_notifier_invalidate_range(mm, start, end);
> +	/*
> +	 * No need to call mmu_notifier_invalidate_range() we are downgrading
> +	 * page table protection not changing it to point to a new page.
> +	 *
> +	 * See Documentation/vm/mmu_notifier.txt
> +	 */
>  	i_mmap_unlock_write(vma->vm_file->f_mapping);
>  	mmu_notifier_invalidate_range_end(mm, start, end);
>  
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 6cb60f46cce5..be8f4576f842 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
>  		 * So we clear the pte and flush the tlb before the check
>  		 * this assure us that no O_DIRECT can happen after the check
>  		 * or in the middle of the check.
> +		 *
> +		 * No need to notify as we are downgrading page table to read
> +		 * only not changing it to point to a new page.
> +		 *
> +		 * See Documentation/vm/mmu_notifier.txt
>  		 */
> -		entry = ptep_clear_flush_notify(vma, pvmw.address, pvmw.pte);
> +		entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte);
>  		/*
>  		 * Check that no O_DIRECT or similar I/O is in progress on the
>  		 * page
> @@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
>  	}
>  
>  	flush_cache_page(vma, addr, pte_pfn(*ptep));
> -	ptep_clear_flush_notify(vma, addr, ptep);
> +	/*
> +	 * No need to notify as we are replacing a read only page with another
> +	 * read only page with the same content.
> +	 *
> +	 * See Documentation/vm/mmu_notifier.txt
> +	 */
> +	ptep_clear_flush(vma, addr, ptep);
>  	set_pte_at_notify(mm, addr, ptep, newpte);
>  
>  	page_remove_rmap(page, false);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 061826278520..6b5a0f219ac0 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
>  #endif
>  		}
>  
> -		if (ret) {
> -			mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend);
> +		/*
> +		 * No need to call mmu_notifier_invalidate_range() as we are
> +		 * downgrading page table protection not changing it to point
> +		 * to a new page.
> +		 *
> +		 * See Documentation/vm/mmu_notifier.txt
> +		 */
> +		if (ret)
>  			(*cleaned)++;
> -		}
>  	}
>  
>  	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> @@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  			if (pte_soft_dirty(pteval))
>  				swp_pte = pte_swp_mksoft_dirty(swp_pte);
>  			set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
> +			/*
> +			 * No need to invalidate here it will synchronize on
> +			 * against the special swap migration pte.
> +			 */
>  			goto discard;
>  		}
>  
> @@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  			 * will take care of the rest.
>  			 */
>  			dec_mm_counter(mm, mm_counter(page));
> +			/* We have to invalidate as we cleared the pte */
> +			mmu_notifier_invalidate_range(mm, address,
> +						      address + PAGE_SIZE);
>  		} else if (IS_ENABLED(CONFIG_MIGRATION) &&
>  				(flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
>  			swp_entry_t entry;
> @@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  			if (pte_soft_dirty(pteval))
>  				swp_pte = pte_swp_mksoft_dirty(swp_pte);
>  			set_pte_at(mm, address, pvmw.pte, swp_pte);
> +			/*
> +			 * No need to invalidate here it will synchronize on
> +			 * against the special swap migration pte.
> +			 */
>  		} else if (PageAnon(page)) {
>  			swp_entry_t entry = { .val = page_private(subpage) };
>  			pte_t swp_pte;
> @@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  				WARN_ON_ONCE(1);
>  				ret = false;
>  				/* We have to invalidate as we cleared the pte */
> +				mmu_notifier_invalidate_range(mm, address,
> +							address + PAGE_SIZE);
>  				page_vma_mapped_walk_done(&pvmw);
>  				break;
>  			}
> @@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  			/* MADV_FREE page check */
>  			if (!PageSwapBacked(page)) {
>  				if (!PageDirty(page)) {
> +					/* Invalidate as we cleared the pte */
> +					mmu_notifier_invalidate_range(mm,
> +						address, address + PAGE_SIZE);
>  					dec_mm_counter(mm, MM_ANONPAGES);
>  					goto discard;
>  				}
> @@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  			if (pte_soft_dirty(pteval))
>  				swp_pte = pte_swp_mksoft_dirty(swp_pte);
>  			set_pte_at(mm, address, pvmw.pte, swp_pte);
> -		} else
> +			/* Invalidate as we cleared the pte */
> +			mmu_notifier_invalidate_range(mm, address,
> +						      address + PAGE_SIZE);
> +		} else {
> +			/*
> +			 * We should not need to notify here as we reach this
> +			 * case only from freeze_page() itself only call from
> +			 * split_huge_page_to_list() so everything below must
> +			 * be true:
> +			 *   - page is not anonymous
> +			 *   - page is locked
> +			 *
> +			 * So as it is a locked file back page thus it can not
> +			 * be remove from the page cache and replace by a new
> +			 * page before mmu_notifier_invalidate_range_end so no
> +			 * concurrent thread might update its page table to
> +			 * point at new page while a device still is using this
> +			 * page.
> +			 *
> +			 * See Documentation/vm/mmu_notifier.txt
> +			 */
>  			dec_mm_counter(mm, mm_counter_file(page));
> +		}
>  discard:
> +		/*
> +		 * No need to call mmu_notifier_invalidate_range() it has be
> +		 * done above for all cases requiring it to happen under page
> +		 * table lock before mmu_notifier_invalidate_range_end()
> +		 *
> +		 * See Documentation/vm/mmu_notifier.txt
> +		 */
>  		page_remove_rmap(subpage, PageHuge(page));
>  		put_page(page);
> -		mmu_notifier_invalidate_range(mm, address,
> -					      address + PAGE_SIZE);
>  	}
>  
>  	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);

Looking at the patchset, I understand the efficiency, but I am concerned
with correctness.

Balbir Singh.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2
@ 2017-10-19  3:04     ` Balbir Singh
  0 siblings, 0 replies; 40+ messages in thread
From: Balbir Singh @ 2017-10-19  3:04 UTC (permalink / raw)
  To: jglisse
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Nadav Amit,
	Linus Torvalds, Andrew Morton, Joerg Roedel,
	Suravee Suthikulpanit, David Woodhouse, Alistair Popple,
	Michael Ellerman, Benjamin Herrenschmidt, Stephen Rothwell,
	Andrew Donnellan, iommu, linuxppc-dev, linux-next

On Mon, 16 Oct 2017 23:10:02 -0400
jglisse@redhat.com wrote:

> From: Jérôme Glisse <jglisse@redhat.com>
> 
> +		/*
> +		 * No need to call mmu_notifier_invalidate_range() as we are
> +		 * downgrading page table protection not changing it to point
> +		 * to a new page.
> +		 *
> +		 * See Documentation/vm/mmu_notifier.txt
> +		 */
>  		if (pmdp) {
>  #ifdef CONFIG_FS_DAX_PMD
>  			pmd_t pmd;
> @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
>  			pmd = pmd_wrprotect(pmd);
>  			pmd = pmd_mkclean(pmd);
>  			set_pmd_at(vma->vm_mm, address, pmdp, pmd);
> -			mmu_notifier_invalidate_range(vma->vm_mm, start, end);

Could the secondary TLB still see the mapping as dirty and propagate the dirty bit back?

>  unlock_pmd:
>  			spin_unlock(ptl);
>  #endif
> @@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
>  			pte = pte_wrprotect(pte);
>  			pte = pte_mkclean(pte);
>  			set_pte_at(vma->vm_mm, address, ptep, pte);
> -			mmu_notifier_invalidate_range(vma->vm_mm, start, end);

Ditto

>  unlock_pte:
>  			pte_unmap_unlock(ptep, ptl);
>  		}
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 6866e8126982..49c925c96b8a 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -155,7 +155,8 @@ struct mmu_notifier_ops {
>  	 * shared page-tables, it not necessary to implement the
>  	 * invalidate_range_start()/end() notifiers, as
>  	 * invalidate_range() alread catches the points in time when an
> -	 * external TLB range needs to be flushed.
> +	 * external TLB range needs to be flushed. For more in depth
> +	 * discussion on this see Documentation/vm/mmu_notifier.txt
>  	 *
>  	 * The invalidate_range() function is called under the ptl
>  	 * spin-lock and not allowed to sleep.
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index c037d3d34950..ff5bc647b51d 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
>  		goto out_free_pages;
>  	VM_BUG_ON_PAGE(!PageHead(page), page);
>  
> +	/*
> +	 * Leave pmd empty until pte is filled note we must notify here as
> +	 * concurrent CPU thread might write to new page before the call to
> +	 * mmu_notifier_invalidate_range_end() happens which can lead to a
> +	 * device seeing memory write in different order than CPU.
> +	 *
> +	 * See Documentation/vm/mmu_notifier.txt
> +	 */
>  	pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
> -	/* leave pmd empty until pte is filled */
>  
>  	pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd);
>  	pmd_populate(vma->vm_mm, &_pmd, pgtable);
> @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>  	pmd_t _pmd;
>  	int i;
>  
> -	/* leave pmd empty until pte is filled */
> -	pmdp_huge_clear_flush_notify(vma, haddr, pmd);
> +	/*
> +	 * Leave pmd empty until pte is filled note that it is fine to delay
> +	 * notification until mmu_notifier_invalidate_range_end() as we are
> +	 * replacing a zero pmd write protected page with a zero pte write
> +	 * protected page.
> +	 *
> +	 * See Documentation/vm/mmu_notifier.txt
> +	 */
> +	pmdp_huge_clear_flush(vma, haddr, pmd);

Shouldn't the secondary TLB know if the page size changed?

>  
>  	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
>  	pmd_populate(mm, &_pmd, pgtable);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 1768efa4c501..63a63f1b536c 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>  			set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
>  		} else {
>  			if (cow) {
> +				/*
> +				 * No need to notify as we are downgrading page
> +				 * table protection not changing it to point
> +				 * to a new page.
> +				 *
> +				 * See Documentation/vm/mmu_notifier.txt
> +				 */
>  				huge_ptep_set_wrprotect(src, addr, src_pte);

OK.. so we could get write faults on write accesses from the device.

> -				mmu_notifier_invalidate_range(src, mmun_start,
> -								   mmun_end);
>  			}
>  			entry = huge_ptep_get(src_pte);
>  			ptepage = pte_page(entry);
> @@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
>  	 * and that page table be reused and filled with junk.
>  	 */
>  	flush_hugetlb_tlb_range(vma, start, end);
> -	mmu_notifier_invalidate_range(mm, start, end);
> +	/*
> +	 * No need to call mmu_notifier_invalidate_range() we are downgrading
> +	 * page table protection not changing it to point to a new page.
> +	 *
> +	 * See Documentation/vm/mmu_notifier.txt
> +	 */
>  	i_mmap_unlock_write(vma->vm_file->f_mapping);
>  	mmu_notifier_invalidate_range_end(mm, start, end);
>  
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 6cb60f46cce5..be8f4576f842 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
>  		 * So we clear the pte and flush the tlb before the check
>  		 * this assure us that no O_DIRECT can happen after the check
>  		 * or in the middle of the check.
> +		 *
> +		 * No need to notify as we are downgrading page table to read
> +		 * only not changing it to point to a new page.
> +		 *
> +		 * See Documentation/vm/mmu_notifier.txt
>  		 */
> -		entry = ptep_clear_flush_notify(vma, pvmw.address, pvmw.pte);
> +		entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte);
>  		/*
>  		 * Check that no O_DIRECT or similar I/O is in progress on the
>  		 * page
> @@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
>  	}
>  
>  	flush_cache_page(vma, addr, pte_pfn(*ptep));
> -	ptep_clear_flush_notify(vma, addr, ptep);
> +	/*
> +	 * No need to notify as we are replacing a read only page with another
> +	 * read only page with the same content.
> +	 *
> +	 * See Documentation/vm/mmu_notifier.txt
> +	 */
> +	ptep_clear_flush(vma, addr, ptep);
>  	set_pte_at_notify(mm, addr, ptep, newpte);
>  
>  	page_remove_rmap(page, false);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 061826278520..6b5a0f219ac0 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
>  #endif
>  		}
>  
> -		if (ret) {
> -			mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend);
> +		/*
> +		 * No need to call mmu_notifier_invalidate_range() as we are
> +		 * downgrading page table protection not changing it to point
> +		 * to a new page.
> +		 *
> +		 * See Documentation/vm/mmu_notifier.txt
> +		 */
> +		if (ret)
>  			(*cleaned)++;
> -		}
>  	}
>  
>  	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> @@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  			if (pte_soft_dirty(pteval))
>  				swp_pte = pte_swp_mksoft_dirty(swp_pte);
>  			set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
> +			/*
> +			 * No need to invalidate here it will synchronize on
> +			 * against the special swap migration pte.
> +			 */
>  			goto discard;
>  		}
>  
> @@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  			 * will take care of the rest.
>  			 */
>  			dec_mm_counter(mm, mm_counter(page));
> +			/* We have to invalidate as we cleared the pte */
> +			mmu_notifier_invalidate_range(mm, address,
> +						      address + PAGE_SIZE);
>  		} else if (IS_ENABLED(CONFIG_MIGRATION) &&
>  				(flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
>  			swp_entry_t entry;
> @@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  			if (pte_soft_dirty(pteval))
>  				swp_pte = pte_swp_mksoft_dirty(swp_pte);
>  			set_pte_at(mm, address, pvmw.pte, swp_pte);
> +			/*
> +			 * No need to invalidate here it will synchronize on
> +			 * against the special swap migration pte.
> +			 */
>  		} else if (PageAnon(page)) {
>  			swp_entry_t entry = { .val = page_private(subpage) };
>  			pte_t swp_pte;
> @@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  				WARN_ON_ONCE(1);
>  				ret = false;
>  				/* We have to invalidate as we cleared the pte */
> +				mmu_notifier_invalidate_range(mm, address,
> +							address + PAGE_SIZE);
>  				page_vma_mapped_walk_done(&pvmw);
>  				break;
>  			}
> @@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  			/* MADV_FREE page check */
>  			if (!PageSwapBacked(page)) {
>  				if (!PageDirty(page)) {
> +					/* Invalidate as we cleared the pte */
> +					mmu_notifier_invalidate_range(mm,
> +						address, address + PAGE_SIZE);
>  					dec_mm_counter(mm, MM_ANONPAGES);
>  					goto discard;
>  				}
> @@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  			if (pte_soft_dirty(pteval))
>  				swp_pte = pte_swp_mksoft_dirty(swp_pte);
>  			set_pte_at(mm, address, pvmw.pte, swp_pte);
> -		} else
> +			/* Invalidate as we cleared the pte */
> +			mmu_notifier_invalidate_range(mm, address,
> +						      address + PAGE_SIZE);
> +		} else {
> +			/*
> +			 * We should not need to notify here as we reach this
> +			 * case only from freeze_page() itself only call from
> +			 * split_huge_page_to_list() so everything below must
> +			 * be true:
> +			 *   - page is not anonymous
> +			 *   - page is locked
> +			 *
> +			 * So as it is a locked file back page thus it can not
> +			 * be remove from the page cache and replace by a new
> +			 * page before mmu_notifier_invalidate_range_end so no
> +			 * concurrent thread might update its page table to
> +			 * point at new page while a device still is using this
> +			 * page.
> +			 *
> +			 * See Documentation/vm/mmu_notifier.txt
> +			 */
>  			dec_mm_counter(mm, mm_counter_file(page));
> +		}
>  discard:
> +		/*
> +		 * No need to call mmu_notifier_invalidate_range() it has be
> +		 * done above for all cases requiring it to happen under page
> +		 * table lock before mmu_notifier_invalidate_range_end()
> +		 *
> +		 * See Documentation/vm/mmu_notifier.txt
> +		 */
>  		page_remove_rmap(subpage, PageHuge(page));
>  		put_page(page);
> -		mmu_notifier_invalidate_range(mm, address,
> -					      address + PAGE_SIZE);
>  	}
>  
>  	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);

Looking at the patchset, I understand the efficiency, but I am concerned
with correctness.

Balbir Singh.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2
@ 2017-10-19  3:04     ` Balbir Singh
  0 siblings, 0 replies; 40+ messages in thread
From: Balbir Singh @ 2017-10-19  3:04 UTC (permalink / raw)
  To: jglisse
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Nadav Amit,
	Linus Torvalds, Andrew Morton, Joerg Roedel,
	Suravee Suthikulpanit, David Woodhouse, Alistair Popple,
	Michael Ellerman, Benjamin Herrenschmidt, Stephen Rothwell,
	Andrew Donnellan, iommu, linuxppc-dev, linux-next

On Mon, 16 Oct 2017 23:10:02 -0400
jglisse@redhat.com wrote:

> From: J=C3=A9r=C3=B4me Glisse <jglisse@redhat.com>
>=20
> +		/*
> +		 * No need to call mmu_notifier_invalidate_range() as we are
> +		 * downgrading page table protection not changing it to point
> +		 * to a new page.
> +		 *
> +		 * See Documentation/vm/mmu_notifier.txt
> +		 */
>  		if (pmdp) {
>  #ifdef CONFIG_FS_DAX_PMD
>  			pmd_t pmd;
> @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct address_=
space *mapping,
>  			pmd =3D pmd_wrprotect(pmd);
>  			pmd =3D pmd_mkclean(pmd);
>  			set_pmd_at(vma->vm_mm, address, pmdp, pmd);
> -			mmu_notifier_invalidate_range(vma->vm_mm, start, end);

Could the secondary TLB still see the mapping as dirty and propagate the di=
rty bit back?

>  unlock_pmd:
>  			spin_unlock(ptl);
>  #endif
> @@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct address_=
space *mapping,
>  			pte =3D pte_wrprotect(pte);
>  			pte =3D pte_mkclean(pte);
>  			set_pte_at(vma->vm_mm, address, ptep, pte);
> -			mmu_notifier_invalidate_range(vma->vm_mm, start, end);

Ditto

>  unlock_pte:
>  			pte_unmap_unlock(ptep, ptl);
>  		}
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 6866e8126982..49c925c96b8a 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -155,7 +155,8 @@ struct mmu_notifier_ops {
>  	 * shared page-tables, it not necessary to implement the
>  	 * invalidate_range_start()/end() notifiers, as
>  	 * invalidate_range() alread catches the points in time when an
> -	 * external TLB range needs to be flushed.
> +	 * external TLB range needs to be flushed. For more in depth
> +	 * discussion on this see Documentation/vm/mmu_notifier.txt
>  	 *
>  	 * The invalidate_range() function is called under the ptl
>  	 * spin-lock and not allowed to sleep.
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index c037d3d34950..ff5bc647b51d 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_=
fault *vmf, pmd_t orig_pmd,
>  		goto out_free_pages;
>  	VM_BUG_ON_PAGE(!PageHead(page), page);
> =20
> +	/*
> +	 * Leave pmd empty until pte is filled note we must notify here as
> +	 * concurrent CPU thread might write to new page before the call to
> +	 * mmu_notifier_invalidate_range_end() happens which can lead to a
> +	 * device seeing memory write in different order than CPU.
> +	 *
> +	 * See Documentation/vm/mmu_notifier.txt
> +	 */
>  	pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
> -	/* leave pmd empty until pte is filled */
> =20
>  	pgtable =3D pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd);
>  	pmd_populate(vma->vm_mm, &_pmd, pgtable);
> @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_a=
rea_struct *vma,
>  	pmd_t _pmd;
>  	int i;
> =20
> -	/* leave pmd empty until pte is filled */
> -	pmdp_huge_clear_flush_notify(vma, haddr, pmd);
> +	/*
> +	 * Leave pmd empty until pte is filled note that it is fine to delay
> +	 * notification until mmu_notifier_invalidate_range_end() as we are
> +	 * replacing a zero pmd write protected page with a zero pte write
> +	 * protected page.
> +	 *
> +	 * See Documentation/vm/mmu_notifier.txt
> +	 */
> +	pmdp_huge_clear_flush(vma, haddr, pmd);

Shouldn't the secondary TLB know if the page size changed?

> =20
>  	pgtable =3D pgtable_trans_huge_withdraw(mm, pmd);
>  	pmd_populate(mm, &_pmd, pgtable);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 1768efa4c501..63a63f1b536c 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst,=
 struct mm_struct *src,
>  			set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
>  		} else {
>  			if (cow) {
> +				/*
> +				 * No need to notify as we are downgrading page
> +				 * table protection not changing it to point
> +				 * to a new page.
> +				 *
> +				 * See Documentation/vm/mmu_notifier.txt
> +				 */
>  				huge_ptep_set_wrprotect(src, addr, src_pte);

OK.. so we could get write faults on write accesses from the device.

> -				mmu_notifier_invalidate_range(src, mmun_start,
> -								   mmun_end);
>  			}
>  			entry =3D huge_ptep_get(src_pte);
>  			ptepage =3D pte_page(entry);
> @@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct vm_=
area_struct *vma,
>  	 * and that page table be reused and filled with junk.
>  	 */
>  	flush_hugetlb_tlb_range(vma, start, end);
> -	mmu_notifier_invalidate_range(mm, start, end);
> +	/*
> +	 * No need to call mmu_notifier_invalidate_range() we are downgrading
> +	 * page table protection not changing it to point to a new page.
> +	 *
> +	 * See Documentation/vm/mmu_notifier.txt
> +	 */
>  	i_mmap_unlock_write(vma->vm_file->f_mapping);
>  	mmu_notifier_invalidate_range_end(mm, start, end);
> =20
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 6cb60f46cce5..be8f4576f842 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_struc=
t *vma, struct page *page,
>  		 * So we clear the pte and flush the tlb before the check
>  		 * this assure us that no O_DIRECT can happen after the check
>  		 * or in the middle of the check.
> +		 *
> +		 * No need to notify as we are downgrading page table to read
> +		 * only not changing it to point to a new page.
> +		 *
> +		 * See Documentation/vm/mmu_notifier.txt
>  		 */
> -		entry =3D ptep_clear_flush_notify(vma, pvmw.address, pvmw.pte);
> +		entry =3D ptep_clear_flush(vma, pvmw.address, pvmw.pte);
>  		/*
>  		 * Check that no O_DIRECT or similar I/O is in progress on the
>  		 * page
> @@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *vma=
, struct page *page,
>  	}
> =20
>  	flush_cache_page(vma, addr, pte_pfn(*ptep));
> -	ptep_clear_flush_notify(vma, addr, ptep);
> +	/*
> +	 * No need to notify as we are replacing a read only page with another
> +	 * read only page with the same content.
> +	 *
> +	 * See Documentation/vm/mmu_notifier.txt
> +	 */
> +	ptep_clear_flush(vma, addr, ptep);
>  	set_pte_at_notify(mm, addr, ptep, newpte);
> =20
>  	page_remove_rmap(page, false);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 061826278520..6b5a0f219ac0 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, str=
uct vm_area_struct *vma,
>  #endif
>  		}
> =20
> -		if (ret) {
> -			mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend);
> +		/*
> +		 * No need to call mmu_notifier_invalidate_range() as we are
> +		 * downgrading page table protection not changing it to point
> +		 * to a new page.
> +		 *
> +		 * See Documentation/vm/mmu_notifier.txt
> +		 */
> +		if (ret)
>  			(*cleaned)++;
> -		}
>  	}
> =20
>  	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> @@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page, st=
ruct vm_area_struct *vma,
>  			if (pte_soft_dirty(pteval))
>  				swp_pte =3D pte_swp_mksoft_dirty(swp_pte);
>  			set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
> +			/*
> +			 * No need to invalidate here it will synchronize on
> +			 * against the special swap migration pte.
> +			 */
>  			goto discard;
>  		}
> =20
> @@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, str=
uct vm_area_struct *vma,
>  			 * will take care of the rest.
>  			 */
>  			dec_mm_counter(mm, mm_counter(page));
> +			/* We have to invalidate as we cleared the pte */
> +			mmu_notifier_invalidate_range(mm, address,
> +						      address + PAGE_SIZE);
>  		} else if (IS_ENABLED(CONFIG_MIGRATION) &&
>  				(flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
>  			swp_entry_t entry;
> @@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page, st=
ruct vm_area_struct *vma,
>  			if (pte_soft_dirty(pteval))
>  				swp_pte =3D pte_swp_mksoft_dirty(swp_pte);
>  			set_pte_at(mm, address, pvmw.pte, swp_pte);
> +			/*
> +			 * No need to invalidate here it will synchronize on
> +			 * against the special swap migration pte.
> +			 */
>  		} else if (PageAnon(page)) {
>  			swp_entry_t entry =3D { .val =3D page_private(subpage) };
>  			pte_t swp_pte;
> @@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, str=
uct vm_area_struct *vma,
>  				WARN_ON_ONCE(1);
>  				ret =3D false;
>  				/* We have to invalidate as we cleared the pte */
> +				mmu_notifier_invalidate_range(mm, address,
> +							address + PAGE_SIZE);
>  				page_vma_mapped_walk_done(&pvmw);
>  				break;
>  			}
> @@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, str=
uct vm_area_struct *vma,
>  			/* MADV_FREE page check */
>  			if (!PageSwapBacked(page)) {
>  				if (!PageDirty(page)) {
> +					/* Invalidate as we cleared the pte */
> +					mmu_notifier_invalidate_range(mm,
> +						address, address + PAGE_SIZE);
>  					dec_mm_counter(mm, MM_ANONPAGES);
>  					goto discard;
>  				}
> @@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page, s=
truct vm_area_struct *vma,
>  			if (pte_soft_dirty(pteval))
>  				swp_pte =3D pte_swp_mksoft_dirty(swp_pte);
>  			set_pte_at(mm, address, pvmw.pte, swp_pte);
> -		} else
> +			/* Invalidate as we cleared the pte */
> +			mmu_notifier_invalidate_range(mm, address,
> +						      address + PAGE_SIZE);
> +		} else {
> +			/*
> +			 * We should not need to notify here as we reach this
> +			 * case only from freeze_page() itself only call from
> +			 * split_huge_page_to_list() so everything below must
> +			 * be true:
> +			 *   - page is not anonymous
> +			 *   - page is locked
> +			 *
> +			 * So as it is a locked file back page thus it can not
> +			 * be remove from the page cache and replace by a new
> +			 * page before mmu_notifier_invalidate_range_end so no
> +			 * concurrent thread might update its page table to
> +			 * point at new page while a device still is using this
> +			 * page.
> +			 *
> +			 * See Documentation/vm/mmu_notifier.txt
> +			 */
>  			dec_mm_counter(mm, mm_counter_file(page));
> +		}
>  discard:
> +		/*
> +		 * No need to call mmu_notifier_invalidate_range() it has be
> +		 * done above for all cases requiring it to happen under page
> +		 * table lock before mmu_notifier_invalidate_range_end()
> +		 *
> +		 * See Documentation/vm/mmu_notifier.txt
> +		 */
>  		page_remove_rmap(subpage, PageHuge(page));
>  		put_page(page);
> -		mmu_notifier_invalidate_range(mm, address,
> -					      address + PAGE_SIZE);
>  	}
> =20
>  	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);

Looking at the patchset, I understand the efficiency, but I am concerned
with correctness.

Balbir Singh.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/2] Optimize mmu_notifier->invalidate_range callback
  2017-10-19  2:43   ` Balbir Singh
  (?)
@ 2017-10-19  3:08     ` Jerome Glisse
  -1 siblings, 0 replies; 40+ messages in thread
From: Jerome Glisse @ 2017-10-19  3:08 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Andrew Morton,
	Joerg Roedel, Suravee Suthikulpanit, David Woodhouse,
	Alistair Popple, Michael Ellerman, Benjamin Herrenschmidt,
	Stephen Rothwell, Andrew Donnellan, iommu, linuxppc-dev

On Thu, Oct 19, 2017 at 01:43:19PM +1100, Balbir Singh wrote:
> On Mon, 16 Oct 2017 23:10:01 -0400
> jglisse@redhat.com wrote:
> 
> > From: Jérôme Glisse <jglisse@redhat.com>
> > 
> > (Andrew you already have v1 in your queue of patch 1, patch 2 is new,
> >  i think you can drop it patch 1 v1 for v2, v2 is bit more conservative
> >  and i fixed typos)
> > 
> > All this only affect user of invalidate_range callback (at this time
> > CAPI arch/powerpc/platforms/powernv/npu-dma.c, IOMMU ATS/PASID in
> > drivers/iommu/amd_iommu_v2.c|intel-svm.c)
> > 
> > This patchset remove useless double call to mmu_notifier->invalidate_range
> > callback wherever it is safe to do so. The first patch just remove useless
> > call
> 
> As in an extra call? Where does that come from?

Before this patch you had the following pattern:
  mmu_notifier_invalidate_range_start();
  take_page_table_lock()
  ...
  update_page_table()
  mmu_notifier_invalidate_range()
  ...
  drop_page_table_lock()
  mmu_notifier_invalidate_range_end();

It happens that mmu_notifier_invalidate_range_end() also make an
unconditional call to mmu_notifier_invalidate_range() so in the
above scenario you had 2 calls to mmu_notifier_invalidate_range()

Obviously one of the 2 call is useless. In some case you can drop
the first call (under the page table lock) this is what patch 1
does.

In other cases you can drop the second call that happen inside
mmu_notifier_invalidate_range_end() that is what patch 2 does.

Hence why i am referring to useless double call. I have added
more documentation to explain all this in the code and also under
Documentation/vm/mmu_notifier.txt


> 
> > and add documentation explaining why it is safe to do so. The second
> > patch go further by introducing mmu_notifier_invalidate_range_only_end()
> > which skip callback to invalidate_range this can be done when clearing a
> > pte, pmd or pud with notification which call invalidate_range right after
> > clearing under the page table lock.
> >
> 
> Balbir Singh.
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/2] Optimize mmu_notifier->invalidate_range callback
@ 2017-10-19  3:08     ` Jerome Glisse
  0 siblings, 0 replies; 40+ messages in thread
From: Jerome Glisse @ 2017-10-19  3:08 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Andrew Morton,
	Joerg Roedel, Suravee Suthikulpanit, David Woodhouse,
	Alistair Popple, Michael Ellerman, Benjamin Herrenschmidt,
	Stephen Rothwell, Andrew Donnellan, iommu, linuxppc-dev

On Thu, Oct 19, 2017 at 01:43:19PM +1100, Balbir Singh wrote:
> On Mon, 16 Oct 2017 23:10:01 -0400
> jglisse@redhat.com wrote:
> 
> > From: Jerome Glisse <jglisse@redhat.com>
> > 
> > (Andrew you already have v1 in your queue of patch 1, patch 2 is new,
> >  i think you can drop it patch 1 v1 for v2, v2 is bit more conservative
> >  and i fixed typos)
> > 
> > All this only affect user of invalidate_range callback (at this time
> > CAPI arch/powerpc/platforms/powernv/npu-dma.c, IOMMU ATS/PASID in
> > drivers/iommu/amd_iommu_v2.c|intel-svm.c)
> > 
> > This patchset remove useless double call to mmu_notifier->invalidate_range
> > callback wherever it is safe to do so. The first patch just remove useless
> > call
> 
> As in an extra call? Where does that come from?

Before this patch you had the following pattern:
  mmu_notifier_invalidate_range_start();
  take_page_table_lock()
  ...
  update_page_table()
  mmu_notifier_invalidate_range()
  ...
  drop_page_table_lock()
  mmu_notifier_invalidate_range_end();

It happens that mmu_notifier_invalidate_range_end() also make an
unconditional call to mmu_notifier_invalidate_range() so in the
above scenario you had 2 calls to mmu_notifier_invalidate_range()

Obviously one of the 2 call is useless. In some case you can drop
the first call (under the page table lock) this is what patch 1
does.

In other cases you can drop the second call that happen inside
mmu_notifier_invalidate_range_end() that is what patch 2 does.

Hence why i am referring to useless double call. I have added
more documentation to explain all this in the code and also under
Documentation/vm/mmu_notifier.txt


> 
> > and add documentation explaining why it is safe to do so. The second
> > patch go further by introducing mmu_notifier_invalidate_range_only_end()
> > which skip callback to invalidate_range this can be done when clearing a
> > pte, pmd or pud with notification which call invalidate_range right after
> > clearing under the page table lock.
> >
> 
> Balbir Singh.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/2] Optimize mmu_notifier->invalidate_range callback
@ 2017-10-19  3:08     ` Jerome Glisse
  0 siblings, 0 replies; 40+ messages in thread
From: Jerome Glisse @ 2017-10-19  3:08 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrea Arcangeli, Stephen Rothwell, Joerg Roedel,
	Benjamin Herrenschmidt, Andrew Donnellan, Alistair Popple,
	linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Michael Ellerman, Andrew Morton, David Woodhouse

On Thu, Oct 19, 2017 at 01:43:19PM +1100, Balbir Singh wrote:
> On Mon, 16 Oct 2017 23:10:01 -0400
> jglisse-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org wrote:
> 
> > From: Jérôme Glisse <jglisse-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > 
> > (Andrew you already have v1 in your queue of patch 1, patch 2 is new,
> >  i think you can drop it patch 1 v1 for v2, v2 is bit more conservative
> >  and i fixed typos)
> > 
> > All this only affect user of invalidate_range callback (at this time
> > CAPI arch/powerpc/platforms/powernv/npu-dma.c, IOMMU ATS/PASID in
> > drivers/iommu/amd_iommu_v2.c|intel-svm.c)
> > 
> > This patchset remove useless double call to mmu_notifier->invalidate_range
> > callback wherever it is safe to do so. The first patch just remove useless
> > call
> 
> As in an extra call? Where does that come from?

Before this patch you had the following pattern:
  mmu_notifier_invalidate_range_start();
  take_page_table_lock()
  ...
  update_page_table()
  mmu_notifier_invalidate_range()
  ...
  drop_page_table_lock()
  mmu_notifier_invalidate_range_end();

It happens that mmu_notifier_invalidate_range_end() also make an
unconditional call to mmu_notifier_invalidate_range() so in the
above scenario you had 2 calls to mmu_notifier_invalidate_range()

Obviously one of the 2 call is useless. In some case you can drop
the first call (under the page table lock) this is what patch 1
does.

In other cases you can drop the second call that happen inside
mmu_notifier_invalidate_range_end() that is what patch 2 does.

Hence why i am referring to useless double call. I have added
more documentation to explain all this in the code and also under
Documentation/vm/mmu_notifier.txt


> 
> > and add documentation explaining why it is safe to do so. The second
> > patch go further by introducing mmu_notifier_invalidate_range_only_end()
> > which skip callback to invalidate_range this can be done when clearing a
> > pte, pmd or pud with notification which call invalidate_range right after
> > clearing under the page table lock.
> >
> 
> Balbir Singh.
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2
  2017-10-19  3:04     ` Balbir Singh
  (?)
@ 2017-10-19  3:28       ` Jerome Glisse
  -1 siblings, 0 replies; 40+ messages in thread
From: Jerome Glisse @ 2017-10-19  3:28 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Nadav Amit,
	Linus Torvalds, Andrew Morton, Joerg Roedel,
	Suravee Suthikulpanit, David Woodhouse, Alistair Popple,
	Michael Ellerman, Benjamin Herrenschmidt, Stephen Rothwell,
	Andrew Donnellan, iommu, linuxppc-dev, linux-next

On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote:
> On Mon, 16 Oct 2017 23:10:02 -0400
> jglisse@redhat.com wrote:
> 
> > From: Jérôme Glisse <jglisse@redhat.com>
> > 
> > +		/*
> > +		 * No need to call mmu_notifier_invalidate_range() as we are
> > +		 * downgrading page table protection not changing it to point
> > +		 * to a new page.
> > +		 *
> > +		 * See Documentation/vm/mmu_notifier.txt
> > +		 */
> >  		if (pmdp) {
> >  #ifdef CONFIG_FS_DAX_PMD
> >  			pmd_t pmd;
> > @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
> >  			pmd = pmd_wrprotect(pmd);
> >  			pmd = pmd_mkclean(pmd);
> >  			set_pmd_at(vma->vm_mm, address, pmdp, pmd);
> > -			mmu_notifier_invalidate_range(vma->vm_mm, start, end);
> 
> Could the secondary TLB still see the mapping as dirty and propagate the dirty bit back?

I am assuming hardware does sane thing of setting the dirty bit only
when walking the CPU page table when device does a write fault ie
once the device get a write TLB entry the dirty is set by the IOMMU
when walking the page table before returning the lookup result to the
device and that it won't be set again latter (ie propagated back
latter).

I should probably have spell that out and maybe some of the ATS/PASID
implementer did not do that.

> 
> >  unlock_pmd:
> >  			spin_unlock(ptl);
> >  #endif
> > @@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
> >  			pte = pte_wrprotect(pte);
> >  			pte = pte_mkclean(pte);
> >  			set_pte_at(vma->vm_mm, address, ptep, pte);
> > -			mmu_notifier_invalidate_range(vma->vm_mm, start, end);
> 
> Ditto
> 
> >  unlock_pte:
> >  			pte_unmap_unlock(ptep, ptl);
> >  		}
> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > index 6866e8126982..49c925c96b8a 100644
> > --- a/include/linux/mmu_notifier.h
> > +++ b/include/linux/mmu_notifier.h
> > @@ -155,7 +155,8 @@ struct mmu_notifier_ops {
> >  	 * shared page-tables, it not necessary to implement the
> >  	 * invalidate_range_start()/end() notifiers, as
> >  	 * invalidate_range() alread catches the points in time when an
> > -	 * external TLB range needs to be flushed.
> > +	 * external TLB range needs to be flushed. For more in depth
> > +	 * discussion on this see Documentation/vm/mmu_notifier.txt
> >  	 *
> >  	 * The invalidate_range() function is called under the ptl
> >  	 * spin-lock and not allowed to sleep.
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index c037d3d34950..ff5bc647b51d 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
> >  		goto out_free_pages;
> >  	VM_BUG_ON_PAGE(!PageHead(page), page);
> >  
> > +	/*
> > +	 * Leave pmd empty until pte is filled note we must notify here as
> > +	 * concurrent CPU thread might write to new page before the call to
> > +	 * mmu_notifier_invalidate_range_end() happens which can lead to a
> > +	 * device seeing memory write in different order than CPU.
> > +	 *
> > +	 * See Documentation/vm/mmu_notifier.txt
> > +	 */
> >  	pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
> > -	/* leave pmd empty until pte is filled */
> >  
> >  	pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd);
> >  	pmd_populate(vma->vm_mm, &_pmd, pgtable);
> > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> >  	pmd_t _pmd;
> >  	int i;
> >  
> > -	/* leave pmd empty until pte is filled */
> > -	pmdp_huge_clear_flush_notify(vma, haddr, pmd);
> > +	/*
> > +	 * Leave pmd empty until pte is filled note that it is fine to delay
> > +	 * notification until mmu_notifier_invalidate_range_end() as we are
> > +	 * replacing a zero pmd write protected page with a zero pte write
> > +	 * protected page.
> > +	 *
> > +	 * See Documentation/vm/mmu_notifier.txt
> > +	 */
> > +	pmdp_huge_clear_flush(vma, haddr, pmd);
> 
> Shouldn't the secondary TLB know if the page size changed?

It should not matter, we are talking virtual to physical on behalf
of a device against a process address space. So the hardware should
not care about the page size.

Moreover if any of the new 512 (assuming 2MB huge and 4K pages) zero
4K pages is replace by something new then a device TLB shootdown will
happen before the new page is set.

Only issue i can think of is if the IOMMU TLB (if there is one) or
the device TLB (you do expect that there is one) does not invalidate
TLB entry if the TLB shootdown is smaller than the TLB entry. That
would be idiotic but yes i know hardware bug.


> 
> >  
> >  	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> >  	pmd_populate(mm, &_pmd, pgtable);
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 1768efa4c501..63a63f1b536c 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> >  			set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
> >  		} else {
> >  			if (cow) {
> > +				/*
> > +				 * No need to notify as we are downgrading page
> > +				 * table protection not changing it to point
> > +				 * to a new page.
> > +				 *
> > +				 * See Documentation/vm/mmu_notifier.txt
> > +				 */
> >  				huge_ptep_set_wrprotect(src, addr, src_pte);
> 
> OK.. so we could get write faults on write accesses from the device.
> 
> > -				mmu_notifier_invalidate_range(src, mmun_start,
> > -								   mmun_end);
> >  			}
> >  			entry = huge_ptep_get(src_pte);
> >  			ptepage = pte_page(entry);
> > @@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> >  	 * and that page table be reused and filled with junk.
> >  	 */
> >  	flush_hugetlb_tlb_range(vma, start, end);
> > -	mmu_notifier_invalidate_range(mm, start, end);
> > +	/*
> > +	 * No need to call mmu_notifier_invalidate_range() we are downgrading
> > +	 * page table protection not changing it to point to a new page.
> > +	 *
> > +	 * See Documentation/vm/mmu_notifier.txt
> > +	 */
> >  	i_mmap_unlock_write(vma->vm_file->f_mapping);
> >  	mmu_notifier_invalidate_range_end(mm, start, end);
> >  
> > diff --git a/mm/ksm.c b/mm/ksm.c
> > index 6cb60f46cce5..be8f4576f842 100644
> > --- a/mm/ksm.c
> > +++ b/mm/ksm.c
> > @@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
> >  		 * So we clear the pte and flush the tlb before the check
> >  		 * this assure us that no O_DIRECT can happen after the check
> >  		 * or in the middle of the check.
> > +		 *
> > +		 * No need to notify as we are downgrading page table to read
> > +		 * only not changing it to point to a new page.
> > +		 *
> > +		 * See Documentation/vm/mmu_notifier.txt
> >  		 */
> > -		entry = ptep_clear_flush_notify(vma, pvmw.address, pvmw.pte);
> > +		entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte);
> >  		/*
> >  		 * Check that no O_DIRECT or similar I/O is in progress on the
> >  		 * page
> > @@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
> >  	}
> >  
> >  	flush_cache_page(vma, addr, pte_pfn(*ptep));
> > -	ptep_clear_flush_notify(vma, addr, ptep);
> > +	/*
> > +	 * No need to notify as we are replacing a read only page with another
> > +	 * read only page with the same content.
> > +	 *
> > +	 * See Documentation/vm/mmu_notifier.txt
> > +	 */
> > +	ptep_clear_flush(vma, addr, ptep);
> >  	set_pte_at_notify(mm, addr, ptep, newpte);
> >  
> >  	page_remove_rmap(page, false);
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 061826278520..6b5a0f219ac0 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> >  #endif
> >  		}
> >  
> > -		if (ret) {
> > -			mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend);
> > +		/*
> > +		 * No need to call mmu_notifier_invalidate_range() as we are
> > +		 * downgrading page table protection not changing it to point
> > +		 * to a new page.
> > +		 *
> > +		 * See Documentation/vm/mmu_notifier.txt
> > +		 */
> > +		if (ret)
> >  			(*cleaned)++;
> > -		}
> >  	}
> >  
> >  	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> > @@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >  			if (pte_soft_dirty(pteval))
> >  				swp_pte = pte_swp_mksoft_dirty(swp_pte);
> >  			set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
> > +			/*
> > +			 * No need to invalidate here it will synchronize on
> > +			 * against the special swap migration pte.
> > +			 */
> >  			goto discard;
> >  		}
> >  
> > @@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >  			 * will take care of the rest.
> >  			 */
> >  			dec_mm_counter(mm, mm_counter(page));
> > +			/* We have to invalidate as we cleared the pte */
> > +			mmu_notifier_invalidate_range(mm, address,
> > +						      address + PAGE_SIZE);
> >  		} else if (IS_ENABLED(CONFIG_MIGRATION) &&
> >  				(flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
> >  			swp_entry_t entry;
> > @@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >  			if (pte_soft_dirty(pteval))
> >  				swp_pte = pte_swp_mksoft_dirty(swp_pte);
> >  			set_pte_at(mm, address, pvmw.pte, swp_pte);
> > +			/*
> > +			 * No need to invalidate here it will synchronize on
> > +			 * against the special swap migration pte.
> > +			 */
> >  		} else if (PageAnon(page)) {
> >  			swp_entry_t entry = { .val = page_private(subpage) };
> >  			pte_t swp_pte;
> > @@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >  				WARN_ON_ONCE(1);
> >  				ret = false;
> >  				/* We have to invalidate as we cleared the pte */
> > +				mmu_notifier_invalidate_range(mm, address,
> > +							address + PAGE_SIZE);
> >  				page_vma_mapped_walk_done(&pvmw);
> >  				break;
> >  			}
> > @@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >  			/* MADV_FREE page check */
> >  			if (!PageSwapBacked(page)) {
> >  				if (!PageDirty(page)) {
> > +					/* Invalidate as we cleared the pte */
> > +					mmu_notifier_invalidate_range(mm,
> > +						address, address + PAGE_SIZE);
> >  					dec_mm_counter(mm, MM_ANONPAGES);
> >  					goto discard;
> >  				}
> > @@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >  			if (pte_soft_dirty(pteval))
> >  				swp_pte = pte_swp_mksoft_dirty(swp_pte);
> >  			set_pte_at(mm, address, pvmw.pte, swp_pte);
> > -		} else
> > +			/* Invalidate as we cleared the pte */
> > +			mmu_notifier_invalidate_range(mm, address,
> > +						      address + PAGE_SIZE);
> > +		} else {
> > +			/*
> > +			 * We should not need to notify here as we reach this
> > +			 * case only from freeze_page() itself only call from
> > +			 * split_huge_page_to_list() so everything below must
> > +			 * be true:
> > +			 *   - page is not anonymous
> > +			 *   - page is locked
> > +			 *
> > +			 * So as it is a locked file back page thus it can not
> > +			 * be remove from the page cache and replace by a new
> > +			 * page before mmu_notifier_invalidate_range_end so no
> > +			 * concurrent thread might update its page table to
> > +			 * point at new page while a device still is using this
> > +			 * page.
> > +			 *
> > +			 * See Documentation/vm/mmu_notifier.txt
> > +			 */
> >  			dec_mm_counter(mm, mm_counter_file(page));
> > +		}
> >  discard:
> > +		/*
> > +		 * No need to call mmu_notifier_invalidate_range() it has be
> > +		 * done above for all cases requiring it to happen under page
> > +		 * table lock before mmu_notifier_invalidate_range_end()
> > +		 *
> > +		 * See Documentation/vm/mmu_notifier.txt
> > +		 */
> >  		page_remove_rmap(subpage, PageHuge(page));
> >  		put_page(page);
> > -		mmu_notifier_invalidate_range(mm, address,
> > -					      address + PAGE_SIZE);
> >  	}
> >  
> >  	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> 
> Looking at the patchset, I understand the efficiency, but I am concerned
> with correctness.

I am fine in holding this off from reaching Linus but only way to flush this
issues out if any is to have this patch in linux-next or somewhere were they
get a chance of being tested.

Note that the second patch is always safe. I agree that this one might
not be if hardware implementation is idiotic (well that would be my
opinion and any opinion/point of view can be challenge :))

> 
> Balbir Singh.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2
@ 2017-10-19  3:28       ` Jerome Glisse
  0 siblings, 0 replies; 40+ messages in thread
From: Jerome Glisse @ 2017-10-19  3:28 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Nadav Amit,
	Linus Torvalds, Andrew Morton, Joerg Roedel,
	Suravee Suthikulpanit, David Woodhouse, Alistair Popple,
	Michael Ellerman, Benjamin Herrenschmidt, Stephen Rothwell,
	Andrew Donnellan, iommu, linuxppc-dev, linux-next

On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote:
> On Mon, 16 Oct 2017 23:10:02 -0400
> jglisse@redhat.com wrote:
> 
> > From: Jérôme Glisse <jglisse@redhat.com>
> > 
> > +		/*
> > +		 * No need to call mmu_notifier_invalidate_range() as we are
> > +		 * downgrading page table protection not changing it to point
> > +		 * to a new page.
> > +		 *
> > +		 * See Documentation/vm/mmu_notifier.txt
> > +		 */
> >  		if (pmdp) {
> >  #ifdef CONFIG_FS_DAX_PMD
> >  			pmd_t pmd;
> > @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
> >  			pmd = pmd_wrprotect(pmd);
> >  			pmd = pmd_mkclean(pmd);
> >  			set_pmd_at(vma->vm_mm, address, pmdp, pmd);
> > -			mmu_notifier_invalidate_range(vma->vm_mm, start, end);
> 
> Could the secondary TLB still see the mapping as dirty and propagate the dirty bit back?

I am assuming hardware does sane thing of setting the dirty bit only
when walking the CPU page table when device does a write fault ie
once the device get a write TLB entry the dirty is set by the IOMMU
when walking the page table before returning the lookup result to the
device and that it won't be set again latter (ie propagated back
latter).

I should probably have spell that out and maybe some of the ATS/PASID
implementer did not do that.

> 
> >  unlock_pmd:
> >  			spin_unlock(ptl);
> >  #endif
> > @@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
> >  			pte = pte_wrprotect(pte);
> >  			pte = pte_mkclean(pte);
> >  			set_pte_at(vma->vm_mm, address, ptep, pte);
> > -			mmu_notifier_invalidate_range(vma->vm_mm, start, end);
> 
> Ditto
> 
> >  unlock_pte:
> >  			pte_unmap_unlock(ptep, ptl);
> >  		}
> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > index 6866e8126982..49c925c96b8a 100644
> > --- a/include/linux/mmu_notifier.h
> > +++ b/include/linux/mmu_notifier.h
> > @@ -155,7 +155,8 @@ struct mmu_notifier_ops {
> >  	 * shared page-tables, it not necessary to implement the
> >  	 * invalidate_range_start()/end() notifiers, as
> >  	 * invalidate_range() alread catches the points in time when an
> > -	 * external TLB range needs to be flushed.
> > +	 * external TLB range needs to be flushed. For more in depth
> > +	 * discussion on this see Documentation/vm/mmu_notifier.txt
> >  	 *
> >  	 * The invalidate_range() function is called under the ptl
> >  	 * spin-lock and not allowed to sleep.
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index c037d3d34950..ff5bc647b51d 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
> >  		goto out_free_pages;
> >  	VM_BUG_ON_PAGE(!PageHead(page), page);
> >  
> > +	/*
> > +	 * Leave pmd empty until pte is filled note we must notify here as
> > +	 * concurrent CPU thread might write to new page before the call to
> > +	 * mmu_notifier_invalidate_range_end() happens which can lead to a
> > +	 * device seeing memory write in different order than CPU.
> > +	 *
> > +	 * See Documentation/vm/mmu_notifier.txt
> > +	 */
> >  	pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
> > -	/* leave pmd empty until pte is filled */
> >  
> >  	pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd);
> >  	pmd_populate(vma->vm_mm, &_pmd, pgtable);
> > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> >  	pmd_t _pmd;
> >  	int i;
> >  
> > -	/* leave pmd empty until pte is filled */
> > -	pmdp_huge_clear_flush_notify(vma, haddr, pmd);
> > +	/*
> > +	 * Leave pmd empty until pte is filled note that it is fine to delay
> > +	 * notification until mmu_notifier_invalidate_range_end() as we are
> > +	 * replacing a zero pmd write protected page with a zero pte write
> > +	 * protected page.
> > +	 *
> > +	 * See Documentation/vm/mmu_notifier.txt
> > +	 */
> > +	pmdp_huge_clear_flush(vma, haddr, pmd);
> 
> Shouldn't the secondary TLB know if the page size changed?

It should not matter, we are talking virtual to physical on behalf
of a device against a process address space. So the hardware should
not care about the page size.

Moreover if any of the new 512 (assuming 2MB huge and 4K pages) zero
4K pages is replace by something new then a device TLB shootdown will
happen before the new page is set.

Only issue i can think of is if the IOMMU TLB (if there is one) or
the device TLB (you do expect that there is one) does not invalidate
TLB entry if the TLB shootdown is smaller than the TLB entry. That
would be idiotic but yes i know hardware bug.


> 
> >  
> >  	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> >  	pmd_populate(mm, &_pmd, pgtable);
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 1768efa4c501..63a63f1b536c 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> >  			set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
> >  		} else {
> >  			if (cow) {
> > +				/*
> > +				 * No need to notify as we are downgrading page
> > +				 * table protection not changing it to point
> > +				 * to a new page.
> > +				 *
> > +				 * See Documentation/vm/mmu_notifier.txt
> > +				 */
> >  				huge_ptep_set_wrprotect(src, addr, src_pte);
> 
> OK.. so we could get write faults on write accesses from the device.
> 
> > -				mmu_notifier_invalidate_range(src, mmun_start,
> > -								   mmun_end);
> >  			}
> >  			entry = huge_ptep_get(src_pte);
> >  			ptepage = pte_page(entry);
> > @@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> >  	 * and that page table be reused and filled with junk.
> >  	 */
> >  	flush_hugetlb_tlb_range(vma, start, end);
> > -	mmu_notifier_invalidate_range(mm, start, end);
> > +	/*
> > +	 * No need to call mmu_notifier_invalidate_range() we are downgrading
> > +	 * page table protection not changing it to point to a new page.
> > +	 *
> > +	 * See Documentation/vm/mmu_notifier.txt
> > +	 */
> >  	i_mmap_unlock_write(vma->vm_file->f_mapping);
> >  	mmu_notifier_invalidate_range_end(mm, start, end);
> >  
> > diff --git a/mm/ksm.c b/mm/ksm.c
> > index 6cb60f46cce5..be8f4576f842 100644
> > --- a/mm/ksm.c
> > +++ b/mm/ksm.c
> > @@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
> >  		 * So we clear the pte and flush the tlb before the check
> >  		 * this assure us that no O_DIRECT can happen after the check
> >  		 * or in the middle of the check.
> > +		 *
> > +		 * No need to notify as we are downgrading page table to read
> > +		 * only not changing it to point to a new page.
> > +		 *
> > +		 * See Documentation/vm/mmu_notifier.txt
> >  		 */
> > -		entry = ptep_clear_flush_notify(vma, pvmw.address, pvmw.pte);
> > +		entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte);
> >  		/*
> >  		 * Check that no O_DIRECT or similar I/O is in progress on the
> >  		 * page
> > @@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
> >  	}
> >  
> >  	flush_cache_page(vma, addr, pte_pfn(*ptep));
> > -	ptep_clear_flush_notify(vma, addr, ptep);
> > +	/*
> > +	 * No need to notify as we are replacing a read only page with another
> > +	 * read only page with the same content.
> > +	 *
> > +	 * See Documentation/vm/mmu_notifier.txt
> > +	 */
> > +	ptep_clear_flush(vma, addr, ptep);
> >  	set_pte_at_notify(mm, addr, ptep, newpte);
> >  
> >  	page_remove_rmap(page, false);
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 061826278520..6b5a0f219ac0 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> >  #endif
> >  		}
> >  
> > -		if (ret) {
> > -			mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend);
> > +		/*
> > +		 * No need to call mmu_notifier_invalidate_range() as we are
> > +		 * downgrading page table protection not changing it to point
> > +		 * to a new page.
> > +		 *
> > +		 * See Documentation/vm/mmu_notifier.txt
> > +		 */
> > +		if (ret)
> >  			(*cleaned)++;
> > -		}
> >  	}
> >  
> >  	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> > @@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >  			if (pte_soft_dirty(pteval))
> >  				swp_pte = pte_swp_mksoft_dirty(swp_pte);
> >  			set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
> > +			/*
> > +			 * No need to invalidate here it will synchronize on
> > +			 * against the special swap migration pte.
> > +			 */
> >  			goto discard;
> >  		}
> >  
> > @@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >  			 * will take care of the rest.
> >  			 */
> >  			dec_mm_counter(mm, mm_counter(page));
> > +			/* We have to invalidate as we cleared the pte */
> > +			mmu_notifier_invalidate_range(mm, address,
> > +						      address + PAGE_SIZE);
> >  		} else if (IS_ENABLED(CONFIG_MIGRATION) &&
> >  				(flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
> >  			swp_entry_t entry;
> > @@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >  			if (pte_soft_dirty(pteval))
> >  				swp_pte = pte_swp_mksoft_dirty(swp_pte);
> >  			set_pte_at(mm, address, pvmw.pte, swp_pte);
> > +			/*
> > +			 * No need to invalidate here it will synchronize on
> > +			 * against the special swap migration pte.
> > +			 */
> >  		} else if (PageAnon(page)) {
> >  			swp_entry_t entry = { .val = page_private(subpage) };
> >  			pte_t swp_pte;
> > @@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >  				WARN_ON_ONCE(1);
> >  				ret = false;
> >  				/* We have to invalidate as we cleared the pte */
> > +				mmu_notifier_invalidate_range(mm, address,
> > +							address + PAGE_SIZE);
> >  				page_vma_mapped_walk_done(&pvmw);
> >  				break;
> >  			}
> > @@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >  			/* MADV_FREE page check */
> >  			if (!PageSwapBacked(page)) {
> >  				if (!PageDirty(page)) {
> > +					/* Invalidate as we cleared the pte */
> > +					mmu_notifier_invalidate_range(mm,
> > +						address, address + PAGE_SIZE);
> >  					dec_mm_counter(mm, MM_ANONPAGES);
> >  					goto discard;
> >  				}
> > @@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >  			if (pte_soft_dirty(pteval))
> >  				swp_pte = pte_swp_mksoft_dirty(swp_pte);
> >  			set_pte_at(mm, address, pvmw.pte, swp_pte);
> > -		} else
> > +			/* Invalidate as we cleared the pte */
> > +			mmu_notifier_invalidate_range(mm, address,
> > +						      address + PAGE_SIZE);
> > +		} else {
> > +			/*
> > +			 * We should not need to notify here as we reach this
> > +			 * case only from freeze_page() itself only call from
> > +			 * split_huge_page_to_list() so everything below must
> > +			 * be true:
> > +			 *   - page is not anonymous
> > +			 *   - page is locked
> > +			 *
> > +			 * So as it is a locked file back page thus it can not
> > +			 * be remove from the page cache and replace by a new
> > +			 * page before mmu_notifier_invalidate_range_end so no
> > +			 * concurrent thread might update its page table to
> > +			 * point at new page while a device still is using this
> > +			 * page.
> > +			 *
> > +			 * See Documentation/vm/mmu_notifier.txt
> > +			 */
> >  			dec_mm_counter(mm, mm_counter_file(page));
> > +		}
> >  discard:
> > +		/*
> > +		 * No need to call mmu_notifier_invalidate_range() it has be
> > +		 * done above for all cases requiring it to happen under page
> > +		 * table lock before mmu_notifier_invalidate_range_end()
> > +		 *
> > +		 * See Documentation/vm/mmu_notifier.txt
> > +		 */
> >  		page_remove_rmap(subpage, PageHuge(page));
> >  		put_page(page);
> > -		mmu_notifier_invalidate_range(mm, address,
> > -					      address + PAGE_SIZE);
> >  	}
> >  
> >  	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> 
> Looking at the patchset, I understand the efficiency, but I am concerned
> with correctness.

I am fine in holding this off from reaching Linus but only way to flush this
issues out if any is to have this patch in linux-next or somewhere were they
get a chance of being tested.

Note that the second patch is always safe. I agree that this one might
not be if hardware implementation is idiotic (well that would be my
opinion and any opinion/point of view can be challenge :))

> 
> Balbir Singh.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2
@ 2017-10-19  3:28       ` Jerome Glisse
  0 siblings, 0 replies; 40+ messages in thread
From: Jerome Glisse @ 2017-10-19  3:28 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Nadav Amit,
	Linus Torvalds, Andrew Morton, Joerg Roedel,
	Suravee Suthikulpanit, David Woodhouse, Alistair Popple,
	Michael Ellerman, Benjamin Herrenschmidt, Stephen Rothwell,
	Andrew Donnellan, iommu, linuxppc-dev, linux-next

On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote:
> On Mon, 16 Oct 2017 23:10:02 -0400
> jglisse@redhat.com wrote:
> 
> > From: Jerome Glisse <jglisse@redhat.com>
> > 
> > +		/*
> > +		 * No need to call mmu_notifier_invalidate_range() as we are
> > +		 * downgrading page table protection not changing it to point
> > +		 * to a new page.
> > +		 *
> > +		 * See Documentation/vm/mmu_notifier.txt
> > +		 */
> >  		if (pmdp) {
> >  #ifdef CONFIG_FS_DAX_PMD
> >  			pmd_t pmd;
> > @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
> >  			pmd = pmd_wrprotect(pmd);
> >  			pmd = pmd_mkclean(pmd);
> >  			set_pmd_at(vma->vm_mm, address, pmdp, pmd);
> > -			mmu_notifier_invalidate_range(vma->vm_mm, start, end);
> 
> Could the secondary TLB still see the mapping as dirty and propagate the dirty bit back?

I am assuming hardware does sane thing of setting the dirty bit only
when walking the CPU page table when device does a write fault ie
once the device get a write TLB entry the dirty is set by the IOMMU
when walking the page table before returning the lookup result to the
device and that it won't be set again latter (ie propagated back
latter).

I should probably have spell that out and maybe some of the ATS/PASID
implementer did not do that.

> 
> >  unlock_pmd:
> >  			spin_unlock(ptl);
> >  #endif
> > @@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
> >  			pte = pte_wrprotect(pte);
> >  			pte = pte_mkclean(pte);
> >  			set_pte_at(vma->vm_mm, address, ptep, pte);
> > -			mmu_notifier_invalidate_range(vma->vm_mm, start, end);
> 
> Ditto
> 
> >  unlock_pte:
> >  			pte_unmap_unlock(ptep, ptl);
> >  		}
> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > index 6866e8126982..49c925c96b8a 100644
> > --- a/include/linux/mmu_notifier.h
> > +++ b/include/linux/mmu_notifier.h
> > @@ -155,7 +155,8 @@ struct mmu_notifier_ops {
> >  	 * shared page-tables, it not necessary to implement the
> >  	 * invalidate_range_start()/end() notifiers, as
> >  	 * invalidate_range() alread catches the points in time when an
> > -	 * external TLB range needs to be flushed.
> > +	 * external TLB range needs to be flushed. For more in depth
> > +	 * discussion on this see Documentation/vm/mmu_notifier.txt
> >  	 *
> >  	 * The invalidate_range() function is called under the ptl
> >  	 * spin-lock and not allowed to sleep.
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index c037d3d34950..ff5bc647b51d 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
> >  		goto out_free_pages;
> >  	VM_BUG_ON_PAGE(!PageHead(page), page);
> >  
> > +	/*
> > +	 * Leave pmd empty until pte is filled note we must notify here as
> > +	 * concurrent CPU thread might write to new page before the call to
> > +	 * mmu_notifier_invalidate_range_end() happens which can lead to a
> > +	 * device seeing memory write in different order than CPU.
> > +	 *
> > +	 * See Documentation/vm/mmu_notifier.txt
> > +	 */
> >  	pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
> > -	/* leave pmd empty until pte is filled */
> >  
> >  	pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd);
> >  	pmd_populate(vma->vm_mm, &_pmd, pgtable);
> > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> >  	pmd_t _pmd;
> >  	int i;
> >  
> > -	/* leave pmd empty until pte is filled */
> > -	pmdp_huge_clear_flush_notify(vma, haddr, pmd);
> > +	/*
> > +	 * Leave pmd empty until pte is filled note that it is fine to delay
> > +	 * notification until mmu_notifier_invalidate_range_end() as we are
> > +	 * replacing a zero pmd write protected page with a zero pte write
> > +	 * protected page.
> > +	 *
> > +	 * See Documentation/vm/mmu_notifier.txt
> > +	 */
> > +	pmdp_huge_clear_flush(vma, haddr, pmd);
> 
> Shouldn't the secondary TLB know if the page size changed?

It should not matter, we are talking virtual to physical on behalf
of a device against a process address space. So the hardware should
not care about the page size.

Moreover if any of the new 512 (assuming 2MB huge and 4K pages) zero
4K pages is replace by something new then a device TLB shootdown will
happen before the new page is set.

Only issue i can think of is if the IOMMU TLB (if there is one) or
the device TLB (you do expect that there is one) does not invalidate
TLB entry if the TLB shootdown is smaller than the TLB entry. That
would be idiotic but yes i know hardware bug.


> 
> >  
> >  	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> >  	pmd_populate(mm, &_pmd, pgtable);
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 1768efa4c501..63a63f1b536c 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> >  			set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
> >  		} else {
> >  			if (cow) {
> > +				/*
> > +				 * No need to notify as we are downgrading page
> > +				 * table protection not changing it to point
> > +				 * to a new page.
> > +				 *
> > +				 * See Documentation/vm/mmu_notifier.txt
> > +				 */
> >  				huge_ptep_set_wrprotect(src, addr, src_pte);
> 
> OK.. so we could get write faults on write accesses from the device.
> 
> > -				mmu_notifier_invalidate_range(src, mmun_start,
> > -								   mmun_end);
> >  			}
> >  			entry = huge_ptep_get(src_pte);
> >  			ptepage = pte_page(entry);
> > @@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> >  	 * and that page table be reused and filled with junk.
> >  	 */
> >  	flush_hugetlb_tlb_range(vma, start, end);
> > -	mmu_notifier_invalidate_range(mm, start, end);
> > +	/*
> > +	 * No need to call mmu_notifier_invalidate_range() we are downgrading
> > +	 * page table protection not changing it to point to a new page.
> > +	 *
> > +	 * See Documentation/vm/mmu_notifier.txt
> > +	 */
> >  	i_mmap_unlock_write(vma->vm_file->f_mapping);
> >  	mmu_notifier_invalidate_range_end(mm, start, end);
> >  
> > diff --git a/mm/ksm.c b/mm/ksm.c
> > index 6cb60f46cce5..be8f4576f842 100644
> > --- a/mm/ksm.c
> > +++ b/mm/ksm.c
> > @@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
> >  		 * So we clear the pte and flush the tlb before the check
> >  		 * this assure us that no O_DIRECT can happen after the check
> >  		 * or in the middle of the check.
> > +		 *
> > +		 * No need to notify as we are downgrading page table to read
> > +		 * only not changing it to point to a new page.
> > +		 *
> > +		 * See Documentation/vm/mmu_notifier.txt
> >  		 */
> > -		entry = ptep_clear_flush_notify(vma, pvmw.address, pvmw.pte);
> > +		entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte);
> >  		/*
> >  		 * Check that no O_DIRECT or similar I/O is in progress on the
> >  		 * page
> > @@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
> >  	}
> >  
> >  	flush_cache_page(vma, addr, pte_pfn(*ptep));
> > -	ptep_clear_flush_notify(vma, addr, ptep);
> > +	/*
> > +	 * No need to notify as we are replacing a read only page with another
> > +	 * read only page with the same content.
> > +	 *
> > +	 * See Documentation/vm/mmu_notifier.txt
> > +	 */
> > +	ptep_clear_flush(vma, addr, ptep);
> >  	set_pte_at_notify(mm, addr, ptep, newpte);
> >  
> >  	page_remove_rmap(page, false);
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 061826278520..6b5a0f219ac0 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> >  #endif
> >  		}
> >  
> > -		if (ret) {
> > -			mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend);
> > +		/*
> > +		 * No need to call mmu_notifier_invalidate_range() as we are
> > +		 * downgrading page table protection not changing it to point
> > +		 * to a new page.
> > +		 *
> > +		 * See Documentation/vm/mmu_notifier.txt
> > +		 */
> > +		if (ret)
> >  			(*cleaned)++;
> > -		}
> >  	}
> >  
> >  	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> > @@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >  			if (pte_soft_dirty(pteval))
> >  				swp_pte = pte_swp_mksoft_dirty(swp_pte);
> >  			set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
> > +			/*
> > +			 * No need to invalidate here it will synchronize on
> > +			 * against the special swap migration pte.
> > +			 */
> >  			goto discard;
> >  		}
> >  
> > @@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >  			 * will take care of the rest.
> >  			 */
> >  			dec_mm_counter(mm, mm_counter(page));
> > +			/* We have to invalidate as we cleared the pte */
> > +			mmu_notifier_invalidate_range(mm, address,
> > +						      address + PAGE_SIZE);
> >  		} else if (IS_ENABLED(CONFIG_MIGRATION) &&
> >  				(flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
> >  			swp_entry_t entry;
> > @@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >  			if (pte_soft_dirty(pteval))
> >  				swp_pte = pte_swp_mksoft_dirty(swp_pte);
> >  			set_pte_at(mm, address, pvmw.pte, swp_pte);
> > +			/*
> > +			 * No need to invalidate here it will synchronize on
> > +			 * against the special swap migration pte.
> > +			 */
> >  		} else if (PageAnon(page)) {
> >  			swp_entry_t entry = { .val = page_private(subpage) };
> >  			pte_t swp_pte;
> > @@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >  				WARN_ON_ONCE(1);
> >  				ret = false;
> >  				/* We have to invalidate as we cleared the pte */
> > +				mmu_notifier_invalidate_range(mm, address,
> > +							address + PAGE_SIZE);
> >  				page_vma_mapped_walk_done(&pvmw);
> >  				break;
> >  			}
> > @@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >  			/* MADV_FREE page check */
> >  			if (!PageSwapBacked(page)) {
> >  				if (!PageDirty(page)) {
> > +					/* Invalidate as we cleared the pte */
> > +					mmu_notifier_invalidate_range(mm,
> > +						address, address + PAGE_SIZE);
> >  					dec_mm_counter(mm, MM_ANONPAGES);
> >  					goto discard;
> >  				}
> > @@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >  			if (pte_soft_dirty(pteval))
> >  				swp_pte = pte_swp_mksoft_dirty(swp_pte);
> >  			set_pte_at(mm, address, pvmw.pte, swp_pte);
> > -		} else
> > +			/* Invalidate as we cleared the pte */
> > +			mmu_notifier_invalidate_range(mm, address,
> > +						      address + PAGE_SIZE);
> > +		} else {
> > +			/*
> > +			 * We should not need to notify here as we reach this
> > +			 * case only from freeze_page() itself only call from
> > +			 * split_huge_page_to_list() so everything below must
> > +			 * be true:
> > +			 *   - page is not anonymous
> > +			 *   - page is locked
> > +			 *
> > +			 * So as it is a locked file back page thus it can not
> > +			 * be remove from the page cache and replace by a new
> > +			 * page before mmu_notifier_invalidate_range_end so no
> > +			 * concurrent thread might update its page table to
> > +			 * point at new page while a device still is using this
> > +			 * page.
> > +			 *
> > +			 * See Documentation/vm/mmu_notifier.txt
> > +			 */
> >  			dec_mm_counter(mm, mm_counter_file(page));
> > +		}
> >  discard:
> > +		/*
> > +		 * No need to call mmu_notifier_invalidate_range() it has be
> > +		 * done above for all cases requiring it to happen under page
> > +		 * table lock before mmu_notifier_invalidate_range_end()
> > +		 *
> > +		 * See Documentation/vm/mmu_notifier.txt
> > +		 */
> >  		page_remove_rmap(subpage, PageHuge(page));
> >  		put_page(page);
> > -		mmu_notifier_invalidate_range(mm, address,
> > -					      address + PAGE_SIZE);
> >  	}
> >  
> >  	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> 
> Looking at the patchset, I understand the efficiency, but I am concerned
> with correctness.

I am fine in holding this off from reaching Linus but only way to flush this
issues out if any is to have this patch in linux-next or somewhere were they
get a chance of being tested.

Note that the second patch is always safe. I agree that this one might
not be if hardware implementation is idiotic (well that would be my
opinion and any opinion/point of view can be challenge :))

> 
> Balbir Singh.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2
  2017-10-19  3:28       ` Jerome Glisse
  (?)
@ 2017-10-19 10:53         ` Balbir Singh
  -1 siblings, 0 replies; 40+ messages in thread
From: Balbir Singh @ 2017-10-19 10:53 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Nadav Amit,
	Linus Torvalds, Andrew Morton, Joerg Roedel,
	Suravee Suthikulpanit, David Woodhouse, Alistair Popple,
	Michael Ellerman, Benjamin Herrenschmidt, Stephen Rothwell,
	Andrew Donnellan, iommu,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-next

On Thu, Oct 19, 2017 at 2:28 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote:
>> On Mon, 16 Oct 2017 23:10:02 -0400
>> jglisse@redhat.com wrote:
>>
>> > From: Jérôme Glisse <jglisse@redhat.com>
>> >
>> > +           /*
>> > +            * No need to call mmu_notifier_invalidate_range() as we are
>> > +            * downgrading page table protection not changing it to point
>> > +            * to a new page.
>> > +            *
>> > +            * See Documentation/vm/mmu_notifier.txt
>> > +            */
>> >             if (pmdp) {
>> >  #ifdef CONFIG_FS_DAX_PMD
>> >                     pmd_t pmd;
>> > @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
>> >                     pmd = pmd_wrprotect(pmd);
>> >                     pmd = pmd_mkclean(pmd);
>> >                     set_pmd_at(vma->vm_mm, address, pmdp, pmd);
>> > -                   mmu_notifier_invalidate_range(vma->vm_mm, start, end);
>>
>> Could the secondary TLB still see the mapping as dirty and propagate the dirty bit back?
>
> I am assuming hardware does sane thing of setting the dirty bit only
> when walking the CPU page table when device does a write fault ie
> once the device get a write TLB entry the dirty is set by the IOMMU
> when walking the page table before returning the lookup result to the
> device and that it won't be set again latter (ie propagated back
> latter).
>

The other possibility is that the hardware things the page is writable
and already
marked dirty. It allows writes and does not set the dirty bit?

> I should probably have spell that out and maybe some of the ATS/PASID
> implementer did not do that.
>
>>
>> >  unlock_pmd:
>> >                     spin_unlock(ptl);
>> >  #endif
>> > @@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
>> >                     pte = pte_wrprotect(pte);
>> >                     pte = pte_mkclean(pte);
>> >                     set_pte_at(vma->vm_mm, address, ptep, pte);
>> > -                   mmu_notifier_invalidate_range(vma->vm_mm, start, end);
>>
>> Ditto
>>
>> >  unlock_pte:
>> >                     pte_unmap_unlock(ptep, ptl);
>> >             }
>> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
>> > index 6866e8126982..49c925c96b8a 100644
>> > --- a/include/linux/mmu_notifier.h
>> > +++ b/include/linux/mmu_notifier.h
>> > @@ -155,7 +155,8 @@ struct mmu_notifier_ops {
>> >      * shared page-tables, it not necessary to implement the
>> >      * invalidate_range_start()/end() notifiers, as
>> >      * invalidate_range() alread catches the points in time when an
>> > -    * external TLB range needs to be flushed.
>> > +    * external TLB range needs to be flushed. For more in depth
>> > +    * discussion on this see Documentation/vm/mmu_notifier.txt
>> >      *
>> >      * The invalidate_range() function is called under the ptl
>> >      * spin-lock and not allowed to sleep.
>> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> > index c037d3d34950..ff5bc647b51d 100644
>> > --- a/mm/huge_memory.c
>> > +++ b/mm/huge_memory.c
>> > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
>> >             goto out_free_pages;
>> >     VM_BUG_ON_PAGE(!PageHead(page), page);
>> >
>> > +   /*
>> > +    * Leave pmd empty until pte is filled note we must notify here as
>> > +    * concurrent CPU thread might write to new page before the call to
>> > +    * mmu_notifier_invalidate_range_end() happens which can lead to a
>> > +    * device seeing memory write in different order than CPU.
>> > +    *
>> > +    * See Documentation/vm/mmu_notifier.txt
>> > +    */
>> >     pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
>> > -   /* leave pmd empty until pte is filled */
>> >
>> >     pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd);
>> >     pmd_populate(vma->vm_mm, &_pmd, pgtable);
>> > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>> >     pmd_t _pmd;
>> >     int i;
>> >
>> > -   /* leave pmd empty until pte is filled */
>> > -   pmdp_huge_clear_flush_notify(vma, haddr, pmd);
>> > +   /*
>> > +    * Leave pmd empty until pte is filled note that it is fine to delay
>> > +    * notification until mmu_notifier_invalidate_range_end() as we are
>> > +    * replacing a zero pmd write protected page with a zero pte write
>> > +    * protected page.
>> > +    *
>> > +    * See Documentation/vm/mmu_notifier.txt
>> > +    */
>> > +   pmdp_huge_clear_flush(vma, haddr, pmd);
>>
>> Shouldn't the secondary TLB know if the page size changed?
>
> It should not matter, we are talking virtual to physical on behalf
> of a device against a process address space. So the hardware should
> not care about the page size.
>

Does that not indicate how much the device can access? Could it try
to access more than what is mapped?

> Moreover if any of the new 512 (assuming 2MB huge and 4K pages) zero
> 4K pages is replace by something new then a device TLB shootdown will
> happen before the new page is set.
>
> Only issue i can think of is if the IOMMU TLB (if there is one) or
> the device TLB (you do expect that there is one) does not invalidate
> TLB entry if the TLB shootdown is smaller than the TLB entry. That
> would be idiotic but yes i know hardware bug.
>
>
>>
>> >
>> >     pgtable = pgtable_trans_huge_withdraw(mm, pmd);
>> >     pmd_populate(mm, &_pmd, pgtable);
>> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> > index 1768efa4c501..63a63f1b536c 100644
>> > --- a/mm/hugetlb.c
>> > +++ b/mm/hugetlb.c
>> > @@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>> >                     set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
>> >             } else {
>> >                     if (cow) {
>> > +                           /*
>> > +                            * No need to notify as we are downgrading page
>> > +                            * table protection not changing it to point
>> > +                            * to a new page.
>> > +                            *
>> > +                            * See Documentation/vm/mmu_notifier.txt
>> > +                            */
>> >                             huge_ptep_set_wrprotect(src, addr, src_pte);
>>
>> OK.. so we could get write faults on write accesses from the device.
>>
>> > -                           mmu_notifier_invalidate_range(src, mmun_start,
>> > -                                                              mmun_end);
>> >                     }
>> >                     entry = huge_ptep_get(src_pte);
>> >                     ptepage = pte_page(entry);
>> > @@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
>> >      * and that page table be reused and filled with junk.
>> >      */
>> >     flush_hugetlb_tlb_range(vma, start, end);
>> > -   mmu_notifier_invalidate_range(mm, start, end);
>> > +   /*
>> > +    * No need to call mmu_notifier_invalidate_range() we are downgrading
>> > +    * page table protection not changing it to point to a new page.
>> > +    *
>> > +    * See Documentation/vm/mmu_notifier.txt
>> > +    */
>> >     i_mmap_unlock_write(vma->vm_file->f_mapping);
>> >     mmu_notifier_invalidate_range_end(mm, start, end);
>> >
>> > diff --git a/mm/ksm.c b/mm/ksm.c
>> > index 6cb60f46cce5..be8f4576f842 100644
>> > --- a/mm/ksm.c
>> > +++ b/mm/ksm.c
>> > @@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
>> >              * So we clear the pte and flush the tlb before the check
>> >              * this assure us that no O_DIRECT can happen after the check
>> >              * or in the middle of the check.
>> > +            *
>> > +            * No need to notify as we are downgrading page table to read
>> > +            * only not changing it to point to a new page.
>> > +            *
>> > +            * See Documentation/vm/mmu_notifier.txt
>> >              */
>> > -           entry = ptep_clear_flush_notify(vma, pvmw.address, pvmw.pte);
>> > +           entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte);
>> >             /*
>> >              * Check that no O_DIRECT or similar I/O is in progress on the
>> >              * page
>> > @@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
>> >     }
>> >
>> >     flush_cache_page(vma, addr, pte_pfn(*ptep));
>> > -   ptep_clear_flush_notify(vma, addr, ptep);
>> > +   /*
>> > +    * No need to notify as we are replacing a read only page with another
>> > +    * read only page with the same content.
>> > +    *
>> > +    * See Documentation/vm/mmu_notifier.txt
>> > +    */
>> > +   ptep_clear_flush(vma, addr, ptep);
>> >     set_pte_at_notify(mm, addr, ptep, newpte);
>> >
>> >     page_remove_rmap(page, false);
>> > diff --git a/mm/rmap.c b/mm/rmap.c
>> > index 061826278520..6b5a0f219ac0 100644
>> > --- a/mm/rmap.c
>> > +++ b/mm/rmap.c
>> > @@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
>> >  #endif
>> >             }
>> >
>> > -           if (ret) {
>> > -                   mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend);
>> > +           /*
>> > +            * No need to call mmu_notifier_invalidate_range() as we are
>> > +            * downgrading page table protection not changing it to point
>> > +            * to a new page.
>> > +            *
>> > +            * See Documentation/vm/mmu_notifier.txt
>> > +            */
>> > +           if (ret)
>> >                     (*cleaned)++;
>> > -           }
>> >     }
>> >
>> >     mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
>> > @@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>> >                     if (pte_soft_dirty(pteval))
>> >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
>> >                     set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
>> > +                   /*
>> > +                    * No need to invalidate here it will synchronize on
>> > +                    * against the special swap migration pte.
>> > +                    */
>> >                     goto discard;
>> >             }
>> >
>> > @@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>> >                      * will take care of the rest.
>> >                      */
>> >                     dec_mm_counter(mm, mm_counter(page));
>> > +                   /* We have to invalidate as we cleared the pte */
>> > +                   mmu_notifier_invalidate_range(mm, address,
>> > +                                                 address + PAGE_SIZE);
>> >             } else if (IS_ENABLED(CONFIG_MIGRATION) &&
>> >                             (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
>> >                     swp_entry_t entry;
>> > @@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>> >                     if (pte_soft_dirty(pteval))
>> >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
>> >                     set_pte_at(mm, address, pvmw.pte, swp_pte);
>> > +                   /*
>> > +                    * No need to invalidate here it will synchronize on
>> > +                    * against the special swap migration pte.
>> > +                    */
>> >             } else if (PageAnon(page)) {
>> >                     swp_entry_t entry = { .val = page_private(subpage) };
>> >                     pte_t swp_pte;
>> > @@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>> >                             WARN_ON_ONCE(1);
>> >                             ret = false;
>> >                             /* We have to invalidate as we cleared the pte */
>> > +                           mmu_notifier_invalidate_range(mm, address,
>> > +                                                   address + PAGE_SIZE);
>> >                             page_vma_mapped_walk_done(&pvmw);
>> >                             break;
>> >                     }
>> > @@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>> >                     /* MADV_FREE page check */
>> >                     if (!PageSwapBacked(page)) {
>> >                             if (!PageDirty(page)) {
>> > +                                   /* Invalidate as we cleared the pte */
>> > +                                   mmu_notifier_invalidate_range(mm,
>> > +                                           address, address + PAGE_SIZE);
>> >                                     dec_mm_counter(mm, MM_ANONPAGES);
>> >                                     goto discard;
>> >                             }
>> > @@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>> >                     if (pte_soft_dirty(pteval))
>> >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
>> >                     set_pte_at(mm, address, pvmw.pte, swp_pte);
>> > -           } else
>> > +                   /* Invalidate as we cleared the pte */
>> > +                   mmu_notifier_invalidate_range(mm, address,
>> > +                                                 address + PAGE_SIZE);
>> > +           } else {
>> > +                   /*
>> > +                    * We should not need to notify here as we reach this
>> > +                    * case only from freeze_page() itself only call from
>> > +                    * split_huge_page_to_list() so everything below must
>> > +                    * be true:
>> > +                    *   - page is not anonymous
>> > +                    *   - page is locked
>> > +                    *
>> > +                    * So as it is a locked file back page thus it can not
>> > +                    * be remove from the page cache and replace by a new
>> > +                    * page before mmu_notifier_invalidate_range_end so no
>> > +                    * concurrent thread might update its page table to
>> > +                    * point at new page while a device still is using this
>> > +                    * page.
>> > +                    *
>> > +                    * See Documentation/vm/mmu_notifier.txt
>> > +                    */
>> >                     dec_mm_counter(mm, mm_counter_file(page));
>> > +           }
>> >  discard:
>> > +           /*
>> > +            * No need to call mmu_notifier_invalidate_range() it has be
>> > +            * done above for all cases requiring it to happen under page
>> > +            * table lock before mmu_notifier_invalidate_range_end()
>> > +            *
>> > +            * See Documentation/vm/mmu_notifier.txt
>> > +            */
>> >             page_remove_rmap(subpage, PageHuge(page));
>> >             put_page(page);
>> > -           mmu_notifier_invalidate_range(mm, address,
>> > -                                         address + PAGE_SIZE);
>> >     }
>> >
>> >     mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
>>
>> Looking at the patchset, I understand the efficiency, but I am concerned
>> with correctness.
>
> I am fine in holding this off from reaching Linus but only way to flush this
> issues out if any is to have this patch in linux-next or somewhere were they
> get a chance of being tested.
>

Yep, I would like to see some additional testing around npu and get Alistair
Popple to comment as well

> Note that the second patch is always safe. I agree that this one might
> not be if hardware implementation is idiotic (well that would be my
> opinion and any opinion/point of view can be challenge :))


You mean the only_end variant that avoids shootdown after pmd/pte changes
that avoid the _start/_end and have just the only_end variant? That seemed
reasonable to me, but I've not tested it or evaluated it in depth

Balbir Singh.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2
@ 2017-10-19 10:53         ` Balbir Singh
  0 siblings, 0 replies; 40+ messages in thread
From: Balbir Singh @ 2017-10-19 10:53 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Nadav Amit,
	Linus Torvalds, Andrew Morton, Joerg Roedel,
	Suravee Suthikulpanit, David Woodhouse, Alistair Popple,
	Michael Ellerman, Benjamin Herrenschmidt, Stephen Rothwell,
	Andrew Donnellan, iommu,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-next

On Thu, Oct 19, 2017 at 2:28 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote:
>> On Mon, 16 Oct 2017 23:10:02 -0400
>> jglisse@redhat.com wrote:
>>
>> > From: Jérôme Glisse <jglisse@redhat.com>
>> >
>> > +           /*
>> > +            * No need to call mmu_notifier_invalidate_range() as we are
>> > +            * downgrading page table protection not changing it to point
>> > +            * to a new page.
>> > +            *
>> > +            * See Documentation/vm/mmu_notifier.txt
>> > +            */
>> >             if (pmdp) {
>> >  #ifdef CONFIG_FS_DAX_PMD
>> >                     pmd_t pmd;
>> > @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
>> >                     pmd = pmd_wrprotect(pmd);
>> >                     pmd = pmd_mkclean(pmd);
>> >                     set_pmd_at(vma->vm_mm, address, pmdp, pmd);
>> > -                   mmu_notifier_invalidate_range(vma->vm_mm, start, end);
>>
>> Could the secondary TLB still see the mapping as dirty and propagate the dirty bit back?
>
> I am assuming hardware does sane thing of setting the dirty bit only
> when walking the CPU page table when device does a write fault ie
> once the device get a write TLB entry the dirty is set by the IOMMU
> when walking the page table before returning the lookup result to the
> device and that it won't be set again latter (ie propagated back
> latter).
>

The other possibility is that the hardware things the page is writable
and already
marked dirty. It allows writes and does not set the dirty bit?

> I should probably have spell that out and maybe some of the ATS/PASID
> implementer did not do that.
>
>>
>> >  unlock_pmd:
>> >                     spin_unlock(ptl);
>> >  #endif
>> > @@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
>> >                     pte = pte_wrprotect(pte);
>> >                     pte = pte_mkclean(pte);
>> >                     set_pte_at(vma->vm_mm, address, ptep, pte);
>> > -                   mmu_notifier_invalidate_range(vma->vm_mm, start, end);
>>
>> Ditto
>>
>> >  unlock_pte:
>> >                     pte_unmap_unlock(ptep, ptl);
>> >             }
>> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
>> > index 6866e8126982..49c925c96b8a 100644
>> > --- a/include/linux/mmu_notifier.h
>> > +++ b/include/linux/mmu_notifier.h
>> > @@ -155,7 +155,8 @@ struct mmu_notifier_ops {
>> >      * shared page-tables, it not necessary to implement the
>> >      * invalidate_range_start()/end() notifiers, as
>> >      * invalidate_range() alread catches the points in time when an
>> > -    * external TLB range needs to be flushed.
>> > +    * external TLB range needs to be flushed. For more in depth
>> > +    * discussion on this see Documentation/vm/mmu_notifier.txt
>> >      *
>> >      * The invalidate_range() function is called under the ptl
>> >      * spin-lock and not allowed to sleep.
>> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> > index c037d3d34950..ff5bc647b51d 100644
>> > --- a/mm/huge_memory.c
>> > +++ b/mm/huge_memory.c
>> > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
>> >             goto out_free_pages;
>> >     VM_BUG_ON_PAGE(!PageHead(page), page);
>> >
>> > +   /*
>> > +    * Leave pmd empty until pte is filled note we must notify here as
>> > +    * concurrent CPU thread might write to new page before the call to
>> > +    * mmu_notifier_invalidate_range_end() happens which can lead to a
>> > +    * device seeing memory write in different order than CPU.
>> > +    *
>> > +    * See Documentation/vm/mmu_notifier.txt
>> > +    */
>> >     pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
>> > -   /* leave pmd empty until pte is filled */
>> >
>> >     pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd);
>> >     pmd_populate(vma->vm_mm, &_pmd, pgtable);
>> > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>> >     pmd_t _pmd;
>> >     int i;
>> >
>> > -   /* leave pmd empty until pte is filled */
>> > -   pmdp_huge_clear_flush_notify(vma, haddr, pmd);
>> > +   /*
>> > +    * Leave pmd empty until pte is filled note that it is fine to delay
>> > +    * notification until mmu_notifier_invalidate_range_end() as we are
>> > +    * replacing a zero pmd write protected page with a zero pte write
>> > +    * protected page.
>> > +    *
>> > +    * See Documentation/vm/mmu_notifier.txt
>> > +    */
>> > +   pmdp_huge_clear_flush(vma, haddr, pmd);
>>
>> Shouldn't the secondary TLB know if the page size changed?
>
> It should not matter, we are talking virtual to physical on behalf
> of a device against a process address space. So the hardware should
> not care about the page size.
>

Does that not indicate how much the device can access? Could it try
to access more than what is mapped?

> Moreover if any of the new 512 (assuming 2MB huge and 4K pages) zero
> 4K pages is replace by something new then a device TLB shootdown will
> happen before the new page is set.
>
> Only issue i can think of is if the IOMMU TLB (if there is one) or
> the device TLB (you do expect that there is one) does not invalidate
> TLB entry if the TLB shootdown is smaller than the TLB entry. That
> would be idiotic but yes i know hardware bug.
>
>
>>
>> >
>> >     pgtable = pgtable_trans_huge_withdraw(mm, pmd);
>> >     pmd_populate(mm, &_pmd, pgtable);
>> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> > index 1768efa4c501..63a63f1b536c 100644
>> > --- a/mm/hugetlb.c
>> > +++ b/mm/hugetlb.c
>> > @@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>> >                     set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
>> >             } else {
>> >                     if (cow) {
>> > +                           /*
>> > +                            * No need to notify as we are downgrading page
>> > +                            * table protection not changing it to point
>> > +                            * to a new page.
>> > +                            *
>> > +                            * See Documentation/vm/mmu_notifier.txt
>> > +                            */
>> >                             huge_ptep_set_wrprotect(src, addr, src_pte);
>>
>> OK.. so we could get write faults on write accesses from the device.
>>
>> > -                           mmu_notifier_invalidate_range(src, mmun_start,
>> > -                                                              mmun_end);
>> >                     }
>> >                     entry = huge_ptep_get(src_pte);
>> >                     ptepage = pte_page(entry);
>> > @@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
>> >      * and that page table be reused and filled with junk.
>> >      */
>> >     flush_hugetlb_tlb_range(vma, start, end);
>> > -   mmu_notifier_invalidate_range(mm, start, end);
>> > +   /*
>> > +    * No need to call mmu_notifier_invalidate_range() we are downgrading
>> > +    * page table protection not changing it to point to a new page.
>> > +    *
>> > +    * See Documentation/vm/mmu_notifier.txt
>> > +    */
>> >     i_mmap_unlock_write(vma->vm_file->f_mapping);
>> >     mmu_notifier_invalidate_range_end(mm, start, end);
>> >
>> > diff --git a/mm/ksm.c b/mm/ksm.c
>> > index 6cb60f46cce5..be8f4576f842 100644
>> > --- a/mm/ksm.c
>> > +++ b/mm/ksm.c
>> > @@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
>> >              * So we clear the pte and flush the tlb before the check
>> >              * this assure us that no O_DIRECT can happen after the check
>> >              * or in the middle of the check.
>> > +            *
>> > +            * No need to notify as we are downgrading page table to read
>> > +            * only not changing it to point to a new page.
>> > +            *
>> > +            * See Documentation/vm/mmu_notifier.txt
>> >              */
>> > -           entry = ptep_clear_flush_notify(vma, pvmw.address, pvmw.pte);
>> > +           entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte);
>> >             /*
>> >              * Check that no O_DIRECT or similar I/O is in progress on the
>> >              * page
>> > @@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
>> >     }
>> >
>> >     flush_cache_page(vma, addr, pte_pfn(*ptep));
>> > -   ptep_clear_flush_notify(vma, addr, ptep);
>> > +   /*
>> > +    * No need to notify as we are replacing a read only page with another
>> > +    * read only page with the same content.
>> > +    *
>> > +    * See Documentation/vm/mmu_notifier.txt
>> > +    */
>> > +   ptep_clear_flush(vma, addr, ptep);
>> >     set_pte_at_notify(mm, addr, ptep, newpte);
>> >
>> >     page_remove_rmap(page, false);
>> > diff --git a/mm/rmap.c b/mm/rmap.c
>> > index 061826278520..6b5a0f219ac0 100644
>> > --- a/mm/rmap.c
>> > +++ b/mm/rmap.c
>> > @@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
>> >  #endif
>> >             }
>> >
>> > -           if (ret) {
>> > -                   mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend);
>> > +           /*
>> > +            * No need to call mmu_notifier_invalidate_range() as we are
>> > +            * downgrading page table protection not changing it to point
>> > +            * to a new page.
>> > +            *
>> > +            * See Documentation/vm/mmu_notifier.txt
>> > +            */
>> > +           if (ret)
>> >                     (*cleaned)++;
>> > -           }
>> >     }
>> >
>> >     mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
>> > @@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>> >                     if (pte_soft_dirty(pteval))
>> >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
>> >                     set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
>> > +                   /*
>> > +                    * No need to invalidate here it will synchronize on
>> > +                    * against the special swap migration pte.
>> > +                    */
>> >                     goto discard;
>> >             }
>> >
>> > @@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>> >                      * will take care of the rest.
>> >                      */
>> >                     dec_mm_counter(mm, mm_counter(page));
>> > +                   /* We have to invalidate as we cleared the pte */
>> > +                   mmu_notifier_invalidate_range(mm, address,
>> > +                                                 address + PAGE_SIZE);
>> >             } else if (IS_ENABLED(CONFIG_MIGRATION) &&
>> >                             (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
>> >                     swp_entry_t entry;
>> > @@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>> >                     if (pte_soft_dirty(pteval))
>> >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
>> >                     set_pte_at(mm, address, pvmw.pte, swp_pte);
>> > +                   /*
>> > +                    * No need to invalidate here it will synchronize on
>> > +                    * against the special swap migration pte.
>> > +                    */
>> >             } else if (PageAnon(page)) {
>> >                     swp_entry_t entry = { .val = page_private(subpage) };
>> >                     pte_t swp_pte;
>> > @@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>> >                             WARN_ON_ONCE(1);
>> >                             ret = false;
>> >                             /* We have to invalidate as we cleared the pte */
>> > +                           mmu_notifier_invalidate_range(mm, address,
>> > +                                                   address + PAGE_SIZE);
>> >                             page_vma_mapped_walk_done(&pvmw);
>> >                             break;
>> >                     }
>> > @@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>> >                     /* MADV_FREE page check */
>> >                     if (!PageSwapBacked(page)) {
>> >                             if (!PageDirty(page)) {
>> > +                                   /* Invalidate as we cleared the pte */
>> > +                                   mmu_notifier_invalidate_range(mm,
>> > +                                           address, address + PAGE_SIZE);
>> >                                     dec_mm_counter(mm, MM_ANONPAGES);
>> >                                     goto discard;
>> >                             }
>> > @@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>> >                     if (pte_soft_dirty(pteval))
>> >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
>> >                     set_pte_at(mm, address, pvmw.pte, swp_pte);
>> > -           } else
>> > +                   /* Invalidate as we cleared the pte */
>> > +                   mmu_notifier_invalidate_range(mm, address,
>> > +                                                 address + PAGE_SIZE);
>> > +           } else {
>> > +                   /*
>> > +                    * We should not need to notify here as we reach this
>> > +                    * case only from freeze_page() itself only call from
>> > +                    * split_huge_page_to_list() so everything below must
>> > +                    * be true:
>> > +                    *   - page is not anonymous
>> > +                    *   - page is locked
>> > +                    *
>> > +                    * So as it is a locked file back page thus it can not
>> > +                    * be remove from the page cache and replace by a new
>> > +                    * page before mmu_notifier_invalidate_range_end so no
>> > +                    * concurrent thread might update its page table to
>> > +                    * point at new page while a device still is using this
>> > +                    * page.
>> > +                    *
>> > +                    * See Documentation/vm/mmu_notifier.txt
>> > +                    */
>> >                     dec_mm_counter(mm, mm_counter_file(page));
>> > +           }
>> >  discard:
>> > +           /*
>> > +            * No need to call mmu_notifier_invalidate_range() it has be
>> > +            * done above for all cases requiring it to happen under page
>> > +            * table lock before mmu_notifier_invalidate_range_end()
>> > +            *
>> > +            * See Documentation/vm/mmu_notifier.txt
>> > +            */
>> >             page_remove_rmap(subpage, PageHuge(page));
>> >             put_page(page);
>> > -           mmu_notifier_invalidate_range(mm, address,
>> > -                                         address + PAGE_SIZE);
>> >     }
>> >
>> >     mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
>>
>> Looking at the patchset, I understand the efficiency, but I am concerned
>> with correctness.
>
> I am fine in holding this off from reaching Linus but only way to flush this
> issues out if any is to have this patch in linux-next or somewhere were they
> get a chance of being tested.
>

Yep, I would like to see some additional testing around npu and get Alistair
Popple to comment as well

> Note that the second patch is always safe. I agree that this one might
> not be if hardware implementation is idiotic (well that would be my
> opinion and any opinion/point of view can be challenge :))


You mean the only_end variant that avoids shootdown after pmd/pte changes
that avoid the _start/_end and have just the only_end variant? That seemed
reasonable to me, but I've not tested it or evaluated it in depth

Balbir Singh.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2
@ 2017-10-19 10:53         ` Balbir Singh
  0 siblings, 0 replies; 40+ messages in thread
From: Balbir Singh @ 2017-10-19 10:53 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Nadav Amit,
	Linus Torvalds, Andrew Morton, Joerg Roedel,
	Suravee Suthikulpanit, David Woodhouse, Alistair Popple,
	Michael Ellerman, Benjamin Herrenschmidt, Stephen Rothwell,
	Andrew Donnellan, iommu,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-next

On Thu, Oct 19, 2017 at 2:28 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote:
>> On Mon, 16 Oct 2017 23:10:02 -0400
>> jglisse@redhat.com wrote:
>>
>> > From: J=C3=A9r=C3=B4me Glisse <jglisse@redhat.com>
>> >
>> > +           /*
>> > +            * No need to call mmu_notifier_invalidate_range() as we a=
re
>> > +            * downgrading page table protection not changing it to po=
int
>> > +            * to a new page.
>> > +            *
>> > +            * See Documentation/vm/mmu_notifier.txt
>> > +            */
>> >             if (pmdp) {
>> >  #ifdef CONFIG_FS_DAX_PMD
>> >                     pmd_t pmd;
>> > @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct addre=
ss_space *mapping,
>> >                     pmd =3D pmd_wrprotect(pmd);
>> >                     pmd =3D pmd_mkclean(pmd);
>> >                     set_pmd_at(vma->vm_mm, address, pmdp, pmd);
>> > -                   mmu_notifier_invalidate_range(vma->vm_mm, start, e=
nd);
>>
>> Could the secondary TLB still see the mapping as dirty and propagate the=
 dirty bit back?
>
> I am assuming hardware does sane thing of setting the dirty bit only
> when walking the CPU page table when device does a write fault ie
> once the device get a write TLB entry the dirty is set by the IOMMU
> when walking the page table before returning the lookup result to the
> device and that it won't be set again latter (ie propagated back
> latter).
>

The other possibility is that the hardware things the page is writable
and already
marked dirty. It allows writes and does not set the dirty bit?

> I should probably have spell that out and maybe some of the ATS/PASID
> implementer did not do that.
>
>>
>> >  unlock_pmd:
>> >                     spin_unlock(ptl);
>> >  #endif
>> > @@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct addre=
ss_space *mapping,
>> >                     pte =3D pte_wrprotect(pte);
>> >                     pte =3D pte_mkclean(pte);
>> >                     set_pte_at(vma->vm_mm, address, ptep, pte);
>> > -                   mmu_notifier_invalidate_range(vma->vm_mm, start, e=
nd);
>>
>> Ditto
>>
>> >  unlock_pte:
>> >                     pte_unmap_unlock(ptep, ptl);
>> >             }
>> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier=
.h
>> > index 6866e8126982..49c925c96b8a 100644
>> > --- a/include/linux/mmu_notifier.h
>> > +++ b/include/linux/mmu_notifier.h
>> > @@ -155,7 +155,8 @@ struct mmu_notifier_ops {
>> >      * shared page-tables, it not necessary to implement the
>> >      * invalidate_range_start()/end() notifiers, as
>> >      * invalidate_range() alread catches the points in time when an
>> > -    * external TLB range needs to be flushed.
>> > +    * external TLB range needs to be flushed. For more in depth
>> > +    * discussion on this see Documentation/vm/mmu_notifier.txt
>> >      *
>> >      * The invalidate_range() function is called under the ptl
>> >      * spin-lock and not allowed to sleep.
>> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> > index c037d3d34950..ff5bc647b51d 100644
>> > --- a/mm/huge_memory.c
>> > +++ b/mm/huge_memory.c
>> > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct =
vm_fault *vmf, pmd_t orig_pmd,
>> >             goto out_free_pages;
>> >     VM_BUG_ON_PAGE(!PageHead(page), page);
>> >
>> > +   /*
>> > +    * Leave pmd empty until pte is filled note we must notify here as
>> > +    * concurrent CPU thread might write to new page before the call t=
o
>> > +    * mmu_notifier_invalidate_range_end() happens which can lead to a
>> > +    * device seeing memory write in different order than CPU.
>> > +    *
>> > +    * See Documentation/vm/mmu_notifier.txt
>> > +    */
>> >     pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
>> > -   /* leave pmd empty until pte is filled */
>> >
>> >     pgtable =3D pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd);
>> >     pmd_populate(vma->vm_mm, &_pmd, pgtable);
>> > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct v=
m_area_struct *vma,
>> >     pmd_t _pmd;
>> >     int i;
>> >
>> > -   /* leave pmd empty until pte is filled */
>> > -   pmdp_huge_clear_flush_notify(vma, haddr, pmd);
>> > +   /*
>> > +    * Leave pmd empty until pte is filled note that it is fine to del=
ay
>> > +    * notification until mmu_notifier_invalidate_range_end() as we ar=
e
>> > +    * replacing a zero pmd write protected page with a zero pte write
>> > +    * protected page.
>> > +    *
>> > +    * See Documentation/vm/mmu_notifier.txt
>> > +    */
>> > +   pmdp_huge_clear_flush(vma, haddr, pmd);
>>
>> Shouldn't the secondary TLB know if the page size changed?
>
> It should not matter, we are talking virtual to physical on behalf
> of a device against a process address space. So the hardware should
> not care about the page size.
>

Does that not indicate how much the device can access? Could it try
to access more than what is mapped?

> Moreover if any of the new 512 (assuming 2MB huge and 4K pages) zero
> 4K pages is replace by something new then a device TLB shootdown will
> happen before the new page is set.
>
> Only issue i can think of is if the IOMMU TLB (if there is one) or
> the device TLB (you do expect that there is one) does not invalidate
> TLB entry if the TLB shootdown is smaller than the TLB entry. That
> would be idiotic but yes i know hardware bug.
>
>
>>
>> >
>> >     pgtable =3D pgtable_trans_huge_withdraw(mm, pmd);
>> >     pmd_populate(mm, &_pmd, pgtable);
>> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> > index 1768efa4c501..63a63f1b536c 100644
>> > --- a/mm/hugetlb.c
>> > +++ b/mm/hugetlb.c
>> > @@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *d=
st, struct mm_struct *src,
>> >                     set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz=
);
>> >             } else {
>> >                     if (cow) {
>> > +                           /*
>> > +                            * No need to notify as we are downgrading=
 page
>> > +                            * table protection not changing it to poi=
nt
>> > +                            * to a new page.
>> > +                            *
>> > +                            * See Documentation/vm/mmu_notifier.txt
>> > +                            */
>> >                             huge_ptep_set_wrprotect(src, addr, src_pte=
);
>>
>> OK.. so we could get write faults on write accesses from the device.
>>
>> > -                           mmu_notifier_invalidate_range(src, mmun_st=
art,
>> > -                                                              mmun_en=
d);
>> >                     }
>> >                     entry =3D huge_ptep_get(src_pte);
>> >                     ptepage =3D pte_page(entry);
>> > @@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct =
vm_area_struct *vma,
>> >      * and that page table be reused and filled with junk.
>> >      */
>> >     flush_hugetlb_tlb_range(vma, start, end);
>> > -   mmu_notifier_invalidate_range(mm, start, end);
>> > +   /*
>> > +    * No need to call mmu_notifier_invalidate_range() we are downgrad=
ing
>> > +    * page table protection not changing it to point to a new page.
>> > +    *
>> > +    * See Documentation/vm/mmu_notifier.txt
>> > +    */
>> >     i_mmap_unlock_write(vma->vm_file->f_mapping);
>> >     mmu_notifier_invalidate_range_end(mm, start, end);
>> >
>> > diff --git a/mm/ksm.c b/mm/ksm.c
>> > index 6cb60f46cce5..be8f4576f842 100644
>> > --- a/mm/ksm.c
>> > +++ b/mm/ksm.c
>> > @@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_st=
ruct *vma, struct page *page,
>> >              * So we clear the pte and flush the tlb before the check
>> >              * this assure us that no O_DIRECT can happen after the ch=
eck
>> >              * or in the middle of the check.
>> > +            *
>> > +            * No need to notify as we are downgrading page table to r=
ead
>> > +            * only not changing it to point to a new page.
>> > +            *
>> > +            * See Documentation/vm/mmu_notifier.txt
>> >              */
>> > -           entry =3D ptep_clear_flush_notify(vma, pvmw.address, pvmw.=
pte);
>> > +           entry =3D ptep_clear_flush(vma, pvmw.address, pvmw.pte);
>> >             /*
>> >              * Check that no O_DIRECT or similar I/O is in progress on=
 the
>> >              * page
>> > @@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *=
vma, struct page *page,
>> >     }
>> >
>> >     flush_cache_page(vma, addr, pte_pfn(*ptep));
>> > -   ptep_clear_flush_notify(vma, addr, ptep);
>> > +   /*
>> > +    * No need to notify as we are replacing a read only page with ano=
ther
>> > +    * read only page with the same content.
>> > +    *
>> > +    * See Documentation/vm/mmu_notifier.txt
>> > +    */
>> > +   ptep_clear_flush(vma, addr, ptep);
>> >     set_pte_at_notify(mm, addr, ptep, newpte);
>> >
>> >     page_remove_rmap(page, false);
>> > diff --git a/mm/rmap.c b/mm/rmap.c
>> > index 061826278520..6b5a0f219ac0 100644
>> > --- a/mm/rmap.c
>> > +++ b/mm/rmap.c
>> > @@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, =
struct vm_area_struct *vma,
>> >  #endif
>> >             }
>> >
>> > -           if (ret) {
>> > -                   mmu_notifier_invalidate_range(vma->vm_mm, cstart, =
cend);
>> > +           /*
>> > +            * No need to call mmu_notifier_invalidate_range() as we a=
re
>> > +            * downgrading page table protection not changing it to po=
int
>> > +            * to a new page.
>> > +            *
>> > +            * See Documentation/vm/mmu_notifier.txt
>> > +            */
>> > +           if (ret)
>> >                     (*cleaned)++;
>> > -           }
>> >     }
>> >
>> >     mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
>> > @@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page,=
 struct vm_area_struct *vma,
>> >                     if (pte_soft_dirty(pteval))
>> >                             swp_pte =3D pte_swp_mksoft_dirty(swp_pte);
>> >                     set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
>> > +                   /*
>> > +                    * No need to invalidate here it will synchronize =
on
>> > +                    * against the special swap migration pte.
>> > +                    */
>> >                     goto discard;
>> >             }
>> >
>> > @@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, =
struct vm_area_struct *vma,
>> >                      * will take care of the rest.
>> >                      */
>> >                     dec_mm_counter(mm, mm_counter(page));
>> > +                   /* We have to invalidate as we cleared the pte */
>> > +                   mmu_notifier_invalidate_range(mm, address,
>> > +                                                 address + PAGE_SIZE)=
;
>> >             } else if (IS_ENABLED(CONFIG_MIGRATION) &&
>> >                             (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))=
) {
>> >                     swp_entry_t entry;
>> > @@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page,=
 struct vm_area_struct *vma,
>> >                     if (pte_soft_dirty(pteval))
>> >                             swp_pte =3D pte_swp_mksoft_dirty(swp_pte);
>> >                     set_pte_at(mm, address, pvmw.pte, swp_pte);
>> > +                   /*
>> > +                    * No need to invalidate here it will synchronize =
on
>> > +                    * against the special swap migration pte.
>> > +                    */
>> >             } else if (PageAnon(page)) {
>> >                     swp_entry_t entry =3D { .val =3D page_private(subp=
age) };
>> >                     pte_t swp_pte;
>> > @@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, =
struct vm_area_struct *vma,
>> >                             WARN_ON_ONCE(1);
>> >                             ret =3D false;
>> >                             /* We have to invalidate as we cleared the=
 pte */
>> > +                           mmu_notifier_invalidate_range(mm, address,
>> > +                                                   address + PAGE_SIZ=
E);
>> >                             page_vma_mapped_walk_done(&pvmw);
>> >                             break;
>> >                     }
>> > @@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, =
struct vm_area_struct *vma,
>> >                     /* MADV_FREE page check */
>> >                     if (!PageSwapBacked(page)) {
>> >                             if (!PageDirty(page)) {
>> > +                                   /* Invalidate as we cleared the pt=
e */
>> > +                                   mmu_notifier_invalidate_range(mm,
>> > +                                           address, address + PAGE_SI=
ZE);
>> >                                     dec_mm_counter(mm, MM_ANONPAGES);
>> >                                     goto discard;
>> >                             }
>> > @@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page=
, struct vm_area_struct *vma,
>> >                     if (pte_soft_dirty(pteval))
>> >                             swp_pte =3D pte_swp_mksoft_dirty(swp_pte);
>> >                     set_pte_at(mm, address, pvmw.pte, swp_pte);
>> > -           } else
>> > +                   /* Invalidate as we cleared the pte */
>> > +                   mmu_notifier_invalidate_range(mm, address,
>> > +                                                 address + PAGE_SIZE)=
;
>> > +           } else {
>> > +                   /*
>> > +                    * We should not need to notify here as we reach t=
his
>> > +                    * case only from freeze_page() itself only call f=
rom
>> > +                    * split_huge_page_to_list() so everything below m=
ust
>> > +                    * be true:
>> > +                    *   - page is not anonymous
>> > +                    *   - page is locked
>> > +                    *
>> > +                    * So as it is a locked file back page thus it can=
 not
>> > +                    * be remove from the page cache and replace by a =
new
>> > +                    * page before mmu_notifier_invalidate_range_end s=
o no
>> > +                    * concurrent thread might update its page table t=
o
>> > +                    * point at new page while a device still is using=
 this
>> > +                    * page.
>> > +                    *
>> > +                    * See Documentation/vm/mmu_notifier.txt
>> > +                    */
>> >                     dec_mm_counter(mm, mm_counter_file(page));
>> > +           }
>> >  discard:
>> > +           /*
>> > +            * No need to call mmu_notifier_invalidate_range() it has =
be
>> > +            * done above for all cases requiring it to happen under p=
age
>> > +            * table lock before mmu_notifier_invalidate_range_end()
>> > +            *
>> > +            * See Documentation/vm/mmu_notifier.txt
>> > +            */
>> >             page_remove_rmap(subpage, PageHuge(page));
>> >             put_page(page);
>> > -           mmu_notifier_invalidate_range(mm, address,
>> > -                                         address + PAGE_SIZE);
>> >     }
>> >
>> >     mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
>>
>> Looking at the patchset, I understand the efficiency, but I am concerned
>> with correctness.
>
> I am fine in holding this off from reaching Linus but only way to flush t=
his
> issues out if any is to have this patch in linux-next or somewhere were t=
hey
> get a chance of being tested.
>

Yep, I would like to see some additional testing around npu and get Alistai=
r
Popple to comment as well

> Note that the second patch is always safe. I agree that this one might
> not be if hardware implementation is idiotic (well that would be my
> opinion and any opinion/point of view can be challenge :))


You mean the only_end variant that avoids shootdown after pmd/pte changes
that avoid the _start/_end and have just the only_end variant? That seemed
reasonable to me, but I've not tested it or evaluated it in depth

Balbir Singh.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2
  2017-10-19 10:53         ` Balbir Singh
  (?)
  (?)
@ 2017-10-19 16:58           ` Jerome Glisse
  -1 siblings, 0 replies; 40+ messages in thread
From: Jerome Glisse @ 2017-10-19 16:58 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Nadav Amit,
	Linus Torvalds, Andrew Morton, Joerg Roedel,
	Suravee Suthikulpanit, David Woodhouse, Alistair Popple,
	Michael Ellerman, Benjamin Herrenschmidt, Stephen Rothwell,
	Andrew Donnellan, iommu,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-next

On Thu, Oct 19, 2017 at 09:53:11PM +1100, Balbir Singh wrote:
> On Thu, Oct 19, 2017 at 2:28 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> > On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote:
> >> On Mon, 16 Oct 2017 23:10:02 -0400
> >> jglisse@redhat.com wrote:
> >>
> >> > From: Jérôme Glisse <jglisse@redhat.com>
> >> >
> >> > +           /*
> >> > +            * No need to call mmu_notifier_invalidate_range() as we are
> >> > +            * downgrading page table protection not changing it to point
> >> > +            * to a new page.
> >> > +            *
> >> > +            * See Documentation/vm/mmu_notifier.txt
> >> > +            */
> >> >             if (pmdp) {
> >> >  #ifdef CONFIG_FS_DAX_PMD
> >> >                     pmd_t pmd;
> >> > @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
> >> >                     pmd = pmd_wrprotect(pmd);
> >> >                     pmd = pmd_mkclean(pmd);
> >> >                     set_pmd_at(vma->vm_mm, address, pmdp, pmd);
> >> > -                   mmu_notifier_invalidate_range(vma->vm_mm, start, end);
> >>
> >> Could the secondary TLB still see the mapping as dirty and propagate the dirty bit back?
> >
> > I am assuming hardware does sane thing of setting the dirty bit only
> > when walking the CPU page table when device does a write fault ie
> > once the device get a write TLB entry the dirty is set by the IOMMU
> > when walking the page table before returning the lookup result to the
> > device and that it won't be set again latter (ie propagated back
> > latter).
> >
> 
> The other possibility is that the hardware things the page is writable
> and already
> marked dirty. It allows writes and does not set the dirty bit?

I thought about this some more and the patch can not regress anything
that is not broken today. So if we assume that device can propagate
dirty bit because it can cache the write protection than all current
code is broken for two reasons:

First one is current code clear pte entry, build a new pte value with
write protection and update pte entry with new pte value. So any PASID/
ATS platform that allows device to cache the write bit and set dirty
bit anytime after that can race during that window and you would loose
the dirty bit of the device. That is not that bad as you are gonna
propagate the dirty bit to the struct page.

Second one is if the dirty bit is propagated back to the new write
protected pte. Quick look at code it seems that when we zap pte or
or mkclean we don't check that the pte has write permission but only
care about the dirty bit. So it should not have any bad consequence.

After this patch only the second window is bigger and thus more likely
to happen. But nothing sinister should happen from that.


> 
> > I should probably have spell that out and maybe some of the ATS/PASID
> > implementer did not do that.
> >
> >>
> >> >  unlock_pmd:
> >> >                     spin_unlock(ptl);
> >> >  #endif
> >> > @@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
> >> >                     pte = pte_wrprotect(pte);
> >> >                     pte = pte_mkclean(pte);
> >> >                     set_pte_at(vma->vm_mm, address, ptep, pte);
> >> > -                   mmu_notifier_invalidate_range(vma->vm_mm, start, end);
> >>
> >> Ditto
> >>
> >> >  unlock_pte:
> >> >                     pte_unmap_unlock(ptep, ptl);
> >> >             }
> >> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> >> > index 6866e8126982..49c925c96b8a 100644
> >> > --- a/include/linux/mmu_notifier.h
> >> > +++ b/include/linux/mmu_notifier.h
> >> > @@ -155,7 +155,8 @@ struct mmu_notifier_ops {
> >> >      * shared page-tables, it not necessary to implement the
> >> >      * invalidate_range_start()/end() notifiers, as
> >> >      * invalidate_range() alread catches the points in time when an
> >> > -    * external TLB range needs to be flushed.
> >> > +    * external TLB range needs to be flushed. For more in depth
> >> > +    * discussion on this see Documentation/vm/mmu_notifier.txt
> >> >      *
> >> >      * The invalidate_range() function is called under the ptl
> >> >      * spin-lock and not allowed to sleep.
> >> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> > index c037d3d34950..ff5bc647b51d 100644
> >> > --- a/mm/huge_memory.c
> >> > +++ b/mm/huge_memory.c
> >> > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
> >> >             goto out_free_pages;
> >> >     VM_BUG_ON_PAGE(!PageHead(page), page);
> >> >
> >> > +   /*
> >> > +    * Leave pmd empty until pte is filled note we must notify here as
> >> > +    * concurrent CPU thread might write to new page before the call to
> >> > +    * mmu_notifier_invalidate_range_end() happens which can lead to a
> >> > +    * device seeing memory write in different order than CPU.
> >> > +    *
> >> > +    * See Documentation/vm/mmu_notifier.txt
> >> > +    */
> >> >     pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
> >> > -   /* leave pmd empty until pte is filled */
> >> >
> >> >     pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd);
> >> >     pmd_populate(vma->vm_mm, &_pmd, pgtable);
> >> > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> >> >     pmd_t _pmd;
> >> >     int i;
> >> >
> >> > -   /* leave pmd empty until pte is filled */
> >> > -   pmdp_huge_clear_flush_notify(vma, haddr, pmd);
> >> > +   /*
> >> > +    * Leave pmd empty until pte is filled note that it is fine to delay
> >> > +    * notification until mmu_notifier_invalidate_range_end() as we are
> >> > +    * replacing a zero pmd write protected page with a zero pte write
> >> > +    * protected page.
> >> > +    *
> >> > +    * See Documentation/vm/mmu_notifier.txt
> >> > +    */
> >> > +   pmdp_huge_clear_flush(vma, haddr, pmd);
> >>
> >> Shouldn't the secondary TLB know if the page size changed?
> >
> > It should not matter, we are talking virtual to physical on behalf
> > of a device against a process address space. So the hardware should
> > not care about the page size.
> >
> 
> Does that not indicate how much the device can access? Could it try
> to access more than what is mapped?

Assuming device has huge TLB and 2MB huge page with 4K small page.
You are going from one 1 TLB covering a 2MB zero page to 512 TLB
each covering 4K. Both case is read only and both case are pointing
to same data (ie zero).

It is fine to delay the TLB invalidate on the device to the call of
mmu_notifier_invalidate_range_end(). The device will keep using the
huge TLB for a little longer but both CPU and device are looking at
same data.

Now if there is a racing thread that replace one of the 512 zeor page
after the split but before mmu_notifier_invalidate_range_end() that
code path would call mmu_notifier_invalidate_range() before changing
the pte to point to something else. Which should shoot down the device
TLB (it would be a serious device bug if this did not work).


> 
> > Moreover if any of the new 512 (assuming 2MB huge and 4K pages) zero
> > 4K pages is replace by something new then a device TLB shootdown will
> > happen before the new page is set.
> >
> > Only issue i can think of is if the IOMMU TLB (if there is one) or
> > the device TLB (you do expect that there is one) does not invalidate
> > TLB entry if the TLB shootdown is smaller than the TLB entry. That
> > would be idiotic but yes i know hardware bug.
> >
> >
> >>
> >> >
> >> >     pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> >> >     pmd_populate(mm, &_pmd, pgtable);
> >> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >> > index 1768efa4c501..63a63f1b536c 100644
> >> > --- a/mm/hugetlb.c
> >> > +++ b/mm/hugetlb.c
> >> > @@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> >> >                     set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
> >> >             } else {
> >> >                     if (cow) {
> >> > +                           /*
> >> > +                            * No need to notify as we are downgrading page
> >> > +                            * table protection not changing it to point
> >> > +                            * to a new page.
> >> > +                            *
> >> > +                            * See Documentation/vm/mmu_notifier.txt
> >> > +                            */
> >> >                             huge_ptep_set_wrprotect(src, addr, src_pte);
> >>
> >> OK.. so we could get write faults on write accesses from the device.
> >>
> >> > -                           mmu_notifier_invalidate_range(src, mmun_start,
> >> > -                                                              mmun_end);
> >> >                     }
> >> >                     entry = huge_ptep_get(src_pte);
> >> >                     ptepage = pte_page(entry);
> >> > @@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> >> >      * and that page table be reused and filled with junk.
> >> >      */
> >> >     flush_hugetlb_tlb_range(vma, start, end);
> >> > -   mmu_notifier_invalidate_range(mm, start, end);
> >> > +   /*
> >> > +    * No need to call mmu_notifier_invalidate_range() we are downgrading
> >> > +    * page table protection not changing it to point to a new page.
> >> > +    *
> >> > +    * See Documentation/vm/mmu_notifier.txt
> >> > +    */
> >> >     i_mmap_unlock_write(vma->vm_file->f_mapping);
> >> >     mmu_notifier_invalidate_range_end(mm, start, end);
> >> >
> >> > diff --git a/mm/ksm.c b/mm/ksm.c
> >> > index 6cb60f46cce5..be8f4576f842 100644
> >> > --- a/mm/ksm.c
> >> > +++ b/mm/ksm.c
> >> > @@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
> >> >              * So we clear the pte and flush the tlb before the check
> >> >              * this assure us that no O_DIRECT can happen after the check
> >> >              * or in the middle of the check.
> >> > +            *
> >> > +            * No need to notify as we are downgrading page table to read
> >> > +            * only not changing it to point to a new page.
> >> > +            *
> >> > +            * See Documentation/vm/mmu_notifier.txt
> >> >              */
> >> > -           entry = ptep_clear_flush_notify(vma, pvmw.address, pvmw.pte);
> >> > +           entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte);
> >> >             /*
> >> >              * Check that no O_DIRECT or similar I/O is in progress on the
> >> >              * page
> >> > @@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
> >> >     }
> >> >
> >> >     flush_cache_page(vma, addr, pte_pfn(*ptep));
> >> > -   ptep_clear_flush_notify(vma, addr, ptep);
> >> > +   /*
> >> > +    * No need to notify as we are replacing a read only page with another
> >> > +    * read only page with the same content.
> >> > +    *
> >> > +    * See Documentation/vm/mmu_notifier.txt
> >> > +    */
> >> > +   ptep_clear_flush(vma, addr, ptep);
> >> >     set_pte_at_notify(mm, addr, ptep, newpte);
> >> >
> >> >     page_remove_rmap(page, false);
> >> > diff --git a/mm/rmap.c b/mm/rmap.c
> >> > index 061826278520..6b5a0f219ac0 100644
> >> > --- a/mm/rmap.c
> >> > +++ b/mm/rmap.c
> >> > @@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> >> >  #endif
> >> >             }
> >> >
> >> > -           if (ret) {
> >> > -                   mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend);
> >> > +           /*
> >> > +            * No need to call mmu_notifier_invalidate_range() as we are
> >> > +            * downgrading page table protection not changing it to point
> >> > +            * to a new page.
> >> > +            *
> >> > +            * See Documentation/vm/mmu_notifier.txt
> >> > +            */
> >> > +           if (ret)
> >> >                     (*cleaned)++;
> >> > -           }
> >> >     }
> >> >
> >> >     mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> >> > @@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >> >                     if (pte_soft_dirty(pteval))
> >> >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
> >> >                     set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
> >> > +                   /*
> >> > +                    * No need to invalidate here it will synchronize on
> >> > +                    * against the special swap migration pte.
> >> > +                    */
> >> >                     goto discard;
> >> >             }
> >> >
> >> > @@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >> >                      * will take care of the rest.
> >> >                      */
> >> >                     dec_mm_counter(mm, mm_counter(page));
> >> > +                   /* We have to invalidate as we cleared the pte */
> >> > +                   mmu_notifier_invalidate_range(mm, address,
> >> > +                                                 address + PAGE_SIZE);
> >> >             } else if (IS_ENABLED(CONFIG_MIGRATION) &&
> >> >                             (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
> >> >                     swp_entry_t entry;
> >> > @@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >> >                     if (pte_soft_dirty(pteval))
> >> >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
> >> >                     set_pte_at(mm, address, pvmw.pte, swp_pte);
> >> > +                   /*
> >> > +                    * No need to invalidate here it will synchronize on
> >> > +                    * against the special swap migration pte.
> >> > +                    */
> >> >             } else if (PageAnon(page)) {
> >> >                     swp_entry_t entry = { .val = page_private(subpage) };
> >> >                     pte_t swp_pte;
> >> > @@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >> >                             WARN_ON_ONCE(1);
> >> >                             ret = false;
> >> >                             /* We have to invalidate as we cleared the pte */
> >> > +                           mmu_notifier_invalidate_range(mm, address,
> >> > +                                                   address + PAGE_SIZE);
> >> >                             page_vma_mapped_walk_done(&pvmw);
> >> >                             break;
> >> >                     }
> >> > @@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >> >                     /* MADV_FREE page check */
> >> >                     if (!PageSwapBacked(page)) {
> >> >                             if (!PageDirty(page)) {
> >> > +                                   /* Invalidate as we cleared the pte */
> >> > +                                   mmu_notifier_invalidate_range(mm,
> >> > +                                           address, address + PAGE_SIZE);
> >> >                                     dec_mm_counter(mm, MM_ANONPAGES);
> >> >                                     goto discard;
> >> >                             }
> >> > @@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >> >                     if (pte_soft_dirty(pteval))
> >> >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
> >> >                     set_pte_at(mm, address, pvmw.pte, swp_pte);
> >> > -           } else
> >> > +                   /* Invalidate as we cleared the pte */
> >> > +                   mmu_notifier_invalidate_range(mm, address,
> >> > +                                                 address + PAGE_SIZE);
> >> > +           } else {
> >> > +                   /*
> >> > +                    * We should not need to notify here as we reach this
> >> > +                    * case only from freeze_page() itself only call from
> >> > +                    * split_huge_page_to_list() so everything below must
> >> > +                    * be true:
> >> > +                    *   - page is not anonymous
> >> > +                    *   - page is locked
> >> > +                    *
> >> > +                    * So as it is a locked file back page thus it can not
> >> > +                    * be remove from the page cache and replace by a new
> >> > +                    * page before mmu_notifier_invalidate_range_end so no
> >> > +                    * concurrent thread might update its page table to
> >> > +                    * point at new page while a device still is using this
> >> > +                    * page.
> >> > +                    *
> >> > +                    * See Documentation/vm/mmu_notifier.txt
> >> > +                    */
> >> >                     dec_mm_counter(mm, mm_counter_file(page));
> >> > +           }
> >> >  discard:
> >> > +           /*
> >> > +            * No need to call mmu_notifier_invalidate_range() it has be
> >> > +            * done above for all cases requiring it to happen under page
> >> > +            * table lock before mmu_notifier_invalidate_range_end()
> >> > +            *
> >> > +            * See Documentation/vm/mmu_notifier.txt
> >> > +            */
> >> >             page_remove_rmap(subpage, PageHuge(page));
> >> >             put_page(page);
> >> > -           mmu_notifier_invalidate_range(mm, address,
> >> > -                                         address + PAGE_SIZE);
> >> >     }
> >> >
> >> >     mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> >>
> >> Looking at the patchset, I understand the efficiency, but I am concerned
> >> with correctness.
> >
> > I am fine in holding this off from reaching Linus but only way to flush this
> > issues out if any is to have this patch in linux-next or somewhere were they
> > get a chance of being tested.
> >
> 
> Yep, I would like to see some additional testing around npu and get Alistair
> Popple to comment as well

I think this patch is fine. The only one race window that it might make
bigger should have no bad consequences.

> 
> > Note that the second patch is always safe. I agree that this one might
> > not be if hardware implementation is idiotic (well that would be my
> > opinion and any opinion/point of view can be challenge :))
> 
> 
> You mean the only_end variant that avoids shootdown after pmd/pte changes
> that avoid the _start/_end and have just the only_end variant? That seemed
> reasonable to me, but I've not tested it or evaluated it in depth

Yes, patch 2/2 in this serie is definitly fine. It invalidate the device
TLB right after clearing pte entry and avoid latter unecessary invalidation
of same TLB.

Jérôme

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2
@ 2017-10-19 16:58           ` Jerome Glisse
  0 siblings, 0 replies; 40+ messages in thread
From: Jerome Glisse @ 2017-10-19 16:58 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Nadav Amit,
	Linus Torvalds, Andrew Morton, Joerg Roedel,
	Suravee Suthikulpanit, David Woodhouse, Alistair Popple,
	Michael Ellerman, Benjamin Herrenschmidt, Stephen Rothwell,
	Andrew Donnellan, iommu,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-next

On Thu, Oct 19, 2017 at 09:53:11PM +1100, Balbir Singh wrote:
> On Thu, Oct 19, 2017 at 2:28 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> > On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote:
> >> On Mon, 16 Oct 2017 23:10:02 -0400
> >> jglisse@redhat.com wrote:
> >>
> >> > From: Jérôme Glisse <jglisse@redhat.com>
> >> >
> >> > +           /*
> >> > +            * No need to call mmu_notifier_invalidate_range() as we are
> >> > +            * downgrading page table protection not changing it to point
> >> > +            * to a new page.
> >> > +            *
> >> > +            * See Documentation/vm/mmu_notifier.txt
> >> > +            */
> >> >             if (pmdp) {
> >> >  #ifdef CONFIG_FS_DAX_PMD
> >> >                     pmd_t pmd;
> >> > @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
> >> >                     pmd = pmd_wrprotect(pmd);
> >> >                     pmd = pmd_mkclean(pmd);
> >> >                     set_pmd_at(vma->vm_mm, address, pmdp, pmd);
> >> > -                   mmu_notifier_invalidate_range(vma->vm_mm, start, end);
> >>
> >> Could the secondary TLB still see the mapping as dirty and propagate the dirty bit back?
> >
> > I am assuming hardware does sane thing of setting the dirty bit only
> > when walking the CPU page table when device does a write fault ie
> > once the device get a write TLB entry the dirty is set by the IOMMU
> > when walking the page table before returning the lookup result to the
> > device and that it won't be set again latter (ie propagated back
> > latter).
> >
> 
> The other possibility is that the hardware things the page is writable
> and already
> marked dirty. It allows writes and does not set the dirty bit?

I thought about this some more and the patch can not regress anything
that is not broken today. So if we assume that device can propagate
dirty bit because it can cache the write protection than all current
code is broken for two reasons:

First one is current code clear pte entry, build a new pte value with
write protection and update pte entry with new pte value. So any PASID/
ATS platform that allows device to cache the write bit and set dirty
bit anytime after that can race during that window and you would loose
the dirty bit of the device. That is not that bad as you are gonna
propagate the dirty bit to the struct page.

Second one is if the dirty bit is propagated back to the new write
protected pte. Quick look at code it seems that when we zap pte or
or mkclean we don't check that the pte has write permission but only
care about the dirty bit. So it should not have any bad consequence.

After this patch only the second window is bigger and thus more likely
to happen. But nothing sinister should happen from that.


> 
> > I should probably have spell that out and maybe some of the ATS/PASID
> > implementer did not do that.
> >
> >>
> >> >  unlock_pmd:
> >> >                     spin_unlock(ptl);
> >> >  #endif
> >> > @@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
> >> >                     pte = pte_wrprotect(pte);
> >> >                     pte = pte_mkclean(pte);
> >> >                     set_pte_at(vma->vm_mm, address, ptep, pte);
> >> > -                   mmu_notifier_invalidate_range(vma->vm_mm, start, end);
> >>
> >> Ditto
> >>
> >> >  unlock_pte:
> >> >                     pte_unmap_unlock(ptep, ptl);
> >> >             }
> >> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> >> > index 6866e8126982..49c925c96b8a 100644
> >> > --- a/include/linux/mmu_notifier.h
> >> > +++ b/include/linux/mmu_notifier.h
> >> > @@ -155,7 +155,8 @@ struct mmu_notifier_ops {
> >> >      * shared page-tables, it not necessary to implement the
> >> >      * invalidate_range_start()/end() notifiers, as
> >> >      * invalidate_range() alread catches the points in time when an
> >> > -    * external TLB range needs to be flushed.
> >> > +    * external TLB range needs to be flushed. For more in depth
> >> > +    * discussion on this see Documentation/vm/mmu_notifier.txt
> >> >      *
> >> >      * The invalidate_range() function is called under the ptl
> >> >      * spin-lock and not allowed to sleep.
> >> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> > index c037d3d34950..ff5bc647b51d 100644
> >> > --- a/mm/huge_memory.c
> >> > +++ b/mm/huge_memory.c
> >> > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
> >> >             goto out_free_pages;
> >> >     VM_BUG_ON_PAGE(!PageHead(page), page);
> >> >
> >> > +   /*
> >> > +    * Leave pmd empty until pte is filled note we must notify here as
> >> > +    * concurrent CPU thread might write to new page before the call to
> >> > +    * mmu_notifier_invalidate_range_end() happens which can lead to a
> >> > +    * device seeing memory write in different order than CPU.
> >> > +    *
> >> > +    * See Documentation/vm/mmu_notifier.txt
> >> > +    */
> >> >     pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
> >> > -   /* leave pmd empty until pte is filled */
> >> >
> >> >     pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd);
> >> >     pmd_populate(vma->vm_mm, &_pmd, pgtable);
> >> > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> >> >     pmd_t _pmd;
> >> >     int i;
> >> >
> >> > -   /* leave pmd empty until pte is filled */
> >> > -   pmdp_huge_clear_flush_notify(vma, haddr, pmd);
> >> > +   /*
> >> > +    * Leave pmd empty until pte is filled note that it is fine to delay
> >> > +    * notification until mmu_notifier_invalidate_range_end() as we are
> >> > +    * replacing a zero pmd write protected page with a zero pte write
> >> > +    * protected page.
> >> > +    *
> >> > +    * See Documentation/vm/mmu_notifier.txt
> >> > +    */
> >> > +   pmdp_huge_clear_flush(vma, haddr, pmd);
> >>
> >> Shouldn't the secondary TLB know if the page size changed?
> >
> > It should not matter, we are talking virtual to physical on behalf
> > of a device against a process address space. So the hardware should
> > not care about the page size.
> >
> 
> Does that not indicate how much the device can access? Could it try
> to access more than what is mapped?

Assuming device has huge TLB and 2MB huge page with 4K small page.
You are going from one 1 TLB covering a 2MB zero page to 512 TLB
each covering 4K. Both case is read only and both case are pointing
to same data (ie zero).

It is fine to delay the TLB invalidate on the device to the call of
mmu_notifier_invalidate_range_end(). The device will keep using the
huge TLB for a little longer but both CPU and device are looking at
same data.

Now if there is a racing thread that replace one of the 512 zeor page
after the split but before mmu_notifier_invalidate_range_end() that
code path would call mmu_notifier_invalidate_range() before changing
the pte to point to something else. Which should shoot down the device
TLB (it would be a serious device bug if this did not work).


> 
> > Moreover if any of the new 512 (assuming 2MB huge and 4K pages) zero
> > 4K pages is replace by something new then a device TLB shootdown will
> > happen before the new page is set.
> >
> > Only issue i can think of is if the IOMMU TLB (if there is one) or
> > the device TLB (you do expect that there is one) does not invalidate
> > TLB entry if the TLB shootdown is smaller than the TLB entry. That
> > would be idiotic but yes i know hardware bug.
> >
> >
> >>
> >> >
> >> >     pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> >> >     pmd_populate(mm, &_pmd, pgtable);
> >> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >> > index 1768efa4c501..63a63f1b536c 100644
> >> > --- a/mm/hugetlb.c
> >> > +++ b/mm/hugetlb.c
> >> > @@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> >> >                     set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
> >> >             } else {
> >> >                     if (cow) {
> >> > +                           /*
> >> > +                            * No need to notify as we are downgrading page
> >> > +                            * table protection not changing it to point
> >> > +                            * to a new page.
> >> > +                            *
> >> > +                            * See Documentation/vm/mmu_notifier.txt
> >> > +                            */
> >> >                             huge_ptep_set_wrprotect(src, addr, src_pte);
> >>
> >> OK.. so we could get write faults on write accesses from the device.
> >>
> >> > -                           mmu_notifier_invalidate_range(src, mmun_start,
> >> > -                                                              mmun_end);
> >> >                     }
> >> >                     entry = huge_ptep_get(src_pte);
> >> >                     ptepage = pte_page(entry);
> >> > @@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> >> >      * and that page table be reused and filled with junk.
> >> >      */
> >> >     flush_hugetlb_tlb_range(vma, start, end);
> >> > -   mmu_notifier_invalidate_range(mm, start, end);
> >> > +   /*
> >> > +    * No need to call mmu_notifier_invalidate_range() we are downgrading
> >> > +    * page table protection not changing it to point to a new page.
> >> > +    *
> >> > +    * See Documentation/vm/mmu_notifier.txt
> >> > +    */
> >> >     i_mmap_unlock_write(vma->vm_file->f_mapping);
> >> >     mmu_notifier_invalidate_range_end(mm, start, end);
> >> >
> >> > diff --git a/mm/ksm.c b/mm/ksm.c
> >> > index 6cb60f46cce5..be8f4576f842 100644
> >> > --- a/mm/ksm.c
> >> > +++ b/mm/ksm.c
> >> > @@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
> >> >              * So we clear the pte and flush the tlb before the check
> >> >              * this assure us that no O_DIRECT can happen after the check
> >> >              * or in the middle of the check.
> >> > +            *
> >> > +            * No need to notify as we are downgrading page table to read
> >> > +            * only not changing it to point to a new page.
> >> > +            *
> >> > +            * See Documentation/vm/mmu_notifier.txt
> >> >              */
> >> > -           entry = ptep_clear_flush_notify(vma, pvmw.address, pvmw.pte);
> >> > +           entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte);
> >> >             /*
> >> >              * Check that no O_DIRECT or similar I/O is in progress on the
> >> >              * page
> >> > @@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
> >> >     }
> >> >
> >> >     flush_cache_page(vma, addr, pte_pfn(*ptep));
> >> > -   ptep_clear_flush_notify(vma, addr, ptep);
> >> > +   /*
> >> > +    * No need to notify as we are replacing a read only page with another
> >> > +    * read only page with the same content.
> >> > +    *
> >> > +    * See Documentation/vm/mmu_notifier.txt
> >> > +    */
> >> > +   ptep_clear_flush(vma, addr, ptep);
> >> >     set_pte_at_notify(mm, addr, ptep, newpte);
> >> >
> >> >     page_remove_rmap(page, false);
> >> > diff --git a/mm/rmap.c b/mm/rmap.c
> >> > index 061826278520..6b5a0f219ac0 100644
> >> > --- a/mm/rmap.c
> >> > +++ b/mm/rmap.c
> >> > @@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> >> >  #endif
> >> >             }
> >> >
> >> > -           if (ret) {
> >> > -                   mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend);
> >> > +           /*
> >> > +            * No need to call mmu_notifier_invalidate_range() as we are
> >> > +            * downgrading page table protection not changing it to point
> >> > +            * to a new page.
> >> > +            *
> >> > +            * See Documentation/vm/mmu_notifier.txt
> >> > +            */
> >> > +           if (ret)
> >> >                     (*cleaned)++;
> >> > -           }
> >> >     }
> >> >
> >> >     mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> >> > @@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >> >                     if (pte_soft_dirty(pteval))
> >> >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
> >> >                     set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
> >> > +                   /*
> >> > +                    * No need to invalidate here it will synchronize on
> >> > +                    * against the special swap migration pte.
> >> > +                    */
> >> >                     goto discard;
> >> >             }
> >> >
> >> > @@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >> >                      * will take care of the rest.
> >> >                      */
> >> >                     dec_mm_counter(mm, mm_counter(page));
> >> > +                   /* We have to invalidate as we cleared the pte */
> >> > +                   mmu_notifier_invalidate_range(mm, address,
> >> > +                                                 address + PAGE_SIZE);
> >> >             } else if (IS_ENABLED(CONFIG_MIGRATION) &&
> >> >                             (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
> >> >                     swp_entry_t entry;
> >> > @@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >> >                     if (pte_soft_dirty(pteval))
> >> >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
> >> >                     set_pte_at(mm, address, pvmw.pte, swp_pte);
> >> > +                   /*
> >> > +                    * No need to invalidate here it will synchronize on
> >> > +                    * against the special swap migration pte.
> >> > +                    */
> >> >             } else if (PageAnon(page)) {
> >> >                     swp_entry_t entry = { .val = page_private(subpage) };
> >> >                     pte_t swp_pte;
> >> > @@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >> >                             WARN_ON_ONCE(1);
> >> >                             ret = false;
> >> >                             /* We have to invalidate as we cleared the pte */
> >> > +                           mmu_notifier_invalidate_range(mm, address,
> >> > +                                                   address + PAGE_SIZE);
> >> >                             page_vma_mapped_walk_done(&pvmw);
> >> >                             break;
> >> >                     }
> >> > @@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >> >                     /* MADV_FREE page check */
> >> >                     if (!PageSwapBacked(page)) {
> >> >                             if (!PageDirty(page)) {
> >> > +                                   /* Invalidate as we cleared the pte */
> >> > +                                   mmu_notifier_invalidate_range(mm,
> >> > +                                           address, address + PAGE_SIZE);
> >> >                                     dec_mm_counter(mm, MM_ANONPAGES);
> >> >                                     goto discard;
> >> >                             }
> >> > @@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >> >                     if (pte_soft_dirty(pteval))
> >> >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
> >> >                     set_pte_at(mm, address, pvmw.pte, swp_pte);
> >> > -           } else
> >> > +                   /* Invalidate as we cleared the pte */
> >> > +                   mmu_notifier_invalidate_range(mm, address,
> >> > +                                                 address + PAGE_SIZE);
> >> > +           } else {
> >> > +                   /*
> >> > +                    * We should not need to notify here as we reach this
> >> > +                    * case only from freeze_page() itself only call from
> >> > +                    * split_huge_page_to_list() so everything below must
> >> > +                    * be true:
> >> > +                    *   - page is not anonymous
> >> > +                    *   - page is locked
> >> > +                    *
> >> > +                    * So as it is a locked file back page thus it can not
> >> > +                    * be remove from the page cache and replace by a new
> >> > +                    * page before mmu_notifier_invalidate_range_end so no
> >> > +                    * concurrent thread might update its page table to
> >> > +                    * point at new page while a device still is using this
> >> > +                    * page.
> >> > +                    *
> >> > +                    * See Documentation/vm/mmu_notifier.txt
> >> > +                    */
> >> >                     dec_mm_counter(mm, mm_counter_file(page));
> >> > +           }
> >> >  discard:
> >> > +           /*
> >> > +            * No need to call mmu_notifier_invalidate_range() it has be
> >> > +            * done above for all cases requiring it to happen under page
> >> > +            * table lock before mmu_notifier_invalidate_range_end()
> >> > +            *
> >> > +            * See Documentation/vm/mmu_notifier.txt
> >> > +            */
> >> >             page_remove_rmap(subpage, PageHuge(page));
> >> >             put_page(page);
> >> > -           mmu_notifier_invalidate_range(mm, address,
> >> > -                                         address + PAGE_SIZE);
> >> >     }
> >> >
> >> >     mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> >>
> >> Looking at the patchset, I understand the efficiency, but I am concerned
> >> with correctness.
> >
> > I am fine in holding this off from reaching Linus but only way to flush this
> > issues out if any is to have this patch in linux-next or somewhere were they
> > get a chance of being tested.
> >
> 
> Yep, I would like to see some additional testing around npu and get Alistair
> Popple to comment as well

I think this patch is fine. The only one race window that it might make
bigger should have no bad consequences.

> 
> > Note that the second patch is always safe. I agree that this one might
> > not be if hardware implementation is idiotic (well that would be my
> > opinion and any opinion/point of view can be challenge :))
> 
> 
> You mean the only_end variant that avoids shootdown after pmd/pte changes
> that avoid the _start/_end and have just the only_end variant? That seemed
> reasonable to me, but I've not tested it or evaluated it in depth

Yes, patch 2/2 in this serie is definitly fine. It invalidate the device
TLB right after clearing pte entry and avoid latter unecessary invalidation
of same TLB.

Jérôme

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2
@ 2017-10-19 16:58           ` Jerome Glisse
  0 siblings, 0 replies; 40+ messages in thread
From: Jerome Glisse @ 2017-10-19 16:58 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Nadav Amit,
	Linus Torvalds, Andrew Morton, Joerg Roedel,
	Suravee Suthikulpanit, David Woodhouse, Alistair Popple,
	Michael Ellerman, Benjamin Herrenschmidt, Stephen Rothwell,
	Andrew Donnellan, iommu,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-next

On Thu, Oct 19, 2017 at 09:53:11PM +1100, Balbir Singh wrote:
> On Thu, Oct 19, 2017 at 2:28 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> > On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote:
> >> On Mon, 16 Oct 2017 23:10:02 -0400
> >> jglisse@redhat.com wrote:
> >>
> >> > From: Jerome Glisse <jglisse@redhat.com>
> >> >
> >> > +           /*
> >> > +            * No need to call mmu_notifier_invalidate_range() as we are
> >> > +            * downgrading page table protection not changing it to point
> >> > +            * to a new page.
> >> > +            *
> >> > +            * See Documentation/vm/mmu_notifier.txt
> >> > +            */
> >> >             if (pmdp) {
> >> >  #ifdef CONFIG_FS_DAX_PMD
> >> >                     pmd_t pmd;
> >> > @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
> >> >                     pmd = pmd_wrprotect(pmd);
> >> >                     pmd = pmd_mkclean(pmd);
> >> >                     set_pmd_at(vma->vm_mm, address, pmdp, pmd);
> >> > -                   mmu_notifier_invalidate_range(vma->vm_mm, start, end);
> >>
> >> Could the secondary TLB still see the mapping as dirty and propagate the dirty bit back?
> >
> > I am assuming hardware does sane thing of setting the dirty bit only
> > when walking the CPU page table when device does a write fault ie
> > once the device get a write TLB entry the dirty is set by the IOMMU
> > when walking the page table before returning the lookup result to the
> > device and that it won't be set again latter (ie propagated back
> > latter).
> >
> 
> The other possibility is that the hardware things the page is writable
> and already
> marked dirty. It allows writes and does not set the dirty bit?

I thought about this some more and the patch can not regress anything
that is not broken today. So if we assume that device can propagate
dirty bit because it can cache the write protection than all current
code is broken for two reasons:

First one is current code clear pte entry, build a new pte value with
write protection and update pte entry with new pte value. So any PASID/
ATS platform that allows device to cache the write bit and set dirty
bit anytime after that can race during that window and you would loose
the dirty bit of the device. That is not that bad as you are gonna
propagate the dirty bit to the struct page.

Second one is if the dirty bit is propagated back to the new write
protected pte. Quick look at code it seems that when we zap pte or
or mkclean we don't check that the pte has write permission but only
care about the dirty bit. So it should not have any bad consequence.

After this patch only the second window is bigger and thus more likely
to happen. But nothing sinister should happen from that.


> 
> > I should probably have spell that out and maybe some of the ATS/PASID
> > implementer did not do that.
> >
> >>
> >> >  unlock_pmd:
> >> >                     spin_unlock(ptl);
> >> >  #endif
> >> > @@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
> >> >                     pte = pte_wrprotect(pte);
> >> >                     pte = pte_mkclean(pte);
> >> >                     set_pte_at(vma->vm_mm, address, ptep, pte);
> >> > -                   mmu_notifier_invalidate_range(vma->vm_mm, start, end);
> >>
> >> Ditto
> >>
> >> >  unlock_pte:
> >> >                     pte_unmap_unlock(ptep, ptl);
> >> >             }
> >> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> >> > index 6866e8126982..49c925c96b8a 100644
> >> > --- a/include/linux/mmu_notifier.h
> >> > +++ b/include/linux/mmu_notifier.h
> >> > @@ -155,7 +155,8 @@ struct mmu_notifier_ops {
> >> >      * shared page-tables, it not necessary to implement the
> >> >      * invalidate_range_start()/end() notifiers, as
> >> >      * invalidate_range() alread catches the points in time when an
> >> > -    * external TLB range needs to be flushed.
> >> > +    * external TLB range needs to be flushed. For more in depth
> >> > +    * discussion on this see Documentation/vm/mmu_notifier.txt
> >> >      *
> >> >      * The invalidate_range() function is called under the ptl
> >> >      * spin-lock and not allowed to sleep.
> >> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> > index c037d3d34950..ff5bc647b51d 100644
> >> > --- a/mm/huge_memory.c
> >> > +++ b/mm/huge_memory.c
> >> > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
> >> >             goto out_free_pages;
> >> >     VM_BUG_ON_PAGE(!PageHead(page), page);
> >> >
> >> > +   /*
> >> > +    * Leave pmd empty until pte is filled note we must notify here as
> >> > +    * concurrent CPU thread might write to new page before the call to
> >> > +    * mmu_notifier_invalidate_range_end() happens which can lead to a
> >> > +    * device seeing memory write in different order than CPU.
> >> > +    *
> >> > +    * See Documentation/vm/mmu_notifier.txt
> >> > +    */
> >> >     pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
> >> > -   /* leave pmd empty until pte is filled */
> >> >
> >> >     pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd);
> >> >     pmd_populate(vma->vm_mm, &_pmd, pgtable);
> >> > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> >> >     pmd_t _pmd;
> >> >     int i;
> >> >
> >> > -   /* leave pmd empty until pte is filled */
> >> > -   pmdp_huge_clear_flush_notify(vma, haddr, pmd);
> >> > +   /*
> >> > +    * Leave pmd empty until pte is filled note that it is fine to delay
> >> > +    * notification until mmu_notifier_invalidate_range_end() as we are
> >> > +    * replacing a zero pmd write protected page with a zero pte write
> >> > +    * protected page.
> >> > +    *
> >> > +    * See Documentation/vm/mmu_notifier.txt
> >> > +    */
> >> > +   pmdp_huge_clear_flush(vma, haddr, pmd);
> >>
> >> Shouldn't the secondary TLB know if the page size changed?
> >
> > It should not matter, we are talking virtual to physical on behalf
> > of a device against a process address space. So the hardware should
> > not care about the page size.
> >
> 
> Does that not indicate how much the device can access? Could it try
> to access more than what is mapped?

Assuming device has huge TLB and 2MB huge page with 4K small page.
You are going from one 1 TLB covering a 2MB zero page to 512 TLB
each covering 4K. Both case is read only and both case are pointing
to same data (ie zero).

It is fine to delay the TLB invalidate on the device to the call of
mmu_notifier_invalidate_range_end(). The device will keep using the
huge TLB for a little longer but both CPU and device are looking at
same data.

Now if there is a racing thread that replace one of the 512 zeor page
after the split but before mmu_notifier_invalidate_range_end() that
code path would call mmu_notifier_invalidate_range() before changing
the pte to point to something else. Which should shoot down the device
TLB (it would be a serious device bug if this did not work).


> 
> > Moreover if any of the new 512 (assuming 2MB huge and 4K pages) zero
> > 4K pages is replace by something new then a device TLB shootdown will
> > happen before the new page is set.
> >
> > Only issue i can think of is if the IOMMU TLB (if there is one) or
> > the device TLB (you do expect that there is one) does not invalidate
> > TLB entry if the TLB shootdown is smaller than the TLB entry. That
> > would be idiotic but yes i know hardware bug.
> >
> >
> >>
> >> >
> >> >     pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> >> >     pmd_populate(mm, &_pmd, pgtable);
> >> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >> > index 1768efa4c501..63a63f1b536c 100644
> >> > --- a/mm/hugetlb.c
> >> > +++ b/mm/hugetlb.c
> >> > @@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> >> >                     set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
> >> >             } else {
> >> >                     if (cow) {
> >> > +                           /*
> >> > +                            * No need to notify as we are downgrading page
> >> > +                            * table protection not changing it to point
> >> > +                            * to a new page.
> >> > +                            *
> >> > +                            * See Documentation/vm/mmu_notifier.txt
> >> > +                            */
> >> >                             huge_ptep_set_wrprotect(src, addr, src_pte);
> >>
> >> OK.. so we could get write faults on write accesses from the device.
> >>
> >> > -                           mmu_notifier_invalidate_range(src, mmun_start,
> >> > -                                                              mmun_end);
> >> >                     }
> >> >                     entry = huge_ptep_get(src_pte);
> >> >                     ptepage = pte_page(entry);
> >> > @@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> >> >      * and that page table be reused and filled with junk.
> >> >      */
> >> >     flush_hugetlb_tlb_range(vma, start, end);
> >> > -   mmu_notifier_invalidate_range(mm, start, end);
> >> > +   /*
> >> > +    * No need to call mmu_notifier_invalidate_range() we are downgrading
> >> > +    * page table protection not changing it to point to a new page.
> >> > +    *
> >> > +    * See Documentation/vm/mmu_notifier.txt
> >> > +    */
> >> >     i_mmap_unlock_write(vma->vm_file->f_mapping);
> >> >     mmu_notifier_invalidate_range_end(mm, start, end);
> >> >
> >> > diff --git a/mm/ksm.c b/mm/ksm.c
> >> > index 6cb60f46cce5..be8f4576f842 100644
> >> > --- a/mm/ksm.c
> >> > +++ b/mm/ksm.c
> >> > @@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
> >> >              * So we clear the pte and flush the tlb before the check
> >> >              * this assure us that no O_DIRECT can happen after the check
> >> >              * or in the middle of the check.
> >> > +            *
> >> > +            * No need to notify as we are downgrading page table to read
> >> > +            * only not changing it to point to a new page.
> >> > +            *
> >> > +            * See Documentation/vm/mmu_notifier.txt
> >> >              */
> >> > -           entry = ptep_clear_flush_notify(vma, pvmw.address, pvmw.pte);
> >> > +           entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte);
> >> >             /*
> >> >              * Check that no O_DIRECT or similar I/O is in progress on the
> >> >              * page
> >> > @@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
> >> >     }
> >> >
> >> >     flush_cache_page(vma, addr, pte_pfn(*ptep));
> >> > -   ptep_clear_flush_notify(vma, addr, ptep);
> >> > +   /*
> >> > +    * No need to notify as we are replacing a read only page with another
> >> > +    * read only page with the same content.
> >> > +    *
> >> > +    * See Documentation/vm/mmu_notifier.txt
> >> > +    */
> >> > +   ptep_clear_flush(vma, addr, ptep);
> >> >     set_pte_at_notify(mm, addr, ptep, newpte);
> >> >
> >> >     page_remove_rmap(page, false);
> >> > diff --git a/mm/rmap.c b/mm/rmap.c
> >> > index 061826278520..6b5a0f219ac0 100644
> >> > --- a/mm/rmap.c
> >> > +++ b/mm/rmap.c
> >> > @@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> >> >  #endif
> >> >             }
> >> >
> >> > -           if (ret) {
> >> > -                   mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend);
> >> > +           /*
> >> > +            * No need to call mmu_notifier_invalidate_range() as we are
> >> > +            * downgrading page table protection not changing it to point
> >> > +            * to a new page.
> >> > +            *
> >> > +            * See Documentation/vm/mmu_notifier.txt
> >> > +            */
> >> > +           if (ret)
> >> >                     (*cleaned)++;
> >> > -           }
> >> >     }
> >> >
> >> >     mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> >> > @@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >> >                     if (pte_soft_dirty(pteval))
> >> >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
> >> >                     set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
> >> > +                   /*
> >> > +                    * No need to invalidate here it will synchronize on
> >> > +                    * against the special swap migration pte.
> >> > +                    */
> >> >                     goto discard;
> >> >             }
> >> >
> >> > @@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >> >                      * will take care of the rest.
> >> >                      */
> >> >                     dec_mm_counter(mm, mm_counter(page));
> >> > +                   /* We have to invalidate as we cleared the pte */
> >> > +                   mmu_notifier_invalidate_range(mm, address,
> >> > +                                                 address + PAGE_SIZE);
> >> >             } else if (IS_ENABLED(CONFIG_MIGRATION) &&
> >> >                             (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
> >> >                     swp_entry_t entry;
> >> > @@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >> >                     if (pte_soft_dirty(pteval))
> >> >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
> >> >                     set_pte_at(mm, address, pvmw.pte, swp_pte);
> >> > +                   /*
> >> > +                    * No need to invalidate here it will synchronize on
> >> > +                    * against the special swap migration pte.
> >> > +                    */
> >> >             } else if (PageAnon(page)) {
> >> >                     swp_entry_t entry = { .val = page_private(subpage) };
> >> >                     pte_t swp_pte;
> >> > @@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >> >                             WARN_ON_ONCE(1);
> >> >                             ret = false;
> >> >                             /* We have to invalidate as we cleared the pte */
> >> > +                           mmu_notifier_invalidate_range(mm, address,
> >> > +                                                   address + PAGE_SIZE);
> >> >                             page_vma_mapped_walk_done(&pvmw);
> >> >                             break;
> >> >                     }
> >> > @@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >> >                     /* MADV_FREE page check */
> >> >                     if (!PageSwapBacked(page)) {
> >> >                             if (!PageDirty(page)) {
> >> > +                                   /* Invalidate as we cleared the pte */
> >> > +                                   mmu_notifier_invalidate_range(mm,
> >> > +                                           address, address + PAGE_SIZE);
> >> >                                     dec_mm_counter(mm, MM_ANONPAGES);
> >> >                                     goto discard;
> >> >                             }
> >> > @@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >> >                     if (pte_soft_dirty(pteval))
> >> >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
> >> >                     set_pte_at(mm, address, pvmw.pte, swp_pte);
> >> > -           } else
> >> > +                   /* Invalidate as we cleared the pte */
> >> > +                   mmu_notifier_invalidate_range(mm, address,
> >> > +                                                 address + PAGE_SIZE);
> >> > +           } else {
> >> > +                   /*
> >> > +                    * We should not need to notify here as we reach this
> >> > +                    * case only from freeze_page() itself only call from
> >> > +                    * split_huge_page_to_list() so everything below must
> >> > +                    * be true:
> >> > +                    *   - page is not anonymous
> >> > +                    *   - page is locked
> >> > +                    *
> >> > +                    * So as it is a locked file back page thus it can not
> >> > +                    * be remove from the page cache and replace by a new
> >> > +                    * page before mmu_notifier_invalidate_range_end so no
> >> > +                    * concurrent thread might update its page table to
> >> > +                    * point at new page while a device still is using this
> >> > +                    * page.
> >> > +                    *
> >> > +                    * See Documentation/vm/mmu_notifier.txt
> >> > +                    */
> >> >                     dec_mm_counter(mm, mm_counter_file(page));
> >> > +           }
> >> >  discard:
> >> > +           /*
> >> > +            * No need to call mmu_notifier_invalidate_range() it has be
> >> > +            * done above for all cases requiring it to happen under page
> >> > +            * table lock before mmu_notifier_invalidate_range_end()
> >> > +            *
> >> > +            * See Documentation/vm/mmu_notifier.txt
> >> > +            */
> >> >             page_remove_rmap(subpage, PageHuge(page));
> >> >             put_page(page);
> >> > -           mmu_notifier_invalidate_range(mm, address,
> >> > -                                         address + PAGE_SIZE);
> >> >     }
> >> >
> >> >     mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> >>
> >> Looking at the patchset, I understand the efficiency, but I am concerned
> >> with correctness.
> >
> > I am fine in holding this off from reaching Linus but only way to flush this
> > issues out if any is to have this patch in linux-next or somewhere were they
> > get a chance of being tested.
> >
> 
> Yep, I would like to see some additional testing around npu and get Alistair
> Popple to comment as well

I think this patch is fine. The only one race window that it might make
bigger should have no bad consequences.

> 
> > Note that the second patch is always safe. I agree that this one might
> > not be if hardware implementation is idiotic (well that would be my
> > opinion and any opinion/point of view can be challenge :))
> 
> 
> You mean the only_end variant that avoids shootdown after pmd/pte changes
> that avoid the _start/_end and have just the only_end variant? That seemed
> reasonable to me, but I've not tested it or evaluated it in depth

Yes, patch 2/2 in this serie is definitly fine. It invalidate the device
TLB right after clearing pte entry and avoid latter unecessary invalidation
of same TLB.

Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2
@ 2017-10-19 16:58           ` Jerome Glisse
  0 siblings, 0 replies; 40+ messages in thread
From: Jerome Glisse @ 2017-10-19 16:58 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Nadav Amit,
	Linus Torvalds, Andrew Morton, Joerg Roedel,
	Suravee Suthikulpanit, David Woodhouse, Alistair Popple,
	Michael Ellerman, Benjamin Herrenschmidt, Stephen Rothwell,
	Andrew Donnellan, iommu,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-next

On Thu, Oct 19, 2017 at 09:53:11PM +1100, Balbir Singh wrote:
> On Thu, Oct 19, 2017 at 2:28 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> > On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote:
> >> On Mon, 16 Oct 2017 23:10:02 -0400
> >> jglisse@redhat.com wrote:
> >>
> >> > From: Jérôme Glisse <jglisse@redhat.com>
> >> >
> >> > +           /*
> >> > +            * No need to call mmu_notifier_invalidate_range() as we are
> >> > +            * downgrading page table protection not changing it to point
> >> > +            * to a new page.
> >> > +            *
> >> > +            * See Documentation/vm/mmu_notifier.txt
> >> > +            */
> >> >             if (pmdp) {
> >> >  #ifdef CONFIG_FS_DAX_PMD
> >> >                     pmd_t pmd;
> >> > @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
> >> >                     pmd = pmd_wrprotect(pmd);
> >> >                     pmd = pmd_mkclean(pmd);
> >> >                     set_pmd_at(vma->vm_mm, address, pmdp, pmd);
> >> > -                   mmu_notifier_invalidate_range(vma->vm_mm, start, end);
> >>
> >> Could the secondary TLB still see the mapping as dirty and propagate the dirty bit back?
> >
> > I am assuming hardware does sane thing of setting the dirty bit only
> > when walking the CPU page table when device does a write fault ie
> > once the device get a write TLB entry the dirty is set by the IOMMU
> > when walking the page table before returning the lookup result to the
> > device and that it won't be set again latter (ie propagated back
> > latter).
> >
> 
> The other possibility is that the hardware things the page is writable
> and already
> marked dirty. It allows writes and does not set the dirty bit?

I thought about this some more and the patch can not regress anything
that is not broken today. So if we assume that device can propagate
dirty bit because it can cache the write protection than all current
code is broken for two reasons:

First one is current code clear pte entry, build a new pte value with
write protection and update pte entry with new pte value. So any PASID/
ATS platform that allows device to cache the write bit and set dirty
bit anytime after that can race during that window and you would loose
the dirty bit of the device. That is not that bad as you are gonna
propagate the dirty bit to the struct page.

Second one is if the dirty bit is propagated back to the new write
protected pte. Quick look at code it seems that when we zap pte or
or mkclean we don't check that the pte has write permission but only
care about the dirty bit. So it should not have any bad consequence.

After this patch only the second window is bigger and thus more likely
to happen. But nothing sinister should happen from that.


> 
> > I should probably have spell that out and maybe some of the ATS/PASID
> > implementer did not do that.
> >
> >>
> >> >  unlock_pmd:
> >> >                     spin_unlock(ptl);
> >> >  #endif
> >> > @@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
> >> >                     pte = pte_wrprotect(pte);
> >> >                     pte = pte_mkclean(pte);
> >> >                     set_pte_at(vma->vm_mm, address, ptep, pte);
> >> > -                   mmu_notifier_invalidate_range(vma->vm_mm, start, end);
> >>
> >> Ditto
> >>
> >> >  unlock_pte:
> >> >                     pte_unmap_unlock(ptep, ptl);
> >> >             }
> >> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> >> > index 6866e8126982..49c925c96b8a 100644
> >> > --- a/include/linux/mmu_notifier.h
> >> > +++ b/include/linux/mmu_notifier.h
> >> > @@ -155,7 +155,8 @@ struct mmu_notifier_ops {
> >> >      * shared page-tables, it not necessary to implement the
> >> >      * invalidate_range_start()/end() notifiers, as
> >> >      * invalidate_range() alread catches the points in time when an
> >> > -    * external TLB range needs to be flushed.
> >> > +    * external TLB range needs to be flushed. For more in depth
> >> > +    * discussion on this see Documentation/vm/mmu_notifier.txt
> >> >      *
> >> >      * The invalidate_range() function is called under the ptl
> >> >      * spin-lock and not allowed to sleep.
> >> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> > index c037d3d34950..ff5bc647b51d 100644
> >> > --- a/mm/huge_memory.c
> >> > +++ b/mm/huge_memory.c
> >> > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
> >> >             goto out_free_pages;
> >> >     VM_BUG_ON_PAGE(!PageHead(page), page);
> >> >
> >> > +   /*
> >> > +    * Leave pmd empty until pte is filled note we must notify here as
> >> > +    * concurrent CPU thread might write to new page before the call to
> >> > +    * mmu_notifier_invalidate_range_end() happens which can lead to a
> >> > +    * device seeing memory write in different order than CPU.
> >> > +    *
> >> > +    * See Documentation/vm/mmu_notifier.txt
> >> > +    */
> >> >     pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
> >> > -   /* leave pmd empty until pte is filled */
> >> >
> >> >     pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd);
> >> >     pmd_populate(vma->vm_mm, &_pmd, pgtable);
> >> > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> >> >     pmd_t _pmd;
> >> >     int i;
> >> >
> >> > -   /* leave pmd empty until pte is filled */
> >> > -   pmdp_huge_clear_flush_notify(vma, haddr, pmd);
> >> > +   /*
> >> > +    * Leave pmd empty until pte is filled note that it is fine to delay
> >> > +    * notification until mmu_notifier_invalidate_range_end() as we are
> >> > +    * replacing a zero pmd write protected page with a zero pte write
> >> > +    * protected page.
> >> > +    *
> >> > +    * See Documentation/vm/mmu_notifier.txt
> >> > +    */
> >> > +   pmdp_huge_clear_flush(vma, haddr, pmd);
> >>
> >> Shouldn't the secondary TLB know if the page size changed?
> >
> > It should not matter, we are talking virtual to physical on behalf
> > of a device against a process address space. So the hardware should
> > not care about the page size.
> >
> 
> Does that not indicate how much the device can access? Could it try
> to access more than what is mapped?

Assuming device has huge TLB and 2MB huge page with 4K small page.
You are going from one 1 TLB covering a 2MB zero page to 512 TLB
each covering 4K. Both case is read only and both case are pointing
to same data (ie zero).

It is fine to delay the TLB invalidate on the device to the call of
mmu_notifier_invalidate_range_end(). The device will keep using the
huge TLB for a little longer but both CPU and device are looking at
same data.

Now if there is a racing thread that replace one of the 512 zeor page
after the split but before mmu_notifier_invalidate_range_end() that
code path would call mmu_notifier_invalidate_range() before changing
the pte to point to something else. Which should shoot down the device
TLB (it would be a serious device bug if this did not work).


> 
> > Moreover if any of the new 512 (assuming 2MB huge and 4K pages) zero
> > 4K pages is replace by something new then a device TLB shootdown will
> > happen before the new page is set.
> >
> > Only issue i can think of is if the IOMMU TLB (if there is one) or
> > the device TLB (you do expect that there is one) does not invalidate
> > TLB entry if the TLB shootdown is smaller than the TLB entry. That
> > would be idiotic but yes i know hardware bug.
> >
> >
> >>
> >> >
> >> >     pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> >> >     pmd_populate(mm, &_pmd, pgtable);
> >> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >> > index 1768efa4c501..63a63f1b536c 100644
> >> > --- a/mm/hugetlb.c
> >> > +++ b/mm/hugetlb.c
> >> > @@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> >> >                     set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
> >> >             } else {
> >> >                     if (cow) {
> >> > +                           /*
> >> > +                            * No need to notify as we are downgrading page
> >> > +                            * table protection not changing it to point
> >> > +                            * to a new page.
> >> > +                            *
> >> > +                            * See Documentation/vm/mmu_notifier.txt
> >> > +                            */
> >> >                             huge_ptep_set_wrprotect(src, addr, src_pte);
> >>
> >> OK.. so we could get write faults on write accesses from the device.
> >>
> >> > -                           mmu_notifier_invalidate_range(src, mmun_start,
> >> > -                                                              mmun_end);
> >> >                     }
> >> >                     entry = huge_ptep_get(src_pte);
> >> >                     ptepage = pte_page(entry);
> >> > @@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> >> >      * and that page table be reused and filled with junk.
> >> >      */
> >> >     flush_hugetlb_tlb_range(vma, start, end);
> >> > -   mmu_notifier_invalidate_range(mm, start, end);
> >> > +   /*
> >> > +    * No need to call mmu_notifier_invalidate_range() we are downgrading
> >> > +    * page table protection not changing it to point to a new page.
> >> > +    *
> >> > +    * See Documentation/vm/mmu_notifier.txt
> >> > +    */
> >> >     i_mmap_unlock_write(vma->vm_file->f_mapping);
> >> >     mmu_notifier_invalidate_range_end(mm, start, end);
> >> >
> >> > diff --git a/mm/ksm.c b/mm/ksm.c
> >> > index 6cb60f46cce5..be8f4576f842 100644
> >> > --- a/mm/ksm.c
> >> > +++ b/mm/ksm.c
> >> > @@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
> >> >              * So we clear the pte and flush the tlb before the check
> >> >              * this assure us that no O_DIRECT can happen after the check
> >> >              * or in the middle of the check.
> >> > +            *
> >> > +            * No need to notify as we are downgrading page table to read
> >> > +            * only not changing it to point to a new page.
> >> > +            *
> >> > +            * See Documentation/vm/mmu_notifier.txt
> >> >              */
> >> > -           entry = ptep_clear_flush_notify(vma, pvmw.address, pvmw.pte);
> >> > +           entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte);
> >> >             /*
> >> >              * Check that no O_DIRECT or similar I/O is in progress on the
> >> >              * page
> >> > @@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
> >> >     }
> >> >
> >> >     flush_cache_page(vma, addr, pte_pfn(*ptep));
> >> > -   ptep_clear_flush_notify(vma, addr, ptep);
> >> > +   /*
> >> > +    * No need to notify as we are replacing a read only page with another
> >> > +    * read only page with the same content.
> >> > +    *
> >> > +    * See Documentation/vm/mmu_notifier.txt
> >> > +    */
> >> > +   ptep_clear_flush(vma, addr, ptep);
> >> >     set_pte_at_notify(mm, addr, ptep, newpte);
> >> >
> >> >     page_remove_rmap(page, false);
> >> > diff --git a/mm/rmap.c b/mm/rmap.c
> >> > index 061826278520..6b5a0f219ac0 100644
> >> > --- a/mm/rmap.c
> >> > +++ b/mm/rmap.c
> >> > @@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> >> >  #endif
> >> >             }
> >> >
> >> > -           if (ret) {
> >> > -                   mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend);
> >> > +           /*
> >> > +            * No need to call mmu_notifier_invalidate_range() as we are
> >> > +            * downgrading page table protection not changing it to point
> >> > +            * to a new page.
> >> > +            *
> >> > +            * See Documentation/vm/mmu_notifier.txt
> >> > +            */
> >> > +           if (ret)
> >> >                     (*cleaned)++;
> >> > -           }
> >> >     }
> >> >
> >> >     mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> >> > @@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >> >                     if (pte_soft_dirty(pteval))
> >> >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
> >> >                     set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
> >> > +                   /*
> >> > +                    * No need to invalidate here it will synchronize on
> >> > +                    * against the special swap migration pte.
> >> > +                    */
> >> >                     goto discard;
> >> >             }
> >> >
> >> > @@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >> >                      * will take care of the rest.
> >> >                      */
> >> >                     dec_mm_counter(mm, mm_counter(page));
> >> > +                   /* We have to invalidate as we cleared the pte */
> >> > +                   mmu_notifier_invalidate_range(mm, address,
> >> > +                                                 address + PAGE_SIZE);
> >> >             } else if (IS_ENABLED(CONFIG_MIGRATION) &&
> >> >                             (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
> >> >                     swp_entry_t entry;
> >> > @@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >> >                     if (pte_soft_dirty(pteval))
> >> >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
> >> >                     set_pte_at(mm, address, pvmw.pte, swp_pte);
> >> > +                   /*
> >> > +                    * No need to invalidate here it will synchronize on
> >> > +                    * against the special swap migration pte.
> >> > +                    */
> >> >             } else if (PageAnon(page)) {
> >> >                     swp_entry_t entry = { .val = page_private(subpage) };
> >> >                     pte_t swp_pte;
> >> > @@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >> >                             WARN_ON_ONCE(1);
> >> >                             ret = false;
> >> >                             /* We have to invalidate as we cleared the pte */
> >> > +                           mmu_notifier_invalidate_range(mm, address,
> >> > +                                                   address + PAGE_SIZE);
> >> >                             page_vma_mapped_walk_done(&pvmw);
> >> >                             break;
> >> >                     }
> >> > @@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >> >                     /* MADV_FREE page check */
> >> >                     if (!PageSwapBacked(page)) {
> >> >                             if (!PageDirty(page)) {
> >> > +                                   /* Invalidate as we cleared the pte */
> >> > +                                   mmu_notifier_invalidate_range(mm,
> >> > +                                           address, address + PAGE_SIZE);
> >> >                                     dec_mm_counter(mm, MM_ANONPAGES);
> >> >                                     goto discard;
> >> >                             }
> >> > @@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >> >                     if (pte_soft_dirty(pteval))
> >> >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
> >> >                     set_pte_at(mm, address, pvmw.pte, swp_pte);
> >> > -           } else
> >> > +                   /* Invalidate as we cleared the pte */
> >> > +                   mmu_notifier_invalidate_range(mm, address,
> >> > +                                                 address + PAGE_SIZE);
> >> > +           } else {
> >> > +                   /*
> >> > +                    * We should not need to notify here as we reach this
> >> > +                    * case only from freeze_page() itself only call from
> >> > +                    * split_huge_page_to_list() so everything below must
> >> > +                    * be true:
> >> > +                    *   - page is not anonymous
> >> > +                    *   - page is locked
> >> > +                    *
> >> > +                    * So as it is a locked file back page thus it can not
> >> > +                    * be remove from the page cache and replace by a new
> >> > +                    * page before mmu_notifier_invalidate_range_end so no
> >> > +                    * concurrent thread might update its page table to
> >> > +                    * point at new page while a device still is using this
> >> > +                    * page.
> >> > +                    *
> >> > +                    * See Documentation/vm/mmu_notifier.txt
> >> > +                    */
> >> >                     dec_mm_counter(mm, mm_counter_file(page));
> >> > +           }
> >> >  discard:
> >> > +           /*
> >> > +            * No need to call mmu_notifier_invalidate_range() it has be
> >> > +            * done above for all cases requiring it to happen under page
> >> > +            * table lock before mmu_notifier_invalidate_range_end()
> >> > +            *
> >> > +            * See Documentation/vm/mmu_notifier.txt
> >> > +            */
> >> >             page_remove_rmap(subpage, PageHuge(page));
> >> >             put_page(page);
> >> > -           mmu_notifier_invalidate_range(mm, address,
> >> > -                                         address + PAGE_SIZE);
> >> >     }
> >> >
> >> >     mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> >>
> >> Looking at the patchset, I understand the efficiency, but I am concerned
> >> with correctness.
> >
> > I am fine in holding this off from reaching Linus but only way to flush this
> > issues out if any is to have this patch in linux-next or somewhere were they
> > get a chance of being tested.
> >
> 
> Yep, I would like to see some additional testing around npu and get Alistair
> Popple to comment as well

I think this patch is fine. The only one race window that it might make
bigger should have no bad consequences.

> 
> > Note that the second patch is always safe. I agree that this one might
> > not be if hardware implementation is idiotic (well that would be my
> > opinion and any opinion/point of view can be challenge :))
> 
> 
> You mean the only_end variant that avoids shootdown after pmd/pte changes
> that avoid the _start/_end and have just the only_end variant? That seemed
> reasonable to me, but I've not tested it or evaluated it in depth

Yes, patch 2/2 in this serie is definitly fine. It invalidate the device
TLB right after clearing pte entry and avoid latter unecessary invalidation
of same TLB.

Jérôme

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2
@ 2017-10-21  5:54             ` Balbir Singh
  0 siblings, 0 replies; 40+ messages in thread
From: Balbir Singh @ 2017-10-21  5:54 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Nadav Amit,
	Linus Torvalds, Andrew Morton, Joerg Roedel,
	Suravee Suthikulpanit, David Woodhouse, Alistair Popple,
	Michael Ellerman, Benjamin Herrenschmidt, Stephen Rothwell,
	Andrew Donnellan, iommu,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-next

On Thu, 2017-10-19 at 12:58 -0400, Jerome Glisse wrote:
> On Thu, Oct 19, 2017 at 09:53:11PM +1100, Balbir Singh wrote:
> > On Thu, Oct 19, 2017 at 2:28 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> > > On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote:
> > > > On Mon, 16 Oct 2017 23:10:02 -0400
> > > > jglisse@redhat.com wrote:
> > > > 
> > > > > From: Jérôme Glisse <jglisse@redhat.com>
> > > > > 
> > > > > +           /*
> > > > > +            * No need to call mmu_notifier_invalidate_range() as we are
> > > > > +            * downgrading page table protection not changing it to point
> > > > > +            * to a new page.
> > > > > +            *
> > > > > +            * See Documentation/vm/mmu_notifier.txt
> > > > > +            */
> > > > >             if (pmdp) {
> > > > >  #ifdef CONFIG_FS_DAX_PMD
> > > > >                     pmd_t pmd;
> > > > > @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
> > > > >                     pmd = pmd_wrprotect(pmd);
> > > > >                     pmd = pmd_mkclean(pmd);
> > > > >                     set_pmd_at(vma->vm_mm, address, pmdp, pmd);
> > > > > -                   mmu_notifier_invalidate_range(vma->vm_mm, start, end);
> > > > 
> > > > Could the secondary TLB still see the mapping as dirty and propagate the dirty bit back?
> > > 
> > > I am assuming hardware does sane thing of setting the dirty bit only
> > > when walking the CPU page table when device does a write fault ie
> > > once the device get a write TLB entry the dirty is set by the IOMMU
> > > when walking the page table before returning the lookup result to the
> > > device and that it won't be set again latter (ie propagated back
> > > latter).
> > > 
> > 
> > The other possibility is that the hardware things the page is writable
> > and already
> > marked dirty. It allows writes and does not set the dirty bit?
> 
> I thought about this some more and the patch can not regress anything
> that is not broken today. So if we assume that device can propagate
> dirty bit because it can cache the write protection than all current
> code is broken for two reasons:
> 
> First one is current code clear pte entry, build a new pte value with
> write protection and update pte entry with new pte value. So any PASID/
> ATS platform that allows device to cache the write bit and set dirty
> bit anytime after that can race during that window and you would loose
> the dirty bit of the device. That is not that bad as you are gonna
> propagate the dirty bit to the struct page.

But they stay consistent with the notifiers, so from the OS perspective
it notifies of any PTE changes as they happen. When the ATS platform sees
invalidation, it invalidates it's PTE's as well.

I was speaking of the case where the ATS platform could assume it has
write access and has not seen any invalidation, the OS could return
back to user space or the caller with write bit clear, but the ATS
platform could still do a write since it's not seen the invalidation.

> 
> Second one is if the dirty bit is propagated back to the new write
> protected pte. Quick look at code it seems that when we zap pte or
> or mkclean we don't check that the pte has write permission but only
> care about the dirty bit. So it should not have any bad consequence.
> 
> After this patch only the second window is bigger and thus more likely
> to happen. But nothing sinister should happen from that.
> 
> 
> > 
> > > I should probably have spell that out and maybe some of the ATS/PASID
> > > implementer did not do that.
> > > 
> > > > 
> > > > >  unlock_pmd:
> > > > >                     spin_unlock(ptl);
> > > > >  #endif
> > > > > @@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
> > > > >                     pte = pte_wrprotect(pte);
> > > > >                     pte = pte_mkclean(pte);
> > > > >                     set_pte_at(vma->vm_mm, address, ptep, pte);
> > > > > -                   mmu_notifier_invalidate_range(vma->vm_mm, start, end);
> > > > 
> > > > Ditto
> > > > 
> > > > >  unlock_pte:
> > > > >                     pte_unmap_unlock(ptep, ptl);
> > > > >             }
> > > > > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > > > > index 6866e8126982..49c925c96b8a 100644
> > > > > --- a/include/linux/mmu_notifier.h
> > > > > +++ b/include/linux/mmu_notifier.h
> > > > > @@ -155,7 +155,8 @@ struct mmu_notifier_ops {
> > > > >      * shared page-tables, it not necessary to implement the
> > > > >      * invalidate_range_start()/end() notifiers, as
> > > > >      * invalidate_range() alread catches the points in time when an
> > > > > -    * external TLB range needs to be flushed.
> > > > > +    * external TLB range needs to be flushed. For more in depth
> > > > > +    * discussion on this see Documentation/vm/mmu_notifier.txt
> > > > >      *
> > > > >      * The invalidate_range() function is called under the ptl
> > > > >      * spin-lock and not allowed to sleep.
> > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > > index c037d3d34950..ff5bc647b51d 100644
> > > > > --- a/mm/huge_memory.c
> > > > > +++ b/mm/huge_memory.c
> > > > > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
> > > > >             goto out_free_pages;
> > > > >     VM_BUG_ON_PAGE(!PageHead(page), page);
> > > > > 
> > > > > +   /*
> > > > > +    * Leave pmd empty until pte is filled note we must notify here as
> > > > > +    * concurrent CPU thread might write to new page before the call to
> > > > > +    * mmu_notifier_invalidate_range_end() happens which can lead to a
> > > > > +    * device seeing memory write in different order than CPU.
> > > > > +    *
> > > > > +    * See Documentation/vm/mmu_notifier.txt
> > > > > +    */
> > > > >     pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
> > > > > -   /* leave pmd empty until pte is filled */
> > > > > 
> > > > >     pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd);
> > > > >     pmd_populate(vma->vm_mm, &_pmd, pgtable);
> > > > > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> > > > >     pmd_t _pmd;
> > > > >     int i;
> > > > > 
> > > > > -   /* leave pmd empty until pte is filled */
> > > > > -   pmdp_huge_clear_flush_notify(vma, haddr, pmd);
> > > > > +   /*
> > > > > +    * Leave pmd empty until pte is filled note that it is fine to delay
> > > > > +    * notification until mmu_notifier_invalidate_range_end() as we are
> > > > > +    * replacing a zero pmd write protected page with a zero pte write
> > > > > +    * protected page.
> > > > > +    *
> > > > > +    * See Documentation/vm/mmu_notifier.txt
> > > > > +    */
> > > > > +   pmdp_huge_clear_flush(vma, haddr, pmd);
> > > > 
> > > > Shouldn't the secondary TLB know if the page size changed?
> > > 
> > > It should not matter, we are talking virtual to physical on behalf
> > > of a device against a process address space. So the hardware should
> > > not care about the page size.
> > > 
> > 
> > Does that not indicate how much the device can access? Could it try
> > to access more than what is mapped?
> 
> Assuming device has huge TLB and 2MB huge page with 4K small page.
> You are going from one 1 TLB covering a 2MB zero page to 512 TLB
> each covering 4K. Both case is read only and both case are pointing
> to same data (ie zero).
> 
> It is fine to delay the TLB invalidate on the device to the call of
> mmu_notifier_invalidate_range_end(). The device will keep using the
> huge TLB for a little longer but both CPU and device are looking at
> same data.
> 
> Now if there is a racing thread that replace one of the 512 zeor page
> after the split but before mmu_notifier_invalidate_range_end() that
> code path would call mmu_notifier_invalidate_range() before changing
> the pte to point to something else. Which should shoot down the device
> TLB (it would be a serious device bug if this did not work).

OK.. This seems reasonable, but I'd really like to see if it can be
tested

> 
> 
> > 
> > > Moreover if any of the new 512 (assuming 2MB huge and 4K pages) zero
> > > 4K pages is replace by something new then a device TLB shootdown will
> > > happen before the new page is set.
> > > 
> > > Only issue i can think of is if the IOMMU TLB (if there is one) or
> > > the device TLB (you do expect that there is one) does not invalidate
> > > TLB entry if the TLB shootdown is smaller than the TLB entry. That
> > > would be idiotic but yes i know hardware bug.
> > > 
> > > 
> > > > 
> > > > > 
> > > > >     pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> > > > >     pmd_populate(mm, &_pmd, pgtable);
> > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > > > index 1768efa4c501..63a63f1b536c 100644
> > > > > --- a/mm/hugetlb.c
> > > > > +++ b/mm/hugetlb.c
> > > > > @@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> > > > >                     set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
> > > > >             } else {
> > > > >                     if (cow) {
> > > > > +                           /*
> > > > > +                            * No need to notify as we are downgrading page
> > > > > +                            * table protection not changing it to point
> > > > > +                            * to a new page.
> > > > > +                            *
> > > > > +                            * See Documentation/vm/mmu_notifier.txt
> > > > > +                            */
> > > > >                             huge_ptep_set_wrprotect(src, addr, src_pte);
> > > > 
> > > > OK.. so we could get write faults on write accesses from the device.
> > > > 
> > > > > -                           mmu_notifier_invalidate_range(src, mmun_start,
> > > > > -                                                              mmun_end);
> > > > >                     }
> > > > >                     entry = huge_ptep_get(src_pte);
> > > > >                     ptepage = pte_page(entry);
> > > > > @@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> > > > >      * and that page table be reused and filled with junk.
> > > > >      */
> > > > >     flush_hugetlb_tlb_range(vma, start, end);
> > > > > -   mmu_notifier_invalidate_range(mm, start, end);
> > > > > +   /*
> > > > > +    * No need to call mmu_notifier_invalidate_range() we are downgrading
> > > > > +    * page table protection not changing it to point to a new page.
> > > > > +    *
> > > > > +    * See Documentation/vm/mmu_notifier.txt
> > > > > +    */
> > > > >     i_mmap_unlock_write(vma->vm_file->f_mapping);
> > > > >     mmu_notifier_invalidate_range_end(mm, start, end);
> > > > > 
> > > > > diff --git a/mm/ksm.c b/mm/ksm.c
> > > > > index 6cb60f46cce5..be8f4576f842 100644
> > > > > --- a/mm/ksm.c
> > > > > +++ b/mm/ksm.c
> > > > > @@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
> > > > >              * So we clear the pte and flush the tlb before the check
> > > > >              * this assure us that no O_DIRECT can happen after the check
> > > > >              * or in the middle of the check.
> > > > > +            *
> > > > > +            * No need to notify as we are downgrading page table to read
> > > > > +            * only not changing it to point to a new page.
> > > > > +            *
> > > > > +            * See Documentation/vm/mmu_notifier.txt
> > > > >              */
> > > > > -           entry = ptep_clear_flush_notify(vma, pvmw.address, pvmw.pte);
> > > > > +           entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte);
> > > > >             /*
> > > > >              * Check that no O_DIRECT or similar I/O is in progress on the
> > > > >              * page
> > > > > @@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
> > > > >     }
> > > > > 
> > > > >     flush_cache_page(vma, addr, pte_pfn(*ptep));
> > > > > -   ptep_clear_flush_notify(vma, addr, ptep);
> > > > > +   /*
> > > > > +    * No need to notify as we are replacing a read only page with another
> > > > > +    * read only page with the same content.
> > > > > +    *
> > > > > +    * See Documentation/vm/mmu_notifier.txt
> > > > > +    */
> > > > > +   ptep_clear_flush(vma, addr, ptep);
> > > > >     set_pte_at_notify(mm, addr, ptep, newpte);
> > > > > 
> > > > >     page_remove_rmap(page, false);
> > > > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > > > index 061826278520..6b5a0f219ac0 100644
> > > > > --- a/mm/rmap.c
> > > > > +++ b/mm/rmap.c
> > > > > @@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> > > > >  #endif
> > > > >             }
> > > > > 
> > > > > -           if (ret) {
> > > > > -                   mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend);
> > > > > +           /*
> > > > > +            * No need to call mmu_notifier_invalidate_range() as we are
> > > > > +            * downgrading page table protection not changing it to point
> > > > > +            * to a new page.
> > > > > +            *
> > > > > +            * See Documentation/vm/mmu_notifier.txt
> > > > > +            */
> > > > > +           if (ret)
> > > > >                     (*cleaned)++;
> > > > > -           }
> > > > >     }
> > > > > 
> > > > >     mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> > > > > @@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > >                     if (pte_soft_dirty(pteval))
> > > > >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > > > >                     set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
> > > > > +                   /*
> > > > > +                    * No need to invalidate here it will synchronize on
> > > > > +                    * against the special swap migration pte.
> > > > > +                    */
> > > > >                     goto discard;
> > > > >             }
> > > > > 
> > > > > @@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > >                      * will take care of the rest.
> > > > >                      */
> > > > >                     dec_mm_counter(mm, mm_counter(page));
> > > > > +                   /* We have to invalidate as we cleared the pte */
> > > > > +                   mmu_notifier_invalidate_range(mm, address,
> > > > > +                                                 address + PAGE_SIZE);
> > > > >             } else if (IS_ENABLED(CONFIG_MIGRATION) &&
> > > > >                             (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
> > > > >                     swp_entry_t entry;
> > > > > @@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > >                     if (pte_soft_dirty(pteval))
> > > > >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > > > >                     set_pte_at(mm, address, pvmw.pte, swp_pte);
> > > > > +                   /*
> > > > > +                    * No need to invalidate here it will synchronize on
> > > > > +                    * against the special swap migration pte.
> > > > > +                    */
> > > > >             } else if (PageAnon(page)) {
> > > > >                     swp_entry_t entry = { .val = page_private(subpage) };
> > > > >                     pte_t swp_pte;
> > > > > @@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > >                             WARN_ON_ONCE(1);
> > > > >                             ret = false;
> > > > >                             /* We have to invalidate as we cleared the pte */
> > > > > +                           mmu_notifier_invalidate_range(mm, address,
> > > > > +                                                   address + PAGE_SIZE);
> > > > >                             page_vma_mapped_walk_done(&pvmw);
> > > > >                             break;
> > > > >                     }
> > > > > @@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > >                     /* MADV_FREE page check */
> > > > >                     if (!PageSwapBacked(page)) {
> > > > >                             if (!PageDirty(page)) {
> > > > > +                                   /* Invalidate as we cleared the pte */
> > > > > +                                   mmu_notifier_invalidate_range(mm,
> > > > > +                                           address, address + PAGE_SIZE);
> > > > >                                     dec_mm_counter(mm, MM_ANONPAGES);
> > > > >                                     goto discard;
> > > > >                             }
> > > > > @@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > >                     if (pte_soft_dirty(pteval))
> > > > >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > > > >                     set_pte_at(mm, address, pvmw.pte, swp_pte);
> > > > > -           } else
> > > > > +                   /* Invalidate as we cleared the pte */
> > > > > +                   mmu_notifier_invalidate_range(mm, address,
> > > > > +                                                 address + PAGE_SIZE);
> > > > > +           } else {
> > > > > +                   /*
> > > > > +                    * We should not need to notify here as we reach this
> > > > > +                    * case only from freeze_page() itself only call from
> > > > > +                    * split_huge_page_to_list() so everything below must
> > > > > +                    * be true:
> > > > > +                    *   - page is not anonymous
> > > > > +                    *   - page is locked
> > > > > +                    *
> > > > > +                    * So as it is a locked file back page thus it can not
> > > > > +                    * be remove from the page cache and replace by a new
> > > > > +                    * page before mmu_notifier_invalidate_range_end so no
> > > > > +                    * concurrent thread might update its page table to
> > > > > +                    * point at new page while a device still is using this
> > > > > +                    * page.
> > > > > +                    *
> > > > > +                    * See Documentation/vm/mmu_notifier.txt
> > > > > +                    */
> > > > >                     dec_mm_counter(mm, mm_counter_file(page));
> > > > > +           }
> > > > >  discard:
> > > > > +           /*
> > > > > +            * No need to call mmu_notifier_invalidate_range() it has be
> > > > > +            * done above for all cases requiring it to happen under page
> > > > > +            * table lock before mmu_notifier_invalidate_range_end()
> > > > > +            *
> > > > > +            * See Documentation/vm/mmu_notifier.txt
> > > > > +            */
> > > > >             page_remove_rmap(subpage, PageHuge(page));
> > > > >             put_page(page);
> > > > > -           mmu_notifier_invalidate_range(mm, address,
> > > > > -                                         address + PAGE_SIZE);
> > > > >     }
> > > > > 
> > > > >     mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> > > > 
> > > > Looking at the patchset, I understand the efficiency, but I am concerned
> > > > with correctness.
> > > 
> > > I am fine in holding this off from reaching Linus but only way to flush this
> > > issues out if any is to have this patch in linux-next or somewhere were they
> > > get a chance of being tested.
> > > 
> > 
> > Yep, I would like to see some additional testing around npu and get Alistair
> > Popple to comment as well
> 
> I think this patch is fine. The only one race window that it might make
> bigger should have no bad consequences.
> 
> > 
> > > Note that the second patch is always safe. I agree that this one might
> > > not be if hardware implementation is idiotic (well that would be my
> > > opinion and any opinion/point of view can be challenge :))
> > 
> > 
> > You mean the only_end variant that avoids shootdown after pmd/pte changes
> > that avoid the _start/_end and have just the only_end variant? That seemed
> > reasonable to me, but I've not tested it or evaluated it in depth
> 
> Yes, patch 2/2 in this serie is definitly fine. It invalidate the device
> TLB right after clearing pte entry and avoid latter unecessary invalidation
> of same TLB.
> 
> Jérôme

Balbir Singh.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2
@ 2017-10-21  5:54             ` Balbir Singh
  0 siblings, 0 replies; 40+ messages in thread
From: Balbir Singh @ 2017-10-21  5:54 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrea Arcangeli, Stephen Rothwell, Joerg Roedel,
	Benjamin Herrenschmidt, Andrew Donnellan,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-next,
	Michael Ellerman, Alistair Popple, Andrew Morton, Linus Torvalds,
	David Woodhouse

On Thu, 2017-10-19 at 12:58 -0400, Jerome Glisse wrote:
> On Thu, Oct 19, 2017 at 09:53:11PM +1100, Balbir Singh wrote:
> > On Thu, Oct 19, 2017 at 2:28 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> > > On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote:
> > > > On Mon, 16 Oct 2017 23:10:02 -0400
> > > > jglisse@redhat.com wrote:
> > > > 
> > > > > From: Jérôme Glisse <jglisse@redhat.com>
> > > > > 
> > > > > +           /*
> > > > > +            * No need to call mmu_notifier_invalidate_range() as we are
> > > > > +            * downgrading page table protection not changing it to point
> > > > > +            * to a new page.
> > > > > +            *
> > > > > +            * See Documentation/vm/mmu_notifier.txt
> > > > > +            */
> > > > >             if (pmdp) {
> > > > >  #ifdef CONFIG_FS_DAX_PMD
> > > > >                     pmd_t pmd;
> > > > > @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
> > > > >                     pmd = pmd_wrprotect(pmd);
> > > > >                     pmd = pmd_mkclean(pmd);
> > > > >                     set_pmd_at(vma->vm_mm, address, pmdp, pmd);
> > > > > -                   mmu_notifier_invalidate_range(vma->vm_mm, start, end);
> > > > 
> > > > Could the secondary TLB still see the mapping as dirty and propagate the dirty bit back?
> > > 
> > > I am assuming hardware does sane thing of setting the dirty bit only
> > > when walking the CPU page table when device does a write fault ie
> > > once the device get a write TLB entry the dirty is set by the IOMMU
> > > when walking the page table before returning the lookup result to the
> > > device and that it won't be set again latter (ie propagated back
> > > latter).
> > > 
> > 
> > The other possibility is that the hardware things the page is writable
> > and already
> > marked dirty. It allows writes and does not set the dirty bit?
> 
> I thought about this some more and the patch can not regress anything
> that is not broken today. So if we assume that device can propagate
> dirty bit because it can cache the write protection than all current
> code is broken for two reasons:
> 
> First one is current code clear pte entry, build a new pte value with
> write protection and update pte entry with new pte value. So any PASID/
> ATS platform that allows device to cache the write bit and set dirty
> bit anytime after that can race during that window and you would loose
> the dirty bit of the device. That is not that bad as you are gonna
> propagate the dirty bit to the struct page.

But they stay consistent with the notifiers, so from the OS perspective
it notifies of any PTE changes as they happen. When the ATS platform sees
invalidation, it invalidates it's PTE's as well.

I was speaking of the case where the ATS platform could assume it has
write access and has not seen any invalidation, the OS could return
back to user space or the caller with write bit clear, but the ATS
platform could still do a write since it's not seen the invalidation.

> 
> Second one is if the dirty bit is propagated back to the new write
> protected pte. Quick look at code it seems that when we zap pte or
> or mkclean we don't check that the pte has write permission but only
> care about the dirty bit. So it should not have any bad consequence.
> 
> After this patch only the second window is bigger and thus more likely
> to happen. But nothing sinister should happen from that.
> 
> 
> > 
> > > I should probably have spell that out and maybe some of the ATS/PASID
> > > implementer did not do that.
> > > 
> > > > 
> > > > >  unlock_pmd:
> > > > >                     spin_unlock(ptl);
> > > > >  #endif
> > > > > @@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
> > > > >                     pte = pte_wrprotect(pte);
> > > > >                     pte = pte_mkclean(pte);
> > > > >                     set_pte_at(vma->vm_mm, address, ptep, pte);
> > > > > -                   mmu_notifier_invalidate_range(vma->vm_mm, start, end);
> > > > 
> > > > Ditto
> > > > 
> > > > >  unlock_pte:
> > > > >                     pte_unmap_unlock(ptep, ptl);
> > > > >             }
> > > > > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > > > > index 6866e8126982..49c925c96b8a 100644
> > > > > --- a/include/linux/mmu_notifier.h
> > > > > +++ b/include/linux/mmu_notifier.h
> > > > > @@ -155,7 +155,8 @@ struct mmu_notifier_ops {
> > > > >      * shared page-tables, it not necessary to implement the
> > > > >      * invalidate_range_start()/end() notifiers, as
> > > > >      * invalidate_range() alread catches the points in time when an
> > > > > -    * external TLB range needs to be flushed.
> > > > > +    * external TLB range needs to be flushed. For more in depth
> > > > > +    * discussion on this see Documentation/vm/mmu_notifier.txt
> > > > >      *
> > > > >      * The invalidate_range() function is called under the ptl
> > > > >      * spin-lock and not allowed to sleep.
> > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > > index c037d3d34950..ff5bc647b51d 100644
> > > > > --- a/mm/huge_memory.c
> > > > > +++ b/mm/huge_memory.c
> > > > > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
> > > > >             goto out_free_pages;
> > > > >     VM_BUG_ON_PAGE(!PageHead(page), page);
> > > > > 
> > > > > +   /*
> > > > > +    * Leave pmd empty until pte is filled note we must notify here as
> > > > > +    * concurrent CPU thread might write to new page before the call to
> > > > > +    * mmu_notifier_invalidate_range_end() happens which can lead to a
> > > > > +    * device seeing memory write in different order than CPU.
> > > > > +    *
> > > > > +    * See Documentation/vm/mmu_notifier.txt
> > > > > +    */
> > > > >     pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
> > > > > -   /* leave pmd empty until pte is filled */
> > > > > 
> > > > >     pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd);
> > > > >     pmd_populate(vma->vm_mm, &_pmd, pgtable);
> > > > > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> > > > >     pmd_t _pmd;
> > > > >     int i;
> > > > > 
> > > > > -   /* leave pmd empty until pte is filled */
> > > > > -   pmdp_huge_clear_flush_notify(vma, haddr, pmd);
> > > > > +   /*
> > > > > +    * Leave pmd empty until pte is filled note that it is fine to delay
> > > > > +    * notification until mmu_notifier_invalidate_range_end() as we are
> > > > > +    * replacing a zero pmd write protected page with a zero pte write
> > > > > +    * protected page.
> > > > > +    *
> > > > > +    * See Documentation/vm/mmu_notifier.txt
> > > > > +    */
> > > > > +   pmdp_huge_clear_flush(vma, haddr, pmd);
> > > > 
> > > > Shouldn't the secondary TLB know if the page size changed?
> > > 
> > > It should not matter, we are talking virtual to physical on behalf
> > > of a device against a process address space. So the hardware should
> > > not care about the page size.
> > > 
> > 
> > Does that not indicate how much the device can access? Could it try
> > to access more than what is mapped?
> 
> Assuming device has huge TLB and 2MB huge page with 4K small page.
> You are going from one 1 TLB covering a 2MB zero page to 512 TLB
> each covering 4K. Both case is read only and both case are pointing
> to same data (ie zero).
> 
> It is fine to delay the TLB invalidate on the device to the call of
> mmu_notifier_invalidate_range_end(). The device will keep using the
> huge TLB for a little longer but both CPU and device are looking at
> same data.
> 
> Now if there is a racing thread that replace one of the 512 zeor page
> after the split but before mmu_notifier_invalidate_range_end() that
> code path would call mmu_notifier_invalidate_range() before changing
> the pte to point to something else. Which should shoot down the device
> TLB (it would be a serious device bug if this did not work).

OK.. This seems reasonable, but I'd really like to see if it can be
tested

> 
> 
> > 
> > > Moreover if any of the new 512 (assuming 2MB huge and 4K pages) zero
> > > 4K pages is replace by something new then a device TLB shootdown will
> > > happen before the new page is set.
> > > 
> > > Only issue i can think of is if the IOMMU TLB (if there is one) or
> > > the device TLB (you do expect that there is one) does not invalidate
> > > TLB entry if the TLB shootdown is smaller than the TLB entry. That
> > > would be idiotic but yes i know hardware bug.
> > > 
> > > 
> > > > 
> > > > > 
> > > > >     pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> > > > >     pmd_populate(mm, &_pmd, pgtable);
> > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > > > index 1768efa4c501..63a63f1b536c 100644
> > > > > --- a/mm/hugetlb.c
> > > > > +++ b/mm/hugetlb.c
> > > > > @@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> > > > >                     set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
> > > > >             } else {
> > > > >                     if (cow) {
> > > > > +                           /*
> > > > > +                            * No need to notify as we are downgrading page
> > > > > +                            * table protection not changing it to point
> > > > > +                            * to a new page.
> > > > > +                            *
> > > > > +                            * See Documentation/vm/mmu_notifier.txt
> > > > > +                            */
> > > > >                             huge_ptep_set_wrprotect(src, addr, src_pte);
> > > > 
> > > > OK.. so we could get write faults on write accesses from the device.
> > > > 
> > > > > -                           mmu_notifier_invalidate_range(src, mmun_start,
> > > > > -                                                              mmun_end);
> > > > >                     }
> > > > >                     entry = huge_ptep_get(src_pte);
> > > > >                     ptepage = pte_page(entry);
> > > > > @@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> > > > >      * and that page table be reused and filled with junk.
> > > > >      */
> > > > >     flush_hugetlb_tlb_range(vma, start, end);
> > > > > -   mmu_notifier_invalidate_range(mm, start, end);
> > > > > +   /*
> > > > > +    * No need to call mmu_notifier_invalidate_range() we are downgrading
> > > > > +    * page table protection not changing it to point to a new page.
> > > > > +    *
> > > > > +    * See Documentation/vm/mmu_notifier.txt
> > > > > +    */
> > > > >     i_mmap_unlock_write(vma->vm_file->f_mapping);
> > > > >     mmu_notifier_invalidate_range_end(mm, start, end);
> > > > > 
> > > > > diff --git a/mm/ksm.c b/mm/ksm.c
> > > > > index 6cb60f46cce5..be8f4576f842 100644
> > > > > --- a/mm/ksm.c
> > > > > +++ b/mm/ksm.c
> > > > > @@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
> > > > >              * So we clear the pte and flush the tlb before the check
> > > > >              * this assure us that no O_DIRECT can happen after the check
> > > > >              * or in the middle of the check.
> > > > > +            *
> > > > > +            * No need to notify as we are downgrading page table to read
> > > > > +            * only not changing it to point to a new page.
> > > > > +            *
> > > > > +            * See Documentation/vm/mmu_notifier.txt
> > > > >              */
> > > > > -           entry = ptep_clear_flush_notify(vma, pvmw.address, pvmw.pte);
> > > > > +           entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte);
> > > > >             /*
> > > > >              * Check that no O_DIRECT or similar I/O is in progress on the
> > > > >              * page
> > > > > @@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
> > > > >     }
> > > > > 
> > > > >     flush_cache_page(vma, addr, pte_pfn(*ptep));
> > > > > -   ptep_clear_flush_notify(vma, addr, ptep);
> > > > > +   /*
> > > > > +    * No need to notify as we are replacing a read only page with another
> > > > > +    * read only page with the same content.
> > > > > +    *
> > > > > +    * See Documentation/vm/mmu_notifier.txt
> > > > > +    */
> > > > > +   ptep_clear_flush(vma, addr, ptep);
> > > > >     set_pte_at_notify(mm, addr, ptep, newpte);
> > > > > 
> > > > >     page_remove_rmap(page, false);
> > > > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > > > index 061826278520..6b5a0f219ac0 100644
> > > > > --- a/mm/rmap.c
> > > > > +++ b/mm/rmap.c
> > > > > @@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> > > > >  #endif
> > > > >             }
> > > > > 
> > > > > -           if (ret) {
> > > > > -                   mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend);
> > > > > +           /*
> > > > > +            * No need to call mmu_notifier_invalidate_range() as we are
> > > > > +            * downgrading page table protection not changing it to point
> > > > > +            * to a new page.
> > > > > +            *
> > > > > +            * See Documentation/vm/mmu_notifier.txt
> > > > > +            */
> > > > > +           if (ret)
> > > > >                     (*cleaned)++;
> > > > > -           }
> > > > >     }
> > > > > 
> > > > >     mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> > > > > @@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > >                     if (pte_soft_dirty(pteval))
> > > > >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > > > >                     set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
> > > > > +                   /*
> > > > > +                    * No need to invalidate here it will synchronize on
> > > > > +                    * against the special swap migration pte.
> > > > > +                    */
> > > > >                     goto discard;
> > > > >             }
> > > > > 
> > > > > @@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > >                      * will take care of the rest.
> > > > >                      */
> > > > >                     dec_mm_counter(mm, mm_counter(page));
> > > > > +                   /* We have to invalidate as we cleared the pte */
> > > > > +                   mmu_notifier_invalidate_range(mm, address,
> > > > > +                                                 address + PAGE_SIZE);
> > > > >             } else if (IS_ENABLED(CONFIG_MIGRATION) &&
> > > > >                             (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
> > > > >                     swp_entry_t entry;
> > > > > @@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > >                     if (pte_soft_dirty(pteval))
> > > > >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > > > >                     set_pte_at(mm, address, pvmw.pte, swp_pte);
> > > > > +                   /*
> > > > > +                    * No need to invalidate here it will synchronize on
> > > > > +                    * against the special swap migration pte.
> > > > > +                    */
> > > > >             } else if (PageAnon(page)) {
> > > > >                     swp_entry_t entry = { .val = page_private(subpage) };
> > > > >                     pte_t swp_pte;
> > > > > @@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > >                             WARN_ON_ONCE(1);
> > > > >                             ret = false;
> > > > >                             /* We have to invalidate as we cleared the pte */
> > > > > +                           mmu_notifier_invalidate_range(mm, address,
> > > > > +                                                   address + PAGE_SIZE);
> > > > >                             page_vma_mapped_walk_done(&pvmw);
> > > > >                             break;
> > > > >                     }
> > > > > @@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > >                     /* MADV_FREE page check */
> > > > >                     if (!PageSwapBacked(page)) {
> > > > >                             if (!PageDirty(page)) {
> > > > > +                                   /* Invalidate as we cleared the pte */
> > > > > +                                   mmu_notifier_invalidate_range(mm,
> > > > > +                                           address, address + PAGE_SIZE);
> > > > >                                     dec_mm_counter(mm, MM_ANONPAGES);
> > > > >                                     goto discard;
> > > > >                             }
> > > > > @@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > >                     if (pte_soft_dirty(pteval))
> > > > >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > > > >                     set_pte_at(mm, address, pvmw.pte, swp_pte);
> > > > > -           } else
> > > > > +                   /* Invalidate as we cleared the pte */
> > > > > +                   mmu_notifier_invalidate_range(mm, address,
> > > > > +                                                 address + PAGE_SIZE);
> > > > > +           } else {
> > > > > +                   /*
> > > > > +                    * We should not need to notify here as we reach this
> > > > > +                    * case only from freeze_page() itself only call from
> > > > > +                    * split_huge_page_to_list() so everything below must
> > > > > +                    * be true:
> > > > > +                    *   - page is not anonymous
> > > > > +                    *   - page is locked
> > > > > +                    *
> > > > > +                    * So as it is a locked file back page thus it can not
> > > > > +                    * be remove from the page cache and replace by a new
> > > > > +                    * page before mmu_notifier_invalidate_range_end so no
> > > > > +                    * concurrent thread might update its page table to
> > > > > +                    * point at new page while a device still is using this
> > > > > +                    * page.
> > > > > +                    *
> > > > > +                    * See Documentation/vm/mmu_notifier.txt
> > > > > +                    */
> > > > >                     dec_mm_counter(mm, mm_counter_file(page));
> > > > > +           }
> > > > >  discard:
> > > > > +           /*
> > > > > +            * No need to call mmu_notifier_invalidate_range() it has be
> > > > > +            * done above for all cases requiring it to happen under page
> > > > > +            * table lock before mmu_notifier_invalidate_range_end()
> > > > > +            *
> > > > > +            * See Documentation/vm/mmu_notifier.txt
> > > > > +            */
> > > > >             page_remove_rmap(subpage, PageHuge(page));
> > > > >             put_page(page);
> > > > > -           mmu_notifier_invalidate_range(mm, address,
> > > > > -                                         address + PAGE_SIZE);
> > > > >     }
> > > > > 
> > > > >     mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> > > > 
> > > > Looking at the patchset, I understand the efficiency, but I am concerned
> > > > with correctness.
> > > 
> > > I am fine in holding this off from reaching Linus but only way to flush this
> > > issues out if any is to have this patch in linux-next or somewhere were they
> > > get a chance of being tested.
> > > 
> > 
> > Yep, I would like to see some additional testing around npu and get Alistair
> > Popple to comment as well
> 
> I think this patch is fine. The only one race window that it might make
> bigger should have no bad consequences.
> 
> > 
> > > Note that the second patch is always safe. I agree that this one might
> > > not be if hardware implementation is idiotic (well that would be my
> > > opinion and any opinion/point of view can be challenge :))
> > 
> > 
> > You mean the only_end variant that avoids shootdown after pmd/pte changes
> > that avoid the _start/_end and have just the only_end variant? That seemed
> > reasonable to me, but I've not tested it or evaluated it in depth
> 
> Yes, patch 2/2 in this serie is definitly fine. It invalidate the device
> TLB right after clearing pte entry and avoid latter unecessary invalidation
> of same TLB.
> 
> Jérôme

Balbir Singh.

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2
@ 2017-10-21  5:54             ` Balbir Singh
  0 siblings, 0 replies; 40+ messages in thread
From: Balbir Singh @ 2017-10-21  5:54 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Nadav Amit,
	Linus Torvalds, Andrew Morton, Joerg Roedel,
	Suravee Suthikulpanit, David Woodhouse, Alistair Popple,
	Michael Ellerman, Benjamin Herrenschmidt, Stephen Rothwell,
	Andrew Donnellan, iommu,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-next

On Thu, 2017-10-19 at 12:58 -0400, Jerome Glisse wrote:
> On Thu, Oct 19, 2017 at 09:53:11PM +1100, Balbir Singh wrote:
> > On Thu, Oct 19, 2017 at 2:28 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> > > On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote:
> > > > On Mon, 16 Oct 2017 23:10:02 -0400
> > > > jglisse@redhat.com wrote:
> > > > 
> > > > > From: JA(C)rA'me Glisse <jglisse@redhat.com>
> > > > > 
> > > > > +           /*
> > > > > +            * No need to call mmu_notifier_invalidate_range() as we are
> > > > > +            * downgrading page table protection not changing it to point
> > > > > +            * to a new page.
> > > > > +            *
> > > > > +            * See Documentation/vm/mmu_notifier.txt
> > > > > +            */
> > > > >             if (pmdp) {
> > > > >  #ifdef CONFIG_FS_DAX_PMD
> > > > >                     pmd_t pmd;
> > > > > @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
> > > > >                     pmd = pmd_wrprotect(pmd);
> > > > >                     pmd = pmd_mkclean(pmd);
> > > > >                     set_pmd_at(vma->vm_mm, address, pmdp, pmd);
> > > > > -                   mmu_notifier_invalidate_range(vma->vm_mm, start, end);
> > > > 
> > > > Could the secondary TLB still see the mapping as dirty and propagate the dirty bit back?
> > > 
> > > I am assuming hardware does sane thing of setting the dirty bit only
> > > when walking the CPU page table when device does a write fault ie
> > > once the device get a write TLB entry the dirty is set by the IOMMU
> > > when walking the page table before returning the lookup result to the
> > > device and that it won't be set again latter (ie propagated back
> > > latter).
> > > 
> > 
> > The other possibility is that the hardware things the page is writable
> > and already
> > marked dirty. It allows writes and does not set the dirty bit?
> 
> I thought about this some more and the patch can not regress anything
> that is not broken today. So if we assume that device can propagate
> dirty bit because it can cache the write protection than all current
> code is broken for two reasons:
> 
> First one is current code clear pte entry, build a new pte value with
> write protection and update pte entry with new pte value. So any PASID/
> ATS platform that allows device to cache the write bit and set dirty
> bit anytime after that can race during that window and you would loose
> the dirty bit of the device. That is not that bad as you are gonna
> propagate the dirty bit to the struct page.

But they stay consistent with the notifiers, so from the OS perspective
it notifies of any PTE changes as they happen. When the ATS platform sees
invalidation, it invalidates it's PTE's as well.

I was speaking of the case where the ATS platform could assume it has
write access and has not seen any invalidation, the OS could return
back to user space or the caller with write bit clear, but the ATS
platform could still do a write since it's not seen the invalidation.

> 
> Second one is if the dirty bit is propagated back to the new write
> protected pte. Quick look at code it seems that when we zap pte or
> or mkclean we don't check that the pte has write permission but only
> care about the dirty bit. So it should not have any bad consequence.
> 
> After this patch only the second window is bigger and thus more likely
> to happen. But nothing sinister should happen from that.
> 
> 
> > 
> > > I should probably have spell that out and maybe some of the ATS/PASID
> > > implementer did not do that.
> > > 
> > > > 
> > > > >  unlock_pmd:
> > > > >                     spin_unlock(ptl);
> > > > >  #endif
> > > > > @@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
> > > > >                     pte = pte_wrprotect(pte);
> > > > >                     pte = pte_mkclean(pte);
> > > > >                     set_pte_at(vma->vm_mm, address, ptep, pte);
> > > > > -                   mmu_notifier_invalidate_range(vma->vm_mm, start, end);
> > > > 
> > > > Ditto
> > > > 
> > > > >  unlock_pte:
> > > > >                     pte_unmap_unlock(ptep, ptl);
> > > > >             }
> > > > > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > > > > index 6866e8126982..49c925c96b8a 100644
> > > > > --- a/include/linux/mmu_notifier.h
> > > > > +++ b/include/linux/mmu_notifier.h
> > > > > @@ -155,7 +155,8 @@ struct mmu_notifier_ops {
> > > > >      * shared page-tables, it not necessary to implement the
> > > > >      * invalidate_range_start()/end() notifiers, as
> > > > >      * invalidate_range() alread catches the points in time when an
> > > > > -    * external TLB range needs to be flushed.
> > > > > +    * external TLB range needs to be flushed. For more in depth
> > > > > +    * discussion on this see Documentation/vm/mmu_notifier.txt
> > > > >      *
> > > > >      * The invalidate_range() function is called under the ptl
> > > > >      * spin-lock and not allowed to sleep.
> > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > > index c037d3d34950..ff5bc647b51d 100644
> > > > > --- a/mm/huge_memory.c
> > > > > +++ b/mm/huge_memory.c
> > > > > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
> > > > >             goto out_free_pages;
> > > > >     VM_BUG_ON_PAGE(!PageHead(page), page);
> > > > > 
> > > > > +   /*
> > > > > +    * Leave pmd empty until pte is filled note we must notify here as
> > > > > +    * concurrent CPU thread might write to new page before the call to
> > > > > +    * mmu_notifier_invalidate_range_end() happens which can lead to a
> > > > > +    * device seeing memory write in different order than CPU.
> > > > > +    *
> > > > > +    * See Documentation/vm/mmu_notifier.txt
> > > > > +    */
> > > > >     pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
> > > > > -   /* leave pmd empty until pte is filled */
> > > > > 
> > > > >     pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd);
> > > > >     pmd_populate(vma->vm_mm, &_pmd, pgtable);
> > > > > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> > > > >     pmd_t _pmd;
> > > > >     int i;
> > > > > 
> > > > > -   /* leave pmd empty until pte is filled */
> > > > > -   pmdp_huge_clear_flush_notify(vma, haddr, pmd);
> > > > > +   /*
> > > > > +    * Leave pmd empty until pte is filled note that it is fine to delay
> > > > > +    * notification until mmu_notifier_invalidate_range_end() as we are
> > > > > +    * replacing a zero pmd write protected page with a zero pte write
> > > > > +    * protected page.
> > > > > +    *
> > > > > +    * See Documentation/vm/mmu_notifier.txt
> > > > > +    */
> > > > > +   pmdp_huge_clear_flush(vma, haddr, pmd);
> > > > 
> > > > Shouldn't the secondary TLB know if the page size changed?
> > > 
> > > It should not matter, we are talking virtual to physical on behalf
> > > of a device against a process address space. So the hardware should
> > > not care about the page size.
> > > 
> > 
> > Does that not indicate how much the device can access? Could it try
> > to access more than what is mapped?
> 
> Assuming device has huge TLB and 2MB huge page with 4K small page.
> You are going from one 1 TLB covering a 2MB zero page to 512 TLB
> each covering 4K. Both case is read only and both case are pointing
> to same data (ie zero).
> 
> It is fine to delay the TLB invalidate on the device to the call of
> mmu_notifier_invalidate_range_end(). The device will keep using the
> huge TLB for a little longer but both CPU and device are looking at
> same data.
> 
> Now if there is a racing thread that replace one of the 512 zeor page
> after the split but before mmu_notifier_invalidate_range_end() that
> code path would call mmu_notifier_invalidate_range() before changing
> the pte to point to something else. Which should shoot down the device
> TLB (it would be a serious device bug if this did not work).

OK.. This seems reasonable, but I'd really like to see if it can be
tested

> 
> 
> > 
> > > Moreover if any of the new 512 (assuming 2MB huge and 4K pages) zero
> > > 4K pages is replace by something new then a device TLB shootdown will
> > > happen before the new page is set.
> > > 
> > > Only issue i can think of is if the IOMMU TLB (if there is one) or
> > > the device TLB (you do expect that there is one) does not invalidate
> > > TLB entry if the TLB shootdown is smaller than the TLB entry. That
> > > would be idiotic but yes i know hardware bug.
> > > 
> > > 
> > > > 
> > > > > 
> > > > >     pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> > > > >     pmd_populate(mm, &_pmd, pgtable);
> > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > > > index 1768efa4c501..63a63f1b536c 100644
> > > > > --- a/mm/hugetlb.c
> > > > > +++ b/mm/hugetlb.c
> > > > > @@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> > > > >                     set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
> > > > >             } else {
> > > > >                     if (cow) {
> > > > > +                           /*
> > > > > +                            * No need to notify as we are downgrading page
> > > > > +                            * table protection not changing it to point
> > > > > +                            * to a new page.
> > > > > +                            *
> > > > > +                            * See Documentation/vm/mmu_notifier.txt
> > > > > +                            */
> > > > >                             huge_ptep_set_wrprotect(src, addr, src_pte);
> > > > 
> > > > OK.. so we could get write faults on write accesses from the device.
> > > > 
> > > > > -                           mmu_notifier_invalidate_range(src, mmun_start,
> > > > > -                                                              mmun_end);
> > > > >                     }
> > > > >                     entry = huge_ptep_get(src_pte);
> > > > >                     ptepage = pte_page(entry);
> > > > > @@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> > > > >      * and that page table be reused and filled with junk.
> > > > >      */
> > > > >     flush_hugetlb_tlb_range(vma, start, end);
> > > > > -   mmu_notifier_invalidate_range(mm, start, end);
> > > > > +   /*
> > > > > +    * No need to call mmu_notifier_invalidate_range() we are downgrading
> > > > > +    * page table protection not changing it to point to a new page.
> > > > > +    *
> > > > > +    * See Documentation/vm/mmu_notifier.txt
> > > > > +    */
> > > > >     i_mmap_unlock_write(vma->vm_file->f_mapping);
> > > > >     mmu_notifier_invalidate_range_end(mm, start, end);
> > > > > 
> > > > > diff --git a/mm/ksm.c b/mm/ksm.c
> > > > > index 6cb60f46cce5..be8f4576f842 100644
> > > > > --- a/mm/ksm.c
> > > > > +++ b/mm/ksm.c
> > > > > @@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
> > > > >              * So we clear the pte and flush the tlb before the check
> > > > >              * this assure us that no O_DIRECT can happen after the check
> > > > >              * or in the middle of the check.
> > > > > +            *
> > > > > +            * No need to notify as we are downgrading page table to read
> > > > > +            * only not changing it to point to a new page.
> > > > > +            *
> > > > > +            * See Documentation/vm/mmu_notifier.txt
> > > > >              */
> > > > > -           entry = ptep_clear_flush_notify(vma, pvmw.address, pvmw.pte);
> > > > > +           entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte);
> > > > >             /*
> > > > >              * Check that no O_DIRECT or similar I/O is in progress on the
> > > > >              * page
> > > > > @@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
> > > > >     }
> > > > > 
> > > > >     flush_cache_page(vma, addr, pte_pfn(*ptep));
> > > > > -   ptep_clear_flush_notify(vma, addr, ptep);
> > > > > +   /*
> > > > > +    * No need to notify as we are replacing a read only page with another
> > > > > +    * read only page with the same content.
> > > > > +    *
> > > > > +    * See Documentation/vm/mmu_notifier.txt
> > > > > +    */
> > > > > +   ptep_clear_flush(vma, addr, ptep);
> > > > >     set_pte_at_notify(mm, addr, ptep, newpte);
> > > > > 
> > > > >     page_remove_rmap(page, false);
> > > > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > > > index 061826278520..6b5a0f219ac0 100644
> > > > > --- a/mm/rmap.c
> > > > > +++ b/mm/rmap.c
> > > > > @@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> > > > >  #endif
> > > > >             }
> > > > > 
> > > > > -           if (ret) {
> > > > > -                   mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend);
> > > > > +           /*
> > > > > +            * No need to call mmu_notifier_invalidate_range() as we are
> > > > > +            * downgrading page table protection not changing it to point
> > > > > +            * to a new page.
> > > > > +            *
> > > > > +            * See Documentation/vm/mmu_notifier.txt
> > > > > +            */
> > > > > +           if (ret)
> > > > >                     (*cleaned)++;
> > > > > -           }
> > > > >     }
> > > > > 
> > > > >     mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> > > > > @@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > >                     if (pte_soft_dirty(pteval))
> > > > >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > > > >                     set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
> > > > > +                   /*
> > > > > +                    * No need to invalidate here it will synchronize on
> > > > > +                    * against the special swap migration pte.
> > > > > +                    */
> > > > >                     goto discard;
> > > > >             }
> > > > > 
> > > > > @@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > >                      * will take care of the rest.
> > > > >                      */
> > > > >                     dec_mm_counter(mm, mm_counter(page));
> > > > > +                   /* We have to invalidate as we cleared the pte */
> > > > > +                   mmu_notifier_invalidate_range(mm, address,
> > > > > +                                                 address + PAGE_SIZE);
> > > > >             } else if (IS_ENABLED(CONFIG_MIGRATION) &&
> > > > >                             (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
> > > > >                     swp_entry_t entry;
> > > > > @@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > >                     if (pte_soft_dirty(pteval))
> > > > >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > > > >                     set_pte_at(mm, address, pvmw.pte, swp_pte);
> > > > > +                   /*
> > > > > +                    * No need to invalidate here it will synchronize on
> > > > > +                    * against the special swap migration pte.
> > > > > +                    */
> > > > >             } else if (PageAnon(page)) {
> > > > >                     swp_entry_t entry = { .val = page_private(subpage) };
> > > > >                     pte_t swp_pte;
> > > > > @@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > >                             WARN_ON_ONCE(1);
> > > > >                             ret = false;
> > > > >                             /* We have to invalidate as we cleared the pte */
> > > > > +                           mmu_notifier_invalidate_range(mm, address,
> > > > > +                                                   address + PAGE_SIZE);
> > > > >                             page_vma_mapped_walk_done(&pvmw);
> > > > >                             break;
> > > > >                     }
> > > > > @@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > >                     /* MADV_FREE page check */
> > > > >                     if (!PageSwapBacked(page)) {
> > > > >                             if (!PageDirty(page)) {
> > > > > +                                   /* Invalidate as we cleared the pte */
> > > > > +                                   mmu_notifier_invalidate_range(mm,
> > > > > +                                           address, address + PAGE_SIZE);
> > > > >                                     dec_mm_counter(mm, MM_ANONPAGES);
> > > > >                                     goto discard;
> > > > >                             }
> > > > > @@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > >                     if (pte_soft_dirty(pteval))
> > > > >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > > > >                     set_pte_at(mm, address, pvmw.pte, swp_pte);
> > > > > -           } else
> > > > > +                   /* Invalidate as we cleared the pte */
> > > > > +                   mmu_notifier_invalidate_range(mm, address,
> > > > > +                                                 address + PAGE_SIZE);
> > > > > +           } else {
> > > > > +                   /*
> > > > > +                    * We should not need to notify here as we reach this
> > > > > +                    * case only from freeze_page() itself only call from
> > > > > +                    * split_huge_page_to_list() so everything below must
> > > > > +                    * be true:
> > > > > +                    *   - page is not anonymous
> > > > > +                    *   - page is locked
> > > > > +                    *
> > > > > +                    * So as it is a locked file back page thus it can not
> > > > > +                    * be remove from the page cache and replace by a new
> > > > > +                    * page before mmu_notifier_invalidate_range_end so no
> > > > > +                    * concurrent thread might update its page table to
> > > > > +                    * point at new page while a device still is using this
> > > > > +                    * page.
> > > > > +                    *
> > > > > +                    * See Documentation/vm/mmu_notifier.txt
> > > > > +                    */
> > > > >                     dec_mm_counter(mm, mm_counter_file(page));
> > > > > +           }
> > > > >  discard:
> > > > > +           /*
> > > > > +            * No need to call mmu_notifier_invalidate_range() it has be
> > > > > +            * done above for all cases requiring it to happen under page
> > > > > +            * table lock before mmu_notifier_invalidate_range_end()
> > > > > +            *
> > > > > +            * See Documentation/vm/mmu_notifier.txt
> > > > > +            */
> > > > >             page_remove_rmap(subpage, PageHuge(page));
> > > > >             put_page(page);
> > > > > -           mmu_notifier_invalidate_range(mm, address,
> > > > > -                                         address + PAGE_SIZE);
> > > > >     }
> > > > > 
> > > > >     mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> > > > 
> > > > Looking at the patchset, I understand the efficiency, but I am concerned
> > > > with correctness.
> > > 
> > > I am fine in holding this off from reaching Linus but only way to flush this
> > > issues out if any is to have this patch in linux-next or somewhere were they
> > > get a chance of being tested.
> > > 
> > 
> > Yep, I would like to see some additional testing around npu and get Alistair
> > Popple to comment as well
> 
> I think this patch is fine. The only one race window that it might make
> bigger should have no bad consequences.
> 
> > 
> > > Note that the second patch is always safe. I agree that this one might
> > > not be if hardware implementation is idiotic (well that would be my
> > > opinion and any opinion/point of view can be challenge :))
> > 
> > 
> > You mean the only_end variant that avoids shootdown after pmd/pte changes
> > that avoid the _start/_end and have just the only_end variant? That seemed
> > reasonable to me, but I've not tested it or evaluated it in depth
> 
> Yes, patch 2/2 in this serie is definitly fine. It invalidate the device
> TLB right after clearing pte entry and avoid latter unecessary invalidation
> of same TLB.
> 
> JA(C)rA'me

Balbir Singh.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2
@ 2017-10-21  5:54             ` Balbir Singh
  0 siblings, 0 replies; 40+ messages in thread
From: Balbir Singh @ 2017-10-21  5:54 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Nadav Amit,
	Linus Torvalds, Andrew Morton, Joerg Roedel,
	Suravee Suthikulpanit, David Woodhouse, Alistair Popple,
	Michael Ellerman, Benjamin Herrenschmidt, Stephen Rothwell,
	Andrew Donnellan, iommu,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-next

On Thu, 2017-10-19 at 12:58 -0400, Jerome Glisse wrote:
> On Thu, Oct 19, 2017 at 09:53:11PM +1100, Balbir Singh wrote:
> > On Thu, Oct 19, 2017 at 2:28 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> > > On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote:
> > > > On Mon, 16 Oct 2017 23:10:02 -0400
> > > > jglisse@redhat.com wrote:
> > > > 
> > > > > From: Jérôme Glisse <jglisse@redhat.com>
> > > > > 
> > > > > +           /*
> > > > > +            * No need to call mmu_notifier_invalidate_range() as we are
> > > > > +            * downgrading page table protection not changing it to point
> > > > > +            * to a new page.
> > > > > +            *
> > > > > +            * See Documentation/vm/mmu_notifier.txt
> > > > > +            */
> > > > >             if (pmdp) {
> > > > >  #ifdef CONFIG_FS_DAX_PMD
> > > > >                     pmd_t pmd;
> > > > > @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
> > > > >                     pmd = pmd_wrprotect(pmd);
> > > > >                     pmd = pmd_mkclean(pmd);
> > > > >                     set_pmd_at(vma->vm_mm, address, pmdp, pmd);
> > > > > -                   mmu_notifier_invalidate_range(vma->vm_mm, start, end);
> > > > 
> > > > Could the secondary TLB still see the mapping as dirty and propagate the dirty bit back?
> > > 
> > > I am assuming hardware does sane thing of setting the dirty bit only
> > > when walking the CPU page table when device does a write fault ie
> > > once the device get a write TLB entry the dirty is set by the IOMMU
> > > when walking the page table before returning the lookup result to the
> > > device and that it won't be set again latter (ie propagated back
> > > latter).
> > > 
> > 
> > The other possibility is that the hardware things the page is writable
> > and already
> > marked dirty. It allows writes and does not set the dirty bit?
> 
> I thought about this some more and the patch can not regress anything
> that is not broken today. So if we assume that device can propagate
> dirty bit because it can cache the write protection than all current
> code is broken for two reasons:
> 
> First one is current code clear pte entry, build a new pte value with
> write protection and update pte entry with new pte value. So any PASID/
> ATS platform that allows device to cache the write bit and set dirty
> bit anytime after that can race during that window and you would loose
> the dirty bit of the device. That is not that bad as you are gonna
> propagate the dirty bit to the struct page.

But they stay consistent with the notifiers, so from the OS perspective
it notifies of any PTE changes as they happen. When the ATS platform sees
invalidation, it invalidates it's PTE's as well.

I was speaking of the case where the ATS platform could assume it has
write access and has not seen any invalidation, the OS could return
back to user space or the caller with write bit clear, but the ATS
platform could still do a write since it's not seen the invalidation.

> 
> Second one is if the dirty bit is propagated back to the new write
> protected pte. Quick look at code it seems that when we zap pte or
> or mkclean we don't check that the pte has write permission but only
> care about the dirty bit. So it should not have any bad consequence.
> 
> After this patch only the second window is bigger and thus more likely
> to happen. But nothing sinister should happen from that.
> 
> 
> > 
> > > I should probably have spell that out and maybe some of the ATS/PASID
> > > implementer did not do that.
> > > 
> > > > 
> > > > >  unlock_pmd:
> > > > >                     spin_unlock(ptl);
> > > > >  #endif
> > > > > @@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
> > > > >                     pte = pte_wrprotect(pte);
> > > > >                     pte = pte_mkclean(pte);
> > > > >                     set_pte_at(vma->vm_mm, address, ptep, pte);
> > > > > -                   mmu_notifier_invalidate_range(vma->vm_mm, start, end);
> > > > 
> > > > Ditto
> > > > 
> > > > >  unlock_pte:
> > > > >                     pte_unmap_unlock(ptep, ptl);
> > > > >             }
> > > > > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > > > > index 6866e8126982..49c925c96b8a 100644
> > > > > --- a/include/linux/mmu_notifier.h
> > > > > +++ b/include/linux/mmu_notifier.h
> > > > > @@ -155,7 +155,8 @@ struct mmu_notifier_ops {
> > > > >      * shared page-tables, it not necessary to implement the
> > > > >      * invalidate_range_start()/end() notifiers, as
> > > > >      * invalidate_range() alread catches the points in time when an
> > > > > -    * external TLB range needs to be flushed.
> > > > > +    * external TLB range needs to be flushed. For more in depth
> > > > > +    * discussion on this see Documentation/vm/mmu_notifier.txt
> > > > >      *
> > > > >      * The invalidate_range() function is called under the ptl
> > > > >      * spin-lock and not allowed to sleep.
> > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > > index c037d3d34950..ff5bc647b51d 100644
> > > > > --- a/mm/huge_memory.c
> > > > > +++ b/mm/huge_memory.c
> > > > > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
> > > > >             goto out_free_pages;
> > > > >     VM_BUG_ON_PAGE(!PageHead(page), page);
> > > > > 
> > > > > +   /*
> > > > > +    * Leave pmd empty until pte is filled note we must notify here as
> > > > > +    * concurrent CPU thread might write to new page before the call to
> > > > > +    * mmu_notifier_invalidate_range_end() happens which can lead to a
> > > > > +    * device seeing memory write in different order than CPU.
> > > > > +    *
> > > > > +    * See Documentation/vm/mmu_notifier.txt
> > > > > +    */
> > > > >     pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
> > > > > -   /* leave pmd empty until pte is filled */
> > > > > 
> > > > >     pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd);
> > > > >     pmd_populate(vma->vm_mm, &_pmd, pgtable);
> > > > > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> > > > >     pmd_t _pmd;
> > > > >     int i;
> > > > > 
> > > > > -   /* leave pmd empty until pte is filled */
> > > > > -   pmdp_huge_clear_flush_notify(vma, haddr, pmd);
> > > > > +   /*
> > > > > +    * Leave pmd empty until pte is filled note that it is fine to delay
> > > > > +    * notification until mmu_notifier_invalidate_range_end() as we are
> > > > > +    * replacing a zero pmd write protected page with a zero pte write
> > > > > +    * protected page.
> > > > > +    *
> > > > > +    * See Documentation/vm/mmu_notifier.txt
> > > > > +    */
> > > > > +   pmdp_huge_clear_flush(vma, haddr, pmd);
> > > > 
> > > > Shouldn't the secondary TLB know if the page size changed?
> > > 
> > > It should not matter, we are talking virtual to physical on behalf
> > > of a device against a process address space. So the hardware should
> > > not care about the page size.
> > > 
> > 
> > Does that not indicate how much the device can access? Could it try
> > to access more than what is mapped?
> 
> Assuming device has huge TLB and 2MB huge page with 4K small page.
> You are going from one 1 TLB covering a 2MB zero page to 512 TLB
> each covering 4K. Both case is read only and both case are pointing
> to same data (ie zero).
> 
> It is fine to delay the TLB invalidate on the device to the call of
> mmu_notifier_invalidate_range_end(). The device will keep using the
> huge TLB for a little longer but both CPU and device are looking at
> same data.
> 
> Now if there is a racing thread that replace one of the 512 zeor page
> after the split but before mmu_notifier_invalidate_range_end() that
> code path would call mmu_notifier_invalidate_range() before changing
> the pte to point to something else. Which should shoot down the device
> TLB (it would be a serious device bug if this did not work).

OK.. This seems reasonable, but I'd really like to see if it can be
tested

> 
> 
> > 
> > > Moreover if any of the new 512 (assuming 2MB huge and 4K pages) zero
> > > 4K pages is replace by something new then a device TLB shootdown will
> > > happen before the new page is set.
> > > 
> > > Only issue i can think of is if the IOMMU TLB (if there is one) or
> > > the device TLB (you do expect that there is one) does not invalidate
> > > TLB entry if the TLB shootdown is smaller than the TLB entry. That
> > > would be idiotic but yes i know hardware bug.
> > > 
> > > 
> > > > 
> > > > > 
> > > > >     pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> > > > >     pmd_populate(mm, &_pmd, pgtable);
> > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > > > index 1768efa4c501..63a63f1b536c 100644
> > > > > --- a/mm/hugetlb.c
> > > > > +++ b/mm/hugetlb.c
> > > > > @@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> > > > >                     set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
> > > > >             } else {
> > > > >                     if (cow) {
> > > > > +                           /*
> > > > > +                            * No need to notify as we are downgrading page
> > > > > +                            * table protection not changing it to point
> > > > > +                            * to a new page.
> > > > > +                            *
> > > > > +                            * See Documentation/vm/mmu_notifier.txt
> > > > > +                            */
> > > > >                             huge_ptep_set_wrprotect(src, addr, src_pte);
> > > > 
> > > > OK.. so we could get write faults on write accesses from the device.
> > > > 
> > > > > -                           mmu_notifier_invalidate_range(src, mmun_start,
> > > > > -                                                              mmun_end);
> > > > >                     }
> > > > >                     entry = huge_ptep_get(src_pte);
> > > > >                     ptepage = pte_page(entry);
> > > > > @@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> > > > >      * and that page table be reused and filled with junk.
> > > > >      */
> > > > >     flush_hugetlb_tlb_range(vma, start, end);
> > > > > -   mmu_notifier_invalidate_range(mm, start, end);
> > > > > +   /*
> > > > > +    * No need to call mmu_notifier_invalidate_range() we are downgrading
> > > > > +    * page table protection not changing it to point to a new page.
> > > > > +    *
> > > > > +    * See Documentation/vm/mmu_notifier.txt
> > > > > +    */
> > > > >     i_mmap_unlock_write(vma->vm_file->f_mapping);
> > > > >     mmu_notifier_invalidate_range_end(mm, start, end);
> > > > > 
> > > > > diff --git a/mm/ksm.c b/mm/ksm.c
> > > > > index 6cb60f46cce5..be8f4576f842 100644
> > > > > --- a/mm/ksm.c
> > > > > +++ b/mm/ksm.c
> > > > > @@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
> > > > >              * So we clear the pte and flush the tlb before the check
> > > > >              * this assure us that no O_DIRECT can happen after the check
> > > > >              * or in the middle of the check.
> > > > > +            *
> > > > > +            * No need to notify as we are downgrading page table to read
> > > > > +            * only not changing it to point to a new page.
> > > > > +            *
> > > > > +            * See Documentation/vm/mmu_notifier.txt
> > > > >              */
> > > > > -           entry = ptep_clear_flush_notify(vma, pvmw.address, pvmw.pte);
> > > > > +           entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte);
> > > > >             /*
> > > > >              * Check that no O_DIRECT or similar I/O is in progress on the
> > > > >              * page
> > > > > @@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
> > > > >     }
> > > > > 
> > > > >     flush_cache_page(vma, addr, pte_pfn(*ptep));
> > > > > -   ptep_clear_flush_notify(vma, addr, ptep);
> > > > > +   /*
> > > > > +    * No need to notify as we are replacing a read only page with another
> > > > > +    * read only page with the same content.
> > > > > +    *
> > > > > +    * See Documentation/vm/mmu_notifier.txt
> > > > > +    */
> > > > > +   ptep_clear_flush(vma, addr, ptep);
> > > > >     set_pte_at_notify(mm, addr, ptep, newpte);
> > > > > 
> > > > >     page_remove_rmap(page, false);
> > > > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > > > index 061826278520..6b5a0f219ac0 100644
> > > > > --- a/mm/rmap.c
> > > > > +++ b/mm/rmap.c
> > > > > @@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> > > > >  #endif
> > > > >             }
> > > > > 
> > > > > -           if (ret) {
> > > > > -                   mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend);
> > > > > +           /*
> > > > > +            * No need to call mmu_notifier_invalidate_range() as we are
> > > > > +            * downgrading page table protection not changing it to point
> > > > > +            * to a new page.
> > > > > +            *
> > > > > +            * See Documentation/vm/mmu_notifier.txt
> > > > > +            */
> > > > > +           if (ret)
> > > > >                     (*cleaned)++;
> > > > > -           }
> > > > >     }
> > > > > 
> > > > >     mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> > > > > @@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > >                     if (pte_soft_dirty(pteval))
> > > > >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > > > >                     set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
> > > > > +                   /*
> > > > > +                    * No need to invalidate here it will synchronize on
> > > > > +                    * against the special swap migration pte.
> > > > > +                    */
> > > > >                     goto discard;
> > > > >             }
> > > > > 
> > > > > @@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > >                      * will take care of the rest.
> > > > >                      */
> > > > >                     dec_mm_counter(mm, mm_counter(page));
> > > > > +                   /* We have to invalidate as we cleared the pte */
> > > > > +                   mmu_notifier_invalidate_range(mm, address,
> > > > > +                                                 address + PAGE_SIZE);
> > > > >             } else if (IS_ENABLED(CONFIG_MIGRATION) &&
> > > > >                             (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
> > > > >                     swp_entry_t entry;
> > > > > @@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > >                     if (pte_soft_dirty(pteval))
> > > > >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > > > >                     set_pte_at(mm, address, pvmw.pte, swp_pte);
> > > > > +                   /*
> > > > > +                    * No need to invalidate here it will synchronize on
> > > > > +                    * against the special swap migration pte.
> > > > > +                    */
> > > > >             } else if (PageAnon(page)) {
> > > > >                     swp_entry_t entry = { .val = page_private(subpage) };
> > > > >                     pte_t swp_pte;
> > > > > @@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > >                             WARN_ON_ONCE(1);
> > > > >                             ret = false;
> > > > >                             /* We have to invalidate as we cleared the pte */
> > > > > +                           mmu_notifier_invalidate_range(mm, address,
> > > > > +                                                   address + PAGE_SIZE);
> > > > >                             page_vma_mapped_walk_done(&pvmw);
> > > > >                             break;
> > > > >                     }
> > > > > @@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > >                     /* MADV_FREE page check */
> > > > >                     if (!PageSwapBacked(page)) {
> > > > >                             if (!PageDirty(page)) {
> > > > > +                                   /* Invalidate as we cleared the pte */
> > > > > +                                   mmu_notifier_invalidate_range(mm,
> > > > > +                                           address, address + PAGE_SIZE);
> > > > >                                     dec_mm_counter(mm, MM_ANONPAGES);
> > > > >                                     goto discard;
> > > > >                             }
> > > > > @@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > >                     if (pte_soft_dirty(pteval))
> > > > >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > > > >                     set_pte_at(mm, address, pvmw.pte, swp_pte);
> > > > > -           } else
> > > > > +                   /* Invalidate as we cleared the pte */
> > > > > +                   mmu_notifier_invalidate_range(mm, address,
> > > > > +                                                 address + PAGE_SIZE);
> > > > > +           } else {
> > > > > +                   /*
> > > > > +                    * We should not need to notify here as we reach this
> > > > > +                    * case only from freeze_page() itself only call from
> > > > > +                    * split_huge_page_to_list() so everything below must
> > > > > +                    * be true:
> > > > > +                    *   - page is not anonymous
> > > > > +                    *   - page is locked
> > > > > +                    *
> > > > > +                    * So as it is a locked file back page thus it can not
> > > > > +                    * be remove from the page cache and replace by a new
> > > > > +                    * page before mmu_notifier_invalidate_range_end so no
> > > > > +                    * concurrent thread might update its page table to
> > > > > +                    * point at new page while a device still is using this
> > > > > +                    * page.
> > > > > +                    *
> > > > > +                    * See Documentation/vm/mmu_notifier.txt
> > > > > +                    */
> > > > >                     dec_mm_counter(mm, mm_counter_file(page));
> > > > > +           }
> > > > >  discard:
> > > > > +           /*
> > > > > +            * No need to call mmu_notifier_invalidate_range() it has be
> > > > > +            * done above for all cases requiring it to happen under page
> > > > > +            * table lock before mmu_notifier_invalidate_range_end()
> > > > > +            *
> > > > > +            * See Documentation/vm/mmu_notifier.txt
> > > > > +            */
> > > > >             page_remove_rmap(subpage, PageHuge(page));
> > > > >             put_page(page);
> > > > > -           mmu_notifier_invalidate_range(mm, address,
> > > > > -                                         address + PAGE_SIZE);
> > > > >     }
> > > > > 
> > > > >     mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> > > > 
> > > > Looking at the patchset, I understand the efficiency, but I am concerned
> > > > with correctness.
> > > 
> > > I am fine in holding this off from reaching Linus but only way to flush this
> > > issues out if any is to have this patch in linux-next or somewhere were they
> > > get a chance of being tested.
> > > 
> > 
> > Yep, I would like to see some additional testing around npu and get Alistair
> > Popple to comment as well
> 
> I think this patch is fine. The only one race window that it might make
> bigger should have no bad consequences.
> 
> > 
> > > Note that the second patch is always safe. I agree that this one might
> > > not be if hardware implementation is idiotic (well that would be my
> > > opinion and any opinion/point of view can be challenge :))
> > 
> > 
> > You mean the only_end variant that avoids shootdown after pmd/pte changes
> > that avoid the _start/_end and have just the only_end variant? That seemed
> > reasonable to me, but I've not tested it or evaluated it in depth
> 
> Yes, patch 2/2 in this serie is definitly fine. It invalidate the device
> TLB right after clearing pte entry and avoid latter unecessary invalidation
> of same TLB.
> 
> Jérôme

Balbir Singh.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2
  2017-10-21  5:54             ` Balbir Singh
  (?)
  (?)
@ 2017-10-21 15:47               ` Jerome Glisse
  -1 siblings, 0 replies; 40+ messages in thread
From: Jerome Glisse @ 2017-10-21 15:47 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Nadav Amit,
	Linus Torvalds, Andrew Morton, Joerg Roedel,
	Suravee Suthikulpanit, David Woodhouse, Alistair Popple,
	Michael Ellerman, Benjamin Herrenschmidt, Stephen Rothwell,
	Andrew Donnellan, iommu,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-next

On Sat, Oct 21, 2017 at 04:54:40PM +1100, Balbir Singh wrote:
> On Thu, 2017-10-19 at 12:58 -0400, Jerome Glisse wrote:
> > On Thu, Oct 19, 2017 at 09:53:11PM +1100, Balbir Singh wrote:
> > > On Thu, Oct 19, 2017 at 2:28 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> > > > On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote:
> > > > > On Mon, 16 Oct 2017 23:10:02 -0400
> > > > > jglisse@redhat.com wrote:
> > > > > 
> > > > > > From: Jérôme Glisse <jglisse@redhat.com>
> > > > > > 
> > > > > > +           /*
> > > > > > +            * No need to call mmu_notifier_invalidate_range() as we are
> > > > > > +            * downgrading page table protection not changing it to point
> > > > > > +            * to a new page.
> > > > > > +            *
> > > > > > +            * See Documentation/vm/mmu_notifier.txt
> > > > > > +            */
> > > > > >             if (pmdp) {
> > > > > >  #ifdef CONFIG_FS_DAX_PMD
> > > > > >                     pmd_t pmd;
> > > > > > @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
> > > > > >                     pmd = pmd_wrprotect(pmd);
> > > > > >                     pmd = pmd_mkclean(pmd);
> > > > > >                     set_pmd_at(vma->vm_mm, address, pmdp, pmd);
> > > > > > -                   mmu_notifier_invalidate_range(vma->vm_mm, start, end);
> > > > > 
> > > > > Could the secondary TLB still see the mapping as dirty and propagate the dirty bit back?
> > > > 
> > > > I am assuming hardware does sane thing of setting the dirty bit only
> > > > when walking the CPU page table when device does a write fault ie
> > > > once the device get a write TLB entry the dirty is set by the IOMMU
> > > > when walking the page table before returning the lookup result to the
> > > > device and that it won't be set again latter (ie propagated back
> > > > latter).
> > > > 
> > > 
> > > The other possibility is that the hardware things the page is writable
> > > and already
> > > marked dirty. It allows writes and does not set the dirty bit?
> > 
> > I thought about this some more and the patch can not regress anything
> > that is not broken today. So if we assume that device can propagate
> > dirty bit because it can cache the write protection than all current
> > code is broken for two reasons:
> > 
> > First one is current code clear pte entry, build a new pte value with
> > write protection and update pte entry with new pte value. So any PASID/
> > ATS platform that allows device to cache the write bit and set dirty
> > bit anytime after that can race during that window and you would loose
> > the dirty bit of the device. That is not that bad as you are gonna
> > propagate the dirty bit to the struct page.
> 
> But they stay consistent with the notifiers, so from the OS perspective
> it notifies of any PTE changes as they happen. When the ATS platform sees
> invalidation, it invalidates it's PTE's as well.
> 
> I was speaking of the case where the ATS platform could assume it has
> write access and has not seen any invalidation, the OS could return
> back to user space or the caller with write bit clear, but the ATS
> platform could still do a write since it's not seen the invalidation.

I understood what you said and what is above apply. I am removing only
one of the invalidation not both. So with that patch the invalidation
is delayed after the page table lock drop but before dax/page_mkclean
returns. Hence any further activity will be read only on any device too
once we exit those functions.

The only difference is the window during which device can report dirty
pte. Before that patch the 2 "~bogus~" window were small:
  First window between pmd/pte_get_clear_flush and set_pte/pmd
  Second window between set_pte/pmd and mmu_notifier_invalidate_range

The first window stay the same, the second window is bigger, potentialy
lot bigger if thread is prempted before mmu_notifier_invalidate_range_end

But that is fine as in that case the page is reported as dirty and thus
we are not missing anything and the kernel code does not care about
seeing read only pte mark as dirty.

> 
> > 
> > Second one is if the dirty bit is propagated back to the new write
> > protected pte. Quick look at code it seems that when we zap pte or
> > or mkclean we don't check that the pte has write permission but only
> > care about the dirty bit. So it should not have any bad consequence.
> > 
> > After this patch only the second window is bigger and thus more likely
> > to happen. But nothing sinister should happen from that.
> > 
> > 
> > > 
> > > > I should probably have spell that out and maybe some of the ATS/PASID
> > > > implementer did not do that.
> > > > 
> > > > > 
> > > > > >  unlock_pmd:
> > > > > >                     spin_unlock(ptl);
> > > > > >  #endif
> > > > > > @@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
> > > > > >                     pte = pte_wrprotect(pte);
> > > > > >                     pte = pte_mkclean(pte);
> > > > > >                     set_pte_at(vma->vm_mm, address, ptep, pte);
> > > > > > -                   mmu_notifier_invalidate_range(vma->vm_mm, start, end);
> > > > > 
> > > > > Ditto
> > > > > 
> > > > > >  unlock_pte:
> > > > > >                     pte_unmap_unlock(ptep, ptl);
> > > > > >             }
> > > > > > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > > > > > index 6866e8126982..49c925c96b8a 100644
> > > > > > --- a/include/linux/mmu_notifier.h
> > > > > > +++ b/include/linux/mmu_notifier.h
> > > > > > @@ -155,7 +155,8 @@ struct mmu_notifier_ops {
> > > > > >      * shared page-tables, it not necessary to implement the
> > > > > >      * invalidate_range_start()/end() notifiers, as
> > > > > >      * invalidate_range() alread catches the points in time when an
> > > > > > -    * external TLB range needs to be flushed.
> > > > > > +    * external TLB range needs to be flushed. For more in depth
> > > > > > +    * discussion on this see Documentation/vm/mmu_notifier.txt
> > > > > >      *
> > > > > >      * The invalidate_range() function is called under the ptl
> > > > > >      * spin-lock and not allowed to sleep.
> > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > > > index c037d3d34950..ff5bc647b51d 100644
> > > > > > --- a/mm/huge_memory.c
> > > > > > +++ b/mm/huge_memory.c
> > > > > > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
> > > > > >             goto out_free_pages;
> > > > > >     VM_BUG_ON_PAGE(!PageHead(page), page);
> > > > > > 
> > > > > > +   /*
> > > > > > +    * Leave pmd empty until pte is filled note we must notify here as
> > > > > > +    * concurrent CPU thread might write to new page before the call to
> > > > > > +    * mmu_notifier_invalidate_range_end() happens which can lead to a
> > > > > > +    * device seeing memory write in different order than CPU.
> > > > > > +    *
> > > > > > +    * See Documentation/vm/mmu_notifier.txt
> > > > > > +    */
> > > > > >     pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
> > > > > > -   /* leave pmd empty until pte is filled */
> > > > > > 
> > > > > >     pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd);
> > > > > >     pmd_populate(vma->vm_mm, &_pmd, pgtable);
> > > > > > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> > > > > >     pmd_t _pmd;
> > > > > >     int i;
> > > > > > 
> > > > > > -   /* leave pmd empty until pte is filled */
> > > > > > -   pmdp_huge_clear_flush_notify(vma, haddr, pmd);
> > > > > > +   /*
> > > > > > +    * Leave pmd empty until pte is filled note that it is fine to delay
> > > > > > +    * notification until mmu_notifier_invalidate_range_end() as we are
> > > > > > +    * replacing a zero pmd write protected page with a zero pte write
> > > > > > +    * protected page.
> > > > > > +    *
> > > > > > +    * See Documentation/vm/mmu_notifier.txt
> > > > > > +    */
> > > > > > +   pmdp_huge_clear_flush(vma, haddr, pmd);
> > > > > 
> > > > > Shouldn't the secondary TLB know if the page size changed?
> > > > 
> > > > It should not matter, we are talking virtual to physical on behalf
> > > > of a device against a process address space. So the hardware should
> > > > not care about the page size.
> > > > 
> > > 
> > > Does that not indicate how much the device can access? Could it try
> > > to access more than what is mapped?
> > 
> > Assuming device has huge TLB and 2MB huge page with 4K small page.
> > You are going from one 1 TLB covering a 2MB zero page to 512 TLB
> > each covering 4K. Both case is read only and both case are pointing
> > to same data (ie zero).
> > 
> > It is fine to delay the TLB invalidate on the device to the call of
> > mmu_notifier_invalidate_range_end(). The device will keep using the
> > huge TLB for a little longer but both CPU and device are looking at
> > same data.
> > 
> > Now if there is a racing thread that replace one of the 512 zeor page
> > after the split but before mmu_notifier_invalidate_range_end() that
> > code path would call mmu_notifier_invalidate_range() before changing
> > the pte to point to something else. Which should shoot down the device
> > TLB (it would be a serious device bug if this did not work).
> 
> OK.. This seems reasonable, but I'd really like to see if it can be
> tested

Well hard to test, many factors first each device might react differently.
Device that only store TLB at 4k granularity are fine. Clever device that
can store TLB for 4k, 2M, ... can ignore an invalidation that is smaller
than their TLB entry ie getting a 4K invalidation would not invalidate a
2MB TLB entry in the device. I consider this as buggy. I will go look at
the PCIE ATS specification one more time and see if there is any wording
related that. I might bring up a question to the PCIE standard body if not.

Second factor is that it is a race between split zero and a write fault.
I can probably do a crappy patch that msleep if split happens against a
given mm to increase the race window. But i would be testing against one
device (right now i can only access AMD IOMMUv2 devices with discret ATS
GPU)


> 
> > 
> > 
> > > 
> > > > Moreover if any of the new 512 (assuming 2MB huge and 4K pages) zero
> > > > 4K pages is replace by something new then a device TLB shootdown will
> > > > happen before the new page is set.
> > > > 
> > > > Only issue i can think of is if the IOMMU TLB (if there is one) or
> > > > the device TLB (you do expect that there is one) does not invalidate
> > > > TLB entry if the TLB shootdown is smaller than the TLB entry. That
> > > > would be idiotic but yes i know hardware bug.
> > > > 
> > > > 
> > > > > 
> > > > > > 
> > > > > >     pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> > > > > >     pmd_populate(mm, &_pmd, pgtable);
> > > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > > > > index 1768efa4c501..63a63f1b536c 100644
> > > > > > --- a/mm/hugetlb.c
> > > > > > +++ b/mm/hugetlb.c
> > > > > > @@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> > > > > >                     set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
> > > > > >             } else {
> > > > > >                     if (cow) {
> > > > > > +                           /*
> > > > > > +                            * No need to notify as we are downgrading page
> > > > > > +                            * table protection not changing it to point
> > > > > > +                            * to a new page.
> > > > > > +                            *
> > > > > > +                            * See Documentation/vm/mmu_notifier.txt
> > > > > > +                            */
> > > > > >                             huge_ptep_set_wrprotect(src, addr, src_pte);
> > > > > 
> > > > > OK.. so we could get write faults on write accesses from the device.
> > > > > 
> > > > > > -                           mmu_notifier_invalidate_range(src, mmun_start,
> > > > > > -                                                              mmun_end);
> > > > > >                     }
> > > > > >                     entry = huge_ptep_get(src_pte);
> > > > > >                     ptepage = pte_page(entry);
> > > > > > @@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> > > > > >      * and that page table be reused and filled with junk.
> > > > > >      */
> > > > > >     flush_hugetlb_tlb_range(vma, start, end);
> > > > > > -   mmu_notifier_invalidate_range(mm, start, end);
> > > > > > +   /*
> > > > > > +    * No need to call mmu_notifier_invalidate_range() we are downgrading
> > > > > > +    * page table protection not changing it to point to a new page.
> > > > > > +    *
> > > > > > +    * See Documentation/vm/mmu_notifier.txt
> > > > > > +    */
> > > > > >     i_mmap_unlock_write(vma->vm_file->f_mapping);
> > > > > >     mmu_notifier_invalidate_range_end(mm, start, end);
> > > > > > 
> > > > > > diff --git a/mm/ksm.c b/mm/ksm.c
> > > > > > index 6cb60f46cce5..be8f4576f842 100644
> > > > > > --- a/mm/ksm.c
> > > > > > +++ b/mm/ksm.c
> > > > > > @@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
> > > > > >              * So we clear the pte and flush the tlb before the check
> > > > > >              * this assure us that no O_DIRECT can happen after the check
> > > > > >              * or in the middle of the check.
> > > > > > +            *
> > > > > > +            * No need to notify as we are downgrading page table to read
> > > > > > +            * only not changing it to point to a new page.
> > > > > > +            *
> > > > > > +            * See Documentation/vm/mmu_notifier.txt
> > > > > >              */
> > > > > > -           entry = ptep_clear_flush_notify(vma, pvmw.address, pvmw.pte);
> > > > > > +           entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte);
> > > > > >             /*
> > > > > >              * Check that no O_DIRECT or similar I/O is in progress on the
> > > > > >              * page
> > > > > > @@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
> > > > > >     }
> > > > > > 
> > > > > >     flush_cache_page(vma, addr, pte_pfn(*ptep));
> > > > > > -   ptep_clear_flush_notify(vma, addr, ptep);
> > > > > > +   /*
> > > > > > +    * No need to notify as we are replacing a read only page with another
> > > > > > +    * read only page with the same content.
> > > > > > +    *
> > > > > > +    * See Documentation/vm/mmu_notifier.txt
> > > > > > +    */
> > > > > > +   ptep_clear_flush(vma, addr, ptep);
> > > > > >     set_pte_at_notify(mm, addr, ptep, newpte);
> > > > > > 
> > > > > >     page_remove_rmap(page, false);
> > > > > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > > > > index 061826278520..6b5a0f219ac0 100644
> > > > > > --- a/mm/rmap.c
> > > > > > +++ b/mm/rmap.c
> > > > > > @@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> > > > > >  #endif
> > > > > >             }
> > > > > > 
> > > > > > -           if (ret) {
> > > > > > -                   mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend);
> > > > > > +           /*
> > > > > > +            * No need to call mmu_notifier_invalidate_range() as we are
> > > > > > +            * downgrading page table protection not changing it to point
> > > > > > +            * to a new page.
> > > > > > +            *
> > > > > > +            * See Documentation/vm/mmu_notifier.txt
> > > > > > +            */
> > > > > > +           if (ret)
> > > > > >                     (*cleaned)++;
> > > > > > -           }
> > > > > >     }
> > > > > > 
> > > > > >     mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> > > > > > @@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > > >                     if (pte_soft_dirty(pteval))
> > > > > >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > > > > >                     set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
> > > > > > +                   /*
> > > > > > +                    * No need to invalidate here it will synchronize on
> > > > > > +                    * against the special swap migration pte.
> > > > > > +                    */
> > > > > >                     goto discard;
> > > > > >             }
> > > > > > 
> > > > > > @@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > > >                      * will take care of the rest.
> > > > > >                      */
> > > > > >                     dec_mm_counter(mm, mm_counter(page));
> > > > > > +                   /* We have to invalidate as we cleared the pte */
> > > > > > +                   mmu_notifier_invalidate_range(mm, address,
> > > > > > +                                                 address + PAGE_SIZE);
> > > > > >             } else if (IS_ENABLED(CONFIG_MIGRATION) &&
> > > > > >                             (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
> > > > > >                     swp_entry_t entry;
> > > > > > @@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > > >                     if (pte_soft_dirty(pteval))
> > > > > >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > > > > >                     set_pte_at(mm, address, pvmw.pte, swp_pte);
> > > > > > +                   /*
> > > > > > +                    * No need to invalidate here it will synchronize on
> > > > > > +                    * against the special swap migration pte.
> > > > > > +                    */
> > > > > >             } else if (PageAnon(page)) {
> > > > > >                     swp_entry_t entry = { .val = page_private(subpage) };
> > > > > >                     pte_t swp_pte;
> > > > > > @@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > > >                             WARN_ON_ONCE(1);
> > > > > >                             ret = false;
> > > > > >                             /* We have to invalidate as we cleared the pte */
> > > > > > +                           mmu_notifier_invalidate_range(mm, address,
> > > > > > +                                                   address + PAGE_SIZE);
> > > > > >                             page_vma_mapped_walk_done(&pvmw);
> > > > > >                             break;
> > > > > >                     }
> > > > > > @@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > > >                     /* MADV_FREE page check */
> > > > > >                     if (!PageSwapBacked(page)) {
> > > > > >                             if (!PageDirty(page)) {
> > > > > > +                                   /* Invalidate as we cleared the pte */
> > > > > > +                                   mmu_notifier_invalidate_range(mm,
> > > > > > +                                           address, address + PAGE_SIZE);
> > > > > >                                     dec_mm_counter(mm, MM_ANONPAGES);
> > > > > >                                     goto discard;
> > > > > >                             }
> > > > > > @@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > > >                     if (pte_soft_dirty(pteval))
> > > > > >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > > > > >                     set_pte_at(mm, address, pvmw.pte, swp_pte);
> > > > > > -           } else
> > > > > > +                   /* Invalidate as we cleared the pte */
> > > > > > +                   mmu_notifier_invalidate_range(mm, address,
> > > > > > +                                                 address + PAGE_SIZE);
> > > > > > +           } else {
> > > > > > +                   /*
> > > > > > +                    * We should not need to notify here as we reach this
> > > > > > +                    * case only from freeze_page() itself only call from
> > > > > > +                    * split_huge_page_to_list() so everything below must
> > > > > > +                    * be true:
> > > > > > +                    *   - page is not anonymous
> > > > > > +                    *   - page is locked
> > > > > > +                    *
> > > > > > +                    * So as it is a locked file back page thus it can not
> > > > > > +                    * be remove from the page cache and replace by a new
> > > > > > +                    * page before mmu_notifier_invalidate_range_end so no
> > > > > > +                    * concurrent thread might update its page table to
> > > > > > +                    * point at new page while a device still is using this
> > > > > > +                    * page.
> > > > > > +                    *
> > > > > > +                    * See Documentation/vm/mmu_notifier.txt
> > > > > > +                    */
> > > > > >                     dec_mm_counter(mm, mm_counter_file(page));
> > > > > > +           }
> > > > > >  discard:
> > > > > > +           /*
> > > > > > +            * No need to call mmu_notifier_invalidate_range() it has be
> > > > > > +            * done above for all cases requiring it to happen under page
> > > > > > +            * table lock before mmu_notifier_invalidate_range_end()
> > > > > > +            *
> > > > > > +            * See Documentation/vm/mmu_notifier.txt
> > > > > > +            */
> > > > > >             page_remove_rmap(subpage, PageHuge(page));
> > > > > >             put_page(page);
> > > > > > -           mmu_notifier_invalidate_range(mm, address,
> > > > > > -                                         address + PAGE_SIZE);
> > > > > >     }
> > > > > > 
> > > > > >     mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> > > > > 
> > > > > Looking at the patchset, I understand the efficiency, but I am concerned
> > > > > with correctness.
> > > > 
> > > > I am fine in holding this off from reaching Linus but only way to flush this
> > > > issues out if any is to have this patch in linux-next or somewhere were they
> > > > get a chance of being tested.
> > > > 
> > > 
> > > Yep, I would like to see some additional testing around npu and get Alistair
> > > Popple to comment as well
> > 
> > I think this patch is fine. The only one race window that it might make
> > bigger should have no bad consequences.
> > 
> > > 
> > > > Note that the second patch is always safe. I agree that this one might
> > > > not be if hardware implementation is idiotic (well that would be my
> > > > opinion and any opinion/point of view can be challenge :))
> > > 
> > > 
> > > You mean the only_end variant that avoids shootdown after pmd/pte changes
> > > that avoid the _start/_end and have just the only_end variant? That seemed
> > > reasonable to me, but I've not tested it or evaluated it in depth
> > 
> > Yes, patch 2/2 in this serie is definitly fine. It invalidate the device
> > TLB right after clearing pte entry and avoid latter unecessary invalidation
> > of same TLB.
> > 
> > Jérôme
> 
> Balbir Singh.
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2
@ 2017-10-21 15:47               ` Jerome Glisse
  0 siblings, 0 replies; 40+ messages in thread
From: Jerome Glisse @ 2017-10-21 15:47 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Nadav Amit,
	Linus Torvalds, Andrew Morton, Joerg Roedel,
	Suravee Suthikulpanit, David Woodhouse, Alistair Popple,
	Michael Ellerman, Benjamin Herrenschmidt, Stephen Rothwell,
	Andrew Donnellan, iommu,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-next

On Sat, Oct 21, 2017 at 04:54:40PM +1100, Balbir Singh wrote:
> On Thu, 2017-10-19 at 12:58 -0400, Jerome Glisse wrote:
> > On Thu, Oct 19, 2017 at 09:53:11PM +1100, Balbir Singh wrote:
> > > On Thu, Oct 19, 2017 at 2:28 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> > > > On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote:
> > > > > On Mon, 16 Oct 2017 23:10:02 -0400
> > > > > jglisse@redhat.com wrote:
> > > > > 
> > > > > > From: Jérôme Glisse <jglisse@redhat.com>
> > > > > > 
> > > > > > +           /*
> > > > > > +            * No need to call mmu_notifier_invalidate_range() as we are
> > > > > > +            * downgrading page table protection not changing it to point
> > > > > > +            * to a new page.
> > > > > > +            *
> > > > > > +            * See Documentation/vm/mmu_notifier.txt
> > > > > > +            */
> > > > > >             if (pmdp) {
> > > > > >  #ifdef CONFIG_FS_DAX_PMD
> > > > > >                     pmd_t pmd;
> > > > > > @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
> > > > > >                     pmd = pmd_wrprotect(pmd);
> > > > > >                     pmd = pmd_mkclean(pmd);
> > > > > >                     set_pmd_at(vma->vm_mm, address, pmdp, pmd);
> > > > > > -                   mmu_notifier_invalidate_range(vma->vm_mm, start, end);
> > > > > 
> > > > > Could the secondary TLB still see the mapping as dirty and propagate the dirty bit back?
> > > > 
> > > > I am assuming hardware does sane thing of setting the dirty bit only
> > > > when walking the CPU page table when device does a write fault ie
> > > > once the device get a write TLB entry the dirty is set by the IOMMU
> > > > when walking the page table before returning the lookup result to the
> > > > device and that it won't be set again latter (ie propagated back
> > > > latter).
> > > > 
> > > 
> > > The other possibility is that the hardware things the page is writable
> > > and already
> > > marked dirty. It allows writes and does not set the dirty bit?
> > 
> > I thought about this some more and the patch can not regress anything
> > that is not broken today. So if we assume that device can propagate
> > dirty bit because it can cache the write protection than all current
> > code is broken for two reasons:
> > 
> > First one is current code clear pte entry, build a new pte value with
> > write protection and update pte entry with new pte value. So any PASID/
> > ATS platform that allows device to cache the write bit and set dirty
> > bit anytime after that can race during that window and you would loose
> > the dirty bit of the device. That is not that bad as you are gonna
> > propagate the dirty bit to the struct page.
> 
> But they stay consistent with the notifiers, so from the OS perspective
> it notifies of any PTE changes as they happen. When the ATS platform sees
> invalidation, it invalidates it's PTE's as well.
> 
> I was speaking of the case where the ATS platform could assume it has
> write access and has not seen any invalidation, the OS could return
> back to user space or the caller with write bit clear, but the ATS
> platform could still do a write since it's not seen the invalidation.

I understood what you said and what is above apply. I am removing only
one of the invalidation not both. So with that patch the invalidation
is delayed after the page table lock drop but before dax/page_mkclean
returns. Hence any further activity will be read only on any device too
once we exit those functions.

The only difference is the window during which device can report dirty
pte. Before that patch the 2 "~bogus~" window were small:
  First window between pmd/pte_get_clear_flush and set_pte/pmd
  Second window between set_pte/pmd and mmu_notifier_invalidate_range

The first window stay the same, the second window is bigger, potentialy
lot bigger if thread is prempted before mmu_notifier_invalidate_range_end

But that is fine as in that case the page is reported as dirty and thus
we are not missing anything and the kernel code does not care about
seeing read only pte mark as dirty.

> 
> > 
> > Second one is if the dirty bit is propagated back to the new write
> > protected pte. Quick look at code it seems that when we zap pte or
> > or mkclean we don't check that the pte has write permission but only
> > care about the dirty bit. So it should not have any bad consequence.
> > 
> > After this patch only the second window is bigger and thus more likely
> > to happen. But nothing sinister should happen from that.
> > 
> > 
> > > 
> > > > I should probably have spell that out and maybe some of the ATS/PASID
> > > > implementer did not do that.
> > > > 
> > > > > 
> > > > > >  unlock_pmd:
> > > > > >                     spin_unlock(ptl);
> > > > > >  #endif
> > > > > > @@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
> > > > > >                     pte = pte_wrprotect(pte);
> > > > > >                     pte = pte_mkclean(pte);
> > > > > >                     set_pte_at(vma->vm_mm, address, ptep, pte);
> > > > > > -                   mmu_notifier_invalidate_range(vma->vm_mm, start, end);
> > > > > 
> > > > > Ditto
> > > > > 
> > > > > >  unlock_pte:
> > > > > >                     pte_unmap_unlock(ptep, ptl);
> > > > > >             }
> > > > > > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > > > > > index 6866e8126982..49c925c96b8a 100644
> > > > > > --- a/include/linux/mmu_notifier.h
> > > > > > +++ b/include/linux/mmu_notifier.h
> > > > > > @@ -155,7 +155,8 @@ struct mmu_notifier_ops {
> > > > > >      * shared page-tables, it not necessary to implement the
> > > > > >      * invalidate_range_start()/end() notifiers, as
> > > > > >      * invalidate_range() alread catches the points in time when an
> > > > > > -    * external TLB range needs to be flushed.
> > > > > > +    * external TLB range needs to be flushed. For more in depth
> > > > > > +    * discussion on this see Documentation/vm/mmu_notifier.txt
> > > > > >      *
> > > > > >      * The invalidate_range() function is called under the ptl
> > > > > >      * spin-lock and not allowed to sleep.
> > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > > > index c037d3d34950..ff5bc647b51d 100644
> > > > > > --- a/mm/huge_memory.c
> > > > > > +++ b/mm/huge_memory.c
> > > > > > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
> > > > > >             goto out_free_pages;
> > > > > >     VM_BUG_ON_PAGE(!PageHead(page), page);
> > > > > > 
> > > > > > +   /*
> > > > > > +    * Leave pmd empty until pte is filled note we must notify here as
> > > > > > +    * concurrent CPU thread might write to new page before the call to
> > > > > > +    * mmu_notifier_invalidate_range_end() happens which can lead to a
> > > > > > +    * device seeing memory write in different order than CPU.
> > > > > > +    *
> > > > > > +    * See Documentation/vm/mmu_notifier.txt
> > > > > > +    */
> > > > > >     pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
> > > > > > -   /* leave pmd empty until pte is filled */
> > > > > > 
> > > > > >     pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd);
> > > > > >     pmd_populate(vma->vm_mm, &_pmd, pgtable);
> > > > > > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> > > > > >     pmd_t _pmd;
> > > > > >     int i;
> > > > > > 
> > > > > > -   /* leave pmd empty until pte is filled */
> > > > > > -   pmdp_huge_clear_flush_notify(vma, haddr, pmd);
> > > > > > +   /*
> > > > > > +    * Leave pmd empty until pte is filled note that it is fine to delay
> > > > > > +    * notification until mmu_notifier_invalidate_range_end() as we are
> > > > > > +    * replacing a zero pmd write protected page with a zero pte write
> > > > > > +    * protected page.
> > > > > > +    *
> > > > > > +    * See Documentation/vm/mmu_notifier.txt
> > > > > > +    */
> > > > > > +   pmdp_huge_clear_flush(vma, haddr, pmd);
> > > > > 
> > > > > Shouldn't the secondary TLB know if the page size changed?
> > > > 
> > > > It should not matter, we are talking virtual to physical on behalf
> > > > of a device against a process address space. So the hardware should
> > > > not care about the page size.
> > > > 
> > > 
> > > Does that not indicate how much the device can access? Could it try
> > > to access more than what is mapped?
> > 
> > Assuming device has huge TLB and 2MB huge page with 4K small page.
> > You are going from one 1 TLB covering a 2MB zero page to 512 TLB
> > each covering 4K. Both case is read only and both case are pointing
> > to same data (ie zero).
> > 
> > It is fine to delay the TLB invalidate on the device to the call of
> > mmu_notifier_invalidate_range_end(). The device will keep using the
> > huge TLB for a little longer but both CPU and device are looking at
> > same data.
> > 
> > Now if there is a racing thread that replace one of the 512 zeor page
> > after the split but before mmu_notifier_invalidate_range_end() that
> > code path would call mmu_notifier_invalidate_range() before changing
> > the pte to point to something else. Which should shoot down the device
> > TLB (it would be a serious device bug if this did not work).
> 
> OK.. This seems reasonable, but I'd really like to see if it can be
> tested

Well hard to test, many factors first each device might react differently.
Device that only store TLB at 4k granularity are fine. Clever device that
can store TLB for 4k, 2M, ... can ignore an invalidation that is smaller
than their TLB entry ie getting a 4K invalidation would not invalidate a
2MB TLB entry in the device. I consider this as buggy. I will go look at
the PCIE ATS specification one more time and see if there is any wording
related that. I might bring up a question to the PCIE standard body if not.

Second factor is that it is a race between split zero and a write fault.
I can probably do a crappy patch that msleep if split happens against a
given mm to increase the race window. But i would be testing against one
device (right now i can only access AMD IOMMUv2 devices with discret ATS
GPU)


> 
> > 
> > 
> > > 
> > > > Moreover if any of the new 512 (assuming 2MB huge and 4K pages) zero
> > > > 4K pages is replace by something new then a device TLB shootdown will
> > > > happen before the new page is set.
> > > > 
> > > > Only issue i can think of is if the IOMMU TLB (if there is one) or
> > > > the device TLB (you do expect that there is one) does not invalidate
> > > > TLB entry if the TLB shootdown is smaller than the TLB entry. That
> > > > would be idiotic but yes i know hardware bug.
> > > > 
> > > > 
> > > > > 
> > > > > > 
> > > > > >     pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> > > > > >     pmd_populate(mm, &_pmd, pgtable);
> > > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > > > > index 1768efa4c501..63a63f1b536c 100644
> > > > > > --- a/mm/hugetlb.c
> > > > > > +++ b/mm/hugetlb.c
> > > > > > @@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> > > > > >                     set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
> > > > > >             } else {
> > > > > >                     if (cow) {
> > > > > > +                           /*
> > > > > > +                            * No need to notify as we are downgrading page
> > > > > > +                            * table protection not changing it to point
> > > > > > +                            * to a new page.
> > > > > > +                            *
> > > > > > +                            * See Documentation/vm/mmu_notifier.txt
> > > > > > +                            */
> > > > > >                             huge_ptep_set_wrprotect(src, addr, src_pte);
> > > > > 
> > > > > OK.. so we could get write faults on write accesses from the device.
> > > > > 
> > > > > > -                           mmu_notifier_invalidate_range(src, mmun_start,
> > > > > > -                                                              mmun_end);
> > > > > >                     }
> > > > > >                     entry = huge_ptep_get(src_pte);
> > > > > >                     ptepage = pte_page(entry);
> > > > > > @@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> > > > > >      * and that page table be reused and filled with junk.
> > > > > >      */
> > > > > >     flush_hugetlb_tlb_range(vma, start, end);
> > > > > > -   mmu_notifier_invalidate_range(mm, start, end);
> > > > > > +   /*
> > > > > > +    * No need to call mmu_notifier_invalidate_range() we are downgrading
> > > > > > +    * page table protection not changing it to point to a new page.
> > > > > > +    *
> > > > > > +    * See Documentation/vm/mmu_notifier.txt
> > > > > > +    */
> > > > > >     i_mmap_unlock_write(vma->vm_file->f_mapping);
> > > > > >     mmu_notifier_invalidate_range_end(mm, start, end);
> > > > > > 
> > > > > > diff --git a/mm/ksm.c b/mm/ksm.c
> > > > > > index 6cb60f46cce5..be8f4576f842 100644
> > > > > > --- a/mm/ksm.c
> > > > > > +++ b/mm/ksm.c
> > > > > > @@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
> > > > > >              * So we clear the pte and flush the tlb before the check
> > > > > >              * this assure us that no O_DIRECT can happen after the check
> > > > > >              * or in the middle of the check.
> > > > > > +            *
> > > > > > +            * No need to notify as we are downgrading page table to read
> > > > > > +            * only not changing it to point to a new page.
> > > > > > +            *
> > > > > > +            * See Documentation/vm/mmu_notifier.txt
> > > > > >              */
> > > > > > -           entry = ptep_clear_flush_notify(vma, pvmw.address, pvmw.pte);
> > > > > > +           entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte);
> > > > > >             /*
> > > > > >              * Check that no O_DIRECT or similar I/O is in progress on the
> > > > > >              * page
> > > > > > @@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
> > > > > >     }
> > > > > > 
> > > > > >     flush_cache_page(vma, addr, pte_pfn(*ptep));
> > > > > > -   ptep_clear_flush_notify(vma, addr, ptep);
> > > > > > +   /*
> > > > > > +    * No need to notify as we are replacing a read only page with another
> > > > > > +    * read only page with the same content.
> > > > > > +    *
> > > > > > +    * See Documentation/vm/mmu_notifier.txt
> > > > > > +    */
> > > > > > +   ptep_clear_flush(vma, addr, ptep);
> > > > > >     set_pte_at_notify(mm, addr, ptep, newpte);
> > > > > > 
> > > > > >     page_remove_rmap(page, false);
> > > > > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > > > > index 061826278520..6b5a0f219ac0 100644
> > > > > > --- a/mm/rmap.c
> > > > > > +++ b/mm/rmap.c
> > > > > > @@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> > > > > >  #endif
> > > > > >             }
> > > > > > 
> > > > > > -           if (ret) {
> > > > > > -                   mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend);
> > > > > > +           /*
> > > > > > +            * No need to call mmu_notifier_invalidate_range() as we are
> > > > > > +            * downgrading page table protection not changing it to point
> > > > > > +            * to a new page.
> > > > > > +            *
> > > > > > +            * See Documentation/vm/mmu_notifier.txt
> > > > > > +            */
> > > > > > +           if (ret)
> > > > > >                     (*cleaned)++;
> > > > > > -           }
> > > > > >     }
> > > > > > 
> > > > > >     mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> > > > > > @@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > > >                     if (pte_soft_dirty(pteval))
> > > > > >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > > > > >                     set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
> > > > > > +                   /*
> > > > > > +                    * No need to invalidate here it will synchronize on
> > > > > > +                    * against the special swap migration pte.
> > > > > > +                    */
> > > > > >                     goto discard;
> > > > > >             }
> > > > > > 
> > > > > > @@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > > >                      * will take care of the rest.
> > > > > >                      */
> > > > > >                     dec_mm_counter(mm, mm_counter(page));
> > > > > > +                   /* We have to invalidate as we cleared the pte */
> > > > > > +                   mmu_notifier_invalidate_range(mm, address,
> > > > > > +                                                 address + PAGE_SIZE);
> > > > > >             } else if (IS_ENABLED(CONFIG_MIGRATION) &&
> > > > > >                             (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
> > > > > >                     swp_entry_t entry;
> > > > > > @@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > > >                     if (pte_soft_dirty(pteval))
> > > > > >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > > > > >                     set_pte_at(mm, address, pvmw.pte, swp_pte);
> > > > > > +                   /*
> > > > > > +                    * No need to invalidate here it will synchronize on
> > > > > > +                    * against the special swap migration pte.
> > > > > > +                    */
> > > > > >             } else if (PageAnon(page)) {
> > > > > >                     swp_entry_t entry = { .val = page_private(subpage) };
> > > > > >                     pte_t swp_pte;
> > > > > > @@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > > >                             WARN_ON_ONCE(1);
> > > > > >                             ret = false;
> > > > > >                             /* We have to invalidate as we cleared the pte */
> > > > > > +                           mmu_notifier_invalidate_range(mm, address,
> > > > > > +                                                   address + PAGE_SIZE);
> > > > > >                             page_vma_mapped_walk_done(&pvmw);
> > > > > >                             break;
> > > > > >                     }
> > > > > > @@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > > >                     /* MADV_FREE page check */
> > > > > >                     if (!PageSwapBacked(page)) {
> > > > > >                             if (!PageDirty(page)) {
> > > > > > +                                   /* Invalidate as we cleared the pte */
> > > > > > +                                   mmu_notifier_invalidate_range(mm,
> > > > > > +                                           address, address + PAGE_SIZE);
> > > > > >                                     dec_mm_counter(mm, MM_ANONPAGES);
> > > > > >                                     goto discard;
> > > > > >                             }
> > > > > > @@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > > >                     if (pte_soft_dirty(pteval))
> > > > > >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > > > > >                     set_pte_at(mm, address, pvmw.pte, swp_pte);
> > > > > > -           } else
> > > > > > +                   /* Invalidate as we cleared the pte */
> > > > > > +                   mmu_notifier_invalidate_range(mm, address,
> > > > > > +                                                 address + PAGE_SIZE);
> > > > > > +           } else {
> > > > > > +                   /*
> > > > > > +                    * We should not need to notify here as we reach this
> > > > > > +                    * case only from freeze_page() itself only call from
> > > > > > +                    * split_huge_page_to_list() so everything below must
> > > > > > +                    * be true:
> > > > > > +                    *   - page is not anonymous
> > > > > > +                    *   - page is locked
> > > > > > +                    *
> > > > > > +                    * So as it is a locked file back page thus it can not
> > > > > > +                    * be remove from the page cache and replace by a new
> > > > > > +                    * page before mmu_notifier_invalidate_range_end so no
> > > > > > +                    * concurrent thread might update its page table to
> > > > > > +                    * point at new page while a device still is using this
> > > > > > +                    * page.
> > > > > > +                    *
> > > > > > +                    * See Documentation/vm/mmu_notifier.txt
> > > > > > +                    */
> > > > > >                     dec_mm_counter(mm, mm_counter_file(page));
> > > > > > +           }
> > > > > >  discard:
> > > > > > +           /*
> > > > > > +            * No need to call mmu_notifier_invalidate_range() it has be
> > > > > > +            * done above for all cases requiring it to happen under page
> > > > > > +            * table lock before mmu_notifier_invalidate_range_end()
> > > > > > +            *
> > > > > > +            * See Documentation/vm/mmu_notifier.txt
> > > > > > +            */
> > > > > >             page_remove_rmap(subpage, PageHuge(page));
> > > > > >             put_page(page);
> > > > > > -           mmu_notifier_invalidate_range(mm, address,
> > > > > > -                                         address + PAGE_SIZE);
> > > > > >     }
> > > > > > 
> > > > > >     mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> > > > > 
> > > > > Looking at the patchset, I understand the efficiency, but I am concerned
> > > > > with correctness.
> > > > 
> > > > I am fine in holding this off from reaching Linus but only way to flush this
> > > > issues out if any is to have this patch in linux-next or somewhere were they
> > > > get a chance of being tested.
> > > > 
> > > 
> > > Yep, I would like to see some additional testing around npu and get Alistair
> > > Popple to comment as well
> > 
> > I think this patch is fine. The only one race window that it might make
> > bigger should have no bad consequences.
> > 
> > > 
> > > > Note that the second patch is always safe. I agree that this one might
> > > > not be if hardware implementation is idiotic (well that would be my
> > > > opinion and any opinion/point of view can be challenge :))
> > > 
> > > 
> > > You mean the only_end variant that avoids shootdown after pmd/pte changes
> > > that avoid the _start/_end and have just the only_end variant? That seemed
> > > reasonable to me, but I've not tested it or evaluated it in depth
> > 
> > Yes, patch 2/2 in this serie is definitly fine. It invalidate the device
> > TLB right after clearing pte entry and avoid latter unecessary invalidation
> > of same TLB.
> > 
> > Jérôme
> 
> Balbir Singh.
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2
@ 2017-10-21 15:47               ` Jerome Glisse
  0 siblings, 0 replies; 40+ messages in thread
From: Jerome Glisse @ 2017-10-21 15:47 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Nadav Amit,
	Linus Torvalds, Andrew Morton, Joerg Roedel,
	Suravee Suthikulpanit, David Woodhouse, Alistair Popple,
	Michael Ellerman, Benjamin Herrenschmidt, Stephen Rothwell,
	Andrew Donnellan, iommu,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-next

On Sat, Oct 21, 2017 at 04:54:40PM +1100, Balbir Singh wrote:
> On Thu, 2017-10-19 at 12:58 -0400, Jerome Glisse wrote:
> > On Thu, Oct 19, 2017 at 09:53:11PM +1100, Balbir Singh wrote:
> > > On Thu, Oct 19, 2017 at 2:28 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> > > > On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote:
> > > > > On Mon, 16 Oct 2017 23:10:02 -0400
> > > > > jglisse@redhat.com wrote:
> > > > > 
> > > > > > From: Jerome Glisse <jglisse@redhat.com>
> > > > > > 
> > > > > > +           /*
> > > > > > +            * No need to call mmu_notifier_invalidate_range() as we are
> > > > > > +            * downgrading page table protection not changing it to point
> > > > > > +            * to a new page.
> > > > > > +            *
> > > > > > +            * See Documentation/vm/mmu_notifier.txt
> > > > > > +            */
> > > > > >             if (pmdp) {
> > > > > >  #ifdef CONFIG_FS_DAX_PMD
> > > > > >                     pmd_t pmd;
> > > > > > @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
> > > > > >                     pmd = pmd_wrprotect(pmd);
> > > > > >                     pmd = pmd_mkclean(pmd);
> > > > > >                     set_pmd_at(vma->vm_mm, address, pmdp, pmd);
> > > > > > -                   mmu_notifier_invalidate_range(vma->vm_mm, start, end);
> > > > > 
> > > > > Could the secondary TLB still see the mapping as dirty and propagate the dirty bit back?
> > > > 
> > > > I am assuming hardware does sane thing of setting the dirty bit only
> > > > when walking the CPU page table when device does a write fault ie
> > > > once the device get a write TLB entry the dirty is set by the IOMMU
> > > > when walking the page table before returning the lookup result to the
> > > > device and that it won't be set again latter (ie propagated back
> > > > latter).
> > > > 
> > > 
> > > The other possibility is that the hardware things the page is writable
> > > and already
> > > marked dirty. It allows writes and does not set the dirty bit?
> > 
> > I thought about this some more and the patch can not regress anything
> > that is not broken today. So if we assume that device can propagate
> > dirty bit because it can cache the write protection than all current
> > code is broken for two reasons:
> > 
> > First one is current code clear pte entry, build a new pte value with
> > write protection and update pte entry with new pte value. So any PASID/
> > ATS platform that allows device to cache the write bit and set dirty
> > bit anytime after that can race during that window and you would loose
> > the dirty bit of the device. That is not that bad as you are gonna
> > propagate the dirty bit to the struct page.
> 
> But they stay consistent with the notifiers, so from the OS perspective
> it notifies of any PTE changes as they happen. When the ATS platform sees
> invalidation, it invalidates it's PTE's as well.
> 
> I was speaking of the case where the ATS platform could assume it has
> write access and has not seen any invalidation, the OS could return
> back to user space or the caller with write bit clear, but the ATS
> platform could still do a write since it's not seen the invalidation.

I understood what you said and what is above apply. I am removing only
one of the invalidation not both. So with that patch the invalidation
is delayed after the page table lock drop but before dax/page_mkclean
returns. Hence any further activity will be read only on any device too
once we exit those functions.

The only difference is the window during which device can report dirty
pte. Before that patch the 2 "~bogus~" window were small:
  First window between pmd/pte_get_clear_flush and set_pte/pmd
  Second window between set_pte/pmd and mmu_notifier_invalidate_range

The first window stay the same, the second window is bigger, potentialy
lot bigger if thread is prempted before mmu_notifier_invalidate_range_end

But that is fine as in that case the page is reported as dirty and thus
we are not missing anything and the kernel code does not care about
seeing read only pte mark as dirty.

> 
> > 
> > Second one is if the dirty bit is propagated back to the new write
> > protected pte. Quick look at code it seems that when we zap pte or
> > or mkclean we don't check that the pte has write permission but only
> > care about the dirty bit. So it should not have any bad consequence.
> > 
> > After this patch only the second window is bigger and thus more likely
> > to happen. But nothing sinister should happen from that.
> > 
> > 
> > > 
> > > > I should probably have spell that out and maybe some of the ATS/PASID
> > > > implementer did not do that.
> > > > 
> > > > > 
> > > > > >  unlock_pmd:
> > > > > >                     spin_unlock(ptl);
> > > > > >  #endif
> > > > > > @@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
> > > > > >                     pte = pte_wrprotect(pte);
> > > > > >                     pte = pte_mkclean(pte);
> > > > > >                     set_pte_at(vma->vm_mm, address, ptep, pte);
> > > > > > -                   mmu_notifier_invalidate_range(vma->vm_mm, start, end);
> > > > > 
> > > > > Ditto
> > > > > 
> > > > > >  unlock_pte:
> > > > > >                     pte_unmap_unlock(ptep, ptl);
> > > > > >             }
> > > > > > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > > > > > index 6866e8126982..49c925c96b8a 100644
> > > > > > --- a/include/linux/mmu_notifier.h
> > > > > > +++ b/include/linux/mmu_notifier.h
> > > > > > @@ -155,7 +155,8 @@ struct mmu_notifier_ops {
> > > > > >      * shared page-tables, it not necessary to implement the
> > > > > >      * invalidate_range_start()/end() notifiers, as
> > > > > >      * invalidate_range() alread catches the points in time when an
> > > > > > -    * external TLB range needs to be flushed.
> > > > > > +    * external TLB range needs to be flushed. For more in depth
> > > > > > +    * discussion on this see Documentation/vm/mmu_notifier.txt
> > > > > >      *
> > > > > >      * The invalidate_range() function is called under the ptl
> > > > > >      * spin-lock and not allowed to sleep.
> > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > > > index c037d3d34950..ff5bc647b51d 100644
> > > > > > --- a/mm/huge_memory.c
> > > > > > +++ b/mm/huge_memory.c
> > > > > > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
> > > > > >             goto out_free_pages;
> > > > > >     VM_BUG_ON_PAGE(!PageHead(page), page);
> > > > > > 
> > > > > > +   /*
> > > > > > +    * Leave pmd empty until pte is filled note we must notify here as
> > > > > > +    * concurrent CPU thread might write to new page before the call to
> > > > > > +    * mmu_notifier_invalidate_range_end() happens which can lead to a
> > > > > > +    * device seeing memory write in different order than CPU.
> > > > > > +    *
> > > > > > +    * See Documentation/vm/mmu_notifier.txt
> > > > > > +    */
> > > > > >     pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
> > > > > > -   /* leave pmd empty until pte is filled */
> > > > > > 
> > > > > >     pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd);
> > > > > >     pmd_populate(vma->vm_mm, &_pmd, pgtable);
> > > > > > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> > > > > >     pmd_t _pmd;
> > > > > >     int i;
> > > > > > 
> > > > > > -   /* leave pmd empty until pte is filled */
> > > > > > -   pmdp_huge_clear_flush_notify(vma, haddr, pmd);
> > > > > > +   /*
> > > > > > +    * Leave pmd empty until pte is filled note that it is fine to delay
> > > > > > +    * notification until mmu_notifier_invalidate_range_end() as we are
> > > > > > +    * replacing a zero pmd write protected page with a zero pte write
> > > > > > +    * protected page.
> > > > > > +    *
> > > > > > +    * See Documentation/vm/mmu_notifier.txt
> > > > > > +    */
> > > > > > +   pmdp_huge_clear_flush(vma, haddr, pmd);
> > > > > 
> > > > > Shouldn't the secondary TLB know if the page size changed?
> > > > 
> > > > It should not matter, we are talking virtual to physical on behalf
> > > > of a device against a process address space. So the hardware should
> > > > not care about the page size.
> > > > 
> > > 
> > > Does that not indicate how much the device can access? Could it try
> > > to access more than what is mapped?
> > 
> > Assuming device has huge TLB and 2MB huge page with 4K small page.
> > You are going from one 1 TLB covering a 2MB zero page to 512 TLB
> > each covering 4K. Both case is read only and both case are pointing
> > to same data (ie zero).
> > 
> > It is fine to delay the TLB invalidate on the device to the call of
> > mmu_notifier_invalidate_range_end(). The device will keep using the
> > huge TLB for a little longer but both CPU and device are looking at
> > same data.
> > 
> > Now if there is a racing thread that replace one of the 512 zeor page
> > after the split but before mmu_notifier_invalidate_range_end() that
> > code path would call mmu_notifier_invalidate_range() before changing
> > the pte to point to something else. Which should shoot down the device
> > TLB (it would be a serious device bug if this did not work).
> 
> OK.. This seems reasonable, but I'd really like to see if it can be
> tested

Well hard to test, many factors first each device might react differently.
Device that only store TLB at 4k granularity are fine. Clever device that
can store TLB for 4k, 2M, ... can ignore an invalidation that is smaller
than their TLB entry ie getting a 4K invalidation would not invalidate a
2MB TLB entry in the device. I consider this as buggy. I will go look at
the PCIE ATS specification one more time and see if there is any wording
related that. I might bring up a question to the PCIE standard body if not.

Second factor is that it is a race between split zero and a write fault.
I can probably do a crappy patch that msleep if split happens against a
given mm to increase the race window. But i would be testing against one
device (right now i can only access AMD IOMMUv2 devices with discret ATS
GPU)


> 
> > 
> > 
> > > 
> > > > Moreover if any of the new 512 (assuming 2MB huge and 4K pages) zero
> > > > 4K pages is replace by something new then a device TLB shootdown will
> > > > happen before the new page is set.
> > > > 
> > > > Only issue i can think of is if the IOMMU TLB (if there is one) or
> > > > the device TLB (you do expect that there is one) does not invalidate
> > > > TLB entry if the TLB shootdown is smaller than the TLB entry. That
> > > > would be idiotic but yes i know hardware bug.
> > > > 
> > > > 
> > > > > 
> > > > > > 
> > > > > >     pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> > > > > >     pmd_populate(mm, &_pmd, pgtable);
> > > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > > > > index 1768efa4c501..63a63f1b536c 100644
> > > > > > --- a/mm/hugetlb.c
> > > > > > +++ b/mm/hugetlb.c
> > > > > > @@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> > > > > >                     set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
> > > > > >             } else {
> > > > > >                     if (cow) {
> > > > > > +                           /*
> > > > > > +                            * No need to notify as we are downgrading page
> > > > > > +                            * table protection not changing it to point
> > > > > > +                            * to a new page.
> > > > > > +                            *
> > > > > > +                            * See Documentation/vm/mmu_notifier.txt
> > > > > > +                            */
> > > > > >                             huge_ptep_set_wrprotect(src, addr, src_pte);
> > > > > 
> > > > > OK.. so we could get write faults on write accesses from the device.
> > > > > 
> > > > > > -                           mmu_notifier_invalidate_range(src, mmun_start,
> > > > > > -                                                              mmun_end);
> > > > > >                     }
> > > > > >                     entry = huge_ptep_get(src_pte);
> > > > > >                     ptepage = pte_page(entry);
> > > > > > @@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> > > > > >      * and that page table be reused and filled with junk.
> > > > > >      */
> > > > > >     flush_hugetlb_tlb_range(vma, start, end);
> > > > > > -   mmu_notifier_invalidate_range(mm, start, end);
> > > > > > +   /*
> > > > > > +    * No need to call mmu_notifier_invalidate_range() we are downgrading
> > > > > > +    * page table protection not changing it to point to a new page.
> > > > > > +    *
> > > > > > +    * See Documentation/vm/mmu_notifier.txt
> > > > > > +    */
> > > > > >     i_mmap_unlock_write(vma->vm_file->f_mapping);
> > > > > >     mmu_notifier_invalidate_range_end(mm, start, end);
> > > > > > 
> > > > > > diff --git a/mm/ksm.c b/mm/ksm.c
> > > > > > index 6cb60f46cce5..be8f4576f842 100644
> > > > > > --- a/mm/ksm.c
> > > > > > +++ b/mm/ksm.c
> > > > > > @@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
> > > > > >              * So we clear the pte and flush the tlb before the check
> > > > > >              * this assure us that no O_DIRECT can happen after the check
> > > > > >              * or in the middle of the check.
> > > > > > +            *
> > > > > > +            * No need to notify as we are downgrading page table to read
> > > > > > +            * only not changing it to point to a new page.
> > > > > > +            *
> > > > > > +            * See Documentation/vm/mmu_notifier.txt
> > > > > >              */
> > > > > > -           entry = ptep_clear_flush_notify(vma, pvmw.address, pvmw.pte);
> > > > > > +           entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte);
> > > > > >             /*
> > > > > >              * Check that no O_DIRECT or similar I/O is in progress on the
> > > > > >              * page
> > > > > > @@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
> > > > > >     }
> > > > > > 
> > > > > >     flush_cache_page(vma, addr, pte_pfn(*ptep));
> > > > > > -   ptep_clear_flush_notify(vma, addr, ptep);
> > > > > > +   /*
> > > > > > +    * No need to notify as we are replacing a read only page with another
> > > > > > +    * read only page with the same content.
> > > > > > +    *
> > > > > > +    * See Documentation/vm/mmu_notifier.txt
> > > > > > +    */
> > > > > > +   ptep_clear_flush(vma, addr, ptep);
> > > > > >     set_pte_at_notify(mm, addr, ptep, newpte);
> > > > > > 
> > > > > >     page_remove_rmap(page, false);
> > > > > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > > > > index 061826278520..6b5a0f219ac0 100644
> > > > > > --- a/mm/rmap.c
> > > > > > +++ b/mm/rmap.c
> > > > > > @@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> > > > > >  #endif
> > > > > >             }
> > > > > > 
> > > > > > -           if (ret) {
> > > > > > -                   mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend);
> > > > > > +           /*
> > > > > > +            * No need to call mmu_notifier_invalidate_range() as we are
> > > > > > +            * downgrading page table protection not changing it to point
> > > > > > +            * to a new page.
> > > > > > +            *
> > > > > > +            * See Documentation/vm/mmu_notifier.txt
> > > > > > +            */
> > > > > > +           if (ret)
> > > > > >                     (*cleaned)++;
> > > > > > -           }
> > > > > >     }
> > > > > > 
> > > > > >     mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> > > > > > @@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > > >                     if (pte_soft_dirty(pteval))
> > > > > >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > > > > >                     set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
> > > > > > +                   /*
> > > > > > +                    * No need to invalidate here it will synchronize on
> > > > > > +                    * against the special swap migration pte.
> > > > > > +                    */
> > > > > >                     goto discard;
> > > > > >             }
> > > > > > 
> > > > > > @@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > > >                      * will take care of the rest.
> > > > > >                      */
> > > > > >                     dec_mm_counter(mm, mm_counter(page));
> > > > > > +                   /* We have to invalidate as we cleared the pte */
> > > > > > +                   mmu_notifier_invalidate_range(mm, address,
> > > > > > +                                                 address + PAGE_SIZE);
> > > > > >             } else if (IS_ENABLED(CONFIG_MIGRATION) &&
> > > > > >                             (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
> > > > > >                     swp_entry_t entry;
> > > > > > @@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > > >                     if (pte_soft_dirty(pteval))
> > > > > >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > > > > >                     set_pte_at(mm, address, pvmw.pte, swp_pte);
> > > > > > +                   /*
> > > > > > +                    * No need to invalidate here it will synchronize on
> > > > > > +                    * against the special swap migration pte.
> > > > > > +                    */
> > > > > >             } else if (PageAnon(page)) {
> > > > > >                     swp_entry_t entry = { .val = page_private(subpage) };
> > > > > >                     pte_t swp_pte;
> > > > > > @@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > > >                             WARN_ON_ONCE(1);
> > > > > >                             ret = false;
> > > > > >                             /* We have to invalidate as we cleared the pte */
> > > > > > +                           mmu_notifier_invalidate_range(mm, address,
> > > > > > +                                                   address + PAGE_SIZE);
> > > > > >                             page_vma_mapped_walk_done(&pvmw);
> > > > > >                             break;
> > > > > >                     }
> > > > > > @@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > > >                     /* MADV_FREE page check */
> > > > > >                     if (!PageSwapBacked(page)) {
> > > > > >                             if (!PageDirty(page)) {
> > > > > > +                                   /* Invalidate as we cleared the pte */
> > > > > > +                                   mmu_notifier_invalidate_range(mm,
> > > > > > +                                           address, address + PAGE_SIZE);
> > > > > >                                     dec_mm_counter(mm, MM_ANONPAGES);
> > > > > >                                     goto discard;
> > > > > >                             }
> > > > > > @@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > > >                     if (pte_soft_dirty(pteval))
> > > > > >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > > > > >                     set_pte_at(mm, address, pvmw.pte, swp_pte);
> > > > > > -           } else
> > > > > > +                   /* Invalidate as we cleared the pte */
> > > > > > +                   mmu_notifier_invalidate_range(mm, address,
> > > > > > +                                                 address + PAGE_SIZE);
> > > > > > +           } else {
> > > > > > +                   /*
> > > > > > +                    * We should not need to notify here as we reach this
> > > > > > +                    * case only from freeze_page() itself only call from
> > > > > > +                    * split_huge_page_to_list() so everything below must
> > > > > > +                    * be true:
> > > > > > +                    *   - page is not anonymous
> > > > > > +                    *   - page is locked
> > > > > > +                    *
> > > > > > +                    * So as it is a locked file back page thus it can not
> > > > > > +                    * be remove from the page cache and replace by a new
> > > > > > +                    * page before mmu_notifier_invalidate_range_end so no
> > > > > > +                    * concurrent thread might update its page table to
> > > > > > +                    * point at new page while a device still is using this
> > > > > > +                    * page.
> > > > > > +                    *
> > > > > > +                    * See Documentation/vm/mmu_notifier.txt
> > > > > > +                    */
> > > > > >                     dec_mm_counter(mm, mm_counter_file(page));
> > > > > > +           }
> > > > > >  discard:
> > > > > > +           /*
> > > > > > +            * No need to call mmu_notifier_invalidate_range() it has be
> > > > > > +            * done above for all cases requiring it to happen under page
> > > > > > +            * table lock before mmu_notifier_invalidate_range_end()
> > > > > > +            *
> > > > > > +            * See Documentation/vm/mmu_notifier.txt
> > > > > > +            */
> > > > > >             page_remove_rmap(subpage, PageHuge(page));
> > > > > >             put_page(page);
> > > > > > -           mmu_notifier_invalidate_range(mm, address,
> > > > > > -                                         address + PAGE_SIZE);
> > > > > >     }
> > > > > > 
> > > > > >     mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> > > > > 
> > > > > Looking at the patchset, I understand the efficiency, but I am concerned
> > > > > with correctness.
> > > > 
> > > > I am fine in holding this off from reaching Linus but only way to flush this
> > > > issues out if any is to have this patch in linux-next or somewhere were they
> > > > get a chance of being tested.
> > > > 
> > > 
> > > Yep, I would like to see some additional testing around npu and get Alistair
> > > Popple to comment as well
> > 
> > I think this patch is fine. The only one race window that it might make
> > bigger should have no bad consequences.
> > 
> > > 
> > > > Note that the second patch is always safe. I agree that this one might
> > > > not be if hardware implementation is idiotic (well that would be my
> > > > opinion and any opinion/point of view can be challenge :))
> > > 
> > > 
> > > You mean the only_end variant that avoids shootdown after pmd/pte changes
> > > that avoid the _start/_end and have just the only_end variant? That seemed
> > > reasonable to me, but I've not tested it or evaluated it in depth
> > 
> > Yes, patch 2/2 in this serie is definitly fine. It invalidate the device
> > TLB right after clearing pte entry and avoid latter unecessary invalidation
> > of same TLB.
> > 
> > Jerome
> 
> Balbir Singh.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2
@ 2017-10-21 15:47               ` Jerome Glisse
  0 siblings, 0 replies; 40+ messages in thread
From: Jerome Glisse @ 2017-10-21 15:47 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Nadav Amit,
	Linus Torvalds, Andrew Morton, Joerg Roedel,
	Suravee Suthikulpanit, David Woodhouse, Alistair Popple,
	Michael Ellerman, Benjamin Herrenschmidt, Stephen Rothwell,
	Andrew Donnellan, iommu,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-next

On Sat, Oct 21, 2017 at 04:54:40PM +1100, Balbir Singh wrote:
> On Thu, 2017-10-19 at 12:58 -0400, Jerome Glisse wrote:
> > On Thu, Oct 19, 2017 at 09:53:11PM +1100, Balbir Singh wrote:
> > > On Thu, Oct 19, 2017 at 2:28 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> > > > On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote:
> > > > > On Mon, 16 Oct 2017 23:10:02 -0400
> > > > > jglisse@redhat.com wrote:
> > > > > 
> > > > > > From: Jérôme Glisse <jglisse@redhat.com>
> > > > > > 
> > > > > > +           /*
> > > > > > +            * No need to call mmu_notifier_invalidate_range() as we are
> > > > > > +            * downgrading page table protection not changing it to point
> > > > > > +            * to a new page.
> > > > > > +            *
> > > > > > +            * See Documentation/vm/mmu_notifier.txt
> > > > > > +            */
> > > > > >             if (pmdp) {
> > > > > >  #ifdef CONFIG_FS_DAX_PMD
> > > > > >                     pmd_t pmd;
> > > > > > @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
> > > > > >                     pmd = pmd_wrprotect(pmd);
> > > > > >                     pmd = pmd_mkclean(pmd);
> > > > > >                     set_pmd_at(vma->vm_mm, address, pmdp, pmd);
> > > > > > -                   mmu_notifier_invalidate_range(vma->vm_mm, start, end);
> > > > > 
> > > > > Could the secondary TLB still see the mapping as dirty and propagate the dirty bit back?
> > > > 
> > > > I am assuming hardware does sane thing of setting the dirty bit only
> > > > when walking the CPU page table when device does a write fault ie
> > > > once the device get a write TLB entry the dirty is set by the IOMMU
> > > > when walking the page table before returning the lookup result to the
> > > > device and that it won't be set again latter (ie propagated back
> > > > latter).
> > > > 
> > > 
> > > The other possibility is that the hardware things the page is writable
> > > and already
> > > marked dirty. It allows writes and does not set the dirty bit?
> > 
> > I thought about this some more and the patch can not regress anything
> > that is not broken today. So if we assume that device can propagate
> > dirty bit because it can cache the write protection than all current
> > code is broken for two reasons:
> > 
> > First one is current code clear pte entry, build a new pte value with
> > write protection and update pte entry with new pte value. So any PASID/
> > ATS platform that allows device to cache the write bit and set dirty
> > bit anytime after that can race during that window and you would loose
> > the dirty bit of the device. That is not that bad as you are gonna
> > propagate the dirty bit to the struct page.
> 
> But they stay consistent with the notifiers, so from the OS perspective
> it notifies of any PTE changes as they happen. When the ATS platform sees
> invalidation, it invalidates it's PTE's as well.
> 
> I was speaking of the case where the ATS platform could assume it has
> write access and has not seen any invalidation, the OS could return
> back to user space or the caller with write bit clear, but the ATS
> platform could still do a write since it's not seen the invalidation.

I understood what you said and what is above apply. I am removing only
one of the invalidation not both. So with that patch the invalidation
is delayed after the page table lock drop but before dax/page_mkclean
returns. Hence any further activity will be read only on any device too
once we exit those functions.

The only difference is the window during which device can report dirty
pte. Before that patch the 2 "~bogus~" window were small:
  First window between pmd/pte_get_clear_flush and set_pte/pmd
  Second window between set_pte/pmd and mmu_notifier_invalidate_range

The first window stay the same, the second window is bigger, potentialy
lot bigger if thread is prempted before mmu_notifier_invalidate_range_end

But that is fine as in that case the page is reported as dirty and thus
we are not missing anything and the kernel code does not care about
seeing read only pte mark as dirty.

> 
> > 
> > Second one is if the dirty bit is propagated back to the new write
> > protected pte. Quick look at code it seems that when we zap pte or
> > or mkclean we don't check that the pte has write permission but only
> > care about the dirty bit. So it should not have any bad consequence.
> > 
> > After this patch only the second window is bigger and thus more likely
> > to happen. But nothing sinister should happen from that.
> > 
> > 
> > > 
> > > > I should probably have spell that out and maybe some of the ATS/PASID
> > > > implementer did not do that.
> > > > 
> > > > > 
> > > > > >  unlock_pmd:
> > > > > >                     spin_unlock(ptl);
> > > > > >  #endif
> > > > > > @@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
> > > > > >                     pte = pte_wrprotect(pte);
> > > > > >                     pte = pte_mkclean(pte);
> > > > > >                     set_pte_at(vma->vm_mm, address, ptep, pte);
> > > > > > -                   mmu_notifier_invalidate_range(vma->vm_mm, start, end);
> > > > > 
> > > > > Ditto
> > > > > 
> > > > > >  unlock_pte:
> > > > > >                     pte_unmap_unlock(ptep, ptl);
> > > > > >             }
> > > > > > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > > > > > index 6866e8126982..49c925c96b8a 100644
> > > > > > --- a/include/linux/mmu_notifier.h
> > > > > > +++ b/include/linux/mmu_notifier.h
> > > > > > @@ -155,7 +155,8 @@ struct mmu_notifier_ops {
> > > > > >      * shared page-tables, it not necessary to implement the
> > > > > >      * invalidate_range_start()/end() notifiers, as
> > > > > >      * invalidate_range() alread catches the points in time when an
> > > > > > -    * external TLB range needs to be flushed.
> > > > > > +    * external TLB range needs to be flushed. For more in depth
> > > > > > +    * discussion on this see Documentation/vm/mmu_notifier.txt
> > > > > >      *
> > > > > >      * The invalidate_range() function is called under the ptl
> > > > > >      * spin-lock and not allowed to sleep.
> > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > > > index c037d3d34950..ff5bc647b51d 100644
> > > > > > --- a/mm/huge_memory.c
> > > > > > +++ b/mm/huge_memory.c
> > > > > > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
> > > > > >             goto out_free_pages;
> > > > > >     VM_BUG_ON_PAGE(!PageHead(page), page);
> > > > > > 
> > > > > > +   /*
> > > > > > +    * Leave pmd empty until pte is filled note we must notify here as
> > > > > > +    * concurrent CPU thread might write to new page before the call to
> > > > > > +    * mmu_notifier_invalidate_range_end() happens which can lead to a
> > > > > > +    * device seeing memory write in different order than CPU.
> > > > > > +    *
> > > > > > +    * See Documentation/vm/mmu_notifier.txt
> > > > > > +    */
> > > > > >     pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
> > > > > > -   /* leave pmd empty until pte is filled */
> > > > > > 
> > > > > >     pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd);
> > > > > >     pmd_populate(vma->vm_mm, &_pmd, pgtable);
> > > > > > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> > > > > >     pmd_t _pmd;
> > > > > >     int i;
> > > > > > 
> > > > > > -   /* leave pmd empty until pte is filled */
> > > > > > -   pmdp_huge_clear_flush_notify(vma, haddr, pmd);
> > > > > > +   /*
> > > > > > +    * Leave pmd empty until pte is filled note that it is fine to delay
> > > > > > +    * notification until mmu_notifier_invalidate_range_end() as we are
> > > > > > +    * replacing a zero pmd write protected page with a zero pte write
> > > > > > +    * protected page.
> > > > > > +    *
> > > > > > +    * See Documentation/vm/mmu_notifier.txt
> > > > > > +    */
> > > > > > +   pmdp_huge_clear_flush(vma, haddr, pmd);
> > > > > 
> > > > > Shouldn't the secondary TLB know if the page size changed?
> > > > 
> > > > It should not matter, we are talking virtual to physical on behalf
> > > > of a device against a process address space. So the hardware should
> > > > not care about the page size.
> > > > 
> > > 
> > > Does that not indicate how much the device can access? Could it try
> > > to access more than what is mapped?
> > 
> > Assuming device has huge TLB and 2MB huge page with 4K small page.
> > You are going from one 1 TLB covering a 2MB zero page to 512 TLB
> > each covering 4K. Both case is read only and both case are pointing
> > to same data (ie zero).
> > 
> > It is fine to delay the TLB invalidate on the device to the call of
> > mmu_notifier_invalidate_range_end(). The device will keep using the
> > huge TLB for a little longer but both CPU and device are looking at
> > same data.
> > 
> > Now if there is a racing thread that replace one of the 512 zeor page
> > after the split but before mmu_notifier_invalidate_range_end() that
> > code path would call mmu_notifier_invalidate_range() before changing
> > the pte to point to something else. Which should shoot down the device
> > TLB (it would be a serious device bug if this did not work).
> 
> OK.. This seems reasonable, but I'd really like to see if it can be
> tested

Well hard to test, many factors first each device might react differently.
Device that only store TLB at 4k granularity are fine. Clever device that
can store TLB for 4k, 2M, ... can ignore an invalidation that is smaller
than their TLB entry ie getting a 4K invalidation would not invalidate a
2MB TLB entry in the device. I consider this as buggy. I will go look at
the PCIE ATS specification one more time and see if there is any wording
related that. I might bring up a question to the PCIE standard body if not.

Second factor is that it is a race between split zero and a write fault.
I can probably do a crappy patch that msleep if split happens against a
given mm to increase the race window. But i would be testing against one
device (right now i can only access AMD IOMMUv2 devices with discret ATS
GPU)


> 
> > 
> > 
> > > 
> > > > Moreover if any of the new 512 (assuming 2MB huge and 4K pages) zero
> > > > 4K pages is replace by something new then a device TLB shootdown will
> > > > happen before the new page is set.
> > > > 
> > > > Only issue i can think of is if the IOMMU TLB (if there is one) or
> > > > the device TLB (you do expect that there is one) does not invalidate
> > > > TLB entry if the TLB shootdown is smaller than the TLB entry. That
> > > > would be idiotic but yes i know hardware bug.
> > > > 
> > > > 
> > > > > 
> > > > > > 
> > > > > >     pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> > > > > >     pmd_populate(mm, &_pmd, pgtable);
> > > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > > > > index 1768efa4c501..63a63f1b536c 100644
> > > > > > --- a/mm/hugetlb.c
> > > > > > +++ b/mm/hugetlb.c
> > > > > > @@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> > > > > >                     set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
> > > > > >             } else {
> > > > > >                     if (cow) {
> > > > > > +                           /*
> > > > > > +                            * No need to notify as we are downgrading page
> > > > > > +                            * table protection not changing it to point
> > > > > > +                            * to a new page.
> > > > > > +                            *
> > > > > > +                            * See Documentation/vm/mmu_notifier.txt
> > > > > > +                            */
> > > > > >                             huge_ptep_set_wrprotect(src, addr, src_pte);
> > > > > 
> > > > > OK.. so we could get write faults on write accesses from the device.
> > > > > 
> > > > > > -                           mmu_notifier_invalidate_range(src, mmun_start,
> > > > > > -                                                              mmun_end);
> > > > > >                     }
> > > > > >                     entry = huge_ptep_get(src_pte);
> > > > > >                     ptepage = pte_page(entry);
> > > > > > @@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> > > > > >      * and that page table be reused and filled with junk.
> > > > > >      */
> > > > > >     flush_hugetlb_tlb_range(vma, start, end);
> > > > > > -   mmu_notifier_invalidate_range(mm, start, end);
> > > > > > +   /*
> > > > > > +    * No need to call mmu_notifier_invalidate_range() we are downgrading
> > > > > > +    * page table protection not changing it to point to a new page.
> > > > > > +    *
> > > > > > +    * See Documentation/vm/mmu_notifier.txt
> > > > > > +    */
> > > > > >     i_mmap_unlock_write(vma->vm_file->f_mapping);
> > > > > >     mmu_notifier_invalidate_range_end(mm, start, end);
> > > > > > 
> > > > > > diff --git a/mm/ksm.c b/mm/ksm.c
> > > > > > index 6cb60f46cce5..be8f4576f842 100644
> > > > > > --- a/mm/ksm.c
> > > > > > +++ b/mm/ksm.c
> > > > > > @@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
> > > > > >              * So we clear the pte and flush the tlb before the check
> > > > > >              * this assure us that no O_DIRECT can happen after the check
> > > > > >              * or in the middle of the check.
> > > > > > +            *
> > > > > > +            * No need to notify as we are downgrading page table to read
> > > > > > +            * only not changing it to point to a new page.
> > > > > > +            *
> > > > > > +            * See Documentation/vm/mmu_notifier.txt
> > > > > >              */
> > > > > > -           entry = ptep_clear_flush_notify(vma, pvmw.address, pvmw.pte);
> > > > > > +           entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte);
> > > > > >             /*
> > > > > >              * Check that no O_DIRECT or similar I/O is in progress on the
> > > > > >              * page
> > > > > > @@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
> > > > > >     }
> > > > > > 
> > > > > >     flush_cache_page(vma, addr, pte_pfn(*ptep));
> > > > > > -   ptep_clear_flush_notify(vma, addr, ptep);
> > > > > > +   /*
> > > > > > +    * No need to notify as we are replacing a read only page with another
> > > > > > +    * read only page with the same content.
> > > > > > +    *
> > > > > > +    * See Documentation/vm/mmu_notifier.txt
> > > > > > +    */
> > > > > > +   ptep_clear_flush(vma, addr, ptep);
> > > > > >     set_pte_at_notify(mm, addr, ptep, newpte);
> > > > > > 
> > > > > >     page_remove_rmap(page, false);
> > > > > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > > > > index 061826278520..6b5a0f219ac0 100644
> > > > > > --- a/mm/rmap.c
> > > > > > +++ b/mm/rmap.c
> > > > > > @@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> > > > > >  #endif
> > > > > >             }
> > > > > > 
> > > > > > -           if (ret) {
> > > > > > -                   mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend);
> > > > > > +           /*
> > > > > > +            * No need to call mmu_notifier_invalidate_range() as we are
> > > > > > +            * downgrading page table protection not changing it to point
> > > > > > +            * to a new page.
> > > > > > +            *
> > > > > > +            * See Documentation/vm/mmu_notifier.txt
> > > > > > +            */
> > > > > > +           if (ret)
> > > > > >                     (*cleaned)++;
> > > > > > -           }
> > > > > >     }
> > > > > > 
> > > > > >     mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> > > > > > @@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > > >                     if (pte_soft_dirty(pteval))
> > > > > >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > > > > >                     set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
> > > > > > +                   /*
> > > > > > +                    * No need to invalidate here it will synchronize on
> > > > > > +                    * against the special swap migration pte.
> > > > > > +                    */
> > > > > >                     goto discard;
> > > > > >             }
> > > > > > 
> > > > > > @@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > > >                      * will take care of the rest.
> > > > > >                      */
> > > > > >                     dec_mm_counter(mm, mm_counter(page));
> > > > > > +                   /* We have to invalidate as we cleared the pte */
> > > > > > +                   mmu_notifier_invalidate_range(mm, address,
> > > > > > +                                                 address + PAGE_SIZE);
> > > > > >             } else if (IS_ENABLED(CONFIG_MIGRATION) &&
> > > > > >                             (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
> > > > > >                     swp_entry_t entry;
> > > > > > @@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > > >                     if (pte_soft_dirty(pteval))
> > > > > >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > > > > >                     set_pte_at(mm, address, pvmw.pte, swp_pte);
> > > > > > +                   /*
> > > > > > +                    * No need to invalidate here it will synchronize on
> > > > > > +                    * against the special swap migration pte.
> > > > > > +                    */
> > > > > >             } else if (PageAnon(page)) {
> > > > > >                     swp_entry_t entry = { .val = page_private(subpage) };
> > > > > >                     pte_t swp_pte;
> > > > > > @@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > > >                             WARN_ON_ONCE(1);
> > > > > >                             ret = false;
> > > > > >                             /* We have to invalidate as we cleared the pte */
> > > > > > +                           mmu_notifier_invalidate_range(mm, address,
> > > > > > +                                                   address + PAGE_SIZE);
> > > > > >                             page_vma_mapped_walk_done(&pvmw);
> > > > > >                             break;
> > > > > >                     }
> > > > > > @@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > > >                     /* MADV_FREE page check */
> > > > > >                     if (!PageSwapBacked(page)) {
> > > > > >                             if (!PageDirty(page)) {
> > > > > > +                                   /* Invalidate as we cleared the pte */
> > > > > > +                                   mmu_notifier_invalidate_range(mm,
> > > > > > +                                           address, address + PAGE_SIZE);
> > > > > >                                     dec_mm_counter(mm, MM_ANONPAGES);
> > > > > >                                     goto discard;
> > > > > >                             }
> > > > > > @@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > > >                     if (pte_soft_dirty(pteval))
> > > > > >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > > > > >                     set_pte_at(mm, address, pvmw.pte, swp_pte);
> > > > > > -           } else
> > > > > > +                   /* Invalidate as we cleared the pte */
> > > > > > +                   mmu_notifier_invalidate_range(mm, address,
> > > > > > +                                                 address + PAGE_SIZE);
> > > > > > +           } else {
> > > > > > +                   /*
> > > > > > +                    * We should not need to notify here as we reach this
> > > > > > +                    * case only from freeze_page() itself only call from
> > > > > > +                    * split_huge_page_to_list() so everything below must
> > > > > > +                    * be true:
> > > > > > +                    *   - page is not anonymous
> > > > > > +                    *   - page is locked
> > > > > > +                    *
> > > > > > +                    * So as it is a locked file back page thus it can not
> > > > > > +                    * be remove from the page cache and replace by a new
> > > > > > +                    * page before mmu_notifier_invalidate_range_end so no
> > > > > > +                    * concurrent thread might update its page table to
> > > > > > +                    * point at new page while a device still is using this
> > > > > > +                    * page.
> > > > > > +                    *
> > > > > > +                    * See Documentation/vm/mmu_notifier.txt
> > > > > > +                    */
> > > > > >                     dec_mm_counter(mm, mm_counter_file(page));
> > > > > > +           }
> > > > > >  discard:
> > > > > > +           /*
> > > > > > +            * No need to call mmu_notifier_invalidate_range() it has be
> > > > > > +            * done above for all cases requiring it to happen under page
> > > > > > +            * table lock before mmu_notifier_invalidate_range_end()
> > > > > > +            *
> > > > > > +            * See Documentation/vm/mmu_notifier.txt
> > > > > > +            */
> > > > > >             page_remove_rmap(subpage, PageHuge(page));
> > > > > >             put_page(page);
> > > > > > -           mmu_notifier_invalidate_range(mm, address,
> > > > > > -                                         address + PAGE_SIZE);
> > > > > >     }
> > > > > > 
> > > > > >     mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> > > > > 
> > > > > Looking at the patchset, I understand the efficiency, but I am concerned
> > > > > with correctness.
> > > > 
> > > > I am fine in holding this off from reaching Linus but only way to flush this
> > > > issues out if any is to have this patch in linux-next or somewhere were they
> > > > get a chance of being tested.
> > > > 
> > > 
> > > Yep, I would like to see some additional testing around npu and get Alistair
> > > Popple to comment as well
> > 
> > I think this patch is fine. The only one race window that it might make
> > bigger should have no bad consequences.
> > 
> > > 
> > > > Note that the second patch is always safe. I agree that this one might
> > > > not be if hardware implementation is idiotic (well that would be my
> > > > opinion and any opinion/point of view can be challenge :))
> > > 
> > > 
> > > You mean the only_end variant that avoids shootdown after pmd/pte changes
> > > that avoid the _start/_end and have just the only_end variant? That seemed
> > > reasonable to me, but I've not tested it or evaluated it in depth
> > 
> > Yes, patch 2/2 in this serie is definitly fine. It invalidate the device
> > TLB right after clearing pte entry and avoid latter unecessary invalidation
> > of same TLB.
> > 
> > Jérôme
> 
> Balbir Singh.
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2
  2017-10-21 15:47               ` Jerome Glisse
  (?)
  (?)
@ 2017-10-23 20:35                 ` Jerome Glisse
  -1 siblings, 0 replies; 40+ messages in thread
From: Jerome Glisse @ 2017-10-23 20:35 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Nadav Amit,
	Linus Torvalds, Andrew Morton, Joerg Roedel,
	Suravee Suthikulpanit, David Woodhouse, Alistair Popple,
	Michael Ellerman, Benjamin Herrenschmidt, Stephen Rothwell,
	Andrew Donnellan, iommu,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-next

On Sat, Oct 21, 2017 at 11:47:03AM -0400, Jerome Glisse wrote:
> On Sat, Oct 21, 2017 at 04:54:40PM +1100, Balbir Singh wrote:
> > On Thu, 2017-10-19 at 12:58 -0400, Jerome Glisse wrote:
> > > On Thu, Oct 19, 2017 at 09:53:11PM +1100, Balbir Singh wrote:
> > > > On Thu, Oct 19, 2017 at 2:28 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> > > > > On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote:
> > > > > > On Mon, 16 Oct 2017 23:10:02 -0400
> > > > > > jglisse@redhat.com wrote:
> > > > > > > From: Jérôme Glisse <jglisse@redhat.com>

[...]

> > > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > > > > index c037d3d34950..ff5bc647b51d 100644
> > > > > > > --- a/mm/huge_memory.c
> > > > > > > +++ b/mm/huge_memory.c
> > > > > > > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
> > > > > > >             goto out_free_pages;
> > > > > > >     VM_BUG_ON_PAGE(!PageHead(page), page);
> > > > > > > 
> > > > > > > +   /*
> > > > > > > +    * Leave pmd empty until pte is filled note we must notify here as
> > > > > > > +    * concurrent CPU thread might write to new page before the call to
> > > > > > > +    * mmu_notifier_invalidate_range_end() happens which can lead to a
> > > > > > > +    * device seeing memory write in different order than CPU.
> > > > > > > +    *
> > > > > > > +    * See Documentation/vm/mmu_notifier.txt
> > > > > > > +    */
> > > > > > >     pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
> > > > > > > -   /* leave pmd empty until pte is filled */
> > > > > > > 
> > > > > > >     pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd);
> > > > > > >     pmd_populate(vma->vm_mm, &_pmd, pgtable);
> > > > > > > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> > > > > > >     pmd_t _pmd;
> > > > > > >     int i;
> > > > > > > 
> > > > > > > -   /* leave pmd empty until pte is filled */
> > > > > > > -   pmdp_huge_clear_flush_notify(vma, haddr, pmd);
> > > > > > > +   /*
> > > > > > > +    * Leave pmd empty until pte is filled note that it is fine to delay
> > > > > > > +    * notification until mmu_notifier_invalidate_range_end() as we are
> > > > > > > +    * replacing a zero pmd write protected page with a zero pte write
> > > > > > > +    * protected page.
> > > > > > > +    *
> > > > > > > +    * See Documentation/vm/mmu_notifier.txt
> > > > > > > +    */
> > > > > > > +   pmdp_huge_clear_flush(vma, haddr, pmd);
> > > > > > 
> > > > > > Shouldn't the secondary TLB know if the page size changed?
> > > > > 
> > > > > It should not matter, we are talking virtual to physical on behalf
> > > > > of a device against a process address space. So the hardware should
> > > > > not care about the page size.
> > > > > 
> > > > 
> > > > Does that not indicate how much the device can access? Could it try
> > > > to access more than what is mapped?
> > > 
> > > Assuming device has huge TLB and 2MB huge page with 4K small page.
> > > You are going from one 1 TLB covering a 2MB zero page to 512 TLB
> > > each covering 4K. Both case is read only and both case are pointing
> > > to same data (ie zero).
> > > 
> > > It is fine to delay the TLB invalidate on the device to the call of
> > > mmu_notifier_invalidate_range_end(). The device will keep using the
> > > huge TLB for a little longer but both CPU and device are looking at
> > > same data.
> > > 
> > > Now if there is a racing thread that replace one of the 512 zeor page
> > > after the split but before mmu_notifier_invalidate_range_end() that
> > > code path would call mmu_notifier_invalidate_range() before changing
> > > the pte to point to something else. Which should shoot down the device
> > > TLB (it would be a serious device bug if this did not work).
> > 
> > OK.. This seems reasonable, but I'd really like to see if it can be
> > tested
> 
> Well hard to test, many factors first each device might react differently.
> Device that only store TLB at 4k granularity are fine. Clever device that
> can store TLB for 4k, 2M, ... can ignore an invalidation that is smaller
> than their TLB entry ie getting a 4K invalidation would not invalidate a
> 2MB TLB entry in the device. I consider this as buggy. I will go look at
> the PCIE ATS specification one more time and see if there is any wording
> related that. I might bring up a question to the PCIE standard body if not.

So inside PCIE ATS there is the definition of "minimum translation or
invalidate size" which says 4096 bytes. So my understanding is that
hardware must support 4K invalidation in all the case and thus we shoud
be safe from possible hazard above.

But none the less i will repost without the optimization for huge page
to be more concervative as anyway we want to be correct before we care
about last bit of optimization.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2
@ 2017-10-23 20:35                 ` Jerome Glisse
  0 siblings, 0 replies; 40+ messages in thread
From: Jerome Glisse @ 2017-10-23 20:35 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Nadav Amit,
	Linus Torvalds, Andrew Morton, Joerg Roedel,
	Suravee Suthikulpanit, David Woodhouse, Alistair Popple,
	Michael Ellerman, Benjamin Herrenschmidt, Stephen Rothwell,
	Andrew Donnellan, iommu,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-next

On Sat, Oct 21, 2017 at 11:47:03AM -0400, Jerome Glisse wrote:
> On Sat, Oct 21, 2017 at 04:54:40PM +1100, Balbir Singh wrote:
> > On Thu, 2017-10-19 at 12:58 -0400, Jerome Glisse wrote:
> > > On Thu, Oct 19, 2017 at 09:53:11PM +1100, Balbir Singh wrote:
> > > > On Thu, Oct 19, 2017 at 2:28 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> > > > > On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote:
> > > > > > On Mon, 16 Oct 2017 23:10:02 -0400
> > > > > > jglisse@redhat.com wrote:
> > > > > > > From: Jérôme Glisse <jglisse@redhat.com>

[...]

> > > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > > > > index c037d3d34950..ff5bc647b51d 100644
> > > > > > > --- a/mm/huge_memory.c
> > > > > > > +++ b/mm/huge_memory.c
> > > > > > > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
> > > > > > >             goto out_free_pages;
> > > > > > >     VM_BUG_ON_PAGE(!PageHead(page), page);
> > > > > > > 
> > > > > > > +   /*
> > > > > > > +    * Leave pmd empty until pte is filled note we must notify here as
> > > > > > > +    * concurrent CPU thread might write to new page before the call to
> > > > > > > +    * mmu_notifier_invalidate_range_end() happens which can lead to a
> > > > > > > +    * device seeing memory write in different order than CPU.
> > > > > > > +    *
> > > > > > > +    * See Documentation/vm/mmu_notifier.txt
> > > > > > > +    */
> > > > > > >     pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
> > > > > > > -   /* leave pmd empty until pte is filled */
> > > > > > > 
> > > > > > >     pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd);
> > > > > > >     pmd_populate(vma->vm_mm, &_pmd, pgtable);
> > > > > > > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> > > > > > >     pmd_t _pmd;
> > > > > > >     int i;
> > > > > > > 
> > > > > > > -   /* leave pmd empty until pte is filled */
> > > > > > > -   pmdp_huge_clear_flush_notify(vma, haddr, pmd);
> > > > > > > +   /*
> > > > > > > +    * Leave pmd empty until pte is filled note that it is fine to delay
> > > > > > > +    * notification until mmu_notifier_invalidate_range_end() as we are
> > > > > > > +    * replacing a zero pmd write protected page with a zero pte write
> > > > > > > +    * protected page.
> > > > > > > +    *
> > > > > > > +    * See Documentation/vm/mmu_notifier.txt
> > > > > > > +    */
> > > > > > > +   pmdp_huge_clear_flush(vma, haddr, pmd);
> > > > > > 
> > > > > > Shouldn't the secondary TLB know if the page size changed?
> > > > > 
> > > > > It should not matter, we are talking virtual to physical on behalf
> > > > > of a device against a process address space. So the hardware should
> > > > > not care about the page size.
> > > > > 
> > > > 
> > > > Does that not indicate how much the device can access? Could it try
> > > > to access more than what is mapped?
> > > 
> > > Assuming device has huge TLB and 2MB huge page with 4K small page.
> > > You are going from one 1 TLB covering a 2MB zero page to 512 TLB
> > > each covering 4K. Both case is read only and both case are pointing
> > > to same data (ie zero).
> > > 
> > > It is fine to delay the TLB invalidate on the device to the call of
> > > mmu_notifier_invalidate_range_end(). The device will keep using the
> > > huge TLB for a little longer but both CPU and device are looking at
> > > same data.
> > > 
> > > Now if there is a racing thread that replace one of the 512 zeor page
> > > after the split but before mmu_notifier_invalidate_range_end() that
> > > code path would call mmu_notifier_invalidate_range() before changing
> > > the pte to point to something else. Which should shoot down the device
> > > TLB (it would be a serious device bug if this did not work).
> > 
> > OK.. This seems reasonable, but I'd really like to see if it can be
> > tested
> 
> Well hard to test, many factors first each device might react differently.
> Device that only store TLB at 4k granularity are fine. Clever device that
> can store TLB for 4k, 2M, ... can ignore an invalidation that is smaller
> than their TLB entry ie getting a 4K invalidation would not invalidate a
> 2MB TLB entry in the device. I consider this as buggy. I will go look at
> the PCIE ATS specification one more time and see if there is any wording
> related that. I might bring up a question to the PCIE standard body if not.

So inside PCIE ATS there is the definition of "minimum translation or
invalidate size" which says 4096 bytes. So my understanding is that
hardware must support 4K invalidation in all the case and thus we shoud
be safe from possible hazard above.

But none the less i will repost without the optimization for huge page
to be more concervative as anyway we want to be correct before we care
about last bit of optimization.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2
@ 2017-10-23 20:35                 ` Jerome Glisse
  0 siblings, 0 replies; 40+ messages in thread
From: Jerome Glisse @ 2017-10-23 20:35 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Nadav Amit,
	Linus Torvalds, Andrew Morton, Joerg Roedel,
	Suravee Suthikulpanit, David Woodhouse, Alistair Popple,
	Michael Ellerman, Benjamin Herrenschmidt, Stephen Rothwell,
	Andrew Donnellan, iommu,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-next

On Sat, Oct 21, 2017 at 11:47:03AM -0400, Jerome Glisse wrote:
> On Sat, Oct 21, 2017 at 04:54:40PM +1100, Balbir Singh wrote:
> > On Thu, 2017-10-19 at 12:58 -0400, Jerome Glisse wrote:
> > > On Thu, Oct 19, 2017 at 09:53:11PM +1100, Balbir Singh wrote:
> > > > On Thu, Oct 19, 2017 at 2:28 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> > > > > On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote:
> > > > > > On Mon, 16 Oct 2017 23:10:02 -0400
> > > > > > jglisse@redhat.com wrote:
> > > > > > > From: Jerome Glisse <jglisse@redhat.com>

[...]

> > > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > > > > index c037d3d34950..ff5bc647b51d 100644
> > > > > > > --- a/mm/huge_memory.c
> > > > > > > +++ b/mm/huge_memory.c
> > > > > > > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
> > > > > > >             goto out_free_pages;
> > > > > > >     VM_BUG_ON_PAGE(!PageHead(page), page);
> > > > > > > 
> > > > > > > +   /*
> > > > > > > +    * Leave pmd empty until pte is filled note we must notify here as
> > > > > > > +    * concurrent CPU thread might write to new page before the call to
> > > > > > > +    * mmu_notifier_invalidate_range_end() happens which can lead to a
> > > > > > > +    * device seeing memory write in different order than CPU.
> > > > > > > +    *
> > > > > > > +    * See Documentation/vm/mmu_notifier.txt
> > > > > > > +    */
> > > > > > >     pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
> > > > > > > -   /* leave pmd empty until pte is filled */
> > > > > > > 
> > > > > > >     pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd);
> > > > > > >     pmd_populate(vma->vm_mm, &_pmd, pgtable);
> > > > > > > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> > > > > > >     pmd_t _pmd;
> > > > > > >     int i;
> > > > > > > 
> > > > > > > -   /* leave pmd empty until pte is filled */
> > > > > > > -   pmdp_huge_clear_flush_notify(vma, haddr, pmd);
> > > > > > > +   /*
> > > > > > > +    * Leave pmd empty until pte is filled note that it is fine to delay
> > > > > > > +    * notification until mmu_notifier_invalidate_range_end() as we are
> > > > > > > +    * replacing a zero pmd write protected page with a zero pte write
> > > > > > > +    * protected page.
> > > > > > > +    *
> > > > > > > +    * See Documentation/vm/mmu_notifier.txt
> > > > > > > +    */
> > > > > > > +   pmdp_huge_clear_flush(vma, haddr, pmd);
> > > > > > 
> > > > > > Shouldn't the secondary TLB know if the page size changed?
> > > > > 
> > > > > It should not matter, we are talking virtual to physical on behalf
> > > > > of a device against a process address space. So the hardware should
> > > > > not care about the page size.
> > > > > 
> > > > 
> > > > Does that not indicate how much the device can access? Could it try
> > > > to access more than what is mapped?
> > > 
> > > Assuming device has huge TLB and 2MB huge page with 4K small page.
> > > You are going from one 1 TLB covering a 2MB zero page to 512 TLB
> > > each covering 4K. Both case is read only and both case are pointing
> > > to same data (ie zero).
> > > 
> > > It is fine to delay the TLB invalidate on the device to the call of
> > > mmu_notifier_invalidate_range_end(). The device will keep using the
> > > huge TLB for a little longer but both CPU and device are looking at
> > > same data.
> > > 
> > > Now if there is a racing thread that replace one of the 512 zeor page
> > > after the split but before mmu_notifier_invalidate_range_end() that
> > > code path would call mmu_notifier_invalidate_range() before changing
> > > the pte to point to something else. Which should shoot down the device
> > > TLB (it would be a serious device bug if this did not work).
> > 
> > OK.. This seems reasonable, but I'd really like to see if it can be
> > tested
> 
> Well hard to test, many factors first each device might react differently.
> Device that only store TLB at 4k granularity are fine. Clever device that
> can store TLB for 4k, 2M, ... can ignore an invalidation that is smaller
> than their TLB entry ie getting a 4K invalidation would not invalidate a
> 2MB TLB entry in the device. I consider this as buggy. I will go look at
> the PCIE ATS specification one more time and see if there is any wording
> related that. I might bring up a question to the PCIE standard body if not.

So inside PCIE ATS there is the definition of "minimum translation or
invalidate size" which says 4096 bytes. So my understanding is that
hardware must support 4K invalidation in all the case and thus we shoud
be safe from possible hazard above.

But none the less i will repost without the optimization for huge page
to be more concervative as anyway we want to be correct before we care
about last bit of optimization.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2
@ 2017-10-23 20:35                 ` Jerome Glisse
  0 siblings, 0 replies; 40+ messages in thread
From: Jerome Glisse @ 2017-10-23 20:35 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Nadav Amit,
	Linus Torvalds, Andrew Morton, Joerg Roedel,
	Suravee Suthikulpanit, David Woodhouse, Alistair Popple,
	Michael Ellerman, Benjamin Herrenschmidt, Stephen Rothwell,
	Andrew Donnellan, iommu,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-next

On Sat, Oct 21, 2017 at 11:47:03AM -0400, Jerome Glisse wrote:
> On Sat, Oct 21, 2017 at 04:54:40PM +1100, Balbir Singh wrote:
> > On Thu, 2017-10-19 at 12:58 -0400, Jerome Glisse wrote:
> > > On Thu, Oct 19, 2017 at 09:53:11PM +1100, Balbir Singh wrote:
> > > > On Thu, Oct 19, 2017 at 2:28 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> > > > > On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote:
> > > > > > On Mon, 16 Oct 2017 23:10:02 -0400
> > > > > > jglisse@redhat.com wrote:
> > > > > > > From: Jérôme Glisse <jglisse@redhat.com>

[...]

> > > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > > > > index c037d3d34950..ff5bc647b51d 100644
> > > > > > > --- a/mm/huge_memory.c
> > > > > > > +++ b/mm/huge_memory.c
> > > > > > > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
> > > > > > >             goto out_free_pages;
> > > > > > >     VM_BUG_ON_PAGE(!PageHead(page), page);
> > > > > > > 
> > > > > > > +   /*
> > > > > > > +    * Leave pmd empty until pte is filled note we must notify here as
> > > > > > > +    * concurrent CPU thread might write to new page before the call to
> > > > > > > +    * mmu_notifier_invalidate_range_end() happens which can lead to a
> > > > > > > +    * device seeing memory write in different order than CPU.
> > > > > > > +    *
> > > > > > > +    * See Documentation/vm/mmu_notifier.txt
> > > > > > > +    */
> > > > > > >     pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
> > > > > > > -   /* leave pmd empty until pte is filled */
> > > > > > > 
> > > > > > >     pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd);
> > > > > > >     pmd_populate(vma->vm_mm, &_pmd, pgtable);
> > > > > > > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> > > > > > >     pmd_t _pmd;
> > > > > > >     int i;
> > > > > > > 
> > > > > > > -   /* leave pmd empty until pte is filled */
> > > > > > > -   pmdp_huge_clear_flush_notify(vma, haddr, pmd);
> > > > > > > +   /*
> > > > > > > +    * Leave pmd empty until pte is filled note that it is fine to delay
> > > > > > > +    * notification until mmu_notifier_invalidate_range_end() as we are
> > > > > > > +    * replacing a zero pmd write protected page with a zero pte write
> > > > > > > +    * protected page.
> > > > > > > +    *
> > > > > > > +    * See Documentation/vm/mmu_notifier.txt
> > > > > > > +    */
> > > > > > > +   pmdp_huge_clear_flush(vma, haddr, pmd);
> > > > > > 
> > > > > > Shouldn't the secondary TLB know if the page size changed?
> > > > > 
> > > > > It should not matter, we are talking virtual to physical on behalf
> > > > > of a device against a process address space. So the hardware should
> > > > > not care about the page size.
> > > > > 
> > > > 
> > > > Does that not indicate how much the device can access? Could it try
> > > > to access more than what is mapped?
> > > 
> > > Assuming device has huge TLB and 2MB huge page with 4K small page.
> > > You are going from one 1 TLB covering a 2MB zero page to 512 TLB
> > > each covering 4K. Both case is read only and both case are pointing
> > > to same data (ie zero).
> > > 
> > > It is fine to delay the TLB invalidate on the device to the call of
> > > mmu_notifier_invalidate_range_end(). The device will keep using the
> > > huge TLB for a little longer but both CPU and device are looking at
> > > same data.
> > > 
> > > Now if there is a racing thread that replace one of the 512 zeor page
> > > after the split but before mmu_notifier_invalidate_range_end() that
> > > code path would call mmu_notifier_invalidate_range() before changing
> > > the pte to point to something else. Which should shoot down the device
> > > TLB (it would be a serious device bug if this did not work).
> > 
> > OK.. This seems reasonable, but I'd really like to see if it can be
> > tested
> 
> Well hard to test, many factors first each device might react differently.
> Device that only store TLB at 4k granularity are fine. Clever device that
> can store TLB for 4k, 2M, ... can ignore an invalidation that is smaller
> than their TLB entry ie getting a 4K invalidation would not invalidate a
> 2MB TLB entry in the device. I consider this as buggy. I will go look at
> the PCIE ATS specification one more time and see if there is any wording
> related that. I might bring up a question to the PCIE standard body if not.

So inside PCIE ATS there is the definition of "minimum translation or
invalidate size" which says 4096 bytes. So my understanding is that
hardware must support 4K invalidation in all the case and thus we shoud
be safe from possible hazard above.

But none the less i will repost without the optimization for huge page
to be more concervative as anyway we want to be correct before we care
about last bit of optimization.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2017-10-23 20:35 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-17  3:10 [PATCH 0/2] Optimize mmu_notifier->invalidate_range callback jglisse
2017-10-17  3:10 ` jglisse
2017-10-17  3:10 ` jglisse
2017-10-17  3:10 ` [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2 jglisse
2017-10-17  3:10   ` jglisse
2017-10-17  3:10   ` jglisse-H+wXaHxf7aLQT0dZR+AlfA
2017-10-19  3:04   ` Balbir Singh
2017-10-19  3:04     ` Balbir Singh
2017-10-19  3:04     ` Balbir Singh
2017-10-19  3:28     ` Jerome Glisse
2017-10-19  3:28       ` Jerome Glisse
2017-10-19  3:28       ` Jerome Glisse
2017-10-19 10:53       ` Balbir Singh
2017-10-19 10:53         ` Balbir Singh
2017-10-19 10:53         ` Balbir Singh
2017-10-19 16:58         ` Jerome Glisse
2017-10-19 16:58           ` Jerome Glisse
2017-10-19 16:58           ` Jerome Glisse
2017-10-19 16:58           ` Jerome Glisse
2017-10-21  5:54           ` Balbir Singh
2017-10-21  5:54             ` Balbir Singh
2017-10-21  5:54             ` Balbir Singh
2017-10-21  5:54             ` Balbir Singh
2017-10-21 15:47             ` Jerome Glisse
2017-10-21 15:47               ` Jerome Glisse
2017-10-21 15:47               ` Jerome Glisse
2017-10-21 15:47               ` Jerome Glisse
2017-10-23 20:35               ` Jerome Glisse
2017-10-23 20:35                 ` Jerome Glisse
2017-10-23 20:35                 ` Jerome Glisse
2017-10-23 20:35                 ` Jerome Glisse
2017-10-17  3:10 ` [PATCH 2/2] mm/mmu_notifier: avoid call to invalidate_range() in range_end() jglisse
2017-10-17  3:10   ` jglisse
2017-10-17  3:10   ` jglisse
2017-10-19  2:43 ` [PATCH 0/2] Optimize mmu_notifier->invalidate_range callback Balbir Singh
2017-10-19  2:43   ` Balbir Singh
2017-10-19  2:43   ` Balbir Singh
2017-10-19  3:08   ` Jerome Glisse
2017-10-19  3:08     ` Jerome Glisse
2017-10-19  3:08     ` Jerome Glisse

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.