All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3 v3] mmu_notifier: Allow to manage CPU external TLBs
@ 2014-09-09 15:43 ` Joerg Roedel
  0 siblings, 0 replies; 36+ messages in thread
From: Joerg Roedel @ 2014-09-09 15:43 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Peter Zijlstra, Rik van Riel,
	Hugh Dickins, Mel Gorman, Johannes Weiner
  Cc: Jerome Glisse, jroedel, joro, Jay.Cornwall, Oded.Gabbay,
	John.Bridgman, Suravee.Suthikulpanit, ben.sander, Jesse Barnes,
	David Woodhouse, linux-kernel, linux-mm, iommu

Changes V2->V3:

* Rebased to v3.17-rc4
* Fixed compile error because pmdp_get_and_clear_notify was
  missing

Changes V1->V2:

* Rebase to v3.16-rc7
* Added call of ->invalidate_range to
  __mmu_notifier_invalidate_end() so that the subsystem
  doesn't need to register an ->invalidate_end() call-back,
  subsystems will likely either register
  invalidate_range_start/end or invalidate_range, so that
  should be fine.
* Re-orded declarations a bit to reflect that
  invalidate_range is not only called between
  invalidate_range_start/end
* Updated documentation to cover the case where
  invalidate_range is called outside of
  invalidate_range_start/end to flush page-table pages out
  of the TLB

Hi,

here is a patch-set to extend the mmu_notifiers in the Linux
kernel to allow managing CPU external TLBs. Those TLBs may
be implemented in IOMMUs or any other external device, e.g.
ATS/PRI capable PCI devices.

The problem with managing these TLBs are the semantics of
the invalidate_range_start/end call-backs currently
available. Currently the subsystem using mmu_notifiers has
to guarantee that no new TLB entries are established between
invalidate_range_start/end. Furthermore the
invalidate_range_start() function is called when all pages
are still mapped and invalidate_range_end() when the pages
are unmapped an already freed.

So both call-backs can't be used to safely flush any non-CPU
TLB because _start() is called too early and _end() too
late.

In the AMD IOMMUv2 driver this is currently implemented by
assigning an empty page-table to the external device between
_start() and _end(). But as tests have shown this doesn't
work as external devices don't re-fault infinitly but enter
a failure state after some time.

Next problem with this solution is that it causes an
interrupt storm for IO page faults to be handled when an
empty page-table is assigned.

Furthermore the _start()/end() notifiers only catch the
moment when page mappings are released, but not page-table
pages. But this is necessary for managing external TLBs when
the page-table is shared with the CPU.

To solve this situation I wrote a patch-set to introduce a
new notifier call-back: mmu_notifer_invalidate_range(). This
notifier lifts the strict requirements that no new
references are taken in the range between _start() and
_end(). When the subsystem can't guarantee that any new
references are taken is has to provide the
invalidate_range() call-back to clear any new references in
there.

It is called between invalidate_range_start() and _end()
every time the VMM has to wipe out any references to a
couple of pages. This are usually the places where the CPU
TLBs are flushed too and where its important that this
happens before invalidate_range_end() is called.

Any comments and review appreciated!

Thanks,

	Joerg

Joerg Roedel (3):
  mmu_notifier: Add mmu_notifier_invalidate_range()
  mmu_notifier: Call mmu_notifier_invalidate_range() from VMM
  mmu_notifier: Add the call-back for mmu_notifier_invalidate_range()

 include/linux/mmu_notifier.h | 88 +++++++++++++++++++++++++++++++++++++++++---
 kernel/events/uprobes.c      |  2 +-
 mm/fremap.c                  |  2 +-
 mm/huge_memory.c             |  9 +++--
 mm/hugetlb.c                 |  7 +++-
 mm/ksm.c                     |  4 +-
 mm/memory.c                  |  3 +-
 mm/migrate.c                 |  3 +-
 mm/mmu_notifier.c            | 25 +++++++++++++
 mm/rmap.c                    |  2 +-
 10 files changed, 128 insertions(+), 17 deletions(-)

-- 
1.9.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 0/3 v3] mmu_notifier: Allow to manage CPU external TLBs
@ 2014-09-09 15:43 ` Joerg Roedel
  0 siblings, 0 replies; 36+ messages in thread
From: Joerg Roedel @ 2014-09-09 15:43 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Peter Zijlstra, Rik van Riel,
	Hugh Dickins, Mel Gorman, Johannes Weiner
  Cc: Jerome Glisse, jroedel, joro, Jay.Cornwall, Oded.Gabbay,
	John.Bridgman, Suravee.Suthikulpanit, ben.sander, Jesse Barnes,
	David Woodhouse, linux-kernel, linux-mm, iommu

Changes V2->V3:

* Rebased to v3.17-rc4
* Fixed compile error because pmdp_get_and_clear_notify was
  missing

Changes V1->V2:

* Rebase to v3.16-rc7
* Added call of ->invalidate_range to
  __mmu_notifier_invalidate_end() so that the subsystem
  doesn't need to register an ->invalidate_end() call-back,
  subsystems will likely either register
  invalidate_range_start/end or invalidate_range, so that
  should be fine.
* Re-orded declarations a bit to reflect that
  invalidate_range is not only called between
  invalidate_range_start/end
* Updated documentation to cover the case where
  invalidate_range is called outside of
  invalidate_range_start/end to flush page-table pages out
  of the TLB

Hi,

here is a patch-set to extend the mmu_notifiers in the Linux
kernel to allow managing CPU external TLBs. Those TLBs may
be implemented in IOMMUs or any other external device, e.g.
ATS/PRI capable PCI devices.

The problem with managing these TLBs are the semantics of
the invalidate_range_start/end call-backs currently
available. Currently the subsystem using mmu_notifiers has
to guarantee that no new TLB entries are established between
invalidate_range_start/end. Furthermore the
invalidate_range_start() function is called when all pages
are still mapped and invalidate_range_end() when the pages
are unmapped an already freed.

So both call-backs can't be used to safely flush any non-CPU
TLB because _start() is called too early and _end() too
late.

In the AMD IOMMUv2 driver this is currently implemented by
assigning an empty page-table to the external device between
_start() and _end(). But as tests have shown this doesn't
work as external devices don't re-fault infinitly but enter
a failure state after some time.

Next problem with this solution is that it causes an
interrupt storm for IO page faults to be handled when an
empty page-table is assigned.

Furthermore the _start()/end() notifiers only catch the
moment when page mappings are released, but not page-table
pages. But this is necessary for managing external TLBs when
the page-table is shared with the CPU.

To solve this situation I wrote a patch-set to introduce a
new notifier call-back: mmu_notifer_invalidate_range(). This
notifier lifts the strict requirements that no new
references are taken in the range between _start() and
_end(). When the subsystem can't guarantee that any new
references are taken is has to provide the
invalidate_range() call-back to clear any new references in
there.

It is called between invalidate_range_start() and _end()
every time the VMM has to wipe out any references to a
couple of pages. This are usually the places where the CPU
TLBs are flushed too and where its important that this
happens before invalidate_range_end() is called.

Any comments and review appreciated!

Thanks,

	Joerg

Joerg Roedel (3):
  mmu_notifier: Add mmu_notifier_invalidate_range()
  mmu_notifier: Call mmu_notifier_invalidate_range() from VMM
  mmu_notifier: Add the call-back for mmu_notifier_invalidate_range()

 include/linux/mmu_notifier.h | 88 +++++++++++++++++++++++++++++++++++++++++---
 kernel/events/uprobes.c      |  2 +-
 mm/fremap.c                  |  2 +-
 mm/huge_memory.c             |  9 +++--
 mm/hugetlb.c                 |  7 +++-
 mm/ksm.c                     |  4 +-
 mm/memory.c                  |  3 +-
 mm/migrate.c                 |  3 +-
 mm/mmu_notifier.c            | 25 +++++++++++++
 mm/rmap.c                    |  2 +-
 10 files changed, 128 insertions(+), 17 deletions(-)

-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 1/3] mmu_notifier: Add mmu_notifier_invalidate_range()
  2014-09-09 15:43 ` Joerg Roedel
  (?)
@ 2014-09-09 15:43   ` Joerg Roedel
  -1 siblings, 0 replies; 36+ messages in thread
From: Joerg Roedel @ 2014-09-09 15:43 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Peter Zijlstra, Rik van Riel,
	Hugh Dickins, Mel Gorman, Johannes Weiner
  Cc: Jerome Glisse, jroedel, joro, Jay.Cornwall, Oded.Gabbay,
	John.Bridgman, Suravee.Suthikulpanit, ben.sander, Jesse Barnes,
	David Woodhouse, linux-kernel, linux-mm, iommu

From: Joerg Roedel <jroedel@suse.de>

This notifier closes two important gaps with the current
invalidate_range_start()/end() notifiers. The _start() part
is called when all pages are still mapped while the _end()
notifier is called when all pages are potentially unmapped
and already freed.

This does not allow to manage external (non-CPU) hardware
TLBs with MMU-notifiers because there is no way to prevent
that hardware will establish new TLB entries between the
calls of these two functions. But this is a requirement to
the subsytem that implements these existing notifiers.

To allow managing external TLBs the MMU-notifiers need to
catch the moment when pages are unmapped but not yet freed.
This new notifier catches that moment and notifies the
interested subsytem when pages that were unmapped are about
to be freed. The new notifier will be called between
invalidate_range_start()/end() to catch the moment when
pages are unmapped but not yet freed.

For non-CPU TLBs it is also necessary to know when
page-table pages are freed. This is the second gap in
current mmu_notifiers. At those events the new notifier will
also be called, without calling invalidate_range_start() and
invalidate_range_end() around it.

Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 include/linux/mmu_notifier.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 2728869..5d03f31 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -237,6 +237,11 @@ static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 		__mmu_notifier_invalidate_range_end(mm, start, end);
 }
 
+static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+}
+
 static inline void mmu_notifier_mm_init(struct mm_struct *mm)
 {
 	mm->mmu_notifier_mm = NULL;
@@ -332,6 +337,11 @@ static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 {
 }
 
+static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+}
+
 static inline void mmu_notifier_mm_init(struct mm_struct *mm)
 {
 }
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 1/3] mmu_notifier: Add mmu_notifier_invalidate_range()
@ 2014-09-09 15:43   ` Joerg Roedel
  0 siblings, 0 replies; 36+ messages in thread
From: Joerg Roedel @ 2014-09-09 15:43 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Peter Zijlstra, Rik van Riel,
	Hugh Dickins, Mel Gorman, Johannes Weiner
  Cc: Jerome Glisse, jroedel, joro, Jay.Cornwall, Oded.Gabbay,
	John.Bridgman, Suravee.Suthikulpanit, ben.sander, Jesse Barnes,
	David Woodhouse, linux-kernel, linux-mm, iommu

From: Joerg Roedel <jroedel@suse.de>

This notifier closes two important gaps with the current
invalidate_range_start()/end() notifiers. The _start() part
is called when all pages are still mapped while the _end()
notifier is called when all pages are potentially unmapped
and already freed.

This does not allow to manage external (non-CPU) hardware
TLBs with MMU-notifiers because there is no way to prevent
that hardware will establish new TLB entries between the
calls of these two functions. But this is a requirement to
the subsytem that implements these existing notifiers.

To allow managing external TLBs the MMU-notifiers need to
catch the moment when pages are unmapped but not yet freed.
This new notifier catches that moment and notifies the
interested subsytem when pages that were unmapped are about
to be freed. The new notifier will be called between
invalidate_range_start()/end() to catch the moment when
pages are unmapped but not yet freed.

For non-CPU TLBs it is also necessary to know when
page-table pages are freed. This is the second gap in
current mmu_notifiers. At those events the new notifier will
also be called, without calling invalidate_range_start() and
invalidate_range_end() around it.

Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 include/linux/mmu_notifier.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 2728869..5d03f31 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -237,6 +237,11 @@ static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 		__mmu_notifier_invalidate_range_end(mm, start, end);
 }
 
+static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+}
+
 static inline void mmu_notifier_mm_init(struct mm_struct *mm)
 {
 	mm->mmu_notifier_mm = NULL;
@@ -332,6 +337,11 @@ static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 {
 }
 
+static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+}
+
 static inline void mmu_notifier_mm_init(struct mm_struct *mm)
 {
 }
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 1/3] mmu_notifier: Add mmu_notifier_invalidate_range()
@ 2014-09-09 15:43   ` Joerg Roedel
  0 siblings, 0 replies; 36+ messages in thread
From: Joerg Roedel @ 2014-09-09 15:43 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Peter Zijlstra, Rik van Riel,
	Hugh Dickins, Mel Gorman, Johannes Weiner
  Cc: Jerome Glisse, jroedel, joro, Jay.Cornwall, Oded.Gabbay,
	John.Bridgman, Suravee.Suthikulpanit, ben.sander, Jesse Barnes,
	David Woodhouse, linux-kernel, linux-mm, iommu

From: Joerg Roedel <jroedel@suse.de>

This notifier closes two important gaps with the current
invalidate_range_start()/end() notifiers. The _start() part
is called when all pages are still mapped while the _end()
notifier is called when all pages are potentially unmapped
and already freed.

This does not allow to manage external (non-CPU) hardware
TLBs with MMU-notifiers because there is no way to prevent
that hardware will establish new TLB entries between the
calls of these two functions. But this is a requirement to
the subsytem that implements these existing notifiers.

To allow managing external TLBs the MMU-notifiers need to
catch the moment when pages are unmapped but not yet freed.
This new notifier catches that moment and notifies the
interested subsytem when pages that were unmapped are about
to be freed. The new notifier will be called between
invalidate_range_start()/end() to catch the moment when
pages are unmapped but not yet freed.

For non-CPU TLBs it is also necessary to know when
page-table pages are freed. This is the second gap in
current mmu_notifiers. At those events the new notifier will
also be called, without calling invalidate_range_start() and
invalidate_range_end() around it.

Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 include/linux/mmu_notifier.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 2728869..5d03f31 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -237,6 +237,11 @@ static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 		__mmu_notifier_invalidate_range_end(mm, start, end);
 }
 
+static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+}
+
 static inline void mmu_notifier_mm_init(struct mm_struct *mm)
 {
 	mm->mmu_notifier_mm = NULL;
@@ -332,6 +337,11 @@ static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 {
 }
 
+static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+}
+
 static inline void mmu_notifier_mm_init(struct mm_struct *mm)
 {
 }
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 2/3] mmu_notifier: Call mmu_notifier_invalidate_range() from VMM
  2014-09-09 15:43 ` Joerg Roedel
  (?)
@ 2014-09-09 15:43   ` Joerg Roedel
  -1 siblings, 0 replies; 36+ messages in thread
From: Joerg Roedel @ 2014-09-09 15:43 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Peter Zijlstra, Rik van Riel,
	Hugh Dickins, Mel Gorman, Johannes Weiner
  Cc: Jerome Glisse, jroedel, joro, Jay.Cornwall, Oded.Gabbay,
	John.Bridgman, Suravee.Suthikulpanit, ben.sander, Jesse Barnes,
	David Woodhouse, linux-kernel, linux-mm, iommu

From: Joerg Roedel <jroedel@suse.de>

Add calls to the new mmu_notifier_invalidate_range()
function to all places if the VMM that need it.

Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 include/linux/mmu_notifier.h | 41 +++++++++++++++++++++++++++++++++++++++++
 kernel/events/uprobes.c      |  2 +-
 mm/fremap.c                  |  2 +-
 mm/huge_memory.c             |  9 +++++----
 mm/hugetlb.c                 |  7 ++++++-
 mm/ksm.c                     |  4 ++--
 mm/memory.c                  |  3 ++-
 mm/migrate.c                 |  3 ++-
 mm/rmap.c                    |  2 +-
 9 files changed, 61 insertions(+), 12 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 5d03f31..877d1c8 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -275,6 +275,44 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
 	__young;							\
 })
 
+#define	ptep_clear_flush_notify(__vma, __address, __ptep)		\
+({									\
+	unsigned long ___addr = __address & PAGE_MASK;			\
+	struct mm_struct *___mm = (__vma)->vm_mm;			\
+	pte_t ___pte;							\
+									\
+	___pte = ptep_clear_flush(__vma, __address, __ptep);		\
+	mmu_notifier_invalidate_range(___mm, ___addr,			\
+					___addr + PAGE_SIZE);		\
+									\
+	___pte;								\
+})
+
+#define pmdp_clear_flush_notify(__vma, __haddr, __pmd)			\
+({									\
+	unsigned long ___haddr = __haddr & HPAGE_PMD_MASK;		\
+	struct mm_struct *___mm = (__vma)->vm_mm;			\
+	pmd_t ___pmd;							\
+									\
+	___pmd = pmdp_clear_flush(__vma, __haddr, __pmd);		\
+	mmu_notifier_invalidate_range(___mm, ___haddr,			\
+				      ___haddr + HPAGE_PMD_SIZE);	\
+									\
+	___pmd;								\
+})
+
+#define pmdp_get_and_clear_notify(__mm, __haddr, __pmd)			\
+({									\
+	unsigned long ___haddr = __haddr & HPAGE_PMD_MASK;		\
+	pmd_t ___pmd;							\
+									\
+	___pmd = pmdp_get_and_clear(__mm, __haddr, __pmd);		\
+	mmu_notifier_invalidate_range(__mm, ___haddr,			\
+				      ___haddr + HPAGE_PMD_SIZE);	\
+									\
+	___pmd;								\
+})
+
 /*
  * set_pte_at_notify() sets the pte _after_ running the notifier.
  * This is safe to start by updating the secondary MMUs, because the primary MMU
@@ -352,6 +390,9 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
 
 #define ptep_clear_flush_young_notify ptep_clear_flush_young
 #define pmdp_clear_flush_young_notify pmdp_clear_flush_young
+#define	ptep_clear_flush_notify ptep_clear_flush
+#define pmdp_clear_flush_notify pmdp_clear_flush
+#define pmdp_get_and_clear_notify pmdp_get_and_clear
 #define set_pte_at_notify set_pte_at
 
 #endif /* CONFIG_MMU_NOTIFIER */
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 1d0af8a..bc143cf 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -193,7 +193,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	}
 
 	flush_cache_page(vma, addr, pte_pfn(*ptep));
-	ptep_clear_flush(vma, addr, ptep);
+	ptep_clear_flush_notify(vma, addr, ptep);
 	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
 
 	page_remove_rmap(page);
diff --git a/mm/fremap.c b/mm/fremap.c
index 72b8fa3..9129013 100644
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -37,7 +37,7 @@ static void zap_pte(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	if (pte_present(pte)) {
 		flush_cache_page(vma, addr, pte_pfn(pte));
-		pte = ptep_clear_flush(vma, addr, ptep);
+		pte = ptep_clear_flush_notify(vma, addr, ptep);
 		page = vm_normal_page(vma, addr, pte);
 		if (page) {
 			if (pte_dirty(pte))
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d9a21d06..c343ebd 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1036,7 +1036,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 		goto out_free_pages;
 	VM_BUG_ON_PAGE(!PageHead(page), page);
 
-	pmdp_clear_flush(vma, haddr, pmd);
+	pmdp_clear_flush_notify(vma, haddr, pmd);
 	/* leave pmd empty until pte is filled */
 
 	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
@@ -1179,7 +1179,7 @@ alloc:
 		pmd_t entry;
 		entry = mk_huge_pmd(new_page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
-		pmdp_clear_flush(vma, haddr, pmd);
+		pmdp_clear_flush_notify(vma, haddr, pmd);
 		page_add_new_anon_rmap(new_page, vma, haddr);
 		mem_cgroup_commit_charge(new_page, memcg, false);
 		lru_cache_add_active_or_unevictable(new_page, vma);
@@ -1512,7 +1512,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		pmd_t entry;
 		ret = 1;
 		if (!prot_numa) {
-			entry = pmdp_get_and_clear(mm, addr, pmd);
+			entry = pmdp_get_and_clear_notify(mm, addr, pmd);
 			if (pmd_numa(entry))
 				entry = pmd_mknonnuma(entry);
 			entry = pmd_modify(entry, newprot);
@@ -1644,6 +1644,7 @@ static int __split_huge_page_splitting(struct page *page,
 		 * serialize against split_huge_page*.
 		 */
 		pmdp_splitting_flush(vma, address, pmd);
+
 		ret = 1;
 		spin_unlock(ptl);
 	}
@@ -2836,7 +2837,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
 	pmd_t _pmd;
 	int i;
 
-	pmdp_clear_flush(vma, haddr, pmd);
+	pmdp_clear_flush_notify(vma, haddr, pmd);
 	/* leave pmd empty until pte is filled */
 
 	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index eeceeeb..393f2cd 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2598,8 +2598,11 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			}
 			set_huge_pte_at(dst, addr, dst_pte, entry);
 		} else {
-			if (cow)
+			if (cow) {
 				huge_ptep_set_wrprotect(src, addr, src_pte);
+				mmu_notifier_invalidate_range(src, mmun_start,
+								   mmun_end);
+			}
 			entry = huge_ptep_get(src_pte);
 			ptepage = pte_page(entry);
 			get_page(ptepage);
@@ -2899,6 +2902,7 @@ retry_avoidcopy:
 
 		/* Break COW */
 		huge_ptep_clear_flush(vma, address, ptep);
+		mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
 		set_huge_pte_at(mm, address, ptep,
 				make_huge_pte(vma, new_page, 1));
 		page_remove_rmap(old_page);
@@ -3374,6 +3378,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	 * and that page table be reused and filled with junk.
 	 */
 	flush_tlb_range(vma, start, end);
+	mmu_notifier_invalidate_range(mm, start, end);
 	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
 	mmu_notifier_invalidate_range_end(mm, start, end);
 
diff --git a/mm/ksm.c b/mm/ksm.c
index fb75902..6b1c239 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -892,7 +892,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 		 * this assure us that no O_DIRECT can happen after the check
 		 * or in the middle of the check.
 		 */
-		entry = ptep_clear_flush(vma, addr, ptep);
+		entry = ptep_clear_flush_notify(vma, addr, ptep);
 		/*
 		 * Check that no O_DIRECT or similar I/O is in progress on the
 		 * page
@@ -960,7 +960,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	page_add_anon_rmap(kpage, vma, addr);
 
 	flush_cache_page(vma, addr, pte_pfn(*ptep));
-	ptep_clear_flush(vma, addr, ptep);
+	ptep_clear_flush_notify(vma, addr, ptep);
 	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
 
 	page_remove_rmap(page);
diff --git a/mm/memory.c b/mm/memory.c
index adeac30..a332714 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -236,6 +236,7 @@ static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 {
 	tlb->need_flush = 0;
 	tlb_flush(tlb);
+	mmu_notifier_invalidate_range(tlb->mm, tlb->start, tlb->end);
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 	tlb_table_flush(tlb);
 #endif
@@ -2230,7 +2231,7 @@ gotten:
 		 * seen in the presence of one thread doing SMC and another
 		 * thread doing COW.
 		 */
-		ptep_clear_flush(vma, address, page_table);
+		ptep_clear_flush_notify(vma, address, page_table);
 		page_add_new_anon_rmap(new_page, vma, address);
 		mem_cgroup_commit_charge(new_page, memcg, false);
 		lru_cache_add_active_or_unevictable(new_page, vma);
diff --git a/mm/migrate.c b/mm/migrate.c
index f78ec9b..44dbe91 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1859,7 +1859,7 @@ fail_putback:
 	 */
 	flush_cache_range(vma, mmun_start, mmun_end);
 	page_add_anon_rmap(new_page, vma, mmun_start);
-	pmdp_clear_flush(vma, mmun_start, pmd);
+	pmdp_clear_flush_notify(vma, mmun_start, pmd);
 	set_pmd_at(mm, mmun_start, pmd, entry);
 	flush_tlb_range(vma, mmun_start, mmun_end);
 	update_mmu_cache_pmd(vma, address, &entry);
@@ -1867,6 +1867,7 @@ fail_putback:
 	if (page_count(page) != 2) {
 		set_pmd_at(mm, mmun_start, pmd, orig_entry);
 		flush_tlb_range(vma, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
 		update_mmu_cache_pmd(vma, address, &entry);
 		page_remove_rmap(new_page);
 		goto fail_putback;
diff --git a/mm/rmap.c b/mm/rmap.c
index 3e8491c..0b192fe 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1360,7 +1360,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 
 		/* Nuke the page table entry. */
 		flush_cache_page(vma, address, pte_pfn(*pte));
-		pteval = ptep_clear_flush(vma, address, pte);
+		pteval = ptep_clear_flush_notify(vma, address, pte);
 
 		/* If nonlinear, store the file page offset in the pte. */
 		if (page->index != linear_page_index(vma, address)) {
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 2/3] mmu_notifier: Call mmu_notifier_invalidate_range() from VMM
@ 2014-09-09 15:43   ` Joerg Roedel
  0 siblings, 0 replies; 36+ messages in thread
From: Joerg Roedel @ 2014-09-09 15:43 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Peter Zijlstra, Rik van Riel,
	Hugh Dickins, Mel Gorman, Johannes Weiner
  Cc: Jerome Glisse, jroedel, joro, Jay.Cornwall, Oded.Gabbay,
	John.Bridgman, Suravee.Suthikulpanit, ben.sander, Jesse Barnes,
	David Woodhouse, linux-kernel, linux-mm, iommu

From: Joerg Roedel <jroedel@suse.de>

Add calls to the new mmu_notifier_invalidate_range()
function to all places if the VMM that need it.

Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 include/linux/mmu_notifier.h | 41 +++++++++++++++++++++++++++++++++++++++++
 kernel/events/uprobes.c      |  2 +-
 mm/fremap.c                  |  2 +-
 mm/huge_memory.c             |  9 +++++----
 mm/hugetlb.c                 |  7 ++++++-
 mm/ksm.c                     |  4 ++--
 mm/memory.c                  |  3 ++-
 mm/migrate.c                 |  3 ++-
 mm/rmap.c                    |  2 +-
 9 files changed, 61 insertions(+), 12 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 5d03f31..877d1c8 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -275,6 +275,44 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
 	__young;							\
 })
 
+#define	ptep_clear_flush_notify(__vma, __address, __ptep)		\
+({									\
+	unsigned long ___addr = __address & PAGE_MASK;			\
+	struct mm_struct *___mm = (__vma)->vm_mm;			\
+	pte_t ___pte;							\
+									\
+	___pte = ptep_clear_flush(__vma, __address, __ptep);		\
+	mmu_notifier_invalidate_range(___mm, ___addr,			\
+					___addr + PAGE_SIZE);		\
+									\
+	___pte;								\
+})
+
+#define pmdp_clear_flush_notify(__vma, __haddr, __pmd)			\
+({									\
+	unsigned long ___haddr = __haddr & HPAGE_PMD_MASK;		\
+	struct mm_struct *___mm = (__vma)->vm_mm;			\
+	pmd_t ___pmd;							\
+									\
+	___pmd = pmdp_clear_flush(__vma, __haddr, __pmd);		\
+	mmu_notifier_invalidate_range(___mm, ___haddr,			\
+				      ___haddr + HPAGE_PMD_SIZE);	\
+									\
+	___pmd;								\
+})
+
+#define pmdp_get_and_clear_notify(__mm, __haddr, __pmd)			\
+({									\
+	unsigned long ___haddr = __haddr & HPAGE_PMD_MASK;		\
+	pmd_t ___pmd;							\
+									\
+	___pmd = pmdp_get_and_clear(__mm, __haddr, __pmd);		\
+	mmu_notifier_invalidate_range(__mm, ___haddr,			\
+				      ___haddr + HPAGE_PMD_SIZE);	\
+									\
+	___pmd;								\
+})
+
 /*
  * set_pte_at_notify() sets the pte _after_ running the notifier.
  * This is safe to start by updating the secondary MMUs, because the primary MMU
@@ -352,6 +390,9 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
 
 #define ptep_clear_flush_young_notify ptep_clear_flush_young
 #define pmdp_clear_flush_young_notify pmdp_clear_flush_young
+#define	ptep_clear_flush_notify ptep_clear_flush
+#define pmdp_clear_flush_notify pmdp_clear_flush
+#define pmdp_get_and_clear_notify pmdp_get_and_clear
 #define set_pte_at_notify set_pte_at
 
 #endif /* CONFIG_MMU_NOTIFIER */
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 1d0af8a..bc143cf 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -193,7 +193,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	}
 
 	flush_cache_page(vma, addr, pte_pfn(*ptep));
-	ptep_clear_flush(vma, addr, ptep);
+	ptep_clear_flush_notify(vma, addr, ptep);
 	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
 
 	page_remove_rmap(page);
diff --git a/mm/fremap.c b/mm/fremap.c
index 72b8fa3..9129013 100644
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -37,7 +37,7 @@ static void zap_pte(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	if (pte_present(pte)) {
 		flush_cache_page(vma, addr, pte_pfn(pte));
-		pte = ptep_clear_flush(vma, addr, ptep);
+		pte = ptep_clear_flush_notify(vma, addr, ptep);
 		page = vm_normal_page(vma, addr, pte);
 		if (page) {
 			if (pte_dirty(pte))
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d9a21d06..c343ebd 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1036,7 +1036,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 		goto out_free_pages;
 	VM_BUG_ON_PAGE(!PageHead(page), page);
 
-	pmdp_clear_flush(vma, haddr, pmd);
+	pmdp_clear_flush_notify(vma, haddr, pmd);
 	/* leave pmd empty until pte is filled */
 
 	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
@@ -1179,7 +1179,7 @@ alloc:
 		pmd_t entry;
 		entry = mk_huge_pmd(new_page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
-		pmdp_clear_flush(vma, haddr, pmd);
+		pmdp_clear_flush_notify(vma, haddr, pmd);
 		page_add_new_anon_rmap(new_page, vma, haddr);
 		mem_cgroup_commit_charge(new_page, memcg, false);
 		lru_cache_add_active_or_unevictable(new_page, vma);
@@ -1512,7 +1512,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		pmd_t entry;
 		ret = 1;
 		if (!prot_numa) {
-			entry = pmdp_get_and_clear(mm, addr, pmd);
+			entry = pmdp_get_and_clear_notify(mm, addr, pmd);
 			if (pmd_numa(entry))
 				entry = pmd_mknonnuma(entry);
 			entry = pmd_modify(entry, newprot);
@@ -1644,6 +1644,7 @@ static int __split_huge_page_splitting(struct page *page,
 		 * serialize against split_huge_page*.
 		 */
 		pmdp_splitting_flush(vma, address, pmd);
+
 		ret = 1;
 		spin_unlock(ptl);
 	}
@@ -2836,7 +2837,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
 	pmd_t _pmd;
 	int i;
 
-	pmdp_clear_flush(vma, haddr, pmd);
+	pmdp_clear_flush_notify(vma, haddr, pmd);
 	/* leave pmd empty until pte is filled */
 
 	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index eeceeeb..393f2cd 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2598,8 +2598,11 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			}
 			set_huge_pte_at(dst, addr, dst_pte, entry);
 		} else {
-			if (cow)
+			if (cow) {
 				huge_ptep_set_wrprotect(src, addr, src_pte);
+				mmu_notifier_invalidate_range(src, mmun_start,
+								   mmun_end);
+			}
 			entry = huge_ptep_get(src_pte);
 			ptepage = pte_page(entry);
 			get_page(ptepage);
@@ -2899,6 +2902,7 @@ retry_avoidcopy:
 
 		/* Break COW */
 		huge_ptep_clear_flush(vma, address, ptep);
+		mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
 		set_huge_pte_at(mm, address, ptep,
 				make_huge_pte(vma, new_page, 1));
 		page_remove_rmap(old_page);
@@ -3374,6 +3378,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	 * and that page table be reused and filled with junk.
 	 */
 	flush_tlb_range(vma, start, end);
+	mmu_notifier_invalidate_range(mm, start, end);
 	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
 	mmu_notifier_invalidate_range_end(mm, start, end);
 
diff --git a/mm/ksm.c b/mm/ksm.c
index fb75902..6b1c239 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -892,7 +892,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 		 * this assure us that no O_DIRECT can happen after the check
 		 * or in the middle of the check.
 		 */
-		entry = ptep_clear_flush(vma, addr, ptep);
+		entry = ptep_clear_flush_notify(vma, addr, ptep);
 		/*
 		 * Check that no O_DIRECT or similar I/O is in progress on the
 		 * page
@@ -960,7 +960,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	page_add_anon_rmap(kpage, vma, addr);
 
 	flush_cache_page(vma, addr, pte_pfn(*ptep));
-	ptep_clear_flush(vma, addr, ptep);
+	ptep_clear_flush_notify(vma, addr, ptep);
 	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
 
 	page_remove_rmap(page);
diff --git a/mm/memory.c b/mm/memory.c
index adeac30..a332714 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -236,6 +236,7 @@ static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 {
 	tlb->need_flush = 0;
 	tlb_flush(tlb);
+	mmu_notifier_invalidate_range(tlb->mm, tlb->start, tlb->end);
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 	tlb_table_flush(tlb);
 #endif
@@ -2230,7 +2231,7 @@ gotten:
 		 * seen in the presence of one thread doing SMC and another
 		 * thread doing COW.
 		 */
-		ptep_clear_flush(vma, address, page_table);
+		ptep_clear_flush_notify(vma, address, page_table);
 		page_add_new_anon_rmap(new_page, vma, address);
 		mem_cgroup_commit_charge(new_page, memcg, false);
 		lru_cache_add_active_or_unevictable(new_page, vma);
diff --git a/mm/migrate.c b/mm/migrate.c
index f78ec9b..44dbe91 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1859,7 +1859,7 @@ fail_putback:
 	 */
 	flush_cache_range(vma, mmun_start, mmun_end);
 	page_add_anon_rmap(new_page, vma, mmun_start);
-	pmdp_clear_flush(vma, mmun_start, pmd);
+	pmdp_clear_flush_notify(vma, mmun_start, pmd);
 	set_pmd_at(mm, mmun_start, pmd, entry);
 	flush_tlb_range(vma, mmun_start, mmun_end);
 	update_mmu_cache_pmd(vma, address, &entry);
@@ -1867,6 +1867,7 @@ fail_putback:
 	if (page_count(page) != 2) {
 		set_pmd_at(mm, mmun_start, pmd, orig_entry);
 		flush_tlb_range(vma, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
 		update_mmu_cache_pmd(vma, address, &entry);
 		page_remove_rmap(new_page);
 		goto fail_putback;
diff --git a/mm/rmap.c b/mm/rmap.c
index 3e8491c..0b192fe 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1360,7 +1360,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 
 		/* Nuke the page table entry. */
 		flush_cache_page(vma, address, pte_pfn(*pte));
-		pteval = ptep_clear_flush(vma, address, pte);
+		pteval = ptep_clear_flush_notify(vma, address, pte);
 
 		/* If nonlinear, store the file page offset in the pte. */
 		if (page->index != linear_page_index(vma, address)) {
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 2/3] mmu_notifier: Call mmu_notifier_invalidate_range() from VMM
@ 2014-09-09 15:43   ` Joerg Roedel
  0 siblings, 0 replies; 36+ messages in thread
From: Joerg Roedel @ 2014-09-09 15:43 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Peter Zijlstra, Rik van Riel,
	Hugh Dickins, Mel Gorman, Johannes Weiner
  Cc: Jay.Cornwall-5C7GfCeVMHo, John.Bridgman-5C7GfCeVMHo,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Jerome Glisse,
	jroedel-l3A5Bk7waGM, Jesse Barnes, David Woodhouse,
	ben.sander-5C7GfCeVMHo

From: Joerg Roedel <jroedel@suse.de>

Add calls to the new mmu_notifier_invalidate_range()
function to all places if the VMM that need it.

Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 include/linux/mmu_notifier.h | 41 +++++++++++++++++++++++++++++++++++++++++
 kernel/events/uprobes.c      |  2 +-
 mm/fremap.c                  |  2 +-
 mm/huge_memory.c             |  9 +++++----
 mm/hugetlb.c                 |  7 ++++++-
 mm/ksm.c                     |  4 ++--
 mm/memory.c                  |  3 ++-
 mm/migrate.c                 |  3 ++-
 mm/rmap.c                    |  2 +-
 9 files changed, 61 insertions(+), 12 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 5d03f31..877d1c8 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -275,6 +275,44 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
 	__young;							\
 })
 
+#define	ptep_clear_flush_notify(__vma, __address, __ptep)		\
+({									\
+	unsigned long ___addr = __address & PAGE_MASK;			\
+	struct mm_struct *___mm = (__vma)->vm_mm;			\
+	pte_t ___pte;							\
+									\
+	___pte = ptep_clear_flush(__vma, __address, __ptep);		\
+	mmu_notifier_invalidate_range(___mm, ___addr,			\
+					___addr + PAGE_SIZE);		\
+									\
+	___pte;								\
+})
+
+#define pmdp_clear_flush_notify(__vma, __haddr, __pmd)			\
+({									\
+	unsigned long ___haddr = __haddr & HPAGE_PMD_MASK;		\
+	struct mm_struct *___mm = (__vma)->vm_mm;			\
+	pmd_t ___pmd;							\
+									\
+	___pmd = pmdp_clear_flush(__vma, __haddr, __pmd);		\
+	mmu_notifier_invalidate_range(___mm, ___haddr,			\
+				      ___haddr + HPAGE_PMD_SIZE);	\
+									\
+	___pmd;								\
+})
+
+#define pmdp_get_and_clear_notify(__mm, __haddr, __pmd)			\
+({									\
+	unsigned long ___haddr = __haddr & HPAGE_PMD_MASK;		\
+	pmd_t ___pmd;							\
+									\
+	___pmd = pmdp_get_and_clear(__mm, __haddr, __pmd);		\
+	mmu_notifier_invalidate_range(__mm, ___haddr,			\
+				      ___haddr + HPAGE_PMD_SIZE);	\
+									\
+	___pmd;								\
+})
+
 /*
  * set_pte_at_notify() sets the pte _after_ running the notifier.
  * This is safe to start by updating the secondary MMUs, because the primary MMU
@@ -352,6 +390,9 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
 
 #define ptep_clear_flush_young_notify ptep_clear_flush_young
 #define pmdp_clear_flush_young_notify pmdp_clear_flush_young
+#define	ptep_clear_flush_notify ptep_clear_flush
+#define pmdp_clear_flush_notify pmdp_clear_flush
+#define pmdp_get_and_clear_notify pmdp_get_and_clear
 #define set_pte_at_notify set_pte_at
 
 #endif /* CONFIG_MMU_NOTIFIER */
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 1d0af8a..bc143cf 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -193,7 +193,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	}
 
 	flush_cache_page(vma, addr, pte_pfn(*ptep));
-	ptep_clear_flush(vma, addr, ptep);
+	ptep_clear_flush_notify(vma, addr, ptep);
 	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
 
 	page_remove_rmap(page);
diff --git a/mm/fremap.c b/mm/fremap.c
index 72b8fa3..9129013 100644
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -37,7 +37,7 @@ static void zap_pte(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	if (pte_present(pte)) {
 		flush_cache_page(vma, addr, pte_pfn(pte));
-		pte = ptep_clear_flush(vma, addr, ptep);
+		pte = ptep_clear_flush_notify(vma, addr, ptep);
 		page = vm_normal_page(vma, addr, pte);
 		if (page) {
 			if (pte_dirty(pte))
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d9a21d06..c343ebd 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1036,7 +1036,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 		goto out_free_pages;
 	VM_BUG_ON_PAGE(!PageHead(page), page);
 
-	pmdp_clear_flush(vma, haddr, pmd);
+	pmdp_clear_flush_notify(vma, haddr, pmd);
 	/* leave pmd empty until pte is filled */
 
 	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
@@ -1179,7 +1179,7 @@ alloc:
 		pmd_t entry;
 		entry = mk_huge_pmd(new_page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
-		pmdp_clear_flush(vma, haddr, pmd);
+		pmdp_clear_flush_notify(vma, haddr, pmd);
 		page_add_new_anon_rmap(new_page, vma, haddr);
 		mem_cgroup_commit_charge(new_page, memcg, false);
 		lru_cache_add_active_or_unevictable(new_page, vma);
@@ -1512,7 +1512,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		pmd_t entry;
 		ret = 1;
 		if (!prot_numa) {
-			entry = pmdp_get_and_clear(mm, addr, pmd);
+			entry = pmdp_get_and_clear_notify(mm, addr, pmd);
 			if (pmd_numa(entry))
 				entry = pmd_mknonnuma(entry);
 			entry = pmd_modify(entry, newprot);
@@ -1644,6 +1644,7 @@ static int __split_huge_page_splitting(struct page *page,
 		 * serialize against split_huge_page*.
 		 */
 		pmdp_splitting_flush(vma, address, pmd);
+
 		ret = 1;
 		spin_unlock(ptl);
 	}
@@ -2836,7 +2837,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
 	pmd_t _pmd;
 	int i;
 
-	pmdp_clear_flush(vma, haddr, pmd);
+	pmdp_clear_flush_notify(vma, haddr, pmd);
 	/* leave pmd empty until pte is filled */
 
 	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index eeceeeb..393f2cd 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2598,8 +2598,11 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			}
 			set_huge_pte_at(dst, addr, dst_pte, entry);
 		} else {
-			if (cow)
+			if (cow) {
 				huge_ptep_set_wrprotect(src, addr, src_pte);
+				mmu_notifier_invalidate_range(src, mmun_start,
+								   mmun_end);
+			}
 			entry = huge_ptep_get(src_pte);
 			ptepage = pte_page(entry);
 			get_page(ptepage);
@@ -2899,6 +2902,7 @@ retry_avoidcopy:
 
 		/* Break COW */
 		huge_ptep_clear_flush(vma, address, ptep);
+		mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
 		set_huge_pte_at(mm, address, ptep,
 				make_huge_pte(vma, new_page, 1));
 		page_remove_rmap(old_page);
@@ -3374,6 +3378,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	 * and that page table be reused and filled with junk.
 	 */
 	flush_tlb_range(vma, start, end);
+	mmu_notifier_invalidate_range(mm, start, end);
 	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
 	mmu_notifier_invalidate_range_end(mm, start, end);
 
diff --git a/mm/ksm.c b/mm/ksm.c
index fb75902..6b1c239 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -892,7 +892,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 		 * this assure us that no O_DIRECT can happen after the check
 		 * or in the middle of the check.
 		 */
-		entry = ptep_clear_flush(vma, addr, ptep);
+		entry = ptep_clear_flush_notify(vma, addr, ptep);
 		/*
 		 * Check that no O_DIRECT or similar I/O is in progress on the
 		 * page
@@ -960,7 +960,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	page_add_anon_rmap(kpage, vma, addr);
 
 	flush_cache_page(vma, addr, pte_pfn(*ptep));
-	ptep_clear_flush(vma, addr, ptep);
+	ptep_clear_flush_notify(vma, addr, ptep);
 	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
 
 	page_remove_rmap(page);
diff --git a/mm/memory.c b/mm/memory.c
index adeac30..a332714 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -236,6 +236,7 @@ static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 {
 	tlb->need_flush = 0;
 	tlb_flush(tlb);
+	mmu_notifier_invalidate_range(tlb->mm, tlb->start, tlb->end);
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 	tlb_table_flush(tlb);
 #endif
@@ -2230,7 +2231,7 @@ gotten:
 		 * seen in the presence of one thread doing SMC and another
 		 * thread doing COW.
 		 */
-		ptep_clear_flush(vma, address, page_table);
+		ptep_clear_flush_notify(vma, address, page_table);
 		page_add_new_anon_rmap(new_page, vma, address);
 		mem_cgroup_commit_charge(new_page, memcg, false);
 		lru_cache_add_active_or_unevictable(new_page, vma);
diff --git a/mm/migrate.c b/mm/migrate.c
index f78ec9b..44dbe91 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1859,7 +1859,7 @@ fail_putback:
 	 */
 	flush_cache_range(vma, mmun_start, mmun_end);
 	page_add_anon_rmap(new_page, vma, mmun_start);
-	pmdp_clear_flush(vma, mmun_start, pmd);
+	pmdp_clear_flush_notify(vma, mmun_start, pmd);
 	set_pmd_at(mm, mmun_start, pmd, entry);
 	flush_tlb_range(vma, mmun_start, mmun_end);
 	update_mmu_cache_pmd(vma, address, &entry);
@@ -1867,6 +1867,7 @@ fail_putback:
 	if (page_count(page) != 2) {
 		set_pmd_at(mm, mmun_start, pmd, orig_entry);
 		flush_tlb_range(vma, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
 		update_mmu_cache_pmd(vma, address, &entry);
 		page_remove_rmap(new_page);
 		goto fail_putback;
diff --git a/mm/rmap.c b/mm/rmap.c
index 3e8491c..0b192fe 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1360,7 +1360,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 
 		/* Nuke the page table entry. */
 		flush_cache_page(vma, address, pte_pfn(*pte));
-		pteval = ptep_clear_flush(vma, address, pte);
+		pteval = ptep_clear_flush_notify(vma, address, pte);
 
 		/* If nonlinear, store the file page offset in the pte. */
 		if (page->index != linear_page_index(vma, address)) {
-- 
1.9.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 3/3] mmu_notifier: Add the call-back for mmu_notifier_invalidate_range()
  2014-09-09 15:43 ` Joerg Roedel
  (?)
@ 2014-09-09 15:43   ` Joerg Roedel
  -1 siblings, 0 replies; 36+ messages in thread
From: Joerg Roedel @ 2014-09-09 15:43 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Peter Zijlstra, Rik van Riel,
	Hugh Dickins, Mel Gorman, Johannes Weiner
  Cc: Jerome Glisse, jroedel, joro, Jay.Cornwall, Oded.Gabbay,
	John.Bridgman, Suravee.Suthikulpanit, ben.sander, Jesse Barnes,
	David Woodhouse, linux-kernel, linux-mm, iommu

From: Joerg Roedel <jroedel@suse.de>

Now that the mmu_notifier_invalidate_range() calls are in
place, add the call-back to allow subsystems to register
against it.

Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 include/linux/mmu_notifier.h | 37 ++++++++++++++++++++++++++++++++-----
 mm/mmu_notifier.c            | 25 +++++++++++++++++++++++++
 2 files changed, 57 insertions(+), 5 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 877d1c8..aa1a6bf 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -95,11 +95,11 @@ struct mmu_notifier_ops {
 	/*
 	 * invalidate_range_start() and invalidate_range_end() must be
 	 * paired and are called only when the mmap_sem and/or the
-	 * locks protecting the reverse maps are held. The subsystem
-	 * must guarantee that no additional references are taken to
-	 * the pages in the range established between the call to
-	 * invalidate_range_start() and the matching call to
-	 * invalidate_range_end().
+	 * locks protecting the reverse maps are held. If the subsystem
+	 * can't guarantee that no additional references are taken to
+	 * the pages in the range, it has to implement the
+	 * invalidate_range() notifier to remove any references taken
+	 * after invalidate_range_start().
 	 *
 	 * Invalidation of multiple concurrent ranges may be
 	 * optionally permitted by the driver. Either way the
@@ -141,6 +141,29 @@ struct mmu_notifier_ops {
 	void (*invalidate_range_end)(struct mmu_notifier *mn,
 				     struct mm_struct *mm,
 				     unsigned long start, unsigned long end);
+
+	/*
+	 * invalidate_range() is either called between
+	 * invalidate_range_start() and invalidate_range_end() when the
+	 * VM has to free pages that where unmapped, but before the
+	 * pages are actually freed, or outside of _start()/_end() when
+	 * page-table pages are about to be freed.
+	 *
+	 * If invalidate_range() is used to manage a non-CPU TLB with
+	 * shared page-tables, it not necessary to implement the
+	 * invalidate_range_start()/end() notifiers, as
+	 * invalidate_range() alread catches the points in time when an
+	 * external TLB range needs to be flushed.
+	 *
+	 * The invalidate_range() function is called under the ptl
+	 * spin-lock and not allowed to sleep.
+	 *
+	 * Note that this function might be called with just a sub-range
+	 * of what was passed to invalidate_range_start()/end(), if
+	 * called between those functions.
+	 */
+	void (*invalidate_range)(struct mmu_notifier *mn, struct mm_struct *mm,
+				 unsigned long start, unsigned long end);
 };
 
 /*
@@ -186,6 +209,8 @@ extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 				  unsigned long start, unsigned long end);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 				  unsigned long start, unsigned long end);
+extern void __mmu_notifier_invalidate_range(struct mm_struct *mm,
+				  unsigned long start, unsigned long end);
 
 static inline void mmu_notifier_release(struct mm_struct *mm)
 {
@@ -240,6 +265,8 @@ static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
 				  unsigned long start, unsigned long end)
 {
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_invalidate_range(mm, start, end);
 }
 
 static inline void mmu_notifier_mm_init(struct mm_struct *mm)
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 950813b..a900637 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -192,6 +192,16 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
+		/*
+		 * Call invalidate_range here too to avoid the need for the
+		 * subsystem of having to register an invalidate_range_end
+		 * call-back when there is invalidate_range already. Usually a
+		 * subsystem registers either invalidate_range_start()/end() or
+		 * invalidate_range(), so this will be no additional overhead
+		 * (besides the pointer check).
+		 */
+		if (mn->ops->invalidate_range)
+			mn->ops->invalidate_range(mn, mm, start, end);
 		if (mn->ops->invalidate_range_end)
 			mn->ops->invalidate_range_end(mn, mm, start, end);
 	}
@@ -199,6 +209,21 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_end);
 
+void __mmu_notifier_invalidate_range(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	struct mmu_notifier *mn;
+	int id;
+
+	id = srcu_read_lock(&srcu);
+	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
+		if (mn->ops->invalidate_range)
+			mn->ops->invalidate_range(mn, mm, start, end);
+	}
+	srcu_read_unlock(&srcu, id);
+}
+EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range);
+
 static int do_mmu_notifier_register(struct mmu_notifier *mn,
 				    struct mm_struct *mm,
 				    int take_mmap_sem)
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 3/3] mmu_notifier: Add the call-back for mmu_notifier_invalidate_range()
@ 2014-09-09 15:43   ` Joerg Roedel
  0 siblings, 0 replies; 36+ messages in thread
From: Joerg Roedel @ 2014-09-09 15:43 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Peter Zijlstra, Rik van Riel,
	Hugh Dickins, Mel Gorman, Johannes Weiner
  Cc: Jerome Glisse, jroedel, joro, Jay.Cornwall, Oded.Gabbay,
	John.Bridgman, Suravee.Suthikulpanit, ben.sander, Jesse Barnes,
	David Woodhouse, linux-kernel, linux-mm, iommu

From: Joerg Roedel <jroedel@suse.de>

Now that the mmu_notifier_invalidate_range() calls are in
place, add the call-back to allow subsystems to register
against it.

Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 include/linux/mmu_notifier.h | 37 ++++++++++++++++++++++++++++++++-----
 mm/mmu_notifier.c            | 25 +++++++++++++++++++++++++
 2 files changed, 57 insertions(+), 5 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 877d1c8..aa1a6bf 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -95,11 +95,11 @@ struct mmu_notifier_ops {
 	/*
 	 * invalidate_range_start() and invalidate_range_end() must be
 	 * paired and are called only when the mmap_sem and/or the
-	 * locks protecting the reverse maps are held. The subsystem
-	 * must guarantee that no additional references are taken to
-	 * the pages in the range established between the call to
-	 * invalidate_range_start() and the matching call to
-	 * invalidate_range_end().
+	 * locks protecting the reverse maps are held. If the subsystem
+	 * can't guarantee that no additional references are taken to
+	 * the pages in the range, it has to implement the
+	 * invalidate_range() notifier to remove any references taken
+	 * after invalidate_range_start().
 	 *
 	 * Invalidation of multiple concurrent ranges may be
 	 * optionally permitted by the driver. Either way the
@@ -141,6 +141,29 @@ struct mmu_notifier_ops {
 	void (*invalidate_range_end)(struct mmu_notifier *mn,
 				     struct mm_struct *mm,
 				     unsigned long start, unsigned long end);
+
+	/*
+	 * invalidate_range() is either called between
+	 * invalidate_range_start() and invalidate_range_end() when the
+	 * VM has to free pages that where unmapped, but before the
+	 * pages are actually freed, or outside of _start()/_end() when
+	 * page-table pages are about to be freed.
+	 *
+	 * If invalidate_range() is used to manage a non-CPU TLB with
+	 * shared page-tables, it not necessary to implement the
+	 * invalidate_range_start()/end() notifiers, as
+	 * invalidate_range() alread catches the points in time when an
+	 * external TLB range needs to be flushed.
+	 *
+	 * The invalidate_range() function is called under the ptl
+	 * spin-lock and not allowed to sleep.
+	 *
+	 * Note that this function might be called with just a sub-range
+	 * of what was passed to invalidate_range_start()/end(), if
+	 * called between those functions.
+	 */
+	void (*invalidate_range)(struct mmu_notifier *mn, struct mm_struct *mm,
+				 unsigned long start, unsigned long end);
 };
 
 /*
@@ -186,6 +209,8 @@ extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 				  unsigned long start, unsigned long end);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 				  unsigned long start, unsigned long end);
+extern void __mmu_notifier_invalidate_range(struct mm_struct *mm,
+				  unsigned long start, unsigned long end);
 
 static inline void mmu_notifier_release(struct mm_struct *mm)
 {
@@ -240,6 +265,8 @@ static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
 				  unsigned long start, unsigned long end)
 {
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_invalidate_range(mm, start, end);
 }
 
 static inline void mmu_notifier_mm_init(struct mm_struct *mm)
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 950813b..a900637 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -192,6 +192,16 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
+		/*
+		 * Call invalidate_range here too to avoid the need for the
+		 * subsystem of having to register an invalidate_range_end
+		 * call-back when there is invalidate_range already. Usually a
+		 * subsystem registers either invalidate_range_start()/end() or
+		 * invalidate_range(), so this will be no additional overhead
+		 * (besides the pointer check).
+		 */
+		if (mn->ops->invalidate_range)
+			mn->ops->invalidate_range(mn, mm, start, end);
 		if (mn->ops->invalidate_range_end)
 			mn->ops->invalidate_range_end(mn, mm, start, end);
 	}
@@ -199,6 +209,21 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_end);
 
+void __mmu_notifier_invalidate_range(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	struct mmu_notifier *mn;
+	int id;
+
+	id = srcu_read_lock(&srcu);
+	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
+		if (mn->ops->invalidate_range)
+			mn->ops->invalidate_range(mn, mm, start, end);
+	}
+	srcu_read_unlock(&srcu, id);
+}
+EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range);
+
 static int do_mmu_notifier_register(struct mmu_notifier *mn,
 				    struct mm_struct *mm,
 				    int take_mmap_sem)
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 3/3] mmu_notifier: Add the call-back for mmu_notifier_invalidate_range()
@ 2014-09-09 15:43   ` Joerg Roedel
  0 siblings, 0 replies; 36+ messages in thread
From: Joerg Roedel @ 2014-09-09 15:43 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Peter Zijlstra, Rik van Riel,
	Hugh Dickins, Mel Gorman, Johannes Weiner
  Cc: Jay.Cornwall-5C7GfCeVMHo, John.Bridgman-5C7GfCeVMHo,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Jerome Glisse,
	jroedel-l3A5Bk7waGM, Jesse Barnes, David Woodhouse,
	ben.sander-5C7GfCeVMHo

From: Joerg Roedel <jroedel@suse.de>

Now that the mmu_notifier_invalidate_range() calls are in
place, add the call-back to allow subsystems to register
against it.

Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 include/linux/mmu_notifier.h | 37 ++++++++++++++++++++++++++++++++-----
 mm/mmu_notifier.c            | 25 +++++++++++++++++++++++++
 2 files changed, 57 insertions(+), 5 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 877d1c8..aa1a6bf 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -95,11 +95,11 @@ struct mmu_notifier_ops {
 	/*
 	 * invalidate_range_start() and invalidate_range_end() must be
 	 * paired and are called only when the mmap_sem and/or the
-	 * locks protecting the reverse maps are held. The subsystem
-	 * must guarantee that no additional references are taken to
-	 * the pages in the range established between the call to
-	 * invalidate_range_start() and the matching call to
-	 * invalidate_range_end().
+	 * locks protecting the reverse maps are held. If the subsystem
+	 * can't guarantee that no additional references are taken to
+	 * the pages in the range, it has to implement the
+	 * invalidate_range() notifier to remove any references taken
+	 * after invalidate_range_start().
 	 *
 	 * Invalidation of multiple concurrent ranges may be
 	 * optionally permitted by the driver. Either way the
@@ -141,6 +141,29 @@ struct mmu_notifier_ops {
 	void (*invalidate_range_end)(struct mmu_notifier *mn,
 				     struct mm_struct *mm,
 				     unsigned long start, unsigned long end);
+
+	/*
+	 * invalidate_range() is either called between
+	 * invalidate_range_start() and invalidate_range_end() when the
+	 * VM has to free pages that where unmapped, but before the
+	 * pages are actually freed, or outside of _start()/_end() when
+	 * page-table pages are about to be freed.
+	 *
+	 * If invalidate_range() is used to manage a non-CPU TLB with
+	 * shared page-tables, it not necessary to implement the
+	 * invalidate_range_start()/end() notifiers, as
+	 * invalidate_range() alread catches the points in time when an
+	 * external TLB range needs to be flushed.
+	 *
+	 * The invalidate_range() function is called under the ptl
+	 * spin-lock and not allowed to sleep.
+	 *
+	 * Note that this function might be called with just a sub-range
+	 * of what was passed to invalidate_range_start()/end(), if
+	 * called between those functions.
+	 */
+	void (*invalidate_range)(struct mmu_notifier *mn, struct mm_struct *mm,
+				 unsigned long start, unsigned long end);
 };
 
 /*
@@ -186,6 +209,8 @@ extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 				  unsigned long start, unsigned long end);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 				  unsigned long start, unsigned long end);
+extern void __mmu_notifier_invalidate_range(struct mm_struct *mm,
+				  unsigned long start, unsigned long end);
 
 static inline void mmu_notifier_release(struct mm_struct *mm)
 {
@@ -240,6 +265,8 @@ static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
 				  unsigned long start, unsigned long end)
 {
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_invalidate_range(mm, start, end);
 }
 
 static inline void mmu_notifier_mm_init(struct mm_struct *mm)
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 950813b..a900637 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -192,6 +192,16 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
+		/*
+		 * Call invalidate_range here too to avoid the need for the
+		 * subsystem of having to register an invalidate_range_end
+		 * call-back when there is invalidate_range already. Usually a
+		 * subsystem registers either invalidate_range_start()/end() or
+		 * invalidate_range(), so this will be no additional overhead
+		 * (besides the pointer check).
+		 */
+		if (mn->ops->invalidate_range)
+			mn->ops->invalidate_range(mn, mm, start, end);
 		if (mn->ops->invalidate_range_end)
 			mn->ops->invalidate_range_end(mn, mm, start, end);
 	}
@@ -199,6 +209,21 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_end);
 
+void __mmu_notifier_invalidate_range(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	struct mmu_notifier *mn;
+	int id;
+
+	id = srcu_read_lock(&srcu);
+	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
+		if (mn->ops->invalidate_range)
+			mn->ops->invalidate_range(mn, mm, start, end);
+	}
+	srcu_read_unlock(&srcu, id);
+}
+EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range);
+
 static int do_mmu_notifier_register(struct mmu_notifier *mn,
 				    struct mm_struct *mm,
 				    int take_mmap_sem)
-- 
1.9.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3 v3] mmu_notifier: Allow to manage CPU external TLBs
  2014-09-09 15:43 ` Joerg Roedel
@ 2014-09-10 22:01   ` Andrew Morton
  -1 siblings, 0 replies; 36+ messages in thread
From: Andrew Morton @ 2014-09-10 22:01 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Andrea Arcangeli, Peter Zijlstra, Rik van Riel, Hugh Dickins,
	Mel Gorman, Johannes Weiner, Jerome Glisse, jroedel,
	Jay.Cornwall, Oded.Gabbay, John.Bridgman, Suravee.Suthikulpanit,
	ben.sander, Jesse Barnes, David Woodhouse, linux-kernel,
	linux-mm, iommu

On Tue,  9 Sep 2014 17:43:51 +0200 Joerg Roedel <joro@8bytes.org> wrote:

> here is a patch-set to extend the mmu_notifiers in the Linux
> kernel to allow managing CPU external TLBs. Those TLBs may
> be implemented in IOMMUs or any other external device, e.g.
> ATS/PRI capable PCI devices.
> 
> The problem with managing these TLBs are the semantics of
> the invalidate_range_start/end call-backs currently
> available. Currently the subsystem using mmu_notifiers has
> to guarantee that no new TLB entries are established between
> invalidate_range_start/end. Furthermore the
> invalidate_range_start() function is called when all pages
> are still mapped and invalidate_range_end() when the pages
> are unmapped an already freed.
> 
> So both call-backs can't be used to safely flush any non-CPU
> TLB because _start() is called too early and _end() too
> late.

There's a lot of missing information here.  Why don't the existing
callbacks suit non-CPU TLBs?  What is different about them?  Please
update the changelog to contain all this context.

> In the AMD IOMMUv2 driver this is currently implemented by
> assigning an empty page-table to the external device between
> _start() and _end(). But as tests have shown this doesn't
> work as external devices don't re-fault infinitly but enter
> a failure state after some time.

More missing info.  Why are these faults occurring?  Is there some
device activity which is trying to fault in pages, but the CPU is
executing code between _start() and _end() so the driver must refuse to
instantiate a page to satisfy the fault?  That's just a guess, and I
shouldn't be guessing.  Please update the changelog to fully describe
the dynamic activity which is causing this.

> Next problem with this solution is that it causes an
> interrupt storm for IO page faults to be handled when an
> empty page-table is assigned.

Also too skimpy.  I *think* this is a variant of the problem in the
preceding paragraph.  We get a fault storm (which is problem 2) and
sometimes the faulting device gives up (which is problem 1).

Or something.  Please de-fog all of this.

> Furthermore the _start()/end() notifiers only catch the
> moment when page mappings are released, but not page-table
> pages. But this is necessary for managing external TLBs when
> the page-table is shared with the CPU.

How come?

> To solve this situation I wrote a patch-set to introduce a
> new notifier call-back: mmu_notifer_invalidate_range(). This
> notifier lifts the strict requirements that no new
> references are taken in the range between _start() and
> _end(). When the subsystem can't guarantee that any new
> references are taken is has to provide the
> invalidate_range() call-back to clear any new references in
> there.
> 
> It is called between invalidate_range_start() and _end()
> every time the VMM has to wipe out any references to a
> couple of pages. This are usually the places where the CPU
> TLBs are flushed too and where its important that this
> happens before invalidate_range_end() is called.
> 
> Any comments and review appreciated!

The patchset looks decent, although I find it had to review because I
just wasn't provided with enough of the thinking that went into it.  I
have enough info to look at the C code, but not enough info to identify
and evaluate alternative implementation approaches, to identify
possible future extensions, etc.

The patchset does appear to add significant additional overhead to hot
code paths when mm_has_notifiers(mm).  Please let's update the
changelog to address this rather important concern.  How significant is
the impact on such mm's, how common are such mm's now and in the
future, should we (for example) look at short-circuiting
__mmu_notifier_invalidate_range() if none of the registered notifiers
implement ->invalidate_range(), etc.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3 v3] mmu_notifier: Allow to manage CPU external TLBs
@ 2014-09-10 22:01   ` Andrew Morton
  0 siblings, 0 replies; 36+ messages in thread
From: Andrew Morton @ 2014-09-10 22:01 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Andrea Arcangeli, Peter Zijlstra, Rik van Riel, Hugh Dickins,
	Mel Gorman, Johannes Weiner, Jerome Glisse, jroedel,
	Jay.Cornwall, Oded.Gabbay, John.Bridgman, Suravee.Suthikulpanit,
	ben.sander, Jesse Barnes, David Woodhouse, linux-kernel,
	linux-mm, iommu

On Tue,  9 Sep 2014 17:43:51 +0200 Joerg Roedel <joro@8bytes.org> wrote:

> here is a patch-set to extend the mmu_notifiers in the Linux
> kernel to allow managing CPU external TLBs. Those TLBs may
> be implemented in IOMMUs or any other external device, e.g.
> ATS/PRI capable PCI devices.
> 
> The problem with managing these TLBs are the semantics of
> the invalidate_range_start/end call-backs currently
> available. Currently the subsystem using mmu_notifiers has
> to guarantee that no new TLB entries are established between
> invalidate_range_start/end. Furthermore the
> invalidate_range_start() function is called when all pages
> are still mapped and invalidate_range_end() when the pages
> are unmapped an already freed.
> 
> So both call-backs can't be used to safely flush any non-CPU
> TLB because _start() is called too early and _end() too
> late.

There's a lot of missing information here.  Why don't the existing
callbacks suit non-CPU TLBs?  What is different about them?  Please
update the changelog to contain all this context.

> In the AMD IOMMUv2 driver this is currently implemented by
> assigning an empty page-table to the external device between
> _start() and _end(). But as tests have shown this doesn't
> work as external devices don't re-fault infinitly but enter
> a failure state after some time.

More missing info.  Why are these faults occurring?  Is there some
device activity which is trying to fault in pages, but the CPU is
executing code between _start() and _end() so the driver must refuse to
instantiate a page to satisfy the fault?  That's just a guess, and I
shouldn't be guessing.  Please update the changelog to fully describe
the dynamic activity which is causing this.

> Next problem with this solution is that it causes an
> interrupt storm for IO page faults to be handled when an
> empty page-table is assigned.

Also too skimpy.  I *think* this is a variant of the problem in the
preceding paragraph.  We get a fault storm (which is problem 2) and
sometimes the faulting device gives up (which is problem 1).

Or something.  Please de-fog all of this.

> Furthermore the _start()/end() notifiers only catch the
> moment when page mappings are released, but not page-table
> pages. But this is necessary for managing external TLBs when
> the page-table is shared with the CPU.

How come?

> To solve this situation I wrote a patch-set to introduce a
> new notifier call-back: mmu_notifer_invalidate_range(). This
> notifier lifts the strict requirements that no new
> references are taken in the range between _start() and
> _end(). When the subsystem can't guarantee that any new
> references are taken is has to provide the
> invalidate_range() call-back to clear any new references in
> there.
> 
> It is called between invalidate_range_start() and _end()
> every time the VMM has to wipe out any references to a
> couple of pages. This are usually the places where the CPU
> TLBs are flushed too and where its important that this
> happens before invalidate_range_end() is called.
> 
> Any comments and review appreciated!

The patchset looks decent, although I find it had to review because I
just wasn't provided with enough of the thinking that went into it.  I
have enough info to look at the C code, but not enough info to identify
and evaluate alternative implementation approaches, to identify
possible future extensions, etc.

The patchset does appear to add significant additional overhead to hot
code paths when mm_has_notifiers(mm).  Please let's update the
changelog to address this rather important concern.  How significant is
the impact on such mm's, how common are such mm's now and in the
future, should we (for example) look at short-circuiting
__mmu_notifier_invalidate_range() if none of the registered notifiers
implement ->invalidate_range(), etc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3 v3] mmu_notifier: Allow to manage CPU external TLBs
  2014-09-10 22:01   ` Andrew Morton
  (?)
@ 2014-09-11  0:02     ` Jerome Glisse
  -1 siblings, 0 replies; 36+ messages in thread
From: Jerome Glisse @ 2014-09-11  0:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joerg Roedel, Andrea Arcangeli, Rik van Riel, jroedel,
	Peter Zijlstra, John.Bridgman, Jesse Barnes, Hugh Dickins,
	linux-kernel, ben.sander, linux-mm, Jerome Glisse, Jay.Cornwall,
	Mel Gorman, David Woodhouse, Johannes Weiner, iommu

On Wed, Sep 10, 2014 at 03:01:25PM -0700, Andrew Morton wrote:
> On Tue,  9 Sep 2014 17:43:51 +0200 Joerg Roedel <joro@8bytes.org> wrote:
> 
> > here is a patch-set to extend the mmu_notifiers in the Linux
> > kernel to allow managing CPU external TLBs. Those TLBs may
> > be implemented in IOMMUs or any other external device, e.g.
> > ATS/PRI capable PCI devices.
> > 
> > The problem with managing these TLBs are the semantics of
> > the invalidate_range_start/end call-backs currently
> > available. Currently the subsystem using mmu_notifiers has
> > to guarantee that no new TLB entries are established between
> > invalidate_range_start/end. Furthermore the
> > invalidate_range_start() function is called when all pages
> > are still mapped and invalidate_range_end() when the pages
> > are unmapped an already freed.
> > 
> > So both call-backs can't be used to safely flush any non-CPU
> > TLB because _start() is called too early and _end() too
> > late.
> 
> There's a lot of missing information here.  Why don't the existing
> callbacks suit non-CPU TLBs?  What is different about them?  Please
> update the changelog to contain all this context.

So unlike KVM or any current user of the mmu_notifier api, any PCIE
ATS/PASID capable hardware IOMMUv2 for AMD and forgot the marketing
name for Intel (probably VTd) implementation walk the cpu page table
on there own and have there own TLB cache. In fact not only the iommu
can have a TLB cache but any single PCIE hardware can implement its
own local TLB cache.

So if we flush the IOMMU and device TLB inside the range_start callback
there is a chance that the hw will just rewalk the cpu page table and
repopulate its TLB before the CPU page table is actually updated.

Now if we shoot down the TLB inside the range_end callback, then we
are too late ie the CPU page table is already populated with new entry
and all the TLB in the IOMMU an in device might be pointing to the old
pages.

So the aim of this callback is to happen right after the CPU page table
is updated but before the old page is freed or recycled. Note that it
is also safe for COW and other transition from like read only to read
and write or the other way around.

> 
> > In the AMD IOMMUv2 driver this is currently implemented by
> > assigning an empty page-table to the external device between
> > _start() and _end(). But as tests have shown this doesn't
> > work as external devices don't re-fault infinitly but enter
> > a failure state after some time.
> 
> More missing info.  Why are these faults occurring?  Is there some
> device activity which is trying to fault in pages, but the CPU is
> executing code between _start() and _end() so the driver must refuse to
> instantiate a page to satisfy the fault?  That's just a guess, and I
> shouldn't be guessing.  Please update the changelog to fully describe
> the dynamic activity which is causing this.

The hack that was use prior to this patch was to point the IOMMU to an
empty page table (a zero page) inside the range_start() callback and
shoot down the TLB but this meant that the device might enter inside a
storm of page fault. GPU can have thousand of threads and because during
invalidation the empty page table is use they all starts triggering page
fault even if they were not trying to access the range being invalidated.
It turns out that when this happens current hw like AMD GPU actually stop
working after a while ie the hw stumble because there is too much fault
going on.

> 
> > Next problem with this solution is that it causes an
> > interrupt storm for IO page faults to be handled when an
> > empty page-table is assigned.
> 
> Also too skimpy.  I *think* this is a variant of the problem in the
> preceding paragraph.  We get a fault storm (which is problem 2) and
> sometimes the faulting device gives up (which is problem 1).
> 
> Or something.  Please de-fog all of this.
> 

Does above explanation help understand the issue ? Given that on each
device page fault an IRQ is trigger (well the way the hw works is bit
more complex than that).

> > Furthermore the _start()/end() notifiers only catch the
> > moment when page mappings are released, but not page-table
> > pages. But this is necessary for managing external TLBs when
> > the page-table is shared with the CPU.
> 
> How come?

As explained above end() might happens after page that were previously
mapped are free or recycled.

> 
> > To solve this situation I wrote a patch-set to introduce a
> > new notifier call-back: mmu_notifer_invalidate_range(). This
> > notifier lifts the strict requirements that no new
> > references are taken in the range between _start() and
> > _end(). When the subsystem can't guarantee that any new
> > references are taken is has to provide the
> > invalidate_range() call-back to clear any new references in
> > there.
> > 
> > It is called between invalidate_range_start() and _end()
> > every time the VMM has to wipe out any references to a
> > couple of pages. This are usually the places where the CPU
> > TLBs are flushed too and where its important that this
> > happens before invalidate_range_end() is called.
> > 
> > Any comments and review appreciated!
> 
> The patchset looks decent, although I find it had to review because I
> just wasn't provided with enough of the thinking that went into it.  I
> have enough info to look at the C code, but not enough info to identify
> and evaluate alternative implementation approaches, to identify
> possible future extensions, etc.
> 
> The patchset does appear to add significant additional overhead to hot
> code paths when mm_has_notifiers(mm).  Please let's update the
> changelog to address this rather important concern.  How significant is
> the impact on such mm's, how common are such mm's now and in the
> future, should we (for example) look at short-circuiting
> __mmu_notifier_invalidate_range() if none of the registered notifiers
> implement ->invalidate_range(), etc.

So one might feel like just completely removing the range_start()/end()
from the mmu_notifier and stick to this one callback but it will not work
with other hardware like the one i am doing HMM patchset for (i send it
again a couple weeks ago https://lkml.org/lkml/2014/8/29/423).

Right now there are very few user of the mmu_notifier, SGI, xen, KVM and
sadly recently the GPU folks (which i am part of too). So as the GPU are
starting to use it we will see a lot more application going through the
mmu_notifier callback. Yet you do want to leverage the hw like GPU and
you do want to use the same address space on the GPU as on the CPU and
thus you do want to share or at least keep synchronize the GPU view of
the CPU page table.

Right now, for IOMMUv2 this means adding this callback, for device that
do not rely on ATS/PASID PCIE extension this means something like HMM.
Also note that HMM is a superset of IOMMUv2 as it could be use at the
same time and provide more feature mainly allowing migrating some page
to device local memory for performances purposes.

I think this sumup all motivation behind this patchset and also behind
my other patchset. As usual i am happy to discuss alternative way to do
things but i think that the path of least disruption from current code
is the one implemented by those patchset.

Cheers,
Jérôme


> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3 v3] mmu_notifier: Allow to manage CPU external TLBs
@ 2014-09-11  0:02     ` Jerome Glisse
  0 siblings, 0 replies; 36+ messages in thread
From: Jerome Glisse @ 2014-09-11  0:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joerg Roedel, Andrea Arcangeli, Rik van Riel, jroedel,
	Peter Zijlstra, John.Bridgman, Jesse Barnes, Hugh Dickins,
	linux-kernel, ben.sander, linux-mm, Jerome Glisse, Jay.Cornwall,
	Mel Gorman, David Woodhouse, Johannes Weiner, iommu

On Wed, Sep 10, 2014 at 03:01:25PM -0700, Andrew Morton wrote:
> On Tue,  9 Sep 2014 17:43:51 +0200 Joerg Roedel <joro@8bytes.org> wrote:
> 
> > here is a patch-set to extend the mmu_notifiers in the Linux
> > kernel to allow managing CPU external TLBs. Those TLBs may
> > be implemented in IOMMUs or any other external device, e.g.
> > ATS/PRI capable PCI devices.
> > 
> > The problem with managing these TLBs are the semantics of
> > the invalidate_range_start/end call-backs currently
> > available. Currently the subsystem using mmu_notifiers has
> > to guarantee that no new TLB entries are established between
> > invalidate_range_start/end. Furthermore the
> > invalidate_range_start() function is called when all pages
> > are still mapped and invalidate_range_end() when the pages
> > are unmapped an already freed.
> > 
> > So both call-backs can't be used to safely flush any non-CPU
> > TLB because _start() is called too early and _end() too
> > late.
> 
> There's a lot of missing information here.  Why don't the existing
> callbacks suit non-CPU TLBs?  What is different about them?  Please
> update the changelog to contain all this context.

So unlike KVM or any current user of the mmu_notifier api, any PCIE
ATS/PASID capable hardware IOMMUv2 for AMD and forgot the marketing
name for Intel (probably VTd) implementation walk the cpu page table
on there own and have there own TLB cache. In fact not only the iommu
can have a TLB cache but any single PCIE hardware can implement its
own local TLB cache.

So if we flush the IOMMU and device TLB inside the range_start callback
there is a chance that the hw will just rewalk the cpu page table and
repopulate its TLB before the CPU page table is actually updated.

Now if we shoot down the TLB inside the range_end callback, then we
are too late ie the CPU page table is already populated with new entry
and all the TLB in the IOMMU an in device might be pointing to the old
pages.

So the aim of this callback is to happen right after the CPU page table
is updated but before the old page is freed or recycled. Note that it
is also safe for COW and other transition from like read only to read
and write or the other way around.

> 
> > In the AMD IOMMUv2 driver this is currently implemented by
> > assigning an empty page-table to the external device between
> > _start() and _end(). But as tests have shown this doesn't
> > work as external devices don't re-fault infinitly but enter
> > a failure state after some time.
> 
> More missing info.  Why are these faults occurring?  Is there some
> device activity which is trying to fault in pages, but the CPU is
> executing code between _start() and _end() so the driver must refuse to
> instantiate a page to satisfy the fault?  That's just a guess, and I
> shouldn't be guessing.  Please update the changelog to fully describe
> the dynamic activity which is causing this.

The hack that was use prior to this patch was to point the IOMMU to an
empty page table (a zero page) inside the range_start() callback and
shoot down the TLB but this meant that the device might enter inside a
storm of page fault. GPU can have thousand of threads and because during
invalidation the empty page table is use they all starts triggering page
fault even if they were not trying to access the range being invalidated.
It turns out that when this happens current hw like AMD GPU actually stop
working after a while ie the hw stumble because there is too much fault
going on.

> 
> > Next problem with this solution is that it causes an
> > interrupt storm for IO page faults to be handled when an
> > empty page-table is assigned.
> 
> Also too skimpy.  I *think* this is a variant of the problem in the
> preceding paragraph.  We get a fault storm (which is problem 2) and
> sometimes the faulting device gives up (which is problem 1).
> 
> Or something.  Please de-fog all of this.
> 

Does above explanation help understand the issue ? Given that on each
device page fault an IRQ is trigger (well the way the hw works is bit
more complex than that).

> > Furthermore the _start()/end() notifiers only catch the
> > moment when page mappings are released, but not page-table
> > pages. But this is necessary for managing external TLBs when
> > the page-table is shared with the CPU.
> 
> How come?

As explained above end() might happens after page that were previously
mapped are free or recycled.

> 
> > To solve this situation I wrote a patch-set to introduce a
> > new notifier call-back: mmu_notifer_invalidate_range(). This
> > notifier lifts the strict requirements that no new
> > references are taken in the range between _start() and
> > _end(). When the subsystem can't guarantee that any new
> > references are taken is has to provide the
> > invalidate_range() call-back to clear any new references in
> > there.
> > 
> > It is called between invalidate_range_start() and _end()
> > every time the VMM has to wipe out any references to a
> > couple of pages. This are usually the places where the CPU
> > TLBs are flushed too and where its important that this
> > happens before invalidate_range_end() is called.
> > 
> > Any comments and review appreciated!
> 
> The patchset looks decent, although I find it had to review because I
> just wasn't provided with enough of the thinking that went into it.  I
> have enough info to look at the C code, but not enough info to identify
> and evaluate alternative implementation approaches, to identify
> possible future extensions, etc.
> 
> The patchset does appear to add significant additional overhead to hot
> code paths when mm_has_notifiers(mm).  Please let's update the
> changelog to address this rather important concern.  How significant is
> the impact on such mm's, how common are such mm's now and in the
> future, should we (for example) look at short-circuiting
> __mmu_notifier_invalidate_range() if none of the registered notifiers
> implement ->invalidate_range(), etc.

So one might feel like just completely removing the range_start()/end()
from the mmu_notifier and stick to this one callback but it will not work
with other hardware like the one i am doing HMM patchset for (i send it
again a couple weeks ago https://lkml.org/lkml/2014/8/29/423).

Right now there are very few user of the mmu_notifier, SGI, xen, KVM and
sadly recently the GPU folks (which i am part of too). So as the GPU are
starting to use it we will see a lot more application going through the
mmu_notifier callback. Yet you do want to leverage the hw like GPU and
you do want to use the same address space on the GPU as on the CPU and
thus you do want to share or at least keep synchronize the GPU view of
the CPU page table.

Right now, for IOMMUv2 this means adding this callback, for device that
do not rely on ATS/PASID PCIE extension this means something like HMM.
Also note that HMM is a superset of IOMMUv2 as it could be use at the
same time and provide more feature mainly allowing migrating some page
to device local memory for performances purposes.

I think this sumup all motivation behind this patchset and also behind
my other patchset. As usual i am happy to discuss alternative way to do
things but i think that the path of least disruption from current code
is the one implemented by those patchset.

Cheers,
Jerome


> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3 v3] mmu_notifier: Allow to manage CPU external TLBs
@ 2014-09-11  0:02     ` Jerome Glisse
  0 siblings, 0 replies; 36+ messages in thread
From: Jerome Glisse @ 2014-09-11  0:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joerg Roedel, Andrea Arcangeli, Rik van Riel, jroedel,
	Peter Zijlstra, John.Bridgman, Jesse Barnes, Hugh Dickins,
	linux-kernel, ben.sander, linux-mm, Jerome Glisse, Jay.Cornwall,
	Mel Gorman, David Woodhouse, Johannes Weiner, iommu

On Wed, Sep 10, 2014 at 03:01:25PM -0700, Andrew Morton wrote:
> On Tue,  9 Sep 2014 17:43:51 +0200 Joerg Roedel <joro@8bytes.org> wrote:
> 
> > here is a patch-set to extend the mmu_notifiers in the Linux
> > kernel to allow managing CPU external TLBs. Those TLBs may
> > be implemented in IOMMUs or any other external device, e.g.
> > ATS/PRI capable PCI devices.
> > 
> > The problem with managing these TLBs are the semantics of
> > the invalidate_range_start/end call-backs currently
> > available. Currently the subsystem using mmu_notifiers has
> > to guarantee that no new TLB entries are established between
> > invalidate_range_start/end. Furthermore the
> > invalidate_range_start() function is called when all pages
> > are still mapped and invalidate_range_end() when the pages
> > are unmapped an already freed.
> > 
> > So both call-backs can't be used to safely flush any non-CPU
> > TLB because _start() is called too early and _end() too
> > late.
> 
> There's a lot of missing information here.  Why don't the existing
> callbacks suit non-CPU TLBs?  What is different about them?  Please
> update the changelog to contain all this context.

So unlike KVM or any current user of the mmu_notifier api, any PCIE
ATS/PASID capable hardware IOMMUv2 for AMD and forgot the marketing
name for Intel (probably VTd) implementation walk the cpu page table
on there own and have there own TLB cache. In fact not only the iommu
can have a TLB cache but any single PCIE hardware can implement its
own local TLB cache.

So if we flush the IOMMU and device TLB inside the range_start callback
there is a chance that the hw will just rewalk the cpu page table and
repopulate its TLB before the CPU page table is actually updated.

Now if we shoot down the TLB inside the range_end callback, then we
are too late ie the CPU page table is already populated with new entry
and all the TLB in the IOMMU an in device might be pointing to the old
pages.

So the aim of this callback is to happen right after the CPU page table
is updated but before the old page is freed or recycled. Note that it
is also safe for COW and other transition from like read only to read
and write or the other way around.

> 
> > In the AMD IOMMUv2 driver this is currently implemented by
> > assigning an empty page-table to the external device between
> > _start() and _end(). But as tests have shown this doesn't
> > work as external devices don't re-fault infinitly but enter
> > a failure state after some time.
> 
> More missing info.  Why are these faults occurring?  Is there some
> device activity which is trying to fault in pages, but the CPU is
> executing code between _start() and _end() so the driver must refuse to
> instantiate a page to satisfy the fault?  That's just a guess, and I
> shouldn't be guessing.  Please update the changelog to fully describe
> the dynamic activity which is causing this.

The hack that was use prior to this patch was to point the IOMMU to an
empty page table (a zero page) inside the range_start() callback and
shoot down the TLB but this meant that the device might enter inside a
storm of page fault. GPU can have thousand of threads and because during
invalidation the empty page table is use they all starts triggering page
fault even if they were not trying to access the range being invalidated.
It turns out that when this happens current hw like AMD GPU actually stop
working after a while ie the hw stumble because there is too much fault
going on.

> 
> > Next problem with this solution is that it causes an
> > interrupt storm for IO page faults to be handled when an
> > empty page-table is assigned.
> 
> Also too skimpy.  I *think* this is a variant of the problem in the
> preceding paragraph.  We get a fault storm (which is problem 2) and
> sometimes the faulting device gives up (which is problem 1).
> 
> Or something.  Please de-fog all of this.
> 

Does above explanation help understand the issue ? Given that on each
device page fault an IRQ is trigger (well the way the hw works is bit
more complex than that).

> > Furthermore the _start()/end() notifiers only catch the
> > moment when page mappings are released, but not page-table
> > pages. But this is necessary for managing external TLBs when
> > the page-table is shared with the CPU.
> 
> How come?

As explained above end() might happens after page that were previously
mapped are free or recycled.

> 
> > To solve this situation I wrote a patch-set to introduce a
> > new notifier call-back: mmu_notifer_invalidate_range(). This
> > notifier lifts the strict requirements that no new
> > references are taken in the range between _start() and
> > _end(). When the subsystem can't guarantee that any new
> > references are taken is has to provide the
> > invalidate_range() call-back to clear any new references in
> > there.
> > 
> > It is called between invalidate_range_start() and _end()
> > every time the VMM has to wipe out any references to a
> > couple of pages. This are usually the places where the CPU
> > TLBs are flushed too and where its important that this
> > happens before invalidate_range_end() is called.
> > 
> > Any comments and review appreciated!
> 
> The patchset looks decent, although I find it had to review because I
> just wasn't provided with enough of the thinking that went into it.  I
> have enough info to look at the C code, but not enough info to identify
> and evaluate alternative implementation approaches, to identify
> possible future extensions, etc.
> 
> The patchset does appear to add significant additional overhead to hot
> code paths when mm_has_notifiers(mm).  Please let's update the
> changelog to address this rather important concern.  How significant is
> the impact on such mm's, how common are such mm's now and in the
> future, should we (for example) look at short-circuiting
> __mmu_notifier_invalidate_range() if none of the registered notifiers
> implement ->invalidate_range(), etc.

So one might feel like just completely removing the range_start()/end()
from the mmu_notifier and stick to this one callback but it will not work
with other hardware like the one i am doing HMM patchset for (i send it
again a couple weeks ago https://lkml.org/lkml/2014/8/29/423).

Right now there are very few user of the mmu_notifier, SGI, xen, KVM and
sadly recently the GPU folks (which i am part of too). So as the GPU are
starting to use it we will see a lot more application going through the
mmu_notifier callback. Yet you do want to leverage the hw like GPU and
you do want to use the same address space on the GPU as on the CPU and
thus you do want to share or at least keep synchronize the GPU view of
the CPU page table.

Right now, for IOMMUv2 this means adding this callback, for device that
do not rely on ATS/PASID PCIE extension this means something like HMM.
Also note that HMM is a superset of IOMMUv2 as it could be use at the
same time and provide more feature mainly allowing migrating some page
to device local memory for performances purposes.

I think this sumup all motivation behind this patchset and also behind
my other patchset. As usual i am happy to discuss alternative way to do
things but i think that the path of least disruption from current code
is the one implemented by those patchset.

Cheers,
Jérôme


> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3 v3] mmu_notifier: Allow to manage CPU external TLBs
  2014-09-11  0:02     ` Jerome Glisse
  (?)
@ 2014-09-12 18:39       ` Jerome Glisse
  -1 siblings, 0 replies; 36+ messages in thread
From: Jerome Glisse @ 2014-09-12 18:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joerg Roedel, Andrea Arcangeli, Rik van Riel, jroedel,
	Peter Zijlstra, John.Bridgman, Jesse Barnes, Hugh Dickins,
	linux-kernel, ben.sander, linux-mm, Jerome Glisse, Jay.Cornwall,
	Mel Gorman, David Woodhouse, Johannes Weiner, iommu

On Wed, Sep 10, 2014 at 08:02:12PM -0400, Jerome Glisse wrote:
> On Wed, Sep 10, 2014 at 03:01:25PM -0700, Andrew Morton wrote:
> > On Tue,  9 Sep 2014 17:43:51 +0200 Joerg Roedel <joro@8bytes.org> wrote:
> > 
> > > here is a patch-set to extend the mmu_notifiers in the Linux
> > > kernel to allow managing CPU external TLBs. Those TLBs may
> > > be implemented in IOMMUs or any other external device, e.g.
> > > ATS/PRI capable PCI devices.
> > > 
> > > The problem with managing these TLBs are the semantics of
> > > the invalidate_range_start/end call-backs currently
> > > available. Currently the subsystem using mmu_notifiers has
> > > to guarantee that no new TLB entries are established between
> > > invalidate_range_start/end. Furthermore the
> > > invalidate_range_start() function is called when all pages
> > > are still mapped and invalidate_range_end() when the pages
> > > are unmapped an already freed.
> > > 
> > > So both call-backs can't be used to safely flush any non-CPU
> > > TLB because _start() is called too early and _end() too
> > > late.
> > 
> > There's a lot of missing information here.  Why don't the existing
> > callbacks suit non-CPU TLBs?  What is different about them?  Please
> > update the changelog to contain all this context.
> 
> So unlike KVM or any current user of the mmu_notifier api, any PCIE
> ATS/PASID capable hardware IOMMUv2 for AMD and forgot the marketing
> name for Intel (probably VTd) implementation walk the cpu page table
> on there own and have there own TLB cache. In fact not only the iommu
> can have a TLB cache but any single PCIE hardware can implement its
> own local TLB cache.
> 
> So if we flush the IOMMU and device TLB inside the range_start callback
> there is a chance that the hw will just rewalk the cpu page table and
> repopulate its TLB before the CPU page table is actually updated.
> 
> Now if we shoot down the TLB inside the range_end callback, then we
> are too late ie the CPU page table is already populated with new entry
> and all the TLB in the IOMMU an in device might be pointing to the old
> pages.
> 
> So the aim of this callback is to happen right after the CPU page table
> is updated but before the old page is freed or recycled. Note that it
> is also safe for COW and other transition from like read only to read
> and write or the other way around.
> 
> > 
> > > In the AMD IOMMUv2 driver this is currently implemented by
> > > assigning an empty page-table to the external device between
> > > _start() and _end(). But as tests have shown this doesn't
> > > work as external devices don't re-fault infinitly but enter
> > > a failure state after some time.
> > 
> > More missing info.  Why are these faults occurring?  Is there some
> > device activity which is trying to fault in pages, but the CPU is
> > executing code between _start() and _end() so the driver must refuse to
> > instantiate a page to satisfy the fault?  That's just a guess, and I
> > shouldn't be guessing.  Please update the changelog to fully describe
> > the dynamic activity which is causing this.
> 
> The hack that was use prior to this patch was to point the IOMMU to an
> empty page table (a zero page) inside the range_start() callback and
> shoot down the TLB but this meant that the device might enter inside a
> storm of page fault. GPU can have thousand of threads and because during
> invalidation the empty page table is use they all starts triggering page
> fault even if they were not trying to access the range being invalidated.
> It turns out that when this happens current hw like AMD GPU actually stop
> working after a while ie the hw stumble because there is too much fault
> going on.
> 
> > 
> > > Next problem with this solution is that it causes an
> > > interrupt storm for IO page faults to be handled when an
> > > empty page-table is assigned.
> > 
> > Also too skimpy.  I *think* this is a variant of the problem in the
> > preceding paragraph.  We get a fault storm (which is problem 2) and
> > sometimes the faulting device gives up (which is problem 1).
> > 
> > Or something.  Please de-fog all of this.
> > 
> 
> Does above explanation help understand the issue ? Given that on each
> device page fault an IRQ is trigger (well the way the hw works is bit
> more complex than that).
> 
> > > Furthermore the _start()/end() notifiers only catch the
> > > moment when page mappings are released, but not page-table
> > > pages. But this is necessary for managing external TLBs when
> > > the page-table is shared with the CPU.
> > 
> > How come?
> 
> As explained above end() might happens after page that were previously
> mapped are free or recycled.
> 
> > 
> > > To solve this situation I wrote a patch-set to introduce a
> > > new notifier call-back: mmu_notifer_invalidate_range(). This
> > > notifier lifts the strict requirements that no new
> > > references are taken in the range between _start() and
> > > _end(). When the subsystem can't guarantee that any new
> > > references are taken is has to provide the
> > > invalidate_range() call-back to clear any new references in
> > > there.
> > > 
> > > It is called between invalidate_range_start() and _end()
> > > every time the VMM has to wipe out any references to a
> > > couple of pages. This are usually the places where the CPU
> > > TLBs are flushed too and where its important that this
> > > happens before invalidate_range_end() is called.
> > > 
> > > Any comments and review appreciated!
> > 
> > The patchset looks decent, although I find it had to review because I
> > just wasn't provided with enough of the thinking that went into it.  I
> > have enough info to look at the C code, but not enough info to identify
> > and evaluate alternative implementation approaches, to identify
> > possible future extensions, etc.
> > 
> > The patchset does appear to add significant additional overhead to hot
> > code paths when mm_has_notifiers(mm).  Please let's update the
> > changelog to address this rather important concern.  How significant is
> > the impact on such mm's, how common are such mm's now and in the
> > future, should we (for example) look at short-circuiting
> > __mmu_notifier_invalidate_range() if none of the registered notifiers
> > implement ->invalidate_range(), etc.
> 
> So one might feel like just completely removing the range_start()/end()
> from the mmu_notifier and stick to this one callback but it will not work
> with other hardware like the one i am doing HMM patchset for (i send it
> again a couple weeks ago https://lkml.org/lkml/2014/8/29/423).
> 
> Right now there are very few user of the mmu_notifier, SGI, xen, KVM and
> sadly recently the GPU folks (which i am part of too). So as the GPU are
> starting to use it we will see a lot more application going through the
> mmu_notifier callback. Yet you do want to leverage the hw like GPU and
> you do want to use the same address space on the GPU as on the CPU and
> thus you do want to share or at least keep synchronize the GPU view of
> the CPU page table.
> 
> Right now, for IOMMUv2 this means adding this callback, for device that
> do not rely on ATS/PASID PCIE extension this means something like HMM.
> Also note that HMM is a superset of IOMMUv2 as it could be use at the
> same time and provide more feature mainly allowing migrating some page
> to device local memory for performances purposes.
> 
> I think this sumup all motivation behind this patchset and also behind
> my other patchset. As usual i am happy to discuss alternative way to do
> things but i think that the path of least disruption from current code
> is the one implemented by those patchset.
> 
> Cheers,
> Jérôme

Hi Andrew, i wanted to touch base to see if those explanations were enough
to understand the problem we are trying to solve. Do you think the patchset
is ok or do you feel like there should be a better solution ?

Cheers,
Jérôme

> 
> 
> > _______________________________________________
> > iommu mailing list
> > iommu@lists.linux-foundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3 v3] mmu_notifier: Allow to manage CPU external TLBs
@ 2014-09-12 18:39       ` Jerome Glisse
  0 siblings, 0 replies; 36+ messages in thread
From: Jerome Glisse @ 2014-09-12 18:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joerg Roedel, Andrea Arcangeli, Rik van Riel, jroedel,
	Peter Zijlstra, John.Bridgman, Jesse Barnes, Hugh Dickins,
	linux-kernel, ben.sander, linux-mm, Jerome Glisse, Jay.Cornwall,
	Mel Gorman, David Woodhouse, Johannes Weiner, iommu

On Wed, Sep 10, 2014 at 08:02:12PM -0400, Jerome Glisse wrote:
> On Wed, Sep 10, 2014 at 03:01:25PM -0700, Andrew Morton wrote:
> > On Tue,  9 Sep 2014 17:43:51 +0200 Joerg Roedel <joro@8bytes.org> wrote:
> > 
> > > here is a patch-set to extend the mmu_notifiers in the Linux
> > > kernel to allow managing CPU external TLBs. Those TLBs may
> > > be implemented in IOMMUs or any other external device, e.g.
> > > ATS/PRI capable PCI devices.
> > > 
> > > The problem with managing these TLBs are the semantics of
> > > the invalidate_range_start/end call-backs currently
> > > available. Currently the subsystem using mmu_notifiers has
> > > to guarantee that no new TLB entries are established between
> > > invalidate_range_start/end. Furthermore the
> > > invalidate_range_start() function is called when all pages
> > > are still mapped and invalidate_range_end() when the pages
> > > are unmapped an already freed.
> > > 
> > > So both call-backs can't be used to safely flush any non-CPU
> > > TLB because _start() is called too early and _end() too
> > > late.
> > 
> > There's a lot of missing information here.  Why don't the existing
> > callbacks suit non-CPU TLBs?  What is different about them?  Please
> > update the changelog to contain all this context.
> 
> So unlike KVM or any current user of the mmu_notifier api, any PCIE
> ATS/PASID capable hardware IOMMUv2 for AMD and forgot the marketing
> name for Intel (probably VTd) implementation walk the cpu page table
> on there own and have there own TLB cache. In fact not only the iommu
> can have a TLB cache but any single PCIE hardware can implement its
> own local TLB cache.
> 
> So if we flush the IOMMU and device TLB inside the range_start callback
> there is a chance that the hw will just rewalk the cpu page table and
> repopulate its TLB before the CPU page table is actually updated.
> 
> Now if we shoot down the TLB inside the range_end callback, then we
> are too late ie the CPU page table is already populated with new entry
> and all the TLB in the IOMMU an in device might be pointing to the old
> pages.
> 
> So the aim of this callback is to happen right after the CPU page table
> is updated but before the old page is freed or recycled. Note that it
> is also safe for COW and other transition from like read only to read
> and write or the other way around.
> 
> > 
> > > In the AMD IOMMUv2 driver this is currently implemented by
> > > assigning an empty page-table to the external device between
> > > _start() and _end(). But as tests have shown this doesn't
> > > work as external devices don't re-fault infinitly but enter
> > > a failure state after some time.
> > 
> > More missing info.  Why are these faults occurring?  Is there some
> > device activity which is trying to fault in pages, but the CPU is
> > executing code between _start() and _end() so the driver must refuse to
> > instantiate a page to satisfy the fault?  That's just a guess, and I
> > shouldn't be guessing.  Please update the changelog to fully describe
> > the dynamic activity which is causing this.
> 
> The hack that was use prior to this patch was to point the IOMMU to an
> empty page table (a zero page) inside the range_start() callback and
> shoot down the TLB but this meant that the device might enter inside a
> storm of page fault. GPU can have thousand of threads and because during
> invalidation the empty page table is use they all starts triggering page
> fault even if they were not trying to access the range being invalidated.
> It turns out that when this happens current hw like AMD GPU actually stop
> working after a while ie the hw stumble because there is too much fault
> going on.
> 
> > 
> > > Next problem with this solution is that it causes an
> > > interrupt storm for IO page faults to be handled when an
> > > empty page-table is assigned.
> > 
> > Also too skimpy.  I *think* this is a variant of the problem in the
> > preceding paragraph.  We get a fault storm (which is problem 2) and
> > sometimes the faulting device gives up (which is problem 1).
> > 
> > Or something.  Please de-fog all of this.
> > 
> 
> Does above explanation help understand the issue ? Given that on each
> device page fault an IRQ is trigger (well the way the hw works is bit
> more complex than that).
> 
> > > Furthermore the _start()/end() notifiers only catch the
> > > moment when page mappings are released, but not page-table
> > > pages. But this is necessary for managing external TLBs when
> > > the page-table is shared with the CPU.
> > 
> > How come?
> 
> As explained above end() might happens after page that were previously
> mapped are free or recycled.
> 
> > 
> > > To solve this situation I wrote a patch-set to introduce a
> > > new notifier call-back: mmu_notifer_invalidate_range(). This
> > > notifier lifts the strict requirements that no new
> > > references are taken in the range between _start() and
> > > _end(). When the subsystem can't guarantee that any new
> > > references are taken is has to provide the
> > > invalidate_range() call-back to clear any new references in
> > > there.
> > > 
> > > It is called between invalidate_range_start() and _end()
> > > every time the VMM has to wipe out any references to a
> > > couple of pages. This are usually the places where the CPU
> > > TLBs are flushed too and where its important that this
> > > happens before invalidate_range_end() is called.
> > > 
> > > Any comments and review appreciated!
> > 
> > The patchset looks decent, although I find it had to review because I
> > just wasn't provided with enough of the thinking that went into it.  I
> > have enough info to look at the C code, but not enough info to identify
> > and evaluate alternative implementation approaches, to identify
> > possible future extensions, etc.
> > 
> > The patchset does appear to add significant additional overhead to hot
> > code paths when mm_has_notifiers(mm).  Please let's update the
> > changelog to address this rather important concern.  How significant is
> > the impact on such mm's, how common are such mm's now and in the
> > future, should we (for example) look at short-circuiting
> > __mmu_notifier_invalidate_range() if none of the registered notifiers
> > implement ->invalidate_range(), etc.
> 
> So one might feel like just completely removing the range_start()/end()
> from the mmu_notifier and stick to this one callback but it will not work
> with other hardware like the one i am doing HMM patchset for (i send it
> again a couple weeks ago https://lkml.org/lkml/2014/8/29/423).
> 
> Right now there are very few user of the mmu_notifier, SGI, xen, KVM and
> sadly recently the GPU folks (which i am part of too). So as the GPU are
> starting to use it we will see a lot more application going through the
> mmu_notifier callback. Yet you do want to leverage the hw like GPU and
> you do want to use the same address space on the GPU as on the CPU and
> thus you do want to share or at least keep synchronize the GPU view of
> the CPU page table.
> 
> Right now, for IOMMUv2 this means adding this callback, for device that
> do not rely on ATS/PASID PCIE extension this means something like HMM.
> Also note that HMM is a superset of IOMMUv2 as it could be use at the
> same time and provide more feature mainly allowing migrating some page
> to device local memory for performances purposes.
> 
> I think this sumup all motivation behind this patchset and also behind
> my other patchset. As usual i am happy to discuss alternative way to do
> things but i think that the path of least disruption from current code
> is the one implemented by those patchset.
> 
> Cheers,
> Jerome

Hi Andrew, i wanted to touch base to see if those explanations were enough
to understand the problem we are trying to solve. Do you think the patchset
is ok or do you feel like there should be a better solution ?

Cheers,
Jerome

> 
> 
> > _______________________________________________
> > iommu mailing list
> > iommu@lists.linux-foundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/iommu

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3 v3] mmu_notifier: Allow to manage CPU external TLBs
@ 2014-09-12 18:39       ` Jerome Glisse
  0 siblings, 0 replies; 36+ messages in thread
From: Jerome Glisse @ 2014-09-12 18:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joerg Roedel, Andrea Arcangeli, Rik van Riel, jroedel,
	Peter Zijlstra, John.Bridgman, Jesse Barnes, Hugh Dickins,
	linux-kernel, ben.sander, linux-mm, Jerome Glisse, Jay.Cornwall,
	Mel Gorman, David Woodhouse, Johannes Weiner, iommu

On Wed, Sep 10, 2014 at 08:02:12PM -0400, Jerome Glisse wrote:
> On Wed, Sep 10, 2014 at 03:01:25PM -0700, Andrew Morton wrote:
> > On Tue,  9 Sep 2014 17:43:51 +0200 Joerg Roedel <joro@8bytes.org> wrote:
> > 
> > > here is a patch-set to extend the mmu_notifiers in the Linux
> > > kernel to allow managing CPU external TLBs. Those TLBs may
> > > be implemented in IOMMUs or any other external device, e.g.
> > > ATS/PRI capable PCI devices.
> > > 
> > > The problem with managing these TLBs are the semantics of
> > > the invalidate_range_start/end call-backs currently
> > > available. Currently the subsystem using mmu_notifiers has
> > > to guarantee that no new TLB entries are established between
> > > invalidate_range_start/end. Furthermore the
> > > invalidate_range_start() function is called when all pages
> > > are still mapped and invalidate_range_end() when the pages
> > > are unmapped an already freed.
> > > 
> > > So both call-backs can't be used to safely flush any non-CPU
> > > TLB because _start() is called too early and _end() too
> > > late.
> > 
> > There's a lot of missing information here.  Why don't the existing
> > callbacks suit non-CPU TLBs?  What is different about them?  Please
> > update the changelog to contain all this context.
> 
> So unlike KVM or any current user of the mmu_notifier api, any PCIE
> ATS/PASID capable hardware IOMMUv2 for AMD and forgot the marketing
> name for Intel (probably VTd) implementation walk the cpu page table
> on there own and have there own TLB cache. In fact not only the iommu
> can have a TLB cache but any single PCIE hardware can implement its
> own local TLB cache.
> 
> So if we flush the IOMMU and device TLB inside the range_start callback
> there is a chance that the hw will just rewalk the cpu page table and
> repopulate its TLB before the CPU page table is actually updated.
> 
> Now if we shoot down the TLB inside the range_end callback, then we
> are too late ie the CPU page table is already populated with new entry
> and all the TLB in the IOMMU an in device might be pointing to the old
> pages.
> 
> So the aim of this callback is to happen right after the CPU page table
> is updated but before the old page is freed or recycled. Note that it
> is also safe for COW and other transition from like read only to read
> and write or the other way around.
> 
> > 
> > > In the AMD IOMMUv2 driver this is currently implemented by
> > > assigning an empty page-table to the external device between
> > > _start() and _end(). But as tests have shown this doesn't
> > > work as external devices don't re-fault infinitly but enter
> > > a failure state after some time.
> > 
> > More missing info.  Why are these faults occurring?  Is there some
> > device activity which is trying to fault in pages, but the CPU is
> > executing code between _start() and _end() so the driver must refuse to
> > instantiate a page to satisfy the fault?  That's just a guess, and I
> > shouldn't be guessing.  Please update the changelog to fully describe
> > the dynamic activity which is causing this.
> 
> The hack that was use prior to this patch was to point the IOMMU to an
> empty page table (a zero page) inside the range_start() callback and
> shoot down the TLB but this meant that the device might enter inside a
> storm of page fault. GPU can have thousand of threads and because during
> invalidation the empty page table is use they all starts triggering page
> fault even if they were not trying to access the range being invalidated.
> It turns out that when this happens current hw like AMD GPU actually stop
> working after a while ie the hw stumble because there is too much fault
> going on.
> 
> > 
> > > Next problem with this solution is that it causes an
> > > interrupt storm for IO page faults to be handled when an
> > > empty page-table is assigned.
> > 
> > Also too skimpy.  I *think* this is a variant of the problem in the
> > preceding paragraph.  We get a fault storm (which is problem 2) and
> > sometimes the faulting device gives up (which is problem 1).
> > 
> > Or something.  Please de-fog all of this.
> > 
> 
> Does above explanation help understand the issue ? Given that on each
> device page fault an IRQ is trigger (well the way the hw works is bit
> more complex than that).
> 
> > > Furthermore the _start()/end() notifiers only catch the
> > > moment when page mappings are released, but not page-table
> > > pages. But this is necessary for managing external TLBs when
> > > the page-table is shared with the CPU.
> > 
> > How come?
> 
> As explained above end() might happens after page that were previously
> mapped are free or recycled.
> 
> > 
> > > To solve this situation I wrote a patch-set to introduce a
> > > new notifier call-back: mmu_notifer_invalidate_range(). This
> > > notifier lifts the strict requirements that no new
> > > references are taken in the range between _start() and
> > > _end(). When the subsystem can't guarantee that any new
> > > references are taken is has to provide the
> > > invalidate_range() call-back to clear any new references in
> > > there.
> > > 
> > > It is called between invalidate_range_start() and _end()
> > > every time the VMM has to wipe out any references to a
> > > couple of pages. This are usually the places where the CPU
> > > TLBs are flushed too and where its important that this
> > > happens before invalidate_range_end() is called.
> > > 
> > > Any comments and review appreciated!
> > 
> > The patchset looks decent, although I find it had to review because I
> > just wasn't provided with enough of the thinking that went into it.  I
> > have enough info to look at the C code, but not enough info to identify
> > and evaluate alternative implementation approaches, to identify
> > possible future extensions, etc.
> > 
> > The patchset does appear to add significant additional overhead to hot
> > code paths when mm_has_notifiers(mm).  Please let's update the
> > changelog to address this rather important concern.  How significant is
> > the impact on such mm's, how common are such mm's now and in the
> > future, should we (for example) look at short-circuiting
> > __mmu_notifier_invalidate_range() if none of the registered notifiers
> > implement ->invalidate_range(), etc.
> 
> So one might feel like just completely removing the range_start()/end()
> from the mmu_notifier and stick to this one callback but it will not work
> with other hardware like the one i am doing HMM patchset for (i send it
> again a couple weeks ago https://lkml.org/lkml/2014/8/29/423).
> 
> Right now there are very few user of the mmu_notifier, SGI, xen, KVM and
> sadly recently the GPU folks (which i am part of too). So as the GPU are
> starting to use it we will see a lot more application going through the
> mmu_notifier callback. Yet you do want to leverage the hw like GPU and
> you do want to use the same address space on the GPU as on the CPU and
> thus you do want to share or at least keep synchronize the GPU view of
> the CPU page table.
> 
> Right now, for IOMMUv2 this means adding this callback, for device that
> do not rely on ATS/PASID PCIE extension this means something like HMM.
> Also note that HMM is a superset of IOMMUv2 as it could be use at the
> same time and provide more feature mainly allowing migrating some page
> to device local memory for performances purposes.
> 
> I think this sumup all motivation behind this patchset and also behind
> my other patchset. As usual i am happy to discuss alternative way to do
> things but i think that the path of least disruption from current code
> is the one implemented by those patchset.
> 
> Cheers,
> Jérôme

Hi Andrew, i wanted to touch base to see if those explanations were enough
to understand the problem we are trying to solve. Do you think the patchset
is ok or do you feel like there should be a better solution ?

Cheers,
Jérôme

> 
> 
> > _______________________________________________
> > iommu mailing list
> > iommu@lists.linux-foundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/iommu

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3 v3] mmu_notifier: Allow to manage CPU external TLBs
  2014-09-10 22:01   ` Andrew Morton
  (?)
@ 2014-09-12 18:47     ` Joerg Roedel
  -1 siblings, 0 replies; 36+ messages in thread
From: Joerg Roedel @ 2014-09-12 18:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joerg Roedel, Andrea Arcangeli, Peter Zijlstra, Rik van Riel,
	Hugh Dickins, Mel Gorman, Johannes Weiner, Jerome Glisse,
	Jay.Cornwall, Oded.Gabbay, John.Bridgman, Suravee.Suthikulpanit,
	ben.sander, Jesse Barnes, David Woodhouse, linux-kernel,
	linux-mm, iommu

Hi Andrew,

thanks for your review, I tried to answer your questions below.

On Wed, Sep 10, 2014 at 03:01:25PM -0700, Andrew Morton wrote:
> On Tue,  9 Sep 2014 17:43:51 +0200 Joerg Roedel <joro@8bytes.org> wrote:
> > So both call-backs can't be used to safely flush any non-CPU
> > TLB because _start() is called too early and _end() too
> > late.
> 
> There's a lot of missing information here.  Why don't the existing
> callbacks suit non-CPU TLBs?  What is different about them?  Please
> update the changelog to contain all this context.

The existing call-backs are called too early or too late. Specifically,
invalidate_range_start() is called when all pages are still mapped and 
invalidate_range_end() when all pages are unmapped and potentially
freed.

This is fine when the users of the mmu_notifiers manage their own
SoftTLB, like KVM does. When the TLB is managed in software it is easy
to wipe out entries for a given range and prevent new entries to be
established until invalidate_range_end is called.

But when the user of mmu_notifiers has to manage a hardware TLB it can
still wipe out TLB entries in invalidate_range_start, but it can't make
sure that no new TLB entries in the given range are established between
invalidate_range_start and invalidate_range_end.

	[ Actually the current AMD IOMMUv2 code tries to do that with
	  setting the setting an empty page-table for the non-CPU TLB,
	  but this causes address translation errors which end up in
	  device failures. ]

But to avoid silent data corruption the TLB entries need to be flushed
out of the non-CPU hardware TLB when the pages are unmapped (at this
point in time no _new_ TLB entries can be established in the non-CPU
TLB) but not yet freed (as the non-CPU TLB may still have _existing_
entries pointing to the pages about to be freed).

So to fix this problem we need to catch the moment when the Linux VMM
flushes remote TLBs (as a non-CPU TLB is not very different in its
flushing requirements from any other remote CPU TLB), as this is the
point in time when the pages are unmapped but _not_ yet freed.

The mmu_notifier_invalidate_range() function aims to catch that moment.

> > In the AMD IOMMUv2 driver this is currently implemented by
> > assigning an empty page-table to the external device between
> > _start() and _end(). But as tests have shown this doesn't
> > work as external devices don't re-fault infinitly but enter
> > a failure state after some time.
> 
> More missing info.  Why are these faults occurring?  Is there some
> device activity which is trying to fault in pages, but the CPU is
> executing code between _start() and _end() so the driver must refuse to
> instantiate a page to satisfy the fault?  That's just a guess, and I
> shouldn't be guessing.  Please update the changelog to fully describe
> the dynamic activity which is causing this.

The device (usually a GPU) runs some process (for example a compute job)
that directly accesses a Linux process address space. Any access
to a process address space can cause a page-fault, whether the access
comes from the CPU or a remote device.

When the page-fault comes from a compute job running on a GPU, is is
reported to Linux by an IOMMU interrupt.

The current implementation of invalidate_range_start/end assigns an
empty page-table, which causes many page-faults from the GPU process,
resulting in an interrupt storm for the IOMMU.

The fault handler doesn't handle the fault if an
invalidate_range_start/end pair is active, but just reports back SUCESS
to the device to let it refault the page then (refaulting is the same strategy
KVM implements). But existing GPUs that make use of this feature don't
refault indefinitly, after a certain number of faults for the same
address the device enters a failure state and needs to be resetted.L

> > Next problem with this solution is that it causes an
> > interrupt storm for IO page faults to be handled when an
> > empty page-table is assigned.
> 
> Also too skimpy.  I *think* this is a variant of the problem in the
> preceding paragraph.  We get a fault storm (which is problem 2) and
> sometimes the faulting device gives up (which is problem 1).
> 
> Or something.  Please de-fog all of this.

Right, I will update the description to be more clear.

> > Furthermore the _start()/end() notifiers only catch the
> > moment when page mappings are released, but not page-table
> > pages. But this is necessary for managing external TLBs when
> > the page-table is shared with the CPU.
> 
> How come?

As mmu_notifiers are not used for managing TLBs that share the same
page-table as the CPU uses, there was not need to catch the page-table
freeing events, so it is not available yet.

> > Any comments and review appreciated!
> 
> The patchset looks decent, although I find it had to review because I
> just wasn't provided with enough of the thinking that went into it.  I
> have enough info to look at the C code, but not enough info to identify
> and evaluate alternative implementation approaches, to identify
> possible future extensions, etc.

Fair enough, I hope I clarified a few things with my explanations
above. I will also update the description of the patch-set when I
re-send.

> The patchset does appear to add significant additional overhead to hot
> code paths when mm_has_notifiers(mm).  Please let's update the
> changelog to address this rather important concern.  How significant is
> the impact on such mm's, how common are such mm's now and in the
> future, should we (for example) look at short-circuiting
> __mmu_notifier_invalidate_range() if none of the registered notifiers
> implement ->invalidate_range(), etc.

I think it adds the most overhead to single-CPU kernels. The
invalidate_range notifier is called in the paths that also do a remote
TLB flush, which is very expensive on its own. To those paths it adds
just another remote TLB that needs to be flushed.

Regards,

	Joerg


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3 v3] mmu_notifier: Allow to manage CPU external TLBs
@ 2014-09-12 18:47     ` Joerg Roedel
  0 siblings, 0 replies; 36+ messages in thread
From: Joerg Roedel @ 2014-09-12 18:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joerg Roedel, Andrea Arcangeli, Peter Zijlstra, Rik van Riel,
	Hugh Dickins, Mel Gorman, Johannes Weiner, Jerome Glisse,
	Jay.Cornwall, Oded.Gabbay, John.Bridgman, Suravee.Suthikulpanit,
	ben.sander, Jesse Barnes, David Woodhouse, linux-kernel,
	linux-mm, iommu

Hi Andrew,

thanks for your review, I tried to answer your questions below.

On Wed, Sep 10, 2014 at 03:01:25PM -0700, Andrew Morton wrote:
> On Tue,  9 Sep 2014 17:43:51 +0200 Joerg Roedel <joro@8bytes.org> wrote:
> > So both call-backs can't be used to safely flush any non-CPU
> > TLB because _start() is called too early and _end() too
> > late.
> 
> There's a lot of missing information here.  Why don't the existing
> callbacks suit non-CPU TLBs?  What is different about them?  Please
> update the changelog to contain all this context.

The existing call-backs are called too early or too late. Specifically,
invalidate_range_start() is called when all pages are still mapped and 
invalidate_range_end() when all pages are unmapped and potentially
freed.

This is fine when the users of the mmu_notifiers manage their own
SoftTLB, like KVM does. When the TLB is managed in software it is easy
to wipe out entries for a given range and prevent new entries to be
established until invalidate_range_end is called.

But when the user of mmu_notifiers has to manage a hardware TLB it can
still wipe out TLB entries in invalidate_range_start, but it can't make
sure that no new TLB entries in the given range are established between
invalidate_range_start and invalidate_range_end.

	[ Actually the current AMD IOMMUv2 code tries to do that with
	  setting the setting an empty page-table for the non-CPU TLB,
	  but this causes address translation errors which end up in
	  device failures. ]

But to avoid silent data corruption the TLB entries need to be flushed
out of the non-CPU hardware TLB when the pages are unmapped (at this
point in time no _new_ TLB entries can be established in the non-CPU
TLB) but not yet freed (as the non-CPU TLB may still have _existing_
entries pointing to the pages about to be freed).

So to fix this problem we need to catch the moment when the Linux VMM
flushes remote TLBs (as a non-CPU TLB is not very different in its
flushing requirements from any other remote CPU TLB), as this is the
point in time when the pages are unmapped but _not_ yet freed.

The mmu_notifier_invalidate_range() function aims to catch that moment.

> > In the AMD IOMMUv2 driver this is currently implemented by
> > assigning an empty page-table to the external device between
> > _start() and _end(). But as tests have shown this doesn't
> > work as external devices don't re-fault infinitly but enter
> > a failure state after some time.
> 
> More missing info.  Why are these faults occurring?  Is there some
> device activity which is trying to fault in pages, but the CPU is
> executing code between _start() and _end() so the driver must refuse to
> instantiate a page to satisfy the fault?  That's just a guess, and I
> shouldn't be guessing.  Please update the changelog to fully describe
> the dynamic activity which is causing this.

The device (usually a GPU) runs some process (for example a compute job)
that directly accesses a Linux process address space. Any access
to a process address space can cause a page-fault, whether the access
comes from the CPU or a remote device.

When the page-fault comes from a compute job running on a GPU, is is
reported to Linux by an IOMMU interrupt.

The current implementation of invalidate_range_start/end assigns an
empty page-table, which causes many page-faults from the GPU process,
resulting in an interrupt storm for the IOMMU.

The fault handler doesn't handle the fault if an
invalidate_range_start/end pair is active, but just reports back SUCESS
to the device to let it refault the page then (refaulting is the same strategy
KVM implements). But existing GPUs that make use of this feature don't
refault indefinitly, after a certain number of faults for the same
address the device enters a failure state and needs to be resetted.L

> > Next problem with this solution is that it causes an
> > interrupt storm for IO page faults to be handled when an
> > empty page-table is assigned.
> 
> Also too skimpy.  I *think* this is a variant of the problem in the
> preceding paragraph.  We get a fault storm (which is problem 2) and
> sometimes the faulting device gives up (which is problem 1).
> 
> Or something.  Please de-fog all of this.

Right, I will update the description to be more clear.

> > Furthermore the _start()/end() notifiers only catch the
> > moment when page mappings are released, but not page-table
> > pages. But this is necessary for managing external TLBs when
> > the page-table is shared with the CPU.
> 
> How come?

As mmu_notifiers are not used for managing TLBs that share the same
page-table as the CPU uses, there was not need to catch the page-table
freeing events, so it is not available yet.

> > Any comments and review appreciated!
> 
> The patchset looks decent, although I find it had to review because I
> just wasn't provided with enough of the thinking that went into it.  I
> have enough info to look at the C code, but not enough info to identify
> and evaluate alternative implementation approaches, to identify
> possible future extensions, etc.

Fair enough, I hope I clarified a few things with my explanations
above. I will also update the description of the patch-set when I
re-send.

> The patchset does appear to add significant additional overhead to hot
> code paths when mm_has_notifiers(mm).  Please let's update the
> changelog to address this rather important concern.  How significant is
> the impact on such mm's, how common are such mm's now and in the
> future, should we (for example) look at short-circuiting
> __mmu_notifier_invalidate_range() if none of the registered notifiers
> implement ->invalidate_range(), etc.

I think it adds the most overhead to single-CPU kernels. The
invalidate_range notifier is called in the paths that also do a remote
TLB flush, which is very expensive on its own. To those paths it adds
just another remote TLB that needs to be flushed.

Regards,

	Joerg

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3 v3] mmu_notifier: Allow to manage CPU external TLBs
@ 2014-09-12 18:47     ` Joerg Roedel
  0 siblings, 0 replies; 36+ messages in thread
From: Joerg Roedel @ 2014-09-12 18:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Rik van Riel, Jay.Cornwall-5C7GfCeVMHo,
	Peter Zijlstra, John.Bridgman-5C7GfCeVMHo, Hugh Dickins,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, ben.sander-5C7GfCeVMHo,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Jerome Glisse,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jesse Barnes,
	Mel Gorman, David Woodhouse, Johannes Weiner

Hi Andrew,

thanks for your review, I tried to answer your questions below.

On Wed, Sep 10, 2014 at 03:01:25PM -0700, Andrew Morton wrote:
> On Tue,  9 Sep 2014 17:43:51 +0200 Joerg Roedel <joro-zLv9SwRftAIdnm+yROfE0A@public.gmane.org> wrote:
> > So both call-backs can't be used to safely flush any non-CPU
> > TLB because _start() is called too early and _end() too
> > late.
> 
> There's a lot of missing information here.  Why don't the existing
> callbacks suit non-CPU TLBs?  What is different about them?  Please
> update the changelog to contain all this context.

The existing call-backs are called too early or too late. Specifically,
invalidate_range_start() is called when all pages are still mapped and 
invalidate_range_end() when all pages are unmapped and potentially
freed.

This is fine when the users of the mmu_notifiers manage their own
SoftTLB, like KVM does. When the TLB is managed in software it is easy
to wipe out entries for a given range and prevent new entries to be
established until invalidate_range_end is called.

But when the user of mmu_notifiers has to manage a hardware TLB it can
still wipe out TLB entries in invalidate_range_start, but it can't make
sure that no new TLB entries in the given range are established between
invalidate_range_start and invalidate_range_end.

	[ Actually the current AMD IOMMUv2 code tries to do that with
	  setting the setting an empty page-table for the non-CPU TLB,
	  but this causes address translation errors which end up in
	  device failures. ]

But to avoid silent data corruption the TLB entries need to be flushed
out of the non-CPU hardware TLB when the pages are unmapped (at this
point in time no _new_ TLB entries can be established in the non-CPU
TLB) but not yet freed (as the non-CPU TLB may still have _existing_
entries pointing to the pages about to be freed).

So to fix this problem we need to catch the moment when the Linux VMM
flushes remote TLBs (as a non-CPU TLB is not very different in its
flushing requirements from any other remote CPU TLB), as this is the
point in time when the pages are unmapped but _not_ yet freed.

The mmu_notifier_invalidate_range() function aims to catch that moment.

> > In the AMD IOMMUv2 driver this is currently implemented by
> > assigning an empty page-table to the external device between
> > _start() and _end(). But as tests have shown this doesn't
> > work as external devices don't re-fault infinitly but enter
> > a failure state after some time.
> 
> More missing info.  Why are these faults occurring?  Is there some
> device activity which is trying to fault in pages, but the CPU is
> executing code between _start() and _end() so the driver must refuse to
> instantiate a page to satisfy the fault?  That's just a guess, and I
> shouldn't be guessing.  Please update the changelog to fully describe
> the dynamic activity which is causing this.

The device (usually a GPU) runs some process (for example a compute job)
that directly accesses a Linux process address space. Any access
to a process address space can cause a page-fault, whether the access
comes from the CPU or a remote device.

When the page-fault comes from a compute job running on a GPU, is is
reported to Linux by an IOMMU interrupt.

The current implementation of invalidate_range_start/end assigns an
empty page-table, which causes many page-faults from the GPU process,
resulting in an interrupt storm for the IOMMU.

The fault handler doesn't handle the fault if an
invalidate_range_start/end pair is active, but just reports back SUCESS
to the device to let it refault the page then (refaulting is the same strategy
KVM implements). But existing GPUs that make use of this feature don't
refault indefinitly, after a certain number of faults for the same
address the device enters a failure state and needs to be resetted.L

> > Next problem with this solution is that it causes an
> > interrupt storm for IO page faults to be handled when an
> > empty page-table is assigned.
> 
> Also too skimpy.  I *think* this is a variant of the problem in the
> preceding paragraph.  We get a fault storm (which is problem 2) and
> sometimes the faulting device gives up (which is problem 1).
> 
> Or something.  Please de-fog all of this.

Right, I will update the description to be more clear.

> > Furthermore the _start()/end() notifiers only catch the
> > moment when page mappings are released, but not page-table
> > pages. But this is necessary for managing external TLBs when
> > the page-table is shared with the CPU.
> 
> How come?

As mmu_notifiers are not used for managing TLBs that share the same
page-table as the CPU uses, there was not need to catch the page-table
freeing events, so it is not available yet.

> > Any comments and review appreciated!
> 
> The patchset looks decent, although I find it had to review because I
> just wasn't provided with enough of the thinking that went into it.  I
> have enough info to look at the C code, but not enough info to identify
> and evaluate alternative implementation approaches, to identify
> possible future extensions, etc.

Fair enough, I hope I clarified a few things with my explanations
above. I will also update the description of the patch-set when I
re-send.

> The patchset does appear to add significant additional overhead to hot
> code paths when mm_has_notifiers(mm).  Please let's update the
> changelog to address this rather important concern.  How significant is
> the impact on such mm's, how common are such mm's now and in the
> future, should we (for example) look at short-circuiting
> __mmu_notifier_invalidate_range() if none of the registered notifiers
> implement ->invalidate_range(), etc.

I think it adds the most overhead to single-CPU kernels. The
invalidate_range notifier is called in the paths that also do a remote
TLB flush, which is very expensive on its own. To those paths it adds
just another remote TLB that needs to be flushed.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3 v3] mmu_notifier: Allow to manage CPU external TLBs
  2014-09-11  0:02     ` Jerome Glisse
  (?)
@ 2014-09-12 19:10       ` Andrew Morton
  -1 siblings, 0 replies; 36+ messages in thread
From: Andrew Morton @ 2014-09-12 19:10 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Joerg Roedel, Andrea Arcangeli, Rik van Riel, jroedel,
	Peter Zijlstra, John.Bridgman, Jesse Barnes, Hugh Dickins,
	linux-kernel, ben.sander, linux-mm, Jerome Glisse, Jay.Cornwall,
	Mel Gorman, David Woodhouse, Johannes Weiner, iommu

On Wed, 10 Sep 2014 20:02:12 -0400 Jerome Glisse <j.glisse@gmail.com> wrote:

> On Wed, Sep 10, 2014 at 03:01:25PM -0700, Andrew Morton wrote:
> > On Tue,  9 Sep 2014 17:43:51 +0200 Joerg Roedel <joro@8bytes.org> wrote:
> > 
> > > here is a patch-set to extend the mmu_notifiers in the Linux
> > > kernel to allow managing CPU external TLBs. Those TLBs may
> > > be implemented in IOMMUs or any other external device, e.g.
> > > ATS/PRI capable PCI devices.
> > > 
> > > The problem with managing these TLBs are the semantics of
> > > the invalidate_range_start/end call-backs currently
> > > available. Currently the subsystem using mmu_notifiers has
> > > to guarantee that no new TLB entries are established between
> > > invalidate_range_start/end. Furthermore the
> > > invalidate_range_start() function is called when all pages
> > > are still mapped and invalidate_range_end() when the pages
> > > are unmapped an already freed.
> > > 
> > > So both call-backs can't be used to safely flush any non-CPU
> > > TLB because _start() is called too early and _end() too
> > > late.
> > 
> > There's a lot of missing information here.  Why don't the existing
> > callbacks suit non-CPU TLBs?  What is different about them?  Please
> > update the changelog to contain all this context.
> 
> So unlike KVM or any current user of the mmu_notifier api, any PCIE
> ATS/PASID capable hardware IOMMUv2 for AMD and forgot the marketing
> name for Intel (probably VTd) implementation walk the cpu page table
> on there own and have there own TLB cache. In fact not only the iommu
> can have a TLB cache but any single PCIE hardware can implement its
> own local TLB cache.
> 
> So if we flush the IOMMU and device TLB inside the range_start callback
> there is a chance that the hw will just rewalk the cpu page table and
> repopulate its TLB before the CPU page table is actually updated.
> 
> Now if we shoot down the TLB inside the range_end callback, then we
> are too late ie the CPU page table is already populated with new entry
> and all the TLB in the IOMMU an in device might be pointing to the old
> pages.
> 
> So the aim of this callback is to happen right after the CPU page table
> is updated but before the old page is freed or recycled.

You had me up to here.  What is "the old page"?  pagetable page, I
assume?  upper, middle, lower, all the above?  Why does the callback
function need to get at that page?

> > Also too skimpy.  I *think* this is a variant of the problem in the
> > preceding paragraph.  We get a fault storm (which is problem 2) and
> > sometimes the faulting device gives up (which is problem 1).
> > 
> > Or something.  Please de-fog all of this.
> > 
> 
> Does above explanation help understand the issue ? Given that on each
> device page fault an IRQ is trigger (well the way the hw works is bit
> more complex than that).

It helps.  Please understand that my main aim here is not to ask
specific questions.  I seek to demonstrate the level of detail which I
believe should be in the changelog so that others can effectively
understand and review this patchset, and to understand this code in the
future.

So it would be good if you were to update the changelog to answer these
questions.  It would be better if it were also to answer similar
questions which I didn't think to ask!

> > > Furthermore the _start()/end() notifiers only catch the
> > > moment when page mappings are released, but not page-table
> > > pages. But this is necessary for managing external TLBs when
> > > the page-table is shared with the CPU.
> > 
> > How come?
> 
> As explained above end() might happens after page that were previously
> mapped are free or recycled.

Again, some explanation of why the driver wishes to inspect the
pagetable pages would be useful.

> > 
> > > To solve this situation I wrote a patch-set to introduce a
> > > new notifier call-back: mmu_notifer_invalidate_range(). This
> > > notifier lifts the strict requirements that no new
> > > references are taken in the range between _start() and
> > > _end(). When the subsystem can't guarantee that any new
> > > references are taken is has to provide the
> > > invalidate_range() call-back to clear any new references in
> > > there.
> > > 
> > > It is called between invalidate_range_start() and _end()
> > > every time the VMM has to wipe out any references to a
> > > couple of pages. This are usually the places where the CPU
> > > TLBs are flushed too and where its important that this
> > > happens before invalidate_range_end() is called.
> > > 
> > > Any comments and review appreciated!
> > 
> > The patchset looks decent, although I find it had to review because I
> > just wasn't provided with enough of the thinking that went into it.  I
> > have enough info to look at the C code, but not enough info to identify
> > and evaluate alternative implementation approaches, to identify
> > possible future extensions, etc.
> > 
> > The patchset does appear to add significant additional overhead to hot
> > code paths when mm_has_notifiers(mm).  Please let's update the
> > changelog to address this rather important concern.  How significant is
> > the impact on such mm's, how common are such mm's now and in the
> > future, should we (for example) look at short-circuiting
> > __mmu_notifier_invalidate_range() if none of the registered notifiers
> > implement ->invalidate_range(), etc.
> 
> So one might feel like just completely removing the range_start()/end()
> from the mmu_notifier and stick to this one callback but it will not work
> with other hardware like the one i am doing HMM patchset for (i send it
> again a couple weeks ago https://lkml.org/lkml/2014/8/29/423).
> 
> Right now there are very few user of the mmu_notifier, SGI, xen, KVM and
> sadly recently the GPU folks (which i am part of too). So as the GPU are
> starting to use it we will see a lot more application going through the
> mmu_notifier callback. Yet you do want to leverage the hw like GPU and
> you do want to use the same address space on the GPU as on the CPU and
> thus you do want to share or at least keep synchronize the GPU view of
> the CPU page table.
> 
> Right now, for IOMMUv2 this means adding this callback, for device that
> do not rely on ATS/PASID PCIE extension this means something like HMM.
> Also note that HMM is a superset of IOMMUv2 as it could be use at the
> same time and provide more feature mainly allowing migrating some page
> to device local memory for performances purposes.

OK, but none of that addressed my question regarding the performance
cost.

> I think this sumup all motivation behind this patchset and also behind
> my other patchset. As usual i am happy to discuss alternative way to do
> things but i think that the path of least disruption from current code
> is the one implemented by those patchset.

"least disruption" is nice, but "good implementation" is better.  In
other words I'd prefer the more complex implementation if the end
result is better.  Depending on the magnitudes of "complex" and
"better" :) Two years from now (which isn't very long), I don't think
we'll regret having chosen the better implementation.

Does this patchset make compromises to achieve low disruption?


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3 v3] mmu_notifier: Allow to manage CPU external TLBs
@ 2014-09-12 19:10       ` Andrew Morton
  0 siblings, 0 replies; 36+ messages in thread
From: Andrew Morton @ 2014-09-12 19:10 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Joerg Roedel, Andrea Arcangeli, Rik van Riel, jroedel,
	Peter Zijlstra, John.Bridgman, Jesse Barnes, Hugh Dickins,
	linux-kernel, ben.sander, linux-mm, Jerome Glisse, Jay.Cornwall,
	Mel Gorman, David Woodhouse, Johannes Weiner, iommu

On Wed, 10 Sep 2014 20:02:12 -0400 Jerome Glisse <j.glisse@gmail.com> wrote:

> On Wed, Sep 10, 2014 at 03:01:25PM -0700, Andrew Morton wrote:
> > On Tue,  9 Sep 2014 17:43:51 +0200 Joerg Roedel <joro@8bytes.org> wrote:
> > 
> > > here is a patch-set to extend the mmu_notifiers in the Linux
> > > kernel to allow managing CPU external TLBs. Those TLBs may
> > > be implemented in IOMMUs or any other external device, e.g.
> > > ATS/PRI capable PCI devices.
> > > 
> > > The problem with managing these TLBs are the semantics of
> > > the invalidate_range_start/end call-backs currently
> > > available. Currently the subsystem using mmu_notifiers has
> > > to guarantee that no new TLB entries are established between
> > > invalidate_range_start/end. Furthermore the
> > > invalidate_range_start() function is called when all pages
> > > are still mapped and invalidate_range_end() when the pages
> > > are unmapped an already freed.
> > > 
> > > So both call-backs can't be used to safely flush any non-CPU
> > > TLB because _start() is called too early and _end() too
> > > late.
> > 
> > There's a lot of missing information here.  Why don't the existing
> > callbacks suit non-CPU TLBs?  What is different about them?  Please
> > update the changelog to contain all this context.
> 
> So unlike KVM or any current user of the mmu_notifier api, any PCIE
> ATS/PASID capable hardware IOMMUv2 for AMD and forgot the marketing
> name for Intel (probably VTd) implementation walk the cpu page table
> on there own and have there own TLB cache. In fact not only the iommu
> can have a TLB cache but any single PCIE hardware can implement its
> own local TLB cache.
> 
> So if we flush the IOMMU and device TLB inside the range_start callback
> there is a chance that the hw will just rewalk the cpu page table and
> repopulate its TLB before the CPU page table is actually updated.
> 
> Now if we shoot down the TLB inside the range_end callback, then we
> are too late ie the CPU page table is already populated with new entry
> and all the TLB in the IOMMU an in device might be pointing to the old
> pages.
> 
> So the aim of this callback is to happen right after the CPU page table
> is updated but before the old page is freed or recycled.

You had me up to here.  What is "the old page"?  pagetable page, I
assume?  upper, middle, lower, all the above?  Why does the callback
function need to get at that page?

> > Also too skimpy.  I *think* this is a variant of the problem in the
> > preceding paragraph.  We get a fault storm (which is problem 2) and
> > sometimes the faulting device gives up (which is problem 1).
> > 
> > Or something.  Please de-fog all of this.
> > 
> 
> Does above explanation help understand the issue ? Given that on each
> device page fault an IRQ is trigger (well the way the hw works is bit
> more complex than that).

It helps.  Please understand that my main aim here is not to ask
specific questions.  I seek to demonstrate the level of detail which I
believe should be in the changelog so that others can effectively
understand and review this patchset, and to understand this code in the
future.

So it would be good if you were to update the changelog to answer these
questions.  It would be better if it were also to answer similar
questions which I didn't think to ask!

> > > Furthermore the _start()/end() notifiers only catch the
> > > moment when page mappings are released, but not page-table
> > > pages. But this is necessary for managing external TLBs when
> > > the page-table is shared with the CPU.
> > 
> > How come?
> 
> As explained above end() might happens after page that were previously
> mapped are free or recycled.

Again, some explanation of why the driver wishes to inspect the
pagetable pages would be useful.

> > 
> > > To solve this situation I wrote a patch-set to introduce a
> > > new notifier call-back: mmu_notifer_invalidate_range(). This
> > > notifier lifts the strict requirements that no new
> > > references are taken in the range between _start() and
> > > _end(). When the subsystem can't guarantee that any new
> > > references are taken is has to provide the
> > > invalidate_range() call-back to clear any new references in
> > > there.
> > > 
> > > It is called between invalidate_range_start() and _end()
> > > every time the VMM has to wipe out any references to a
> > > couple of pages. This are usually the places where the CPU
> > > TLBs are flushed too and where its important that this
> > > happens before invalidate_range_end() is called.
> > > 
> > > Any comments and review appreciated!
> > 
> > The patchset looks decent, although I find it had to review because I
> > just wasn't provided with enough of the thinking that went into it.  I
> > have enough info to look at the C code, but not enough info to identify
> > and evaluate alternative implementation approaches, to identify
> > possible future extensions, etc.
> > 
> > The patchset does appear to add significant additional overhead to hot
> > code paths when mm_has_notifiers(mm).  Please let's update the
> > changelog to address this rather important concern.  How significant is
> > the impact on such mm's, how common are such mm's now and in the
> > future, should we (for example) look at short-circuiting
> > __mmu_notifier_invalidate_range() if none of the registered notifiers
> > implement ->invalidate_range(), etc.
> 
> So one might feel like just completely removing the range_start()/end()
> from the mmu_notifier and stick to this one callback but it will not work
> with other hardware like the one i am doing HMM patchset for (i send it
> again a couple weeks ago https://lkml.org/lkml/2014/8/29/423).
> 
> Right now there are very few user of the mmu_notifier, SGI, xen, KVM and
> sadly recently the GPU folks (which i am part of too). So as the GPU are
> starting to use it we will see a lot more application going through the
> mmu_notifier callback. Yet you do want to leverage the hw like GPU and
> you do want to use the same address space on the GPU as on the CPU and
> thus you do want to share or at least keep synchronize the GPU view of
> the CPU page table.
> 
> Right now, for IOMMUv2 this means adding this callback, for device that
> do not rely on ATS/PASID PCIE extension this means something like HMM.
> Also note that HMM is a superset of IOMMUv2 as it could be use at the
> same time and provide more feature mainly allowing migrating some page
> to device local memory for performances purposes.

OK, but none of that addressed my question regarding the performance
cost.

> I think this sumup all motivation behind this patchset and also behind
> my other patchset. As usual i am happy to discuss alternative way to do
> things but i think that the path of least disruption from current code
> is the one implemented by those patchset.

"least disruption" is nice, but "good implementation" is better.  In
other words I'd prefer the more complex implementation if the end
result is better.  Depending on the magnitudes of "complex" and
"better" :) Two years from now (which isn't very long), I don't think
we'll regret having chosen the better implementation.

Does this patchset make compromises to achieve low disruption?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3 v3] mmu_notifier: Allow to manage CPU external TLBs
@ 2014-09-12 19:10       ` Andrew Morton
  0 siblings, 0 replies; 36+ messages in thread
From: Andrew Morton @ 2014-09-12 19:10 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrea Arcangeli, Rik van Riel, jroedel-l3A5Bk7waGM,
	Peter Zijlstra, John.Bridgman-5C7GfCeVMHo, Hugh Dickins,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Jesse Barnes,
	Johannes Weiner, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Jerome Glisse,
	Jay.Cornwall-5C7GfCeVMHo, Mel Gorman,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	David Woodhouse, ben.sander-5C7GfCeVMHo

On Wed, 10 Sep 2014 20:02:12 -0400 Jerome Glisse <j.glisse-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

> On Wed, Sep 10, 2014 at 03:01:25PM -0700, Andrew Morton wrote:
> > On Tue,  9 Sep 2014 17:43:51 +0200 Joerg Roedel <joro-zLv9SwRftAIdnm+yROfE0A@public.gmane.org> wrote:
> > 
> > > here is a patch-set to extend the mmu_notifiers in the Linux
> > > kernel to allow managing CPU external TLBs. Those TLBs may
> > > be implemented in IOMMUs or any other external device, e.g.
> > > ATS/PRI capable PCI devices.
> > > 
> > > The problem with managing these TLBs are the semantics of
> > > the invalidate_range_start/end call-backs currently
> > > available. Currently the subsystem using mmu_notifiers has
> > > to guarantee that no new TLB entries are established between
> > > invalidate_range_start/end. Furthermore the
> > > invalidate_range_start() function is called when all pages
> > > are still mapped and invalidate_range_end() when the pages
> > > are unmapped an already freed.
> > > 
> > > So both call-backs can't be used to safely flush any non-CPU
> > > TLB because _start() is called too early and _end() too
> > > late.
> > 
> > There's a lot of missing information here.  Why don't the existing
> > callbacks suit non-CPU TLBs?  What is different about them?  Please
> > update the changelog to contain all this context.
> 
> So unlike KVM or any current user of the mmu_notifier api, any PCIE
> ATS/PASID capable hardware IOMMUv2 for AMD and forgot the marketing
> name for Intel (probably VTd) implementation walk the cpu page table
> on there own and have there own TLB cache. In fact not only the iommu
> can have a TLB cache but any single PCIE hardware can implement its
> own local TLB cache.
> 
> So if we flush the IOMMU and device TLB inside the range_start callback
> there is a chance that the hw will just rewalk the cpu page table and
> repopulate its TLB before the CPU page table is actually updated.
> 
> Now if we shoot down the TLB inside the range_end callback, then we
> are too late ie the CPU page table is already populated with new entry
> and all the TLB in the IOMMU an in device might be pointing to the old
> pages.
> 
> So the aim of this callback is to happen right after the CPU page table
> is updated but before the old page is freed or recycled.

You had me up to here.  What is "the old page"?  pagetable page, I
assume?  upper, middle, lower, all the above?  Why does the callback
function need to get at that page?

> > Also too skimpy.  I *think* this is a variant of the problem in the
> > preceding paragraph.  We get a fault storm (which is problem 2) and
> > sometimes the faulting device gives up (which is problem 1).
> > 
> > Or something.  Please de-fog all of this.
> > 
> 
> Does above explanation help understand the issue ? Given that on each
> device page fault an IRQ is trigger (well the way the hw works is bit
> more complex than that).

It helps.  Please understand that my main aim here is not to ask
specific questions.  I seek to demonstrate the level of detail which I
believe should be in the changelog so that others can effectively
understand and review this patchset, and to understand this code in the
future.

So it would be good if you were to update the changelog to answer these
questions.  It would be better if it were also to answer similar
questions which I didn't think to ask!

> > > Furthermore the _start()/end() notifiers only catch the
> > > moment when page mappings are released, but not page-table
> > > pages. But this is necessary for managing external TLBs when
> > > the page-table is shared with the CPU.
> > 
> > How come?
> 
> As explained above end() might happens after page that were previously
> mapped are free or recycled.

Again, some explanation of why the driver wishes to inspect the
pagetable pages would be useful.

> > 
> > > To solve this situation I wrote a patch-set to introduce a
> > > new notifier call-back: mmu_notifer_invalidate_range(). This
> > > notifier lifts the strict requirements that no new
> > > references are taken in the range between _start() and
> > > _end(). When the subsystem can't guarantee that any new
> > > references are taken is has to provide the
> > > invalidate_range() call-back to clear any new references in
> > > there.
> > > 
> > > It is called between invalidate_range_start() and _end()
> > > every time the VMM has to wipe out any references to a
> > > couple of pages. This are usually the places where the CPU
> > > TLBs are flushed too and where its important that this
> > > happens before invalidate_range_end() is called.
> > > 
> > > Any comments and review appreciated!
> > 
> > The patchset looks decent, although I find it had to review because I
> > just wasn't provided with enough of the thinking that went into it.  I
> > have enough info to look at the C code, but not enough info to identify
> > and evaluate alternative implementation approaches, to identify
> > possible future extensions, etc.
> > 
> > The patchset does appear to add significant additional overhead to hot
> > code paths when mm_has_notifiers(mm).  Please let's update the
> > changelog to address this rather important concern.  How significant is
> > the impact on such mm's, how common are such mm's now and in the
> > future, should we (for example) look at short-circuiting
> > __mmu_notifier_invalidate_range() if none of the registered notifiers
> > implement ->invalidate_range(), etc.
> 
> So one might feel like just completely removing the range_start()/end()
> from the mmu_notifier and stick to this one callback but it will not work
> with other hardware like the one i am doing HMM patchset for (i send it
> again a couple weeks ago https://lkml.org/lkml/2014/8/29/423).
> 
> Right now there are very few user of the mmu_notifier, SGI, xen, KVM and
> sadly recently the GPU folks (which i am part of too). So as the GPU are
> starting to use it we will see a lot more application going through the
> mmu_notifier callback. Yet you do want to leverage the hw like GPU and
> you do want to use the same address space on the GPU as on the CPU and
> thus you do want to share or at least keep synchronize the GPU view of
> the CPU page table.
> 
> Right now, for IOMMUv2 this means adding this callback, for device that
> do not rely on ATS/PASID PCIE extension this means something like HMM.
> Also note that HMM is a superset of IOMMUv2 as it could be use at the
> same time and provide more feature mainly allowing migrating some page
> to device local memory for performances purposes.

OK, but none of that addressed my question regarding the performance
cost.

> I think this sumup all motivation behind this patchset and also behind
> my other patchset. As usual i am happy to discuss alternative way to do
> things but i think that the path of least disruption from current code
> is the one implemented by those patchset.

"least disruption" is nice, but "good implementation" is better.  In
other words I'd prefer the more complex implementation if the end
result is better.  Depending on the magnitudes of "complex" and
"better" :) Two years from now (which isn't very long), I don't think
we'll regret having chosen the better implementation.

Does this patchset make compromises to achieve low disruption?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3 v3] mmu_notifier: Allow to manage CPU external TLBs
  2014-09-12 18:47     ` Joerg Roedel
  (?)
@ 2014-09-12 19:19       ` Andrew Morton
  -1 siblings, 0 replies; 36+ messages in thread
From: Andrew Morton @ 2014-09-12 19:19 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Joerg Roedel, Andrea Arcangeli, Peter Zijlstra, Rik van Riel,
	Hugh Dickins, Mel Gorman, Johannes Weiner, Jerome Glisse,
	Jay.Cornwall, Oded.Gabbay, John.Bridgman, Suravee.Suthikulpanit,
	ben.sander, Jesse Barnes, David Woodhouse, linux-kernel,
	linux-mm, iommu

On Fri, 12 Sep 2014 20:47:39 +0200 Joerg Roedel <jroedel@suse.de> wrote:

> thanks for your review, I tried to answer your questions below.

You'd be amazed how helpful that was ;)

> Fair enough, I hope I clarified a few things with my explanations
> above. I will also update the description of the patch-set when I
> re-send.

Sounds good, thanks.


How does HMM play into all of this?  Would HMM make this patchset
obsolete, or could HMM be evolved to do so?  

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3 v3] mmu_notifier: Allow to manage CPU external TLBs
@ 2014-09-12 19:19       ` Andrew Morton
  0 siblings, 0 replies; 36+ messages in thread
From: Andrew Morton @ 2014-09-12 19:19 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Joerg Roedel, Andrea Arcangeli, Peter Zijlstra, Rik van Riel,
	Hugh Dickins, Mel Gorman, Johannes Weiner, Jerome Glisse,
	Jay.Cornwall, Oded.Gabbay, John.Bridgman, Suravee.Suthikulpanit,
	ben.sander, Jesse Barnes, David Woodhouse, linux-kernel,
	linux-mm, iommu

On Fri, 12 Sep 2014 20:47:39 +0200 Joerg Roedel <jroedel@suse.de> wrote:

> thanks for your review, I tried to answer your questions below.

You'd be amazed how helpful that was ;)

> Fair enough, I hope I clarified a few things with my explanations
> above. I will also update the description of the patch-set when I
> re-send.

Sounds good, thanks.


How does HMM play into all of this?  Would HMM make this patchset
obsolete, or could HMM be evolved to do so?  

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3 v3] mmu_notifier: Allow to manage CPU external TLBs
@ 2014-09-12 19:19       ` Andrew Morton
  0 siblings, 0 replies; 36+ messages in thread
From: Andrew Morton @ 2014-09-12 19:19 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Andrea Arcangeli, Rik van Riel, Jay.Cornwall-5C7GfCeVMHo,
	Peter Zijlstra, John.Bridgman-5C7GfCeVMHo, Hugh Dickins,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, ben.sander-5C7GfCeVMHo,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Jerome Glisse,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jesse Barnes,
	Mel Gorman, David Woodhouse, Johannes Weiner

On Fri, 12 Sep 2014 20:47:39 +0200 Joerg Roedel <jroedel-l3A5Bk7waGM@public.gmane.org> wrote:

> thanks for your review, I tried to answer your questions below.

You'd be amazed how helpful that was ;)

> Fair enough, I hope I clarified a few things with my explanations
> above. I will also update the description of the patch-set when I
> re-send.

Sounds good, thanks.


How does HMM play into all of this?  Would HMM make this patchset
obsolete, or could HMM be evolved to do so?  

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3 v3] mmu_notifier: Allow to manage CPU external TLBs
  2014-09-12 19:10       ` Andrew Morton
  (?)
@ 2014-09-12 19:21         ` Jerome Glisse
  -1 siblings, 0 replies; 36+ messages in thread
From: Jerome Glisse @ 2014-09-12 19:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joerg Roedel, Andrea Arcangeli, Rik van Riel, jroedel,
	Peter Zijlstra, John.Bridgman, Jesse Barnes, Hugh Dickins,
	linux-kernel, ben.sander, linux-mm, Jerome Glisse, Jay.Cornwall,
	Mel Gorman, David Woodhouse, Johannes Weiner, iommu

On Fri, Sep 12, 2014 at 12:10:36PM -0700, Andrew Morton wrote:
> On Wed, 10 Sep 2014 20:02:12 -0400 Jerome Glisse <j.glisse@gmail.com> wrote:
> 
> > On Wed, Sep 10, 2014 at 03:01:25PM -0700, Andrew Morton wrote:
> > > On Tue,  9 Sep 2014 17:43:51 +0200 Joerg Roedel <joro@8bytes.org> wrote:
> > > 
> > > > here is a patch-set to extend the mmu_notifiers in the Linux
> > > > kernel to allow managing CPU external TLBs. Those TLBs may
> > > > be implemented in IOMMUs or any other external device, e.g.
> > > > ATS/PRI capable PCI devices.
> > > > 
> > > > The problem with managing these TLBs are the semantics of
> > > > the invalidate_range_start/end call-backs currently
> > > > available. Currently the subsystem using mmu_notifiers has
> > > > to guarantee that no new TLB entries are established between
> > > > invalidate_range_start/end. Furthermore the
> > > > invalidate_range_start() function is called when all pages
> > > > are still mapped and invalidate_range_end() when the pages
> > > > are unmapped an already freed.
> > > > 
> > > > So both call-backs can't be used to safely flush any non-CPU
> > > > TLB because _start() is called too early and _end() too
> > > > late.
> > > 
> > > There's a lot of missing information here.  Why don't the existing
> > > callbacks suit non-CPU TLBs?  What is different about them?  Please
> > > update the changelog to contain all this context.
> > 
> > So unlike KVM or any current user of the mmu_notifier api, any PCIE
> > ATS/PASID capable hardware IOMMUv2 for AMD and forgot the marketing
> > name for Intel (probably VTd) implementation walk the cpu page table
> > on there own and have there own TLB cache. In fact not only the iommu
> > can have a TLB cache but any single PCIE hardware can implement its
> > own local TLB cache.
> > 
> > So if we flush the IOMMU and device TLB inside the range_start callback
> > there is a chance that the hw will just rewalk the cpu page table and
> > repopulate its TLB before the CPU page table is actually updated.
> > 
> > Now if we shoot down the TLB inside the range_end callback, then we
> > are too late ie the CPU page table is already populated with new entry
> > and all the TLB in the IOMMU an in device might be pointing to the old
> > pages.
> > 
> > So the aim of this callback is to happen right after the CPU page table
> > is updated but before the old page is freed or recycled.
> 
> You had me up to here.  What is "the old page"?  pagetable page, I
> assume?  upper, middle, lower, all the above?  Why does the callback
> function need to get at that page?

No old page is not a page from the cpu page table directory free but a page
that use to belong to the process ie a page that was backing an address range
of the process we are interested in. The kernel flush the cpu tlb and then
free those pages before calling the mmu_notifier end callback (for good reasons).

So we want to flush the hardware TLB (whish is distinc from CPU TLB) before
any of the page that use to be valid page backing address range of the process
are freed or recycle.

Sorry for the confusion my wording caused.

> 
> > > Also too skimpy.  I *think* this is a variant of the problem in the
> > > preceding paragraph.  We get a fault storm (which is problem 2) and
> > > sometimes the faulting device gives up (which is problem 1).
> > > 
> > > Or something.  Please de-fog all of this.
> > > 
> > 
> > Does above explanation help understand the issue ? Given that on each
> > device page fault an IRQ is trigger (well the way the hw works is bit
> > more complex than that).
> 
> It helps.  Please understand that my main aim here is not to ask
> specific questions.  I seek to demonstrate the level of detail which I
> believe should be in the changelog so that others can effectively
> understand and review this patchset, and to understand this code in the
> future.

Totaly agree and i am rooting for better commit message in other kernel
component, since i am reviewing more patches in distinct component i now
clearly see the value of good and exhaustive commit message.

> 
> So it would be good if you were to update the changelog to answer these
> questions.  It would be better if it were also to answer similar
> questions which I didn't think to ask!
> 
> > > > Furthermore the _start()/end() notifiers only catch the
> > > > moment when page mappings are released, but not page-table
> > > > pages. But this is necessary for managing external TLBs when
> > > > the page-table is shared with the CPU.
> > > 
> > > How come?
> > 
> > As explained above end() might happens after page that were previously
> > mapped are free or recycled.
> 
> Again, some explanation of why the driver wishes to inspect the
> pagetable pages would be useful.
> 
> > > 
> > > > To solve this situation I wrote a patch-set to introduce a
> > > > new notifier call-back: mmu_notifer_invalidate_range(). This
> > > > notifier lifts the strict requirements that no new
> > > > references are taken in the range between _start() and
> > > > _end(). When the subsystem can't guarantee that any new
> > > > references are taken is has to provide the
> > > > invalidate_range() call-back to clear any new references in
> > > > there.
> > > > 
> > > > It is called between invalidate_range_start() and _end()
> > > > every time the VMM has to wipe out any references to a
> > > > couple of pages. This are usually the places where the CPU
> > > > TLBs are flushed too and where its important that this
> > > > happens before invalidate_range_end() is called.
> > > > 
> > > > Any comments and review appreciated!
> > > 
> > > The patchset looks decent, although I find it had to review because I
> > > just wasn't provided with enough of the thinking that went into it.  I
> > > have enough info to look at the C code, but not enough info to identify
> > > and evaluate alternative implementation approaches, to identify
> > > possible future extensions, etc.
> > > 
> > > The patchset does appear to add significant additional overhead to hot
> > > code paths when mm_has_notifiers(mm).  Please let's update the
> > > changelog to address this rather important concern.  How significant is
> > > the impact on such mm's, how common are such mm's now and in the
> > > future, should we (for example) look at short-circuiting
> > > __mmu_notifier_invalidate_range() if none of the registered notifiers
> > > implement ->invalidate_range(), etc.
> > 
> > So one might feel like just completely removing the range_start()/end()
> > from the mmu_notifier and stick to this one callback but it will not work
> > with other hardware like the one i am doing HMM patchset for (i send it
> > again a couple weeks ago https://lkml.org/lkml/2014/8/29/423).
> > 
> > Right now there are very few user of the mmu_notifier, SGI, xen, KVM and
> > sadly recently the GPU folks (which i am part of too). So as the GPU are
> > starting to use it we will see a lot more application going through the
> > mmu_notifier callback. Yet you do want to leverage the hw like GPU and
> > you do want to use the same address space on the GPU as on the CPU and
> > thus you do want to share or at least keep synchronize the GPU view of
> > the CPU page table.
> > 
> > Right now, for IOMMUv2 this means adding this callback, for device that
> > do not rely on ATS/PASID PCIE extension this means something like HMM.
> > Also note that HMM is a superset of IOMMUv2 as it could be use at the
> > same time and provide more feature mainly allowing migrating some page
> > to device local memory for performances purposes.
> 
> OK, but none of that addressed my question regarding the performance
> cost.

I think Joerg addressed the performances concern in his email, basicly,
CPU TLB flush is already a costly operation and the hardware TLB flush
should just barely register a noise (this is current theory but hard to
say until we have a full working stack including userspace with which
proper and extensive benchmark and profiling can be perform).

> 
> > I think this sumup all motivation behind this patchset and also behind
> > my other patchset. As usual i am happy to discuss alternative way to do
> > things but i think that the path of least disruption from current code
> > is the one implemented by those patchset.
> 
> "least disruption" is nice, but "good implementation" is better.  In
> other words I'd prefer the more complex implementation if the end
> result is better.  Depending on the magnitudes of "complex" and
> "better" :) Two years from now (which isn't very long), I don't think
> we'll regret having chosen the better implementation.
> 
> Does this patchset make compromises to achieve low disruption?

Well right now i think we are lacking proper userspace support with which
this code and the global new usecase can be stress tested allowing to gather
profiling information.

I think as a first step we should take this least disruptive path and if
it proove to perform badly then we should work toward possibly more complex
design. Note that only complex design i can think of involve an overhaul of
how process memory management is done and probably linking cpu page table
update with the scheduler to try to hide cost of those update by scheduling
other thread meanwhile.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3 v3] mmu_notifier: Allow to manage CPU external TLBs
@ 2014-09-12 19:21         ` Jerome Glisse
  0 siblings, 0 replies; 36+ messages in thread
From: Jerome Glisse @ 2014-09-12 19:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joerg Roedel, Andrea Arcangeli, Rik van Riel, jroedel,
	Peter Zijlstra, John.Bridgman, Jesse Barnes, Hugh Dickins,
	linux-kernel, ben.sander, linux-mm, Jerome Glisse, Jay.Cornwall,
	Mel Gorman, David Woodhouse, Johannes Weiner, iommu

On Fri, Sep 12, 2014 at 12:10:36PM -0700, Andrew Morton wrote:
> On Wed, 10 Sep 2014 20:02:12 -0400 Jerome Glisse <j.glisse@gmail.com> wrote:
> 
> > On Wed, Sep 10, 2014 at 03:01:25PM -0700, Andrew Morton wrote:
> > > On Tue,  9 Sep 2014 17:43:51 +0200 Joerg Roedel <joro@8bytes.org> wrote:
> > > 
> > > > here is a patch-set to extend the mmu_notifiers in the Linux
> > > > kernel to allow managing CPU external TLBs. Those TLBs may
> > > > be implemented in IOMMUs or any other external device, e.g.
> > > > ATS/PRI capable PCI devices.
> > > > 
> > > > The problem with managing these TLBs are the semantics of
> > > > the invalidate_range_start/end call-backs currently
> > > > available. Currently the subsystem using mmu_notifiers has
> > > > to guarantee that no new TLB entries are established between
> > > > invalidate_range_start/end. Furthermore the
> > > > invalidate_range_start() function is called when all pages
> > > > are still mapped and invalidate_range_end() when the pages
> > > > are unmapped an already freed.
> > > > 
> > > > So both call-backs can't be used to safely flush any non-CPU
> > > > TLB because _start() is called too early and _end() too
> > > > late.
> > > 
> > > There's a lot of missing information here.  Why don't the existing
> > > callbacks suit non-CPU TLBs?  What is different about them?  Please
> > > update the changelog to contain all this context.
> > 
> > So unlike KVM or any current user of the mmu_notifier api, any PCIE
> > ATS/PASID capable hardware IOMMUv2 for AMD and forgot the marketing
> > name for Intel (probably VTd) implementation walk the cpu page table
> > on there own and have there own TLB cache. In fact not only the iommu
> > can have a TLB cache but any single PCIE hardware can implement its
> > own local TLB cache.
> > 
> > So if we flush the IOMMU and device TLB inside the range_start callback
> > there is a chance that the hw will just rewalk the cpu page table and
> > repopulate its TLB before the CPU page table is actually updated.
> > 
> > Now if we shoot down the TLB inside the range_end callback, then we
> > are too late ie the CPU page table is already populated with new entry
> > and all the TLB in the IOMMU an in device might be pointing to the old
> > pages.
> > 
> > So the aim of this callback is to happen right after the CPU page table
> > is updated but before the old page is freed or recycled.
> 
> You had me up to here.  What is "the old page"?  pagetable page, I
> assume?  upper, middle, lower, all the above?  Why does the callback
> function need to get at that page?

No old page is not a page from the cpu page table directory free but a page
that use to belong to the process ie a page that was backing an address range
of the process we are interested in. The kernel flush the cpu tlb and then
free those pages before calling the mmu_notifier end callback (for good reasons).

So we want to flush the hardware TLB (whish is distinc from CPU TLB) before
any of the page that use to be valid page backing address range of the process
are freed or recycle.

Sorry for the confusion my wording caused.

> 
> > > Also too skimpy.  I *think* this is a variant of the problem in the
> > > preceding paragraph.  We get a fault storm (which is problem 2) and
> > > sometimes the faulting device gives up (which is problem 1).
> > > 
> > > Or something.  Please de-fog all of this.
> > > 
> > 
> > Does above explanation help understand the issue ? Given that on each
> > device page fault an IRQ is trigger (well the way the hw works is bit
> > more complex than that).
> 
> It helps.  Please understand that my main aim here is not to ask
> specific questions.  I seek to demonstrate the level of detail which I
> believe should be in the changelog so that others can effectively
> understand and review this patchset, and to understand this code in the
> future.

Totaly agree and i am rooting for better commit message in other kernel
component, since i am reviewing more patches in distinct component i now
clearly see the value of good and exhaustive commit message.

> 
> So it would be good if you were to update the changelog to answer these
> questions.  It would be better if it were also to answer similar
> questions which I didn't think to ask!
> 
> > > > Furthermore the _start()/end() notifiers only catch the
> > > > moment when page mappings are released, but not page-table
> > > > pages. But this is necessary for managing external TLBs when
> > > > the page-table is shared with the CPU.
> > > 
> > > How come?
> > 
> > As explained above end() might happens after page that were previously
> > mapped are free or recycled.
> 
> Again, some explanation of why the driver wishes to inspect the
> pagetable pages would be useful.
> 
> > > 
> > > > To solve this situation I wrote a patch-set to introduce a
> > > > new notifier call-back: mmu_notifer_invalidate_range(). This
> > > > notifier lifts the strict requirements that no new
> > > > references are taken in the range between _start() and
> > > > _end(). When the subsystem can't guarantee that any new
> > > > references are taken is has to provide the
> > > > invalidate_range() call-back to clear any new references in
> > > > there.
> > > > 
> > > > It is called between invalidate_range_start() and _end()
> > > > every time the VMM has to wipe out any references to a
> > > > couple of pages. This are usually the places where the CPU
> > > > TLBs are flushed too and where its important that this
> > > > happens before invalidate_range_end() is called.
> > > > 
> > > > Any comments and review appreciated!
> > > 
> > > The patchset looks decent, although I find it had to review because I
> > > just wasn't provided with enough of the thinking that went into it.  I
> > > have enough info to look at the C code, but not enough info to identify
> > > and evaluate alternative implementation approaches, to identify
> > > possible future extensions, etc.
> > > 
> > > The patchset does appear to add significant additional overhead to hot
> > > code paths when mm_has_notifiers(mm).  Please let's update the
> > > changelog to address this rather important concern.  How significant is
> > > the impact on such mm's, how common are such mm's now and in the
> > > future, should we (for example) look at short-circuiting
> > > __mmu_notifier_invalidate_range() if none of the registered notifiers
> > > implement ->invalidate_range(), etc.
> > 
> > So one might feel like just completely removing the range_start()/end()
> > from the mmu_notifier and stick to this one callback but it will not work
> > with other hardware like the one i am doing HMM patchset for (i send it
> > again a couple weeks ago https://lkml.org/lkml/2014/8/29/423).
> > 
> > Right now there are very few user of the mmu_notifier, SGI, xen, KVM and
> > sadly recently the GPU folks (which i am part of too). So as the GPU are
> > starting to use it we will see a lot more application going through the
> > mmu_notifier callback. Yet you do want to leverage the hw like GPU and
> > you do want to use the same address space on the GPU as on the CPU and
> > thus you do want to share or at least keep synchronize the GPU view of
> > the CPU page table.
> > 
> > Right now, for IOMMUv2 this means adding this callback, for device that
> > do not rely on ATS/PASID PCIE extension this means something like HMM.
> > Also note that HMM is a superset of IOMMUv2 as it could be use at the
> > same time and provide more feature mainly allowing migrating some page
> > to device local memory for performances purposes.
> 
> OK, but none of that addressed my question regarding the performance
> cost.

I think Joerg addressed the performances concern in his email, basicly,
CPU TLB flush is already a costly operation and the hardware TLB flush
should just barely register a noise (this is current theory but hard to
say until we have a full working stack including userspace with which
proper and extensive benchmark and profiling can be perform).

> 
> > I think this sumup all motivation behind this patchset and also behind
> > my other patchset. As usual i am happy to discuss alternative way to do
> > things but i think that the path of least disruption from current code
> > is the one implemented by those patchset.
> 
> "least disruption" is nice, but "good implementation" is better.  In
> other words I'd prefer the more complex implementation if the end
> result is better.  Depending on the magnitudes of "complex" and
> "better" :) Two years from now (which isn't very long), I don't think
> we'll regret having chosen the better implementation.
> 
> Does this patchset make compromises to achieve low disruption?

Well right now i think we are lacking proper userspace support with which
this code and the global new usecase can be stress tested allowing to gather
profiling information.

I think as a first step we should take this least disruptive path and if
it proove to perform badly then we should work toward possibly more complex
design. Note that only complex design i can think of involve an overhaul of
how process memory management is done and probably linking cpu page table
update with the scheduler to try to hide cost of those update by scheduling
other thread meanwhile.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3 v3] mmu_notifier: Allow to manage CPU external TLBs
@ 2014-09-12 19:21         ` Jerome Glisse
  0 siblings, 0 replies; 36+ messages in thread
From: Jerome Glisse @ 2014-09-12 19:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joerg Roedel, Andrea Arcangeli, Rik van Riel, jroedel,
	Peter Zijlstra, John.Bridgman, Jesse Barnes, Hugh Dickins,
	linux-kernel, ben.sander, linux-mm, Jerome Glisse, Jay.Cornwall,
	Mel Gorman, David Woodhouse, Johannes Weiner, iommu

On Fri, Sep 12, 2014 at 12:10:36PM -0700, Andrew Morton wrote:
> On Wed, 10 Sep 2014 20:02:12 -0400 Jerome Glisse <j.glisse@gmail.com> wrote:
> 
> > On Wed, Sep 10, 2014 at 03:01:25PM -0700, Andrew Morton wrote:
> > > On Tue,  9 Sep 2014 17:43:51 +0200 Joerg Roedel <joro@8bytes.org> wrote:
> > > 
> > > > here is a patch-set to extend the mmu_notifiers in the Linux
> > > > kernel to allow managing CPU external TLBs. Those TLBs may
> > > > be implemented in IOMMUs or any other external device, e.g.
> > > > ATS/PRI capable PCI devices.
> > > > 
> > > > The problem with managing these TLBs are the semantics of
> > > > the invalidate_range_start/end call-backs currently
> > > > available. Currently the subsystem using mmu_notifiers has
> > > > to guarantee that no new TLB entries are established between
> > > > invalidate_range_start/end. Furthermore the
> > > > invalidate_range_start() function is called when all pages
> > > > are still mapped and invalidate_range_end() when the pages
> > > > are unmapped an already freed.
> > > > 
> > > > So both call-backs can't be used to safely flush any non-CPU
> > > > TLB because _start() is called too early and _end() too
> > > > late.
> > > 
> > > There's a lot of missing information here.  Why don't the existing
> > > callbacks suit non-CPU TLBs?  What is different about them?  Please
> > > update the changelog to contain all this context.
> > 
> > So unlike KVM or any current user of the mmu_notifier api, any PCIE
> > ATS/PASID capable hardware IOMMUv2 for AMD and forgot the marketing
> > name for Intel (probably VTd) implementation walk the cpu page table
> > on there own and have there own TLB cache. In fact not only the iommu
> > can have a TLB cache but any single PCIE hardware can implement its
> > own local TLB cache.
> > 
> > So if we flush the IOMMU and device TLB inside the range_start callback
> > there is a chance that the hw will just rewalk the cpu page table and
> > repopulate its TLB before the CPU page table is actually updated.
> > 
> > Now if we shoot down the TLB inside the range_end callback, then we
> > are too late ie the CPU page table is already populated with new entry
> > and all the TLB in the IOMMU an in device might be pointing to the old
> > pages.
> > 
> > So the aim of this callback is to happen right after the CPU page table
> > is updated but before the old page is freed or recycled.
> 
> You had me up to here.  What is "the old page"?  pagetable page, I
> assume?  upper, middle, lower, all the above?  Why does the callback
> function need to get at that page?

No old page is not a page from the cpu page table directory free but a page
that use to belong to the process ie a page that was backing an address range
of the process we are interested in. The kernel flush the cpu tlb and then
free those pages before calling the mmu_notifier end callback (for good reasons).

So we want to flush the hardware TLB (whish is distinc from CPU TLB) before
any of the page that use to be valid page backing address range of the process
are freed or recycle.

Sorry for the confusion my wording caused.

> 
> > > Also too skimpy.  I *think* this is a variant of the problem in the
> > > preceding paragraph.  We get a fault storm (which is problem 2) and
> > > sometimes the faulting device gives up (which is problem 1).
> > > 
> > > Or something.  Please de-fog all of this.
> > > 
> > 
> > Does above explanation help understand the issue ? Given that on each
> > device page fault an IRQ is trigger (well the way the hw works is bit
> > more complex than that).
> 
> It helps.  Please understand that my main aim here is not to ask
> specific questions.  I seek to demonstrate the level of detail which I
> believe should be in the changelog so that others can effectively
> understand and review this patchset, and to understand this code in the
> future.

Totaly agree and i am rooting for better commit message in other kernel
component, since i am reviewing more patches in distinct component i now
clearly see the value of good and exhaustive commit message.

> 
> So it would be good if you were to update the changelog to answer these
> questions.  It would be better if it were also to answer similar
> questions which I didn't think to ask!
> 
> > > > Furthermore the _start()/end() notifiers only catch the
> > > > moment when page mappings are released, but not page-table
> > > > pages. But this is necessary for managing external TLBs when
> > > > the page-table is shared with the CPU.
> > > 
> > > How come?
> > 
> > As explained above end() might happens after page that were previously
> > mapped are free or recycled.
> 
> Again, some explanation of why the driver wishes to inspect the
> pagetable pages would be useful.
> 
> > > 
> > > > To solve this situation I wrote a patch-set to introduce a
> > > > new notifier call-back: mmu_notifer_invalidate_range(). This
> > > > notifier lifts the strict requirements that no new
> > > > references are taken in the range between _start() and
> > > > _end(). When the subsystem can't guarantee that any new
> > > > references are taken is has to provide the
> > > > invalidate_range() call-back to clear any new references in
> > > > there.
> > > > 
> > > > It is called between invalidate_range_start() and _end()
> > > > every time the VMM has to wipe out any references to a
> > > > couple of pages. This are usually the places where the CPU
> > > > TLBs are flushed too and where its important that this
> > > > happens before invalidate_range_end() is called.
> > > > 
> > > > Any comments and review appreciated!
> > > 
> > > The patchset looks decent, although I find it had to review because I
> > > just wasn't provided with enough of the thinking that went into it.  I
> > > have enough info to look at the C code, but not enough info to identify
> > > and evaluate alternative implementation approaches, to identify
> > > possible future extensions, etc.
> > > 
> > > The patchset does appear to add significant additional overhead to hot
> > > code paths when mm_has_notifiers(mm).  Please let's update the
> > > changelog to address this rather important concern.  How significant is
> > > the impact on such mm's, how common are such mm's now and in the
> > > future, should we (for example) look at short-circuiting
> > > __mmu_notifier_invalidate_range() if none of the registered notifiers
> > > implement ->invalidate_range(), etc.
> > 
> > So one might feel like just completely removing the range_start()/end()
> > from the mmu_notifier and stick to this one callback but it will not work
> > with other hardware like the one i am doing HMM patchset for (i send it
> > again a couple weeks ago https://lkml.org/lkml/2014/8/29/423).
> > 
> > Right now there are very few user of the mmu_notifier, SGI, xen, KVM and
> > sadly recently the GPU folks (which i am part of too). So as the GPU are
> > starting to use it we will see a lot more application going through the
> > mmu_notifier callback. Yet you do want to leverage the hw like GPU and
> > you do want to use the same address space on the GPU as on the CPU and
> > thus you do want to share or at least keep synchronize the GPU view of
> > the CPU page table.
> > 
> > Right now, for IOMMUv2 this means adding this callback, for device that
> > do not rely on ATS/PASID PCIE extension this means something like HMM.
> > Also note that HMM is a superset of IOMMUv2 as it could be use at the
> > same time and provide more feature mainly allowing migrating some page
> > to device local memory for performances purposes.
> 
> OK, but none of that addressed my question regarding the performance
> cost.

I think Joerg addressed the performances concern in his email, basicly,
CPU TLB flush is already a costly operation and the hardware TLB flush
should just barely register a noise (this is current theory but hard to
say until we have a full working stack including userspace with which
proper and extensive benchmark and profiling can be perform).

> 
> > I think this sumup all motivation behind this patchset and also behind
> > my other patchset. As usual i am happy to discuss alternative way to do
> > things but i think that the path of least disruption from current code
> > is the one implemented by those patchset.
> 
> "least disruption" is nice, but "good implementation" is better.  In
> other words I'd prefer the more complex implementation if the end
> result is better.  Depending on the magnitudes of "complex" and
> "better" :) Two years from now (which isn't very long), I don't think
> we'll regret having chosen the better implementation.
> 
> Does this patchset make compromises to achieve low disruption?

Well right now i think we are lacking proper userspace support with which
this code and the global new usecase can be stress tested allowing to gather
profiling information.

I think as a first step we should take this least disruptive path and if
it proove to perform badly then we should work toward possibly more complex
design. Note that only complex design i can think of involve an overhaul of
how process memory management is done and probably linking cpu page table
update with the scheduler to try to hide cost of those update by scheduling
other thread meanwhile.

Cheers,
Jérôme

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3 v3] mmu_notifier: Allow to manage CPU external TLBs
  2014-09-12 19:21         ` Jerome Glisse
@ 2014-09-12 19:27           ` Andrew Morton
  -1 siblings, 0 replies; 36+ messages in thread
From: Andrew Morton @ 2014-09-12 19:27 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Joerg Roedel, Andrea Arcangeli, Rik van Riel, jroedel,
	Peter Zijlstra, John.Bridgman, Jesse Barnes, Hugh Dickins,
	linux-kernel, ben.sander, linux-mm, Jerome Glisse, Jay.Cornwall,
	Mel Gorman, David Woodhouse, Johannes Weiner, iommu

On Fri, 12 Sep 2014 15:21:01 -0400 Jerome Glisse <j.glisse@gmail.com> wrote:

> > > I think this sumup all motivation behind this patchset and also behind
> > > my other patchset. As usual i am happy to discuss alternative way to do
> > > things but i think that the path of least disruption from current code
> > > is the one implemented by those patchset.
> > 
> > "least disruption" is nice, but "good implementation" is better.  In
> > other words I'd prefer the more complex implementation if the end
> > result is better.  Depending on the magnitudes of "complex" and
> > "better" :) Two years from now (which isn't very long), I don't think
> > we'll regret having chosen the better implementation.
> > 
> > Does this patchset make compromises to achieve low disruption?
> 
> Well right now i think we are lacking proper userspace support with which
> this code and the global new usecase can be stress tested allowing to gather
> profiling information.
> 
> I think as a first step we should take this least disruptive path and if
> it proove to perform badly then we should work toward possibly more complex
> design. Note that only complex design i can think of involve an overhaul of
> how process memory management is done and probably linking cpu page table
> update with the scheduler to try to hide cost of those update by scheduling
> other thread meanwhile.

OK.  Often I'd resist merging a patchset when we're not sure it will be
sufficient.  But there are practical advantages to doing so and the
present patchset is quite simple.  So if we decide to remove it later
on the impact will be small.  If the patchset made userspace-visible
changes then things would be different!


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3 v3] mmu_notifier: Allow to manage CPU external TLBs
@ 2014-09-12 19:27           ` Andrew Morton
  0 siblings, 0 replies; 36+ messages in thread
From: Andrew Morton @ 2014-09-12 19:27 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Joerg Roedel, Andrea Arcangeli, Rik van Riel, jroedel,
	Peter Zijlstra, John.Bridgman, Jesse Barnes, Hugh Dickins,
	linux-kernel, ben.sander, linux-mm, Jerome Glisse, Jay.Cornwall,
	Mel Gorman, David Woodhouse, Johannes Weiner, iommu

On Fri, 12 Sep 2014 15:21:01 -0400 Jerome Glisse <j.glisse@gmail.com> wrote:

> > > I think this sumup all motivation behind this patchset and also behind
> > > my other patchset. As usual i am happy to discuss alternative way to do
> > > things but i think that the path of least disruption from current code
> > > is the one implemented by those patchset.
> > 
> > "least disruption" is nice, but "good implementation" is better.  In
> > other words I'd prefer the more complex implementation if the end
> > result is better.  Depending on the magnitudes of "complex" and
> > "better" :) Two years from now (which isn't very long), I don't think
> > we'll regret having chosen the better implementation.
> > 
> > Does this patchset make compromises to achieve low disruption?
> 
> Well right now i think we are lacking proper userspace support with which
> this code and the global new usecase can be stress tested allowing to gather
> profiling information.
> 
> I think as a first step we should take this least disruptive path and if
> it proove to perform badly then we should work toward possibly more complex
> design. Note that only complex design i can think of involve an overhaul of
> how process memory management is done and probably linking cpu page table
> update with the scheduler to try to hide cost of those update by scheduling
> other thread meanwhile.

OK.  Often I'd resist merging a patchset when we're not sure it will be
sufficient.  But there are practical advantages to doing so and the
present patchset is quite simple.  So if we decide to remove it later
on the impact will be small.  If the patchset made userspace-visible
changes then things would be different!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3 v3] mmu_notifier: Allow to manage CPU external TLBs
  2014-09-12 19:19       ` Andrew Morton
  (?)
@ 2014-09-12 19:28         ` Jerome Glisse
  -1 siblings, 0 replies; 36+ messages in thread
From: Jerome Glisse @ 2014-09-12 19:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joerg Roedel, Andrea Arcangeli, Rik van Riel, Jay.Cornwall,
	Peter Zijlstra, John.Bridgman, Hugh Dickins, linux-kernel,
	ben.sander, linux-mm, Jerome Glisse, iommu, Jesse Barnes,
	Mel Gorman, David Woodhouse, Johannes Weiner

On Fri, Sep 12, 2014 at 12:19:37PM -0700, Andrew Morton wrote:
> On Fri, 12 Sep 2014 20:47:39 +0200 Joerg Roedel <jroedel@suse.de> wrote:
> 
> > thanks for your review, I tried to answer your questions below.
> 
> You'd be amazed how helpful that was ;)
> 
> > Fair enough, I hope I clarified a few things with my explanations
> > above. I will also update the description of the patch-set when I
> > re-send.
> 
> Sounds good, thanks.
> 
> 
> How does HMM play into all of this?  Would HMM make this patchset
> obsolete, or could HMM be evolved to do so?  

HMM should be consider as distinc from this. The hardware TLB we are talking
with this patchset can be flush by the CPU from inside an atomic context (ie
while holding cpu page table spinlock for instance).

HMM on the other hand deals with hardware that have there own page table
ie they do not necessarily walk the cpu page table. Flushing the TLB for this
kind of hardware means scheduling some job on the hardware and this can not
be done from kernel atomic context as this job might take a long time to
complete (imagine preempting thousand of threads on a gpu).

Still HMM can be use in a mixed environement where the IOMMUv2 is use for
memory that reside into system ram while HMM only handle memory that have
been migrated to the device memory.

So while HMM intend to provide more features than IOMMUv2 hardware allow,
it does not intend to replace it. On contrary hope is that both can work at
same time.

Cheers,
Jérôme

> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3 v3] mmu_notifier: Allow to manage CPU external TLBs
@ 2014-09-12 19:28         ` Jerome Glisse
  0 siblings, 0 replies; 36+ messages in thread
From: Jerome Glisse @ 2014-09-12 19:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joerg Roedel, Andrea Arcangeli, Rik van Riel, Jay.Cornwall,
	Peter Zijlstra, John.Bridgman, Hugh Dickins, linux-kernel,
	ben.sander, linux-mm, Jerome Glisse, iommu, Jesse Barnes,
	Mel Gorman, David Woodhouse, Johannes Weiner

On Fri, Sep 12, 2014 at 12:19:37PM -0700, Andrew Morton wrote:
> On Fri, 12 Sep 2014 20:47:39 +0200 Joerg Roedel <jroedel@suse.de> wrote:
> 
> > thanks for your review, I tried to answer your questions below.
> 
> You'd be amazed how helpful that was ;)
> 
> > Fair enough, I hope I clarified a few things with my explanations
> > above. I will also update the description of the patch-set when I
> > re-send.
> 
> Sounds good, thanks.
> 
> 
> How does HMM play into all of this?  Would HMM make this patchset
> obsolete, or could HMM be evolved to do so?  

HMM should be consider as distinc from this. The hardware TLB we are talking
with this patchset can be flush by the CPU from inside an atomic context (ie
while holding cpu page table spinlock for instance).

HMM on the other hand deals with hardware that have there own page table
ie they do not necessarily walk the cpu page table. Flushing the TLB for this
kind of hardware means scheduling some job on the hardware and this can not
be done from kernel atomic context as this job might take a long time to
complete (imagine preempting thousand of threads on a gpu).

Still HMM can be use in a mixed environement where the IOMMUv2 is use for
memory that reside into system ram while HMM only handle memory that have
been migrated to the device memory.

So while HMM intend to provide more features than IOMMUv2 hardware allow,
it does not intend to replace it. On contrary hope is that both can work at
same time.

Cheers,
Jerome

> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3 v3] mmu_notifier: Allow to manage CPU external TLBs
@ 2014-09-12 19:28         ` Jerome Glisse
  0 siblings, 0 replies; 36+ messages in thread
From: Jerome Glisse @ 2014-09-12 19:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joerg Roedel, Andrea Arcangeli, Rik van Riel, Jay.Cornwall,
	Peter Zijlstra, John.Bridgman, Hugh Dickins, linux-kernel,
	ben.sander, linux-mm, Jerome Glisse, iommu, Jesse Barnes,
	Mel Gorman, David Woodhouse, Johannes Weiner

On Fri, Sep 12, 2014 at 12:19:37PM -0700, Andrew Morton wrote:
> On Fri, 12 Sep 2014 20:47:39 +0200 Joerg Roedel <jroedel@suse.de> wrote:
> 
> > thanks for your review, I tried to answer your questions below.
> 
> You'd be amazed how helpful that was ;)
> 
> > Fair enough, I hope I clarified a few things with my explanations
> > above. I will also update the description of the patch-set when I
> > re-send.
> 
> Sounds good, thanks.
> 
> 
> How does HMM play into all of this?  Would HMM make this patchset
> obsolete, or could HMM be evolved to do so?  

HMM should be consider as distinc from this. The hardware TLB we are talking
with this patchset can be flush by the CPU from inside an atomic context (ie
while holding cpu page table spinlock for instance).

HMM on the other hand deals with hardware that have there own page table
ie they do not necessarily walk the cpu page table. Flushing the TLB for this
kind of hardware means scheduling some job on the hardware and this can not
be done from kernel atomic context as this job might take a long time to
complete (imagine preempting thousand of threads on a gpu).

Still HMM can be use in a mixed environement where the IOMMUv2 is use for
memory that reside into system ram while HMM only handle memory that have
been migrated to the device memory.

So while HMM intend to provide more features than IOMMUv2 hardware allow,
it does not intend to replace it. On contrary hope is that both can work at
same time.

Cheers,
Jérôme

> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2014-09-12 19:28 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-09 15:43 [PATCH 0/3 v3] mmu_notifier: Allow to manage CPU external TLBs Joerg Roedel
2014-09-09 15:43 ` Joerg Roedel
2014-09-09 15:43 ` [PATCH 1/3] mmu_notifier: Add mmu_notifier_invalidate_range() Joerg Roedel
2014-09-09 15:43   ` Joerg Roedel
2014-09-09 15:43   ` Joerg Roedel
2014-09-09 15:43 ` [PATCH 2/3] mmu_notifier: Call mmu_notifier_invalidate_range() from VMM Joerg Roedel
2014-09-09 15:43   ` Joerg Roedel
2014-09-09 15:43   ` Joerg Roedel
2014-09-09 15:43 ` [PATCH 3/3] mmu_notifier: Add the call-back for mmu_notifier_invalidate_range() Joerg Roedel
2014-09-09 15:43   ` Joerg Roedel
2014-09-09 15:43   ` Joerg Roedel
2014-09-10 22:01 ` [PATCH 0/3 v3] mmu_notifier: Allow to manage CPU external TLBs Andrew Morton
2014-09-10 22:01   ` Andrew Morton
2014-09-11  0:02   ` Jerome Glisse
2014-09-11  0:02     ` Jerome Glisse
2014-09-11  0:02     ` Jerome Glisse
2014-09-12 18:39     ` Jerome Glisse
2014-09-12 18:39       ` Jerome Glisse
2014-09-12 18:39       ` Jerome Glisse
2014-09-12 19:10     ` Andrew Morton
2014-09-12 19:10       ` Andrew Morton
2014-09-12 19:10       ` Andrew Morton
2014-09-12 19:21       ` Jerome Glisse
2014-09-12 19:21         ` Jerome Glisse
2014-09-12 19:21         ` Jerome Glisse
2014-09-12 19:27         ` Andrew Morton
2014-09-12 19:27           ` Andrew Morton
2014-09-12 18:47   ` Joerg Roedel
2014-09-12 18:47     ` Joerg Roedel
2014-09-12 18:47     ` Joerg Roedel
2014-09-12 19:19     ` Andrew Morton
2014-09-12 19:19       ` Andrew Morton
2014-09-12 19:19       ` Andrew Morton
2014-09-12 19:28       ` Jerome Glisse
2014-09-12 19:28         ` Jerome Glisse
2014-09-12 19:28         ` Jerome Glisse

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.