All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/5] Invalidate secondary IOMMU TLB on permission upgrade
@ 2023-07-19 12:18 ` Alistair Popple
  0 siblings, 0 replies; 35+ messages in thread
From: Alistair Popple @ 2023-07-19 12:18 UTC (permalink / raw)
  To: akpm
  Cc: ajd, catalin.marinas, fbarrat, iommu, jgg, jhubbard, kevin.tian,
	kvm, linux-arm-kernel, linux-kernel, linux-mm, linuxppc-dev, mpe,
	nicolinc, npiggin, robin.murphy, seanjc, will, x86,
	zhi.wang.linux, Alistair Popple

The main change is to move secondary TLB invalidation mmu notifier
callbacks into the architecture specific TLB flushing functions. This
makes secondary TLB invalidation mostly match CPU invalidation while
still allowing efficient range based invalidations based on the
existing TLB batching code.

Changes for v2:

 - Rebased on linux-next commit 906fa30154ef ("mm/rmap: correct stale
   comment of rmap_walk_anon and rmap_walk_file") to fix a minor
   integration conflict with "arm64: support batched/deferred tlb
   shootdown during page reclamation/migration". This series will need
   to be applied after the conflicting patch.

 - Reordered the function rename until the end of the series as many
   places that were getting renamed ended up being removed anyway.

 - Fixed a couple of build issues which broke bisection.

 - Added a minor patch to fix up a stale/incorrect comment.

==========
Background
==========

The arm64 architecture specifies TLB permission bits may be cached and
therefore the TLB must be invalidated during permission upgrades. For
the CPU this currently occurs in the architecture specific
ptep_set_access_flags() routine.

Secondary TLBs such as implemented by the SMMU IOMMU match the CPU
architecture specification and may also cache permission bits and
require the same TLB invalidations. This may be achieved in one of two
ways.

Some SMMU implementations implement broadcast TLB maintenance
(BTM). This snoops CPU TLB invalidates and will invalidate any
secondary TLB at the same time as the CPU. However implementations are
not required to implement BTM.

Implementations without BTM rely on mmu notifier callbacks to send
explicit TLB invalidation commands to invalidate SMMU TLB. Therefore
either generic kernel code or architecture specific code needs to call
the mmu notifier on permission upgrade.

Currently that doesn't happen so devices will fault indefinitely when
writing to a PTE that was previously read-only as nothing invalidates
the SMMU TLB.

========
Solution
========

To fix this the series first renames the .invalidate_range() callback
to .arch_invalidate_secondary_tlbs() as suggested by Jason and Sean to
make it clear this callback is only used for secondary TLBs. That was
made possible thanks to Sean's series [1] to remove KVM's incorrect
usage.

Based on feedback from Jason [2] the proposed solution to the bug is
to move the calls to mmu_notifier_arch_invalidate_secondary_tlbs()
closer to the architecture specific TLB invalidation code. This
ensures the secondary TLB won't miss invalidations, including the
existing invalidation in the ARM64 code to deal with permission
upgrade.

Currently only ARM64, PowerPC and x86 have IOMMU with secondary TLBs
requiring SW invalidation so the notifier is only called for those
architectures. It is also not called for invalidation of kernel
mappings as no secondary IOMMU implementations can access those and
hence it is not required.

[1] - https://lore.kernel.org/all/20230602011518.787006-1-seanjc@google.com/
[2] - https://lore.kernel.org/linux-mm/ZJMR5bw8l+BbzdJ7@ziepe.ca/

Alistair Popple (5):
  arm64/smmu: Use TLBI ASID when invalidating entire range
  mmu_notifiers: Fixup comment in mmu_interval_read_begin()
  mmu_notifiers: Call invalidate_range() when invalidating TLBs
  mmu_notifiers: Don't invalidate secondary TLBs as part of mmu_notifier_invalidate_range_end()
  mmu_notifiers: Rename invalidate_range notifier

 arch/arm64/include/asm/tlbflush.h               |   5 +-
 arch/powerpc/include/asm/book3s/64/tlbflush.h   |   1 +-
 arch/powerpc/mm/book3s64/radix_hugetlbpage.c    |   1 +-
 arch/powerpc/mm/book3s64/radix_tlb.c            |   6 +-
 arch/x86/mm/tlb.c                               |   3 +-
 drivers/iommu/amd/iommu_v2.c                    |  10 +-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c |  29 +++--
 drivers/iommu/intel/svm.c                       |   8 +-
 drivers/misc/ocxl/link.c                        |   8 +-
 include/asm-generic/tlb.h                       |   1 +-
 include/linux/mmu_notifier.h                    | 104 ++++-------------
 kernel/events/uprobes.c                         |   2 +-
 mm/huge_memory.c                                |  29 +----
 mm/hugetlb.c                                    |   8 +-
 mm/memory.c                                     |   8 +-
 mm/migrate_device.c                             |   9 +-
 mm/mmu_notifier.c                               |  49 +++-----
 mm/rmap.c                                       |  40 +-------
 18 files changed, 110 insertions(+), 211 deletions(-)

base-commit: 906fa30154ef42f93d28d7322860e76c6ae392ac
-- 
git-series 0.9.1

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 0/5] Invalidate secondary IOMMU TLB on permission upgrade
@ 2023-07-19 12:18 ` Alistair Popple
  0 siblings, 0 replies; 35+ messages in thread
From: Alistair Popple @ 2023-07-19 12:18 UTC (permalink / raw)
  To: akpm
  Cc: ajd, catalin.marinas, fbarrat, iommu, jgg, jhubbard, kevin.tian,
	kvm, linux-arm-kernel, linux-kernel, linux-mm, linuxppc-dev, mpe,
	nicolinc, npiggin, robin.murphy, seanjc, will, x86,
	zhi.wang.linux, Alistair Popple

The main change is to move secondary TLB invalidation mmu notifier
callbacks into the architecture specific TLB flushing functions. This
makes secondary TLB invalidation mostly match CPU invalidation while
still allowing efficient range based invalidations based on the
existing TLB batching code.

Changes for v2:

 - Rebased on linux-next commit 906fa30154ef ("mm/rmap: correct stale
   comment of rmap_walk_anon and rmap_walk_file") to fix a minor
   integration conflict with "arm64: support batched/deferred tlb
   shootdown during page reclamation/migration". This series will need
   to be applied after the conflicting patch.

 - Reordered the function rename until the end of the series as many
   places that were getting renamed ended up being removed anyway.

 - Fixed a couple of build issues which broke bisection.

 - Added a minor patch to fix up a stale/incorrect comment.

==========
Background
==========

The arm64 architecture specifies TLB permission bits may be cached and
therefore the TLB must be invalidated during permission upgrades. For
the CPU this currently occurs in the architecture specific
ptep_set_access_flags() routine.

Secondary TLBs such as implemented by the SMMU IOMMU match the CPU
architecture specification and may also cache permission bits and
require the same TLB invalidations. This may be achieved in one of two
ways.

Some SMMU implementations implement broadcast TLB maintenance
(BTM). This snoops CPU TLB invalidates and will invalidate any
secondary TLB at the same time as the CPU. However implementations are
not required to implement BTM.

Implementations without BTM rely on mmu notifier callbacks to send
explicit TLB invalidation commands to invalidate SMMU TLB. Therefore
either generic kernel code or architecture specific code needs to call
the mmu notifier on permission upgrade.

Currently that doesn't happen so devices will fault indefinitely when
writing to a PTE that was previously read-only as nothing invalidates
the SMMU TLB.

========
Solution
========

To fix this the series first renames the .invalidate_range() callback
to .arch_invalidate_secondary_tlbs() as suggested by Jason and Sean to
make it clear this callback is only used for secondary TLBs. That was
made possible thanks to Sean's series [1] to remove KVM's incorrect
usage.

Based on feedback from Jason [2] the proposed solution to the bug is
to move the calls to mmu_notifier_arch_invalidate_secondary_tlbs()
closer to the architecture specific TLB invalidation code. This
ensures the secondary TLB won't miss invalidations, including the
existing invalidation in the ARM64 code to deal with permission
upgrade.

Currently only ARM64, PowerPC and x86 have IOMMU with secondary TLBs
requiring SW invalidation so the notifier is only called for those
architectures. It is also not called for invalidation of kernel
mappings as no secondary IOMMU implementations can access those and
hence it is not required.

[1] - https://lore.kernel.org/all/20230602011518.787006-1-seanjc@google.com/
[2] - https://lore.kernel.org/linux-mm/ZJMR5bw8l+BbzdJ7@ziepe.ca/

Alistair Popple (5):
  arm64/smmu: Use TLBI ASID when invalidating entire range
  mmu_notifiers: Fixup comment in mmu_interval_read_begin()
  mmu_notifiers: Call invalidate_range() when invalidating TLBs
  mmu_notifiers: Don't invalidate secondary TLBs as part of mmu_notifier_invalidate_range_end()
  mmu_notifiers: Rename invalidate_range notifier

 arch/arm64/include/asm/tlbflush.h               |   5 +-
 arch/powerpc/include/asm/book3s/64/tlbflush.h   |   1 +-
 arch/powerpc/mm/book3s64/radix_hugetlbpage.c    |   1 +-
 arch/powerpc/mm/book3s64/radix_tlb.c            |   6 +-
 arch/x86/mm/tlb.c                               |   3 +-
 drivers/iommu/amd/iommu_v2.c                    |  10 +-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c |  29 +++--
 drivers/iommu/intel/svm.c                       |   8 +-
 drivers/misc/ocxl/link.c                        |   8 +-
 include/asm-generic/tlb.h                       |   1 +-
 include/linux/mmu_notifier.h                    | 104 ++++-------------
 kernel/events/uprobes.c                         |   2 +-
 mm/huge_memory.c                                |  29 +----
 mm/hugetlb.c                                    |   8 +-
 mm/memory.c                                     |   8 +-
 mm/migrate_device.c                             |   9 +-
 mm/mmu_notifier.c                               |  49 +++-----
 mm/rmap.c                                       |  40 +-------
 18 files changed, 110 insertions(+), 211 deletions(-)

base-commit: 906fa30154ef42f93d28d7322860e76c6ae392ac
-- 
git-series 0.9.1

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 0/5] Invalidate secondary IOMMU TLB on permission upgrade
@ 2023-07-19 12:18 ` Alistair Popple
  0 siblings, 0 replies; 35+ messages in thread
From: Alistair Popple @ 2023-07-19 12:18 UTC (permalink / raw)
  To: akpm
  Cc: Alistair Popple, kevin.tian, x86, ajd, kvm, linux-mm,
	catalin.marinas, seanjc, will, linux-kernel, npiggin,
	zhi.wang.linux, jgg, iommu, nicolinc, jhubbard, fbarrat,
	linuxppc-dev, linux-arm-kernel, robin.murphy

The main change is to move secondary TLB invalidation mmu notifier
callbacks into the architecture specific TLB flushing functions. This
makes secondary TLB invalidation mostly match CPU invalidation while
still allowing efficient range based invalidations based on the
existing TLB batching code.

Changes for v2:

 - Rebased on linux-next commit 906fa30154ef ("mm/rmap: correct stale
   comment of rmap_walk_anon and rmap_walk_file") to fix a minor
   integration conflict with "arm64: support batched/deferred tlb
   shootdown during page reclamation/migration". This series will need
   to be applied after the conflicting patch.

 - Reordered the function rename until the end of the series as many
   places that were getting renamed ended up being removed anyway.

 - Fixed a couple of build issues which broke bisection.

 - Added a minor patch to fix up a stale/incorrect comment.

==========
Background
==========

The arm64 architecture specifies TLB permission bits may be cached and
therefore the TLB must be invalidated during permission upgrades. For
the CPU this currently occurs in the architecture specific
ptep_set_access_flags() routine.

Secondary TLBs such as implemented by the SMMU IOMMU match the CPU
architecture specification and may also cache permission bits and
require the same TLB invalidations. This may be achieved in one of two
ways.

Some SMMU implementations implement broadcast TLB maintenance
(BTM). This snoops CPU TLB invalidates and will invalidate any
secondary TLB at the same time as the CPU. However implementations are
not required to implement BTM.

Implementations without BTM rely on mmu notifier callbacks to send
explicit TLB invalidation commands to invalidate SMMU TLB. Therefore
either generic kernel code or architecture specific code needs to call
the mmu notifier on permission upgrade.

Currently that doesn't happen so devices will fault indefinitely when
writing to a PTE that was previously read-only as nothing invalidates
the SMMU TLB.

========
Solution
========

To fix this the series first renames the .invalidate_range() callback
to .arch_invalidate_secondary_tlbs() as suggested by Jason and Sean to
make it clear this callback is only used for secondary TLBs. That was
made possible thanks to Sean's series [1] to remove KVM's incorrect
usage.

Based on feedback from Jason [2] the proposed solution to the bug is
to move the calls to mmu_notifier_arch_invalidate_secondary_tlbs()
closer to the architecture specific TLB invalidation code. This
ensures the secondary TLB won't miss invalidations, including the
existing invalidation in the ARM64 code to deal with permission
upgrade.

Currently only ARM64, PowerPC and x86 have IOMMU with secondary TLBs
requiring SW invalidation so the notifier is only called for those
architectures. It is also not called for invalidation of kernel
mappings as no secondary IOMMU implementations can access those and
hence it is not required.

[1] - https://lore.kernel.org/all/20230602011518.787006-1-seanjc@google.com/
[2] - https://lore.kernel.org/linux-mm/ZJMR5bw8l+BbzdJ7@ziepe.ca/

Alistair Popple (5):
  arm64/smmu: Use TLBI ASID when invalidating entire range
  mmu_notifiers: Fixup comment in mmu_interval_read_begin()
  mmu_notifiers: Call invalidate_range() when invalidating TLBs
  mmu_notifiers: Don't invalidate secondary TLBs as part of mmu_notifier_invalidate_range_end()
  mmu_notifiers: Rename invalidate_range notifier

 arch/arm64/include/asm/tlbflush.h               |   5 +-
 arch/powerpc/include/asm/book3s/64/tlbflush.h   |   1 +-
 arch/powerpc/mm/book3s64/radix_hugetlbpage.c    |   1 +-
 arch/powerpc/mm/book3s64/radix_tlb.c            |   6 +-
 arch/x86/mm/tlb.c                               |   3 +-
 drivers/iommu/amd/iommu_v2.c                    |  10 +-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c |  29 +++--
 drivers/iommu/intel/svm.c                       |   8 +-
 drivers/misc/ocxl/link.c                        |   8 +-
 include/asm-generic/tlb.h                       |   1 +-
 include/linux/mmu_notifier.h                    | 104 ++++-------------
 kernel/events/uprobes.c                         |   2 +-
 mm/huge_memory.c                                |  29 +----
 mm/hugetlb.c                                    |   8 +-
 mm/memory.c                                     |   8 +-
 mm/migrate_device.c                             |   9 +-
 mm/mmu_notifier.c                               |  49 +++-----
 mm/rmap.c                                       |  40 +-------
 18 files changed, 110 insertions(+), 211 deletions(-)

base-commit: 906fa30154ef42f93d28d7322860e76c6ae392ac
-- 
git-series 0.9.1

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 1/5] arm64/smmu: Use TLBI ASID when invalidating entire range
  2023-07-19 12:18 ` Alistair Popple
  (?)
@ 2023-07-19 12:18   ` Alistair Popple
  -1 siblings, 0 replies; 35+ messages in thread
From: Alistair Popple @ 2023-07-19 12:18 UTC (permalink / raw)
  To: akpm
  Cc: ajd, catalin.marinas, fbarrat, iommu, jgg, jhubbard, kevin.tian,
	kvm, linux-arm-kernel, linux-kernel, linux-mm, linuxppc-dev, mpe,
	nicolinc, npiggin, robin.murphy, seanjc, will, x86,
	zhi.wang.linux, Alistair Popple

The ARM SMMU has a specific command for invalidating the TLB for an
entire ASID. Currently this is used for the IO_PGTABLE API but not for
ATS when called from the MMU notifier.

The current implementation of notifiers does not attempt to invalidate
such a large address range, instead walking each VMA and invalidating
each range individually during mmap removal. However in future SMMU
TLB invalidations are going to be sent as part of the normal
flush_tlb_*() kernel calls. To better deal with that add handling to
use TLBI ASID when invalidating the entire address space.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
index a5a63b1..2a19784 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
@@ -200,10 +200,20 @@ static void arm_smmu_mm_invalidate_range(struct mmu_notifier *mn,
 	 * range. So do a simple translation here by calculating size correctly.
 	 */
 	size = end - start;
+	if (size == ULONG_MAX)
+		size = 0;
+
+	if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_BTM)) {
+		if (!size)
+			arm_smmu_tlb_inv_asid(smmu_domain->smmu,
+					      smmu_mn->cd->asid);
+		else
+			arm_smmu_tlb_inv_range_asid(start, size,
+						    smmu_mn->cd->asid,
+						    PAGE_SIZE, false,
+						    smmu_domain);
+	}
 
-	if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_BTM))
-		arm_smmu_tlb_inv_range_asid(start, size, smmu_mn->cd->asid,
-					    PAGE_SIZE, false, smmu_domain);
 	arm_smmu_atc_inv_domain(smmu_domain, mm->pasid, start, size);
 }
 
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 1/5] arm64/smmu: Use TLBI ASID when invalidating entire range
@ 2023-07-19 12:18   ` Alistair Popple
  0 siblings, 0 replies; 35+ messages in thread
From: Alistair Popple @ 2023-07-19 12:18 UTC (permalink / raw)
  To: akpm
  Cc: ajd, catalin.marinas, fbarrat, iommu, jgg, jhubbard, kevin.tian,
	kvm, linux-arm-kernel, linux-kernel, linux-mm, linuxppc-dev, mpe,
	nicolinc, npiggin, robin.murphy, seanjc, will, x86,
	zhi.wang.linux, Alistair Popple

The ARM SMMU has a specific command for invalidating the TLB for an
entire ASID. Currently this is used for the IO_PGTABLE API but not for
ATS when called from the MMU notifier.

The current implementation of notifiers does not attempt to invalidate
such a large address range, instead walking each VMA and invalidating
each range individually during mmap removal. However in future SMMU
TLB invalidations are going to be sent as part of the normal
flush_tlb_*() kernel calls. To better deal with that add handling to
use TLBI ASID when invalidating the entire address space.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
index a5a63b1..2a19784 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
@@ -200,10 +200,20 @@ static void arm_smmu_mm_invalidate_range(struct mmu_notifier *mn,
 	 * range. So do a simple translation here by calculating size correctly.
 	 */
 	size = end - start;
+	if (size == ULONG_MAX)
+		size = 0;
+
+	if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_BTM)) {
+		if (!size)
+			arm_smmu_tlb_inv_asid(smmu_domain->smmu,
+					      smmu_mn->cd->asid);
+		else
+			arm_smmu_tlb_inv_range_asid(start, size,
+						    smmu_mn->cd->asid,
+						    PAGE_SIZE, false,
+						    smmu_domain);
+	}
 
-	if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_BTM))
-		arm_smmu_tlb_inv_range_asid(start, size, smmu_mn->cd->asid,
-					    PAGE_SIZE, false, smmu_domain);
 	arm_smmu_atc_inv_domain(smmu_domain, mm->pasid, start, size);
 }
 
-- 
git-series 0.9.1

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 1/5] arm64/smmu: Use TLBI ASID when invalidating entire range
@ 2023-07-19 12:18   ` Alistair Popple
  0 siblings, 0 replies; 35+ messages in thread
From: Alistair Popple @ 2023-07-19 12:18 UTC (permalink / raw)
  To: akpm
  Cc: Alistair Popple, kevin.tian, x86, ajd, kvm, linux-mm,
	catalin.marinas, seanjc, will, linux-kernel, npiggin,
	zhi.wang.linux, jgg, iommu, nicolinc, jhubbard, fbarrat,
	linuxppc-dev, linux-arm-kernel, robin.murphy

The ARM SMMU has a specific command for invalidating the TLB for an
entire ASID. Currently this is used for the IO_PGTABLE API but not for
ATS when called from the MMU notifier.

The current implementation of notifiers does not attempt to invalidate
such a large address range, instead walking each VMA and invalidating
each range individually during mmap removal. However in future SMMU
TLB invalidations are going to be sent as part of the normal
flush_tlb_*() kernel calls. To better deal with that add handling to
use TLBI ASID when invalidating the entire address space.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
index a5a63b1..2a19784 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
@@ -200,10 +200,20 @@ static void arm_smmu_mm_invalidate_range(struct mmu_notifier *mn,
 	 * range. So do a simple translation here by calculating size correctly.
 	 */
 	size = end - start;
+	if (size == ULONG_MAX)
+		size = 0;
+
+	if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_BTM)) {
+		if (!size)
+			arm_smmu_tlb_inv_asid(smmu_domain->smmu,
+					      smmu_mn->cd->asid);
+		else
+			arm_smmu_tlb_inv_range_asid(start, size,
+						    smmu_mn->cd->asid,
+						    PAGE_SIZE, false,
+						    smmu_domain);
+	}
 
-	if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_BTM))
-		arm_smmu_tlb_inv_range_asid(start, size, smmu_mn->cd->asid,
-					    PAGE_SIZE, false, smmu_domain);
 	arm_smmu_atc_inv_domain(smmu_domain, mm->pasid, start, size);
 }
 
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 2/5] mmu_notifiers: Fixup comment in mmu_interval_read_begin()
  2023-07-19 12:18 ` Alistair Popple
  (?)
@ 2023-07-19 12:18   ` Alistair Popple
  -1 siblings, 0 replies; 35+ messages in thread
From: Alistair Popple @ 2023-07-19 12:18 UTC (permalink / raw)
  To: akpm
  Cc: ajd, catalin.marinas, fbarrat, iommu, jgg, jhubbard, kevin.tian,
	kvm, linux-arm-kernel, linux-kernel, linux-mm, linuxppc-dev, mpe,
	nicolinc, npiggin, robin.murphy, seanjc, will, x86,
	zhi.wang.linux, Alistair Popple

The comment in mmu_interval_read_begin() refers to a function that
doesn't exist and uses the wrong call-back name. The op for mmu
interval notifiers is mmu_interval_notifier_ops->invalidate() so fix
the comment up to reflect that.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
 mm/mmu_notifier.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 50c0dde..b7ad155 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -199,7 +199,7 @@ mmu_interval_read_begin(struct mmu_interval_notifier *interval_sub)
 	 * invalidate_start/end and is colliding.
 	 *
 	 * The locking looks broadly like this:
-	 *   mn_tree_invalidate_start():          mmu_interval_read_begin():
+	 *   mn_itree_inv_start():                 mmu_interval_read_begin():
 	 *                                         spin_lock
 	 *                                          seq = READ_ONCE(interval_sub->invalidate_seq);
 	 *                                          seq == subs->invalidate_seq
@@ -207,7 +207,7 @@ mmu_interval_read_begin(struct mmu_interval_notifier *interval_sub)
 	 *    spin_lock
 	 *     seq = ++subscriptions->invalidate_seq
 	 *    spin_unlock
-	 *     op->invalidate_range():
+	 *     op->invalidate():
 	 *       user_lock
 	 *        mmu_interval_set_seq()
 	 *         interval_sub->invalidate_seq = seq
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 2/5] mmu_notifiers: Fixup comment in mmu_interval_read_begin()
@ 2023-07-19 12:18   ` Alistair Popple
  0 siblings, 0 replies; 35+ messages in thread
From: Alistair Popple @ 2023-07-19 12:18 UTC (permalink / raw)
  To: akpm
  Cc: ajd, catalin.marinas, fbarrat, iommu, jgg, jhubbard, kevin.tian,
	kvm, linux-arm-kernel, linux-kernel, linux-mm, linuxppc-dev, mpe,
	nicolinc, npiggin, robin.murphy, seanjc, will, x86,
	zhi.wang.linux, Alistair Popple

The comment in mmu_interval_read_begin() refers to a function that
doesn't exist and uses the wrong call-back name. The op for mmu
interval notifiers is mmu_interval_notifier_ops->invalidate() so fix
the comment up to reflect that.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
 mm/mmu_notifier.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 50c0dde..b7ad155 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -199,7 +199,7 @@ mmu_interval_read_begin(struct mmu_interval_notifier *interval_sub)
 	 * invalidate_start/end and is colliding.
 	 *
 	 * The locking looks broadly like this:
-	 *   mn_tree_invalidate_start():          mmu_interval_read_begin():
+	 *   mn_itree_inv_start():                 mmu_interval_read_begin():
 	 *                                         spin_lock
 	 *                                          seq = READ_ONCE(interval_sub->invalidate_seq);
 	 *                                          seq == subs->invalidate_seq
@@ -207,7 +207,7 @@ mmu_interval_read_begin(struct mmu_interval_notifier *interval_sub)
 	 *    spin_lock
 	 *     seq = ++subscriptions->invalidate_seq
 	 *    spin_unlock
-	 *     op->invalidate_range():
+	 *     op->invalidate():
 	 *       user_lock
 	 *        mmu_interval_set_seq()
 	 *         interval_sub->invalidate_seq = seq
-- 
git-series 0.9.1

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 2/5] mmu_notifiers: Fixup comment in mmu_interval_read_begin()
@ 2023-07-19 12:18   ` Alistair Popple
  0 siblings, 0 replies; 35+ messages in thread
From: Alistair Popple @ 2023-07-19 12:18 UTC (permalink / raw)
  To: akpm
  Cc: Alistair Popple, kevin.tian, x86, ajd, kvm, linux-mm,
	catalin.marinas, seanjc, will, linux-kernel, npiggin,
	zhi.wang.linux, jgg, iommu, nicolinc, jhubbard, fbarrat,
	linuxppc-dev, linux-arm-kernel, robin.murphy

The comment in mmu_interval_read_begin() refers to a function that
doesn't exist and uses the wrong call-back name. The op for mmu
interval notifiers is mmu_interval_notifier_ops->invalidate() so fix
the comment up to reflect that.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
 mm/mmu_notifier.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 50c0dde..b7ad155 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -199,7 +199,7 @@ mmu_interval_read_begin(struct mmu_interval_notifier *interval_sub)
 	 * invalidate_start/end and is colliding.
 	 *
 	 * The locking looks broadly like this:
-	 *   mn_tree_invalidate_start():          mmu_interval_read_begin():
+	 *   mn_itree_inv_start():                 mmu_interval_read_begin():
 	 *                                         spin_lock
 	 *                                          seq = READ_ONCE(interval_sub->invalidate_seq);
 	 *                                          seq == subs->invalidate_seq
@@ -207,7 +207,7 @@ mmu_interval_read_begin(struct mmu_interval_notifier *interval_sub)
 	 *    spin_lock
 	 *     seq = ++subscriptions->invalidate_seq
 	 *    spin_unlock
-	 *     op->invalidate_range():
+	 *     op->invalidate():
 	 *       user_lock
 	 *        mmu_interval_set_seq()
 	 *         interval_sub->invalidate_seq = seq
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 3/5] mmu_notifiers: Call invalidate_range() when invalidating TLBs
  2023-07-19 12:18 ` Alistair Popple
  (?)
@ 2023-07-19 12:18   ` Alistair Popple
  -1 siblings, 0 replies; 35+ messages in thread
From: Alistair Popple @ 2023-07-19 12:18 UTC (permalink / raw)
  To: akpm
  Cc: ajd, catalin.marinas, fbarrat, iommu, jgg, jhubbard, kevin.tian,
	kvm, linux-arm-kernel, linux-kernel, linux-mm, linuxppc-dev, mpe,
	nicolinc, npiggin, robin.murphy, seanjc, will, x86,
	zhi.wang.linux, Alistair Popple

The invalidate_range() is going to become an architecture specific mmu
notifier used to keep the TLB of secondary MMUs such as an IOMMU in
sync with the CPU page tables. Currently it is called from separate
code paths to the main CPU TLB invalidations. This can lead to a
secondary TLB not getting invalidated when required and makes it hard
to reason about when exactly the secondary TLB is invalidated.

To fix this move the notifier call to the architecture specific TLB
maintenance functions for architectures that have secondary MMUs
requiring explicit software invalidations.

This fixes a SMMU bug on ARM64. On ARM64 PTE permission upgrades
require a TLB invalidation. This invalidation is done by the
architecutre specific ptep_set_access_flags() which calls
flush_tlb_page() if required. However this doesn't call the notifier
resulting in infinite faults being generated by devices using the SMMU
if it has previously cached a read-only PTE in it's TLB.

Moving the invalidations into the TLB invalidation functions ensures
all invalidations happen at the same time as the CPU invalidation. The
architecture specific flush_tlb_all() routines do not call the
notifier as none of the IOMMUs require this.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
---
 arch/arm64/include/asm/tlbflush.h             | 5 +++++
 arch/powerpc/include/asm/book3s/64/tlbflush.h | 1 +
 arch/powerpc/mm/book3s64/radix_hugetlbpage.c  | 1 +
 arch/powerpc/mm/book3s64/radix_tlb.c          | 6 ++++++
 arch/x86/mm/tlb.c                             | 3 +++
 include/asm-generic/tlb.h                     | 1 -
 6 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 3456866..a99349d 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -13,6 +13,7 @@
 #include <linux/bitfield.h>
 #include <linux/mm_types.h>
 #include <linux/sched.h>
+#include <linux/mmu_notifier.h>
 #include <asm/cputype.h>
 #include <asm/mmu.h>
 
@@ -252,6 +253,7 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
 	__tlbi(aside1is, asid);
 	__tlbi_user(aside1is, asid);
 	dsb(ish);
+	mmu_notifier_invalidate_range(mm, 0, -1UL);
 }
 
 static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
@@ -263,6 +265,8 @@ static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
 	addr = __TLBI_VADDR(uaddr, ASID(mm));
 	__tlbi(vale1is, addr);
 	__tlbi_user(vale1is, addr);
+	mmu_notifier_invalidate_range(mm, uaddr & PAGE_MASK,
+						(uaddr & PAGE_MASK) + PAGE_SIZE);
 }
 
 static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
@@ -396,6 +400,7 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
 		scale++;
 	}
 	dsb(ish);
+	mmu_notifier_invalidate_range(vma->vm_mm, start, end);
 }
 
 static inline void flush_tlb_range(struct vm_area_struct *vma,
diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush.h b/arch/powerpc/include/asm/book3s/64/tlbflush.h
index 0d0c144..dca0477 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush.h
@@ -5,6 +5,7 @@
 #define MMU_NO_CONTEXT	~0UL
 
 #include <linux/mm_types.h>
+#include <linux/mmu_notifier.h>
 #include <asm/book3s/64/tlbflush-hash.h>
 #include <asm/book3s/64/tlbflush-radix.h>
 
diff --git a/arch/powerpc/mm/book3s64/radix_hugetlbpage.c b/arch/powerpc/mm/book3s64/radix_hugetlbpage.c
index 5e31955..f3fb49f 100644
--- a/arch/powerpc/mm/book3s64/radix_hugetlbpage.c
+++ b/arch/powerpc/mm/book3s64/radix_hugetlbpage.c
@@ -39,6 +39,7 @@ void radix__flush_hugetlb_tlb_range(struct vm_area_struct *vma, unsigned long st
 		radix__flush_tlb_pwc_range_psize(vma->vm_mm, start, end, psize);
 	else
 		radix__flush_tlb_range_psize(vma->vm_mm, start, end, psize);
+	mmu_notifier_invalidate_range(vma->vm_mm, start, end);
 }
 
 void radix__huge_ptep_modify_prot_commit(struct vm_area_struct *vma,
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
index 0bd4866..9724b26 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -752,6 +752,8 @@ void radix__local_flush_tlb_page(struct vm_area_struct *vma, unsigned long vmadd
 		return radix__local_flush_hugetlb_page(vma, vmaddr);
 #endif
 	radix__local_flush_tlb_page_psize(vma->vm_mm, vmaddr, mmu_virtual_psize);
+	mmu_notifier_invalidate_range(vma->vm_mm, vmaddr,
+						vmaddr + mmu_virtual_psize);
 }
 EXPORT_SYMBOL(radix__local_flush_tlb_page);
 
@@ -987,6 +989,7 @@ void radix__flush_tlb_mm(struct mm_struct *mm)
 		}
 	}
 	preempt_enable();
+	mmu_notifier_invalidate_range(mm, 0, -1UL);
 }
 EXPORT_SYMBOL(radix__flush_tlb_mm);
 
@@ -1020,6 +1023,7 @@ static void __flush_all_mm(struct mm_struct *mm, bool fullmm)
 			_tlbiel_pid_multicast(mm, pid, RIC_FLUSH_ALL);
 	}
 	preempt_enable();
+	mmu_notifier_invalidate_range(mm, 0, -1UL);
 }
 
 void radix__flush_all_mm(struct mm_struct *mm)
@@ -1228,6 +1232,7 @@ static inline void __radix__flush_tlb_range(struct mm_struct *mm,
 	}
 out:
 	preempt_enable();
+	mmu_notifier_invalidate_range(mm, start, end);
 }
 
 void radix__flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
@@ -1392,6 +1397,7 @@ static void __radix__flush_tlb_range_psize(struct mm_struct *mm,
 	}
 out:
 	preempt_enable();
+	mmu_notifier_invalidate_range(mm, start, end);
 }
 
 void radix__flush_tlb_range_psize(struct mm_struct *mm, unsigned long start,
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 267acf2..c30fbcd 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -10,6 +10,7 @@
 #include <linux/debugfs.h>
 #include <linux/sched/smt.h>
 #include <linux/task_work.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
@@ -1036,6 +1037,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 
 	put_flush_tlb_info();
 	put_cpu();
+	mmu_notifier_invalidate_range(mm, start, end);
 }
 
 
@@ -1263,6 +1265,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 
 	put_flush_tlb_info();
 	put_cpu();
+	mmu_notifier_invalidate_range(current->mm, 0, -1UL);
 }
 
 /*
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index b466172..bc32a22 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -456,7 +456,6 @@ static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 		return;
 
 	tlb_flush(tlb);
-	mmu_notifier_invalidate_range(tlb->mm, tlb->start, tlb->end);
 	__tlb_reset_range(tlb);
 }
 
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 3/5] mmu_notifiers: Call invalidate_range() when invalidating TLBs
@ 2023-07-19 12:18   ` Alistair Popple
  0 siblings, 0 replies; 35+ messages in thread
From: Alistair Popple @ 2023-07-19 12:18 UTC (permalink / raw)
  To: akpm
  Cc: ajd, catalin.marinas, fbarrat, iommu, jgg, jhubbard, kevin.tian,
	kvm, linux-arm-kernel, linux-kernel, linux-mm, linuxppc-dev, mpe,
	nicolinc, npiggin, robin.murphy, seanjc, will, x86,
	zhi.wang.linux, Alistair Popple

The invalidate_range() is going to become an architecture specific mmu
notifier used to keep the TLB of secondary MMUs such as an IOMMU in
sync with the CPU page tables. Currently it is called from separate
code paths to the main CPU TLB invalidations. This can lead to a
secondary TLB not getting invalidated when required and makes it hard
to reason about when exactly the secondary TLB is invalidated.

To fix this move the notifier call to the architecture specific TLB
maintenance functions for architectures that have secondary MMUs
requiring explicit software invalidations.

This fixes a SMMU bug on ARM64. On ARM64 PTE permission upgrades
require a TLB invalidation. This invalidation is done by the
architecutre specific ptep_set_access_flags() which calls
flush_tlb_page() if required. However this doesn't call the notifier
resulting in infinite faults being generated by devices using the SMMU
if it has previously cached a read-only PTE in it's TLB.

Moving the invalidations into the TLB invalidation functions ensures
all invalidations happen at the same time as the CPU invalidation. The
architecture specific flush_tlb_all() routines do not call the
notifier as none of the IOMMUs require this.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
---
 arch/arm64/include/asm/tlbflush.h             | 5 +++++
 arch/powerpc/include/asm/book3s/64/tlbflush.h | 1 +
 arch/powerpc/mm/book3s64/radix_hugetlbpage.c  | 1 +
 arch/powerpc/mm/book3s64/radix_tlb.c          | 6 ++++++
 arch/x86/mm/tlb.c                             | 3 +++
 include/asm-generic/tlb.h                     | 1 -
 6 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 3456866..a99349d 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -13,6 +13,7 @@
 #include <linux/bitfield.h>
 #include <linux/mm_types.h>
 #include <linux/sched.h>
+#include <linux/mmu_notifier.h>
 #include <asm/cputype.h>
 #include <asm/mmu.h>
 
@@ -252,6 +253,7 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
 	__tlbi(aside1is, asid);
 	__tlbi_user(aside1is, asid);
 	dsb(ish);
+	mmu_notifier_invalidate_range(mm, 0, -1UL);
 }
 
 static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
@@ -263,6 +265,8 @@ static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
 	addr = __TLBI_VADDR(uaddr, ASID(mm));
 	__tlbi(vale1is, addr);
 	__tlbi_user(vale1is, addr);
+	mmu_notifier_invalidate_range(mm, uaddr & PAGE_MASK,
+						(uaddr & PAGE_MASK) + PAGE_SIZE);
 }
 
 static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
@@ -396,6 +400,7 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
 		scale++;
 	}
 	dsb(ish);
+	mmu_notifier_invalidate_range(vma->vm_mm, start, end);
 }
 
 static inline void flush_tlb_range(struct vm_area_struct *vma,
diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush.h b/arch/powerpc/include/asm/book3s/64/tlbflush.h
index 0d0c144..dca0477 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush.h
@@ -5,6 +5,7 @@
 #define MMU_NO_CONTEXT	~0UL
 
 #include <linux/mm_types.h>
+#include <linux/mmu_notifier.h>
 #include <asm/book3s/64/tlbflush-hash.h>
 #include <asm/book3s/64/tlbflush-radix.h>
 
diff --git a/arch/powerpc/mm/book3s64/radix_hugetlbpage.c b/arch/powerpc/mm/book3s64/radix_hugetlbpage.c
index 5e31955..f3fb49f 100644
--- a/arch/powerpc/mm/book3s64/radix_hugetlbpage.c
+++ b/arch/powerpc/mm/book3s64/radix_hugetlbpage.c
@@ -39,6 +39,7 @@ void radix__flush_hugetlb_tlb_range(struct vm_area_struct *vma, unsigned long st
 		radix__flush_tlb_pwc_range_psize(vma->vm_mm, start, end, psize);
 	else
 		radix__flush_tlb_range_psize(vma->vm_mm, start, end, psize);
+	mmu_notifier_invalidate_range(vma->vm_mm, start, end);
 }
 
 void radix__huge_ptep_modify_prot_commit(struct vm_area_struct *vma,
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
index 0bd4866..9724b26 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -752,6 +752,8 @@ void radix__local_flush_tlb_page(struct vm_area_struct *vma, unsigned long vmadd
 		return radix__local_flush_hugetlb_page(vma, vmaddr);
 #endif
 	radix__local_flush_tlb_page_psize(vma->vm_mm, vmaddr, mmu_virtual_psize);
+	mmu_notifier_invalidate_range(vma->vm_mm, vmaddr,
+						vmaddr + mmu_virtual_psize);
 }
 EXPORT_SYMBOL(radix__local_flush_tlb_page);
 
@@ -987,6 +989,7 @@ void radix__flush_tlb_mm(struct mm_struct *mm)
 		}
 	}
 	preempt_enable();
+	mmu_notifier_invalidate_range(mm, 0, -1UL);
 }
 EXPORT_SYMBOL(radix__flush_tlb_mm);
 
@@ -1020,6 +1023,7 @@ static void __flush_all_mm(struct mm_struct *mm, bool fullmm)
 			_tlbiel_pid_multicast(mm, pid, RIC_FLUSH_ALL);
 	}
 	preempt_enable();
+	mmu_notifier_invalidate_range(mm, 0, -1UL);
 }
 
 void radix__flush_all_mm(struct mm_struct *mm)
@@ -1228,6 +1232,7 @@ static inline void __radix__flush_tlb_range(struct mm_struct *mm,
 	}
 out:
 	preempt_enable();
+	mmu_notifier_invalidate_range(mm, start, end);
 }
 
 void radix__flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
@@ -1392,6 +1397,7 @@ static void __radix__flush_tlb_range_psize(struct mm_struct *mm,
 	}
 out:
 	preempt_enable();
+	mmu_notifier_invalidate_range(mm, start, end);
 }
 
 void radix__flush_tlb_range_psize(struct mm_struct *mm, unsigned long start,
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 267acf2..c30fbcd 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -10,6 +10,7 @@
 #include <linux/debugfs.h>
 #include <linux/sched/smt.h>
 #include <linux/task_work.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
@@ -1036,6 +1037,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 
 	put_flush_tlb_info();
 	put_cpu();
+	mmu_notifier_invalidate_range(mm, start, end);
 }
 
 
@@ -1263,6 +1265,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 
 	put_flush_tlb_info();
 	put_cpu();
+	mmu_notifier_invalidate_range(current->mm, 0, -1UL);
 }
 
 /*
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index b466172..bc32a22 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -456,7 +456,6 @@ static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 		return;
 
 	tlb_flush(tlb);
-	mmu_notifier_invalidate_range(tlb->mm, tlb->start, tlb->end);
 	__tlb_reset_range(tlb);
 }
 
-- 
git-series 0.9.1

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 3/5] mmu_notifiers: Call invalidate_range() when invalidating TLBs
@ 2023-07-19 12:18   ` Alistair Popple
  0 siblings, 0 replies; 35+ messages in thread
From: Alistair Popple @ 2023-07-19 12:18 UTC (permalink / raw)
  To: akpm
  Cc: Alistair Popple, kevin.tian, x86, ajd, kvm, linux-mm,
	catalin.marinas, seanjc, will, linux-kernel, npiggin,
	zhi.wang.linux, jgg, iommu, nicolinc, jhubbard, fbarrat,
	linuxppc-dev, linux-arm-kernel, robin.murphy

The invalidate_range() is going to become an architecture specific mmu
notifier used to keep the TLB of secondary MMUs such as an IOMMU in
sync with the CPU page tables. Currently it is called from separate
code paths to the main CPU TLB invalidations. This can lead to a
secondary TLB not getting invalidated when required and makes it hard
to reason about when exactly the secondary TLB is invalidated.

To fix this move the notifier call to the architecture specific TLB
maintenance functions for architectures that have secondary MMUs
requiring explicit software invalidations.

This fixes a SMMU bug on ARM64. On ARM64 PTE permission upgrades
require a TLB invalidation. This invalidation is done by the
architecutre specific ptep_set_access_flags() which calls
flush_tlb_page() if required. However this doesn't call the notifier
resulting in infinite faults being generated by devices using the SMMU
if it has previously cached a read-only PTE in it's TLB.

Moving the invalidations into the TLB invalidation functions ensures
all invalidations happen at the same time as the CPU invalidation. The
architecture specific flush_tlb_all() routines do not call the
notifier as none of the IOMMUs require this.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
---
 arch/arm64/include/asm/tlbflush.h             | 5 +++++
 arch/powerpc/include/asm/book3s/64/tlbflush.h | 1 +
 arch/powerpc/mm/book3s64/radix_hugetlbpage.c  | 1 +
 arch/powerpc/mm/book3s64/radix_tlb.c          | 6 ++++++
 arch/x86/mm/tlb.c                             | 3 +++
 include/asm-generic/tlb.h                     | 1 -
 6 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 3456866..a99349d 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -13,6 +13,7 @@
 #include <linux/bitfield.h>
 #include <linux/mm_types.h>
 #include <linux/sched.h>
+#include <linux/mmu_notifier.h>
 #include <asm/cputype.h>
 #include <asm/mmu.h>
 
@@ -252,6 +253,7 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
 	__tlbi(aside1is, asid);
 	__tlbi_user(aside1is, asid);
 	dsb(ish);
+	mmu_notifier_invalidate_range(mm, 0, -1UL);
 }
 
 static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
@@ -263,6 +265,8 @@ static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
 	addr = __TLBI_VADDR(uaddr, ASID(mm));
 	__tlbi(vale1is, addr);
 	__tlbi_user(vale1is, addr);
+	mmu_notifier_invalidate_range(mm, uaddr & PAGE_MASK,
+						(uaddr & PAGE_MASK) + PAGE_SIZE);
 }
 
 static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
@@ -396,6 +400,7 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
 		scale++;
 	}
 	dsb(ish);
+	mmu_notifier_invalidate_range(vma->vm_mm, start, end);
 }
 
 static inline void flush_tlb_range(struct vm_area_struct *vma,
diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush.h b/arch/powerpc/include/asm/book3s/64/tlbflush.h
index 0d0c144..dca0477 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush.h
@@ -5,6 +5,7 @@
 #define MMU_NO_CONTEXT	~0UL
 
 #include <linux/mm_types.h>
+#include <linux/mmu_notifier.h>
 #include <asm/book3s/64/tlbflush-hash.h>
 #include <asm/book3s/64/tlbflush-radix.h>
 
diff --git a/arch/powerpc/mm/book3s64/radix_hugetlbpage.c b/arch/powerpc/mm/book3s64/radix_hugetlbpage.c
index 5e31955..f3fb49f 100644
--- a/arch/powerpc/mm/book3s64/radix_hugetlbpage.c
+++ b/arch/powerpc/mm/book3s64/radix_hugetlbpage.c
@@ -39,6 +39,7 @@ void radix__flush_hugetlb_tlb_range(struct vm_area_struct *vma, unsigned long st
 		radix__flush_tlb_pwc_range_psize(vma->vm_mm, start, end, psize);
 	else
 		radix__flush_tlb_range_psize(vma->vm_mm, start, end, psize);
+	mmu_notifier_invalidate_range(vma->vm_mm, start, end);
 }
 
 void radix__huge_ptep_modify_prot_commit(struct vm_area_struct *vma,
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
index 0bd4866..9724b26 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -752,6 +752,8 @@ void radix__local_flush_tlb_page(struct vm_area_struct *vma, unsigned long vmadd
 		return radix__local_flush_hugetlb_page(vma, vmaddr);
 #endif
 	radix__local_flush_tlb_page_psize(vma->vm_mm, vmaddr, mmu_virtual_psize);
+	mmu_notifier_invalidate_range(vma->vm_mm, vmaddr,
+						vmaddr + mmu_virtual_psize);
 }
 EXPORT_SYMBOL(radix__local_flush_tlb_page);
 
@@ -987,6 +989,7 @@ void radix__flush_tlb_mm(struct mm_struct *mm)
 		}
 	}
 	preempt_enable();
+	mmu_notifier_invalidate_range(mm, 0, -1UL);
 }
 EXPORT_SYMBOL(radix__flush_tlb_mm);
 
@@ -1020,6 +1023,7 @@ static void __flush_all_mm(struct mm_struct *mm, bool fullmm)
 			_tlbiel_pid_multicast(mm, pid, RIC_FLUSH_ALL);
 	}
 	preempt_enable();
+	mmu_notifier_invalidate_range(mm, 0, -1UL);
 }
 
 void radix__flush_all_mm(struct mm_struct *mm)
@@ -1228,6 +1232,7 @@ static inline void __radix__flush_tlb_range(struct mm_struct *mm,
 	}
 out:
 	preempt_enable();
+	mmu_notifier_invalidate_range(mm, start, end);
 }
 
 void radix__flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
@@ -1392,6 +1397,7 @@ static void __radix__flush_tlb_range_psize(struct mm_struct *mm,
 	}
 out:
 	preempt_enable();
+	mmu_notifier_invalidate_range(mm, start, end);
 }
 
 void radix__flush_tlb_range_psize(struct mm_struct *mm, unsigned long start,
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 267acf2..c30fbcd 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -10,6 +10,7 @@
 #include <linux/debugfs.h>
 #include <linux/sched/smt.h>
 #include <linux/task_work.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
@@ -1036,6 +1037,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 
 	put_flush_tlb_info();
 	put_cpu();
+	mmu_notifier_invalidate_range(mm, start, end);
 }
 
 
@@ -1263,6 +1265,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 
 	put_flush_tlb_info();
 	put_cpu();
+	mmu_notifier_invalidate_range(current->mm, 0, -1UL);
 }
 
 /*
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index b466172..bc32a22 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -456,7 +456,6 @@ static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 		return;
 
 	tlb_flush(tlb);
-	mmu_notifier_invalidate_range(tlb->mm, tlb->start, tlb->end);
 	__tlb_reset_range(tlb);
 }
 
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 4/5] mmu_notifiers: Don't invalidate secondary TLBs as part of mmu_notifier_invalidate_range_end()
  2023-07-19 12:18 ` Alistair Popple
  (?)
@ 2023-07-19 12:18   ` Alistair Popple
  -1 siblings, 0 replies; 35+ messages in thread
From: Alistair Popple @ 2023-07-19 12:18 UTC (permalink / raw)
  To: akpm
  Cc: ajd, catalin.marinas, fbarrat, iommu, jgg, jhubbard, kevin.tian,
	kvm, linux-arm-kernel, linux-kernel, linux-mm, linuxppc-dev, mpe,
	nicolinc, npiggin, robin.murphy, seanjc, will, x86,
	zhi.wang.linux, Alistair Popple

Secondary TLBs are now invalidated from the architecture specific TLB
invalidation functions. Therefore there is no need to explicitly
notify or invalidate as part of the range end functions. This means we
can remove mmu_notifier_invalidate_range_end_only() and some of the
ptep_*_notify() functions.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
 include/linux/mmu_notifier.h | 56 +------------------------------------
 kernel/events/uprobes.c      |  2 +-
 mm/huge_memory.c             | 25 ++---------------
 mm/hugetlb.c                 |  1 +-
 mm/memory.c                  |  8 +----
 mm/migrate_device.c          |  9 +-----
 mm/mmu_notifier.c            | 25 ++---------------
 mm/rmap.c                    | 40 +--------------------------
 8 files changed, 14 insertions(+), 152 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 64a3e05..f2e9edc 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -395,8 +395,7 @@ extern int __mmu_notifier_test_young(struct mm_struct *mm,
 extern void __mmu_notifier_change_pte(struct mm_struct *mm,
 				      unsigned long address, pte_t pte);
 extern int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *r);
-extern void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *r,
-				  bool only_end);
+extern void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *r);
 extern void __mmu_notifier_invalidate_range(struct mm_struct *mm,
 				  unsigned long start, unsigned long end);
 extern bool
@@ -481,14 +480,7 @@ mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range)
 		might_sleep();
 
 	if (mm_has_notifiers(range->mm))
-		__mmu_notifier_invalidate_range_end(range, false);
-}
-
-static inline void
-mmu_notifier_invalidate_range_only_end(struct mmu_notifier_range *range)
-{
-	if (mm_has_notifiers(range->mm))
-		__mmu_notifier_invalidate_range_end(range, true);
+		__mmu_notifier_invalidate_range_end(range);
 }
 
 static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
@@ -582,45 +574,6 @@ static inline void mmu_notifier_range_init_owner(
 	__young;							\
 })
 
-#define	ptep_clear_flush_notify(__vma, __address, __ptep)		\
-({									\
-	unsigned long ___addr = __address & PAGE_MASK;			\
-	struct mm_struct *___mm = (__vma)->vm_mm;			\
-	pte_t ___pte;							\
-									\
-	___pte = ptep_clear_flush(__vma, __address, __ptep);		\
-	mmu_notifier_invalidate_range(___mm, ___addr,			\
-					___addr + PAGE_SIZE);		\
-									\
-	___pte;								\
-})
-
-#define pmdp_huge_clear_flush_notify(__vma, __haddr, __pmd)		\
-({									\
-	unsigned long ___haddr = __haddr & HPAGE_PMD_MASK;		\
-	struct mm_struct *___mm = (__vma)->vm_mm;			\
-	pmd_t ___pmd;							\
-									\
-	___pmd = pmdp_huge_clear_flush(__vma, __haddr, __pmd);		\
-	mmu_notifier_invalidate_range(___mm, ___haddr,			\
-				      ___haddr + HPAGE_PMD_SIZE);	\
-									\
-	___pmd;								\
-})
-
-#define pudp_huge_clear_flush_notify(__vma, __haddr, __pud)		\
-({									\
-	unsigned long ___haddr = __haddr & HPAGE_PUD_MASK;		\
-	struct mm_struct *___mm = (__vma)->vm_mm;			\
-	pud_t ___pud;							\
-									\
-	___pud = pudp_huge_clear_flush(__vma, __haddr, __pud);		\
-	mmu_notifier_invalidate_range(___mm, ___haddr,			\
-				      ___haddr + HPAGE_PUD_SIZE);	\
-									\
-	___pud;								\
-})
-
 /*
  * set_pte_at_notify() sets the pte _after_ running the notifier.
  * This is safe to start by updating the secondary MMUs, because the primary MMU
@@ -711,11 +664,6 @@ void mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range)
 {
 }
 
-static inline void
-mmu_notifier_invalidate_range_only_end(struct mmu_notifier_range *range)
-{
-}
-
 static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
 				  unsigned long start, unsigned long end)
 {
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index f0ac5b8..3048589 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -193,7 +193,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	}
 
 	flush_cache_page(vma, addr, pte_pfn(ptep_get(pvmw.pte)));
-	ptep_clear_flush_notify(vma, addr, pvmw.pte);
+	ptep_clear_flush(vma, addr, pvmw.pte);
 	if (new_page)
 		set_pte_at_notify(mm, addr, pvmw.pte,
 				  mk_pte(new_page, vma->vm_page_prot));
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 762be2f..3ece117 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2003,7 +2003,7 @@ static void __split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
 
 	count_vm_event(THP_SPLIT_PUD);
 
-	pudp_huge_clear_flush_notify(vma, haddr, pud);
+	pudp_huge_clear_flush(vma, haddr, pud);
 }
 
 void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
@@ -2023,11 +2023,7 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
 
 out:
 	spin_unlock(ptl);
-	/*
-	 * No need to double call mmu_notifier->invalidate_range() callback as
-	 * the above pudp_huge_clear_flush_notify() did already call it.
-	 */
-	mmu_notifier_invalidate_range_only_end(&range);
+	mmu_notifier_invalidate_range_end(&range);
 }
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
@@ -2094,7 +2090,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	count_vm_event(THP_SPLIT_PMD);
 
 	if (!vma_is_anonymous(vma)) {
-		old_pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd);
+		old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd);
 		/*
 		 * We are going to unmap this huge page. So
 		 * just go ahead and zap it
@@ -2304,20 +2300,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 
 out:
 	spin_unlock(ptl);
-	/*
-	 * No need to double call mmu_notifier->invalidate_range() callback.
-	 * They are 3 cases to consider inside __split_huge_pmd_locked():
-	 *  1) pmdp_huge_clear_flush_notify() call invalidate_range() obvious
-	 *  2) __split_huge_zero_page_pmd() read only zero page and any write
-	 *    fault will trigger a flush_notify before pointing to a new page
-	 *    (it is fine if the secondary mmu keeps pointing to the old zero
-	 *    page in the meantime)
-	 *  3) Split a huge pmd into pte pointing to the same page. No need
-	 *     to invalidate secondary tlb entry they are all still valid.
-	 *     any further changes to individual pte will notify. So no need
-	 *     to call mmu_notifier->invalidate_range()
-	 */
-	mmu_notifier_invalidate_range_only_end(&range);
+	mmu_notifier_invalidate_range_end(&range);
 }
 
 void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dc1ec19..9c6e431 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5715,7 +5715,6 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
 
 		/* Break COW or unshare */
 		huge_ptep_clear_flush(vma, haddr, ptep);
-		mmu_notifier_invalidate_range(mm, range.start, range.end);
 		page_remove_rmap(&old_folio->page, vma, true);
 		hugepage_add_new_anon_rmap(new_folio, vma, haddr);
 		if (huge_pte_uffd_wp(pte))
diff --git a/mm/memory.c b/mm/memory.c
index ad79039..8dca544 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3158,7 +3158,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		 * that left a window where the new PTE could be loaded into
 		 * some TLBs while the old PTE remains in others.
 		 */
-		ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
+		ptep_clear_flush(vma, vmf->address, vmf->pte);
 		folio_add_new_anon_rmap(new_folio, vma, vmf->address);
 		folio_add_lru_vma(new_folio, vma);
 		/*
@@ -3204,11 +3204,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 	}
 
-	/*
-	 * No need to double call mmu_notifier->invalidate_range() callback as
-	 * the above ptep_clear_flush_notify() did already call it.
-	 */
-	mmu_notifier_invalidate_range_only_end(&range);
+	mmu_notifier_invalidate_range_end(&range);
 
 	if (new_folio)
 		folio_put(new_folio);
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index e29626e..6c556b5 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -658,7 +658,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 
 	if (flush) {
 		flush_cache_page(vma, addr, pte_pfn(orig_pte));
-		ptep_clear_flush_notify(vma, addr, ptep);
+		ptep_clear_flush(vma, addr, ptep);
 		set_pte_at_notify(mm, addr, ptep, entry);
 		update_mmu_cache(vma, addr, ptep);
 	} else {
@@ -763,13 +763,8 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
 	}
 
-	/*
-	 * No need to double call mmu_notifier->invalidate_range() callback as
-	 * the above ptep_clear_flush_notify() inside migrate_vma_insert_page()
-	 * did already call it.
-	 */
 	if (notified)
-		mmu_notifier_invalidate_range_only_end(&range);
+		mmu_notifier_invalidate_range_end(&range);
 }
 
 /**
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index b7ad155..453a156 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -551,7 +551,7 @@ int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
 
 static void
 mn_hlist_invalidate_end(struct mmu_notifier_subscriptions *subscriptions,
-			struct mmu_notifier_range *range, bool only_end)
+			struct mmu_notifier_range *range)
 {
 	struct mmu_notifier *subscription;
 	int id;
@@ -559,24 +559,6 @@ mn_hlist_invalidate_end(struct mmu_notifier_subscriptions *subscriptions,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(subscription, &subscriptions->list, hlist,
 				 srcu_read_lock_held(&srcu)) {
-		/*
-		 * Call invalidate_range here too to avoid the need for the
-		 * subsystem of having to register an invalidate_range_end
-		 * call-back when there is invalidate_range already. Usually a
-		 * subsystem registers either invalidate_range_start()/end() or
-		 * invalidate_range(), so this will be no additional overhead
-		 * (besides the pointer check).
-		 *
-		 * We skip call to invalidate_range() if we know it is safe ie
-		 * call site use mmu_notifier_invalidate_range_only_end() which
-		 * is safe to do when we know that a call to invalidate_range()
-		 * already happen under page table lock.
-		 */
-		if (!only_end && subscription->ops->invalidate_range)
-			subscription->ops->invalidate_range(subscription,
-							    range->mm,
-							    range->start,
-							    range->end);
 		if (subscription->ops->invalidate_range_end) {
 			if (!mmu_notifier_range_blockable(range))
 				non_block_start();
@@ -589,8 +571,7 @@ mn_hlist_invalidate_end(struct mmu_notifier_subscriptions *subscriptions,
 	srcu_read_unlock(&srcu, id);
 }
 
-void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range,
-					 bool only_end)
+void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range)
 {
 	struct mmu_notifier_subscriptions *subscriptions =
 		range->mm->notifier_subscriptions;
@@ -600,7 +581,7 @@ void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range,
 		mn_itree_inv_end(subscriptions);
 
 	if (!hlist_empty(&subscriptions->list))
-		mn_hlist_invalidate_end(subscriptions, range, only_end);
+		mn_hlist_invalidate_end(subscriptions, range);
 	lock_map_release(&__mmu_notifier_invalidate_range_start_map);
 }
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 1355bf6..51ec8aa 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -985,13 +985,6 @@ static int page_vma_mkclean_one(struct page_vma_mapped_walk *pvmw)
 #endif
 		}
 
-		/*
-		 * No need to call mmu_notifier_invalidate_range() as we are
-		 * downgrading page table protection not changing it to point
-		 * to a new page.
-		 *
-		 * See Documentation/mm/mmu_notifier.rst
-		 */
 		if (ret)
 			cleaned++;
 	}
@@ -1549,8 +1542,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 					hugetlb_vma_unlock_write(vma);
 					flush_tlb_range(vma,
 						range.start, range.end);
-					mmu_notifier_invalidate_range(mm,
-						range.start, range.end);
 					/*
 					 * The ref count of the PMD page was
 					 * dropped which is part of the way map
@@ -1623,9 +1614,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			 * copied pages.
 			 */
 			dec_mm_counter(mm, mm_counter(&folio->page));
-			/* We have to invalidate as we cleared the pte */
-			mmu_notifier_invalidate_range(mm, address,
-						      address + PAGE_SIZE);
 		} else if (folio_test_anon(folio)) {
 			swp_entry_t entry = { .val = page_private(subpage) };
 			pte_t swp_pte;
@@ -1637,9 +1625,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 					folio_test_swapcache(folio))) {
 				WARN_ON_ONCE(1);
 				ret = false;
-				/* We have to invalidate as we cleared the pte */
-				mmu_notifier_invalidate_range(mm, address,
-							address + PAGE_SIZE);
 				page_vma_mapped_walk_done(&pvmw);
 				break;
 			}
@@ -1670,9 +1655,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				 */
 				if (ref_count == 1 + map_count &&
 				    !folio_test_dirty(folio)) {
-					/* Invalidate as we cleared the pte */
-					mmu_notifier_invalidate_range(mm,
-						address, address + PAGE_SIZE);
 					dec_mm_counter(mm, MM_ANONPAGES);
 					goto discard;
 				}
@@ -1727,9 +1709,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			if (pte_uffd_wp(pteval))
 				swp_pte = pte_swp_mkuffd_wp(swp_pte);
 			set_pte_at(mm, address, pvmw.pte, swp_pte);
-			/* Invalidate as we cleared the pte */
-			mmu_notifier_invalidate_range(mm, address,
-						      address + PAGE_SIZE);
 		} else {
 			/*
 			 * This is a locked file-backed folio,
@@ -1745,13 +1724,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			dec_mm_counter(mm, mm_counter_file(&folio->page));
 		}
 discard:
-		/*
-		 * No need to call mmu_notifier_invalidate_range() it has be
-		 * done above for all cases requiring it to happen under page
-		 * table lock before mmu_notifier_invalidate_range_end()
-		 *
-		 * See Documentation/mm/mmu_notifier.rst
-		 */
 		page_remove_rmap(subpage, vma, folio_test_hugetlb(folio));
 		if (vma->vm_flags & VM_LOCKED)
 			mlock_drain_local();
@@ -1930,8 +1902,6 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 					hugetlb_vma_unlock_write(vma);
 					flush_tlb_range(vma,
 						range.start, range.end);
-					mmu_notifier_invalidate_range(mm,
-						range.start, range.end);
 
 					/*
 					 * The ref count of the PMD page was
@@ -2036,9 +2006,6 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			 * copied pages.
 			 */
 			dec_mm_counter(mm, mm_counter(&folio->page));
-			/* We have to invalidate as we cleared the pte */
-			mmu_notifier_invalidate_range(mm, address,
-						      address + PAGE_SIZE);
 		} else {
 			swp_entry_t entry;
 			pte_t swp_pte;
@@ -2102,13 +2069,6 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			 */
 		}
 
-		/*
-		 * No need to call mmu_notifier_invalidate_range() it has be
-		 * done above for all cases requiring it to happen under page
-		 * table lock before mmu_notifier_invalidate_range_end()
-		 *
-		 * See Documentation/mm/mmu_notifier.rst
-		 */
 		page_remove_rmap(subpage, vma, folio_test_hugetlb(folio));
 		if (vma->vm_flags & VM_LOCKED)
 			mlock_drain_local();
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 4/5] mmu_notifiers: Don't invalidate secondary TLBs as part of mmu_notifier_invalidate_range_end()
@ 2023-07-19 12:18   ` Alistair Popple
  0 siblings, 0 replies; 35+ messages in thread
From: Alistair Popple @ 2023-07-19 12:18 UTC (permalink / raw)
  To: akpm
  Cc: ajd, catalin.marinas, fbarrat, iommu, jgg, jhubbard, kevin.tian,
	kvm, linux-arm-kernel, linux-kernel, linux-mm, linuxppc-dev, mpe,
	nicolinc, npiggin, robin.murphy, seanjc, will, x86,
	zhi.wang.linux, Alistair Popple

Secondary TLBs are now invalidated from the architecture specific TLB
invalidation functions. Therefore there is no need to explicitly
notify or invalidate as part of the range end functions. This means we
can remove mmu_notifier_invalidate_range_end_only() and some of the
ptep_*_notify() functions.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
 include/linux/mmu_notifier.h | 56 +------------------------------------
 kernel/events/uprobes.c      |  2 +-
 mm/huge_memory.c             | 25 ++---------------
 mm/hugetlb.c                 |  1 +-
 mm/memory.c                  |  8 +----
 mm/migrate_device.c          |  9 +-----
 mm/mmu_notifier.c            | 25 ++---------------
 mm/rmap.c                    | 40 +--------------------------
 8 files changed, 14 insertions(+), 152 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 64a3e05..f2e9edc 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -395,8 +395,7 @@ extern int __mmu_notifier_test_young(struct mm_struct *mm,
 extern void __mmu_notifier_change_pte(struct mm_struct *mm,
 				      unsigned long address, pte_t pte);
 extern int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *r);
-extern void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *r,
-				  bool only_end);
+extern void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *r);
 extern void __mmu_notifier_invalidate_range(struct mm_struct *mm,
 				  unsigned long start, unsigned long end);
 extern bool
@@ -481,14 +480,7 @@ mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range)
 		might_sleep();
 
 	if (mm_has_notifiers(range->mm))
-		__mmu_notifier_invalidate_range_end(range, false);
-}
-
-static inline void
-mmu_notifier_invalidate_range_only_end(struct mmu_notifier_range *range)
-{
-	if (mm_has_notifiers(range->mm))
-		__mmu_notifier_invalidate_range_end(range, true);
+		__mmu_notifier_invalidate_range_end(range);
 }
 
 static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
@@ -582,45 +574,6 @@ static inline void mmu_notifier_range_init_owner(
 	__young;							\
 })
 
-#define	ptep_clear_flush_notify(__vma, __address, __ptep)		\
-({									\
-	unsigned long ___addr = __address & PAGE_MASK;			\
-	struct mm_struct *___mm = (__vma)->vm_mm;			\
-	pte_t ___pte;							\
-									\
-	___pte = ptep_clear_flush(__vma, __address, __ptep);		\
-	mmu_notifier_invalidate_range(___mm, ___addr,			\
-					___addr + PAGE_SIZE);		\
-									\
-	___pte;								\
-})
-
-#define pmdp_huge_clear_flush_notify(__vma, __haddr, __pmd)		\
-({									\
-	unsigned long ___haddr = __haddr & HPAGE_PMD_MASK;		\
-	struct mm_struct *___mm = (__vma)->vm_mm;			\
-	pmd_t ___pmd;							\
-									\
-	___pmd = pmdp_huge_clear_flush(__vma, __haddr, __pmd);		\
-	mmu_notifier_invalidate_range(___mm, ___haddr,			\
-				      ___haddr + HPAGE_PMD_SIZE);	\
-									\
-	___pmd;								\
-})
-
-#define pudp_huge_clear_flush_notify(__vma, __haddr, __pud)		\
-({									\
-	unsigned long ___haddr = __haddr & HPAGE_PUD_MASK;		\
-	struct mm_struct *___mm = (__vma)->vm_mm;			\
-	pud_t ___pud;							\
-									\
-	___pud = pudp_huge_clear_flush(__vma, __haddr, __pud);		\
-	mmu_notifier_invalidate_range(___mm, ___haddr,			\
-				      ___haddr + HPAGE_PUD_SIZE);	\
-									\
-	___pud;								\
-})
-
 /*
  * set_pte_at_notify() sets the pte _after_ running the notifier.
  * This is safe to start by updating the secondary MMUs, because the primary MMU
@@ -711,11 +664,6 @@ void mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range)
 {
 }
 
-static inline void
-mmu_notifier_invalidate_range_only_end(struct mmu_notifier_range *range)
-{
-}
-
 static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
 				  unsigned long start, unsigned long end)
 {
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index f0ac5b8..3048589 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -193,7 +193,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	}
 
 	flush_cache_page(vma, addr, pte_pfn(ptep_get(pvmw.pte)));
-	ptep_clear_flush_notify(vma, addr, pvmw.pte);
+	ptep_clear_flush(vma, addr, pvmw.pte);
 	if (new_page)
 		set_pte_at_notify(mm, addr, pvmw.pte,
 				  mk_pte(new_page, vma->vm_page_prot));
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 762be2f..3ece117 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2003,7 +2003,7 @@ static void __split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
 
 	count_vm_event(THP_SPLIT_PUD);
 
-	pudp_huge_clear_flush_notify(vma, haddr, pud);
+	pudp_huge_clear_flush(vma, haddr, pud);
 }
 
 void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
@@ -2023,11 +2023,7 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
 
 out:
 	spin_unlock(ptl);
-	/*
-	 * No need to double call mmu_notifier->invalidate_range() callback as
-	 * the above pudp_huge_clear_flush_notify() did already call it.
-	 */
-	mmu_notifier_invalidate_range_only_end(&range);
+	mmu_notifier_invalidate_range_end(&range);
 }
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
@@ -2094,7 +2090,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	count_vm_event(THP_SPLIT_PMD);
 
 	if (!vma_is_anonymous(vma)) {
-		old_pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd);
+		old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd);
 		/*
 		 * We are going to unmap this huge page. So
 		 * just go ahead and zap it
@@ -2304,20 +2300,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 
 out:
 	spin_unlock(ptl);
-	/*
-	 * No need to double call mmu_notifier->invalidate_range() callback.
-	 * They are 3 cases to consider inside __split_huge_pmd_locked():
-	 *  1) pmdp_huge_clear_flush_notify() call invalidate_range() obvious
-	 *  2) __split_huge_zero_page_pmd() read only zero page and any write
-	 *    fault will trigger a flush_notify before pointing to a new page
-	 *    (it is fine if the secondary mmu keeps pointing to the old zero
-	 *    page in the meantime)
-	 *  3) Split a huge pmd into pte pointing to the same page. No need
-	 *     to invalidate secondary tlb entry they are all still valid.
-	 *     any further changes to individual pte will notify. So no need
-	 *     to call mmu_notifier->invalidate_range()
-	 */
-	mmu_notifier_invalidate_range_only_end(&range);
+	mmu_notifier_invalidate_range_end(&range);
 }
 
 void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dc1ec19..9c6e431 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5715,7 +5715,6 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
 
 		/* Break COW or unshare */
 		huge_ptep_clear_flush(vma, haddr, ptep);
-		mmu_notifier_invalidate_range(mm, range.start, range.end);
 		page_remove_rmap(&old_folio->page, vma, true);
 		hugepage_add_new_anon_rmap(new_folio, vma, haddr);
 		if (huge_pte_uffd_wp(pte))
diff --git a/mm/memory.c b/mm/memory.c
index ad79039..8dca544 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3158,7 +3158,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		 * that left a window where the new PTE could be loaded into
 		 * some TLBs while the old PTE remains in others.
 		 */
-		ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
+		ptep_clear_flush(vma, vmf->address, vmf->pte);
 		folio_add_new_anon_rmap(new_folio, vma, vmf->address);
 		folio_add_lru_vma(new_folio, vma);
 		/*
@@ -3204,11 +3204,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 	}
 
-	/*
-	 * No need to double call mmu_notifier->invalidate_range() callback as
-	 * the above ptep_clear_flush_notify() did already call it.
-	 */
-	mmu_notifier_invalidate_range_only_end(&range);
+	mmu_notifier_invalidate_range_end(&range);
 
 	if (new_folio)
 		folio_put(new_folio);
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index e29626e..6c556b5 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -658,7 +658,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 
 	if (flush) {
 		flush_cache_page(vma, addr, pte_pfn(orig_pte));
-		ptep_clear_flush_notify(vma, addr, ptep);
+		ptep_clear_flush(vma, addr, ptep);
 		set_pte_at_notify(mm, addr, ptep, entry);
 		update_mmu_cache(vma, addr, ptep);
 	} else {
@@ -763,13 +763,8 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
 	}
 
-	/*
-	 * No need to double call mmu_notifier->invalidate_range() callback as
-	 * the above ptep_clear_flush_notify() inside migrate_vma_insert_page()
-	 * did already call it.
-	 */
 	if (notified)
-		mmu_notifier_invalidate_range_only_end(&range);
+		mmu_notifier_invalidate_range_end(&range);
 }
 
 /**
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index b7ad155..453a156 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -551,7 +551,7 @@ int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
 
 static void
 mn_hlist_invalidate_end(struct mmu_notifier_subscriptions *subscriptions,
-			struct mmu_notifier_range *range, bool only_end)
+			struct mmu_notifier_range *range)
 {
 	struct mmu_notifier *subscription;
 	int id;
@@ -559,24 +559,6 @@ mn_hlist_invalidate_end(struct mmu_notifier_subscriptions *subscriptions,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(subscription, &subscriptions->list, hlist,
 				 srcu_read_lock_held(&srcu)) {
-		/*
-		 * Call invalidate_range here too to avoid the need for the
-		 * subsystem of having to register an invalidate_range_end
-		 * call-back when there is invalidate_range already. Usually a
-		 * subsystem registers either invalidate_range_start()/end() or
-		 * invalidate_range(), so this will be no additional overhead
-		 * (besides the pointer check).
-		 *
-		 * We skip call to invalidate_range() if we know it is safe ie
-		 * call site use mmu_notifier_invalidate_range_only_end() which
-		 * is safe to do when we know that a call to invalidate_range()
-		 * already happen under page table lock.
-		 */
-		if (!only_end && subscription->ops->invalidate_range)
-			subscription->ops->invalidate_range(subscription,
-							    range->mm,
-							    range->start,
-							    range->end);
 		if (subscription->ops->invalidate_range_end) {
 			if (!mmu_notifier_range_blockable(range))
 				non_block_start();
@@ -589,8 +571,7 @@ mn_hlist_invalidate_end(struct mmu_notifier_subscriptions *subscriptions,
 	srcu_read_unlock(&srcu, id);
 }
 
-void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range,
-					 bool only_end)
+void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range)
 {
 	struct mmu_notifier_subscriptions *subscriptions =
 		range->mm->notifier_subscriptions;
@@ -600,7 +581,7 @@ void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range,
 		mn_itree_inv_end(subscriptions);
 
 	if (!hlist_empty(&subscriptions->list))
-		mn_hlist_invalidate_end(subscriptions, range, only_end);
+		mn_hlist_invalidate_end(subscriptions, range);
 	lock_map_release(&__mmu_notifier_invalidate_range_start_map);
 }
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 1355bf6..51ec8aa 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -985,13 +985,6 @@ static int page_vma_mkclean_one(struct page_vma_mapped_walk *pvmw)
 #endif
 		}
 
-		/*
-		 * No need to call mmu_notifier_invalidate_range() as we are
-		 * downgrading page table protection not changing it to point
-		 * to a new page.
-		 *
-		 * See Documentation/mm/mmu_notifier.rst
-		 */
 		if (ret)
 			cleaned++;
 	}
@@ -1549,8 +1542,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 					hugetlb_vma_unlock_write(vma);
 					flush_tlb_range(vma,
 						range.start, range.end);
-					mmu_notifier_invalidate_range(mm,
-						range.start, range.end);
 					/*
 					 * The ref count of the PMD page was
 					 * dropped which is part of the way map
@@ -1623,9 +1614,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			 * copied pages.
 			 */
 			dec_mm_counter(mm, mm_counter(&folio->page));
-			/* We have to invalidate as we cleared the pte */
-			mmu_notifier_invalidate_range(mm, address,
-						      address + PAGE_SIZE);
 		} else if (folio_test_anon(folio)) {
 			swp_entry_t entry = { .val = page_private(subpage) };
 			pte_t swp_pte;
@@ -1637,9 +1625,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 					folio_test_swapcache(folio))) {
 				WARN_ON_ONCE(1);
 				ret = false;
-				/* We have to invalidate as we cleared the pte */
-				mmu_notifier_invalidate_range(mm, address,
-							address + PAGE_SIZE);
 				page_vma_mapped_walk_done(&pvmw);
 				break;
 			}
@@ -1670,9 +1655,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				 */
 				if (ref_count == 1 + map_count &&
 				    !folio_test_dirty(folio)) {
-					/* Invalidate as we cleared the pte */
-					mmu_notifier_invalidate_range(mm,
-						address, address + PAGE_SIZE);
 					dec_mm_counter(mm, MM_ANONPAGES);
 					goto discard;
 				}
@@ -1727,9 +1709,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			if (pte_uffd_wp(pteval))
 				swp_pte = pte_swp_mkuffd_wp(swp_pte);
 			set_pte_at(mm, address, pvmw.pte, swp_pte);
-			/* Invalidate as we cleared the pte */
-			mmu_notifier_invalidate_range(mm, address,
-						      address + PAGE_SIZE);
 		} else {
 			/*
 			 * This is a locked file-backed folio,
@@ -1745,13 +1724,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			dec_mm_counter(mm, mm_counter_file(&folio->page));
 		}
 discard:
-		/*
-		 * No need to call mmu_notifier_invalidate_range() it has be
-		 * done above for all cases requiring it to happen under page
-		 * table lock before mmu_notifier_invalidate_range_end()
-		 *
-		 * See Documentation/mm/mmu_notifier.rst
-		 */
 		page_remove_rmap(subpage, vma, folio_test_hugetlb(folio));
 		if (vma->vm_flags & VM_LOCKED)
 			mlock_drain_local();
@@ -1930,8 +1902,6 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 					hugetlb_vma_unlock_write(vma);
 					flush_tlb_range(vma,
 						range.start, range.end);
-					mmu_notifier_invalidate_range(mm,
-						range.start, range.end);
 
 					/*
 					 * The ref count of the PMD page was
@@ -2036,9 +2006,6 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			 * copied pages.
 			 */
 			dec_mm_counter(mm, mm_counter(&folio->page));
-			/* We have to invalidate as we cleared the pte */
-			mmu_notifier_invalidate_range(mm, address,
-						      address + PAGE_SIZE);
 		} else {
 			swp_entry_t entry;
 			pte_t swp_pte;
@@ -2102,13 +2069,6 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			 */
 		}
 
-		/*
-		 * No need to call mmu_notifier_invalidate_range() it has be
-		 * done above for all cases requiring it to happen under page
-		 * table lock before mmu_notifier_invalidate_range_end()
-		 *
-		 * See Documentation/mm/mmu_notifier.rst
-		 */
 		page_remove_rmap(subpage, vma, folio_test_hugetlb(folio));
 		if (vma->vm_flags & VM_LOCKED)
 			mlock_drain_local();
-- 
git-series 0.9.1

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 4/5] mmu_notifiers: Don't invalidate secondary TLBs as part of mmu_notifier_invalidate_range_end()
@ 2023-07-19 12:18   ` Alistair Popple
  0 siblings, 0 replies; 35+ messages in thread
From: Alistair Popple @ 2023-07-19 12:18 UTC (permalink / raw)
  To: akpm
  Cc: Alistair Popple, kevin.tian, x86, ajd, kvm, linux-mm,
	catalin.marinas, seanjc, will, linux-kernel, npiggin,
	zhi.wang.linux, jgg, iommu, nicolinc, jhubbard, fbarrat,
	linuxppc-dev, linux-arm-kernel, robin.murphy

Secondary TLBs are now invalidated from the architecture specific TLB
invalidation functions. Therefore there is no need to explicitly
notify or invalidate as part of the range end functions. This means we
can remove mmu_notifier_invalidate_range_end_only() and some of the
ptep_*_notify() functions.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
 include/linux/mmu_notifier.h | 56 +------------------------------------
 kernel/events/uprobes.c      |  2 +-
 mm/huge_memory.c             | 25 ++---------------
 mm/hugetlb.c                 |  1 +-
 mm/memory.c                  |  8 +----
 mm/migrate_device.c          |  9 +-----
 mm/mmu_notifier.c            | 25 ++---------------
 mm/rmap.c                    | 40 +--------------------------
 8 files changed, 14 insertions(+), 152 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 64a3e05..f2e9edc 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -395,8 +395,7 @@ extern int __mmu_notifier_test_young(struct mm_struct *mm,
 extern void __mmu_notifier_change_pte(struct mm_struct *mm,
 				      unsigned long address, pte_t pte);
 extern int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *r);
-extern void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *r,
-				  bool only_end);
+extern void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *r);
 extern void __mmu_notifier_invalidate_range(struct mm_struct *mm,
 				  unsigned long start, unsigned long end);
 extern bool
@@ -481,14 +480,7 @@ mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range)
 		might_sleep();
 
 	if (mm_has_notifiers(range->mm))
-		__mmu_notifier_invalidate_range_end(range, false);
-}
-
-static inline void
-mmu_notifier_invalidate_range_only_end(struct mmu_notifier_range *range)
-{
-	if (mm_has_notifiers(range->mm))
-		__mmu_notifier_invalidate_range_end(range, true);
+		__mmu_notifier_invalidate_range_end(range);
 }
 
 static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
@@ -582,45 +574,6 @@ static inline void mmu_notifier_range_init_owner(
 	__young;							\
 })
 
-#define	ptep_clear_flush_notify(__vma, __address, __ptep)		\
-({									\
-	unsigned long ___addr = __address & PAGE_MASK;			\
-	struct mm_struct *___mm = (__vma)->vm_mm;			\
-	pte_t ___pte;							\
-									\
-	___pte = ptep_clear_flush(__vma, __address, __ptep);		\
-	mmu_notifier_invalidate_range(___mm, ___addr,			\
-					___addr + PAGE_SIZE);		\
-									\
-	___pte;								\
-})
-
-#define pmdp_huge_clear_flush_notify(__vma, __haddr, __pmd)		\
-({									\
-	unsigned long ___haddr = __haddr & HPAGE_PMD_MASK;		\
-	struct mm_struct *___mm = (__vma)->vm_mm;			\
-	pmd_t ___pmd;							\
-									\
-	___pmd = pmdp_huge_clear_flush(__vma, __haddr, __pmd);		\
-	mmu_notifier_invalidate_range(___mm, ___haddr,			\
-				      ___haddr + HPAGE_PMD_SIZE);	\
-									\
-	___pmd;								\
-})
-
-#define pudp_huge_clear_flush_notify(__vma, __haddr, __pud)		\
-({									\
-	unsigned long ___haddr = __haddr & HPAGE_PUD_MASK;		\
-	struct mm_struct *___mm = (__vma)->vm_mm;			\
-	pud_t ___pud;							\
-									\
-	___pud = pudp_huge_clear_flush(__vma, __haddr, __pud);		\
-	mmu_notifier_invalidate_range(___mm, ___haddr,			\
-				      ___haddr + HPAGE_PUD_SIZE);	\
-									\
-	___pud;								\
-})
-
 /*
  * set_pte_at_notify() sets the pte _after_ running the notifier.
  * This is safe to start by updating the secondary MMUs, because the primary MMU
@@ -711,11 +664,6 @@ void mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range)
 {
 }
 
-static inline void
-mmu_notifier_invalidate_range_only_end(struct mmu_notifier_range *range)
-{
-}
-
 static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
 				  unsigned long start, unsigned long end)
 {
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index f0ac5b8..3048589 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -193,7 +193,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	}
 
 	flush_cache_page(vma, addr, pte_pfn(ptep_get(pvmw.pte)));
-	ptep_clear_flush_notify(vma, addr, pvmw.pte);
+	ptep_clear_flush(vma, addr, pvmw.pte);
 	if (new_page)
 		set_pte_at_notify(mm, addr, pvmw.pte,
 				  mk_pte(new_page, vma->vm_page_prot));
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 762be2f..3ece117 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2003,7 +2003,7 @@ static void __split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
 
 	count_vm_event(THP_SPLIT_PUD);
 
-	pudp_huge_clear_flush_notify(vma, haddr, pud);
+	pudp_huge_clear_flush(vma, haddr, pud);
 }
 
 void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
@@ -2023,11 +2023,7 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
 
 out:
 	spin_unlock(ptl);
-	/*
-	 * No need to double call mmu_notifier->invalidate_range() callback as
-	 * the above pudp_huge_clear_flush_notify() did already call it.
-	 */
-	mmu_notifier_invalidate_range_only_end(&range);
+	mmu_notifier_invalidate_range_end(&range);
 }
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
@@ -2094,7 +2090,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	count_vm_event(THP_SPLIT_PMD);
 
 	if (!vma_is_anonymous(vma)) {
-		old_pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd);
+		old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd);
 		/*
 		 * We are going to unmap this huge page. So
 		 * just go ahead and zap it
@@ -2304,20 +2300,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 
 out:
 	spin_unlock(ptl);
-	/*
-	 * No need to double call mmu_notifier->invalidate_range() callback.
-	 * They are 3 cases to consider inside __split_huge_pmd_locked():
-	 *  1) pmdp_huge_clear_flush_notify() call invalidate_range() obvious
-	 *  2) __split_huge_zero_page_pmd() read only zero page and any write
-	 *    fault will trigger a flush_notify before pointing to a new page
-	 *    (it is fine if the secondary mmu keeps pointing to the old zero
-	 *    page in the meantime)
-	 *  3) Split a huge pmd into pte pointing to the same page. No need
-	 *     to invalidate secondary tlb entry they are all still valid.
-	 *     any further changes to individual pte will notify. So no need
-	 *     to call mmu_notifier->invalidate_range()
-	 */
-	mmu_notifier_invalidate_range_only_end(&range);
+	mmu_notifier_invalidate_range_end(&range);
 }
 
 void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dc1ec19..9c6e431 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5715,7 +5715,6 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
 
 		/* Break COW or unshare */
 		huge_ptep_clear_flush(vma, haddr, ptep);
-		mmu_notifier_invalidate_range(mm, range.start, range.end);
 		page_remove_rmap(&old_folio->page, vma, true);
 		hugepage_add_new_anon_rmap(new_folio, vma, haddr);
 		if (huge_pte_uffd_wp(pte))
diff --git a/mm/memory.c b/mm/memory.c
index ad79039..8dca544 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3158,7 +3158,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		 * that left a window where the new PTE could be loaded into
 		 * some TLBs while the old PTE remains in others.
 		 */
-		ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
+		ptep_clear_flush(vma, vmf->address, vmf->pte);
 		folio_add_new_anon_rmap(new_folio, vma, vmf->address);
 		folio_add_lru_vma(new_folio, vma);
 		/*
@@ -3204,11 +3204,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 	}
 
-	/*
-	 * No need to double call mmu_notifier->invalidate_range() callback as
-	 * the above ptep_clear_flush_notify() did already call it.
-	 */
-	mmu_notifier_invalidate_range_only_end(&range);
+	mmu_notifier_invalidate_range_end(&range);
 
 	if (new_folio)
 		folio_put(new_folio);
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index e29626e..6c556b5 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -658,7 +658,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 
 	if (flush) {
 		flush_cache_page(vma, addr, pte_pfn(orig_pte));
-		ptep_clear_flush_notify(vma, addr, ptep);
+		ptep_clear_flush(vma, addr, ptep);
 		set_pte_at_notify(mm, addr, ptep, entry);
 		update_mmu_cache(vma, addr, ptep);
 	} else {
@@ -763,13 +763,8 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
 	}
 
-	/*
-	 * No need to double call mmu_notifier->invalidate_range() callback as
-	 * the above ptep_clear_flush_notify() inside migrate_vma_insert_page()
-	 * did already call it.
-	 */
 	if (notified)
-		mmu_notifier_invalidate_range_only_end(&range);
+		mmu_notifier_invalidate_range_end(&range);
 }
 
 /**
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index b7ad155..453a156 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -551,7 +551,7 @@ int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
 
 static void
 mn_hlist_invalidate_end(struct mmu_notifier_subscriptions *subscriptions,
-			struct mmu_notifier_range *range, bool only_end)
+			struct mmu_notifier_range *range)
 {
 	struct mmu_notifier *subscription;
 	int id;
@@ -559,24 +559,6 @@ mn_hlist_invalidate_end(struct mmu_notifier_subscriptions *subscriptions,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(subscription, &subscriptions->list, hlist,
 				 srcu_read_lock_held(&srcu)) {
-		/*
-		 * Call invalidate_range here too to avoid the need for the
-		 * subsystem of having to register an invalidate_range_end
-		 * call-back when there is invalidate_range already. Usually a
-		 * subsystem registers either invalidate_range_start()/end() or
-		 * invalidate_range(), so this will be no additional overhead
-		 * (besides the pointer check).
-		 *
-		 * We skip call to invalidate_range() if we know it is safe ie
-		 * call site use mmu_notifier_invalidate_range_only_end() which
-		 * is safe to do when we know that a call to invalidate_range()
-		 * already happen under page table lock.
-		 */
-		if (!only_end && subscription->ops->invalidate_range)
-			subscription->ops->invalidate_range(subscription,
-							    range->mm,
-							    range->start,
-							    range->end);
 		if (subscription->ops->invalidate_range_end) {
 			if (!mmu_notifier_range_blockable(range))
 				non_block_start();
@@ -589,8 +571,7 @@ mn_hlist_invalidate_end(struct mmu_notifier_subscriptions *subscriptions,
 	srcu_read_unlock(&srcu, id);
 }
 
-void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range,
-					 bool only_end)
+void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range)
 {
 	struct mmu_notifier_subscriptions *subscriptions =
 		range->mm->notifier_subscriptions;
@@ -600,7 +581,7 @@ void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range,
 		mn_itree_inv_end(subscriptions);
 
 	if (!hlist_empty(&subscriptions->list))
-		mn_hlist_invalidate_end(subscriptions, range, only_end);
+		mn_hlist_invalidate_end(subscriptions, range);
 	lock_map_release(&__mmu_notifier_invalidate_range_start_map);
 }
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 1355bf6..51ec8aa 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -985,13 +985,6 @@ static int page_vma_mkclean_one(struct page_vma_mapped_walk *pvmw)
 #endif
 		}
 
-		/*
-		 * No need to call mmu_notifier_invalidate_range() as we are
-		 * downgrading page table protection not changing it to point
-		 * to a new page.
-		 *
-		 * See Documentation/mm/mmu_notifier.rst
-		 */
 		if (ret)
 			cleaned++;
 	}
@@ -1549,8 +1542,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 					hugetlb_vma_unlock_write(vma);
 					flush_tlb_range(vma,
 						range.start, range.end);
-					mmu_notifier_invalidate_range(mm,
-						range.start, range.end);
 					/*
 					 * The ref count of the PMD page was
 					 * dropped which is part of the way map
@@ -1623,9 +1614,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			 * copied pages.
 			 */
 			dec_mm_counter(mm, mm_counter(&folio->page));
-			/* We have to invalidate as we cleared the pte */
-			mmu_notifier_invalidate_range(mm, address,
-						      address + PAGE_SIZE);
 		} else if (folio_test_anon(folio)) {
 			swp_entry_t entry = { .val = page_private(subpage) };
 			pte_t swp_pte;
@@ -1637,9 +1625,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 					folio_test_swapcache(folio))) {
 				WARN_ON_ONCE(1);
 				ret = false;
-				/* We have to invalidate as we cleared the pte */
-				mmu_notifier_invalidate_range(mm, address,
-							address + PAGE_SIZE);
 				page_vma_mapped_walk_done(&pvmw);
 				break;
 			}
@@ -1670,9 +1655,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				 */
 				if (ref_count == 1 + map_count &&
 				    !folio_test_dirty(folio)) {
-					/* Invalidate as we cleared the pte */
-					mmu_notifier_invalidate_range(mm,
-						address, address + PAGE_SIZE);
 					dec_mm_counter(mm, MM_ANONPAGES);
 					goto discard;
 				}
@@ -1727,9 +1709,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			if (pte_uffd_wp(pteval))
 				swp_pte = pte_swp_mkuffd_wp(swp_pte);
 			set_pte_at(mm, address, pvmw.pte, swp_pte);
-			/* Invalidate as we cleared the pte */
-			mmu_notifier_invalidate_range(mm, address,
-						      address + PAGE_SIZE);
 		} else {
 			/*
 			 * This is a locked file-backed folio,
@@ -1745,13 +1724,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			dec_mm_counter(mm, mm_counter_file(&folio->page));
 		}
 discard:
-		/*
-		 * No need to call mmu_notifier_invalidate_range() it has be
-		 * done above for all cases requiring it to happen under page
-		 * table lock before mmu_notifier_invalidate_range_end()
-		 *
-		 * See Documentation/mm/mmu_notifier.rst
-		 */
 		page_remove_rmap(subpage, vma, folio_test_hugetlb(folio));
 		if (vma->vm_flags & VM_LOCKED)
 			mlock_drain_local();
@@ -1930,8 +1902,6 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 					hugetlb_vma_unlock_write(vma);
 					flush_tlb_range(vma,
 						range.start, range.end);
-					mmu_notifier_invalidate_range(mm,
-						range.start, range.end);
 
 					/*
 					 * The ref count of the PMD page was
@@ -2036,9 +2006,6 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			 * copied pages.
 			 */
 			dec_mm_counter(mm, mm_counter(&folio->page));
-			/* We have to invalidate as we cleared the pte */
-			mmu_notifier_invalidate_range(mm, address,
-						      address + PAGE_SIZE);
 		} else {
 			swp_entry_t entry;
 			pte_t swp_pte;
@@ -2102,13 +2069,6 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			 */
 		}
 
-		/*
-		 * No need to call mmu_notifier_invalidate_range() it has be
-		 * done above for all cases requiring it to happen under page
-		 * table lock before mmu_notifier_invalidate_range_end()
-		 *
-		 * See Documentation/mm/mmu_notifier.rst
-		 */
 		page_remove_rmap(subpage, vma, folio_test_hugetlb(folio));
 		if (vma->vm_flags & VM_LOCKED)
 			mlock_drain_local();
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 5/5] mmu_notifiers: Rename invalidate_range notifier
  2023-07-19 12:18 ` Alistair Popple
  (?)
@ 2023-07-19 12:18   ` Alistair Popple
  -1 siblings, 0 replies; 35+ messages in thread
From: Alistair Popple @ 2023-07-19 12:18 UTC (permalink / raw)
  To: akpm
  Cc: ajd, catalin.marinas, fbarrat, iommu, jgg, jhubbard, kevin.tian,
	kvm, linux-arm-kernel, linux-kernel, linux-mm, linuxppc-dev, mpe,
	nicolinc, npiggin, robin.murphy, seanjc, will, x86,
	zhi.wang.linux, Alistair Popple, Jason Gunthorpe

There are two main use cases for mmu notifiers. One is by KVM which
uses mmu_notifier_invalidate_range_start()/end() to manage a software
TLB.

The other is to manage hardware TLBs which need to use the
invalidate_range() callback because HW can establish new TLB entries
at any time. Hence using start/end() can lead to memory corruption as
these callbacks happen too soon/late during page unmap.

mmu notifier users should therefore either use the start()/end()
callbacks or the invalidate_range() callbacks. To make this usage
clearer rename the invalidate_range() callback to
arch_invalidate_secondary_tlbs() and update documention.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
---
 arch/arm64/include/asm/tlbflush.h               |  6 +-
 arch/powerpc/mm/book3s64/radix_hugetlbpage.c    |  2 +-
 arch/powerpc/mm/book3s64/radix_tlb.c            | 10 ++--
 arch/x86/mm/tlb.c                               |  4 +-
 drivers/iommu/amd/iommu_v2.c                    | 10 ++--
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 13 ++---
 drivers/iommu/intel/svm.c                       |  8 +--
 drivers/misc/ocxl/link.c                        |  8 +--
 include/linux/mmu_notifier.h                    | 48 +++++++++---------
 mm/huge_memory.c                                |  4 +-
 mm/hugetlb.c                                    |  7 +--
 mm/mmu_notifier.c                               | 20 ++++++--
 12 files changed, 76 insertions(+), 64 deletions(-)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index a99349d..84a05a0 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -253,7 +253,7 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
 	__tlbi(aside1is, asid);
 	__tlbi_user(aside1is, asid);
 	dsb(ish);
-	mmu_notifier_invalidate_range(mm, 0, -1UL);
+	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
 }
 
 static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
@@ -265,7 +265,7 @@ static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
 	addr = __TLBI_VADDR(uaddr, ASID(mm));
 	__tlbi(vale1is, addr);
 	__tlbi_user(vale1is, addr);
-	mmu_notifier_invalidate_range(mm, uaddr & PAGE_MASK,
+	mmu_notifier_arch_invalidate_secondary_tlbs(mm, uaddr & PAGE_MASK,
 						(uaddr & PAGE_MASK) + PAGE_SIZE);
 }
 
@@ -400,7 +400,7 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
 		scale++;
 	}
 	dsb(ish);
-	mmu_notifier_invalidate_range(vma->vm_mm, start, end);
+	mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
 }
 
 static inline void flush_tlb_range(struct vm_area_struct *vma,
diff --git a/arch/powerpc/mm/book3s64/radix_hugetlbpage.c b/arch/powerpc/mm/book3s64/radix_hugetlbpage.c
index f3fb49f..17075c7 100644
--- a/arch/powerpc/mm/book3s64/radix_hugetlbpage.c
+++ b/arch/powerpc/mm/book3s64/radix_hugetlbpage.c
@@ -39,7 +39,7 @@ void radix__flush_hugetlb_tlb_range(struct vm_area_struct *vma, unsigned long st
 		radix__flush_tlb_pwc_range_psize(vma->vm_mm, start, end, psize);
 	else
 		radix__flush_tlb_range_psize(vma->vm_mm, start, end, psize);
-	mmu_notifier_invalidate_range(vma->vm_mm, start, end);
+	mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
 }
 
 void radix__huge_ptep_modify_prot_commit(struct vm_area_struct *vma,
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
index 9724b26..64c11a4 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -752,7 +752,7 @@ void radix__local_flush_tlb_page(struct vm_area_struct *vma, unsigned long vmadd
 		return radix__local_flush_hugetlb_page(vma, vmaddr);
 #endif
 	radix__local_flush_tlb_page_psize(vma->vm_mm, vmaddr, mmu_virtual_psize);
-	mmu_notifier_invalidate_range(vma->vm_mm, vmaddr,
+	mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, vmaddr,
 						vmaddr + mmu_virtual_psize);
 }
 EXPORT_SYMBOL(radix__local_flush_tlb_page);
@@ -989,7 +989,7 @@ void radix__flush_tlb_mm(struct mm_struct *mm)
 		}
 	}
 	preempt_enable();
-	mmu_notifier_invalidate_range(mm, 0, -1UL);
+	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
 }
 EXPORT_SYMBOL(radix__flush_tlb_mm);
 
@@ -1023,7 +1023,7 @@ static void __flush_all_mm(struct mm_struct *mm, bool fullmm)
 			_tlbiel_pid_multicast(mm, pid, RIC_FLUSH_ALL);
 	}
 	preempt_enable();
-	mmu_notifier_invalidate_range(mm, 0, -1UL);
+	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
 }
 
 void radix__flush_all_mm(struct mm_struct *mm)
@@ -1232,7 +1232,7 @@ static inline void __radix__flush_tlb_range(struct mm_struct *mm,
 	}
 out:
 	preempt_enable();
-	mmu_notifier_invalidate_range(mm, start, end);
+	mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end);
 }
 
 void radix__flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
@@ -1397,7 +1397,7 @@ static void __radix__flush_tlb_range_psize(struct mm_struct *mm,
 	}
 out:
 	preempt_enable();
-	mmu_notifier_invalidate_range(mm, start, end);
+	mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end);
 }
 
 void radix__flush_tlb_range_psize(struct mm_struct *mm, unsigned long start,
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index c30fbcd..0b990fb 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1037,7 +1037,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 
 	put_flush_tlb_info();
 	put_cpu();
-	mmu_notifier_invalidate_range(mm, start, end);
+	mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end);
 }
 
 
@@ -1265,7 +1265,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 
 	put_flush_tlb_info();
 	put_cpu();
-	mmu_notifier_invalidate_range(current->mm, 0, -1UL);
+	mmu_notifier_arch_invalidate_secondary_tlbs(current->mm, 0, -1UL);
 }
 
 /*
diff --git a/drivers/iommu/amd/iommu_v2.c b/drivers/iommu/amd/iommu_v2.c
index 261352a..2596466 100644
--- a/drivers/iommu/amd/iommu_v2.c
+++ b/drivers/iommu/amd/iommu_v2.c
@@ -355,9 +355,9 @@ static struct pasid_state *mn_to_state(struct mmu_notifier *mn)
 	return container_of(mn, struct pasid_state, mn);
 }
 
-static void mn_invalidate_range(struct mmu_notifier *mn,
-				struct mm_struct *mm,
-				unsigned long start, unsigned long end)
+static void mn_arch_invalidate_secondary_tlbs(struct mmu_notifier *mn,
+					struct mm_struct *mm,
+					unsigned long start, unsigned long end)
 {
 	struct pasid_state *pasid_state;
 	struct device_state *dev_state;
@@ -391,8 +391,8 @@ static void mn_release(struct mmu_notifier *mn, struct mm_struct *mm)
 }
 
 static const struct mmu_notifier_ops iommu_mn = {
-	.release		= mn_release,
-	.invalidate_range       = mn_invalidate_range,
+	.release			= mn_release,
+	.arch_invalidate_secondary_tlbs	= mn_arch_invalidate_secondary_tlbs,
 };
 
 static void set_pri_tag_status(struct pasid_state *pasid_state,
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
index 2a19784..dbc812a 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
@@ -186,9 +186,10 @@ static void arm_smmu_free_shared_cd(struct arm_smmu_ctx_desc *cd)
 	}
 }
 
-static void arm_smmu_mm_invalidate_range(struct mmu_notifier *mn,
-					 struct mm_struct *mm,
-					 unsigned long start, unsigned long end)
+static void arm_smmu_mm_arch_invalidate_secondary_tlbs(struct mmu_notifier *mn,
+						struct mm_struct *mm,
+						unsigned long start,
+						unsigned long end)
 {
 	struct arm_smmu_mmu_notifier *smmu_mn = mn_to_smmu(mn);
 	struct arm_smmu_domain *smmu_domain = smmu_mn->domain;
@@ -247,9 +248,9 @@ static void arm_smmu_mmu_notifier_free(struct mmu_notifier *mn)
 }
 
 static const struct mmu_notifier_ops arm_smmu_mmu_notifier_ops = {
-	.invalidate_range	= arm_smmu_mm_invalidate_range,
-	.release		= arm_smmu_mm_release,
-	.free_notifier		= arm_smmu_mmu_notifier_free,
+	.arch_invalidate_secondary_tlbs	= arm_smmu_mm_arch_invalidate_secondary_tlbs,
+	.release			= arm_smmu_mm_release,
+	.free_notifier			= arm_smmu_mmu_notifier_free,
 };
 
 /* Allocate or get existing MMU notifier for this {domain, mm} pair */
diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index e95b339..8f6d680 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -219,9 +219,9 @@ static void intel_flush_svm_range(struct intel_svm *svm, unsigned long address,
 }
 
 /* Pages have been freed at this point */
-static void intel_invalidate_range(struct mmu_notifier *mn,
-				   struct mm_struct *mm,
-				   unsigned long start, unsigned long end)
+static void intel_arch_invalidate_secondary_tlbs(struct mmu_notifier *mn,
+					struct mm_struct *mm,
+					unsigned long start, unsigned long end)
 {
 	struct intel_svm *svm = container_of(mn, struct intel_svm, notifier);
 
@@ -256,7 +256,7 @@ static void intel_mm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 
 static const struct mmu_notifier_ops intel_mmuops = {
 	.release = intel_mm_release,
-	.invalidate_range = intel_invalidate_range,
+	.arch_invalidate_secondary_tlbs = intel_arch_invalidate_secondary_tlbs,
 };
 
 static DEFINE_MUTEX(pasid_mutex);
diff --git a/drivers/misc/ocxl/link.c b/drivers/misc/ocxl/link.c
index 4cf4c55..c06c699 100644
--- a/drivers/misc/ocxl/link.c
+++ b/drivers/misc/ocxl/link.c
@@ -491,9 +491,9 @@ void ocxl_link_release(struct pci_dev *dev, void *link_handle)
 }
 EXPORT_SYMBOL_GPL(ocxl_link_release);
 
-static void invalidate_range(struct mmu_notifier *mn,
-			     struct mm_struct *mm,
-			     unsigned long start, unsigned long end)
+static void arch_invalidate_secondary_tlbs(struct mmu_notifier *mn,
+					struct mm_struct *mm,
+					unsigned long start, unsigned long end)
 {
 	struct pe_data *pe_data = container_of(mn, struct pe_data, mmu_notifier);
 	struct ocxl_link *link = pe_data->link;
@@ -509,7 +509,7 @@ static void invalidate_range(struct mmu_notifier *mn,
 }
 
 static const struct mmu_notifier_ops ocxl_mmu_notifier_ops = {
-	.invalidate_range = invalidate_range,
+	.arch_invalidate_secondary_tlbs = arch_invalidate_secondary_tlbs,
 };
 
 static u64 calculate_cfg_state(bool kernel)
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index f2e9edc..6e3c857 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -187,27 +187,27 @@ struct mmu_notifier_ops {
 				     const struct mmu_notifier_range *range);
 
 	/*
-	 * invalidate_range() is either called between
-	 * invalidate_range_start() and invalidate_range_end() when the
-	 * VM has to free pages that where unmapped, but before the
-	 * pages are actually freed, or outside of _start()/_end() when
-	 * a (remote) TLB is necessary.
+	 * arch_invalidate_secondary_tlbs() is used to manage a non-CPU TLB
+	 * which shares page-tables with the CPU. The
+	 * invalidate_range_start()/end() callbacks should not be implemented as
+	 * invalidate_secondary_tlbs() already catches the points in time when
+	 * an external TLB needs to be flushed.
 	 *
-	 * If invalidate_range() is used to manage a non-CPU TLB with
-	 * shared page-tables, it not necessary to implement the
-	 * invalidate_range_start()/end() notifiers, as
-	 * invalidate_range() already catches the points in time when an
-	 * external TLB range needs to be flushed. For more in depth
-	 * discussion on this see Documentation/mm/mmu_notifier.rst
+	 * This requires arch_invalidate_secondary_tlbs() to be called while
+	 * holding the ptl spin-lock and therefore this callback is not allowed
+	 * to sleep.
 	 *
-	 * Note that this function might be called with just a sub-range
-	 * of what was passed to invalidate_range_start()/end(), if
-	 * called between those functions.
+	 * This is called by architecture code whenever invalidating a TLB
+	 * entry. It is assumed that any secondary TLB has the same rules for
+	 * when invalidations are required. If this is not the case architecture
+	 * code will need to call this explicitly when required for secondary
+	 * TLB invalidation.
 	 */
-	void (*invalidate_range)(struct mmu_notifier *subscription,
-				 struct mm_struct *mm,
-				 unsigned long start,
-				 unsigned long end);
+	void (*arch_invalidate_secondary_tlbs)(
+					struct mmu_notifier *subscription,
+					struct mm_struct *mm,
+					unsigned long start,
+					unsigned long end);
 
 	/*
 	 * These callbacks are used with the get/put interface to manage the
@@ -396,8 +396,8 @@ extern void __mmu_notifier_change_pte(struct mm_struct *mm,
 				      unsigned long address, pte_t pte);
 extern int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *r);
 extern void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *r);
-extern void __mmu_notifier_invalidate_range(struct mm_struct *mm,
-				  unsigned long start, unsigned long end);
+extern void __mmu_notifier_arch_invalidate_secondary_tlbs(struct mm_struct *mm,
+					unsigned long start, unsigned long end);
 extern bool
 mmu_notifier_range_update_to_read_only(const struct mmu_notifier_range *range);
 
@@ -483,11 +483,11 @@ mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range)
 		__mmu_notifier_invalidate_range_end(range);
 }
 
-static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+static inline void mmu_notifier_arch_invalidate_secondary_tlbs(struct mm_struct *mm,
+					unsigned long start, unsigned long end)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range(mm, start, end);
+		__mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end);
 }
 
 static inline void mmu_notifier_subscriptions_init(struct mm_struct *mm)
@@ -664,7 +664,7 @@ void mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range)
 {
 }
 
-static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
+static inline void mmu_notifier_arch_invalidate_secondary_tlbs(struct mm_struct *mm,
 				  unsigned long start, unsigned long end)
 {
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3ece117..e0420de 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2120,8 +2120,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	if (is_huge_zero_pmd(*pmd)) {
 		/*
 		 * FIXME: Do we want to invalidate secondary mmu by calling
-		 * mmu_notifier_invalidate_range() see comments below inside
-		 * __split_huge_pmd() ?
+		 * mmu_notifier_arch_invalidate_secondary_tlbs() see comments below
+		 * inside __split_huge_pmd() ?
 		 *
 		 * We are going from a zero huge page write protected to zero
 		 * small page also write protected so it does not seems useful
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9c6e431..e0028cb 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6676,8 +6676,9 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
 	else
 		flush_hugetlb_tlb_range(vma, start, end);
 	/*
-	 * No need to call mmu_notifier_invalidate_range() we are downgrading
-	 * page table protection not changing it to point to a new page.
+	 * No need to call mmu_notifier_arch_invalidate_secondary_tlbs() we are
+	 * downgrading page table protection not changing it to point to a new
+	 * page.
 	 *
 	 * See Documentation/mm/mmu_notifier.rst
 	 */
@@ -7321,7 +7322,7 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
 	i_mmap_unlock_write(vma->vm_file->f_mapping);
 	hugetlb_vma_unlock_write(vma);
 	/*
-	 * No need to call mmu_notifier_invalidate_range(), see
+	 * No need to call mmu_notifier_arch_invalidate_secondary_tlbs(), see
 	 * Documentation/mm/mmu_notifier.rst.
 	 */
 	mmu_notifier_invalidate_range_end(&range);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 453a156..63c8eb7 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -585,8 +585,8 @@ void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range)
 	lock_map_release(&__mmu_notifier_invalidate_range_start_map);
 }
 
-void __mmu_notifier_invalidate_range(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+void __mmu_notifier_arch_invalidate_secondary_tlbs(struct mm_struct *mm,
+					unsigned long start, unsigned long end)
 {
 	struct mmu_notifier *subscription;
 	int id;
@@ -595,9 +595,10 @@ void __mmu_notifier_invalidate_range(struct mm_struct *mm,
 	hlist_for_each_entry_rcu(subscription,
 				 &mm->notifier_subscriptions->list, hlist,
 				 srcu_read_lock_held(&srcu)) {
-		if (subscription->ops->invalidate_range)
-			subscription->ops->invalidate_range(subscription, mm,
-							    start, end);
+		if (subscription->ops->arch_invalidate_secondary_tlbs)
+			subscription->ops->arch_invalidate_secondary_tlbs(
+				subscription, mm,
+				start, end);
 	}
 	srcu_read_unlock(&srcu, id);
 }
@@ -616,6 +617,15 @@ int __mmu_notifier_register(struct mmu_notifier *subscription,
 	mmap_assert_write_locked(mm);
 	BUG_ON(atomic_read(&mm->mm_users) <= 0);
 
+	/*
+	 * Subsystems should only register for invalidate_secondary_tlbs() or
+	 * invalidate_range_start()/end() callbacks, not both.
+	 */
+	if (WARN_ON_ONCE(subscription->ops->arch_invalidate_secondary_tlbs &&
+				(subscription->ops->invalidate_range_start ||
+				subscription->ops->invalidate_range_end)))
+		return -EINVAL;
+
 	if (!mm->notifier_subscriptions) {
 		/*
 		 * kmalloc cannot be called under mm_take_all_locks(), but we
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 5/5] mmu_notifiers: Rename invalidate_range notifier
@ 2023-07-19 12:18   ` Alistair Popple
  0 siblings, 0 replies; 35+ messages in thread
From: Alistair Popple @ 2023-07-19 12:18 UTC (permalink / raw)
  To: akpm
  Cc: ajd, catalin.marinas, fbarrat, iommu, jgg, jhubbard, kevin.tian,
	kvm, linux-arm-kernel, linux-kernel, linux-mm, linuxppc-dev, mpe,
	nicolinc, npiggin, robin.murphy, seanjc, will, x86,
	zhi.wang.linux, Alistair Popple, Jason Gunthorpe

There are two main use cases for mmu notifiers. One is by KVM which
uses mmu_notifier_invalidate_range_start()/end() to manage a software
TLB.

The other is to manage hardware TLBs which need to use the
invalidate_range() callback because HW can establish new TLB entries
at any time. Hence using start/end() can lead to memory corruption as
these callbacks happen too soon/late during page unmap.

mmu notifier users should therefore either use the start()/end()
callbacks or the invalidate_range() callbacks. To make this usage
clearer rename the invalidate_range() callback to
arch_invalidate_secondary_tlbs() and update documention.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
---
 arch/arm64/include/asm/tlbflush.h               |  6 +-
 arch/powerpc/mm/book3s64/radix_hugetlbpage.c    |  2 +-
 arch/powerpc/mm/book3s64/radix_tlb.c            | 10 ++--
 arch/x86/mm/tlb.c                               |  4 +-
 drivers/iommu/amd/iommu_v2.c                    | 10 ++--
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 13 ++---
 drivers/iommu/intel/svm.c                       |  8 +--
 drivers/misc/ocxl/link.c                        |  8 +--
 include/linux/mmu_notifier.h                    | 48 +++++++++---------
 mm/huge_memory.c                                |  4 +-
 mm/hugetlb.c                                    |  7 +--
 mm/mmu_notifier.c                               | 20 ++++++--
 12 files changed, 76 insertions(+), 64 deletions(-)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index a99349d..84a05a0 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -253,7 +253,7 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
 	__tlbi(aside1is, asid);
 	__tlbi_user(aside1is, asid);
 	dsb(ish);
-	mmu_notifier_invalidate_range(mm, 0, -1UL);
+	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
 }
 
 static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
@@ -265,7 +265,7 @@ static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
 	addr = __TLBI_VADDR(uaddr, ASID(mm));
 	__tlbi(vale1is, addr);
 	__tlbi_user(vale1is, addr);
-	mmu_notifier_invalidate_range(mm, uaddr & PAGE_MASK,
+	mmu_notifier_arch_invalidate_secondary_tlbs(mm, uaddr & PAGE_MASK,
 						(uaddr & PAGE_MASK) + PAGE_SIZE);
 }
 
@@ -400,7 +400,7 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
 		scale++;
 	}
 	dsb(ish);
-	mmu_notifier_invalidate_range(vma->vm_mm, start, end);
+	mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
 }
 
 static inline void flush_tlb_range(struct vm_area_struct *vma,
diff --git a/arch/powerpc/mm/book3s64/radix_hugetlbpage.c b/arch/powerpc/mm/book3s64/radix_hugetlbpage.c
index f3fb49f..17075c7 100644
--- a/arch/powerpc/mm/book3s64/radix_hugetlbpage.c
+++ b/arch/powerpc/mm/book3s64/radix_hugetlbpage.c
@@ -39,7 +39,7 @@ void radix__flush_hugetlb_tlb_range(struct vm_area_struct *vma, unsigned long st
 		radix__flush_tlb_pwc_range_psize(vma->vm_mm, start, end, psize);
 	else
 		radix__flush_tlb_range_psize(vma->vm_mm, start, end, psize);
-	mmu_notifier_invalidate_range(vma->vm_mm, start, end);
+	mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
 }
 
 void radix__huge_ptep_modify_prot_commit(struct vm_area_struct *vma,
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
index 9724b26..64c11a4 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -752,7 +752,7 @@ void radix__local_flush_tlb_page(struct vm_area_struct *vma, unsigned long vmadd
 		return radix__local_flush_hugetlb_page(vma, vmaddr);
 #endif
 	radix__local_flush_tlb_page_psize(vma->vm_mm, vmaddr, mmu_virtual_psize);
-	mmu_notifier_invalidate_range(vma->vm_mm, vmaddr,
+	mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, vmaddr,
 						vmaddr + mmu_virtual_psize);
 }
 EXPORT_SYMBOL(radix__local_flush_tlb_page);
@@ -989,7 +989,7 @@ void radix__flush_tlb_mm(struct mm_struct *mm)
 		}
 	}
 	preempt_enable();
-	mmu_notifier_invalidate_range(mm, 0, -1UL);
+	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
 }
 EXPORT_SYMBOL(radix__flush_tlb_mm);
 
@@ -1023,7 +1023,7 @@ static void __flush_all_mm(struct mm_struct *mm, bool fullmm)
 			_tlbiel_pid_multicast(mm, pid, RIC_FLUSH_ALL);
 	}
 	preempt_enable();
-	mmu_notifier_invalidate_range(mm, 0, -1UL);
+	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
 }
 
 void radix__flush_all_mm(struct mm_struct *mm)
@@ -1232,7 +1232,7 @@ static inline void __radix__flush_tlb_range(struct mm_struct *mm,
 	}
 out:
 	preempt_enable();
-	mmu_notifier_invalidate_range(mm, start, end);
+	mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end);
 }
 
 void radix__flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
@@ -1397,7 +1397,7 @@ static void __radix__flush_tlb_range_psize(struct mm_struct *mm,
 	}
 out:
 	preempt_enable();
-	mmu_notifier_invalidate_range(mm, start, end);
+	mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end);
 }
 
 void radix__flush_tlb_range_psize(struct mm_struct *mm, unsigned long start,
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index c30fbcd..0b990fb 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1037,7 +1037,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 
 	put_flush_tlb_info();
 	put_cpu();
-	mmu_notifier_invalidate_range(mm, start, end);
+	mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end);
 }
 
 
@@ -1265,7 +1265,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 
 	put_flush_tlb_info();
 	put_cpu();
-	mmu_notifier_invalidate_range(current->mm, 0, -1UL);
+	mmu_notifier_arch_invalidate_secondary_tlbs(current->mm, 0, -1UL);
 }
 
 /*
diff --git a/drivers/iommu/amd/iommu_v2.c b/drivers/iommu/amd/iommu_v2.c
index 261352a..2596466 100644
--- a/drivers/iommu/amd/iommu_v2.c
+++ b/drivers/iommu/amd/iommu_v2.c
@@ -355,9 +355,9 @@ static struct pasid_state *mn_to_state(struct mmu_notifier *mn)
 	return container_of(mn, struct pasid_state, mn);
 }
 
-static void mn_invalidate_range(struct mmu_notifier *mn,
-				struct mm_struct *mm,
-				unsigned long start, unsigned long end)
+static void mn_arch_invalidate_secondary_tlbs(struct mmu_notifier *mn,
+					struct mm_struct *mm,
+					unsigned long start, unsigned long end)
 {
 	struct pasid_state *pasid_state;
 	struct device_state *dev_state;
@@ -391,8 +391,8 @@ static void mn_release(struct mmu_notifier *mn, struct mm_struct *mm)
 }
 
 static const struct mmu_notifier_ops iommu_mn = {
-	.release		= mn_release,
-	.invalidate_range       = mn_invalidate_range,
+	.release			= mn_release,
+	.arch_invalidate_secondary_tlbs	= mn_arch_invalidate_secondary_tlbs,
 };
 
 static void set_pri_tag_status(struct pasid_state *pasid_state,
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
index 2a19784..dbc812a 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
@@ -186,9 +186,10 @@ static void arm_smmu_free_shared_cd(struct arm_smmu_ctx_desc *cd)
 	}
 }
 
-static void arm_smmu_mm_invalidate_range(struct mmu_notifier *mn,
-					 struct mm_struct *mm,
-					 unsigned long start, unsigned long end)
+static void arm_smmu_mm_arch_invalidate_secondary_tlbs(struct mmu_notifier *mn,
+						struct mm_struct *mm,
+						unsigned long start,
+						unsigned long end)
 {
 	struct arm_smmu_mmu_notifier *smmu_mn = mn_to_smmu(mn);
 	struct arm_smmu_domain *smmu_domain = smmu_mn->domain;
@@ -247,9 +248,9 @@ static void arm_smmu_mmu_notifier_free(struct mmu_notifier *mn)
 }
 
 static const struct mmu_notifier_ops arm_smmu_mmu_notifier_ops = {
-	.invalidate_range	= arm_smmu_mm_invalidate_range,
-	.release		= arm_smmu_mm_release,
-	.free_notifier		= arm_smmu_mmu_notifier_free,
+	.arch_invalidate_secondary_tlbs	= arm_smmu_mm_arch_invalidate_secondary_tlbs,
+	.release			= arm_smmu_mm_release,
+	.free_notifier			= arm_smmu_mmu_notifier_free,
 };
 
 /* Allocate or get existing MMU notifier for this {domain, mm} pair */
diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index e95b339..8f6d680 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -219,9 +219,9 @@ static void intel_flush_svm_range(struct intel_svm *svm, unsigned long address,
 }
 
 /* Pages have been freed at this point */
-static void intel_invalidate_range(struct mmu_notifier *mn,
-				   struct mm_struct *mm,
-				   unsigned long start, unsigned long end)
+static void intel_arch_invalidate_secondary_tlbs(struct mmu_notifier *mn,
+					struct mm_struct *mm,
+					unsigned long start, unsigned long end)
 {
 	struct intel_svm *svm = container_of(mn, struct intel_svm, notifier);
 
@@ -256,7 +256,7 @@ static void intel_mm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 
 static const struct mmu_notifier_ops intel_mmuops = {
 	.release = intel_mm_release,
-	.invalidate_range = intel_invalidate_range,
+	.arch_invalidate_secondary_tlbs = intel_arch_invalidate_secondary_tlbs,
 };
 
 static DEFINE_MUTEX(pasid_mutex);
diff --git a/drivers/misc/ocxl/link.c b/drivers/misc/ocxl/link.c
index 4cf4c55..c06c699 100644
--- a/drivers/misc/ocxl/link.c
+++ b/drivers/misc/ocxl/link.c
@@ -491,9 +491,9 @@ void ocxl_link_release(struct pci_dev *dev, void *link_handle)
 }
 EXPORT_SYMBOL_GPL(ocxl_link_release);
 
-static void invalidate_range(struct mmu_notifier *mn,
-			     struct mm_struct *mm,
-			     unsigned long start, unsigned long end)
+static void arch_invalidate_secondary_tlbs(struct mmu_notifier *mn,
+					struct mm_struct *mm,
+					unsigned long start, unsigned long end)
 {
 	struct pe_data *pe_data = container_of(mn, struct pe_data, mmu_notifier);
 	struct ocxl_link *link = pe_data->link;
@@ -509,7 +509,7 @@ static void invalidate_range(struct mmu_notifier *mn,
 }
 
 static const struct mmu_notifier_ops ocxl_mmu_notifier_ops = {
-	.invalidate_range = invalidate_range,
+	.arch_invalidate_secondary_tlbs = arch_invalidate_secondary_tlbs,
 };
 
 static u64 calculate_cfg_state(bool kernel)
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index f2e9edc..6e3c857 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -187,27 +187,27 @@ struct mmu_notifier_ops {
 				     const struct mmu_notifier_range *range);
 
 	/*
-	 * invalidate_range() is either called between
-	 * invalidate_range_start() and invalidate_range_end() when the
-	 * VM has to free pages that where unmapped, but before the
-	 * pages are actually freed, or outside of _start()/_end() when
-	 * a (remote) TLB is necessary.
+	 * arch_invalidate_secondary_tlbs() is used to manage a non-CPU TLB
+	 * which shares page-tables with the CPU. The
+	 * invalidate_range_start()/end() callbacks should not be implemented as
+	 * invalidate_secondary_tlbs() already catches the points in time when
+	 * an external TLB needs to be flushed.
 	 *
-	 * If invalidate_range() is used to manage a non-CPU TLB with
-	 * shared page-tables, it not necessary to implement the
-	 * invalidate_range_start()/end() notifiers, as
-	 * invalidate_range() already catches the points in time when an
-	 * external TLB range needs to be flushed. For more in depth
-	 * discussion on this see Documentation/mm/mmu_notifier.rst
+	 * This requires arch_invalidate_secondary_tlbs() to be called while
+	 * holding the ptl spin-lock and therefore this callback is not allowed
+	 * to sleep.
 	 *
-	 * Note that this function might be called with just a sub-range
-	 * of what was passed to invalidate_range_start()/end(), if
-	 * called between those functions.
+	 * This is called by architecture code whenever invalidating a TLB
+	 * entry. It is assumed that any secondary TLB has the same rules for
+	 * when invalidations are required. If this is not the case architecture
+	 * code will need to call this explicitly when required for secondary
+	 * TLB invalidation.
 	 */
-	void (*invalidate_range)(struct mmu_notifier *subscription,
-				 struct mm_struct *mm,
-				 unsigned long start,
-				 unsigned long end);
+	void (*arch_invalidate_secondary_tlbs)(
+					struct mmu_notifier *subscription,
+					struct mm_struct *mm,
+					unsigned long start,
+					unsigned long end);
 
 	/*
 	 * These callbacks are used with the get/put interface to manage the
@@ -396,8 +396,8 @@ extern void __mmu_notifier_change_pte(struct mm_struct *mm,
 				      unsigned long address, pte_t pte);
 extern int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *r);
 extern void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *r);
-extern void __mmu_notifier_invalidate_range(struct mm_struct *mm,
-				  unsigned long start, unsigned long end);
+extern void __mmu_notifier_arch_invalidate_secondary_tlbs(struct mm_struct *mm,
+					unsigned long start, unsigned long end);
 extern bool
 mmu_notifier_range_update_to_read_only(const struct mmu_notifier_range *range);
 
@@ -483,11 +483,11 @@ mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range)
 		__mmu_notifier_invalidate_range_end(range);
 }
 
-static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+static inline void mmu_notifier_arch_invalidate_secondary_tlbs(struct mm_struct *mm,
+					unsigned long start, unsigned long end)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range(mm, start, end);
+		__mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end);
 }
 
 static inline void mmu_notifier_subscriptions_init(struct mm_struct *mm)
@@ -664,7 +664,7 @@ void mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range)
 {
 }
 
-static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
+static inline void mmu_notifier_arch_invalidate_secondary_tlbs(struct mm_struct *mm,
 				  unsigned long start, unsigned long end)
 {
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3ece117..e0420de 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2120,8 +2120,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	if (is_huge_zero_pmd(*pmd)) {
 		/*
 		 * FIXME: Do we want to invalidate secondary mmu by calling
-		 * mmu_notifier_invalidate_range() see comments below inside
-		 * __split_huge_pmd() ?
+		 * mmu_notifier_arch_invalidate_secondary_tlbs() see comments below
+		 * inside __split_huge_pmd() ?
 		 *
 		 * We are going from a zero huge page write protected to zero
 		 * small page also write protected so it does not seems useful
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9c6e431..e0028cb 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6676,8 +6676,9 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
 	else
 		flush_hugetlb_tlb_range(vma, start, end);
 	/*
-	 * No need to call mmu_notifier_invalidate_range() we are downgrading
-	 * page table protection not changing it to point to a new page.
+	 * No need to call mmu_notifier_arch_invalidate_secondary_tlbs() we are
+	 * downgrading page table protection not changing it to point to a new
+	 * page.
 	 *
 	 * See Documentation/mm/mmu_notifier.rst
 	 */
@@ -7321,7 +7322,7 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
 	i_mmap_unlock_write(vma->vm_file->f_mapping);
 	hugetlb_vma_unlock_write(vma);
 	/*
-	 * No need to call mmu_notifier_invalidate_range(), see
+	 * No need to call mmu_notifier_arch_invalidate_secondary_tlbs(), see
 	 * Documentation/mm/mmu_notifier.rst.
 	 */
 	mmu_notifier_invalidate_range_end(&range);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 453a156..63c8eb7 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -585,8 +585,8 @@ void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range)
 	lock_map_release(&__mmu_notifier_invalidate_range_start_map);
 }
 
-void __mmu_notifier_invalidate_range(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+void __mmu_notifier_arch_invalidate_secondary_tlbs(struct mm_struct *mm,
+					unsigned long start, unsigned long end)
 {
 	struct mmu_notifier *subscription;
 	int id;
@@ -595,9 +595,10 @@ void __mmu_notifier_invalidate_range(struct mm_struct *mm,
 	hlist_for_each_entry_rcu(subscription,
 				 &mm->notifier_subscriptions->list, hlist,
 				 srcu_read_lock_held(&srcu)) {
-		if (subscription->ops->invalidate_range)
-			subscription->ops->invalidate_range(subscription, mm,
-							    start, end);
+		if (subscription->ops->arch_invalidate_secondary_tlbs)
+			subscription->ops->arch_invalidate_secondary_tlbs(
+				subscription, mm,
+				start, end);
 	}
 	srcu_read_unlock(&srcu, id);
 }
@@ -616,6 +617,15 @@ int __mmu_notifier_register(struct mmu_notifier *subscription,
 	mmap_assert_write_locked(mm);
 	BUG_ON(atomic_read(&mm->mm_users) <= 0);
 
+	/*
+	 * Subsystems should only register for invalidate_secondary_tlbs() or
+	 * invalidate_range_start()/end() callbacks, not both.
+	 */
+	if (WARN_ON_ONCE(subscription->ops->arch_invalidate_secondary_tlbs &&
+				(subscription->ops->invalidate_range_start ||
+				subscription->ops->invalidate_range_end)))
+		return -EINVAL;
+
 	if (!mm->notifier_subscriptions) {
 		/*
 		 * kmalloc cannot be called under mm_take_all_locks(), but we
-- 
git-series 0.9.1

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 5/5] mmu_notifiers: Rename invalidate_range notifier
@ 2023-07-19 12:18   ` Alistair Popple
  0 siblings, 0 replies; 35+ messages in thread
From: Alistair Popple @ 2023-07-19 12:18 UTC (permalink / raw)
  To: akpm
  Cc: x86, zhi.wang.linux, kvm, catalin.marinas, linux-mm, will,
	Alistair Popple, jgg, iommu, nicolinc, Jason Gunthorpe,
	kevin.tian, ajd, jhubbard, robin.murphy, npiggin,
	linux-arm-kernel, seanjc, linux-kernel, fbarrat, linuxppc-dev

There are two main use cases for mmu notifiers. One is by KVM which
uses mmu_notifier_invalidate_range_start()/end() to manage a software
TLB.

The other is to manage hardware TLBs which need to use the
invalidate_range() callback because HW can establish new TLB entries
at any time. Hence using start/end() can lead to memory corruption as
these callbacks happen too soon/late during page unmap.

mmu notifier users should therefore either use the start()/end()
callbacks or the invalidate_range() callbacks. To make this usage
clearer rename the invalidate_range() callback to
arch_invalidate_secondary_tlbs() and update documention.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
---
 arch/arm64/include/asm/tlbflush.h               |  6 +-
 arch/powerpc/mm/book3s64/radix_hugetlbpage.c    |  2 +-
 arch/powerpc/mm/book3s64/radix_tlb.c            | 10 ++--
 arch/x86/mm/tlb.c                               |  4 +-
 drivers/iommu/amd/iommu_v2.c                    | 10 ++--
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 13 ++---
 drivers/iommu/intel/svm.c                       |  8 +--
 drivers/misc/ocxl/link.c                        |  8 +--
 include/linux/mmu_notifier.h                    | 48 +++++++++---------
 mm/huge_memory.c                                |  4 +-
 mm/hugetlb.c                                    |  7 +--
 mm/mmu_notifier.c                               | 20 ++++++--
 12 files changed, 76 insertions(+), 64 deletions(-)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index a99349d..84a05a0 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -253,7 +253,7 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
 	__tlbi(aside1is, asid);
 	__tlbi_user(aside1is, asid);
 	dsb(ish);
-	mmu_notifier_invalidate_range(mm, 0, -1UL);
+	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
 }
 
 static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
@@ -265,7 +265,7 @@ static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
 	addr = __TLBI_VADDR(uaddr, ASID(mm));
 	__tlbi(vale1is, addr);
 	__tlbi_user(vale1is, addr);
-	mmu_notifier_invalidate_range(mm, uaddr & PAGE_MASK,
+	mmu_notifier_arch_invalidate_secondary_tlbs(mm, uaddr & PAGE_MASK,
 						(uaddr & PAGE_MASK) + PAGE_SIZE);
 }
 
@@ -400,7 +400,7 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
 		scale++;
 	}
 	dsb(ish);
-	mmu_notifier_invalidate_range(vma->vm_mm, start, end);
+	mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
 }
 
 static inline void flush_tlb_range(struct vm_area_struct *vma,
diff --git a/arch/powerpc/mm/book3s64/radix_hugetlbpage.c b/arch/powerpc/mm/book3s64/radix_hugetlbpage.c
index f3fb49f..17075c7 100644
--- a/arch/powerpc/mm/book3s64/radix_hugetlbpage.c
+++ b/arch/powerpc/mm/book3s64/radix_hugetlbpage.c
@@ -39,7 +39,7 @@ void radix__flush_hugetlb_tlb_range(struct vm_area_struct *vma, unsigned long st
 		radix__flush_tlb_pwc_range_psize(vma->vm_mm, start, end, psize);
 	else
 		radix__flush_tlb_range_psize(vma->vm_mm, start, end, psize);
-	mmu_notifier_invalidate_range(vma->vm_mm, start, end);
+	mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
 }
 
 void radix__huge_ptep_modify_prot_commit(struct vm_area_struct *vma,
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
index 9724b26..64c11a4 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -752,7 +752,7 @@ void radix__local_flush_tlb_page(struct vm_area_struct *vma, unsigned long vmadd
 		return radix__local_flush_hugetlb_page(vma, vmaddr);
 #endif
 	radix__local_flush_tlb_page_psize(vma->vm_mm, vmaddr, mmu_virtual_psize);
-	mmu_notifier_invalidate_range(vma->vm_mm, vmaddr,
+	mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, vmaddr,
 						vmaddr + mmu_virtual_psize);
 }
 EXPORT_SYMBOL(radix__local_flush_tlb_page);
@@ -989,7 +989,7 @@ void radix__flush_tlb_mm(struct mm_struct *mm)
 		}
 	}
 	preempt_enable();
-	mmu_notifier_invalidate_range(mm, 0, -1UL);
+	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
 }
 EXPORT_SYMBOL(radix__flush_tlb_mm);
 
@@ -1023,7 +1023,7 @@ static void __flush_all_mm(struct mm_struct *mm, bool fullmm)
 			_tlbiel_pid_multicast(mm, pid, RIC_FLUSH_ALL);
 	}
 	preempt_enable();
-	mmu_notifier_invalidate_range(mm, 0, -1UL);
+	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
 }
 
 void radix__flush_all_mm(struct mm_struct *mm)
@@ -1232,7 +1232,7 @@ static inline void __radix__flush_tlb_range(struct mm_struct *mm,
 	}
 out:
 	preempt_enable();
-	mmu_notifier_invalidate_range(mm, start, end);
+	mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end);
 }
 
 void radix__flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
@@ -1397,7 +1397,7 @@ static void __radix__flush_tlb_range_psize(struct mm_struct *mm,
 	}
 out:
 	preempt_enable();
-	mmu_notifier_invalidate_range(mm, start, end);
+	mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end);
 }
 
 void radix__flush_tlb_range_psize(struct mm_struct *mm, unsigned long start,
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index c30fbcd..0b990fb 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1037,7 +1037,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 
 	put_flush_tlb_info();
 	put_cpu();
-	mmu_notifier_invalidate_range(mm, start, end);
+	mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end);
 }
 
 
@@ -1265,7 +1265,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 
 	put_flush_tlb_info();
 	put_cpu();
-	mmu_notifier_invalidate_range(current->mm, 0, -1UL);
+	mmu_notifier_arch_invalidate_secondary_tlbs(current->mm, 0, -1UL);
 }
 
 /*
diff --git a/drivers/iommu/amd/iommu_v2.c b/drivers/iommu/amd/iommu_v2.c
index 261352a..2596466 100644
--- a/drivers/iommu/amd/iommu_v2.c
+++ b/drivers/iommu/amd/iommu_v2.c
@@ -355,9 +355,9 @@ static struct pasid_state *mn_to_state(struct mmu_notifier *mn)
 	return container_of(mn, struct pasid_state, mn);
 }
 
-static void mn_invalidate_range(struct mmu_notifier *mn,
-				struct mm_struct *mm,
-				unsigned long start, unsigned long end)
+static void mn_arch_invalidate_secondary_tlbs(struct mmu_notifier *mn,
+					struct mm_struct *mm,
+					unsigned long start, unsigned long end)
 {
 	struct pasid_state *pasid_state;
 	struct device_state *dev_state;
@@ -391,8 +391,8 @@ static void mn_release(struct mmu_notifier *mn, struct mm_struct *mm)
 }
 
 static const struct mmu_notifier_ops iommu_mn = {
-	.release		= mn_release,
-	.invalidate_range       = mn_invalidate_range,
+	.release			= mn_release,
+	.arch_invalidate_secondary_tlbs	= mn_arch_invalidate_secondary_tlbs,
 };
 
 static void set_pri_tag_status(struct pasid_state *pasid_state,
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
index 2a19784..dbc812a 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
@@ -186,9 +186,10 @@ static void arm_smmu_free_shared_cd(struct arm_smmu_ctx_desc *cd)
 	}
 }
 
-static void arm_smmu_mm_invalidate_range(struct mmu_notifier *mn,
-					 struct mm_struct *mm,
-					 unsigned long start, unsigned long end)
+static void arm_smmu_mm_arch_invalidate_secondary_tlbs(struct mmu_notifier *mn,
+						struct mm_struct *mm,
+						unsigned long start,
+						unsigned long end)
 {
 	struct arm_smmu_mmu_notifier *smmu_mn = mn_to_smmu(mn);
 	struct arm_smmu_domain *smmu_domain = smmu_mn->domain;
@@ -247,9 +248,9 @@ static void arm_smmu_mmu_notifier_free(struct mmu_notifier *mn)
 }
 
 static const struct mmu_notifier_ops arm_smmu_mmu_notifier_ops = {
-	.invalidate_range	= arm_smmu_mm_invalidate_range,
-	.release		= arm_smmu_mm_release,
-	.free_notifier		= arm_smmu_mmu_notifier_free,
+	.arch_invalidate_secondary_tlbs	= arm_smmu_mm_arch_invalidate_secondary_tlbs,
+	.release			= arm_smmu_mm_release,
+	.free_notifier			= arm_smmu_mmu_notifier_free,
 };
 
 /* Allocate or get existing MMU notifier for this {domain, mm} pair */
diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index e95b339..8f6d680 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -219,9 +219,9 @@ static void intel_flush_svm_range(struct intel_svm *svm, unsigned long address,
 }
 
 /* Pages have been freed at this point */
-static void intel_invalidate_range(struct mmu_notifier *mn,
-				   struct mm_struct *mm,
-				   unsigned long start, unsigned long end)
+static void intel_arch_invalidate_secondary_tlbs(struct mmu_notifier *mn,
+					struct mm_struct *mm,
+					unsigned long start, unsigned long end)
 {
 	struct intel_svm *svm = container_of(mn, struct intel_svm, notifier);
 
@@ -256,7 +256,7 @@ static void intel_mm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 
 static const struct mmu_notifier_ops intel_mmuops = {
 	.release = intel_mm_release,
-	.invalidate_range = intel_invalidate_range,
+	.arch_invalidate_secondary_tlbs = intel_arch_invalidate_secondary_tlbs,
 };
 
 static DEFINE_MUTEX(pasid_mutex);
diff --git a/drivers/misc/ocxl/link.c b/drivers/misc/ocxl/link.c
index 4cf4c55..c06c699 100644
--- a/drivers/misc/ocxl/link.c
+++ b/drivers/misc/ocxl/link.c
@@ -491,9 +491,9 @@ void ocxl_link_release(struct pci_dev *dev, void *link_handle)
 }
 EXPORT_SYMBOL_GPL(ocxl_link_release);
 
-static void invalidate_range(struct mmu_notifier *mn,
-			     struct mm_struct *mm,
-			     unsigned long start, unsigned long end)
+static void arch_invalidate_secondary_tlbs(struct mmu_notifier *mn,
+					struct mm_struct *mm,
+					unsigned long start, unsigned long end)
 {
 	struct pe_data *pe_data = container_of(mn, struct pe_data, mmu_notifier);
 	struct ocxl_link *link = pe_data->link;
@@ -509,7 +509,7 @@ static void invalidate_range(struct mmu_notifier *mn,
 }
 
 static const struct mmu_notifier_ops ocxl_mmu_notifier_ops = {
-	.invalidate_range = invalidate_range,
+	.arch_invalidate_secondary_tlbs = arch_invalidate_secondary_tlbs,
 };
 
 static u64 calculate_cfg_state(bool kernel)
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index f2e9edc..6e3c857 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -187,27 +187,27 @@ struct mmu_notifier_ops {
 				     const struct mmu_notifier_range *range);
 
 	/*
-	 * invalidate_range() is either called between
-	 * invalidate_range_start() and invalidate_range_end() when the
-	 * VM has to free pages that where unmapped, but before the
-	 * pages are actually freed, or outside of _start()/_end() when
-	 * a (remote) TLB is necessary.
+	 * arch_invalidate_secondary_tlbs() is used to manage a non-CPU TLB
+	 * which shares page-tables with the CPU. The
+	 * invalidate_range_start()/end() callbacks should not be implemented as
+	 * invalidate_secondary_tlbs() already catches the points in time when
+	 * an external TLB needs to be flushed.
 	 *
-	 * If invalidate_range() is used to manage a non-CPU TLB with
-	 * shared page-tables, it not necessary to implement the
-	 * invalidate_range_start()/end() notifiers, as
-	 * invalidate_range() already catches the points in time when an
-	 * external TLB range needs to be flushed. For more in depth
-	 * discussion on this see Documentation/mm/mmu_notifier.rst
+	 * This requires arch_invalidate_secondary_tlbs() to be called while
+	 * holding the ptl spin-lock and therefore this callback is not allowed
+	 * to sleep.
 	 *
-	 * Note that this function might be called with just a sub-range
-	 * of what was passed to invalidate_range_start()/end(), if
-	 * called between those functions.
+	 * This is called by architecture code whenever invalidating a TLB
+	 * entry. It is assumed that any secondary TLB has the same rules for
+	 * when invalidations are required. If this is not the case architecture
+	 * code will need to call this explicitly when required for secondary
+	 * TLB invalidation.
 	 */
-	void (*invalidate_range)(struct mmu_notifier *subscription,
-				 struct mm_struct *mm,
-				 unsigned long start,
-				 unsigned long end);
+	void (*arch_invalidate_secondary_tlbs)(
+					struct mmu_notifier *subscription,
+					struct mm_struct *mm,
+					unsigned long start,
+					unsigned long end);
 
 	/*
 	 * These callbacks are used with the get/put interface to manage the
@@ -396,8 +396,8 @@ extern void __mmu_notifier_change_pte(struct mm_struct *mm,
 				      unsigned long address, pte_t pte);
 extern int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *r);
 extern void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *r);
-extern void __mmu_notifier_invalidate_range(struct mm_struct *mm,
-				  unsigned long start, unsigned long end);
+extern void __mmu_notifier_arch_invalidate_secondary_tlbs(struct mm_struct *mm,
+					unsigned long start, unsigned long end);
 extern bool
 mmu_notifier_range_update_to_read_only(const struct mmu_notifier_range *range);
 
@@ -483,11 +483,11 @@ mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range)
 		__mmu_notifier_invalidate_range_end(range);
 }
 
-static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+static inline void mmu_notifier_arch_invalidate_secondary_tlbs(struct mm_struct *mm,
+					unsigned long start, unsigned long end)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range(mm, start, end);
+		__mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end);
 }
 
 static inline void mmu_notifier_subscriptions_init(struct mm_struct *mm)
@@ -664,7 +664,7 @@ void mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range)
 {
 }
 
-static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
+static inline void mmu_notifier_arch_invalidate_secondary_tlbs(struct mm_struct *mm,
 				  unsigned long start, unsigned long end)
 {
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3ece117..e0420de 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2120,8 +2120,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	if (is_huge_zero_pmd(*pmd)) {
 		/*
 		 * FIXME: Do we want to invalidate secondary mmu by calling
-		 * mmu_notifier_invalidate_range() see comments below inside
-		 * __split_huge_pmd() ?
+		 * mmu_notifier_arch_invalidate_secondary_tlbs() see comments below
+		 * inside __split_huge_pmd() ?
 		 *
 		 * We are going from a zero huge page write protected to zero
 		 * small page also write protected so it does not seems useful
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9c6e431..e0028cb 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6676,8 +6676,9 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
 	else
 		flush_hugetlb_tlb_range(vma, start, end);
 	/*
-	 * No need to call mmu_notifier_invalidate_range() we are downgrading
-	 * page table protection not changing it to point to a new page.
+	 * No need to call mmu_notifier_arch_invalidate_secondary_tlbs() we are
+	 * downgrading page table protection not changing it to point to a new
+	 * page.
 	 *
 	 * See Documentation/mm/mmu_notifier.rst
 	 */
@@ -7321,7 +7322,7 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
 	i_mmap_unlock_write(vma->vm_file->f_mapping);
 	hugetlb_vma_unlock_write(vma);
 	/*
-	 * No need to call mmu_notifier_invalidate_range(), see
+	 * No need to call mmu_notifier_arch_invalidate_secondary_tlbs(), see
 	 * Documentation/mm/mmu_notifier.rst.
 	 */
 	mmu_notifier_invalidate_range_end(&range);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 453a156..63c8eb7 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -585,8 +585,8 @@ void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range)
 	lock_map_release(&__mmu_notifier_invalidate_range_start_map);
 }
 
-void __mmu_notifier_invalidate_range(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+void __mmu_notifier_arch_invalidate_secondary_tlbs(struct mm_struct *mm,
+					unsigned long start, unsigned long end)
 {
 	struct mmu_notifier *subscription;
 	int id;
@@ -595,9 +595,10 @@ void __mmu_notifier_invalidate_range(struct mm_struct *mm,
 	hlist_for_each_entry_rcu(subscription,
 				 &mm->notifier_subscriptions->list, hlist,
 				 srcu_read_lock_held(&srcu)) {
-		if (subscription->ops->invalidate_range)
-			subscription->ops->invalidate_range(subscription, mm,
-							    start, end);
+		if (subscription->ops->arch_invalidate_secondary_tlbs)
+			subscription->ops->arch_invalidate_secondary_tlbs(
+				subscription, mm,
+				start, end);
 	}
 	srcu_read_unlock(&srcu, id);
 }
@@ -616,6 +617,15 @@ int __mmu_notifier_register(struct mmu_notifier *subscription,
 	mmap_assert_write_locked(mm);
 	BUG_ON(atomic_read(&mm->mm_users) <= 0);
 
+	/*
+	 * Subsystems should only register for invalidate_secondary_tlbs() or
+	 * invalidate_range_start()/end() callbacks, not both.
+	 */
+	if (WARN_ON_ONCE(subscription->ops->arch_invalidate_secondary_tlbs &&
+				(subscription->ops->invalidate_range_start ||
+				subscription->ops->invalidate_range_end)))
+		return -EINVAL;
+
 	if (!mm->notifier_subscriptions) {
 		/*
 		 * kmalloc cannot be called under mm_take_all_locks(), but we
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 3/5] mmu_notifiers: Call invalidate_range() when invalidating TLBs
  2023-07-19 12:18   ` Alistair Popple
  (?)
@ 2023-07-19 22:51     ` SeongJae Park
  -1 siblings, 0 replies; 35+ messages in thread
From: SeongJae Park @ 2023-07-19 22:51 UTC (permalink / raw)
  To: Alistair Popple
  Cc: akpm, ajd, catalin.marinas, fbarrat, iommu, jgg, jhubbard,
	kevin.tian, kvm, linux-arm-kernel, linux-kernel, linux-mm,
	linuxppc-dev, mpe, nicolinc, npiggin, robin.murphy, seanjc, will,
	x86, zhi.wang.linux

Hi Alistair,

On Wed, 19 Jul 2023 22:18:44 +1000 Alistair Popple <apopple@nvidia.com> wrote:

> The invalidate_range() is going to become an architecture specific mmu
> notifier used to keep the TLB of secondary MMUs such as an IOMMU in
> sync with the CPU page tables. Currently it is called from separate
> code paths to the main CPU TLB invalidations. This can lead to a
> secondary TLB not getting invalidated when required and makes it hard
> to reason about when exactly the secondary TLB is invalidated.
> 
> To fix this move the notifier call to the architecture specific TLB
> maintenance functions for architectures that have secondary MMUs
> requiring explicit software invalidations.
> 
> This fixes a SMMU bug on ARM64. On ARM64 PTE permission upgrades
> require a TLB invalidation. This invalidation is done by the
> architecutre specific ptep_set_access_flags() which calls
> flush_tlb_page() if required. However this doesn't call the notifier
> resulting in infinite faults being generated by devices using the SMMU
> if it has previously cached a read-only PTE in it's TLB.
> 
> Moving the invalidations into the TLB invalidation functions ensures
> all invalidations happen at the same time as the CPU invalidation. The
> architecture specific flush_tlb_all() routines do not call the
> notifier as none of the IOMMUs require this.
> 
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>

I found below kernel NULL-dereference issue on latest mm-unstable tree, and
bisect points me to the commit of this patch, namely
75c400f82d347af1307010a3e06f3aa5d831d995.

To reproduce, I use 'stress-ng --bigheap $(nproc)'.  The issue happens as soon
as it starts reclaiming memory.  I didn't dive deep into this yet, but
reporting this issue first, since you might have an idea already.


[   69.824805] BUG: kernel NULL pointer dereference, address: 0000000000000498
[   69.826983] #PF: supervisor read access in kernel mode
[   69.828716] #PF: error_code(0x0000) - not-present page
[   69.830249] PGD 0 P4D 0
[   69.830784] Oops: 0000 [#4] PREEMPT SMP PTI
[   69.831881] CPU: 2 PID: 201 Comm: kworker/u72:2 Tainted: G      D W          6.5.0-rc1+ #311
[   69.834221] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-pr4
[   69.837459] Workqueue: writeback wb_workfn (flush-251:0)
[   69.839422] RIP: 0010:arch_tlbbatch_flush (include/linux/mmu_notifier.h:497 (discriminator 3) arch/x86/mm/tlb.c:1268 (discriminator 3))
[ 69.841140] Code: e8 c5 dd d2 00 bf 01 00 00 00 e8 0b df 05 00 65 8b 05 cc 90 f9 63 85 c0 74 4b 65 48c

Code starting with the faulting instruction
===========================================
   0:   e8 c5 dd d2 00          callq  0xd2ddca
   5:   bf 01 00 00 00          mov    $0x1,%edi
   a:   e8 0b df 05 00          callq  0x5df1a
   f:   65 8b 05 cc 90 f9 63    mov    %gs:0x63f990cc(%rip),%eax        # 0x63f990e2
  16:   85 c0                   test   %eax,%eax
  18:   74 4b                   je     0x65
  1a:   65                      gs
  1b:   8c                      .byte 0x8c
[   69.846840] RSP: 0000:ffffbf77007fb048 EFLAGS: 00010286
[   69.848464] RAX: ffff9bfd81970000 RBX: 0000000000000002 RCX: 0000000000000000
[   69.851016] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
[   69.853070] RBP: ffffbf77007fb068 R08: 0000000000000001 R09: 0000000000000000
[   69.855213] R10: ffff9bfd81970f10 R11: 0000000000000000 R12: ffff9bfd81970f10
[   69.857456] R13: ffff9c1bbd4b28c0 R14: 0000000000000003 R15: ffffe75b88f4a808
[   69.860213] FS:  0000000000000000(0000) GS:ffff9c1bbd480000(0000) knlGS:0000000000000000
[   69.862211] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   69.863624] CR2: 0000000000000498 CR3: 00000001112ca000 CR4: 00000000000006e0
[   69.865907] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   69.868146] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   69.870414] Call Trace:
[   69.871222]  <TASK>
[   69.871823] ? show_regs (arch/x86/kernel/dumpstack.c:479)
[   69.872855] ? __die_body (arch/x86/kernel/dumpstack.c:421)
[   69.873733] ? __die (arch/x86/kernel/dumpstack.c:435)
[   69.874576] ? page_fault_oops (arch/x86/mm/fault.c:707)
[   69.875657] ? search_bpf_extables (kernel/bpf/core.c:751)
[   69.876854] ? arch_tlbbatch_flush (include/linux/mmu_notifier.h:497 (discriminator 3) arch/x86/mm/tlb.c:1268 (discriminator 3))
[   69.878146] ? search_exception_tables (kernel/extable.c:64)
[   69.879931] ? kernelmode_fixup_or_oops (arch/x86/mm/fault.c:762)
[   69.881284] ? __bad_area_nosemaphore (arch/x86/mm/fault.c:860)
[   69.882722] ? bad_area_nosemaphore (arch/x86/mm/fault.c:867)
[   69.884276] ? do_user_addr_fault (arch/x86/mm/fault.c:1458)
[   69.885800] ? check_preemption_disabled (lib/smp_processor_id.c:42)
[   69.887428] ? exc_page_fault (arch/x86/include/asm/paravirt.h:695 arch/x86/mm/fault.c:1495 arch/x86/mm/fault.c:1543)
[   69.888435] ? asm_exc_page_fault (arch/x86/include/asm/idtentry.h:570)
[   69.889955] ? arch_tlbbatch_flush (include/linux/mmu_notifier.h:497 (discriminator 3) arch/x86/mm/tlb.c:1268 (discriminator 3))
[   69.891564] ? arch_tlbbatch_flush (arch/x86/include/asm/preempt.h:104 (discriminator 1) arch/x86/mm/tlb.c:1267 (discriminator 1))
[   69.892779] try_to_unmap_flush_dirty (mm/rmap.c:622 mm/rmap.c:632)
[   69.893970] shrink_folio_list (mm/vmscan.c:2015)
[   69.895243] shrink_inactive_list (include/linux/spinlock.h:376 mm/vmscan.c:2616)
[   69.896816] shrink_lruvec (mm/vmscan.c:2855 mm/vmscan.c:6303)
[   69.897952] ? rmqueue_bulk (mm/page_alloc.c:2228)
[   69.899385] shrink_node (mm/vmscan.c:6491 mm/vmscan.c:6524)
[   69.900447] do_try_to_free_pages (mm/vmscan.c:6705 mm/vmscan.c:6825)
[   69.901548] try_to_free_pages (mm/vmscan.c:7060)
[   69.902540] __alloc_pages_slowpath.constprop.0 (include/linux/sched/mm.h:380 mm/page_alloc.c:3716 mm/page_alloc.c:3735 mm/page_alloc.c:4140)
[   69.904307] ? sched_cpu_deactivate (kernel/sched/core.c:9728)
[   69.905427] __alloc_pages (mm/page_alloc.c:4525)
[   69.906735] alloc_pages (mm/mempolicy.c:2284)
[   69.907604] new_slab (mm/slub.c:1866 mm/slub.c:2009 mm/slub.c:2062)
[   69.908414] ? chksum_update (crypto/crc32c_generic.c:88)
[   69.909348] ___slab_alloc (mm/slub.c:3216 (discriminator 3))
[   69.910093] ? ext4_mb_new_blocks (fs/ext4/mballoc.c:5538 fs/ext4/mballoc.c:6129)
[   69.911885] ? __enqueue_entity (kernel/sched/fair.c:646)
[   69.913085] ? x2apic_send_IPI (arch/x86/include/asm/paravirt.h:196 arch/x86/include/asm/paravirt.h:229 arch/x86/include/asm/apic.h:240 arch/x86/kernel/apic/x2apic_phys.c:126 arch/x86/kernel/apic/x2apic_phys.c:48)
[   69.914143] ? native_smp_send_reschedule (arch/x86/kernel/apic/ipi.c:72)
[   69.915777] ? sbitmap_find_bit (lib/sbitmap.c:146 lib/sbitmap.c:178 lib/sbitmap.c:199)
[   69.917011] ? ext4_mb_new_blocks (fs/ext4/mballoc.c:5538 fs/ext4/mballoc.c:6129)
[   69.917996] __slab_alloc.isra.0 (mm/slub.c:3314)
[   69.919309] kmem_cache_alloc (mm/slub.c:3367 mm/slub.c:3460 mm/slub.c:3478 mm/slub.c:3485 mm/slub.c:3494)
[   69.920239] ? ext4_mb_new_blocks (fs/ext4/mballoc.c:5538 fs/ext4/mballoc.c:6129)
[   69.921291] ext4_mb_new_blocks (fs/ext4/mballoc.c:5538 fs/ext4/mballoc.c:6129)
[   69.922496] ? ext4_cache_extents (fs/ext4/extents.c:543 (discriminator 2))
[   69.923591] ext4_ext_map_blocks (fs/ext4/extents.c:4286)
[   69.925039] ext4_map_blocks (fs/ext4/inode.c:621)
[   69.926478] ? kmem_cache_alloc (arch/x86/include/asm/jump_label.h:55 include/linux/memcontrol.h:1777 mm/slab.h:522 mm/slab.h:770 mm/slub.c:3470 mm/slub.c:3478 mm/slub.c:3485 mm/slub.c:3494)
[   69.927813] ? ext4_alloc_io_end_vec (fs/ext4/page-io.c:61)
[   69.928851] ext4_do_writepages (fs/ext4/inode.c:2159 fs/ext4/inode.c:2212 fs/ext4/inode.c:2677)
[   69.930008] ? update_sd_lb_stats.constprop.0 (kernel/sched/fair.c:9501 kernel/sched/fair.c:10163)
[   69.931662] ext4_writepages (fs/ext4/inode.c:2766)
[   69.932728] do_writepages (mm/page-writeback.c:2553)
[   69.933403] ? __wb_calc_thresh (mm/page-writeback.c:875)
[   69.934168] __writeback_single_inode (fs/fs-writeback.c:1603)
[   69.935101] ? _raw_spin_unlock (arch/x86/include/asm/preempt.h:104 include/linux/spinlock_api_smp.h:143 kernel/locking/spinlock.c:186)
[   69.936219] writeback_sb_inodes (fs/fs-writeback.c:1896)
[   69.937408] __writeback_inodes_wb (fs/fs-writeback.c:1966)
[   69.938612] wb_writeback (fs/fs-writeback.c:2072)
[   69.939548] wb_workfn (fs/fs-writeback.c:2142 fs/fs-writeback.c:2230 fs/fs-writeback.c:2257)
[   69.940439] ? __switch_to (arch/x86/include/asm/paravirt.h:300 arch/x86/kernel/process_64.c:583)
[   69.941694] process_one_work (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:207 include/trace/events/workqueue.h:108 kernel/workqueue.c:2602)
[   69.943189] worker_thread (include/linux/list.h:292 kernel/workqueue.c:2749)
[   69.944403] ? __pfx_worker_thread (kernel/workqueue.c:2691)
[   69.945491] kthread (kernel/kthread.c:389)
[   69.946333] ? __pfx_kthread (kernel/kthread.c:342)
[   69.947109] ret_from_fork (arch/x86/entry/entry_64.S:314)
[   69.948097]  </TASK>
[   69.948944] Modules linked in: ppdev input_leds joydev parport_pc parport serio_raw qemu_fw_cfg mac_hi
[   69.958916] Dumping ftrace buffer:
[   69.961197]    (ftrace buffer empty)
[   69.962797] CR2: 0000000000000498
[   69.964431] ---[ end trace 0000000000000000 ]---
[   69.966440] RIP: 0010:arch_tlbbatch_flush (include/linux/mmu_notifier.h:497 (discriminator 3) arch/x86/mm/tlb.c:1268 (discriminator 3))
[ 69.969546] Code: e8 c5 dd d2 00 bf 01 00 00 00 e8 0b df 05 00 65 8b 05 cc 90 f9 63 85 c0 74 4b 65 48c

Code starting with the faulting instruction
===========================================
   0:   e8 c5 dd d2 00          callq  0xd2ddca
   5:   bf 01 00 00 00          mov    $0x1,%edi
   a:   e8 0b df 05 00          callq  0x5df1a
   f:   65 8b 05 cc 90 f9 63    mov    %gs:0x63f990cc(%rip),%eax        # 0x63f990e2
  16:   85 c0                   test   %eax,%eax
  18:   74 4b                   je     0x65
  1a:   65                      gs
  1b:   8c                      .byte 0x8c
[   69.979017] RSP: 0000:ffffbf770005f6d8 EFLAGS: 00010286
[   69.981894] RAX: ffff9bfd8090af80 RBX: 0000000000000003 RCX: 0000000000000000
[   69.985344] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
[   69.988499] RBP: ffffbf770005f6f8 R08: 0000000000000001 R09: 0000000000000000
[   69.991677] R10: ffff9bfd8090be90 R11: 0000000000000000 R12: ffff9bfd8090be90
[   69.994926] R13: ffff9c1bbd4f28c0 R14: 0000000000000004 R15: ffffe75b84499d08
[   69.998152] FS:  0000000000000000(0000) GS:ffff9c1bbd480000(0000) knlGS:0000000000000000
[   70.001874] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   70.004419] CR2: 0000000000000498 CR3: 00000001112ca000 CR4: 00000000000006e0
[   70.007517] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   70.010951] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400



Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 3/5] mmu_notifiers: Call invalidate_range() when invalidating TLBs
@ 2023-07-19 22:51     ` SeongJae Park
  0 siblings, 0 replies; 35+ messages in thread
From: SeongJae Park @ 2023-07-19 22:51 UTC (permalink / raw)
  To: Alistair Popple
  Cc: kevin.tian, x86, ajd, kvm, linux-mm, catalin.marinas, seanjc,
	will, linux-kernel, npiggin, zhi.wang.linux, jgg, iommu,
	nicolinc, jhubbard, fbarrat, akpm, linuxppc-dev,
	linux-arm-kernel, robin.murphy

Hi Alistair,

On Wed, 19 Jul 2023 22:18:44 +1000 Alistair Popple <apopple@nvidia.com> wrote:

> The invalidate_range() is going to become an architecture specific mmu
> notifier used to keep the TLB of secondary MMUs such as an IOMMU in
> sync with the CPU page tables. Currently it is called from separate
> code paths to the main CPU TLB invalidations. This can lead to a
> secondary TLB not getting invalidated when required and makes it hard
> to reason about when exactly the secondary TLB is invalidated.
> 
> To fix this move the notifier call to the architecture specific TLB
> maintenance functions for architectures that have secondary MMUs
> requiring explicit software invalidations.
> 
> This fixes a SMMU bug on ARM64. On ARM64 PTE permission upgrades
> require a TLB invalidation. This invalidation is done by the
> architecutre specific ptep_set_access_flags() which calls
> flush_tlb_page() if required. However this doesn't call the notifier
> resulting in infinite faults being generated by devices using the SMMU
> if it has previously cached a read-only PTE in it's TLB.
> 
> Moving the invalidations into the TLB invalidation functions ensures
> all invalidations happen at the same time as the CPU invalidation. The
> architecture specific flush_tlb_all() routines do not call the
> notifier as none of the IOMMUs require this.
> 
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>

I found below kernel NULL-dereference issue on latest mm-unstable tree, and
bisect points me to the commit of this patch, namely
75c400f82d347af1307010a3e06f3aa5d831d995.

To reproduce, I use 'stress-ng --bigheap $(nproc)'.  The issue happens as soon
as it starts reclaiming memory.  I didn't dive deep into this yet, but
reporting this issue first, since you might have an idea already.


[   69.824805] BUG: kernel NULL pointer dereference, address: 0000000000000498
[   69.826983] #PF: supervisor read access in kernel mode
[   69.828716] #PF: error_code(0x0000) - not-present page
[   69.830249] PGD 0 P4D 0
[   69.830784] Oops: 0000 [#4] PREEMPT SMP PTI
[   69.831881] CPU: 2 PID: 201 Comm: kworker/u72:2 Tainted: G      D W          6.5.0-rc1+ #311
[   69.834221] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-pr4
[   69.837459] Workqueue: writeback wb_workfn (flush-251:0)
[   69.839422] RIP: 0010:arch_tlbbatch_flush (include/linux/mmu_notifier.h:497 (discriminator 3) arch/x86/mm/tlb.c:1268 (discriminator 3))
[ 69.841140] Code: e8 c5 dd d2 00 bf 01 00 00 00 e8 0b df 05 00 65 8b 05 cc 90 f9 63 85 c0 74 4b 65 48c

Code starting with the faulting instruction
===========================================
   0:   e8 c5 dd d2 00          callq  0xd2ddca
   5:   bf 01 00 00 00          mov    $0x1,%edi
   a:   e8 0b df 05 00          callq  0x5df1a
   f:   65 8b 05 cc 90 f9 63    mov    %gs:0x63f990cc(%rip),%eax        # 0x63f990e2
  16:   85 c0                   test   %eax,%eax
  18:   74 4b                   je     0x65
  1a:   65                      gs
  1b:   8c                      .byte 0x8c
[   69.846840] RSP: 0000:ffffbf77007fb048 EFLAGS: 00010286
[   69.848464] RAX: ffff9bfd81970000 RBX: 0000000000000002 RCX: 0000000000000000
[   69.851016] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
[   69.853070] RBP: ffffbf77007fb068 R08: 0000000000000001 R09: 0000000000000000
[   69.855213] R10: ffff9bfd81970f10 R11: 0000000000000000 R12: ffff9bfd81970f10
[   69.857456] R13: ffff9c1bbd4b28c0 R14: 0000000000000003 R15: ffffe75b88f4a808
[   69.860213] FS:  0000000000000000(0000) GS:ffff9c1bbd480000(0000) knlGS:0000000000000000
[   69.862211] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   69.863624] CR2: 0000000000000498 CR3: 00000001112ca000 CR4: 00000000000006e0
[   69.865907] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   69.868146] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   69.870414] Call Trace:
[   69.871222]  <TASK>
[   69.871823] ? show_regs (arch/x86/kernel/dumpstack.c:479)
[   69.872855] ? __die_body (arch/x86/kernel/dumpstack.c:421)
[   69.873733] ? __die (arch/x86/kernel/dumpstack.c:435)
[   69.874576] ? page_fault_oops (arch/x86/mm/fault.c:707)
[   69.875657] ? search_bpf_extables (kernel/bpf/core.c:751)
[   69.876854] ? arch_tlbbatch_flush (include/linux/mmu_notifier.h:497 (discriminator 3) arch/x86/mm/tlb.c:1268 (discriminator 3))
[   69.878146] ? search_exception_tables (kernel/extable.c:64)
[   69.879931] ? kernelmode_fixup_or_oops (arch/x86/mm/fault.c:762)
[   69.881284] ? __bad_area_nosemaphore (arch/x86/mm/fault.c:860)
[   69.882722] ? bad_area_nosemaphore (arch/x86/mm/fault.c:867)
[   69.884276] ? do_user_addr_fault (arch/x86/mm/fault.c:1458)
[   69.885800] ? check_preemption_disabled (lib/smp_processor_id.c:42)
[   69.887428] ? exc_page_fault (arch/x86/include/asm/paravirt.h:695 arch/x86/mm/fault.c:1495 arch/x86/mm/fault.c:1543)
[   69.888435] ? asm_exc_page_fault (arch/x86/include/asm/idtentry.h:570)
[   69.889955] ? arch_tlbbatch_flush (include/linux/mmu_notifier.h:497 (discriminator 3) arch/x86/mm/tlb.c:1268 (discriminator 3))
[   69.891564] ? arch_tlbbatch_flush (arch/x86/include/asm/preempt.h:104 (discriminator 1) arch/x86/mm/tlb.c:1267 (discriminator 1))
[   69.892779] try_to_unmap_flush_dirty (mm/rmap.c:622 mm/rmap.c:632)
[   69.893970] shrink_folio_list (mm/vmscan.c:2015)
[   69.895243] shrink_inactive_list (include/linux/spinlock.h:376 mm/vmscan.c:2616)
[   69.896816] shrink_lruvec (mm/vmscan.c:2855 mm/vmscan.c:6303)
[   69.897952] ? rmqueue_bulk (mm/page_alloc.c:2228)
[   69.899385] shrink_node (mm/vmscan.c:6491 mm/vmscan.c:6524)
[   69.900447] do_try_to_free_pages (mm/vmscan.c:6705 mm/vmscan.c:6825)
[   69.901548] try_to_free_pages (mm/vmscan.c:7060)
[   69.902540] __alloc_pages_slowpath.constprop.0 (include/linux/sched/mm.h:380 mm/page_alloc.c:3716 mm/page_alloc.c:3735 mm/page_alloc.c:4140)
[   69.904307] ? sched_cpu_deactivate (kernel/sched/core.c:9728)
[   69.905427] __alloc_pages (mm/page_alloc.c:4525)
[   69.906735] alloc_pages (mm/mempolicy.c:2284)
[   69.907604] new_slab (mm/slub.c:1866 mm/slub.c:2009 mm/slub.c:2062)
[   69.908414] ? chksum_update (crypto/crc32c_generic.c:88)
[   69.909348] ___slab_alloc (mm/slub.c:3216 (discriminator 3))
[   69.910093] ? ext4_mb_new_blocks (fs/ext4/mballoc.c:5538 fs/ext4/mballoc.c:6129)
[   69.911885] ? __enqueue_entity (kernel/sched/fair.c:646)
[   69.913085] ? x2apic_send_IPI (arch/x86/include/asm/paravirt.h:196 arch/x86/include/asm/paravirt.h:229 arch/x86/include/asm/apic.h:240 arch/x86/kernel/apic/x2apic_phys.c:126 arch/x86/kernel/apic/x2apic_phys.c:48)
[   69.914143] ? native_smp_send_reschedule (arch/x86/kernel/apic/ipi.c:72)
[   69.915777] ? sbitmap_find_bit (lib/sbitmap.c:146 lib/sbitmap.c:178 lib/sbitmap.c:199)
[   69.917011] ? ext4_mb_new_blocks (fs/ext4/mballoc.c:5538 fs/ext4/mballoc.c:6129)
[   69.917996] __slab_alloc.isra.0 (mm/slub.c:3314)
[   69.919309] kmem_cache_alloc (mm/slub.c:3367 mm/slub.c:3460 mm/slub.c:3478 mm/slub.c:3485 mm/slub.c:3494)
[   69.920239] ? ext4_mb_new_blocks (fs/ext4/mballoc.c:5538 fs/ext4/mballoc.c:6129)
[   69.921291] ext4_mb_new_blocks (fs/ext4/mballoc.c:5538 fs/ext4/mballoc.c:6129)
[   69.922496] ? ext4_cache_extents (fs/ext4/extents.c:543 (discriminator 2))
[   69.923591] ext4_ext_map_blocks (fs/ext4/extents.c:4286)
[   69.925039] ext4_map_blocks (fs/ext4/inode.c:621)
[   69.926478] ? kmem_cache_alloc (arch/x86/include/asm/jump_label.h:55 include/linux/memcontrol.h:1777 mm/slab.h:522 mm/slab.h:770 mm/slub.c:3470 mm/slub.c:3478 mm/slub.c:3485 mm/slub.c:3494)
[   69.927813] ? ext4_alloc_io_end_vec (fs/ext4/page-io.c:61)
[   69.928851] ext4_do_writepages (fs/ext4/inode.c:2159 fs/ext4/inode.c:2212 fs/ext4/inode.c:2677)
[   69.930008] ? update_sd_lb_stats.constprop.0 (kernel/sched/fair.c:9501 kernel/sched/fair.c:10163)
[   69.931662] ext4_writepages (fs/ext4/inode.c:2766)
[   69.932728] do_writepages (mm/page-writeback.c:2553)
[   69.933403] ? __wb_calc_thresh (mm/page-writeback.c:875)
[   69.934168] __writeback_single_inode (fs/fs-writeback.c:1603)
[   69.935101] ? _raw_spin_unlock (arch/x86/include/asm/preempt.h:104 include/linux/spinlock_api_smp.h:143 kernel/locking/spinlock.c:186)
[   69.936219] writeback_sb_inodes (fs/fs-writeback.c:1896)
[   69.937408] __writeback_inodes_wb (fs/fs-writeback.c:1966)
[   69.938612] wb_writeback (fs/fs-writeback.c:2072)
[   69.939548] wb_workfn (fs/fs-writeback.c:2142 fs/fs-writeback.c:2230 fs/fs-writeback.c:2257)
[   69.940439] ? __switch_to (arch/x86/include/asm/paravirt.h:300 arch/x86/kernel/process_64.c:583)
[   69.941694] process_one_work (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:207 include/trace/events/workqueue.h:108 kernel/workqueue.c:2602)
[   69.943189] worker_thread (include/linux/list.h:292 kernel/workqueue.c:2749)
[   69.944403] ? __pfx_worker_thread (kernel/workqueue.c:2691)
[   69.945491] kthread (kernel/kthread.c:389)
[   69.946333] ? __pfx_kthread (kernel/kthread.c:342)
[   69.947109] ret_from_fork (arch/x86/entry/entry_64.S:314)
[   69.948097]  </TASK>
[   69.948944] Modules linked in: ppdev input_leds joydev parport_pc parport serio_raw qemu_fw_cfg mac_hi
[   69.958916] Dumping ftrace buffer:
[   69.961197]    (ftrace buffer empty)
[   69.962797] CR2: 0000000000000498
[   69.964431] ---[ end trace 0000000000000000 ]---
[   69.966440] RIP: 0010:arch_tlbbatch_flush (include/linux/mmu_notifier.h:497 (discriminator 3) arch/x86/mm/tlb.c:1268 (discriminator 3))
[ 69.969546] Code: e8 c5 dd d2 00 bf 01 00 00 00 e8 0b df 05 00 65 8b 05 cc 90 f9 63 85 c0 74 4b 65 48c

Code starting with the faulting instruction
===========================================
   0:   e8 c5 dd d2 00          callq  0xd2ddca
   5:   bf 01 00 00 00          mov    $0x1,%edi
   a:   e8 0b df 05 00          callq  0x5df1a
   f:   65 8b 05 cc 90 f9 63    mov    %gs:0x63f990cc(%rip),%eax        # 0x63f990e2
  16:   85 c0                   test   %eax,%eax
  18:   74 4b                   je     0x65
  1a:   65                      gs
  1b:   8c                      .byte 0x8c
[   69.979017] RSP: 0000:ffffbf770005f6d8 EFLAGS: 00010286
[   69.981894] RAX: ffff9bfd8090af80 RBX: 0000000000000003 RCX: 0000000000000000
[   69.985344] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
[   69.988499] RBP: ffffbf770005f6f8 R08: 0000000000000001 R09: 0000000000000000
[   69.991677] R10: ffff9bfd8090be90 R11: 0000000000000000 R12: ffff9bfd8090be90
[   69.994926] R13: ffff9c1bbd4f28c0 R14: 0000000000000004 R15: ffffe75b84499d08
[   69.998152] FS:  0000000000000000(0000) GS:ffff9c1bbd480000(0000) knlGS:0000000000000000
[   70.001874] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   70.004419] CR2: 0000000000000498 CR3: 00000001112ca000 CR4: 00000000000006e0
[   70.007517] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   70.010951] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400



Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 3/5] mmu_notifiers: Call invalidate_range() when invalidating TLBs
@ 2023-07-19 22:51     ` SeongJae Park
  0 siblings, 0 replies; 35+ messages in thread
From: SeongJae Park @ 2023-07-19 22:51 UTC (permalink / raw)
  To: Alistair Popple
  Cc: akpm, ajd, catalin.marinas, fbarrat, iommu, jgg, jhubbard,
	kevin.tian, kvm, linux-arm-kernel, linux-kernel, linux-mm,
	linuxppc-dev, mpe, nicolinc, npiggin, robin.murphy, seanjc, will,
	x86, zhi.wang.linux

Hi Alistair,

On Wed, 19 Jul 2023 22:18:44 +1000 Alistair Popple <apopple@nvidia.com> wrote:

> The invalidate_range() is going to become an architecture specific mmu
> notifier used to keep the TLB of secondary MMUs such as an IOMMU in
> sync with the CPU page tables. Currently it is called from separate
> code paths to the main CPU TLB invalidations. This can lead to a
> secondary TLB not getting invalidated when required and makes it hard
> to reason about when exactly the secondary TLB is invalidated.
> 
> To fix this move the notifier call to the architecture specific TLB
> maintenance functions for architectures that have secondary MMUs
> requiring explicit software invalidations.
> 
> This fixes a SMMU bug on ARM64. On ARM64 PTE permission upgrades
> require a TLB invalidation. This invalidation is done by the
> architecutre specific ptep_set_access_flags() which calls
> flush_tlb_page() if required. However this doesn't call the notifier
> resulting in infinite faults being generated by devices using the SMMU
> if it has previously cached a read-only PTE in it's TLB.
> 
> Moving the invalidations into the TLB invalidation functions ensures
> all invalidations happen at the same time as the CPU invalidation. The
> architecture specific flush_tlb_all() routines do not call the
> notifier as none of the IOMMUs require this.
> 
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>

I found below kernel NULL-dereference issue on latest mm-unstable tree, and
bisect points me to the commit of this patch, namely
75c400f82d347af1307010a3e06f3aa5d831d995.

To reproduce, I use 'stress-ng --bigheap $(nproc)'.  The issue happens as soon
as it starts reclaiming memory.  I didn't dive deep into this yet, but
reporting this issue first, since you might have an idea already.


[   69.824805] BUG: kernel NULL pointer dereference, address: 0000000000000498
[   69.826983] #PF: supervisor read access in kernel mode
[   69.828716] #PF: error_code(0x0000) - not-present page
[   69.830249] PGD 0 P4D 0
[   69.830784] Oops: 0000 [#4] PREEMPT SMP PTI
[   69.831881] CPU: 2 PID: 201 Comm: kworker/u72:2 Tainted: G      D W          6.5.0-rc1+ #311
[   69.834221] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-pr4
[   69.837459] Workqueue: writeback wb_workfn (flush-251:0)
[   69.839422] RIP: 0010:arch_tlbbatch_flush (include/linux/mmu_notifier.h:497 (discriminator 3) arch/x86/mm/tlb.c:1268 (discriminator 3))
[ 69.841140] Code: e8 c5 dd d2 00 bf 01 00 00 00 e8 0b df 05 00 65 8b 05 cc 90 f9 63 85 c0 74 4b 65 48c

Code starting with the faulting instruction
===========================================
   0:   e8 c5 dd d2 00          callq  0xd2ddca
   5:   bf 01 00 00 00          mov    $0x1,%edi
   a:   e8 0b df 05 00          callq  0x5df1a
   f:   65 8b 05 cc 90 f9 63    mov    %gs:0x63f990cc(%rip),%eax        # 0x63f990e2
  16:   85 c0                   test   %eax,%eax
  18:   74 4b                   je     0x65
  1a:   65                      gs
  1b:   8c                      .byte 0x8c
[   69.846840] RSP: 0000:ffffbf77007fb048 EFLAGS: 00010286
[   69.848464] RAX: ffff9bfd81970000 RBX: 0000000000000002 RCX: 0000000000000000
[   69.851016] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
[   69.853070] RBP: ffffbf77007fb068 R08: 0000000000000001 R09: 0000000000000000
[   69.855213] R10: ffff9bfd81970f10 R11: 0000000000000000 R12: ffff9bfd81970f10
[   69.857456] R13: ffff9c1bbd4b28c0 R14: 0000000000000003 R15: ffffe75b88f4a808
[   69.860213] FS:  0000000000000000(0000) GS:ffff9c1bbd480000(0000) knlGS:0000000000000000
[   69.862211] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   69.863624] CR2: 0000000000000498 CR3: 00000001112ca000 CR4: 00000000000006e0
[   69.865907] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   69.868146] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   69.870414] Call Trace:
[   69.871222]  <TASK>
[   69.871823] ? show_regs (arch/x86/kernel/dumpstack.c:479)
[   69.872855] ? __die_body (arch/x86/kernel/dumpstack.c:421)
[   69.873733] ? __die (arch/x86/kernel/dumpstack.c:435)
[   69.874576] ? page_fault_oops (arch/x86/mm/fault.c:707)
[   69.875657] ? search_bpf_extables (kernel/bpf/core.c:751)
[   69.876854] ? arch_tlbbatch_flush (include/linux/mmu_notifier.h:497 (discriminator 3) arch/x86/mm/tlb.c:1268 (discriminator 3))
[   69.878146] ? search_exception_tables (kernel/extable.c:64)
[   69.879931] ? kernelmode_fixup_or_oops (arch/x86/mm/fault.c:762)
[   69.881284] ? __bad_area_nosemaphore (arch/x86/mm/fault.c:860)
[   69.882722] ? bad_area_nosemaphore (arch/x86/mm/fault.c:867)
[   69.884276] ? do_user_addr_fault (arch/x86/mm/fault.c:1458)
[   69.885800] ? check_preemption_disabled (lib/smp_processor_id.c:42)
[   69.887428] ? exc_page_fault (arch/x86/include/asm/paravirt.h:695 arch/x86/mm/fault.c:1495 arch/x86/mm/fault.c:1543)
[   69.888435] ? asm_exc_page_fault (arch/x86/include/asm/idtentry.h:570)
[   69.889955] ? arch_tlbbatch_flush (include/linux/mmu_notifier.h:497 (discriminator 3) arch/x86/mm/tlb.c:1268 (discriminator 3))
[   69.891564] ? arch_tlbbatch_flush (arch/x86/include/asm/preempt.h:104 (discriminator 1) arch/x86/mm/tlb.c:1267 (discriminator 1))
[   69.892779] try_to_unmap_flush_dirty (mm/rmap.c:622 mm/rmap.c:632)
[   69.893970] shrink_folio_list (mm/vmscan.c:2015)
[   69.895243] shrink_inactive_list (include/linux/spinlock.h:376 mm/vmscan.c:2616)
[   69.896816] shrink_lruvec (mm/vmscan.c:2855 mm/vmscan.c:6303)
[   69.897952] ? rmqueue_bulk (mm/page_alloc.c:2228)
[   69.899385] shrink_node (mm/vmscan.c:6491 mm/vmscan.c:6524)
[   69.900447] do_try_to_free_pages (mm/vmscan.c:6705 mm/vmscan.c:6825)
[   69.901548] try_to_free_pages (mm/vmscan.c:7060)
[   69.902540] __alloc_pages_slowpath.constprop.0 (include/linux/sched/mm.h:380 mm/page_alloc.c:3716 mm/page_alloc.c:3735 mm/page_alloc.c:4140)
[   69.904307] ? sched_cpu_deactivate (kernel/sched/core.c:9728)
[   69.905427] __alloc_pages (mm/page_alloc.c:4525)
[   69.906735] alloc_pages (mm/mempolicy.c:2284)
[   69.907604] new_slab (mm/slub.c:1866 mm/slub.c:2009 mm/slub.c:2062)
[   69.908414] ? chksum_update (crypto/crc32c_generic.c:88)
[   69.909348] ___slab_alloc (mm/slub.c:3216 (discriminator 3))
[   69.910093] ? ext4_mb_new_blocks (fs/ext4/mballoc.c:5538 fs/ext4/mballoc.c:6129)
[   69.911885] ? __enqueue_entity (kernel/sched/fair.c:646)
[   69.913085] ? x2apic_send_IPI (arch/x86/include/asm/paravirt.h:196 arch/x86/include/asm/paravirt.h:229 arch/x86/include/asm/apic.h:240 arch/x86/kernel/apic/x2apic_phys.c:126 arch/x86/kernel/apic/x2apic_phys.c:48)
[   69.914143] ? native_smp_send_reschedule (arch/x86/kernel/apic/ipi.c:72)
[   69.915777] ? sbitmap_find_bit (lib/sbitmap.c:146 lib/sbitmap.c:178 lib/sbitmap.c:199)
[   69.917011] ? ext4_mb_new_blocks (fs/ext4/mballoc.c:5538 fs/ext4/mballoc.c:6129)
[   69.917996] __slab_alloc.isra.0 (mm/slub.c:3314)
[   69.919309] kmem_cache_alloc (mm/slub.c:3367 mm/slub.c:3460 mm/slub.c:3478 mm/slub.c:3485 mm/slub.c:3494)
[   69.920239] ? ext4_mb_new_blocks (fs/ext4/mballoc.c:5538 fs/ext4/mballoc.c:6129)
[   69.921291] ext4_mb_new_blocks (fs/ext4/mballoc.c:5538 fs/ext4/mballoc.c:6129)
[   69.922496] ? ext4_cache_extents (fs/ext4/extents.c:543 (discriminator 2))
[   69.923591] ext4_ext_map_blocks (fs/ext4/extents.c:4286)
[   69.925039] ext4_map_blocks (fs/ext4/inode.c:621)
[   69.926478] ? kmem_cache_alloc (arch/x86/include/asm/jump_label.h:55 include/linux/memcontrol.h:1777 mm/slab.h:522 mm/slab.h:770 mm/slub.c:3470 mm/slub.c:3478 mm/slub.c:3485 mm/slub.c:3494)
[   69.927813] ? ext4_alloc_io_end_vec (fs/ext4/page-io.c:61)
[   69.928851] ext4_do_writepages (fs/ext4/inode.c:2159 fs/ext4/inode.c:2212 fs/ext4/inode.c:2677)
[   69.930008] ? update_sd_lb_stats.constprop.0 (kernel/sched/fair.c:9501 kernel/sched/fair.c:10163)
[   69.931662] ext4_writepages (fs/ext4/inode.c:2766)
[   69.932728] do_writepages (mm/page-writeback.c:2553)
[   69.933403] ? __wb_calc_thresh (mm/page-writeback.c:875)
[   69.934168] __writeback_single_inode (fs/fs-writeback.c:1603)
[   69.935101] ? _raw_spin_unlock (arch/x86/include/asm/preempt.h:104 include/linux/spinlock_api_smp.h:143 kernel/locking/spinlock.c:186)
[   69.936219] writeback_sb_inodes (fs/fs-writeback.c:1896)
[   69.937408] __writeback_inodes_wb (fs/fs-writeback.c:1966)
[   69.938612] wb_writeback (fs/fs-writeback.c:2072)
[   69.939548] wb_workfn (fs/fs-writeback.c:2142 fs/fs-writeback.c:2230 fs/fs-writeback.c:2257)
[   69.940439] ? __switch_to (arch/x86/include/asm/paravirt.h:300 arch/x86/kernel/process_64.c:583)
[   69.941694] process_one_work (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:207 include/trace/events/workqueue.h:108 kernel/workqueue.c:2602)
[   69.943189] worker_thread (include/linux/list.h:292 kernel/workqueue.c:2749)
[   69.944403] ? __pfx_worker_thread (kernel/workqueue.c:2691)
[   69.945491] kthread (kernel/kthread.c:389)
[   69.946333] ? __pfx_kthread (kernel/kthread.c:342)
[   69.947109] ret_from_fork (arch/x86/entry/entry_64.S:314)
[   69.948097]  </TASK>
[   69.948944] Modules linked in: ppdev input_leds joydev parport_pc parport serio_raw qemu_fw_cfg mac_hi
[   69.958916] Dumping ftrace buffer:
[   69.961197]    (ftrace buffer empty)
[   69.962797] CR2: 0000000000000498
[   69.964431] ---[ end trace 0000000000000000 ]---
[   69.966440] RIP: 0010:arch_tlbbatch_flush (include/linux/mmu_notifier.h:497 (discriminator 3) arch/x86/mm/tlb.c:1268 (discriminator 3))
[ 69.969546] Code: e8 c5 dd d2 00 bf 01 00 00 00 e8 0b df 05 00 65 8b 05 cc 90 f9 63 85 c0 74 4b 65 48c

Code starting with the faulting instruction
===========================================
   0:   e8 c5 dd d2 00          callq  0xd2ddca
   5:   bf 01 00 00 00          mov    $0x1,%edi
   a:   e8 0b df 05 00          callq  0x5df1a
   f:   65 8b 05 cc 90 f9 63    mov    %gs:0x63f990cc(%rip),%eax        # 0x63f990e2
  16:   85 c0                   test   %eax,%eax
  18:   74 4b                   je     0x65
  1a:   65                      gs
  1b:   8c                      .byte 0x8c
[   69.979017] RSP: 0000:ffffbf770005f6d8 EFLAGS: 00010286
[   69.981894] RAX: ffff9bfd8090af80 RBX: 0000000000000003 RCX: 0000000000000000
[   69.985344] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
[   69.988499] RBP: ffffbf770005f6f8 R08: 0000000000000001 R09: 0000000000000000
[   69.991677] R10: ffff9bfd8090be90 R11: 0000000000000000 R12: ffff9bfd8090be90
[   69.994926] R13: ffff9c1bbd4f28c0 R14: 0000000000000004 R15: ffffe75b84499d08
[   69.998152] FS:  0000000000000000(0000) GS:ffff9c1bbd480000(0000) knlGS:0000000000000000
[   70.001874] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   70.004419] CR2: 0000000000000498 CR3: 00000001112ca000 CR4: 00000000000006e0
[   70.007517] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   70.010951] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400



Thanks,
SJ

[...]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 3/5] mmu_notifiers: Call invalidate_range() when invalidating TLBs
  2023-07-19 22:51     ` SeongJae Park
  (?)
@ 2023-07-20  0:52       ` Alistair Popple
  -1 siblings, 0 replies; 35+ messages in thread
From: Alistair Popple @ 2023-07-20  0:52 UTC (permalink / raw)
  To: SeongJae Park
  Cc: akpm, ajd, catalin.marinas, fbarrat, iommu, jgg, jhubbard,
	kevin.tian, kvm, linux-arm-kernel, linux-kernel, linux-mm,
	linuxppc-dev, mpe, nicolinc, npiggin, robin.murphy, seanjc, will,
	x86, zhi.wang.linux


SeongJae Park <sj@kernel.org> writes:

> Hi Alistair,
>
> On Wed, 19 Jul 2023 22:18:44 +1000 Alistair Popple <apopple@nvidia.com> wrote:
>
>> The invalidate_range() is going to become an architecture specific mmu
>> notifier used to keep the TLB of secondary MMUs such as an IOMMU in
>> sync with the CPU page tables. Currently it is called from separate
>> code paths to the main CPU TLB invalidations. This can lead to a
>> secondary TLB not getting invalidated when required and makes it hard
>> to reason about when exactly the secondary TLB is invalidated.
>> 
>> To fix this move the notifier call to the architecture specific TLB
>> maintenance functions for architectures that have secondary MMUs
>> requiring explicit software invalidations.
>> 
>> This fixes a SMMU bug on ARM64. On ARM64 PTE permission upgrades
>> require a TLB invalidation. This invalidation is done by the
>> architecutre specific ptep_set_access_flags() which calls
>> flush_tlb_page() if required. However this doesn't call the notifier
>> resulting in infinite faults being generated by devices using the SMMU
>> if it has previously cached a read-only PTE in it's TLB.
>> 
>> Moving the invalidations into the TLB invalidation functions ensures
>> all invalidations happen at the same time as the CPU invalidation. The
>> architecture specific flush_tlb_all() routines do not call the
>> notifier as none of the IOMMUs require this.
>> 
>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>> Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
>
> I found below kernel NULL-dereference issue on latest mm-unstable tree, and
> bisect points me to the commit of this patch, namely
> 75c400f82d347af1307010a3e06f3aa5d831d995.
>
> To reproduce, I use 'stress-ng --bigheap $(nproc)'.  The issue happens as soon
> as it starts reclaiming memory.  I didn't dive deep into this yet, but
> reporting this issue first, since you might have an idea already.

Thanks for the report SJ!

I see the problem - current->mm can (obviously!) be NULL which is what's
leading to the NULL dereference. Instead I think on x86 I need to call
the notifier when adding the invalidate to the tlbbatch in
arch_tlbbatch_add_pending() which is equivalent to what ARM64 does.

The below should fix it. Will do a respin with this.

---

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 837e4a50281a..79c46da919b9 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -4,6 +4,7 @@
 
 #include <linux/mm_types.h>
 #include <linux/sched.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/processor.h>
 #include <asm/cpufeature.h>
@@ -282,6 +283,7 @@ static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *b
 {
 	inc_mm_tlb_gen(mm);
 	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
+	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
 }
 
 static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 0b990fb56b66..2d253919b3e8 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1265,7 +1265,6 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 
 	put_flush_tlb_info();
 	put_cpu();
-	mmu_notifier_arch_invalidate_secondary_tlbs(current->mm, 0, -1UL);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 3/5] mmu_notifiers: Call invalidate_range() when invalidating TLBs
@ 2023-07-20  0:52       ` Alistair Popple
  0 siblings, 0 replies; 35+ messages in thread
From: Alistair Popple @ 2023-07-20  0:52 UTC (permalink / raw)
  To: SeongJae Park
  Cc: akpm, ajd, catalin.marinas, fbarrat, iommu, jgg, jhubbard,
	kevin.tian, kvm, linux-arm-kernel, linux-kernel, linux-mm,
	linuxppc-dev, mpe, nicolinc, npiggin, robin.murphy, seanjc, will,
	x86, zhi.wang.linux


SeongJae Park <sj@kernel.org> writes:

> Hi Alistair,
>
> On Wed, 19 Jul 2023 22:18:44 +1000 Alistair Popple <apopple@nvidia.com> wrote:
>
>> The invalidate_range() is going to become an architecture specific mmu
>> notifier used to keep the TLB of secondary MMUs such as an IOMMU in
>> sync with the CPU page tables. Currently it is called from separate
>> code paths to the main CPU TLB invalidations. This can lead to a
>> secondary TLB not getting invalidated when required and makes it hard
>> to reason about when exactly the secondary TLB is invalidated.
>> 
>> To fix this move the notifier call to the architecture specific TLB
>> maintenance functions for architectures that have secondary MMUs
>> requiring explicit software invalidations.
>> 
>> This fixes a SMMU bug on ARM64. On ARM64 PTE permission upgrades
>> require a TLB invalidation. This invalidation is done by the
>> architecutre specific ptep_set_access_flags() which calls
>> flush_tlb_page() if required. However this doesn't call the notifier
>> resulting in infinite faults being generated by devices using the SMMU
>> if it has previously cached a read-only PTE in it's TLB.
>> 
>> Moving the invalidations into the TLB invalidation functions ensures
>> all invalidations happen at the same time as the CPU invalidation. The
>> architecture specific flush_tlb_all() routines do not call the
>> notifier as none of the IOMMUs require this.
>> 
>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>> Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
>
> I found below kernel NULL-dereference issue on latest mm-unstable tree, and
> bisect points me to the commit of this patch, namely
> 75c400f82d347af1307010a3e06f3aa5d831d995.
>
> To reproduce, I use 'stress-ng --bigheap $(nproc)'.  The issue happens as soon
> as it starts reclaiming memory.  I didn't dive deep into this yet, but
> reporting this issue first, since you might have an idea already.

Thanks for the report SJ!

I see the problem - current->mm can (obviously!) be NULL which is what's
leading to the NULL dereference. Instead I think on x86 I need to call
the notifier when adding the invalidate to the tlbbatch in
arch_tlbbatch_add_pending() which is equivalent to what ARM64 does.

The below should fix it. Will do a respin with this.

---

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 837e4a50281a..79c46da919b9 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -4,6 +4,7 @@
 
 #include <linux/mm_types.h>
 #include <linux/sched.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/processor.h>
 #include <asm/cpufeature.h>
@@ -282,6 +283,7 @@ static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *b
 {
 	inc_mm_tlb_gen(mm);
 	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
+	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
 }
 
 static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 0b990fb56b66..2d253919b3e8 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1265,7 +1265,6 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 
 	put_flush_tlb_info();
 	put_cpu();
-	mmu_notifier_arch_invalidate_secondary_tlbs(current->mm, 0, -1UL);
 }
 
 /*

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 3/5] mmu_notifiers: Call invalidate_range() when invalidating TLBs
@ 2023-07-20  0:52       ` Alistair Popple
  0 siblings, 0 replies; 35+ messages in thread
From: Alistair Popple @ 2023-07-20  0:52 UTC (permalink / raw)
  To: SeongJae Park
  Cc: kevin.tian, x86, ajd, kvm, linux-mm, catalin.marinas, seanjc,
	will, linux-kernel, npiggin, zhi.wang.linux, jgg, iommu,
	nicolinc, jhubbard, fbarrat, akpm, linuxppc-dev,
	linux-arm-kernel, robin.murphy


SeongJae Park <sj@kernel.org> writes:

> Hi Alistair,
>
> On Wed, 19 Jul 2023 22:18:44 +1000 Alistair Popple <apopple@nvidia.com> wrote:
>
>> The invalidate_range() is going to become an architecture specific mmu
>> notifier used to keep the TLB of secondary MMUs such as an IOMMU in
>> sync with the CPU page tables. Currently it is called from separate
>> code paths to the main CPU TLB invalidations. This can lead to a
>> secondary TLB not getting invalidated when required and makes it hard
>> to reason about when exactly the secondary TLB is invalidated.
>> 
>> To fix this move the notifier call to the architecture specific TLB
>> maintenance functions for architectures that have secondary MMUs
>> requiring explicit software invalidations.
>> 
>> This fixes a SMMU bug on ARM64. On ARM64 PTE permission upgrades
>> require a TLB invalidation. This invalidation is done by the
>> architecutre specific ptep_set_access_flags() which calls
>> flush_tlb_page() if required. However this doesn't call the notifier
>> resulting in infinite faults being generated by devices using the SMMU
>> if it has previously cached a read-only PTE in it's TLB.
>> 
>> Moving the invalidations into the TLB invalidation functions ensures
>> all invalidations happen at the same time as the CPU invalidation. The
>> architecture specific flush_tlb_all() routines do not call the
>> notifier as none of the IOMMUs require this.
>> 
>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>> Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
>
> I found below kernel NULL-dereference issue on latest mm-unstable tree, and
> bisect points me to the commit of this patch, namely
> 75c400f82d347af1307010a3e06f3aa5d831d995.
>
> To reproduce, I use 'stress-ng --bigheap $(nproc)'.  The issue happens as soon
> as it starts reclaiming memory.  I didn't dive deep into this yet, but
> reporting this issue first, since you might have an idea already.

Thanks for the report SJ!

I see the problem - current->mm can (obviously!) be NULL which is what's
leading to the NULL dereference. Instead I think on x86 I need to call
the notifier when adding the invalidate to the tlbbatch in
arch_tlbbatch_add_pending() which is equivalent to what ARM64 does.

The below should fix it. Will do a respin with this.

---

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 837e4a50281a..79c46da919b9 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -4,6 +4,7 @@
 
 #include <linux/mm_types.h>
 #include <linux/sched.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/processor.h>
 #include <asm/cpufeature.h>
@@ -282,6 +283,7 @@ static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *b
 {
 	inc_mm_tlb_gen(mm);
 	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
+	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
 }
 
 static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 0b990fb56b66..2d253919b3e8 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1265,7 +1265,6 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 
 	put_flush_tlb_info();
 	put_cpu();
-	mmu_notifier_arch_invalidate_secondary_tlbs(current->mm, 0, -1UL);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 3/5] mmu_notifiers: Call invalidate_range() when invalidating TLBs
  2023-07-20  0:52       ` Alistair Popple
  (?)
@ 2023-07-20  1:31         ` SeongJae Park
  -1 siblings, 0 replies; 35+ messages in thread
From: SeongJae Park @ 2023-07-20  1:31 UTC (permalink / raw)
  To: Alistair Popple
  Cc: SeongJae Park, akpm, ajd, catalin.marinas, fbarrat, iommu, jgg,
	jhubbard, kevin.tian, kvm, linux-arm-kernel, linux-kernel,
	linux-mm, linuxppc-dev, mpe, nicolinc, npiggin, robin.murphy,
	seanjc, will, x86, zhi.wang.linux

On Thu, 20 Jul 2023 10:52:59 +1000 Alistair Popple <apopple@nvidia.com> wrote:

> 
> SeongJae Park <sj@kernel.org> writes:
> 
> > Hi Alistair,
> >
> > On Wed, 19 Jul 2023 22:18:44 +1000 Alistair Popple <apopple@nvidia.com> wrote:
> >
> >> The invalidate_range() is going to become an architecture specific mmu
> >> notifier used to keep the TLB of secondary MMUs such as an IOMMU in
> >> sync with the CPU page tables. Currently it is called from separate
> >> code paths to the main CPU TLB invalidations. This can lead to a
> >> secondary TLB not getting invalidated when required and makes it hard
> >> to reason about when exactly the secondary TLB is invalidated.
> >> 
> >> To fix this move the notifier call to the architecture specific TLB
> >> maintenance functions for architectures that have secondary MMUs
> >> requiring explicit software invalidations.
> >> 
> >> This fixes a SMMU bug on ARM64. On ARM64 PTE permission upgrades
> >> require a TLB invalidation. This invalidation is done by the
> >> architecutre specific ptep_set_access_flags() which calls
> >> flush_tlb_page() if required. However this doesn't call the notifier
> >> resulting in infinite faults being generated by devices using the SMMU
> >> if it has previously cached a read-only PTE in it's TLB.
> >> 
> >> Moving the invalidations into the TLB invalidation functions ensures
> >> all invalidations happen at the same time as the CPU invalidation. The
> >> architecture specific flush_tlb_all() routines do not call the
> >> notifier as none of the IOMMUs require this.
> >> 
> >> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> >> Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
> >
> > I found below kernel NULL-dereference issue on latest mm-unstable tree, and
> > bisect points me to the commit of this patch, namely
> > 75c400f82d347af1307010a3e06f3aa5d831d995.
> >
> > To reproduce, I use 'stress-ng --bigheap $(nproc)'.  The issue happens as soon
> > as it starts reclaiming memory.  I didn't dive deep into this yet, but
> > reporting this issue first, since you might have an idea already.
> 
> Thanks for the report SJ!
> 
> I see the problem - current->mm can (obviously!) be NULL which is what's
> leading to the NULL dereference. Instead I think on x86 I need to call
> the notifier when adding the invalidate to the tlbbatch in
> arch_tlbbatch_add_pending() which is equivalent to what ARM64 does.
> 
> The below should fix it. Will do a respin with this.

Thank you for this quick reply!  I confirm this fixes my issue.


Tested-by: SeongJae Park <sj@kernel.org>

> 
> ---
> 
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 837e4a50281a..79c46da919b9 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -4,6 +4,7 @@
>  
>  #include <linux/mm_types.h>
>  #include <linux/sched.h>
> +#include <linux/mmu_notifier.h>

Nit.  How about putting it between mm_types.h and sched.h, so that it looks
alphabetically sorted?

>  
>  #include <asm/processor.h>
>  #include <asm/cpufeature.h>
> @@ -282,6 +283,7 @@ static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *b
>  {
>  	inc_mm_tlb_gen(mm);
>  	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
> +	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
>  }
>  
>  static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 0b990fb56b66..2d253919b3e8 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -1265,7 +1265,6 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
>  
>  	put_flush_tlb_info();
>  	put_cpu();
> -	mmu_notifier_arch_invalidate_secondary_tlbs(current->mm, 0, -1UL);
>  }
>  
>  /*
> 
> 


Thanks,
SJ

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 3/5] mmu_notifiers: Call invalidate_range() when invalidating TLBs
@ 2023-07-20  1:31         ` SeongJae Park
  0 siblings, 0 replies; 35+ messages in thread
From: SeongJae Park @ 2023-07-20  1:31 UTC (permalink / raw)
  To: Alistair Popple
  Cc: zhi.wang.linux, kvm, catalin.marinas, linux-mm, will, x86, jgg,
	iommu, nicolinc, kevin.tian, ajd, jhubbard, robin.murphy,
	npiggin, linux-arm-kernel, SeongJae Park, seanjc, linux-kernel,
	fbarrat, akpm, linuxppc-dev

On Thu, 20 Jul 2023 10:52:59 +1000 Alistair Popple <apopple@nvidia.com> wrote:

> 
> SeongJae Park <sj@kernel.org> writes:
> 
> > Hi Alistair,
> >
> > On Wed, 19 Jul 2023 22:18:44 +1000 Alistair Popple <apopple@nvidia.com> wrote:
> >
> >> The invalidate_range() is going to become an architecture specific mmu
> >> notifier used to keep the TLB of secondary MMUs such as an IOMMU in
> >> sync with the CPU page tables. Currently it is called from separate
> >> code paths to the main CPU TLB invalidations. This can lead to a
> >> secondary TLB not getting invalidated when required and makes it hard
> >> to reason about when exactly the secondary TLB is invalidated.
> >> 
> >> To fix this move the notifier call to the architecture specific TLB
> >> maintenance functions for architectures that have secondary MMUs
> >> requiring explicit software invalidations.
> >> 
> >> This fixes a SMMU bug on ARM64. On ARM64 PTE permission upgrades
> >> require a TLB invalidation. This invalidation is done by the
> >> architecutre specific ptep_set_access_flags() which calls
> >> flush_tlb_page() if required. However this doesn't call the notifier
> >> resulting in infinite faults being generated by devices using the SMMU
> >> if it has previously cached a read-only PTE in it's TLB.
> >> 
> >> Moving the invalidations into the TLB invalidation functions ensures
> >> all invalidations happen at the same time as the CPU invalidation. The
> >> architecture specific flush_tlb_all() routines do not call the
> >> notifier as none of the IOMMUs require this.
> >> 
> >> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> >> Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
> >
> > I found below kernel NULL-dereference issue on latest mm-unstable tree, and
> > bisect points me to the commit of this patch, namely
> > 75c400f82d347af1307010a3e06f3aa5d831d995.
> >
> > To reproduce, I use 'stress-ng --bigheap $(nproc)'.  The issue happens as soon
> > as it starts reclaiming memory.  I didn't dive deep into this yet, but
> > reporting this issue first, since you might have an idea already.
> 
> Thanks for the report SJ!
> 
> I see the problem - current->mm can (obviously!) be NULL which is what's
> leading to the NULL dereference. Instead I think on x86 I need to call
> the notifier when adding the invalidate to the tlbbatch in
> arch_tlbbatch_add_pending() which is equivalent to what ARM64 does.
> 
> The below should fix it. Will do a respin with this.

Thank you for this quick reply!  I confirm this fixes my issue.


Tested-by: SeongJae Park <sj@kernel.org>

> 
> ---
> 
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 837e4a50281a..79c46da919b9 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -4,6 +4,7 @@
>  
>  #include <linux/mm_types.h>
>  #include <linux/sched.h>
> +#include <linux/mmu_notifier.h>

Nit.  How about putting it between mm_types.h and sched.h, so that it looks
alphabetically sorted?

>  
>  #include <asm/processor.h>
>  #include <asm/cpufeature.h>
> @@ -282,6 +283,7 @@ static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *b
>  {
>  	inc_mm_tlb_gen(mm);
>  	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
> +	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
>  }
>  
>  static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 0b990fb56b66..2d253919b3e8 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -1265,7 +1265,6 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
>  
>  	put_flush_tlb_info();
>  	put_cpu();
> -	mmu_notifier_arch_invalidate_secondary_tlbs(current->mm, 0, -1UL);
>  }
>  
>  /*
> 
> 


Thanks,
SJ

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 3/5] mmu_notifiers: Call invalidate_range() when invalidating TLBs
@ 2023-07-20  1:31         ` SeongJae Park
  0 siblings, 0 replies; 35+ messages in thread
From: SeongJae Park @ 2023-07-20  1:31 UTC (permalink / raw)
  To: Alistair Popple
  Cc: SeongJae Park, akpm, ajd, catalin.marinas, fbarrat, iommu, jgg,
	jhubbard, kevin.tian, kvm, linux-arm-kernel, linux-kernel,
	linux-mm, linuxppc-dev, mpe, nicolinc, npiggin, robin.murphy,
	seanjc, will, x86, zhi.wang.linux

On Thu, 20 Jul 2023 10:52:59 +1000 Alistair Popple <apopple@nvidia.com> wrote:

> 
> SeongJae Park <sj@kernel.org> writes:
> 
> > Hi Alistair,
> >
> > On Wed, 19 Jul 2023 22:18:44 +1000 Alistair Popple <apopple@nvidia.com> wrote:
> >
> >> The invalidate_range() is going to become an architecture specific mmu
> >> notifier used to keep the TLB of secondary MMUs such as an IOMMU in
> >> sync with the CPU page tables. Currently it is called from separate
> >> code paths to the main CPU TLB invalidations. This can lead to a
> >> secondary TLB not getting invalidated when required and makes it hard
> >> to reason about when exactly the secondary TLB is invalidated.
> >> 
> >> To fix this move the notifier call to the architecture specific TLB
> >> maintenance functions for architectures that have secondary MMUs
> >> requiring explicit software invalidations.
> >> 
> >> This fixes a SMMU bug on ARM64. On ARM64 PTE permission upgrades
> >> require a TLB invalidation. This invalidation is done by the
> >> architecutre specific ptep_set_access_flags() which calls
> >> flush_tlb_page() if required. However this doesn't call the notifier
> >> resulting in infinite faults being generated by devices using the SMMU
> >> if it has previously cached a read-only PTE in it's TLB.
> >> 
> >> Moving the invalidations into the TLB invalidation functions ensures
> >> all invalidations happen at the same time as the CPU invalidation. The
> >> architecture specific flush_tlb_all() routines do not call the
> >> notifier as none of the IOMMUs require this.
> >> 
> >> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> >> Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
> >
> > I found below kernel NULL-dereference issue on latest mm-unstable tree, and
> > bisect points me to the commit of this patch, namely
> > 75c400f82d347af1307010a3e06f3aa5d831d995.
> >
> > To reproduce, I use 'stress-ng --bigheap $(nproc)'.  The issue happens as soon
> > as it starts reclaiming memory.  I didn't dive deep into this yet, but
> > reporting this issue first, since you might have an idea already.
> 
> Thanks for the report SJ!
> 
> I see the problem - current->mm can (obviously!) be NULL which is what's
> leading to the NULL dereference. Instead I think on x86 I need to call
> the notifier when adding the invalidate to the tlbbatch in
> arch_tlbbatch_add_pending() which is equivalent to what ARM64 does.
> 
> The below should fix it. Will do a respin with this.

Thank you for this quick reply!  I confirm this fixes my issue.


Tested-by: SeongJae Park <sj@kernel.org>

> 
> ---
> 
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 837e4a50281a..79c46da919b9 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -4,6 +4,7 @@
>  
>  #include <linux/mm_types.h>
>  #include <linux/sched.h>
> +#include <linux/mmu_notifier.h>

Nit.  How about putting it between mm_types.h and sched.h, so that it looks
alphabetically sorted?

>  
>  #include <asm/processor.h>
>  #include <asm/cpufeature.h>
> @@ -282,6 +283,7 @@ static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *b
>  {
>  	inc_mm_tlb_gen(mm);
>  	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
> +	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
>  }
>  
>  static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 0b990fb56b66..2d253919b3e8 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -1265,7 +1265,6 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
>  
>  	put_flush_tlb_info();
>  	put_cpu();
> -	mmu_notifier_arch_invalidate_secondary_tlbs(current->mm, 0, -1UL);
>  }
>  
>  /*
> 
> 


Thanks,
SJ

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 3/5] mmu_notifiers: Call invalidate_range() when invalidating TLBs
  2023-07-20  0:52       ` Alistair Popple
@ 2023-07-24 18:18         ` Luis Chamberlain
  -1 siblings, 0 replies; 35+ messages in thread
From: Luis Chamberlain @ 2023-07-24 18:18 UTC (permalink / raw)
  To: Alistair Popple, linux-fsdevel, linux-xfs, Pankaj Raghav
  Cc: SeongJae Park, kevin.tian, x86, ajd, kvm, linux-mm,
	catalin.marinas, seanjc, will, linux-kernel, npiggin,
	zhi.wang.linux, jgg, iommu, nicolinc, jhubbard, fbarrat, akpm,
	linuxppc-dev, linux-arm-kernel, robin.murphy, mcgrof

Cc'ing fsdevel + xfs folks as this fixes a regression tests with
XFS with generic/176.

On Thu, Jul 20, 2023 at 10:52:59AM +1000, Alistair Popple wrote:
> 
> SeongJae Park <sj@kernel.org> writes:
> 
> > Hi Alistair,
> >
> > On Wed, 19 Jul 2023 22:18:44 +1000 Alistair Popple <apopple@nvidia.com> wrote:
> >
> >> The invalidate_range() is going to become an architecture specific mmu
> >> notifier used to keep the TLB of secondary MMUs such as an IOMMU in
> >> sync with the CPU page tables. Currently it is called from separate
> >> code paths to the main CPU TLB invalidations. This can lead to a
> >> secondary TLB not getting invalidated when required and makes it hard
> >> to reason about when exactly the secondary TLB is invalidated.
> >> 
> >> To fix this move the notifier call to the architecture specific TLB
> >> maintenance functions for architectures that have secondary MMUs
> >> requiring explicit software invalidations.
> >> 
> >> This fixes a SMMU bug on ARM64. On ARM64 PTE permission upgrades
> >> require a TLB invalidation. This invalidation is done by the
> >> architecutre specific ptep_set_access_flags() which calls
> >> flush_tlb_page() if required. However this doesn't call the notifier
> >> resulting in infinite faults being generated by devices using the SMMU
> >> if it has previously cached a read-only PTE in it's TLB.
> >> 
> >> Moving the invalidations into the TLB invalidation functions ensures
> >> all invalidations happen at the same time as the CPU invalidation. The
> >> architecture specific flush_tlb_all() routines do not call the
> >> notifier as none of the IOMMUs require this.
> >> 
> >> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> >> Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
> >
> > I found below kernel NULL-dereference issue on latest mm-unstable tree, and
> > bisect points me to the commit of this patch, namely
> > 75c400f82d347af1307010a3e06f3aa5d831d995.
> >
> > To reproduce, I use 'stress-ng --bigheap $(nproc)'.  The issue happens as soon
> > as it starts reclaiming memory.  I didn't dive deep into this yet, but
> > reporting this issue first, since you might have an idea already.
> 
> Thanks for the report SJ!
> 
> I see the problem - current->mm can (obviously!) be NULL which is what's
> leading to the NULL dereference. Instead I think on x86 I need to call
> the notifier when adding the invalidate to the tlbbatch in
> arch_tlbbatch_add_pending() which is equivalent to what ARM64 does.
> 
> The below should fix it. Will do a respin with this.
> 
> ---
> 
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 837e4a50281a..79c46da919b9 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -4,6 +4,7 @@
>  
>  #include <linux/mm_types.h>
>  #include <linux/sched.h>
> +#include <linux/mmu_notifier.h>
>  
>  #include <asm/processor.h>
>  #include <asm/cpufeature.h>
> @@ -282,6 +283,7 @@ static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *b
>  {
>  	inc_mm_tlb_gen(mm);
>  	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
> +	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
>  }
>  
>  static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 0b990fb56b66..2d253919b3e8 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -1265,7 +1265,6 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
>  
>  	put_flush_tlb_info();
>  	put_cpu();
> -	mmu_notifier_arch_invalidate_secondary_tlbs(current->mm, 0, -1UL);
>  }
>  
>  /*

This patch also fixes a regression introduced on linux-next, the same
crash on arch_tlbbatch_flush() is reproducible with fstests generic/176
on XFS. This patch fixes that regression [0]. This should also close out
the syzbot crash too [1]

[0] https://gist.github.com/mcgrof/b37fc8cf7e6e1b3935242681de1a83e2
[1] https://lore.kernel.org/all/0000000000003afcb4060135a664@google.com/

Tested-by: Luis Chamberlain <mcgrof@kernel.org>

  Luis

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 3/5] mmu_notifiers: Call invalidate_range() when invalidating TLBs
@ 2023-07-24 18:18         ` Luis Chamberlain
  0 siblings, 0 replies; 35+ messages in thread
From: Luis Chamberlain @ 2023-07-24 18:18 UTC (permalink / raw)
  To: Alistair Popple, linux-fsdevel, linux-xfs, Pankaj Raghav
  Cc: zhi.wang.linux, kvm, catalin.marinas, linux-mm, will, x86, jgg,
	iommu, nicolinc, kevin.tian, ajd, jhubbard, linuxppc-dev,
	npiggin, linux-arm-kernel, SeongJae Park, seanjc, linux-kernel,
	mcgrof, fbarrat, akpm, robin.murphy

Cc'ing fsdevel + xfs folks as this fixes a regression tests with
XFS with generic/176.

On Thu, Jul 20, 2023 at 10:52:59AM +1000, Alistair Popple wrote:
> 
> SeongJae Park <sj@kernel.org> writes:
> 
> > Hi Alistair,
> >
> > On Wed, 19 Jul 2023 22:18:44 +1000 Alistair Popple <apopple@nvidia.com> wrote:
> >
> >> The invalidate_range() is going to become an architecture specific mmu
> >> notifier used to keep the TLB of secondary MMUs such as an IOMMU in
> >> sync with the CPU page tables. Currently it is called from separate
> >> code paths to the main CPU TLB invalidations. This can lead to a
> >> secondary TLB not getting invalidated when required and makes it hard
> >> to reason about when exactly the secondary TLB is invalidated.
> >> 
> >> To fix this move the notifier call to the architecture specific TLB
> >> maintenance functions for architectures that have secondary MMUs
> >> requiring explicit software invalidations.
> >> 
> >> This fixes a SMMU bug on ARM64. On ARM64 PTE permission upgrades
> >> require a TLB invalidation. This invalidation is done by the
> >> architecutre specific ptep_set_access_flags() which calls
> >> flush_tlb_page() if required. However this doesn't call the notifier
> >> resulting in infinite faults being generated by devices using the SMMU
> >> if it has previously cached a read-only PTE in it's TLB.
> >> 
> >> Moving the invalidations into the TLB invalidation functions ensures
> >> all invalidations happen at the same time as the CPU invalidation. The
> >> architecture specific flush_tlb_all() routines do not call the
> >> notifier as none of the IOMMUs require this.
> >> 
> >> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> >> Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
> >
> > I found below kernel NULL-dereference issue on latest mm-unstable tree, and
> > bisect points me to the commit of this patch, namely
> > 75c400f82d347af1307010a3e06f3aa5d831d995.
> >
> > To reproduce, I use 'stress-ng --bigheap $(nproc)'.  The issue happens as soon
> > as it starts reclaiming memory.  I didn't dive deep into this yet, but
> > reporting this issue first, since you might have an idea already.
> 
> Thanks for the report SJ!
> 
> I see the problem - current->mm can (obviously!) be NULL which is what's
> leading to the NULL dereference. Instead I think on x86 I need to call
> the notifier when adding the invalidate to the tlbbatch in
> arch_tlbbatch_add_pending() which is equivalent to what ARM64 does.
> 
> The below should fix it. Will do a respin with this.
> 
> ---
> 
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 837e4a50281a..79c46da919b9 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -4,6 +4,7 @@
>  
>  #include <linux/mm_types.h>
>  #include <linux/sched.h>
> +#include <linux/mmu_notifier.h>
>  
>  #include <asm/processor.h>
>  #include <asm/cpufeature.h>
> @@ -282,6 +283,7 @@ static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *b
>  {
>  	inc_mm_tlb_gen(mm);
>  	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
> +	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
>  }
>  
>  static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 0b990fb56b66..2d253919b3e8 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -1265,7 +1265,6 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
>  
>  	put_flush_tlb_info();
>  	put_cpu();
> -	mmu_notifier_arch_invalidate_secondary_tlbs(current->mm, 0, -1UL);
>  }
>  
>  /*

This patch also fixes a regression introduced on linux-next, the same
crash on arch_tlbbatch_flush() is reproducible with fstests generic/176
on XFS. This patch fixes that regression [0]. This should also close out
the syzbot crash too [1]

[0] https://gist.github.com/mcgrof/b37fc8cf7e6e1b3935242681de1a83e2
[1] https://lore.kernel.org/all/0000000000003afcb4060135a664@google.com/

Tested-by: Luis Chamberlain <mcgrof@kernel.org>

  Luis

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 3/5] mmu_notifiers: Call invalidate_range() when invalidating TLBs
  2023-07-24 18:18         ` Luis Chamberlain
@ 2023-07-25  0:20           ` Alistair Popple
  -1 siblings, 0 replies; 35+ messages in thread
From: Alistair Popple @ 2023-07-25  0:20 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: linux-fsdevel, linux-xfs, Pankaj Raghav, SeongJae Park,
	kevin.tian, x86, ajd, kvm, linux-mm, catalin.marinas, seanjc,
	will, linux-kernel, npiggin, zhi.wang.linux, jgg, iommu,
	nicolinc, jhubbard, fbarrat, akpm, linuxppc-dev,
	linux-arm-kernel, robin.murphy


Luis Chamberlain <mcgrof@kernel.org> writes:

>> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
>> index 837e4a50281a..79c46da919b9 100644
>> --- a/arch/x86/include/asm/tlbflush.h
>> +++ b/arch/x86/include/asm/tlbflush.h
>> @@ -4,6 +4,7 @@
>>  
>>  #include <linux/mm_types.h>
>>  #include <linux/sched.h>
>> +#include <linux/mmu_notifier.h>
>>  
>>  #include <asm/processor.h>
>>  #include <asm/cpufeature.h>
>> @@ -282,6 +283,7 @@ static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *b
>>  {
>>  	inc_mm_tlb_gen(mm);
>>  	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
>> +	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
>>  }
>>  
>>  static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
>> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
>> index 0b990fb56b66..2d253919b3e8 100644
>> --- a/arch/x86/mm/tlb.c
>> +++ b/arch/x86/mm/tlb.c
>> @@ -1265,7 +1265,6 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
>>  
>>  	put_flush_tlb_info();
>>  	put_cpu();
>> -	mmu_notifier_arch_invalidate_secondary_tlbs(current->mm, 0, -1UL);
>>  }
>>  
>>  /*
>
> This patch also fixes a regression introduced on linux-next, the same
> crash on arch_tlbbatch_flush() is reproducible with fstests generic/176
> on XFS. This patch fixes that regression [0]. This should also close out
> the syzbot crash too [1]
>
> [0] https://gist.github.com/mcgrof/b37fc8cf7e6e1b3935242681de1a83e2
> [1] https://lore.kernel.org/all/0000000000003afcb4060135a664@google.com/
>
> Tested-by: Luis Chamberlain <mcgrof@kernel.org>

Thanks Luis. The above fix/respin is already in yesterdays linux-next
(next-20230724) so hopefully you are no longer seeing issues.

>   Luis


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 3/5] mmu_notifiers: Call invalidate_range() when invalidating TLBs
@ 2023-07-25  0:20           ` Alistair Popple
  0 siblings, 0 replies; 35+ messages in thread
From: Alistair Popple @ 2023-07-25  0:20 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: zhi.wang.linux, kvm, catalin.marinas, linux-mm, will, x86, jgg,
	iommu, nicolinc, Pankaj Raghav, kevin.tian, ajd, jhubbard,
	robin.murphy, npiggin, fbarrat, linux-arm-kernel, SeongJae Park,
	seanjc, linux-kernel, linux-xfs, linux-fsdevel, akpm,
	linuxppc-dev


Luis Chamberlain <mcgrof@kernel.org> writes:

>> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
>> index 837e4a50281a..79c46da919b9 100644
>> --- a/arch/x86/include/asm/tlbflush.h
>> +++ b/arch/x86/include/asm/tlbflush.h
>> @@ -4,6 +4,7 @@
>>  
>>  #include <linux/mm_types.h>
>>  #include <linux/sched.h>
>> +#include <linux/mmu_notifier.h>
>>  
>>  #include <asm/processor.h>
>>  #include <asm/cpufeature.h>
>> @@ -282,6 +283,7 @@ static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *b
>>  {
>>  	inc_mm_tlb_gen(mm);
>>  	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
>> +	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
>>  }
>>  
>>  static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
>> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
>> index 0b990fb56b66..2d253919b3e8 100644
>> --- a/arch/x86/mm/tlb.c
>> +++ b/arch/x86/mm/tlb.c
>> @@ -1265,7 +1265,6 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
>>  
>>  	put_flush_tlb_info();
>>  	put_cpu();
>> -	mmu_notifier_arch_invalidate_secondary_tlbs(current->mm, 0, -1UL);
>>  }
>>  
>>  /*
>
> This patch also fixes a regression introduced on linux-next, the same
> crash on arch_tlbbatch_flush() is reproducible with fstests generic/176
> on XFS. This patch fixes that regression [0]. This should also close out
> the syzbot crash too [1]
>
> [0] https://gist.github.com/mcgrof/b37fc8cf7e6e1b3935242681de1a83e2
> [1] https://lore.kernel.org/all/0000000000003afcb4060135a664@google.com/
>
> Tested-by: Luis Chamberlain <mcgrof@kernel.org>

Thanks Luis. The above fix/respin is already in yesterdays linux-next
(next-20230724) so hopefully you are no longer seeing issues.

>   Luis


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 3/5] mmu_notifiers: Call invalidate_range() when invalidating TLBs
  2023-07-19 12:18   ` Alistair Popple
@ 2023-07-25  3:41     ` Michael Ellerman
  -1 siblings, 0 replies; 35+ messages in thread
From: Michael Ellerman @ 2023-07-25  3:41 UTC (permalink / raw)
  To: Alistair Popple, akpm
  Cc: ajd, catalin.marinas, fbarrat, iommu, jgg, jhubbard, kevin.tian,
	kvm, linux-arm-kernel, linux-kernel, linux-mm, linuxppc-dev,
	nicolinc, npiggin, robin.murphy, seanjc, will, x86,
	zhi.wang.linux, Alistair Popple

Alistair Popple <apopple@nvidia.com> writes:
> The invalidate_range() is going to become an architecture specific mmu
> notifier used to keep the TLB of secondary MMUs such as an IOMMU in
> sync with the CPU page tables. Currently it is called from separate
> code paths to the main CPU TLB invalidations. This can lead to a
> secondary TLB not getting invalidated when required and makes it hard
> to reason about when exactly the secondary TLB is invalidated.
>
> To fix this move the notifier call to the architecture specific TLB
> maintenance functions for architectures that have secondary MMUs
> requiring explicit software invalidations.
>
> This fixes a SMMU bug on ARM64. On ARM64 PTE permission upgrades
> require a TLB invalidation. This invalidation is done by the
> architecutre specific ptep_set_access_flags() which calls
  ^
  architecture
  
> flush_tlb_page() if required. However this doesn't call the notifier
> resulting in infinite faults being generated by devices using the SMMU
> if it has previously cached a read-only PTE in it's TLB.
>
> Moving the invalidations into the TLB invalidation functions ensures
> all invalidations happen at the same time as the CPU invalidation. The
> architecture specific flush_tlb_all() routines do not call the
> notifier as none of the IOMMUs require this.
>
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
> 
...

> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
> index 0bd4866..9724b26 100644
> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
> @@ -752,6 +752,8 @@ void radix__local_flush_tlb_page(struct vm_area_struct *vma, unsigned long vmadd
>  		return radix__local_flush_hugetlb_page(vma, vmaddr);
>  #endif
>  	radix__local_flush_tlb_page_psize(vma->vm_mm, vmaddr, mmu_virtual_psize);
> +	mmu_notifier_invalidate_range(vma->vm_mm, vmaddr,
> +						vmaddr + mmu_virtual_psize);
>  }
>  EXPORT_SYMBOL(radix__local_flush_tlb_page);

I think we can skip calling the notifier there? It's explicitly a local flush.

cheers

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 3/5] mmu_notifiers: Call invalidate_range() when invalidating TLBs
@ 2023-07-25  3:41     ` Michael Ellerman
  0 siblings, 0 replies; 35+ messages in thread
From: Michael Ellerman @ 2023-07-25  3:41 UTC (permalink / raw)
  To: Alistair Popple, akpm
  Cc: Alistair Popple, kevin.tian, x86, ajd, kvm, linux-mm,
	catalin.marinas, seanjc, will, linux-kernel, npiggin,
	zhi.wang.linux, jgg, iommu, nicolinc, jhubbard, fbarrat,
	linuxppc-dev, linux-arm-kernel, robin.murphy

Alistair Popple <apopple@nvidia.com> writes:
> The invalidate_range() is going to become an architecture specific mmu
> notifier used to keep the TLB of secondary MMUs such as an IOMMU in
> sync with the CPU page tables. Currently it is called from separate
> code paths to the main CPU TLB invalidations. This can lead to a
> secondary TLB not getting invalidated when required and makes it hard
> to reason about when exactly the secondary TLB is invalidated.
>
> To fix this move the notifier call to the architecture specific TLB
> maintenance functions for architectures that have secondary MMUs
> requiring explicit software invalidations.
>
> This fixes a SMMU bug on ARM64. On ARM64 PTE permission upgrades
> require a TLB invalidation. This invalidation is done by the
> architecutre specific ptep_set_access_flags() which calls
  ^
  architecture
  
> flush_tlb_page() if required. However this doesn't call the notifier
> resulting in infinite faults being generated by devices using the SMMU
> if it has previously cached a read-only PTE in it's TLB.
>
> Moving the invalidations into the TLB invalidation functions ensures
> all invalidations happen at the same time as the CPU invalidation. The
> architecture specific flush_tlb_all() routines do not call the
> notifier as none of the IOMMUs require this.
>
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
> 
...

> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
> index 0bd4866..9724b26 100644
> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
> @@ -752,6 +752,8 @@ void radix__local_flush_tlb_page(struct vm_area_struct *vma, unsigned long vmadd
>  		return radix__local_flush_hugetlb_page(vma, vmaddr);
>  #endif
>  	radix__local_flush_tlb_page_psize(vma->vm_mm, vmaddr, mmu_virtual_psize);
> +	mmu_notifier_invalidate_range(vma->vm_mm, vmaddr,
> +						vmaddr + mmu_virtual_psize);
>  }
>  EXPORT_SYMBOL(radix__local_flush_tlb_page);

I think we can skip calling the notifier there? It's explicitly a local flush.

cheers

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 3/5] mmu_notifiers: Call invalidate_range() when invalidating TLBs
  2023-07-25  3:41     ` Michael Ellerman
@ 2023-07-25  5:51       ` Alistair Popple
  -1 siblings, 0 replies; 35+ messages in thread
From: Alistair Popple @ 2023-07-25  5:51 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: akpm, ajd, catalin.marinas, fbarrat, iommu, jgg, jhubbard,
	kevin.tian, kvm, linux-arm-kernel, linux-kernel, linux-mm,
	linuxppc-dev, nicolinc, npiggin, robin.murphy, seanjc, will, x86,
	zhi.wang.linux


Michael Ellerman <mpe@ellerman.id.au> writes:

> Alistair Popple <apopple@nvidia.com> writes:
>> The invalidate_range() is going to become an architecture specific mmu
>> notifier used to keep the TLB of secondary MMUs such as an IOMMU in
>> sync with the CPU page tables. Currently it is called from separate
>> code paths to the main CPU TLB invalidations. This can lead to a
>> secondary TLB not getting invalidated when required and makes it hard
>> to reason about when exactly the secondary TLB is invalidated.
>>
>> To fix this move the notifier call to the architecture specific TLB
>> maintenance functions for architectures that have secondary MMUs
>> requiring explicit software invalidations.
>>
>> This fixes a SMMU bug on ARM64. On ARM64 PTE permission upgrades
>> require a TLB invalidation. This invalidation is done by the
>> arahitecutre specific ptep_set_access_flags() which calls
>   ^
>   architecture

Oh. I'd forgotten to apt install codespell ;-)
  
>> flush_tlb_page() if required. However this doesn't call the notifier
>> resulting in infinite faults being generated by devices using the SMMU
>> if it has previously cached a read-only PTE in it's TLB.
>>
>> Moving the invalidations into the TLB invalidation functions ensures
>> all invalidations happen at the same time as the CPU invalidation. The
>> architecture specific flush_tlb_all() routines do not call the
>> notifier as none of the IOMMUs require this.
>>
>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>> Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
>> 
> ...
>
>> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
>> index 0bd4866..9724b26 100644
>> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
>> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
>> @@ -752,6 +752,8 @@ void radix__local_flush_tlb_page(struct vm_area_struct *vma, unsigned long vmadd
>>  		return radix__local_flush_hugetlb_page(vma, vmaddr);
>>  #endif
>>  	radix__local_flush_tlb_page_psize(vma->vm_mm, vmaddr, mmu_virtual_psize);
>> +	mmu_notifier_invalidate_range(vma->vm_mm, vmaddr,
>> +						vmaddr + mmu_virtual_psize);
>>  }
>>  EXPORT_SYMBOL(radix__local_flush_tlb_page);
>
> I think we can skip calling the notifier there? It's explicitly a local flush.

I suspect you're correct. It's been a while since I last worked on PPC
TLB invalidation code though and it's changed a fair bit since then so
was being conservative and appreciate any comments there. Was worried I
may have missed some clever optimisation that detects a local flush is
all that's needed, but I see OCXL calls mm_context_add_copro() though so
that should be ok. Will respin and drop it.

> cheers


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 3/5] mmu_notifiers: Call invalidate_range() when invalidating TLBs
@ 2023-07-25  5:51       ` Alistair Popple
  0 siblings, 0 replies; 35+ messages in thread
From: Alistair Popple @ 2023-07-25  5:51 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: kevin.tian, x86, ajd, kvm, linux-mm, catalin.marinas, seanjc,
	will, linux-kernel, npiggin, zhi.wang.linux, jgg, iommu,
	nicolinc, jhubbard, fbarrat, akpm, linuxppc-dev,
	linux-arm-kernel, robin.murphy


Michael Ellerman <mpe@ellerman.id.au> writes:

> Alistair Popple <apopple@nvidia.com> writes:
>> The invalidate_range() is going to become an architecture specific mmu
>> notifier used to keep the TLB of secondary MMUs such as an IOMMU in
>> sync with the CPU page tables. Currently it is called from separate
>> code paths to the main CPU TLB invalidations. This can lead to a
>> secondary TLB not getting invalidated when required and makes it hard
>> to reason about when exactly the secondary TLB is invalidated.
>>
>> To fix this move the notifier call to the architecture specific TLB
>> maintenance functions for architectures that have secondary MMUs
>> requiring explicit software invalidations.
>>
>> This fixes a SMMU bug on ARM64. On ARM64 PTE permission upgrades
>> require a TLB invalidation. This invalidation is done by the
>> arahitecutre specific ptep_set_access_flags() which calls
>   ^
>   architecture

Oh. I'd forgotten to apt install codespell ;-)
  
>> flush_tlb_page() if required. However this doesn't call the notifier
>> resulting in infinite faults being generated by devices using the SMMU
>> if it has previously cached a read-only PTE in it's TLB.
>>
>> Moving the invalidations into the TLB invalidation functions ensures
>> all invalidations happen at the same time as the CPU invalidation. The
>> architecture specific flush_tlb_all() routines do not call the
>> notifier as none of the IOMMUs require this.
>>
>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>> Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
>> 
> ...
>
>> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
>> index 0bd4866..9724b26 100644
>> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
>> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
>> @@ -752,6 +752,8 @@ void radix__local_flush_tlb_page(struct vm_area_struct *vma, unsigned long vmadd
>>  		return radix__local_flush_hugetlb_page(vma, vmaddr);
>>  #endif
>>  	radix__local_flush_tlb_page_psize(vma->vm_mm, vmaddr, mmu_virtual_psize);
>> +	mmu_notifier_invalidate_range(vma->vm_mm, vmaddr,
>> +						vmaddr + mmu_virtual_psize);
>>  }
>>  EXPORT_SYMBOL(radix__local_flush_tlb_page);
>
> I think we can skip calling the notifier there? It's explicitly a local flush.

I suspect you're correct. It's been a while since I last worked on PPC
TLB invalidation code though and it's changed a fair bit since then so
was being conservative and appreciate any comments there. Was worried I
may have missed some clever optimisation that detects a local flush is
all that's needed, but I see OCXL calls mm_context_add_copro() though so
that should be ok. Will respin and drop it.

> cheers


^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2023-07-25  5:53 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-19 12:18 [PATCH v2 0/5] Invalidate secondary IOMMU TLB on permission upgrade Alistair Popple
2023-07-19 12:18 ` Alistair Popple
2023-07-19 12:18 ` Alistair Popple
2023-07-19 12:18 ` [PATCH v2 1/5] arm64/smmu: Use TLBI ASID when invalidating entire range Alistair Popple
2023-07-19 12:18   ` Alistair Popple
2023-07-19 12:18   ` Alistair Popple
2023-07-19 12:18 ` [PATCH v2 2/5] mmu_notifiers: Fixup comment in mmu_interval_read_begin() Alistair Popple
2023-07-19 12:18   ` Alistair Popple
2023-07-19 12:18   ` Alistair Popple
2023-07-19 12:18 ` [PATCH v2 3/5] mmu_notifiers: Call invalidate_range() when invalidating TLBs Alistair Popple
2023-07-19 12:18   ` Alistair Popple
2023-07-19 12:18   ` Alistair Popple
2023-07-19 22:51   ` SeongJae Park
2023-07-19 22:51     ` SeongJae Park
2023-07-19 22:51     ` SeongJae Park
2023-07-20  0:52     ` Alistair Popple
2023-07-20  0:52       ` Alistair Popple
2023-07-20  0:52       ` Alistair Popple
2023-07-20  1:31       ` SeongJae Park
2023-07-20  1:31         ` SeongJae Park
2023-07-20  1:31         ` SeongJae Park
2023-07-24 18:18       ` Luis Chamberlain
2023-07-24 18:18         ` Luis Chamberlain
2023-07-25  0:20         ` Alistair Popple
2023-07-25  0:20           ` Alistair Popple
2023-07-25  3:41   ` Michael Ellerman
2023-07-25  3:41     ` Michael Ellerman
2023-07-25  5:51     ` Alistair Popple
2023-07-25  5:51       ` Alistair Popple
2023-07-19 12:18 ` [PATCH v2 4/5] mmu_notifiers: Don't invalidate secondary TLBs as part of mmu_notifier_invalidate_range_end() Alistair Popple
2023-07-19 12:18   ` Alistair Popple
2023-07-19 12:18   ` Alistair Popple
2023-07-19 12:18 ` [PATCH v2 5/5] mmu_notifiers: Rename invalidate_range notifier Alistair Popple
2023-07-19 12:18   ` Alistair Popple
2023-07-19 12:18   ` Alistair Popple

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.