All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] powerpc: pSeries: vfio: iommu: Re-enable support for SPAPR TCE VFIO
@ 2024-03-12 18:14 ` Shivaprasad G Bhat
  0 siblings, 0 replies; 21+ messages in thread
From: Shivaprasad G Bhat @ 2024-03-12 18:14 UTC (permalink / raw)
  To: tpearson, alex.williamson, linuxppc-dev
  Cc: mpe, npiggin, christophe.leroy, aneesh.kumar, naveen.n.rao,
	gbatra, brking, sbhat, aik, jgg, robh, linux-kernel, kvm, aik,
	msuchanek, jroedel, vaibhav, svaidy

This patch-set fills the missing gaps on pSeries for VFIO SPAPR TCE
sub-driver thereby re-enabling the VFIO support on POWER pSeries
machines.

Structure of the patchset
=========================

* Currently, due to [1] VFIO_IOMMU_UNMAP_{DMA,UNMAP} ioctls are broken
on pSeries as it only supports single level TCEs. This is addressed in
the first patch as the lost functionality is restored.

On pSeries, the DMA windows (default 32-bit and the 64-bit DDW) are
"borrowed" between the host and the vfio driver. The necessary
mechanism for this is already been in place since [2]. However,
the VFIO SPAPR-TCE sub-driver doesn't open the 64-bit window if
it wasn't already done by a host driver like NVME. So the user-space
only gets access to the default 32-bit DMA window alone. This poses a
challenge for devices having no host kernel drivers and completely
depend on VFIO user-space interface to request DMA. The
VFIO_SPAPR_TCE_CREATE ioctl currently just returns EPERM without
attempting to open the second window for such devices.

* The second patch is just code movement for pSeries specific
functions from arch iommu to the pSeries platform iommu file.
This is needed as the DMA window manipulation operations
introduced in the third patch depend on these functions and are
entirely pSeries specific.

* The third patch adds necessary support to open up the 64-bit DMA
window on VFIO_SPAPR_TCE_CREATE ioctl from the user-space. It also
collects the DDW information from the platform for exposing it to
user through 'struct vfio_iommu_spapr_tce_ddw_info'.

Testing
========
These patches are tested with by attaching a nvme disk to a nested
kvm guest running a pSeries lpar. Also vfio-test [3] by Alex Willamson,
was forked and updated to add support for pSeries guest and used to
test these patches[4].

Limitations/Known Issues
========================
* Does not work for SRIOV VFs, as they have only one DMA window.
* Does not work for multi-function cards.
* Bugs
  - mmdrop() in tce_iommu_release() when container detached
    with pending unmaps.

[1] Commit: 090bad39b237 ("powerpc/powernv: Add indirect levels to it_userspace")
[2] Commit: 9d67c9433509 ("powerpc/iommu: Add \"borrowing\" iommu_table_group_ops")
[3] https://github.com/awilliam/tests
[4] https://github.com/nnmwebmin/vfio-ppc-tests/tree/vfio-ppc-ex

---

Shivaprasad G Bhat (3):
      powerpc/pseries/iommu: Bring back userspace view for single level TCE tables
      powerpc/iommu: Move pSeries specific functions to pseries/iommu.c
      pseries/iommu: Enable DDW for VFIO TCE create


 arch/powerpc/include/asm/iommu.h       |   7 +-
 arch/powerpc/kernel/iommu.c            | 156 +-------
 arch/powerpc/platforms/pseries/iommu.c | 514 ++++++++++++++++++++++++-
 drivers/vfio/vfio_iommu_spapr_tce.c    |  51 +++
 4 files changed, 571 insertions(+), 157 deletions(-)


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC PATCH 0/3] powerpc: pSeries: vfio: iommu: Re-enable support for SPAPR TCE VFIO
@ 2024-03-12 18:14 ` Shivaprasad G Bhat
  0 siblings, 0 replies; 21+ messages in thread
From: Shivaprasad G Bhat @ 2024-03-12 18:14 UTC (permalink / raw)
  To: tpearson, alex.williamson, linuxppc-dev
  Cc: robh, jroedel, sbhat, gbatra, jgg, aik, linux-kernel, svaidy,
	aneesh.kumar, brking, npiggin, kvm, naveen.n.rao, vaibhav,
	msuchanek, aik

This patch-set fills the missing gaps on pSeries for VFIO SPAPR TCE
sub-driver thereby re-enabling the VFIO support on POWER pSeries
machines.

Structure of the patchset
=========================

* Currently, due to [1] VFIO_IOMMU_UNMAP_{DMA,UNMAP} ioctls are broken
on pSeries as it only supports single level TCEs. This is addressed in
the first patch as the lost functionality is restored.

On pSeries, the DMA windows (default 32-bit and the 64-bit DDW) are
"borrowed" between the host and the vfio driver. The necessary
mechanism for this is already been in place since [2]. However,
the VFIO SPAPR-TCE sub-driver doesn't open the 64-bit window if
it wasn't already done by a host driver like NVME. So the user-space
only gets access to the default 32-bit DMA window alone. This poses a
challenge for devices having no host kernel drivers and completely
depend on VFIO user-space interface to request DMA. The
VFIO_SPAPR_TCE_CREATE ioctl currently just returns EPERM without
attempting to open the second window for such devices.

* The second patch is just code movement for pSeries specific
functions from arch iommu to the pSeries platform iommu file.
This is needed as the DMA window manipulation operations
introduced in the third patch depend on these functions and are
entirely pSeries specific.

* The third patch adds necessary support to open up the 64-bit DMA
window on VFIO_SPAPR_TCE_CREATE ioctl from the user-space. It also
collects the DDW information from the platform for exposing it to
user through 'struct vfio_iommu_spapr_tce_ddw_info'.

Testing
========
These patches are tested with by attaching a nvme disk to a nested
kvm guest running a pSeries lpar. Also vfio-test [3] by Alex Willamson,
was forked and updated to add support for pSeries guest and used to
test these patches[4].

Limitations/Known Issues
========================
* Does not work for SRIOV VFs, as they have only one DMA window.
* Does not work for multi-function cards.
* Bugs
  - mmdrop() in tce_iommu_release() when container detached
    with pending unmaps.

[1] Commit: 090bad39b237 ("powerpc/powernv: Add indirect levels to it_userspace")
[2] Commit: 9d67c9433509 ("powerpc/iommu: Add \"borrowing\" iommu_table_group_ops")
[3] https://github.com/awilliam/tests
[4] https://github.com/nnmwebmin/vfio-ppc-tests/tree/vfio-ppc-ex

---

Shivaprasad G Bhat (3):
      powerpc/pseries/iommu: Bring back userspace view for single level TCE tables
      powerpc/iommu: Move pSeries specific functions to pseries/iommu.c
      pseries/iommu: Enable DDW for VFIO TCE create


 arch/powerpc/include/asm/iommu.h       |   7 +-
 arch/powerpc/kernel/iommu.c            | 156 +-------
 arch/powerpc/platforms/pseries/iommu.c | 514 ++++++++++++++++++++++++-
 drivers/vfio/vfio_iommu_spapr_tce.c    |  51 +++
 4 files changed, 571 insertions(+), 157 deletions(-)


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC PATCH 1/3] powerpc/pseries/iommu: Bring back userspace view for single level TCE tables
  2024-03-12 18:14 ` Shivaprasad G Bhat
@ 2024-03-12 18:14   ` Shivaprasad G Bhat
  -1 siblings, 0 replies; 21+ messages in thread
From: Shivaprasad G Bhat @ 2024-03-12 18:14 UTC (permalink / raw)
  To: tpearson, alex.williamson, linuxppc-dev
  Cc: mpe, npiggin, christophe.leroy, aneesh.kumar, naveen.n.rao,
	gbatra, brking, sbhat, aik, jgg, robh, linux-kernel, kvm, aik,
	msuchanek, jroedel, vaibhav, svaidy

The commit 090bad39b237a ("powerpc/powernv: Add indirect levels to
it_userspace") which implemented the tce indirect levels
support for PowerNV ended up removing the single level support
which existed by default(generic tce_iommu_userspace_view_alloc/free()
calls). On pSeries the TCEs are single level, and the allocation
of userspace view is lost with the removal of generic code.

The patch attempts to bring it back for pseries on the refactored
code base.

On pSeries, the windows/tables are "borrowed", so the it_ops->free()
is not called during the container detach or the tce release call paths
as the table is not really freed. So, decoupling the userspace view
array free and alloc from table's it_ops just the way it was before.

Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
---
 arch/powerpc/platforms/pseries/iommu.c |   19 ++++++++++--
 drivers/vfio/vfio_iommu_spapr_tce.c    |   51 ++++++++++++++++++++++++++++++++
 2 files changed, 67 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index e8c4129697b1..40de8d55faef 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -143,7 +143,7 @@ static int tce_build_pSeries(struct iommu_table *tbl, long index,
 }
 
 
-static void tce_free_pSeries(struct iommu_table *tbl, long index, long npages)
+static void tce_clear_pSeries(struct iommu_table *tbl, long index, long npages)
 {
 	__be64 *tcep;
 
@@ -162,6 +162,11 @@ static unsigned long tce_get_pseries(struct iommu_table *tbl, long index)
 	return be64_to_cpu(*tcep);
 }
 
+static void tce_free_pSeries(struct iommu_table *tbl)
+{
+	/* Do nothing. */
+}
+
 static void tce_free_pSeriesLP(unsigned long liobn, long, long, long);
 static void tce_freemulti_pSeriesLP(struct iommu_table*, long, long);
 
@@ -576,7 +581,7 @@ struct iommu_table_ops iommu_table_lpar_multi_ops;
 
 struct iommu_table_ops iommu_table_pseries_ops = {
 	.set = tce_build_pSeries,
-	.clear = tce_free_pSeries,
+	.clear = tce_clear_pSeries,
 	.get = tce_get_pseries
 };
 
@@ -685,15 +690,23 @@ static int tce_exchange_pseries(struct iommu_table *tbl, long index, unsigned
 
 	return rc;
 }
+
+static __be64 *tce_useraddr_pSeriesLP(struct iommu_table *tbl, long index,
+				      bool __always_unused alloc)
+{
+	return tbl->it_userspace ? &tbl->it_userspace[index - tbl->it_offset] : NULL;
+}
 #endif
 
 struct iommu_table_ops iommu_table_lpar_multi_ops = {
 	.set = tce_buildmulti_pSeriesLP,
 #ifdef CONFIG_IOMMU_API
 	.xchg_no_kill = tce_exchange_pseries,
+	.useraddrptr = tce_useraddr_pSeriesLP,
 #endif
 	.clear = tce_freemulti_pSeriesLP,
-	.get = tce_get_pSeriesLP
+	.get = tce_get_pSeriesLP,
+	.free = tce_free_pSeries
 };
 
 /*
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index a94ec6225d31..1cf36d687559 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -177,6 +177,50 @@ static long tce_iommu_register_pages(struct tce_container *container,
 	return ret;
 }
 
+static long tce_iommu_userspace_view_alloc(struct iommu_table *tbl,
+		struct mm_struct *mm)
+{
+	unsigned long cb = ALIGN(sizeof(tbl->it_userspace[0]) *
+			tbl->it_size, PAGE_SIZE);
+	unsigned long *uas;
+	long ret;
+
+	if (tbl->it_indirect_levels)
+		return 0;
+
+	WARN_ON(tbl->it_userspace);
+
+	ret = account_locked_vm(mm, cb >> PAGE_SHIFT, true);
+	if (ret)
+		return ret;
+
+	uas = vzalloc(cb);
+	if (!uas) {
+		account_locked_vm(mm, cb >> PAGE_SHIFT, false);
+		return -ENOMEM;
+	}
+	tbl->it_userspace = (__be64 *) uas;
+
+	return 0;
+}
+
+static void tce_iommu_userspace_view_free(struct iommu_table *tbl,
+		struct mm_struct *mm)
+{
+	unsigned long cb = ALIGN(sizeof(tbl->it_userspace[0]) *
+			tbl->it_size, PAGE_SIZE);
+
+	if (!tbl->it_userspace)
+		return;
+
+	if (tbl->it_indirect_levels)
+		return;
+
+	vfree(tbl->it_userspace);
+	tbl->it_userspace = NULL;
+	account_locked_vm(mm, cb >> PAGE_SHIFT, false);
+}
+
 static bool tce_page_is_contained(struct mm_struct *mm, unsigned long hpa,
 		unsigned int it_page_shift)
 {
@@ -554,6 +598,12 @@ static long tce_iommu_build_v2(struct tce_container *container,
 	unsigned long hpa;
 	enum dma_data_direction dirtmp;
 
+	if (!tbl->it_userspace) {
+		ret = tce_iommu_userspace_view_alloc(tbl, container->mm);
+		if (ret)
+			return ret;
+	}
+
 	for (i = 0; i < pages; ++i) {
 		struct mm_iommu_table_group_mem_t *mem = NULL;
 		__be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry + i);
@@ -637,6 +687,7 @@ static void tce_iommu_free_table(struct tce_container *container,
 {
 	unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
 
+	tce_iommu_userspace_view_free(tbl, container->mm);
 	iommu_tce_table_put(tbl);
 	account_locked_vm(container->mm, pages, false);
 }



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH 1/3] powerpc/pseries/iommu: Bring back userspace view for single level TCE tables
@ 2024-03-12 18:14   ` Shivaprasad G Bhat
  0 siblings, 0 replies; 21+ messages in thread
From: Shivaprasad G Bhat @ 2024-03-12 18:14 UTC (permalink / raw)
  To: tpearson, alex.williamson, linuxppc-dev
  Cc: robh, jroedel, sbhat, gbatra, jgg, aik, linux-kernel, svaidy,
	aneesh.kumar, brking, npiggin, kvm, naveen.n.rao, vaibhav,
	msuchanek, aik

The commit 090bad39b237a ("powerpc/powernv: Add indirect levels to
it_userspace") which implemented the tce indirect levels
support for PowerNV ended up removing the single level support
which existed by default(generic tce_iommu_userspace_view_alloc/free()
calls). On pSeries the TCEs are single level, and the allocation
of userspace view is lost with the removal of generic code.

The patch attempts to bring it back for pseries on the refactored
code base.

On pSeries, the windows/tables are "borrowed", so the it_ops->free()
is not called during the container detach or the tce release call paths
as the table is not really freed. So, decoupling the userspace view
array free and alloc from table's it_ops just the way it was before.

Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
---
 arch/powerpc/platforms/pseries/iommu.c |   19 ++++++++++--
 drivers/vfio/vfio_iommu_spapr_tce.c    |   51 ++++++++++++++++++++++++++++++++
 2 files changed, 67 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index e8c4129697b1..40de8d55faef 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -143,7 +143,7 @@ static int tce_build_pSeries(struct iommu_table *tbl, long index,
 }
 
 
-static void tce_free_pSeries(struct iommu_table *tbl, long index, long npages)
+static void tce_clear_pSeries(struct iommu_table *tbl, long index, long npages)
 {
 	__be64 *tcep;
 
@@ -162,6 +162,11 @@ static unsigned long tce_get_pseries(struct iommu_table *tbl, long index)
 	return be64_to_cpu(*tcep);
 }
 
+static void tce_free_pSeries(struct iommu_table *tbl)
+{
+	/* Do nothing. */
+}
+
 static void tce_free_pSeriesLP(unsigned long liobn, long, long, long);
 static void tce_freemulti_pSeriesLP(struct iommu_table*, long, long);
 
@@ -576,7 +581,7 @@ struct iommu_table_ops iommu_table_lpar_multi_ops;
 
 struct iommu_table_ops iommu_table_pseries_ops = {
 	.set = tce_build_pSeries,
-	.clear = tce_free_pSeries,
+	.clear = tce_clear_pSeries,
 	.get = tce_get_pseries
 };
 
@@ -685,15 +690,23 @@ static int tce_exchange_pseries(struct iommu_table *tbl, long index, unsigned
 
 	return rc;
 }
+
+static __be64 *tce_useraddr_pSeriesLP(struct iommu_table *tbl, long index,
+				      bool __always_unused alloc)
+{
+	return tbl->it_userspace ? &tbl->it_userspace[index - tbl->it_offset] : NULL;
+}
 #endif
 
 struct iommu_table_ops iommu_table_lpar_multi_ops = {
 	.set = tce_buildmulti_pSeriesLP,
 #ifdef CONFIG_IOMMU_API
 	.xchg_no_kill = tce_exchange_pseries,
+	.useraddrptr = tce_useraddr_pSeriesLP,
 #endif
 	.clear = tce_freemulti_pSeriesLP,
-	.get = tce_get_pSeriesLP
+	.get = tce_get_pSeriesLP,
+	.free = tce_free_pSeries
 };
 
 /*
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index a94ec6225d31..1cf36d687559 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -177,6 +177,50 @@ static long tce_iommu_register_pages(struct tce_container *container,
 	return ret;
 }
 
+static long tce_iommu_userspace_view_alloc(struct iommu_table *tbl,
+		struct mm_struct *mm)
+{
+	unsigned long cb = ALIGN(sizeof(tbl->it_userspace[0]) *
+			tbl->it_size, PAGE_SIZE);
+	unsigned long *uas;
+	long ret;
+
+	if (tbl->it_indirect_levels)
+		return 0;
+
+	WARN_ON(tbl->it_userspace);
+
+	ret = account_locked_vm(mm, cb >> PAGE_SHIFT, true);
+	if (ret)
+		return ret;
+
+	uas = vzalloc(cb);
+	if (!uas) {
+		account_locked_vm(mm, cb >> PAGE_SHIFT, false);
+		return -ENOMEM;
+	}
+	tbl->it_userspace = (__be64 *) uas;
+
+	return 0;
+}
+
+static void tce_iommu_userspace_view_free(struct iommu_table *tbl,
+		struct mm_struct *mm)
+{
+	unsigned long cb = ALIGN(sizeof(tbl->it_userspace[0]) *
+			tbl->it_size, PAGE_SIZE);
+
+	if (!tbl->it_userspace)
+		return;
+
+	if (tbl->it_indirect_levels)
+		return;
+
+	vfree(tbl->it_userspace);
+	tbl->it_userspace = NULL;
+	account_locked_vm(mm, cb >> PAGE_SHIFT, false);
+}
+
 static bool tce_page_is_contained(struct mm_struct *mm, unsigned long hpa,
 		unsigned int it_page_shift)
 {
@@ -554,6 +598,12 @@ static long tce_iommu_build_v2(struct tce_container *container,
 	unsigned long hpa;
 	enum dma_data_direction dirtmp;
 
+	if (!tbl->it_userspace) {
+		ret = tce_iommu_userspace_view_alloc(tbl, container->mm);
+		if (ret)
+			return ret;
+	}
+
 	for (i = 0; i < pages; ++i) {
 		struct mm_iommu_table_group_mem_t *mem = NULL;
 		__be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry + i);
@@ -637,6 +687,7 @@ static void tce_iommu_free_table(struct tce_container *container,
 {
 	unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
 
+	tce_iommu_userspace_view_free(tbl, container->mm);
 	iommu_tce_table_put(tbl);
 	account_locked_vm(container->mm, pages, false);
 }



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH 2/3] powerpc/iommu: Move pSeries specific functions to pseries/iommu.c
  2024-03-12 18:14 ` Shivaprasad G Bhat
@ 2024-03-12 18:14   ` Shivaprasad G Bhat
  -1 siblings, 0 replies; 21+ messages in thread
From: Shivaprasad G Bhat @ 2024-03-12 18:14 UTC (permalink / raw)
  To: tpearson, alex.williamson, linuxppc-dev
  Cc: mpe, npiggin, christophe.leroy, aneesh.kumar, naveen.n.rao,
	gbatra, brking, sbhat, aik, jgg, robh, linux-kernel, kvm, aik,
	msuchanek, jroedel, vaibhav, svaidy

The PowerNV specific table_group_ops are defined in powernv/pci-ioda.c.
The pSeries specific table_group_ops are sitting in the generic powerpc
file. Move it to where it actually belong(pseries/iommu.c).

Only code movement, no functional changes intended.

Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
---
 arch/powerpc/include/asm/iommu.h       |    4 +
 arch/powerpc/kernel/iommu.c            |  149 --------------------------------
 arch/powerpc/platforms/pseries/iommu.c |  145 +++++++++++++++++++++++++++++++
 3 files changed, 150 insertions(+), 148 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 026695943550..744cc5fc22d3 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -156,6 +156,9 @@ extern int iommu_tce_table_put(struct iommu_table *tbl);
 extern struct iommu_table *iommu_init_table(struct iommu_table *tbl,
 		int nid, unsigned long res_start, unsigned long res_end);
 bool iommu_table_in_use(struct iommu_table *tbl);
+extern void iommu_table_reserve_pages(struct iommu_table *tbl,
+		unsigned long res_start, unsigned long res_end);
+extern void iommu_table_clear(struct iommu_table *tbl);
 
 #define IOMMU_TABLE_GROUP_MAX_TABLES	2
 
@@ -218,7 +221,6 @@ extern long iommu_tce_xchg_no_kill(struct mm_struct *mm,
 extern void iommu_tce_kill(struct iommu_table *tbl,
 		unsigned long entry, unsigned long pages);
 
-extern struct iommu_table_group_ops spapr_tce_table_group_ops;
 #else
 static inline void iommu_register_group(struct iommu_table_group *table_group,
 					int pci_domain_number,
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 1185efebf032..aa11b2acf24f 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -642,7 +642,7 @@ void ppc_iommu_unmap_sg(struct iommu_table *tbl, struct scatterlist *sglist,
 		tbl->it_ops->flush(tbl);
 }
 
-static void iommu_table_clear(struct iommu_table *tbl)
+void iommu_table_clear(struct iommu_table *tbl)
 {
 	/*
 	 * In case of firmware assisted dump system goes through clean
@@ -683,7 +683,7 @@ static void iommu_table_clear(struct iommu_table *tbl)
 #endif
 }
 
-static void iommu_table_reserve_pages(struct iommu_table *tbl,
+void iommu_table_reserve_pages(struct iommu_table *tbl,
 		unsigned long res_start, unsigned long res_end)
 {
 	int i;
@@ -1101,59 +1101,6 @@ void iommu_tce_kill(struct iommu_table *tbl,
 }
 EXPORT_SYMBOL_GPL(iommu_tce_kill);
 
-#if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV)
-static int iommu_take_ownership(struct iommu_table *tbl)
-{
-	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
-	int ret = 0;
-
-	/*
-	 * VFIO does not control TCE entries allocation and the guest
-	 * can write new TCEs on top of existing ones so iommu_tce_build()
-	 * must be able to release old pages. This functionality
-	 * requires exchange() callback defined so if it is not
-	 * implemented, we disallow taking ownership over the table.
-	 */
-	if (!tbl->it_ops->xchg_no_kill)
-		return -EINVAL;
-
-	spin_lock_irqsave(&tbl->large_pool.lock, flags);
-	for (i = 0; i < tbl->nr_pools; i++)
-		spin_lock_nest_lock(&tbl->pools[i].lock, &tbl->large_pool.lock);
-
-	if (iommu_table_in_use(tbl)) {
-		pr_err("iommu_tce: it_map is not empty");
-		ret = -EBUSY;
-	} else {
-		memset(tbl->it_map, 0xff, sz);
-	}
-
-	for (i = 0; i < tbl->nr_pools; i++)
-		spin_unlock(&tbl->pools[i].lock);
-	spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
-
-	return ret;
-}
-
-static void iommu_release_ownership(struct iommu_table *tbl)
-{
-	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
-
-	spin_lock_irqsave(&tbl->large_pool.lock, flags);
-	for (i = 0; i < tbl->nr_pools; i++)
-		spin_lock_nest_lock(&tbl->pools[i].lock, &tbl->large_pool.lock);
-
-	memset(tbl->it_map, 0, sz);
-
-	iommu_table_reserve_pages(tbl, tbl->it_reserved_start,
-			tbl->it_reserved_end);
-
-	for (i = 0; i < tbl->nr_pools; i++)
-		spin_unlock(&tbl->pools[i].lock);
-	spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
-}
-#endif
-
 int iommu_add_device(struct iommu_table_group *table_group, struct device *dev)
 {
 	/*
@@ -1185,98 +1132,6 @@ int iommu_add_device(struct iommu_table_group *table_group, struct device *dev)
 EXPORT_SYMBOL_GPL(iommu_add_device);
 
 #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV)
-/*
- * A simple iommu_table_group_ops which only allows reusing the existing
- * iommu_table. This handles VFIO for POWER7 or the nested KVM.
- * The ops does not allow creating windows and only allows reusing the existing
- * one if it matches table_group->tce32_start/tce32_size/page_shift.
- */
-static unsigned long spapr_tce_get_table_size(__u32 page_shift,
-					      __u64 window_size, __u32 levels)
-{
-	unsigned long size;
-
-	if (levels > 1)
-		return ~0U;
-	size = window_size >> (page_shift - 3);
-	return size;
-}
-
-static long spapr_tce_create_table(struct iommu_table_group *table_group, int num,
-				   __u32 page_shift, __u64 window_size, __u32 levels,
-				   struct iommu_table **ptbl)
-{
-	struct iommu_table *tbl = table_group->tables[0];
-
-	if (num > 0)
-		return -EPERM;
-
-	if (tbl->it_page_shift != page_shift ||
-	    tbl->it_size != (window_size >> page_shift) ||
-	    tbl->it_indirect_levels != levels - 1)
-		return -EINVAL;
-
-	*ptbl = iommu_tce_table_get(tbl);
-	return 0;
-}
-
-static long spapr_tce_set_window(struct iommu_table_group *table_group,
-				 int num, struct iommu_table *tbl)
-{
-	return tbl == table_group->tables[num] ? 0 : -EPERM;
-}
-
-static long spapr_tce_unset_window(struct iommu_table_group *table_group, int num)
-{
-	return 0;
-}
-
-static long spapr_tce_take_ownership(struct iommu_table_group *table_group)
-{
-	int i, j, rc = 0;
-
-	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
-		struct iommu_table *tbl = table_group->tables[i];
-
-		if (!tbl || !tbl->it_map)
-			continue;
-
-		rc = iommu_take_ownership(tbl);
-		if (!rc)
-			continue;
-
-		for (j = 0; j < i; ++j)
-			iommu_release_ownership(table_group->tables[j]);
-		return rc;
-	}
-	return 0;
-}
-
-static void spapr_tce_release_ownership(struct iommu_table_group *table_group)
-{
-	int i;
-
-	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
-		struct iommu_table *tbl = table_group->tables[i];
-
-		if (!tbl)
-			continue;
-
-		iommu_table_clear(tbl);
-		if (tbl->it_map)
-			iommu_release_ownership(tbl);
-	}
-}
-
-struct iommu_table_group_ops spapr_tce_table_group_ops = {
-	.get_table_size = spapr_tce_get_table_size,
-	.create_table = spapr_tce_create_table,
-	.set_window = spapr_tce_set_window,
-	.unset_window = spapr_tce_unset_window,
-	.take_ownership = spapr_tce_take_ownership,
-	.release_ownership = spapr_tce_release_ownership,
-};
-
 /*
  * A simple iommu_ops to allow less cruft in generic VFIO code.
  */
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 40de8d55faef..3d9865dadf73 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -54,6 +54,57 @@ enum {
 	DDW_EXT_QUERY_OUT_SIZE = 2
 };
 
+static int iommu_take_ownership(struct iommu_table *tbl)
+{
+	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
+	int ret = 0;
+
+	/*
+	 * VFIO does not control TCE entries allocation and the guest
+	 * can write new TCEs on top of existing ones so iommu_tce_build()
+	 * must be able to release old pages. This functionality
+	 * requires exchange() callback defined so if it is not
+	 * implemented, we disallow taking ownership over the table.
+	 */
+	if (!tbl->it_ops->xchg_no_kill)
+		return -EINVAL;
+
+	spin_lock_irqsave(&tbl->large_pool.lock, flags);
+	for (i = 0; i < tbl->nr_pools; i++)
+		spin_lock_nest_lock(&tbl->pools[i].lock, &tbl->large_pool.lock);
+
+	if (iommu_table_in_use(tbl)) {
+		pr_err("iommu_tce: it_map is not empty");
+		ret = -EBUSY;
+	} else {
+		memset(tbl->it_map, 0xff, sz);
+	}
+
+	for (i = 0; i < tbl->nr_pools; i++)
+		spin_unlock(&tbl->pools[i].lock);
+	spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
+
+	return ret;
+}
+
+static void iommu_release_ownership(struct iommu_table *tbl)
+{
+	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
+
+	spin_lock_irqsave(&tbl->large_pool.lock, flags);
+	for (i = 0; i < tbl->nr_pools; i++)
+		spin_lock_nest_lock(&tbl->pools[i].lock, &tbl->large_pool.lock);
+
+	memset(tbl->it_map, 0, sz);
+
+	iommu_table_reserve_pages(tbl, tbl->it_reserved_start,
+			tbl->it_reserved_end);
+
+	for (i = 0; i < tbl->nr_pools; i++)
+		spin_unlock(&tbl->pools[i].lock);
+	spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
+}
+
 static struct iommu_table *iommu_pseries_alloc_table(int node)
 {
 	struct iommu_table *tbl;
@@ -67,6 +118,8 @@ static struct iommu_table *iommu_pseries_alloc_table(int node)
 	return tbl;
 }
 
+struct iommu_table_group_ops spapr_tce_table_group_ops;
+
 static struct iommu_table_group *iommu_pseries_alloc_group(int node)
 {
 	struct iommu_table_group *table_group;
@@ -1656,6 +1709,98 @@ static bool iommu_bypass_supported_pSeriesLP(struct pci_dev *pdev, u64 dma_mask)
 	return false;
 }
 
+/*
+ * A simple iommu_table_group_ops which only allows reusing the existing
+ * iommu_table. This handles VFIO for POWER7 or the nested KVM.
+ * The ops does not allow creating windows and only allows reusing the existing
+ * one if it matches table_group->tce32_start/tce32_size/page_shift.
+ */
+static unsigned long spapr_tce_get_table_size(__u32 page_shift,
+					      __u64 window_size, __u32 levels)
+{
+	unsigned long size;
+
+	if (levels > 1)
+		return ~0U;
+	size = window_size >> (page_shift - 3);
+	return size;
+}
+
+static long spapr_tce_create_table(struct iommu_table_group *table_group, int num,
+				   __u32 page_shift, __u64 window_size, __u32 levels,
+				   struct iommu_table **ptbl)
+{
+	struct iommu_table *tbl = table_group->tables[0];
+
+	if (num > 0)
+		return -EPERM;
+
+	if (tbl->it_page_shift != page_shift ||
+	    tbl->it_size != (window_size >> page_shift) ||
+	    tbl->it_indirect_levels != levels - 1)
+		return -EINVAL;
+
+	*ptbl = iommu_tce_table_get(tbl);
+	return 0;
+}
+
+static long spapr_tce_set_window(struct iommu_table_group *table_group,
+				 int num, struct iommu_table *tbl)
+{
+	return tbl == table_group->tables[num] ? 0 : -EPERM;
+}
+
+static long spapr_tce_unset_window(struct iommu_table_group *table_group, int num)
+{
+	return 0;
+}
+
+static long spapr_tce_take_ownership(struct iommu_table_group *table_group)
+{
+	int i, j, rc = 0;
+
+	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
+		struct iommu_table *tbl = table_group->tables[i];
+
+		if (!tbl || !tbl->it_map)
+			continue;
+
+		rc = iommu_take_ownership(tbl);
+		if (!rc)
+			continue;
+
+		for (j = 0; j < i; ++j)
+			iommu_release_ownership(table_group->tables[j]);
+		return rc;
+	}
+	return 0;
+}
+
+static void spapr_tce_release_ownership(struct iommu_table_group *table_group)
+{
+	int i;
+
+	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
+		struct iommu_table *tbl = table_group->tables[i];
+
+		if (!tbl)
+			continue;
+
+		iommu_table_clear(tbl);
+		if (tbl->it_map)
+			iommu_release_ownership(tbl);
+	}
+}
+
+struct iommu_table_group_ops spapr_tce_table_group_ops = {
+	.get_table_size = spapr_tce_get_table_size,
+	.create_table = spapr_tce_create_table,
+	.set_window = spapr_tce_set_window,
+	.unset_window = spapr_tce_unset_window,
+	.take_ownership = spapr_tce_take_ownership,
+	.release_ownership = spapr_tce_release_ownership,
+};
+
 static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action,
 		void *data)
 {



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH 2/3] powerpc/iommu: Move pSeries specific functions to pseries/iommu.c
@ 2024-03-12 18:14   ` Shivaprasad G Bhat
  0 siblings, 0 replies; 21+ messages in thread
From: Shivaprasad G Bhat @ 2024-03-12 18:14 UTC (permalink / raw)
  To: tpearson, alex.williamson, linuxppc-dev
  Cc: robh, jroedel, sbhat, gbatra, jgg, aik, linux-kernel, svaidy,
	aneesh.kumar, brking, npiggin, kvm, naveen.n.rao, vaibhav,
	msuchanek, aik

The PowerNV specific table_group_ops are defined in powernv/pci-ioda.c.
The pSeries specific table_group_ops are sitting in the generic powerpc
file. Move it to where it actually belong(pseries/iommu.c).

Only code movement, no functional changes intended.

Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
---
 arch/powerpc/include/asm/iommu.h       |    4 +
 arch/powerpc/kernel/iommu.c            |  149 --------------------------------
 arch/powerpc/platforms/pseries/iommu.c |  145 +++++++++++++++++++++++++++++++
 3 files changed, 150 insertions(+), 148 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 026695943550..744cc5fc22d3 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -156,6 +156,9 @@ extern int iommu_tce_table_put(struct iommu_table *tbl);
 extern struct iommu_table *iommu_init_table(struct iommu_table *tbl,
 		int nid, unsigned long res_start, unsigned long res_end);
 bool iommu_table_in_use(struct iommu_table *tbl);
+extern void iommu_table_reserve_pages(struct iommu_table *tbl,
+		unsigned long res_start, unsigned long res_end);
+extern void iommu_table_clear(struct iommu_table *tbl);
 
 #define IOMMU_TABLE_GROUP_MAX_TABLES	2
 
@@ -218,7 +221,6 @@ extern long iommu_tce_xchg_no_kill(struct mm_struct *mm,
 extern void iommu_tce_kill(struct iommu_table *tbl,
 		unsigned long entry, unsigned long pages);
 
-extern struct iommu_table_group_ops spapr_tce_table_group_ops;
 #else
 static inline void iommu_register_group(struct iommu_table_group *table_group,
 					int pci_domain_number,
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 1185efebf032..aa11b2acf24f 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -642,7 +642,7 @@ void ppc_iommu_unmap_sg(struct iommu_table *tbl, struct scatterlist *sglist,
 		tbl->it_ops->flush(tbl);
 }
 
-static void iommu_table_clear(struct iommu_table *tbl)
+void iommu_table_clear(struct iommu_table *tbl)
 {
 	/*
 	 * In case of firmware assisted dump system goes through clean
@@ -683,7 +683,7 @@ static void iommu_table_clear(struct iommu_table *tbl)
 #endif
 }
 
-static void iommu_table_reserve_pages(struct iommu_table *tbl,
+void iommu_table_reserve_pages(struct iommu_table *tbl,
 		unsigned long res_start, unsigned long res_end)
 {
 	int i;
@@ -1101,59 +1101,6 @@ void iommu_tce_kill(struct iommu_table *tbl,
 }
 EXPORT_SYMBOL_GPL(iommu_tce_kill);
 
-#if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV)
-static int iommu_take_ownership(struct iommu_table *tbl)
-{
-	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
-	int ret = 0;
-
-	/*
-	 * VFIO does not control TCE entries allocation and the guest
-	 * can write new TCEs on top of existing ones so iommu_tce_build()
-	 * must be able to release old pages. This functionality
-	 * requires exchange() callback defined so if it is not
-	 * implemented, we disallow taking ownership over the table.
-	 */
-	if (!tbl->it_ops->xchg_no_kill)
-		return -EINVAL;
-
-	spin_lock_irqsave(&tbl->large_pool.lock, flags);
-	for (i = 0; i < tbl->nr_pools; i++)
-		spin_lock_nest_lock(&tbl->pools[i].lock, &tbl->large_pool.lock);
-
-	if (iommu_table_in_use(tbl)) {
-		pr_err("iommu_tce: it_map is not empty");
-		ret = -EBUSY;
-	} else {
-		memset(tbl->it_map, 0xff, sz);
-	}
-
-	for (i = 0; i < tbl->nr_pools; i++)
-		spin_unlock(&tbl->pools[i].lock);
-	spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
-
-	return ret;
-}
-
-static void iommu_release_ownership(struct iommu_table *tbl)
-{
-	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
-
-	spin_lock_irqsave(&tbl->large_pool.lock, flags);
-	for (i = 0; i < tbl->nr_pools; i++)
-		spin_lock_nest_lock(&tbl->pools[i].lock, &tbl->large_pool.lock);
-
-	memset(tbl->it_map, 0, sz);
-
-	iommu_table_reserve_pages(tbl, tbl->it_reserved_start,
-			tbl->it_reserved_end);
-
-	for (i = 0; i < tbl->nr_pools; i++)
-		spin_unlock(&tbl->pools[i].lock);
-	spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
-}
-#endif
-
 int iommu_add_device(struct iommu_table_group *table_group, struct device *dev)
 {
 	/*
@@ -1185,98 +1132,6 @@ int iommu_add_device(struct iommu_table_group *table_group, struct device *dev)
 EXPORT_SYMBOL_GPL(iommu_add_device);
 
 #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV)
-/*
- * A simple iommu_table_group_ops which only allows reusing the existing
- * iommu_table. This handles VFIO for POWER7 or the nested KVM.
- * The ops does not allow creating windows and only allows reusing the existing
- * one if it matches table_group->tce32_start/tce32_size/page_shift.
- */
-static unsigned long spapr_tce_get_table_size(__u32 page_shift,
-					      __u64 window_size, __u32 levels)
-{
-	unsigned long size;
-
-	if (levels > 1)
-		return ~0U;
-	size = window_size >> (page_shift - 3);
-	return size;
-}
-
-static long spapr_tce_create_table(struct iommu_table_group *table_group, int num,
-				   __u32 page_shift, __u64 window_size, __u32 levels,
-				   struct iommu_table **ptbl)
-{
-	struct iommu_table *tbl = table_group->tables[0];
-
-	if (num > 0)
-		return -EPERM;
-
-	if (tbl->it_page_shift != page_shift ||
-	    tbl->it_size != (window_size >> page_shift) ||
-	    tbl->it_indirect_levels != levels - 1)
-		return -EINVAL;
-
-	*ptbl = iommu_tce_table_get(tbl);
-	return 0;
-}
-
-static long spapr_tce_set_window(struct iommu_table_group *table_group,
-				 int num, struct iommu_table *tbl)
-{
-	return tbl == table_group->tables[num] ? 0 : -EPERM;
-}
-
-static long spapr_tce_unset_window(struct iommu_table_group *table_group, int num)
-{
-	return 0;
-}
-
-static long spapr_tce_take_ownership(struct iommu_table_group *table_group)
-{
-	int i, j, rc = 0;
-
-	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
-		struct iommu_table *tbl = table_group->tables[i];
-
-		if (!tbl || !tbl->it_map)
-			continue;
-
-		rc = iommu_take_ownership(tbl);
-		if (!rc)
-			continue;
-
-		for (j = 0; j < i; ++j)
-			iommu_release_ownership(table_group->tables[j]);
-		return rc;
-	}
-	return 0;
-}
-
-static void spapr_tce_release_ownership(struct iommu_table_group *table_group)
-{
-	int i;
-
-	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
-		struct iommu_table *tbl = table_group->tables[i];
-
-		if (!tbl)
-			continue;
-
-		iommu_table_clear(tbl);
-		if (tbl->it_map)
-			iommu_release_ownership(tbl);
-	}
-}
-
-struct iommu_table_group_ops spapr_tce_table_group_ops = {
-	.get_table_size = spapr_tce_get_table_size,
-	.create_table = spapr_tce_create_table,
-	.set_window = spapr_tce_set_window,
-	.unset_window = spapr_tce_unset_window,
-	.take_ownership = spapr_tce_take_ownership,
-	.release_ownership = spapr_tce_release_ownership,
-};
-
 /*
  * A simple iommu_ops to allow less cruft in generic VFIO code.
  */
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 40de8d55faef..3d9865dadf73 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -54,6 +54,57 @@ enum {
 	DDW_EXT_QUERY_OUT_SIZE = 2
 };
 
+static int iommu_take_ownership(struct iommu_table *tbl)
+{
+	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
+	int ret = 0;
+
+	/*
+	 * VFIO does not control TCE entries allocation and the guest
+	 * can write new TCEs on top of existing ones so iommu_tce_build()
+	 * must be able to release old pages. This functionality
+	 * requires exchange() callback defined so if it is not
+	 * implemented, we disallow taking ownership over the table.
+	 */
+	if (!tbl->it_ops->xchg_no_kill)
+		return -EINVAL;
+
+	spin_lock_irqsave(&tbl->large_pool.lock, flags);
+	for (i = 0; i < tbl->nr_pools; i++)
+		spin_lock_nest_lock(&tbl->pools[i].lock, &tbl->large_pool.lock);
+
+	if (iommu_table_in_use(tbl)) {
+		pr_err("iommu_tce: it_map is not empty");
+		ret = -EBUSY;
+	} else {
+		memset(tbl->it_map, 0xff, sz);
+	}
+
+	for (i = 0; i < tbl->nr_pools; i++)
+		spin_unlock(&tbl->pools[i].lock);
+	spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
+
+	return ret;
+}
+
+static void iommu_release_ownership(struct iommu_table *tbl)
+{
+	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
+
+	spin_lock_irqsave(&tbl->large_pool.lock, flags);
+	for (i = 0; i < tbl->nr_pools; i++)
+		spin_lock_nest_lock(&tbl->pools[i].lock, &tbl->large_pool.lock);
+
+	memset(tbl->it_map, 0, sz);
+
+	iommu_table_reserve_pages(tbl, tbl->it_reserved_start,
+			tbl->it_reserved_end);
+
+	for (i = 0; i < tbl->nr_pools; i++)
+		spin_unlock(&tbl->pools[i].lock);
+	spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
+}
+
 static struct iommu_table *iommu_pseries_alloc_table(int node)
 {
 	struct iommu_table *tbl;
@@ -67,6 +118,8 @@ static struct iommu_table *iommu_pseries_alloc_table(int node)
 	return tbl;
 }
 
+struct iommu_table_group_ops spapr_tce_table_group_ops;
+
 static struct iommu_table_group *iommu_pseries_alloc_group(int node)
 {
 	struct iommu_table_group *table_group;
@@ -1656,6 +1709,98 @@ static bool iommu_bypass_supported_pSeriesLP(struct pci_dev *pdev, u64 dma_mask)
 	return false;
 }
 
+/*
+ * A simple iommu_table_group_ops which only allows reusing the existing
+ * iommu_table. This handles VFIO for POWER7 or the nested KVM.
+ * The ops does not allow creating windows and only allows reusing the existing
+ * one if it matches table_group->tce32_start/tce32_size/page_shift.
+ */
+static unsigned long spapr_tce_get_table_size(__u32 page_shift,
+					      __u64 window_size, __u32 levels)
+{
+	unsigned long size;
+
+	if (levels > 1)
+		return ~0U;
+	size = window_size >> (page_shift - 3);
+	return size;
+}
+
+static long spapr_tce_create_table(struct iommu_table_group *table_group, int num,
+				   __u32 page_shift, __u64 window_size, __u32 levels,
+				   struct iommu_table **ptbl)
+{
+	struct iommu_table *tbl = table_group->tables[0];
+
+	if (num > 0)
+		return -EPERM;
+
+	if (tbl->it_page_shift != page_shift ||
+	    tbl->it_size != (window_size >> page_shift) ||
+	    tbl->it_indirect_levels != levels - 1)
+		return -EINVAL;
+
+	*ptbl = iommu_tce_table_get(tbl);
+	return 0;
+}
+
+static long spapr_tce_set_window(struct iommu_table_group *table_group,
+				 int num, struct iommu_table *tbl)
+{
+	return tbl == table_group->tables[num] ? 0 : -EPERM;
+}
+
+static long spapr_tce_unset_window(struct iommu_table_group *table_group, int num)
+{
+	return 0;
+}
+
+static long spapr_tce_take_ownership(struct iommu_table_group *table_group)
+{
+	int i, j, rc = 0;
+
+	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
+		struct iommu_table *tbl = table_group->tables[i];
+
+		if (!tbl || !tbl->it_map)
+			continue;
+
+		rc = iommu_take_ownership(tbl);
+		if (!rc)
+			continue;
+
+		for (j = 0; j < i; ++j)
+			iommu_release_ownership(table_group->tables[j]);
+		return rc;
+	}
+	return 0;
+}
+
+static void spapr_tce_release_ownership(struct iommu_table_group *table_group)
+{
+	int i;
+
+	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
+		struct iommu_table *tbl = table_group->tables[i];
+
+		if (!tbl)
+			continue;
+
+		iommu_table_clear(tbl);
+		if (tbl->it_map)
+			iommu_release_ownership(tbl);
+	}
+}
+
+struct iommu_table_group_ops spapr_tce_table_group_ops = {
+	.get_table_size = spapr_tce_get_table_size,
+	.create_table = spapr_tce_create_table,
+	.set_window = spapr_tce_set_window,
+	.unset_window = spapr_tce_unset_window,
+	.take_ownership = spapr_tce_take_ownership,
+	.release_ownership = spapr_tce_release_ownership,
+};
+
 static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action,
 		void *data)
 {



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH 3/3] pseries/iommu: Enable DDW for VFIO TCE create
  2024-03-12 18:14 ` Shivaprasad G Bhat
@ 2024-03-12 18:14   ` Shivaprasad G Bhat
  -1 siblings, 0 replies; 21+ messages in thread
From: Shivaprasad G Bhat @ 2024-03-12 18:14 UTC (permalink / raw)
  To: tpearson, alex.williamson, linuxppc-dev
  Cc: mpe, npiggin, christophe.leroy, aneesh.kumar, naveen.n.rao,
	gbatra, brking, sbhat, aik, jgg, robh, linux-kernel, kvm, aik,
	msuchanek, jroedel, vaibhav, svaidy

The commit 9d67c9433509 ("powerpc/iommu: Add \"borrowing\"
iommu_table_group_ops") implemented the "borrow" mechanism for
the pSeries SPAPR TCE. It did implement this support partially
that it left out creating the DDW if not present already.

The patch here attempts to fix the missing gaps.
 - Expose the DDW info to user by collecting it during probe.
 - Create the window and the iommu table if not present during
   VFIO_SPAPR_TCE_CREATE.
 - Remove and recreate the window if the pageshift and window sizes
   do not match.
 - Restore the original window in enable_ddw() if the user had
   created/modified the DDW. As there is preference for DIRECT mapping
   on the host driver side, the user created window is removed.

The changes work only for the non-SRIOV-VF scenarios for PEs having
2 DMA windows.

Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
---
 arch/powerpc/include/asm/iommu.h       |    3 
 arch/powerpc/kernel/iommu.c            |    7 -
 arch/powerpc/platforms/pseries/iommu.c |  362 +++++++++++++++++++++++++++++++-
 3 files changed, 360 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 744cc5fc22d3..fde174122844 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -110,6 +110,7 @@ struct iommu_table {
 	unsigned long  it_page_shift;/* table iommu page size */
 	struct list_head it_group_list;/* List of iommu_table_group_link */
 	__be64 *it_userspace; /* userspace view of the table */
+	bool reset_ddw;
 	struct iommu_table_ops *it_ops;
 	struct kref    it_kref;
 	int it_nid;
@@ -169,6 +170,8 @@ struct iommu_table_group_ops {
 			__u32 page_shift,
 			__u64 window_size,
 			__u32 levels);
+	void (*init_group)(struct iommu_table_group *table_group,
+			struct device *dev);
 	long (*create_table)(struct iommu_table_group *table_group,
 			int num,
 			__u32 page_shift,
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index aa11b2acf24f..1cce2b8b8f2c 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -740,6 +740,7 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid,
 		return NULL;
 	}
 
+	tbl->it_nid = nid;
 	iommu_table_reserve_pages(tbl, res_start, res_end);
 
 	/* We only split the IOMMU table if we have 1GB or more of space */
@@ -1141,7 +1142,10 @@ spapr_tce_platform_iommu_attach_dev(struct iommu_domain *platform_domain,
 {
 	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
 	struct iommu_group *grp = iommu_group_get(dev);
-	struct iommu_table_group *table_group;
+	struct iommu_table_group *table_group = iommu_group_get_iommudata(grp);
+
+	/* This should have been in spapr_tce_iommu_probe_device() ?*/
+	table_group->ops->init_group(table_group, dev);
 
 	/* At first attach the ownership is already set */
 	if (!domain) {
@@ -1149,7 +1153,6 @@ spapr_tce_platform_iommu_attach_dev(struct iommu_domain *platform_domain,
 		return 0;
 	}
 
-	table_group = iommu_group_get_iommudata(grp);
 	/*
 	 * The domain being set to PLATFORM from earlier
 	 * BLOCKED. The table_group ownership has to be released.
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 3d9865dadf73..7224107a0f60 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -630,6 +630,62 @@ static void iommu_table_setparms(struct pci_controller *phb,
 	phb->dma_window_base_cur += phb->dma_window_size;
 }
 
+static int iommu_table_reset(struct iommu_table *tbl, unsigned long busno,
+				   unsigned long liobn, unsigned long win_addr,
+				   unsigned long window_size, unsigned long page_shift,
+				   void *base, struct iommu_table_ops *table_ops)
+{
+	unsigned long sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
+	unsigned int i, oldsize = tbl->it_size;
+	struct iommu_pool *p;
+
+	WARN_ON(!tbl->it_ops);
+
+	if (oldsize != (window_size >> page_shift)) {
+		vfree(tbl->it_map);
+
+		tbl->it_map = vzalloc_node(sz, tbl->it_nid);
+		if (!tbl->it_map)
+			return -ENOMEM;
+
+		tbl->it_size = window_size >> page_shift;
+		if (oldsize < (window_size >> page_shift))
+			iommu_table_clear(tbl);
+	}
+	tbl->it_busno = busno;
+	tbl->it_index = liobn;
+	tbl->it_offset = win_addr >> page_shift;
+	tbl->it_blocksize = 16;
+	tbl->it_type = TCE_PCI;
+	tbl->it_ops = table_ops;
+	tbl->it_page_shift = page_shift;
+	tbl->it_base = (unsigned long)base;
+
+	if ((tbl->it_size << tbl->it_page_shift) >= (1UL * 1024 * 1024 * 1024))
+		tbl->nr_pools = IOMMU_NR_POOLS;
+	else
+		tbl->nr_pools = 1;
+
+	tbl->poolsize = (tbl->it_size * 3 / 4) / tbl->nr_pools;
+
+	for (i = 0; i < tbl->nr_pools; i++) {
+		p = &tbl->pools[i];
+		spin_lock_init(&(p->lock));
+		p->start = tbl->poolsize * i;
+		p->hint = p->start;
+		p->end = p->start + tbl->poolsize;
+	}
+
+	p = &tbl->large_pool;
+	spin_lock_init(&(p->lock));
+	p->start = tbl->poolsize * i;
+	p->hint = p->start;
+	p->end = tbl->it_size;
+	return 0;
+}
+
+
+
 struct iommu_table_ops iommu_table_lpar_multi_ops;
 
 struct iommu_table_ops iommu_table_pseries_ops = {
@@ -1016,8 +1072,8 @@ static int remove_ddw(struct device_node *np, bool remove_prop, const char *win_
 	return 0;
 }
 
-static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr, int *window_shift,
-			      bool *direct_mapping)
+static bool find_existing_ddw(struct device_node *pdn, u32 *liobn, u64 *dma_addr,
+			      int *window_shift, bool *direct_mapping)
 {
 	struct dma_win *window;
 	const struct dynamic_dma_window_prop *dma64;
@@ -1031,6 +1087,7 @@ static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr, int *windo
 			*dma_addr = be64_to_cpu(dma64->dma_base);
 			*window_shift = be32_to_cpu(dma64->window_shift);
 			*direct_mapping = window->direct;
+			*liobn = be32_to_cpu(dma64->liobn);
 			found = true;
 			break;
 		}
@@ -1315,6 +1372,23 @@ static int iommu_get_page_shift(u32 query_page_size)
 	return ret;
 }
 
+static __u64 query_page_size_to_mask(u32 query_page_size)
+{
+	const long shift[] = {
+		(SZ_4K),   (SZ_64K), (SZ_16M),
+		(SZ_32M),  (SZ_64M), (SZ_128M),
+		(SZ_256M), (SZ_16G), (SZ_2M)
+	};
+	int i, ret = 0;
+
+	for (i = 0; i < ARRAY_SIZE(shift); i++) {
+		if (query_page_size & (1 << i))
+			ret |= shift[i];
+	}
+
+	return ret;
+}
+
 static struct property *ddw_property_create(const char *propname, u32 liobn, u64 dma_addr,
 					    u32 page_shift, u32 window_shift)
 {
@@ -1344,6 +1418,9 @@ static struct property *ddw_property_create(const char *propname, u32 liobn, u64
 	return win64;
 }
 
+static long remove_dynamic_dma_windows_locked(struct iommu_table_group *table_group,
+					      struct pci_dev *pdev);
+
 /*
  * If the PE supports dynamic dma windows, and there is space for a table
  * that can map all pages in a linear offset, then setup such a table,
@@ -1373,6 +1450,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
 	bool pmem_present;
 	struct pci_dn *pci = PCI_DN(pdn);
 	struct property *default_win = NULL;
+	u32 liobn;
 
 	dn = of_find_node_by_type(NULL, "ibm,pmemory");
 	pmem_present = dn != NULL;
@@ -1380,8 +1458,19 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
 
 	mutex_lock(&dma_win_init_mutex);
 
-	if (find_existing_ddw(pdn, &dev->dev.archdata.dma_offset, &len, &direct_mapping))
-		goto out_unlock;
+	if (find_existing_ddw(pdn, &liobn, &dev->dev.archdata.dma_offset, &len, &direct_mapping)) {
+		struct iommu_table *tbl = pci->table_group->tables[1];
+
+		if (direct_mapping || (tbl && !tbl->reset_ddw))
+			goto out_unlock;
+		/* VFIO user created window has custom size/pageshift */
+		if (remove_dynamic_dma_windows_locked(pci->table_group, dev))
+			goto out_failed;
+
+		iommu_tce_table_put(tbl);
+		pci->table_group->tables[1] = NULL;
+		set_iommu_table_base(&dev->dev, pci->table_group->tables[0]);
+	}
 
 	/*
 	 * If we already went through this for a previous function of
@@ -1726,20 +1815,272 @@ static unsigned long spapr_tce_get_table_size(__u32 page_shift,
 	return size;
 }
 
+static void spapr_tce_init_group(struct iommu_table_group *table_group, struct device *dev)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	struct device_node *dn, *pdn;
+	u32 ddw_avail[DDW_APPLICABLE_SIZE];
+	struct ddw_query_response query;
+	int ret;
+
+	if (table_group->max_dynamic_windows_supported > 0)
+		return;
+
+	/* No need to insitialize for kdump kernel. */
+	if (is_kdump_kernel())
+		return;
+
+	dn = pci_device_to_OF_node(pdev);
+	pdn = pci_dma_find(dn, NULL);
+	if (!pdn || !PCI_DN(pdn)) {
+		table_group->max_dynamic_windows_supported = -1;
+		return;
+	}
+
+	/* TODO: Phyp sets VF default window base at 512PiB offset. Need
+	 * tce32_base set to the global offset and use the start as 0?
+	 */
+	ret = of_property_read_u32_array(pdn, "ibm,ddw-applicable",
+			&ddw_avail[0], DDW_APPLICABLE_SIZE);
+	if (ret) {
+		table_group->max_dynamic_windows_supported = -1;
+		return;
+	}
+
+	ret = query_ddw(pdev, ddw_avail, &query, pdn);
+	if (ret) {
+		dev_err(&pdev->dev, "%s: query_ddw failed\n", __func__);
+		table_group->max_dynamic_windows_supported = -1;
+		return;
+	}
+
+	/* The SRIOV VFs have only 1 window, the default is removed
+	 * before creating the 64-bit window
+	 */
+	if (query.windows_available == 0)
+		table_group->max_dynamic_windows_supported = 1;
+	else
+		table_group->max_dynamic_windows_supported = 2;
+
+	table_group->max_levels = 1;
+	table_group->pgsizes |= query_page_size_to_mask(query.page_size);
+}
+
+
+static long remove_dynamic_dma_windows_locked(struct iommu_table_group *table_group,
+					      struct pci_dev *pdev)
+{
+	struct device_node *dn = pci_device_to_OF_node(pdev), *pdn;
+	bool direct_mapping;
+	struct dma_win *window;
+	u32 liobn;
+	int len;
+
+	pdn = pci_dma_find(dn, NULL);
+	if (!pdn || !PCI_DN(pdn)) { // Niether of 32s|64-bit exist!
+		return -ENODEV;
+	}
+
+	if (find_existing_ddw(pdn, &liobn, &pdev->dev.archdata.dma_offset, &len, &direct_mapping)) {
+		remove_ddw(pdn, true, direct_mapping ? DIRECT64_PROPNAME : DMA64_PROPNAME);
+		spin_lock(&dma_win_list_lock);
+		list_for_each_entry(window, &dma_win_list, list) {
+			if (window->device == pdn) {
+				list_del(&window->list);
+				kfree(window);
+				break;
+			}
+		}
+		spin_unlock(&dma_win_list_lock);
+	}
+
+	return 0;
+}
+
+static int dev_has_iommu_table(struct device *dev, void *data)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	struct pci_dev **ppdev = data;
+
+	if (!dev)
+		return 0;
+
+	if (device_iommu_mapped(dev)) {
+		*ppdev = pdev;
+		return 1;
+	}
+
+	return 0;
+}
+
+
+static struct pci_dev *iommu_group_get_first_pci_dev(struct iommu_group *group)
+{
+	struct pci_dev *pdev = NULL;
+	int ret;
+
+	/* No IOMMU group ? */
+	if (!group)
+		return NULL;
+
+	ret = iommu_group_for_each_dev(group, &pdev, dev_has_iommu_table);
+	if (!ret || !pdev)
+		return NULL;
+	return pdev;
+}
+
 static long spapr_tce_create_table(struct iommu_table_group *table_group, int num,
 				   __u32 page_shift, __u64 window_size, __u32 levels,
 				   struct iommu_table **ptbl)
 {
-	struct iommu_table *tbl = table_group->tables[0];
+	struct pci_dev *pdev = iommu_group_get_first_pci_dev(table_group->group);
+	struct device_node *dn = pci_device_to_OF_node(pdev), *pdn;
+	struct iommu_table *tbl = table_group->tables[num];
+	u32 window_shift = order_base_2(window_size);
+	u32 ddw_avail[DDW_APPLICABLE_SIZE];
+	struct ddw_create_response create;
+	struct ddw_query_response query;
+	unsigned long start = 0, end = 0;
+	struct failed_ddw_pdn *fpdn;
+	struct dma_win *window;
+	struct property *win64;
+	struct pci_dn *pci;
+	int len, ret = 0;
+	u64 win_addr;
 
-	if (num > 0)
+	if (num > 1)
 		return -EPERM;
 
-	if (tbl->it_page_shift != page_shift ||
-	    tbl->it_size != (window_size >> page_shift) ||
-	    tbl->it_indirect_levels != levels - 1)
-		return -EINVAL;
+	if (tbl && (tbl->it_page_shift == page_shift) &&
+		(tbl->it_size == (window_size >> page_shift)) &&
+		(tbl->it_indirect_levels == levels - 1))
+		goto exit;
+
+	if (num == 0)
+		return -EINVAL; /* Can't modify the default window. */
+
+	/* TODO: The SRIO-VFs have only 1 window. */
+	if (table_group->max_dynamic_windows_supported == 1)
+		return -EPERM;
+
+	mutex_lock(&dma_win_init_mutex);
+
+	ret = -ENODEV;
+	/* If the enable DDW failed for the pdn, dont retry! */
+	list_for_each_entry(fpdn, &failed_ddw_pdn_list, list) {
+		if (fpdn->pdn == pdn) {
+			pr_err("%s: %pOF in failed DDW device list\n", __func__, pdn);
+			goto out_unlock;
+		}
+	}
+
+	pdn = pci_dma_find(dn, NULL);
+	if (!pdn || !PCI_DN(pdn)) { /* Niether of 32s|64-bit exist! */
+		pr_err("%s: No dma-windows exist for the node %pOF\n", __func__, pdn);
+		goto out_failed;
+	}
+
+	/* The existing ddw didn't match the size/shift */
+	if (remove_dynamic_dma_windows_locked(table_group, pdev)) {
+		pr_err("%s: The existing DDW remova failed for node %pOF\n", __func__, pdn);
+		goto out_failed; /* Could not remove it either! */
+	}
+
+	pci = PCI_DN(pdn);
+	ret = of_property_read_u32_array(pdn, "ibm,ddw-applicable",
+				&ddw_avail[0], DDW_APPLICABLE_SIZE);
+	if (ret) {
+		pr_err("%s: ibm,ddw-applicable not found\n", __func__);
+		goto out_failed;
+	}
+
+	ret = query_ddw(pdev, ddw_avail, &query, pdn);
+	if (ret)
+		goto out_failed;
+	ret = -ENODEV;
 
+	len = window_shift;
+	if (query.largest_available_block < (1ULL << (len - page_shift))) {
+		dev_dbg(&pdev->dev, "can't map window 0x%llx with %llu %llu-sized pages\n",
+				1ULL << len, query.largest_available_block,
+				1ULL << page_shift);
+		ret = -EINVAL; /* Retry with smaller window size */
+		goto out_unlock;
+	}
+
+	if (create_ddw(pdev, ddw_avail, &create, page_shift, len))
+		goto out_failed;
+
+	win_addr = ((u64)create.addr_hi << 32) | create.addr_lo;
+	win64 = ddw_property_create(DMA64_PROPNAME, create.liobn, win_addr, page_shift, len);
+	if (!win64)
+		goto remove_window;
+
+	ret = of_add_property(pdn, win64);
+	if (ret) {
+		dev_err(&pdev->dev, "unable to add DMA window property for %pOF: %d",
+			pdn, ret);
+		goto free_property;
+	}
+	ret = -ENODEV;
+
+	window = ddw_list_new_entry(pdn, win64->value);
+	if (!window)
+		goto remove_property;
+
+	window->direct = false;
+
+	if (tbl) {
+		iommu_table_reset(tbl, pci->phb->bus->number, create.liobn, win_addr,
+				  1UL << len, page_shift, NULL, &iommu_table_lpar_multi_ops);
+	} else {
+		tbl = iommu_pseries_alloc_table(pci->phb->node);
+		if (!tbl) {
+			dev_err(&pdev->dev, "couldn't create new IOMMU table\n");
+			goto free_window;
+		}
+		iommu_table_setparms_common(tbl, pci->phb->bus->number, create.liobn, win_addr,
+					    1UL << len, page_shift, NULL,
+					    &iommu_table_lpar_multi_ops);
+		iommu_init_table(tbl, pci->phb->node, start, end);
+	}
+
+	tbl->reset_ddw = true;
+	pci->table_group->tables[1] = tbl;
+	set_iommu_table_base(&pdev->dev, tbl);
+	pdev->dev.archdata.dma_offset = win_addr;
+
+	spin_lock(&dma_win_list_lock);
+	list_add(&window->list, &dma_win_list);
+	spin_unlock(&dma_win_list_lock);
+
+	mutex_unlock(&dma_win_init_mutex);
+
+	goto exit;
+
+free_window:
+	kfree(window);
+remove_property:
+	of_remove_property(pdn, win64);
+free_property:
+	kfree(win64->name);
+	kfree(win64->value);
+	kfree(win64);
+remove_window:
+	__remove_dma_window(pdn, ddw_avail, create.liobn);
+
+out_failed:
+	fpdn = kzalloc(sizeof(*fpdn), GFP_KERNEL);
+	if (!fpdn)
+		goto out_unlock;
+	fpdn->pdn = pdn;
+	list_add(&fpdn->list, &failed_ddw_pdn_list);
+
+out_unlock:
+	mutex_unlock(&dma_win_init_mutex);
+
+	return ret;
+exit:
 	*ptbl = iommu_tce_table_get(tbl);
 	return 0;
 }
@@ -1795,6 +2136,7 @@ static void spapr_tce_release_ownership(struct iommu_table_group *table_group)
 struct iommu_table_group_ops spapr_tce_table_group_ops = {
 	.get_table_size = spapr_tce_get_table_size,
 	.create_table = spapr_tce_create_table,
+	.init_group = spapr_tce_init_group,
 	.set_window = spapr_tce_set_window,
 	.unset_window = spapr_tce_unset_window,
 	.take_ownership = spapr_tce_take_ownership,



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH 3/3] pseries/iommu: Enable DDW for VFIO TCE create
@ 2024-03-12 18:14   ` Shivaprasad G Bhat
  0 siblings, 0 replies; 21+ messages in thread
From: Shivaprasad G Bhat @ 2024-03-12 18:14 UTC (permalink / raw)
  To: tpearson, alex.williamson, linuxppc-dev
  Cc: robh, jroedel, sbhat, gbatra, jgg, aik, linux-kernel, svaidy,
	aneesh.kumar, brking, npiggin, kvm, naveen.n.rao, vaibhav,
	msuchanek, aik

The commit 9d67c9433509 ("powerpc/iommu: Add \"borrowing\"
iommu_table_group_ops") implemented the "borrow" mechanism for
the pSeries SPAPR TCE. It did implement this support partially
that it left out creating the DDW if not present already.

The patch here attempts to fix the missing gaps.
 - Expose the DDW info to user by collecting it during probe.
 - Create the window and the iommu table if not present during
   VFIO_SPAPR_TCE_CREATE.
 - Remove and recreate the window if the pageshift and window sizes
   do not match.
 - Restore the original window in enable_ddw() if the user had
   created/modified the DDW. As there is preference for DIRECT mapping
   on the host driver side, the user created window is removed.

The changes work only for the non-SRIOV-VF scenarios for PEs having
2 DMA windows.

Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
---
 arch/powerpc/include/asm/iommu.h       |    3 
 arch/powerpc/kernel/iommu.c            |    7 -
 arch/powerpc/platforms/pseries/iommu.c |  362 +++++++++++++++++++++++++++++++-
 3 files changed, 360 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 744cc5fc22d3..fde174122844 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -110,6 +110,7 @@ struct iommu_table {
 	unsigned long  it_page_shift;/* table iommu page size */
 	struct list_head it_group_list;/* List of iommu_table_group_link */
 	__be64 *it_userspace; /* userspace view of the table */
+	bool reset_ddw;
 	struct iommu_table_ops *it_ops;
 	struct kref    it_kref;
 	int it_nid;
@@ -169,6 +170,8 @@ struct iommu_table_group_ops {
 			__u32 page_shift,
 			__u64 window_size,
 			__u32 levels);
+	void (*init_group)(struct iommu_table_group *table_group,
+			struct device *dev);
 	long (*create_table)(struct iommu_table_group *table_group,
 			int num,
 			__u32 page_shift,
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index aa11b2acf24f..1cce2b8b8f2c 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -740,6 +740,7 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid,
 		return NULL;
 	}
 
+	tbl->it_nid = nid;
 	iommu_table_reserve_pages(tbl, res_start, res_end);
 
 	/* We only split the IOMMU table if we have 1GB or more of space */
@@ -1141,7 +1142,10 @@ spapr_tce_platform_iommu_attach_dev(struct iommu_domain *platform_domain,
 {
 	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
 	struct iommu_group *grp = iommu_group_get(dev);
-	struct iommu_table_group *table_group;
+	struct iommu_table_group *table_group = iommu_group_get_iommudata(grp);
+
+	/* This should have been in spapr_tce_iommu_probe_device() ?*/
+	table_group->ops->init_group(table_group, dev);
 
 	/* At first attach the ownership is already set */
 	if (!domain) {
@@ -1149,7 +1153,6 @@ spapr_tce_platform_iommu_attach_dev(struct iommu_domain *platform_domain,
 		return 0;
 	}
 
-	table_group = iommu_group_get_iommudata(grp);
 	/*
 	 * The domain being set to PLATFORM from earlier
 	 * BLOCKED. The table_group ownership has to be released.
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 3d9865dadf73..7224107a0f60 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -630,6 +630,62 @@ static void iommu_table_setparms(struct pci_controller *phb,
 	phb->dma_window_base_cur += phb->dma_window_size;
 }
 
+static int iommu_table_reset(struct iommu_table *tbl, unsigned long busno,
+				   unsigned long liobn, unsigned long win_addr,
+				   unsigned long window_size, unsigned long page_shift,
+				   void *base, struct iommu_table_ops *table_ops)
+{
+	unsigned long sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
+	unsigned int i, oldsize = tbl->it_size;
+	struct iommu_pool *p;
+
+	WARN_ON(!tbl->it_ops);
+
+	if (oldsize != (window_size >> page_shift)) {
+		vfree(tbl->it_map);
+
+		tbl->it_map = vzalloc_node(sz, tbl->it_nid);
+		if (!tbl->it_map)
+			return -ENOMEM;
+
+		tbl->it_size = window_size >> page_shift;
+		if (oldsize < (window_size >> page_shift))
+			iommu_table_clear(tbl);
+	}
+	tbl->it_busno = busno;
+	tbl->it_index = liobn;
+	tbl->it_offset = win_addr >> page_shift;
+	tbl->it_blocksize = 16;
+	tbl->it_type = TCE_PCI;
+	tbl->it_ops = table_ops;
+	tbl->it_page_shift = page_shift;
+	tbl->it_base = (unsigned long)base;
+
+	if ((tbl->it_size << tbl->it_page_shift) >= (1UL * 1024 * 1024 * 1024))
+		tbl->nr_pools = IOMMU_NR_POOLS;
+	else
+		tbl->nr_pools = 1;
+
+	tbl->poolsize = (tbl->it_size * 3 / 4) / tbl->nr_pools;
+
+	for (i = 0; i < tbl->nr_pools; i++) {
+		p = &tbl->pools[i];
+		spin_lock_init(&(p->lock));
+		p->start = tbl->poolsize * i;
+		p->hint = p->start;
+		p->end = p->start + tbl->poolsize;
+	}
+
+	p = &tbl->large_pool;
+	spin_lock_init(&(p->lock));
+	p->start = tbl->poolsize * i;
+	p->hint = p->start;
+	p->end = tbl->it_size;
+	return 0;
+}
+
+
+
 struct iommu_table_ops iommu_table_lpar_multi_ops;
 
 struct iommu_table_ops iommu_table_pseries_ops = {
@@ -1016,8 +1072,8 @@ static int remove_ddw(struct device_node *np, bool remove_prop, const char *win_
 	return 0;
 }
 
-static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr, int *window_shift,
-			      bool *direct_mapping)
+static bool find_existing_ddw(struct device_node *pdn, u32 *liobn, u64 *dma_addr,
+			      int *window_shift, bool *direct_mapping)
 {
 	struct dma_win *window;
 	const struct dynamic_dma_window_prop *dma64;
@@ -1031,6 +1087,7 @@ static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr, int *windo
 			*dma_addr = be64_to_cpu(dma64->dma_base);
 			*window_shift = be32_to_cpu(dma64->window_shift);
 			*direct_mapping = window->direct;
+			*liobn = be32_to_cpu(dma64->liobn);
 			found = true;
 			break;
 		}
@@ -1315,6 +1372,23 @@ static int iommu_get_page_shift(u32 query_page_size)
 	return ret;
 }
 
+static __u64 query_page_size_to_mask(u32 query_page_size)
+{
+	const long shift[] = {
+		(SZ_4K),   (SZ_64K), (SZ_16M),
+		(SZ_32M),  (SZ_64M), (SZ_128M),
+		(SZ_256M), (SZ_16G), (SZ_2M)
+	};
+	int i, ret = 0;
+
+	for (i = 0; i < ARRAY_SIZE(shift); i++) {
+		if (query_page_size & (1 << i))
+			ret |= shift[i];
+	}
+
+	return ret;
+}
+
 static struct property *ddw_property_create(const char *propname, u32 liobn, u64 dma_addr,
 					    u32 page_shift, u32 window_shift)
 {
@@ -1344,6 +1418,9 @@ static struct property *ddw_property_create(const char *propname, u32 liobn, u64
 	return win64;
 }
 
+static long remove_dynamic_dma_windows_locked(struct iommu_table_group *table_group,
+					      struct pci_dev *pdev);
+
 /*
  * If the PE supports dynamic dma windows, and there is space for a table
  * that can map all pages in a linear offset, then setup such a table,
@@ -1373,6 +1450,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
 	bool pmem_present;
 	struct pci_dn *pci = PCI_DN(pdn);
 	struct property *default_win = NULL;
+	u32 liobn;
 
 	dn = of_find_node_by_type(NULL, "ibm,pmemory");
 	pmem_present = dn != NULL;
@@ -1380,8 +1458,19 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
 
 	mutex_lock(&dma_win_init_mutex);
 
-	if (find_existing_ddw(pdn, &dev->dev.archdata.dma_offset, &len, &direct_mapping))
-		goto out_unlock;
+	if (find_existing_ddw(pdn, &liobn, &dev->dev.archdata.dma_offset, &len, &direct_mapping)) {
+		struct iommu_table *tbl = pci->table_group->tables[1];
+
+		if (direct_mapping || (tbl && !tbl->reset_ddw))
+			goto out_unlock;
+		/* VFIO user created window has custom size/pageshift */
+		if (remove_dynamic_dma_windows_locked(pci->table_group, dev))
+			goto out_failed;
+
+		iommu_tce_table_put(tbl);
+		pci->table_group->tables[1] = NULL;
+		set_iommu_table_base(&dev->dev, pci->table_group->tables[0]);
+	}
 
 	/*
 	 * If we already went through this for a previous function of
@@ -1726,20 +1815,272 @@ static unsigned long spapr_tce_get_table_size(__u32 page_shift,
 	return size;
 }
 
+static void spapr_tce_init_group(struct iommu_table_group *table_group, struct device *dev)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	struct device_node *dn, *pdn;
+	u32 ddw_avail[DDW_APPLICABLE_SIZE];
+	struct ddw_query_response query;
+	int ret;
+
+	if (table_group->max_dynamic_windows_supported > 0)
+		return;
+
+	/* No need to insitialize for kdump kernel. */
+	if (is_kdump_kernel())
+		return;
+
+	dn = pci_device_to_OF_node(pdev);
+	pdn = pci_dma_find(dn, NULL);
+	if (!pdn || !PCI_DN(pdn)) {
+		table_group->max_dynamic_windows_supported = -1;
+		return;
+	}
+
+	/* TODO: Phyp sets VF default window base at 512PiB offset. Need
+	 * tce32_base set to the global offset and use the start as 0?
+	 */
+	ret = of_property_read_u32_array(pdn, "ibm,ddw-applicable",
+			&ddw_avail[0], DDW_APPLICABLE_SIZE);
+	if (ret) {
+		table_group->max_dynamic_windows_supported = -1;
+		return;
+	}
+
+	ret = query_ddw(pdev, ddw_avail, &query, pdn);
+	if (ret) {
+		dev_err(&pdev->dev, "%s: query_ddw failed\n", __func__);
+		table_group->max_dynamic_windows_supported = -1;
+		return;
+	}
+
+	/* The SRIOV VFs have only 1 window, the default is removed
+	 * before creating the 64-bit window
+	 */
+	if (query.windows_available == 0)
+		table_group->max_dynamic_windows_supported = 1;
+	else
+		table_group->max_dynamic_windows_supported = 2;
+
+	table_group->max_levels = 1;
+	table_group->pgsizes |= query_page_size_to_mask(query.page_size);
+}
+
+
+static long remove_dynamic_dma_windows_locked(struct iommu_table_group *table_group,
+					      struct pci_dev *pdev)
+{
+	struct device_node *dn = pci_device_to_OF_node(pdev), *pdn;
+	bool direct_mapping;
+	struct dma_win *window;
+	u32 liobn;
+	int len;
+
+	pdn = pci_dma_find(dn, NULL);
+	if (!pdn || !PCI_DN(pdn)) { // Niether of 32s|64-bit exist!
+		return -ENODEV;
+	}
+
+	if (find_existing_ddw(pdn, &liobn, &pdev->dev.archdata.dma_offset, &len, &direct_mapping)) {
+		remove_ddw(pdn, true, direct_mapping ? DIRECT64_PROPNAME : DMA64_PROPNAME);
+		spin_lock(&dma_win_list_lock);
+		list_for_each_entry(window, &dma_win_list, list) {
+			if (window->device == pdn) {
+				list_del(&window->list);
+				kfree(window);
+				break;
+			}
+		}
+		spin_unlock(&dma_win_list_lock);
+	}
+
+	return 0;
+}
+
+static int dev_has_iommu_table(struct device *dev, void *data)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	struct pci_dev **ppdev = data;
+
+	if (!dev)
+		return 0;
+
+	if (device_iommu_mapped(dev)) {
+		*ppdev = pdev;
+		return 1;
+	}
+
+	return 0;
+}
+
+
+static struct pci_dev *iommu_group_get_first_pci_dev(struct iommu_group *group)
+{
+	struct pci_dev *pdev = NULL;
+	int ret;
+
+	/* No IOMMU group ? */
+	if (!group)
+		return NULL;
+
+	ret = iommu_group_for_each_dev(group, &pdev, dev_has_iommu_table);
+	if (!ret || !pdev)
+		return NULL;
+	return pdev;
+}
+
 static long spapr_tce_create_table(struct iommu_table_group *table_group, int num,
 				   __u32 page_shift, __u64 window_size, __u32 levels,
 				   struct iommu_table **ptbl)
 {
-	struct iommu_table *tbl = table_group->tables[0];
+	struct pci_dev *pdev = iommu_group_get_first_pci_dev(table_group->group);
+	struct device_node *dn = pci_device_to_OF_node(pdev), *pdn;
+	struct iommu_table *tbl = table_group->tables[num];
+	u32 window_shift = order_base_2(window_size);
+	u32 ddw_avail[DDW_APPLICABLE_SIZE];
+	struct ddw_create_response create;
+	struct ddw_query_response query;
+	unsigned long start = 0, end = 0;
+	struct failed_ddw_pdn *fpdn;
+	struct dma_win *window;
+	struct property *win64;
+	struct pci_dn *pci;
+	int len, ret = 0;
+	u64 win_addr;
 
-	if (num > 0)
+	if (num > 1)
 		return -EPERM;
 
-	if (tbl->it_page_shift != page_shift ||
-	    tbl->it_size != (window_size >> page_shift) ||
-	    tbl->it_indirect_levels != levels - 1)
-		return -EINVAL;
+	if (tbl && (tbl->it_page_shift == page_shift) &&
+		(tbl->it_size == (window_size >> page_shift)) &&
+		(tbl->it_indirect_levels == levels - 1))
+		goto exit;
+
+	if (num == 0)
+		return -EINVAL; /* Can't modify the default window. */
+
+	/* TODO: The SRIO-VFs have only 1 window. */
+	if (table_group->max_dynamic_windows_supported == 1)
+		return -EPERM;
+
+	mutex_lock(&dma_win_init_mutex);
+
+	ret = -ENODEV;
+	/* If the enable DDW failed for the pdn, dont retry! */
+	list_for_each_entry(fpdn, &failed_ddw_pdn_list, list) {
+		if (fpdn->pdn == pdn) {
+			pr_err("%s: %pOF in failed DDW device list\n", __func__, pdn);
+			goto out_unlock;
+		}
+	}
+
+	pdn = pci_dma_find(dn, NULL);
+	if (!pdn || !PCI_DN(pdn)) { /* Niether of 32s|64-bit exist! */
+		pr_err("%s: No dma-windows exist for the node %pOF\n", __func__, pdn);
+		goto out_failed;
+	}
+
+	/* The existing ddw didn't match the size/shift */
+	if (remove_dynamic_dma_windows_locked(table_group, pdev)) {
+		pr_err("%s: The existing DDW remova failed for node %pOF\n", __func__, pdn);
+		goto out_failed; /* Could not remove it either! */
+	}
+
+	pci = PCI_DN(pdn);
+	ret = of_property_read_u32_array(pdn, "ibm,ddw-applicable",
+				&ddw_avail[0], DDW_APPLICABLE_SIZE);
+	if (ret) {
+		pr_err("%s: ibm,ddw-applicable not found\n", __func__);
+		goto out_failed;
+	}
+
+	ret = query_ddw(pdev, ddw_avail, &query, pdn);
+	if (ret)
+		goto out_failed;
+	ret = -ENODEV;
 
+	len = window_shift;
+	if (query.largest_available_block < (1ULL << (len - page_shift))) {
+		dev_dbg(&pdev->dev, "can't map window 0x%llx with %llu %llu-sized pages\n",
+				1ULL << len, query.largest_available_block,
+				1ULL << page_shift);
+		ret = -EINVAL; /* Retry with smaller window size */
+		goto out_unlock;
+	}
+
+	if (create_ddw(pdev, ddw_avail, &create, page_shift, len))
+		goto out_failed;
+
+	win_addr = ((u64)create.addr_hi << 32) | create.addr_lo;
+	win64 = ddw_property_create(DMA64_PROPNAME, create.liobn, win_addr, page_shift, len);
+	if (!win64)
+		goto remove_window;
+
+	ret = of_add_property(pdn, win64);
+	if (ret) {
+		dev_err(&pdev->dev, "unable to add DMA window property for %pOF: %d",
+			pdn, ret);
+		goto free_property;
+	}
+	ret = -ENODEV;
+
+	window = ddw_list_new_entry(pdn, win64->value);
+	if (!window)
+		goto remove_property;
+
+	window->direct = false;
+
+	if (tbl) {
+		iommu_table_reset(tbl, pci->phb->bus->number, create.liobn, win_addr,
+				  1UL << len, page_shift, NULL, &iommu_table_lpar_multi_ops);
+	} else {
+		tbl = iommu_pseries_alloc_table(pci->phb->node);
+		if (!tbl) {
+			dev_err(&pdev->dev, "couldn't create new IOMMU table\n");
+			goto free_window;
+		}
+		iommu_table_setparms_common(tbl, pci->phb->bus->number, create.liobn, win_addr,
+					    1UL << len, page_shift, NULL,
+					    &iommu_table_lpar_multi_ops);
+		iommu_init_table(tbl, pci->phb->node, start, end);
+	}
+
+	tbl->reset_ddw = true;
+	pci->table_group->tables[1] = tbl;
+	set_iommu_table_base(&pdev->dev, tbl);
+	pdev->dev.archdata.dma_offset = win_addr;
+
+	spin_lock(&dma_win_list_lock);
+	list_add(&window->list, &dma_win_list);
+	spin_unlock(&dma_win_list_lock);
+
+	mutex_unlock(&dma_win_init_mutex);
+
+	goto exit;
+
+free_window:
+	kfree(window);
+remove_property:
+	of_remove_property(pdn, win64);
+free_property:
+	kfree(win64->name);
+	kfree(win64->value);
+	kfree(win64);
+remove_window:
+	__remove_dma_window(pdn, ddw_avail, create.liobn);
+
+out_failed:
+	fpdn = kzalloc(sizeof(*fpdn), GFP_KERNEL);
+	if (!fpdn)
+		goto out_unlock;
+	fpdn->pdn = pdn;
+	list_add(&fpdn->list, &failed_ddw_pdn_list);
+
+out_unlock:
+	mutex_unlock(&dma_win_init_mutex);
+
+	return ret;
+exit:
 	*ptbl = iommu_tce_table_get(tbl);
 	return 0;
 }
@@ -1795,6 +2136,7 @@ static void spapr_tce_release_ownership(struct iommu_table_group *table_group)
 struct iommu_table_group_ops spapr_tce_table_group_ops = {
 	.get_table_size = spapr_tce_get_table_size,
 	.create_table = spapr_tce_create_table,
+	.init_group = spapr_tce_init_group,
 	.set_window = spapr_tce_set_window,
 	.unset_window = spapr_tce_unset_window,
 	.take_ownership = spapr_tce_take_ownership,



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 3/3] pseries/iommu: Enable DDW for VFIO TCE create
  2024-03-12 18:14   ` Shivaprasad G Bhat
@ 2024-03-13 12:53     ` Michael Ellerman
  -1 siblings, 0 replies; 21+ messages in thread
From: Michael Ellerman @ 2024-03-13 12:53 UTC (permalink / raw)
  To: Shivaprasad G Bhat, tpearson, alex.williamson, linuxppc-dev
  Cc: npiggin, christophe.leroy, aneesh.kumar, naveen.n.rao, gbatra,
	brking, sbhat, aik, jgg, robh, linux-kernel, kvm, aik, msuchanek,
	jroedel, vaibhav, svaidy

Hi Shivaprasad,

Shivaprasad G Bhat <sbhat@linux.ibm.com> writes:
> The commit 9d67c9433509 ("powerpc/iommu: Add \"borrowing\"
> iommu_table_group_ops") implemented the "borrow" mechanism for
> the pSeries SPAPR TCE. It did implement this support partially
> that it left out creating the DDW if not present already.
>
> The patch here attempts to fix the missing gaps.
>  - Expose the DDW info to user by collecting it during probe.
>  - Create the window and the iommu table if not present during
>    VFIO_SPAPR_TCE_CREATE.
>  - Remove and recreate the window if the pageshift and window sizes
>    do not match.
>  - Restore the original window in enable_ddw() if the user had
>    created/modified the DDW. As there is preference for DIRECT mapping
>    on the host driver side, the user created window is removed.
>
> The changes work only for the non-SRIOV-VF scenarios for PEs having
> 2 DMA windows.

This crashes on powernv.

Full log at https://github.com/linuxppc/linux-snowpatch/actions/runs/8253875566/job/22577897225.

[    0.958561][    T1] pci_bus 0002:01: Configuring PE for bus
[    0.959699][    T1] pci 0002:01     : [PE# fd] Secondary bus 0x0000000000000001 associated with PE#fd
[    0.961692][    T1] pci 0002:01:00.0: Configured PE#fd
[    0.962424][    T1] pci 0002:01     : [PE# fd] Setting up 32-bit TCE table at 0..80000000
[    0.966424][    T1] IOMMU table initialized, virtual merging enabled
[    0.967544][    T1] pci 0002:01     : [PE# fd] Setting up window#0 0..ffffffff pg=10000
[    0.969362][    T1] pci 0002:01     : [PE# fd] Enabling 64-bit DMA bypass
[    0.971386][    T1] pci 0002:01:00.0: Adding to iommu group 0
[    0.973481][    T1] BUG: Unable to handle kernel instruction fetch (NULL pointer?)
[    0.974388][    T1] Faulting instruction address: 0x00000000
[    0.975578][    T1] Oops: Kernel access of bad area, sig: 11 [#1]
[    0.976476][    T1] LE PAGE_SIZE=64K MMU=Hash SMP ERROR: Error: saw oops/warning etc. while expecting NR_CPUS=2048 NUMA PowerNV
[    0.977777][    T1] Modules linked in:
[    0.978570][    T1] CPU: 1 PID: 1 Comm: swapper/1 Not tainted 6.8.0-rc6-g80dcb4e6d0aa #1
[    0.979766][    T1] Hardware name: IBM PowerNV (emulated by qemu) POWER8 0x4d0200 opal:v6.8-104-g820d43c0 PowerNV
[    0.981197][    T1] NIP:  0000000000000000 LR: c00000000005653c CTR: 0000000000000000
[    0.982221][    T1] REGS: c000000003687420 TRAP: 0480   Not tainted  (6.8.0-rc6-g80dcb4e6d0aa)
[    0.983400][    T1] MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 44004422  XER: 00000000
[    0.984742][    T1] CFAR: c000000000056538 IRQMASK: 0 
[    0.984742][    T1] GPR00: c000000000056520 c0000000036876c0 c0000000015b9800 c00000000363ae58 
[    0.984742][    T1] GPR04: c00000000352f0a0 c0000000026d4748 0000000000000001 0000000000000000 
[    0.984742][    T1] GPR08: 0000000000000000 c000000002716668 0000000000000003 0000000000008000 
[    0.984742][    T1] GPR12: 0000000000000000 c000000002be0000 c0000000000110cc 0000000000000000 
[    0.984742][    T1] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 
[    0.984742][    T1] GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000001 
[    0.984742][    T1] GPR24: c0000000014681d8 0000000000000000 c000000003068a00 0000000000000001 
[    0.984742][    T1] GPR28: c000000003068a00 0000000000000000 c00000000363ae58 c00000000352f0a0 
[    0.994647][    T1] NIP [0000000000000000] 0x0
[    0.995699][    T1] LR [c00000000005653c] spapr_tce_platform_iommu_attach_dev+0x74/0xc8
[    0.997399][    T1] Call Trace:
[    0.997897][    T1] [c0000000036876c0] [c000000000056514] spapr_tce_platform_iommu_attach_dev+0x4c/0xc8 (unreliable)
[    0.999383][    T1] [c000000003687700] [c000000000b383dc] __iommu_attach_device+0x44/0xfc
[    1.000476][    T1] [c000000003687730] [c000000000b38574] __iommu_device_set_domain+0xe0/0x170
[    1.001728][    T1] [c0000000036877c0] [c000000000b3869c] __iommu_group_set_domain_internal+0x98/0x1c0
[    1.003014][    T1] [c000000003687820] [c000000000b3bb10] iommu_setup_default_domain+0x544/0x650
[    1.004306][    T1] [c0000000036878e0] [c000000000b3d3b4] __iommu_probe_device+0x5b0/0x604
[    1.005500][    T1] [c000000003687950] [c000000000b3d454] iommu_probe_device+0x4c/0xb0
[    1.006563][    T1] [c000000003687990] [c00000000005648c] iommu_add_device+0x3c/0x78
[    1.007590][    T1] [c0000000036879b0] [c0000000000db920] pnv_pci_ioda_dma_dev_setup+0x168/0x73c
[    1.008918][    T1] [c000000003687a60] [c0000000000729f4] pcibios_bus_add_device+0x80/0x328
[    1.010077][    T1] [c000000003687ac0] [c000000000a49fa0] pci_bus_add_device+0x30/0x11c
[    1.011169][    T1] [c000000003687b30] [c000000000a4a0e4] pci_bus_add_devices+0x58/0xb4
[    1.012230][    T1] [c000000003687b70] [c000000000a4a118] pci_bus_add_devices+0x8c/0xb4
[    1.013301][    T1] [c000000003687bb0] [c00000000201a3c8] pcibios_init+0xd8/0x140
[    1.014314][    T1] [c000000003687c30] [c000000000010d58] do_one_initcall+0x80/0x2f8
[    1.015349][    T1] [c000000003687d00] [c000000002005b0c] kernel_init_freeable+0x31c/0x510
[    1.016470][    T1] [c000000003687de0] [c0000000000110f8] kernel_init+0x34/0x25c
[    1.017527][    T1] [c000000003687e50] [c00000000000debc] ret_from_kernel_user_thread+0x14/0x1c
[    1.018778][    T1] --- interrupt: 0 at 0x0
[    1.019525][    T1] Code: XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX 
[    1.022234][    T1] ---[ end trace 0000000000000000 ]---
[    1.022983][    T1] 
[    2.023819][    T1] note: swapper/1[1] exited with irqs disabled
[    2.025051][    T1] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[    2.027371][    T1] Rebooting in 10 seconds.


cheers

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 3/3] pseries/iommu: Enable DDW for VFIO TCE create
@ 2024-03-13 12:53     ` Michael Ellerman
  0 siblings, 0 replies; 21+ messages in thread
From: Michael Ellerman @ 2024-03-13 12:53 UTC (permalink / raw)
  To: Shivaprasad G Bhat, tpearson, alex.williamson, linuxppc-dev
  Cc: robh, jroedel, sbhat, gbatra, jgg, aik, linux-kernel, svaidy,
	aneesh.kumar, brking, npiggin, kvm, naveen.n.rao, vaibhav,
	msuchanek, aik

Hi Shivaprasad,

Shivaprasad G Bhat <sbhat@linux.ibm.com> writes:
> The commit 9d67c9433509 ("powerpc/iommu: Add \"borrowing\"
> iommu_table_group_ops") implemented the "borrow" mechanism for
> the pSeries SPAPR TCE. It did implement this support partially
> that it left out creating the DDW if not present already.
>
> The patch here attempts to fix the missing gaps.
>  - Expose the DDW info to user by collecting it during probe.
>  - Create the window and the iommu table if not present during
>    VFIO_SPAPR_TCE_CREATE.
>  - Remove and recreate the window if the pageshift and window sizes
>    do not match.
>  - Restore the original window in enable_ddw() if the user had
>    created/modified the DDW. As there is preference for DIRECT mapping
>    on the host driver side, the user created window is removed.
>
> The changes work only for the non-SRIOV-VF scenarios for PEs having
> 2 DMA windows.

This crashes on powernv.

Full log at https://github.com/linuxppc/linux-snowpatch/actions/runs/8253875566/job/22577897225.

[    0.958561][    T1] pci_bus 0002:01: Configuring PE for bus
[    0.959699][    T1] pci 0002:01     : [PE# fd] Secondary bus 0x0000000000000001 associated with PE#fd
[    0.961692][    T1] pci 0002:01:00.0: Configured PE#fd
[    0.962424][    T1] pci 0002:01     : [PE# fd] Setting up 32-bit TCE table at 0..80000000
[    0.966424][    T1] IOMMU table initialized, virtual merging enabled
[    0.967544][    T1] pci 0002:01     : [PE# fd] Setting up window#0 0..ffffffff pg=10000
[    0.969362][    T1] pci 0002:01     : [PE# fd] Enabling 64-bit DMA bypass
[    0.971386][    T1] pci 0002:01:00.0: Adding to iommu group 0
[    0.973481][    T1] BUG: Unable to handle kernel instruction fetch (NULL pointer?)
[    0.974388][    T1] Faulting instruction address: 0x00000000
[    0.975578][    T1] Oops: Kernel access of bad area, sig: 11 [#1]
[    0.976476][    T1] LE PAGE_SIZE=64K MMU=Hash SMP ERROR: Error: saw oops/warning etc. while expecting NR_CPUS=2048 NUMA PowerNV
[    0.977777][    T1] Modules linked in:
[    0.978570][    T1] CPU: 1 PID: 1 Comm: swapper/1 Not tainted 6.8.0-rc6-g80dcb4e6d0aa #1
[    0.979766][    T1] Hardware name: IBM PowerNV (emulated by qemu) POWER8 0x4d0200 opal:v6.8-104-g820d43c0 PowerNV
[    0.981197][    T1] NIP:  0000000000000000 LR: c00000000005653c CTR: 0000000000000000
[    0.982221][    T1] REGS: c000000003687420 TRAP: 0480   Not tainted  (6.8.0-rc6-g80dcb4e6d0aa)
[    0.983400][    T1] MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 44004422  XER: 00000000
[    0.984742][    T1] CFAR: c000000000056538 IRQMASK: 0 
[    0.984742][    T1] GPR00: c000000000056520 c0000000036876c0 c0000000015b9800 c00000000363ae58 
[    0.984742][    T1] GPR04: c00000000352f0a0 c0000000026d4748 0000000000000001 0000000000000000 
[    0.984742][    T1] GPR08: 0000000000000000 c000000002716668 0000000000000003 0000000000008000 
[    0.984742][    T1] GPR12: 0000000000000000 c000000002be0000 c0000000000110cc 0000000000000000 
[    0.984742][    T1] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 
[    0.984742][    T1] GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000001 
[    0.984742][    T1] GPR24: c0000000014681d8 0000000000000000 c000000003068a00 0000000000000001 
[    0.984742][    T1] GPR28: c000000003068a00 0000000000000000 c00000000363ae58 c00000000352f0a0 
[    0.994647][    T1] NIP [0000000000000000] 0x0
[    0.995699][    T1] LR [c00000000005653c] spapr_tce_platform_iommu_attach_dev+0x74/0xc8
[    0.997399][    T1] Call Trace:
[    0.997897][    T1] [c0000000036876c0] [c000000000056514] spapr_tce_platform_iommu_attach_dev+0x4c/0xc8 (unreliable)
[    0.999383][    T1] [c000000003687700] [c000000000b383dc] __iommu_attach_device+0x44/0xfc
[    1.000476][    T1] [c000000003687730] [c000000000b38574] __iommu_device_set_domain+0xe0/0x170
[    1.001728][    T1] [c0000000036877c0] [c000000000b3869c] __iommu_group_set_domain_internal+0x98/0x1c0
[    1.003014][    T1] [c000000003687820] [c000000000b3bb10] iommu_setup_default_domain+0x544/0x650
[    1.004306][    T1] [c0000000036878e0] [c000000000b3d3b4] __iommu_probe_device+0x5b0/0x604
[    1.005500][    T1] [c000000003687950] [c000000000b3d454] iommu_probe_device+0x4c/0xb0
[    1.006563][    T1] [c000000003687990] [c00000000005648c] iommu_add_device+0x3c/0x78
[    1.007590][    T1] [c0000000036879b0] [c0000000000db920] pnv_pci_ioda_dma_dev_setup+0x168/0x73c
[    1.008918][    T1] [c000000003687a60] [c0000000000729f4] pcibios_bus_add_device+0x80/0x328
[    1.010077][    T1] [c000000003687ac0] [c000000000a49fa0] pci_bus_add_device+0x30/0x11c
[    1.011169][    T1] [c000000003687b30] [c000000000a4a0e4] pci_bus_add_devices+0x58/0xb4
[    1.012230][    T1] [c000000003687b70] [c000000000a4a118] pci_bus_add_devices+0x8c/0xb4
[    1.013301][    T1] [c000000003687bb0] [c00000000201a3c8] pcibios_init+0xd8/0x140
[    1.014314][    T1] [c000000003687c30] [c000000000010d58] do_one_initcall+0x80/0x2f8
[    1.015349][    T1] [c000000003687d00] [c000000002005b0c] kernel_init_freeable+0x31c/0x510
[    1.016470][    T1] [c000000003687de0] [c0000000000110f8] kernel_init+0x34/0x25c
[    1.017527][    T1] [c000000003687e50] [c00000000000debc] ret_from_kernel_user_thread+0x14/0x1c
[    1.018778][    T1] --- interrupt: 0 at 0x0
[    1.019525][    T1] Code: XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX 
[    1.022234][    T1] ---[ end trace 0000000000000000 ]---
[    1.022983][    T1] 
[    2.023819][    T1] note: swapper/1[1] exited with irqs disabled
[    2.025051][    T1] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[    2.027371][    T1] Rebooting in 10 seconds.


cheers

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 3/3] pseries/iommu: Enable DDW for VFIO TCE create
  2024-03-12 18:14   ` Shivaprasad G Bhat
  (?)
  (?)
@ 2024-03-14 15:10   ` kernel test robot
  -1 siblings, 0 replies; 21+ messages in thread
From: kernel test robot @ 2024-03-14 15:10 UTC (permalink / raw)
  To: Shivaprasad G Bhat; +Cc: llvm, oe-kbuild-all

Hi Shivaprasad,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build warnings:

[auto build test WARNING on v6.8]
[also build test WARNING on linus/master next-20240314]
[cannot apply to powerpc/next powerpc/fixes awilliam-vfio/next awilliam-vfio/for-linus]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Shivaprasad-G-Bhat/powerpc-pseries-iommu-Bring-back-userspace-view-for-single-level-TCE-tables/20240313-022030
base:   v6.8
patch link:    https://lore.kernel.org/r/171026728072.8367.13581504605624115205.stgit%40linux.ibm.com
patch subject: [RFC PATCH 3/3] pseries/iommu: Enable DDW for VFIO TCE create
config: powerpc-allyesconfig (https://download.01.org/0day-ci/archive/20240314/202403142228.8FfmsLis-lkp@intel.com/config)
compiler: clang version 19.0.0git (https://github.com/llvm/llvm-project 503c55e17037436dcd45ac69dea8967e67e3f5e8)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240314/202403142228.8FfmsLis-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202403142228.8FfmsLis-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from arch/powerpc/platforms/pseries/iommu.c:16:
   In file included from include/linux/mm.h:2188:
   include/linux/vmstat.h:508:43: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') [-Wenum-enum-conversion]
     508 |         return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~ ^
     509 |                            item];
         |                            ~~~~
   include/linux/vmstat.h:515:43: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') [-Wenum-enum-conversion]
     515 |         return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~ ^
     516 |                            NR_VM_NUMA_EVENT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~~
   include/linux/vmstat.h:522:36: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
     522 |         return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_"
         |                               ~~~~~~~~~~~ ^ ~~~
   include/linux/vmstat.h:527:43: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') [-Wenum-enum-conversion]
     527 |         return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~ ^
     528 |                            NR_VM_NUMA_EVENT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~~
   include/linux/vmstat.h:536:43: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') [-Wenum-enum-conversion]
     536 |         return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~ ^
     537 |                            NR_VM_NUMA_EVENT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~~
>> arch/powerpc/platforms/pseries/iommu.c:1971:20: warning: variable 'pdn' is uninitialized when used here [-Wuninitialized]
    1971 |                 if (fpdn->pdn == pdn) {
         |                                  ^~~
   arch/powerpc/platforms/pseries/iommu.c:1937:60: note: initialize the variable 'pdn' to silence this warning
    1937 |         struct device_node *dn = pci_device_to_OF_node(pdev), *pdn;
         |                                                                   ^
         |                                                                    = NULL
   6 warnings generated.


vim +/pdn +1971 arch/powerpc/platforms/pseries/iommu.c

  1931	
  1932	static long spapr_tce_create_table(struct iommu_table_group *table_group, int num,
  1933					   __u32 page_shift, __u64 window_size, __u32 levels,
  1934					   struct iommu_table **ptbl)
  1935	{
  1936		struct pci_dev *pdev = iommu_group_get_first_pci_dev(table_group->group);
  1937		struct device_node *dn = pci_device_to_OF_node(pdev), *pdn;
  1938		struct iommu_table *tbl = table_group->tables[num];
  1939		u32 window_shift = order_base_2(window_size);
  1940		u32 ddw_avail[DDW_APPLICABLE_SIZE];
  1941		struct ddw_create_response create;
  1942		struct ddw_query_response query;
  1943		unsigned long start = 0, end = 0;
  1944		struct failed_ddw_pdn *fpdn;
  1945		struct dma_win *window;
  1946		struct property *win64;
  1947		struct pci_dn *pci;
  1948		int len, ret = 0;
  1949		u64 win_addr;
  1950	
  1951		if (num > 1)
  1952			return -EPERM;
  1953	
  1954		if (tbl && (tbl->it_page_shift == page_shift) &&
  1955			(tbl->it_size == (window_size >> page_shift)) &&
  1956			(tbl->it_indirect_levels == levels - 1))
  1957			goto exit;
  1958	
  1959		if (num == 0)
  1960			return -EINVAL; /* Can't modify the default window. */
  1961	
  1962		/* TODO: The SRIO-VFs have only 1 window. */
  1963		if (table_group->max_dynamic_windows_supported == 1)
  1964			return -EPERM;
  1965	
  1966		mutex_lock(&dma_win_init_mutex);
  1967	
  1968		ret = -ENODEV;
  1969		/* If the enable DDW failed for the pdn, dont retry! */
  1970		list_for_each_entry(fpdn, &failed_ddw_pdn_list, list) {
> 1971			if (fpdn->pdn == pdn) {
  1972				pr_err("%s: %pOF in failed DDW device list\n", __func__, pdn);
  1973				goto out_unlock;
  1974			}
  1975		}
  1976	
  1977		pdn = pci_dma_find(dn, NULL);
  1978		if (!pdn || !PCI_DN(pdn)) { /* Niether of 32s|64-bit exist! */
  1979			pr_err("%s: No dma-windows exist for the node %pOF\n", __func__, pdn);
  1980			goto out_failed;
  1981		}
  1982	
  1983		/* The existing ddw didn't match the size/shift */
  1984		if (remove_dynamic_dma_windows_locked(table_group, pdev)) {
  1985			pr_err("%s: The existing DDW remova failed for node %pOF\n", __func__, pdn);
  1986			goto out_failed; /* Could not remove it either! */
  1987		}
  1988	
  1989		pci = PCI_DN(pdn);
  1990		ret = of_property_read_u32_array(pdn, "ibm,ddw-applicable",
  1991					&ddw_avail[0], DDW_APPLICABLE_SIZE);
  1992		if (ret) {
  1993			pr_err("%s: ibm,ddw-applicable not found\n", __func__);
  1994			goto out_failed;
  1995		}
  1996	
  1997		ret = query_ddw(pdev, ddw_avail, &query, pdn);
  1998		if (ret)
  1999			goto out_failed;
  2000		ret = -ENODEV;
  2001	
  2002		len = window_shift;
  2003		if (query.largest_available_block < (1ULL << (len - page_shift))) {
  2004			dev_dbg(&pdev->dev, "can't map window 0x%llx with %llu %llu-sized pages\n",
  2005					1ULL << len, query.largest_available_block,
  2006					1ULL << page_shift);
  2007			ret = -EINVAL; /* Retry with smaller window size */
  2008			goto out_unlock;
  2009		}
  2010	
  2011		if (create_ddw(pdev, ddw_avail, &create, page_shift, len))
  2012			goto out_failed;
  2013	
  2014		win_addr = ((u64)create.addr_hi << 32) | create.addr_lo;
  2015		win64 = ddw_property_create(DMA64_PROPNAME, create.liobn, win_addr, page_shift, len);
  2016		if (!win64)
  2017			goto remove_window;
  2018	
  2019		ret = of_add_property(pdn, win64);
  2020		if (ret) {
  2021			dev_err(&pdev->dev, "unable to add DMA window property for %pOF: %d",
  2022				pdn, ret);
  2023			goto free_property;
  2024		}
  2025		ret = -ENODEV;
  2026	
  2027		window = ddw_list_new_entry(pdn, win64->value);
  2028		if (!window)
  2029			goto remove_property;
  2030	
  2031		window->direct = false;
  2032	
  2033		if (tbl) {
  2034			iommu_table_reset(tbl, pci->phb->bus->number, create.liobn, win_addr,
  2035					  1UL << len, page_shift, NULL, &iommu_table_lpar_multi_ops);
  2036		} else {
  2037			tbl = iommu_pseries_alloc_table(pci->phb->node);
  2038			if (!tbl) {
  2039				dev_err(&pdev->dev, "couldn't create new IOMMU table\n");
  2040				goto free_window;
  2041			}
  2042			iommu_table_setparms_common(tbl, pci->phb->bus->number, create.liobn, win_addr,
  2043						    1UL << len, page_shift, NULL,
  2044						    &iommu_table_lpar_multi_ops);
  2045			iommu_init_table(tbl, pci->phb->node, start, end);
  2046		}
  2047	
  2048		tbl->reset_ddw = true;
  2049		pci->table_group->tables[1] = tbl;
  2050		set_iommu_table_base(&pdev->dev, tbl);
  2051		pdev->dev.archdata.dma_offset = win_addr;
  2052	
  2053		spin_lock(&dma_win_list_lock);
  2054		list_add(&window->list, &dma_win_list);
  2055		spin_unlock(&dma_win_list_lock);
  2056	
  2057		mutex_unlock(&dma_win_init_mutex);
  2058	
  2059		goto exit;
  2060	
  2061	free_window:
  2062		kfree(window);
  2063	remove_property:
  2064		of_remove_property(pdn, win64);
  2065	free_property:
  2066		kfree(win64->name);
  2067		kfree(win64->value);
  2068		kfree(win64);
  2069	remove_window:
  2070		__remove_dma_window(pdn, ddw_avail, create.liobn);
  2071	
  2072	out_failed:
  2073		fpdn = kzalloc(sizeof(*fpdn), GFP_KERNEL);
  2074		if (!fpdn)
  2075			goto out_unlock;
  2076		fpdn->pdn = pdn;
  2077		list_add(&fpdn->list, &failed_ddw_pdn_list);
  2078	
  2079	out_unlock:
  2080		mutex_unlock(&dma_win_init_mutex);
  2081	
  2082		return ret;
  2083	exit:
  2084		*ptbl = iommu_tce_table_get(tbl);
  2085		return 0;
  2086	}
  2087	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 1/3] powerpc/pseries/iommu: Bring back userspace view for single level TCE tables
  2024-03-12 18:14   ` Shivaprasad G Bhat
@ 2024-03-19 14:32     ` Jason Gunthorpe
  -1 siblings, 0 replies; 21+ messages in thread
From: Jason Gunthorpe @ 2024-03-19 14:32 UTC (permalink / raw)
  To: Shivaprasad G Bhat
  Cc: tpearson, alex.williamson, linuxppc-dev, mpe, npiggin,
	christophe.leroy, aneesh.kumar, naveen.n.rao, gbatra, brking,
	aik, robh, linux-kernel, kvm, aik, msuchanek, jroedel, vaibhav,
	svaidy

On Tue, Mar 12, 2024 at 01:14:20PM -0500, Shivaprasad G Bhat wrote:
> The commit 090bad39b237a ("powerpc/powernv: Add indirect levels to
> it_userspace") which implemented the tce indirect levels
> support for PowerNV ended up removing the single level support
> which existed by default(generic tce_iommu_userspace_view_alloc/free()
> calls). On pSeries the TCEs are single level, and the allocation
> of userspace view is lost with the removal of generic code.

:( :(

If this has been broken since 2018 and nobody cared till now can we
please go in a direction of moving this code to the new iommu APIs
instead of doubling down on more of this old stuff that apparently
almost nobody cares about ??

Jason

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 1/3] powerpc/pseries/iommu: Bring back userspace view for single level TCE tables
@ 2024-03-19 14:32     ` Jason Gunthorpe
  0 siblings, 0 replies; 21+ messages in thread
From: Jason Gunthorpe @ 2024-03-19 14:32 UTC (permalink / raw)
  To: Shivaprasad G Bhat
  Cc: svaidy, robh, jroedel, kvm, gbatra, aik, alex.williamson,
	linux-kernel, aneesh.kumar, brking, tpearson, npiggin,
	naveen.n.rao, vaibhav, msuchanek, linuxppc-dev, aik

On Tue, Mar 12, 2024 at 01:14:20PM -0500, Shivaprasad G Bhat wrote:
> The commit 090bad39b237a ("powerpc/powernv: Add indirect levels to
> it_userspace") which implemented the tce indirect levels
> support for PowerNV ended up removing the single level support
> which existed by default(generic tce_iommu_userspace_view_alloc/free()
> calls). On pSeries the TCEs are single level, and the allocation
> of userspace view is lost with the removal of generic code.

:( :(

If this has been broken since 2018 and nobody cared till now can we
please go in a direction of moving this code to the new iommu APIs
instead of doubling down on more of this old stuff that apparently
almost nobody cares about ??

Jason

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 1/3] powerpc/pseries/iommu: Bring back userspace view for single level TCE tables
  2024-03-19 14:32     ` Jason Gunthorpe
@ 2024-03-19 18:36       ` Timothy Pearson
  -1 siblings, 0 replies; 21+ messages in thread
From: Timothy Pearson @ 2024-03-19 18:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: svaidy, robh, jroedel, gbatra, Shivaprasad G Bhat,
	Alexey Kardashevskiy, Alex Williamson, npiggin, linux-kernel,
	aneesh kumar, brking, Timothy Pearson, kvm, naveen n rao,
	vaibhav, msuchanek, linuxppc-dev, aik



----- Original Message -----
> From: "Jason Gunthorpe" <jgg@ziepe.ca>
> To: "Shivaprasad G Bhat" <sbhat@linux.ibm.com>
> Cc: "Timothy Pearson" <tpearson@raptorengineering.com>, "Alex Williamson" <alex.williamson@redhat.com>, "linuxppc-dev"
> <linuxppc-dev@lists.ozlabs.org>, "Michael Ellerman" <mpe@ellerman.id.au>, "npiggin" <npiggin@gmail.com>, "christophe
> leroy" <christophe.leroy@csgroup.eu>, "aneesh kumar" <aneesh.kumar@kernel.org>, "naveen n rao"
> <naveen.n.rao@linux.ibm.com>, "gbatra" <gbatra@linux.vnet.ibm.com>, brking@linux.vnet.ibm.com, "Alexey Kardashevskiy"
> <aik@ozlabs.ru>, robh@kernel.org, "linux-kernel" <linux-kernel@vger.kernel.org>, "kvm" <kvm@vger.kernel.org>, "aik"
> <aik@amd.com>, msuchanek@suse.de, "jroedel" <jroedel@suse.de>, "vaibhav" <vaibhav@linux.ibm.com>, svaidy@linux.ibm.com
> Sent: Tuesday, March 19, 2024 9:32:02 AM
> Subject: Re: [RFC PATCH 1/3] powerpc/pseries/iommu: Bring back userspace view for single level TCE tables

> On Tue, Mar 12, 2024 at 01:14:20PM -0500, Shivaprasad G Bhat wrote:
>> The commit 090bad39b237a ("powerpc/powernv: Add indirect levels to
>> it_userspace") which implemented the tce indirect levels
>> support for PowerNV ended up removing the single level support
>> which existed by default(generic tce_iommu_userspace_view_alloc/free()
>> calls). On pSeries the TCEs are single level, and the allocation
>> of userspace view is lost with the removal of generic code.
> 
> :( :(
> 
> If this has been broken since 2018 and nobody cared till now can we
> please go in a direction of moving this code to the new iommu APIs
> instead of doubling down on more of this old stuff that apparently
> almost nobody cares about ??
> 
> Jason

Just FYI Raptor is working on porting things over to the new APIs.  RFC patches should be posted in the next week or two.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 1/3] powerpc/pseries/iommu: Bring back userspace view for single level TCE tables
@ 2024-03-19 18:36       ` Timothy Pearson
  0 siblings, 0 replies; 21+ messages in thread
From: Timothy Pearson @ 2024-03-19 18:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Shivaprasad G Bhat, Timothy Pearson, Alex Williamson,
	linuxppc-dev, Michael Ellerman, npiggin, christophe leroy,
	aneesh kumar, naveen n rao, gbatra, brking, Alexey Kardashevskiy,
	robh, linux-kernel, kvm, aik, msuchanek, jroedel, vaibhav,
	svaidy



----- Original Message -----
> From: "Jason Gunthorpe" <jgg@ziepe.ca>
> To: "Shivaprasad G Bhat" <sbhat@linux.ibm.com>
> Cc: "Timothy Pearson" <tpearson@raptorengineering.com>, "Alex Williamson" <alex.williamson@redhat.com>, "linuxppc-dev"
> <linuxppc-dev@lists.ozlabs.org>, "Michael Ellerman" <mpe@ellerman.id.au>, "npiggin" <npiggin@gmail.com>, "christophe
> leroy" <christophe.leroy@csgroup.eu>, "aneesh kumar" <aneesh.kumar@kernel.org>, "naveen n rao"
> <naveen.n.rao@linux.ibm.com>, "gbatra" <gbatra@linux.vnet.ibm.com>, brking@linux.vnet.ibm.com, "Alexey Kardashevskiy"
> <aik@ozlabs.ru>, robh@kernel.org, "linux-kernel" <linux-kernel@vger.kernel.org>, "kvm" <kvm@vger.kernel.org>, "aik"
> <aik@amd.com>, msuchanek@suse.de, "jroedel" <jroedel@suse.de>, "vaibhav" <vaibhav@linux.ibm.com>, svaidy@linux.ibm.com
> Sent: Tuesday, March 19, 2024 9:32:02 AM
> Subject: Re: [RFC PATCH 1/3] powerpc/pseries/iommu: Bring back userspace view for single level TCE tables

> On Tue, Mar 12, 2024 at 01:14:20PM -0500, Shivaprasad G Bhat wrote:
>> The commit 090bad39b237a ("powerpc/powernv: Add indirect levels to
>> it_userspace") which implemented the tce indirect levels
>> support for PowerNV ended up removing the single level support
>> which existed by default(generic tce_iommu_userspace_view_alloc/free()
>> calls). On pSeries the TCEs are single level, and the allocation
>> of userspace view is lost with the removal of generic code.
> 
> :( :(
> 
> If this has been broken since 2018 and nobody cared till now can we
> please go in a direction of moving this code to the new iommu APIs
> instead of doubling down on more of this old stuff that apparently
> almost nobody cares about ??
> 
> Jason

Just FYI Raptor is working on porting things over to the new APIs.  RFC patches should be posted in the next week or two.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 1/3] powerpc/pseries/iommu: Bring back userspace view for single level TCE tables
  2024-03-19 14:32     ` Jason Gunthorpe
@ 2024-03-20 15:29       ` Shivaprasad G Bhat
  -1 siblings, 0 replies; 21+ messages in thread
From: Shivaprasad G Bhat @ 2024-03-20 15:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: tpearson, alex.williamson, linuxppc-dev, mpe, npiggin,
	christophe.leroy, aneesh.kumar, naveen.n.rao, gbatra, brking,
	aik, robh, linux-kernel, kvm, aik, msuchanek, jroedel, vaibhav,
	svaidy

Hi Jason,

On 3/19/24 20:02, Jason Gunthorpe wrote:
> On Tue, Mar 12, 2024 at 01:14:20PM -0500, Shivaprasad G Bhat wrote:
>> The commit 090bad39b237a ("powerpc/powernv: Add indirect levels to
>> it_userspace") which implemented the tce indirect levels
>> support for PowerNV ended up removing the single level support
>> which existed by default(generic tce_iommu_userspace_view_alloc/free()
>> calls). On pSeries the TCEs are single level, and the allocation
>> of userspace view is lost with the removal of generic code.
> :( :(
>
> If this has been broken since 2018 and nobody cared till now can we
> please go in a direction of moving this code to the new iommu APIs
> instead of doubling down on more of this old stuff that apparently
> almost nobody cares about ??

We have existing software stack deployments using VFIO userspace
device assignment running on Power platform. We have to enable
similar software stack on newer generation Power10 platform and
also in a pSeries lpar environment. These distros rely on VFIO enabled
in kernel and currently have IOMMUFD disabled. This patch series is
a simpler low risk enablement that functionally get the software stack
working while we continue to enable and move to IOMMUFD in phases.
We have to fix the older APIs in order to stage the functional enablement
in small increments.

We are working on iommufd support for pSeries and looking forward
to Timothy's patches.


-Thanks

Shivaprasad

> Jason

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 1/3] powerpc/pseries/iommu: Bring back userspace view for single level TCE tables
@ 2024-03-20 15:29       ` Shivaprasad G Bhat
  0 siblings, 0 replies; 21+ messages in thread
From: Shivaprasad G Bhat @ 2024-03-20 15:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: svaidy, robh, jroedel, kvm, gbatra, aik, alex.williamson,
	linux-kernel, aneesh.kumar, brking, tpearson, npiggin,
	naveen.n.rao, vaibhav, msuchanek, linuxppc-dev, aik

Hi Jason,

On 3/19/24 20:02, Jason Gunthorpe wrote:
> On Tue, Mar 12, 2024 at 01:14:20PM -0500, Shivaprasad G Bhat wrote:
>> The commit 090bad39b237a ("powerpc/powernv: Add indirect levels to
>> it_userspace") which implemented the tce indirect levels
>> support for PowerNV ended up removing the single level support
>> which existed by default(generic tce_iommu_userspace_view_alloc/free()
>> calls). On pSeries the TCEs are single level, and the allocation
>> of userspace view is lost with the removal of generic code.
> :( :(
>
> If this has been broken since 2018 and nobody cared till now can we
> please go in a direction of moving this code to the new iommu APIs
> instead of doubling down on more of this old stuff that apparently
> almost nobody cares about ??

We have existing software stack deployments using VFIO userspace
device assignment running on Power platform. We have to enable
similar software stack on newer generation Power10 platform and
also in a pSeries lpar environment. These distros rely on VFIO enabled
in kernel and currently have IOMMUFD disabled. This patch series is
a simpler low risk enablement that functionally get the software stack
working while we continue to enable and move to IOMMUFD in phases.
We have to fix the older APIs in order to stage the functional enablement
in small increments.

We are working on iommufd support for pSeries and looking forward
to Timothy's patches.


-Thanks

Shivaprasad

> Jason

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 1/3] powerpc/pseries/iommu: Bring back userspace view for single level TCE tables
  2024-03-19 14:32     ` Jason Gunthorpe
@ 2024-03-22  5:49       ` Michael Ellerman
  -1 siblings, 0 replies; 21+ messages in thread
From: Michael Ellerman @ 2024-03-22  5:49 UTC (permalink / raw)
  To: Jason Gunthorpe, Shivaprasad G Bhat
  Cc: tpearson, alex.williamson, linuxppc-dev, npiggin,
	christophe.leroy, aneesh.kumar, naveen.n.rao, gbatra, brking,
	aik, robh, linux-kernel, kvm, aik, msuchanek, jroedel, vaibhav,
	svaidy

Jason Gunthorpe <jgg@ziepe.ca> writes:
> On Tue, Mar 12, 2024 at 01:14:20PM -0500, Shivaprasad G Bhat wrote:
>> The commit 090bad39b237a ("powerpc/powernv: Add indirect levels to
>> it_userspace") which implemented the tce indirect levels
>> support for PowerNV ended up removing the single level support
>> which existed by default(generic tce_iommu_userspace_view_alloc/free()
>> calls). On pSeries the TCEs are single level, and the allocation
>> of userspace view is lost with the removal of generic code.
>
> :( :(
>
> If this has been broken since 2018 and nobody cared till now can we
> please go in a direction of moving this code to the new iommu APIs
> instead of doubling down on more of this old stuff that apparently
> almost nobody cares about ??

It's broken *on pseries* (Linux as a guest), but it works fine on
powernv (aka bare metal, aka Linux as Hypervisor).

What's changed is folks are now testing it on pseries with Linux as a
nested hypervisor.

cheers

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 1/3] powerpc/pseries/iommu: Bring back userspace view for single level TCE tables
@ 2024-03-22  5:49       ` Michael Ellerman
  0 siblings, 0 replies; 21+ messages in thread
From: Michael Ellerman @ 2024-03-22  5:49 UTC (permalink / raw)
  To: Jason Gunthorpe, Shivaprasad G Bhat
  Cc: svaidy, robh, jroedel, kvm, gbatra, aik, alex.williamson,
	linux-kernel, aneesh.kumar, brking, tpearson, npiggin,
	naveen.n.rao, vaibhav, msuchanek, linuxppc-dev, aik

Jason Gunthorpe <jgg@ziepe.ca> writes:
> On Tue, Mar 12, 2024 at 01:14:20PM -0500, Shivaprasad G Bhat wrote:
>> The commit 090bad39b237a ("powerpc/powernv: Add indirect levels to
>> it_userspace") which implemented the tce indirect levels
>> support for PowerNV ended up removing the single level support
>> which existed by default(generic tce_iommu_userspace_view_alloc/free()
>> calls). On pSeries the TCEs are single level, and the allocation
>> of userspace view is lost with the removal of generic code.
>
> :( :(
>
> If this has been broken since 2018 and nobody cared till now can we
> please go in a direction of moving this code to the new iommu APIs
> instead of doubling down on more of this old stuff that apparently
> almost nobody cares about ??

It's broken *on pseries* (Linux as a guest), but it works fine on
powernv (aka bare metal, aka Linux as Hypervisor).

What's changed is folks are now testing it on pseries with Linux as a
nested hypervisor.

cheers

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 3/3] pseries/iommu: Enable DDW for VFIO TCE create
  2024-03-13 12:53     ` Michael Ellerman
@ 2024-03-26  4:56       ` Shivaprasad G Bhat
  -1 siblings, 0 replies; 21+ messages in thread
From: Shivaprasad G Bhat @ 2024-03-26  4:56 UTC (permalink / raw)
  To: Michael Ellerman, tpearson, alex.williamson, linuxppc-dev
  Cc: npiggin, christophe.leroy, aneesh.kumar, naveen.n.rao, gbatra,
	brking, aik, jgg, robh, linux-kernel, kvm, aik, msuchanek,
	jroedel, vaibhav, svaidy

Hi Michael,

On 3/13/24 18:23, Michael Ellerman wrote:
> Hi Shivaprasad,
>
> Shivaprasad G Bhat <sbhat@linux.ibm.com> writes:
>> The commit 9d67c9433509 ("powerpc/iommu: Add \"borrowing\"
>> iommu_table_group_ops") implemented the "borrow" mechanism for
>> the pSeries SPAPR TCE. It did implement this support partially
>> that it left out creating the DDW if not present already.
>>
>> The patch here attempts to fix the missing gaps.
>>   - Expose the DDW info to user by collecting it during probe.
>>   - Create the window and the iommu table if not present during
>>     VFIO_SPAPR_TCE_CREATE.
>>   - Remove and recreate the window if the pageshift and window sizes
>>     do not match.
>>   - Restore the original window in enable_ddw() if the user had
>>     created/modified the DDW. As there is preference for DIRECT mapping
>>     on the host driver side, the user created window is removed.
>>
>> The changes work only for the non-SRIOV-VF scenarios for PEs having
>> 2 DMA windows.
> This crashes on powernv.


Thanks for pointing this out.  I will take care of this in v2 of this RFC.


Regards,

Shivaprasad


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 3/3] pseries/iommu: Enable DDW for VFIO TCE create
@ 2024-03-26  4:56       ` Shivaprasad G Bhat
  0 siblings, 0 replies; 21+ messages in thread
From: Shivaprasad G Bhat @ 2024-03-26  4:56 UTC (permalink / raw)
  To: Michael Ellerman, tpearson, alex.williamson, linuxppc-dev
  Cc: robh, jroedel, kvm, gbatra, jgg, aik, linux-kernel, svaidy,
	aneesh.kumar, brking, npiggin, naveen.n.rao, vaibhav, msuchanek,
	aik

Hi Michael,

On 3/13/24 18:23, Michael Ellerman wrote:
> Hi Shivaprasad,
>
> Shivaprasad G Bhat <sbhat@linux.ibm.com> writes:
>> The commit 9d67c9433509 ("powerpc/iommu: Add \"borrowing\"
>> iommu_table_group_ops") implemented the "borrow" mechanism for
>> the pSeries SPAPR TCE. It did implement this support partially
>> that it left out creating the DDW if not present already.
>>
>> The patch here attempts to fix the missing gaps.
>>   - Expose the DDW info to user by collecting it during probe.
>>   - Create the window and the iommu table if not present during
>>     VFIO_SPAPR_TCE_CREATE.
>>   - Remove and recreate the window if the pageshift and window sizes
>>     do not match.
>>   - Restore the original window in enable_ddw() if the user had
>>     created/modified the DDW. As there is preference for DIRECT mapping
>>     on the host driver side, the user created window is removed.
>>
>> The changes work only for the non-SRIOV-VF scenarios for PEs having
>> 2 DMA windows.
> This crashes on powernv.


Thanks for pointing this out.  I will take care of this in v2 of this RFC.


Regards,

Shivaprasad


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2024-03-26  4:57 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-12 18:14 [RFC PATCH 0/3] powerpc: pSeries: vfio: iommu: Re-enable support for SPAPR TCE VFIO Shivaprasad G Bhat
2024-03-12 18:14 ` Shivaprasad G Bhat
2024-03-12 18:14 ` [RFC PATCH 1/3] powerpc/pseries/iommu: Bring back userspace view for single level TCE tables Shivaprasad G Bhat
2024-03-12 18:14   ` Shivaprasad G Bhat
2024-03-19 14:32   ` Jason Gunthorpe
2024-03-19 14:32     ` Jason Gunthorpe
2024-03-19 18:36     ` Timothy Pearson
2024-03-19 18:36       ` Timothy Pearson
2024-03-20 15:29     ` Shivaprasad G Bhat
2024-03-20 15:29       ` Shivaprasad G Bhat
2024-03-22  5:49     ` Michael Ellerman
2024-03-22  5:49       ` Michael Ellerman
2024-03-12 18:14 ` [RFC PATCH 2/3] powerpc/iommu: Move pSeries specific functions to pseries/iommu.c Shivaprasad G Bhat
2024-03-12 18:14   ` Shivaprasad G Bhat
2024-03-12 18:14 ` [RFC PATCH 3/3] pseries/iommu: Enable DDW for VFIO TCE create Shivaprasad G Bhat
2024-03-12 18:14   ` Shivaprasad G Bhat
2024-03-13 12:53   ` Michael Ellerman
2024-03-13 12:53     ` Michael Ellerman
2024-03-26  4:56     ` Shivaprasad G Bhat
2024-03-26  4:56       ` Shivaprasad G Bhat
2024-03-14 15:10   ` kernel test robot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.