[PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
@ 2020-03-23  7:53 ` Alexey Kardashevskiy
  0 siblings, 0 replies; 57+ messages in thread
From: Alexey Kardashevskiy @ 2020-03-23  7:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: David Gibson, kvm-ppc, kvm, Alistair Popple, Fabiano Rosas,
	Alexey Kardashevskiy

Here is an attempt to support bigger DMA space for devices
supporting DMA masks less than 59 bits (GPUs come into mind
first). POWER9 PHBs have an option to map 2 windows at 0
and select a windows based on DMA address being below or above
4GB.

This adds the "iommu=iommu_bypass" kernel parameter and
supports VFIO+pseries machine - current this requires telling
upstream+unmodified QEMU about this via
-global spapr-pci-host-bridge.dma64_win_addr=0x100000000
or per-phb property. 4/4 advertises the new option but
there is no automation around it in QEMU (should it be?).

For now it is either 1<<59 or 4GB mode; dynamic switching is
not supported (could be via sysfs).

This is a rebased version of
https://lore.kernel.org/kvm/20191202015953.127902-1-aik@ozlabs.ru/

The main change since v1 is that now it is 7 patches with
clearer separation of steps.


This is based on 6c90b86a745a "Merge tag 'mmc-v5.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc"

Please comment. Thanks.



Alexey Kardashevskiy (7):
  powerpc/powernv/ioda: Move TCE bypass base to PE
  powerpc/powernv/ioda: Rework for huge DMA window at 4GB
  powerpc/powernv/ioda: Allow smaller TCE table levels
  powerpc/powernv/phb4: Use IOMMU instead of bypassing
  powerpc/iommu: Add a window number to
    iommu_table_group_ops::get_table_size
  powerpc/powernv/phb4: Add 4GB IOMMU bypass mode
  vfio/spapr_tce: Advertise and allow a huge DMA windows at 4GB

 arch/powerpc/include/asm/iommu.h              |   3 +
 arch/powerpc/include/asm/opal-api.h           |   9 +-
 arch/powerpc/include/asm/opal.h               |   2 +
 arch/powerpc/platforms/powernv/pci.h          |   4 +-
 include/uapi/linux/vfio.h                     |   2 +
 arch/powerpc/platforms/powernv/npu-dma.c      |   1 +
 arch/powerpc/platforms/powernv/opal-call.c    |   2 +
 arch/powerpc/platforms/powernv/pci-ioda-tce.c |   4 +-
 arch/powerpc/platforms/powernv/pci-ioda.c     | 234 ++++++++++++++----
 drivers/vfio/vfio_iommu_spapr_tce.c           |  17 +-
 10 files changed, 213 insertions(+), 65 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
@ 2020-03-23  7:53 ` Alexey Kardashevskiy
  0 siblings, 0 replies; 57+ messages in thread
From: Alexey Kardashevskiy @ 2020-03-23  7:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Fabiano Rosas, Alexey Kardashevskiy, Alistair Popple,
	kvm-ppc, David Gibson

Here is an attempt to support bigger DMA space for devices
supporting DMA masks less than 59 bits (GPUs come into mind
first). POWER9 PHBs have an option to map 2 windows at 0
and select a windows based on DMA address being below or above
4GB.

This adds the "iommu=iommu_bypass" kernel parameter and
supports VFIO+pseries machine - current this requires telling
upstream+unmodified QEMU about this via
-global spapr-pci-host-bridge.dma64_win_addr=0x100000000
or per-phb property. 4/4 advertises the new option but
there is no automation around it in QEMU (should it be?).

For now it is either 1<<59 or 4GB mode; dynamic switching is
not supported (could be via sysfs).

This is a rebased version of
https://lore.kernel.org/kvm/20191202015953.127902-1-aik@ozlabs.ru/

The main change since v1 is that now it is 7 patches with
clearer separation of steps.


This is based on 6c90b86a745a "Merge tag 'mmc-v5.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc"

Please comment. Thanks.



Alexey Kardashevskiy (7):
  powerpc/powernv/ioda: Move TCE bypass base to PE
  powerpc/powernv/ioda: Rework for huge DMA window at 4GB
  powerpc/powernv/ioda: Allow smaller TCE table levels
  powerpc/powernv/phb4: Use IOMMU instead of bypassing
  powerpc/iommu: Add a window number to
    iommu_table_group_ops::get_table_size
  powerpc/powernv/phb4: Add 4GB IOMMU bypass mode
  vfio/spapr_tce: Advertise and allow a huge DMA windows at 4GB

 arch/powerpc/include/asm/iommu.h              |   3 +
 arch/powerpc/include/asm/opal-api.h           |   9 +-
 arch/powerpc/include/asm/opal.h               |   2 +
 arch/powerpc/platforms/powernv/pci.h          |   4 +-
 include/uapi/linux/vfio.h                     |   2 +
 arch/powerpc/platforms/powernv/npu-dma.c      |   1 +
 arch/powerpc/platforms/powernv/opal-call.c    |   2 +
 arch/powerpc/platforms/powernv/pci-ioda-tce.c |   4 +-
 arch/powerpc/platforms/powernv/pci-ioda.c     | 234 ++++++++++++++----
 drivers/vfio/vfio_iommu_spapr_tce.c           |  17 +-
 10 files changed, 213 insertions(+), 65 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
@ 2020-03-23  7:53 ` Alexey Kardashevskiy
  0 siblings, 0 replies; 57+ messages in thread
From: Alexey Kardashevskiy @ 2020-03-23  7:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: David Gibson, kvm-ppc, kvm, Alistair Popple, Fabiano Rosas,
	Alexey Kardashevskiy

Here is an attempt to support bigger DMA space for devices
supporting DMA masks less than 59 bits (GPUs come into mind
first). POWER9 PHBs have an option to map 2 windows at 0
and select a windows based on DMA address being below or above
4GB.

This adds the "iommu=iommu_bypass" kernel parameter and
supports VFIO+pseries machine - current this requires telling
upstream+unmodified QEMU about this via
-global spapr-pci-host-bridge.dma64_win_addr=0x100000000
or per-phb property. 4/4 advertises the new option but
there is no automation around it in QEMU (should it be?).

For now it is either 1<<59 or 4GB mode; dynamic switching is
not supported (could be via sysfs).

This is a rebased version of
https://lore.kernel.org/kvm/20191202015953.127902-1-aik@ozlabs.ru/

The main change since v1 is that now it is 7 patches with
clearer separation of steps.


This is based on 6c90b86a745a "Merge tag 'mmc-v5.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc"

Please comment. Thanks.



Alexey Kardashevskiy (7):
  powerpc/powernv/ioda: Move TCE bypass base to PE
  powerpc/powernv/ioda: Rework for huge DMA window at 4GB
  powerpc/powernv/ioda: Allow smaller TCE table levels
  powerpc/powernv/phb4: Use IOMMU instead of bypassing
  powerpc/iommu: Add a window number to
    iommu_table_group_ops::get_table_size
  powerpc/powernv/phb4: Add 4GB IOMMU bypass mode
  vfio/spapr_tce: Advertise and allow a huge DMA windows at 4GB

 arch/powerpc/include/asm/iommu.h              |   3 +
 arch/powerpc/include/asm/opal-api.h           |   9 +-
 arch/powerpc/include/asm/opal.h               |   2 +
 arch/powerpc/platforms/powernv/pci.h          |   4 +-
 include/uapi/linux/vfio.h                     |   2 +
 arch/powerpc/platforms/powernv/npu-dma.c      |   1 +
 arch/powerpc/platforms/powernv/opal-call.c    |   2 +
 arch/powerpc/platforms/powernv/pci-ioda-tce.c |   4 +-
 arch/powerpc/platforms/powernv/pci-ioda.c     | 234 ++++++++++++++----
 drivers/vfio/vfio_iommu_spapr_tce.c           |  17 +-
 10 files changed, 213 insertions(+), 65 deletions(-)

-- 
2.17.1

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH kernel v2 1/7] powerpc/powernv/ioda: Move TCE bypass base to PE
  2020-03-23  7:53 ` Alexey Kardashevskiy
  (?)
@ 2020-03-23  7:53   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 57+ messages in thread
From: Alexey Kardashevskiy @ 2020-03-23  7:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: David Gibson, kvm-ppc, kvm, Alistair Popple, Fabiano Rosas,
	Alexey Kardashevskiy

We are about to allow another location for the second DMA window and
we will need to advertise it outside of the powernv platform code.

This moves bypass base address to iommu_table_group so drivers such as
VFIO SPAPR TCE can see it.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h          |  1 +
 arch/powerpc/platforms/powernv/pci.h      |  1 -
 arch/powerpc/platforms/powernv/npu-dma.c  |  1 +
 arch/powerpc/platforms/powernv/pci-ioda.c | 26 ++++++++++-------------
 4 files changed, 13 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 350101e11ddb..479439ef003e 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -194,6 +194,7 @@ struct iommu_table_group {
 	__u64 pgsizes; /* Bitmap of supported page sizes */
 	__u32 max_dynamic_windows_supported;
 	__u32 max_levels;
+	__u64 tce64_start;
 
 	struct iommu_group *group;
 	struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index d3bbdeab3a32..a808dd396522 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -67,7 +67,6 @@ struct pnv_ioda_pe {
 
 	/* 64-bit TCE bypass region */
 	bool			tce_bypass_enabled;
-	uint64_t		tce_bypass_base;
 
 	/* MSIs. MVE index is identical for for 32 and 64 bit MSI
 	 * and -1 if not supported. (It's actually identical to the
diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
index b95b9e3c4c98..97a479848003 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -469,6 +469,7 @@ struct iommu_table_group *pnv_try_setup_npu_table_group(struct pnv_ioda_pe *pe)
 	table_group->tce32_start = pe->table_group.tce32_start;
 	table_group->tce32_size = pe->table_group.tce32_size;
 	table_group->max_levels = pe->table_group.max_levels;
+	table_group->tce64_start = pe->table_group.tce64_start;
 	if (!table_group->pgsizes)
 		table_group->pgsizes = pe->table_group.pgsizes;
 
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 22c22cd7bd82..52db10ab4fef 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1777,7 +1777,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pci_dev *pdev)
 
 	pe = &phb->ioda.pe_array[pdn->pe_number];
 	WARN_ON(get_dma_ops(&pdev->dev) != &dma_iommu_ops);
-	pdev->dev.archdata.dma_offset = pe->tce_bypass_base;
+	pdev->dev.archdata.dma_offset = pe->table_group.tce64_start;
 	set_iommu_table_base(&pdev->dev, pe->table_group.tables[0]);
 	/*
 	 * Note: iommu_add_device() will fail here as
@@ -1869,7 +1869,8 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct pci_dev *pdev,
 
 	pe = &phb->ioda.pe_array[pdn->pe_number];
 	if (pe->tce_bypass_enabled) {
-		u64 top = pe->tce_bypass_base + memblock_end_of_DRAM() - 1;
+		u64 top = pe->table_group.tce64_start +
+			memblock_end_of_DRAM() - 1;
 		if (dma_mask >= top)
 			return true;
 	}
@@ -1903,7 +1904,7 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus)
 
 	list_for_each_entry(dev, &bus->devices, bus_list) {
 		set_iommu_table_base(&dev->dev, pe->table_group.tables[0]);
-		dev->dev.archdata.dma_offset = pe->tce_bypass_base;
+		dev->dev.archdata.dma_offset = pe->table_group.tce64_start;
 
 		if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
 			pnv_ioda_setup_bus_dma(pe, dev->subordinate);
@@ -2361,16 +2362,12 @@ static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
 
 		top = roundup_pow_of_two(top);
 		rc = opal_pci_map_pe_dma_window_real(pe->phb->opal_id,
-						     pe->pe_number,
-						     window_id,
-						     pe->tce_bypass_base,
-						     top);
+				pe->pe_number, window_id,
+				pe->table_group.tce64_start, top);
 	} else {
 		rc = opal_pci_map_pe_dma_window_real(pe->phb->opal_id,
-						     pe->pe_number,
-						     window_id,
-						     pe->tce_bypass_base,
-						     0);
+				pe->pe_number, window_id,
+				pe->table_group.tce64_start, 0);
 	}
 	if (rc)
 		pe_err(pe, "OPAL error %lld configuring bypass window\n", rc);
@@ -2385,7 +2382,8 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
 	struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
 			table_group);
 	int nid = pe->phb->hose->node;
-	__u64 bus_offset = num ? pe->tce_bypass_base : table_group->tce32_start;
+	__u64 bus_offset = num ?
+		pe->table_group.tce64_start : table_group->tce32_start;
 	long ret;
 	struct iommu_table *tbl;
 
@@ -2735,9 +2733,6 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 	if (!pnv_pci_ioda_pe_dma_weight(pe))
 		return;
 
-	/* TVE #1 is selected by PCI address bit 59 */
-	pe->tce_bypass_base = 1ull << 59;
-
 	/* The PE will reserve all possible 32-bits space */
 	pe_info(pe, "Setting up 32-bit TCE table at 0..%08x\n",
 		phb->ioda.m32_pci_base);
@@ -2745,6 +2740,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 	/* Setup linux iommu table */
 	pe->table_group.tce32_start = 0;
 	pe->table_group.tce32_size = phb->ioda.m32_pci_base;
+	pe->table_group.tce64_start = 1UL << 59;
 	pe->table_group.max_dynamic_windows_supported =
 			IOMMU_TABLE_GROUP_MAX_TABLES;
 	pe->table_group.max_levels = POWERNV_IOMMU_MAX_LEVELS;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH kernel v2 1/7] powerpc/powernv/ioda: Move TCE bypass base to PE
@ 2020-03-23  7:53   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 57+ messages in thread
From: Alexey Kardashevskiy @ 2020-03-23  7:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Fabiano Rosas, Alexey Kardashevskiy, Alistair Popple,
	kvm-ppc, David Gibson

We are about to allow another location for the second DMA window and
we will need to advertise it outside of the powernv platform code.

This moves bypass base address to iommu_table_group so drivers such as
VFIO SPAPR TCE can see it.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h          |  1 +
 arch/powerpc/platforms/powernv/pci.h      |  1 -
 arch/powerpc/platforms/powernv/npu-dma.c  |  1 +
 arch/powerpc/platforms/powernv/pci-ioda.c | 26 ++++++++++-------------
 4 files changed, 13 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 350101e11ddb..479439ef003e 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -194,6 +194,7 @@ struct iommu_table_group {
 	__u64 pgsizes; /* Bitmap of supported page sizes */
 	__u32 max_dynamic_windows_supported;
 	__u32 max_levels;
+	__u64 tce64_start;
 
 	struct iommu_group *group;
 	struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index d3bbdeab3a32..a808dd396522 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -67,7 +67,6 @@ struct pnv_ioda_pe {
 
 	/* 64-bit TCE bypass region */
 	bool			tce_bypass_enabled;
-	uint64_t		tce_bypass_base;
 
 	/* MSIs. MVE index is identical for for 32 and 64 bit MSI
 	 * and -1 if not supported. (It's actually identical to the
diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
index b95b9e3c4c98..97a479848003 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -469,6 +469,7 @@ struct iommu_table_group *pnv_try_setup_npu_table_group(struct pnv_ioda_pe *pe)
 	table_group->tce32_start = pe->table_group.tce32_start;
 	table_group->tce32_size = pe->table_group.tce32_size;
 	table_group->max_levels = pe->table_group.max_levels;
+	table_group->tce64_start = pe->table_group.tce64_start;
 	if (!table_group->pgsizes)
 		table_group->pgsizes = pe->table_group.pgsizes;
 
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 22c22cd7bd82..52db10ab4fef 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1777,7 +1777,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pci_dev *pdev)
 
 	pe = &phb->ioda.pe_array[pdn->pe_number];
 	WARN_ON(get_dma_ops(&pdev->dev) != &dma_iommu_ops);
-	pdev->dev.archdata.dma_offset = pe->tce_bypass_base;
+	pdev->dev.archdata.dma_offset = pe->table_group.tce64_start;
 	set_iommu_table_base(&pdev->dev, pe->table_group.tables[0]);
 	/*
 	 * Note: iommu_add_device() will fail here as
@@ -1869,7 +1869,8 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct pci_dev *pdev,
 
 	pe = &phb->ioda.pe_array[pdn->pe_number];
 	if (pe->tce_bypass_enabled) {
-		u64 top = pe->tce_bypass_base + memblock_end_of_DRAM() - 1;
+		u64 top = pe->table_group.tce64_start +
+			memblock_end_of_DRAM() - 1;
 		if (dma_mask >= top)
 			return true;
 	}
@@ -1903,7 +1904,7 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus)
 
 	list_for_each_entry(dev, &bus->devices, bus_list) {
 		set_iommu_table_base(&dev->dev, pe->table_group.tables[0]);
-		dev->dev.archdata.dma_offset = pe->tce_bypass_base;
+		dev->dev.archdata.dma_offset = pe->table_group.tce64_start;
 
 		if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
 			pnv_ioda_setup_bus_dma(pe, dev->subordinate);
@@ -2361,16 +2362,12 @@ static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
 
 		top = roundup_pow_of_two(top);
 		rc = opal_pci_map_pe_dma_window_real(pe->phb->opal_id,
-						     pe->pe_number,
-						     window_id,
-						     pe->tce_bypass_base,
-						     top);
+				pe->pe_number, window_id,
+				pe->table_group.tce64_start, top);
 	} else {
 		rc = opal_pci_map_pe_dma_window_real(pe->phb->opal_id,
-						     pe->pe_number,
-						     window_id,
-						     pe->tce_bypass_base,
-						     0);
+				pe->pe_number, window_id,
+				pe->table_group.tce64_start, 0);
 	}
 	if (rc)
 		pe_err(pe, "OPAL error %lld configuring bypass window\n", rc);
@@ -2385,7 +2382,8 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
 	struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
 			table_group);
 	int nid = pe->phb->hose->node;
-	__u64 bus_offset = num ? pe->tce_bypass_base : table_group->tce32_start;
+	__u64 bus_offset = num ?
+		pe->table_group.tce64_start : table_group->tce32_start;
 	long ret;
 	struct iommu_table *tbl;
 
@@ -2735,9 +2733,6 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 	if (!pnv_pci_ioda_pe_dma_weight(pe))
 		return;
 
-	/* TVE #1 is selected by PCI address bit 59 */
-	pe->tce_bypass_base = 1ull << 59;
-
 	/* The PE will reserve all possible 32-bits space */
 	pe_info(pe, "Setting up 32-bit TCE table at 0..%08x\n",
 		phb->ioda.m32_pci_base);
@@ -2745,6 +2740,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 	/* Setup linux iommu table */
 	pe->table_group.tce32_start = 0;
 	pe->table_group.tce32_size = phb->ioda.m32_pci_base;
+	pe->table_group.tce64_start = 1UL << 59;
 	pe->table_group.max_dynamic_windows_supported =
 			IOMMU_TABLE_GROUP_MAX_TABLES;
 	pe->table_group.max_levels = POWERNV_IOMMU_MAX_LEVELS;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH kernel v2 1/7] powerpc/powernv/ioda: Move TCE bypass base to PE
@ 2020-03-23  7:53   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 57+ messages in thread
From: Alexey Kardashevskiy @ 2020-03-23  7:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: David Gibson, kvm-ppc, kvm, Alistair Popple, Fabiano Rosas,
	Alexey Kardashevskiy

We are about to allow another location for the second DMA window and
we will need to advertise it outside of the powernv platform code.

This moves bypass base address to iommu_table_group so drivers such as
VFIO SPAPR TCE can see it.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h          |  1 +
 arch/powerpc/platforms/powernv/pci.h      |  1 -
 arch/powerpc/platforms/powernv/npu-dma.c  |  1 +
 arch/powerpc/platforms/powernv/pci-ioda.c | 26 ++++++++++-------------
 4 files changed, 13 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 350101e11ddb..479439ef003e 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -194,6 +194,7 @@ struct iommu_table_group {
 	__u64 pgsizes; /* Bitmap of supported page sizes */
 	__u32 max_dynamic_windows_supported;
 	__u32 max_levels;
+	__u64 tce64_start;
 
 	struct iommu_group *group;
 	struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index d3bbdeab3a32..a808dd396522 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -67,7 +67,6 @@ struct pnv_ioda_pe {
 
 	/* 64-bit TCE bypass region */
 	bool			tce_bypass_enabled;
-	uint64_t		tce_bypass_base;
 
 	/* MSIs. MVE index is identical for for 32 and 64 bit MSI
 	 * and -1 if not supported. (It's actually identical to the
diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
index b95b9e3c4c98..97a479848003 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -469,6 +469,7 @@ struct iommu_table_group *pnv_try_setup_npu_table_group(struct pnv_ioda_pe *pe)
 	table_group->tce32_start = pe->table_group.tce32_start;
 	table_group->tce32_size = pe->table_group.tce32_size;
 	table_group->max_levels = pe->table_group.max_levels;
+	table_group->tce64_start = pe->table_group.tce64_start;
 	if (!table_group->pgsizes)
 		table_group->pgsizes = pe->table_group.pgsizes;
 
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 22c22cd7bd82..52db10ab4fef 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1777,7 +1777,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pci_dev *pdev)
 
 	pe = &phb->ioda.pe_array[pdn->pe_number];
 	WARN_ON(get_dma_ops(&pdev->dev) != &dma_iommu_ops);
-	pdev->dev.archdata.dma_offset = pe->tce_bypass_base;
+	pdev->dev.archdata.dma_offset = pe->table_group.tce64_start;
 	set_iommu_table_base(&pdev->dev, pe->table_group.tables[0]);
 	/*
 	 * Note: iommu_add_device() will fail here as
@@ -1869,7 +1869,8 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct pci_dev *pdev,
 
 	pe = &phb->ioda.pe_array[pdn->pe_number];
 	if (pe->tce_bypass_enabled) {
-		u64 top = pe->tce_bypass_base + memblock_end_of_DRAM() - 1;
+		u64 top = pe->table_group.tce64_start +
+			memblock_end_of_DRAM() - 1;
 		if (dma_mask >= top)
 			return true;
 	}
@@ -1903,7 +1904,7 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus)
 
 	list_for_each_entry(dev, &bus->devices, bus_list) {
 		set_iommu_table_base(&dev->dev, pe->table_group.tables[0]);
-		dev->dev.archdata.dma_offset = pe->tce_bypass_base;
+		dev->dev.archdata.dma_offset = pe->table_group.tce64_start;
 
 		if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
 			pnv_ioda_setup_bus_dma(pe, dev->subordinate);
@@ -2361,16 +2362,12 @@ static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
 
 		top = roundup_pow_of_two(top);
 		rc = opal_pci_map_pe_dma_window_real(pe->phb->opal_id,
-						     pe->pe_number,
-						     window_id,
-						     pe->tce_bypass_base,
-						     top);
+				pe->pe_number, window_id,
+				pe->table_group.tce64_start, top);
 	} else {
 		rc = opal_pci_map_pe_dma_window_real(pe->phb->opal_id,
-						     pe->pe_number,
-						     window_id,
-						     pe->tce_bypass_base,
-						     0);
+				pe->pe_number, window_id,
+				pe->table_group.tce64_start, 0);
 	}
 	if (rc)
 		pe_err(pe, "OPAL error %lld configuring bypass window\n", rc);
@@ -2385,7 +2382,8 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
 	struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
 			table_group);
 	int nid = pe->phb->hose->node;
-	__u64 bus_offset = num ? pe->tce_bypass_base : table_group->tce32_start;
+	__u64 bus_offset = num ?
+		pe->table_group.tce64_start : table_group->tce32_start;
 	long ret;
 	struct iommu_table *tbl;
 
@@ -2735,9 +2733,6 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 	if (!pnv_pci_ioda_pe_dma_weight(pe))
 		return;
 
-	/* TVE #1 is selected by PCI address bit 59 */
-	pe->tce_bypass_base = 1ull << 59;
-
 	/* The PE will reserve all possible 32-bits space */
 	pe_info(pe, "Setting up 32-bit TCE table at 0..%08x\n",
 		phb->ioda.m32_pci_base);
@@ -2745,6 +2740,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 	/* Setup linux iommu table */
 	pe->table_group.tce32_start = 0;
 	pe->table_group.tce32_size = phb->ioda.m32_pci_base;
+	pe->table_group.tce64_start = 1UL << 59;
 	pe->table_group.max_dynamic_windows_supported  			IOMMU_TABLE_GROUP_MAX_TABLES;
 	pe->table_group.max_levels = POWERNV_IOMMU_MAX_LEVELS;
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH kernel v2 2/7] powerpc/powernv/ioda: Rework for huge DMA window at 4GB
  2020-03-23  7:53 ` Alexey Kardashevskiy
  (?)
@ 2020-03-23  7:53   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 57+ messages in thread
From: Alexey Kardashevskiy @ 2020-03-23  7:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: David Gibson, kvm-ppc, kvm, Alistair Popple, Fabiano Rosas,
	Alexey Kardashevskiy

This moves code to make the next patches look simpler. In particular:

1. Separate locals declaration as we will be allocating a smaller DMA
window if a TVE1_4GB option (allows a huge DMA windows at 4GB) is enabled;

2. Pass the bypass offset directly to pnv_pci_ioda2_create_table()
as it is the only information needed from @pe;

3. Use PAGE_SHIFT for it_map allocation estimate and @tceshift for
the IOMMU page size; this makes the distinction clear and allows
easy switching between different IOMMU page size.

These changes should not cause behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---

I really need 1), 2) makes the code less dependent on the PE struct member
value (==easier to follow), 3) is to enable 2MB quickly for the default
DMA window for debugging/performance testing.
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 38 ++++++++++++-----------
 1 file changed, 20 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 52db10ab4fef..f5f1b4e25530 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2375,15 +2375,10 @@ static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
 		pe->tce_bypass_enabled = enable;
 }
 
-static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
-		int num, __u32 page_shift, __u64 window_size, __u32 levels,
+static long pnv_pci_ioda2_create_table(int nid, int num, __u64 bus_offset,
+		__u32 page_shift, __u64 window_size, __u32 levels,
 		bool alloc_userspace_copy, struct iommu_table **ptbl)
 {
-	struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
-			table_group);
-	int nid = pe->phb->hose->node;
-	__u64 bus_offset = num ?
-		pe->table_group.tce64_start : table_group->tce32_start;
 	long ret;
 	struct iommu_table *tbl;
 
@@ -2410,21 +2405,23 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
 {
 	struct iommu_table *tbl = NULL;
 	long rc;
-	unsigned long res_start, res_end;
+	u64 max_memory, maxblock, window_size;
+	const unsigned int tceshift = PAGE_SHIFT;
+	unsigned long res_start, res_end, tces_order, tcelevel_order, levels;
 
 	/*
 	 * crashkernel= specifies the kdump kernel's maximum memory at
 	 * some offset and there is no guaranteed the result is a power
 	 * of 2, which will cause errors later.
 	 */
-	const u64 max_memory = __rounddown_pow_of_two(memory_hotplug_max());
+	max_memory = __rounddown_pow_of_two(memory_hotplug_max());
 
 	/*
 	 * In memory constrained environments, e.g. kdump kernel, the
 	 * DMA window can be larger than available memory, which will
 	 * cause errors later.
 	 */
-	const u64 maxblock = 1UL << (PAGE_SHIFT + MAX_ORDER - 1);
+	maxblock = 1UL << (PAGE_SHIFT + MAX_ORDER - 1);
 
 	/*
 	 * We create the default window as big as we can. The constraint is
@@ -2434,11 +2431,11 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
 	 * to support crippled devices (i.e. not fully 64bit DMAble) only.
 	 */
 	/* iommu_table::it_map uses 1 bit per IOMMU page, hence 8 */
-	const u64 window_size = min((maxblock * 8) << PAGE_SHIFT, max_memory);
+	window_size = min((maxblock * 8) << tceshift, max_memory);
 	/* Each TCE level cannot exceed maxblock so go multilevel if needed */
-	unsigned long tces_order = ilog2(window_size >> PAGE_SHIFT);
-	unsigned long tcelevel_order = ilog2(maxblock >> 3);
-	unsigned int levels = tces_order / tcelevel_order;
+	tces_order = ilog2(window_size >> tceshift);
+	tcelevel_order = ilog2(maxblock >> 3);
+	levels = tces_order / tcelevel_order;
 
 	if (tces_order % tcelevel_order)
 		levels += 1;
@@ -2448,8 +2445,8 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
 	 */
 	levels = max_t(unsigned int, levels, POWERNV_IOMMU_DEFAULT_LEVELS);
 
-	rc = pnv_pci_ioda2_create_table(&pe->table_group, 0, PAGE_SHIFT,
-			window_size, levels, false, &tbl);
+	rc = pnv_pci_ioda2_create_table(pe->phb->hose->node,
+			0, 0, tceshift, window_size, levels, false, &tbl);
 	if (rc) {
 		pe_err(pe, "Failed to create 32-bit TCE table, err %ld",
 				rc);
@@ -2551,8 +2548,13 @@ static long pnv_pci_ioda2_create_table_userspace(
 		int num, __u32 page_shift, __u64 window_size, __u32 levels,
 		struct iommu_table **ptbl)
 {
-	long ret = pnv_pci_ioda2_create_table(table_group,
-			num, page_shift, window_size, levels, true, ptbl);
+	struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
+			table_group);
+	__u64 bus_offset = num ?
+		pe->table_group.tce64_start : table_group->tce32_start;
+	long ret = pnv_pci_ioda2_create_table(pe->phb->hose->node,
+			num, bus_offset, page_shift, window_size, levels, true,
+			ptbl);
 
 	if (!ret)
 		(*ptbl)->it_allocated_size = pnv_pci_ioda2_get_table_size(
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH kernel v2 2/7] powerpc/powernv/ioda: Rework for huge DMA window at 4GB
@ 2020-03-23  7:53   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 57+ messages in thread
From: Alexey Kardashevskiy @ 2020-03-23  7:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Fabiano Rosas, Alexey Kardashevskiy, Alistair Popple,
	kvm-ppc, David Gibson

This moves code to make the next patches look simpler. In particular:

1. Separate locals declaration as we will be allocating a smaller DMA
window if a TVE1_4GB option (allows a huge DMA windows at 4GB) is enabled;

2. Pass the bypass offset directly to pnv_pci_ioda2_create_table()
as it is the only information needed from @pe;

3. Use PAGE_SHIFT for it_map allocation estimate and @tceshift for
the IOMMU page size; this makes the distinction clear and allows
easy switching between different IOMMU page size.

These changes should not cause behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---

I really need 1), 2) makes the code less dependent on the PE struct member
value (==easier to follow), 3) is to enable 2MB quickly for the default
DMA window for debugging/performance testing.
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 38 ++++++++++++-----------
 1 file changed, 20 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 52db10ab4fef..f5f1b4e25530 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2375,15 +2375,10 @@ static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
 		pe->tce_bypass_enabled = enable;
 }
 
-static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
-		int num, __u32 page_shift, __u64 window_size, __u32 levels,
+static long pnv_pci_ioda2_create_table(int nid, int num, __u64 bus_offset,
+		__u32 page_shift, __u64 window_size, __u32 levels,
 		bool alloc_userspace_copy, struct iommu_table **ptbl)
 {
-	struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
-			table_group);
-	int nid = pe->phb->hose->node;
-	__u64 bus_offset = num ?
-		pe->table_group.tce64_start : table_group->tce32_start;
 	long ret;
 	struct iommu_table *tbl;
 
@@ -2410,21 +2405,23 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
 {
 	struct iommu_table *tbl = NULL;
 	long rc;
-	unsigned long res_start, res_end;
+	u64 max_memory, maxblock, window_size;
+	const unsigned int tceshift = PAGE_SHIFT;
+	unsigned long res_start, res_end, tces_order, tcelevel_order, levels;
 
 	/*
 	 * crashkernel= specifies the kdump kernel's maximum memory at
 	 * some offset and there is no guaranteed the result is a power
 	 * of 2, which will cause errors later.
 	 */
-	const u64 max_memory = __rounddown_pow_of_two(memory_hotplug_max());
+	max_memory = __rounddown_pow_of_two(memory_hotplug_max());
 
 	/*
 	 * In memory constrained environments, e.g. kdump kernel, the
 	 * DMA window can be larger than available memory, which will
 	 * cause errors later.
 	 */
-	const u64 maxblock = 1UL << (PAGE_SHIFT + MAX_ORDER - 1);
+	maxblock = 1UL << (PAGE_SHIFT + MAX_ORDER - 1);
 
 	/*
 	 * We create the default window as big as we can. The constraint is
@@ -2434,11 +2431,11 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
 	 * to support crippled devices (i.e. not fully 64bit DMAble) only.
 	 */
 	/* iommu_table::it_map uses 1 bit per IOMMU page, hence 8 */
-	const u64 window_size = min((maxblock * 8) << PAGE_SHIFT, max_memory);
+	window_size = min((maxblock * 8) << tceshift, max_memory);
 	/* Each TCE level cannot exceed maxblock so go multilevel if needed */
-	unsigned long tces_order = ilog2(window_size >> PAGE_SHIFT);
-	unsigned long tcelevel_order = ilog2(maxblock >> 3);
-	unsigned int levels = tces_order / tcelevel_order;
+	tces_order = ilog2(window_size >> tceshift);
+	tcelevel_order = ilog2(maxblock >> 3);
+	levels = tces_order / tcelevel_order;
 
 	if (tces_order % tcelevel_order)
 		levels += 1;
@@ -2448,8 +2445,8 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
 	 */
 	levels = max_t(unsigned int, levels, POWERNV_IOMMU_DEFAULT_LEVELS);
 
-	rc = pnv_pci_ioda2_create_table(&pe->table_group, 0, PAGE_SHIFT,
-			window_size, levels, false, &tbl);
+	rc = pnv_pci_ioda2_create_table(pe->phb->hose->node,
+			0, 0, tceshift, window_size, levels, false, &tbl);
 	if (rc) {
 		pe_err(pe, "Failed to create 32-bit TCE table, err %ld",
 				rc);
@@ -2551,8 +2548,13 @@ static long pnv_pci_ioda2_create_table_userspace(
 		int num, __u32 page_shift, __u64 window_size, __u32 levels,
 		struct iommu_table **ptbl)
 {
-	long ret = pnv_pci_ioda2_create_table(table_group,
-			num, page_shift, window_size, levels, true, ptbl);
+	struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
+			table_group);
+	__u64 bus_offset = num ?
+		pe->table_group.tce64_start : table_group->tce32_start;
+	long ret = pnv_pci_ioda2_create_table(pe->phb->hose->node,
+			num, bus_offset, page_shift, window_size, levels, true,
+			ptbl);
 
 	if (!ret)
 		(*ptbl)->it_allocated_size = pnv_pci_ioda2_get_table_size(
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH kernel v2 2/7] powerpc/powernv/ioda: Rework for huge DMA window at 4GB
@ 2020-03-23  7:53   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 57+ messages in thread
From: Alexey Kardashevskiy @ 2020-03-23  7:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: David Gibson, kvm-ppc, kvm, Alistair Popple, Fabiano Rosas,
	Alexey Kardashevskiy

This moves code to make the next patches look simpler. In particular:

1. Separate locals declaration as we will be allocating a smaller DMA
window if a TVE1_4GB option (allows a huge DMA windows at 4GB) is enabled;

2. Pass the bypass offset directly to pnv_pci_ioda2_create_table()
as it is the only information needed from @pe;

3. Use PAGE_SHIFT for it_map allocation estimate and @tceshift for
the IOMMU page size; this makes the distinction clear and allows
easy switching between different IOMMU page size.

These changes should not cause behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---

I really need 1), 2) makes the code less dependent on the PE struct member
value (=easier to follow), 3) is to enable 2MB quickly for the default
DMA window for debugging/performance testing.
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 38 ++++++++++++-----------
 1 file changed, 20 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 52db10ab4fef..f5f1b4e25530 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2375,15 +2375,10 @@ static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
 		pe->tce_bypass_enabled = enable;
 }
 
-static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
-		int num, __u32 page_shift, __u64 window_size, __u32 levels,
+static long pnv_pci_ioda2_create_table(int nid, int num, __u64 bus_offset,
+		__u32 page_shift, __u64 window_size, __u32 levels,
 		bool alloc_userspace_copy, struct iommu_table **ptbl)
 {
-	struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
-			table_group);
-	int nid = pe->phb->hose->node;
-	__u64 bus_offset = num ?
-		pe->table_group.tce64_start : table_group->tce32_start;
 	long ret;
 	struct iommu_table *tbl;
 
@@ -2410,21 +2405,23 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
 {
 	struct iommu_table *tbl = NULL;
 	long rc;
-	unsigned long res_start, res_end;
+	u64 max_memory, maxblock, window_size;
+	const unsigned int tceshift = PAGE_SHIFT;
+	unsigned long res_start, res_end, tces_order, tcelevel_order, levels;
 
 	/*
 	 * crashkernel= specifies the kdump kernel's maximum memory at
 	 * some offset and there is no guaranteed the result is a power
 	 * of 2, which will cause errors later.
 	 */
-	const u64 max_memory = __rounddown_pow_of_two(memory_hotplug_max());
+	max_memory = __rounddown_pow_of_two(memory_hotplug_max());
 
 	/*
 	 * In memory constrained environments, e.g. kdump kernel, the
 	 * DMA window can be larger than available memory, which will
 	 * cause errors later.
 	 */
-	const u64 maxblock = 1UL << (PAGE_SHIFT + MAX_ORDER - 1);
+	maxblock = 1UL << (PAGE_SHIFT + MAX_ORDER - 1);
 
 	/*
 	 * We create the default window as big as we can. The constraint is
@@ -2434,11 +2431,11 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
 	 * to support crippled devices (i.e. not fully 64bit DMAble) only.
 	 */
 	/* iommu_table::it_map uses 1 bit per IOMMU page, hence 8 */
-	const u64 window_size = min((maxblock * 8) << PAGE_SHIFT, max_memory);
+	window_size = min((maxblock * 8) << tceshift, max_memory);
 	/* Each TCE level cannot exceed maxblock so go multilevel if needed */
-	unsigned long tces_order = ilog2(window_size >> PAGE_SHIFT);
-	unsigned long tcelevel_order = ilog2(maxblock >> 3);
-	unsigned int levels = tces_order / tcelevel_order;
+	tces_order = ilog2(window_size >> tceshift);
+	tcelevel_order = ilog2(maxblock >> 3);
+	levels = tces_order / tcelevel_order;
 
 	if (tces_order % tcelevel_order)
 		levels += 1;
@@ -2448,8 +2445,8 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
 	 */
 	levels = max_t(unsigned int, levels, POWERNV_IOMMU_DEFAULT_LEVELS);
 
-	rc = pnv_pci_ioda2_create_table(&pe->table_group, 0, PAGE_SHIFT,
-			window_size, levels, false, &tbl);
+	rc = pnv_pci_ioda2_create_table(pe->phb->hose->node,
+			0, 0, tceshift, window_size, levels, false, &tbl);
 	if (rc) {
 		pe_err(pe, "Failed to create 32-bit TCE table, err %ld",
 				rc);
@@ -2551,8 +2548,13 @@ static long pnv_pci_ioda2_create_table_userspace(
 		int num, __u32 page_shift, __u64 window_size, __u32 levels,
 		struct iommu_table **ptbl)
 {
-	long ret = pnv_pci_ioda2_create_table(table_group,
-			num, page_shift, window_size, levels, true, ptbl);
+	struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
+			table_group);
+	__u64 bus_offset = num ?
+		pe->table_group.tce64_start : table_group->tce32_start;
+	long ret = pnv_pci_ioda2_create_table(pe->phb->hose->node,
+			num, bus_offset, page_shift, window_size, levels, true,
+			ptbl);
 
 	if (!ret)
 		(*ptbl)->it_allocated_size = pnv_pci_ioda2_get_table_size(
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH kernel v2 3/7] powerpc/powernv/ioda: Allow smaller TCE table levels
  2020-03-23  7:53 ` Alexey Kardashevskiy
  (?)
@ 2020-03-23  7:53   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 57+ messages in thread
From: Alexey Kardashevskiy @ 2020-03-23  7:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: David Gibson, kvm-ppc, kvm, Alistair Popple, Fabiano Rosas,
	Alexey Kardashevskiy

Now the minimum allocation size for a TCE table level is PAGE_SIZE (64k)
as this is the minimum for alloc_pages(). The limit was set in POWER8
where we did not have sparse RAM so we did not need sparse TCE tables.
On POWER9 we have gaps in the phys address space for which using multi
level TCE tables makes sense. The problem with that is that 64K per level
is too much for 2 levels and 1GB pages as it exceeds the hardware limit
of 55bits so we need smaller levels.

This drops the minimum level size to 4K.

For a machine with 2 CPUs, top RAM address 0x4000.0000.0000
(each node gets 32TiB) and 1GiB IOMMU pages:
Before the patch: 512KiB or 8 pages.
After the patch: 3 pages: one level1 + 2xlevel2 tables, each can map
up to 64k>>3<<30 = 8TiB of physical space.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/pci-ioda-tce.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda-tce.c b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
index 5dc6847d5f4c..82e680da9d94 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda-tce.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
@@ -37,7 +37,7 @@ static __be64 *pnv_alloc_tce_level(int nid, unsigned int shift)
 	__be64 *addr;

 	tce_mem = alloc_pages_node(nid, GFP_ATOMIC | __GFP_NOWARN,
-			shift - PAGE_SHIFT);
+			shift > PAGE_SHIFT ? shift - PAGE_SHIFT : 0);
 	if (!tce_mem) {
 		pr_err("Failed to allocate a TCE memory, level shift=%d\n",
 				shift);
@@ -282,7 +282,7 @@ long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
 	/* Adjust direct table size from window_size and levels */
 	entries_shift = (entries_shift + levels - 1) / levels;
 	level_shift = entries_shift + 3;
-	level_shift = max_t(unsigned int, level_shift, PAGE_SHIFT);
+	level_shift = max_t(unsigned int, level_shift, 12); /* 4K is minimum */

 	if ((level_shift - 3) * levels + page_shift >= 55)
 		return -EINVAL;
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH kernel v2 3/7] powerpc/powernv/ioda: Allow smaller TCE table levels
@ 2020-03-23  7:53   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 57+ messages in thread
From: Alexey Kardashevskiy @ 2020-03-23  7:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Fabiano Rosas, Alexey Kardashevskiy, Alistair Popple,
	kvm-ppc, David Gibson

Now the minimum allocation size for a TCE table level is PAGE_SIZE (64k)
as this is the minimum for alloc_pages(). The limit was set in POWER8
where we did not have sparse RAM so we did not need sparse TCE tables.
On POWER9 we have gaps in the phys address space for which using multi
level TCE tables makes sense. The problem with that is that 64K per level
is too much for 2 levels and 1GB pages as it exceeds the hardware limit
of 55bits so we need smaller levels.

This drops the minimum level size to 4K.

For a machine with 2 CPUs, top RAM address 0x4000.0000.0000
(each node gets 32TiB) and 1GiB IOMMU pages:
Before the patch: 512KiB or 8 pages.
After the patch: 3 pages: one level1 + 2xlevel2 tables, each can map
up to 64k>>3<<30 = 8TiB of physical space.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/pci-ioda-tce.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda-tce.c b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
index 5dc6847d5f4c..82e680da9d94 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda-tce.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
@@ -37,7 +37,7 @@ static __be64 *pnv_alloc_tce_level(int nid, unsigned int shift)
 	__be64 *addr;

 	tce_mem = alloc_pages_node(nid, GFP_ATOMIC | __GFP_NOWARN,
-			shift - PAGE_SHIFT);
+			shift > PAGE_SHIFT ? shift - PAGE_SHIFT : 0);
 	if (!tce_mem) {
 		pr_err("Failed to allocate a TCE memory, level shift=%d\n",
 				shift);
@@ -282,7 +282,7 @@ long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
 	/* Adjust direct table size from window_size and levels */
 	entries_shift = (entries_shift + levels - 1) / levels;
 	level_shift = entries_shift + 3;
-	level_shift = max_t(unsigned int, level_shift, PAGE_SHIFT);
+	level_shift = max_t(unsigned int, level_shift, 12); /* 4K is minimum */

 	if ((level_shift - 3) * levels + page_shift >= 55)
 		return -EINVAL;
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH kernel v2 3/7] powerpc/powernv/ioda: Allow smaller TCE table levels
@ 2020-03-23  7:53   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 57+ messages in thread
From: Alexey Kardashevskiy @ 2020-03-23  7:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: David Gibson, kvm-ppc, kvm, Alistair Popple, Fabiano Rosas,
	Alexey Kardashevskiy

Now the minimum allocation size for a TCE table level is PAGE_SIZE (64k)
as this is the minimum for alloc_pages(). The limit was set in POWER8
where we did not have sparse RAM so we did not need sparse TCE tables.
On POWER9 we have gaps in the phys address space for which using multi
level TCE tables makes sense. The problem with that is that 64K per level
is too much for 2 levels and 1GB pages as it exceeds the hardware limit
of 55bits so we need smaller levels.

This drops the minimum level size to 4K.

For a machine with 2 CPUs, top RAM address 0x4000.0000.0000
(each node gets 32TiB) and 1GiB IOMMU pages:
Before the patch: 512KiB or 8 pages.
After the patch: 3 pages: one level1 + 2xlevel2 tables, each can map
up to 64k>>3<<30 = 8TiB of physical space.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/pci-ioda-tce.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda-tce.c b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
index 5dc6847d5f4c..82e680da9d94 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda-tce.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
@@ -37,7 +37,7 @@ static __be64 *pnv_alloc_tce_level(int nid, unsigned int shift)
 	__be64 *addr;

 	tce_mem = alloc_pages_node(nid, GFP_ATOMIC | __GFP_NOWARN,
-			shift - PAGE_SHIFT);
+			shift > PAGE_SHIFT ? shift - PAGE_SHIFT : 0);
 	if (!tce_mem) {
 		pr_err("Failed to allocate a TCE memory, level shift=%d\n",
 				shift);
@@ -282,7 +282,7 @@ long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
 	/* Adjust direct table size from window_size and levels */
 	entries_shift = (entries_shift + levels - 1) / levels;
 	level_shift = entries_shift + 3;
-	level_shift = max_t(unsigned int, level_shift, PAGE_SHIFT);
+	level_shift = max_t(unsigned int, level_shift, 12); /* 4K is minimum */

 	if ((level_shift - 3) * levels + page_shift >= 55)
 		return -EINVAL;
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH kernel v2 4/7] powerpc/powernv/phb4: Use IOMMU instead of bypassing
  2020-03-23  7:53 ` Alexey Kardashevskiy
  (?)
@ 2020-03-23  7:53   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 57+ messages in thread
From: Alexey Kardashevskiy @ 2020-03-23  7:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: David Gibson, kvm-ppc, kvm, Alistair Popple, Fabiano Rosas,
	Alexey Kardashevskiy

At the moment IODA2 systems do 64bit DMA by bypassing IOMMU which
allows mapping PCI space to system space at fixed offset (1<<59).
The bypass is controlled via the "iommu" kernel parameter.

This adds a "iommu_bypass" mode which maps PCI space to system space
using an actual TCE table with the biggest IOMMU page size available
(256MB or 1GB) and 2 levels so in a typical case about 4 to 6 system
pages per PHB are allocated.

This creates a single TCE table per PHB which is shared among devices
under the same PHB.

With this enabled, all DMA goes via IOMMU. Tests on 100GBit ethernet
did not show any regression.

The following patch allows using a special PHB4 4GB PCI hack which
moved 64bit DMA window at 4GB from 1<<59 to improve DMA support.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/pci.h      |   1 +
 arch/powerpc/platforms/powernv/pci-ioda.c | 128 ++++++++++++++++++----
 2 files changed, 107 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index a808dd396522..ce00278185b0 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -100,6 +100,7 @@ struct pnv_phb {
 	int			has_dbgfs;
 	struct dentry		*dbgfs;
 #endif
+	struct iommu_table	*bypass_tbl; /* PNV_IOMMU_TCE_BYPASS only */
 
 	unsigned int		msi_base;
 	unsigned int		msi32_support;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index f5f1b4e25530..9928a1618a8b 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -51,6 +51,10 @@ static const char * const pnv_phb_names[] = { "IODA1", "IODA2", "NPU_NVLINK",
 					      "NPU_OCAPI" };
 
 static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable);
+static unsigned long pnv_ioda_parse_tce_sizes(struct pnv_phb *phb);
+static long pnv_pci_ioda2_create_table(int nid, int num, __u64 bus_offset,
+		__u32 page_shift, __u64 window_size, __u32 levels,
+		bool alloc_userspace_copy, struct iommu_table **ptbl);
 
 void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
 			    const char *fmt, ...)
@@ -83,7 +87,14 @@ void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
 	va_end(args);
 }
 
-static bool pnv_iommu_bypass_disabled __read_mostly;
+enum pnv_iommu_bypass_mode {
+	PNV_IOMMU_NO_TRANSLATE,
+	PNV_IOMMU_BYPASS_DISABLED,
+	PNV_IOMMU_TCE_BYPASS
+};
+
+static enum pnv_iommu_bypass_mode pnv_iommu_bypass_mode __read_mostly =
+		PNV_IOMMU_NO_TRANSLATE;
 static bool pci_reset_phbs __read_mostly;
 
 static int __init iommu_setup(char *str)
@@ -93,9 +104,13 @@ static int __init iommu_setup(char *str)
 
 	while (*str) {
 		if (!strncmp(str, "nobypass", 8)) {
-			pnv_iommu_bypass_disabled = true;
+			pnv_iommu_bypass_mode = PNV_IOMMU_BYPASS_DISABLED;
 			pr_info("PowerNV: IOMMU bypass window disabled.\n");
 			break;
+		} else if (!strncmp(str, "iommu_bypass", 12)) {
+			pnv_iommu_bypass_mode = PNV_IOMMU_TCE_BYPASS;
+			pr_info("PowerNV: IOMMU TCE bypass window selected.\n");
+			break;
 		}
 		str += strcspn(str, ",");
 		if (*str == ',')
@@ -2351,28 +2366,99 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
 	return 0;
 }
 
+static long pnv_pci_ioda2_set_bypass_iommu(struct pnv_ioda_pe *pe,
+		unsigned long bus_offset)
+{
+	struct pnv_phb *phb = pe->phb;
+	long rc;
+	struct memblock_region *r;
+	unsigned long pgsizes;
+
+
+	pgsizes = pnv_ioda_parse_tce_sizes(phb);
+	if (!pgsizes)
+		return -1;
+
+	if (!phb->bypass_tbl) {
+		struct iommu_table *tbl = NULL;
+
+		rc = pnv_pci_ioda2_create_table(phb->hose->node,
+				1 /* window number */,
+				bus_offset,
+				__fls(pgsizes),
+				roundup_pow_of_two(memory_hotplug_max()),
+				2 /* levels */,
+				false /* userspace cache */,
+				&tbl);
+		if (rc)
+			return -1;
+
+		for_each_memblock(memory, r)
+			pnv_ioda2_tce_build(tbl,
+					    (r->base >> tbl->it_page_shift) +
+					    tbl->it_offset,
+					    r->size >> tbl->it_page_shift,
+					    (unsigned long) __va(r->base),
+					    DMA_BIDIRECTIONAL,
+					    0);
+		phb->bypass_tbl = tbl;
+		pe_info(pe, "Created 64-bit bypass TCE table\n");
+	} else {
+		iommu_tce_table_get(phb->bypass_tbl);
+	}
+
+	rc = pnv_pci_ioda2_set_window(&pe->table_group, 1, phb->bypass_tbl);
+	if (rc) {
+		iommu_tce_table_put(phb->bypass_tbl);
+		return -1;
+	}
+
+	pe->tce_bypass_enabled = true;
+
+	return 0;
+}
+
 static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
 {
+	struct pnv_phb *phb = pe->phb;
 	uint16_t window_id = (pe->pe_number << 1 ) + 1;
 	int64_t rc;
+	phys_addr_t top;
 
-	pe_info(pe, "%sabling 64-bit DMA bypass\n", enable ? "En" : "Dis");
-	if (enable) {
-		phys_addr_t top = memblock_end_of_DRAM();
+	if (!enable) {
+		pe_info(pe, "Disabling 64-bit bypass\n");
+		rc = opal_pci_map_pe_dma_window_real(phb->opal_id,
+				pe->pe_number, window_id, 0, 0);
+		if (rc)
+			pe_err(pe, "OPAL error %lld configuring bypass window\n",
+				rc);
 
-		top = roundup_pow_of_two(top);
-		rc = opal_pci_map_pe_dma_window_real(pe->phb->opal_id,
-				pe->pe_number, window_id,
-				pe->table_group.tce64_start, top);
-	} else {
-		rc = opal_pci_map_pe_dma_window_real(pe->phb->opal_id,
-				pe->pe_number, window_id,
-				pe->table_group.tce64_start, 0);
+		pe->tce_bypass_enabled = false;
+		return;
+	}
+
+	if (pnv_iommu_bypass_mode == PNV_IOMMU_TCE_BYPASS) {
+		if (!pnv_pci_ioda2_set_bypass_iommu(pe,
+				pe->table_group.tce64_start)) {
+			pe->tce_bypass_enabled = true;
+			pe_info(pe, "Enabled 64-bit IOMMU bypass at %llx\n",
+				pe->table_group.tce64_start);
+			return;
+		}
+		/* IOMMU bypass failed, fallback to direct bypass */
+		pnv_iommu_bypass_mode = PNV_IOMMU_NO_TRANSLATE;
+	}
+
+	if (pnv_iommu_bypass_mode == PNV_IOMMU_NO_TRANSLATE) {
+		top = roundup_pow_of_two(memblock_end_of_DRAM());
+		if (!opal_pci_map_pe_dma_window_real(phb->opal_id,
+					pe->pe_number, window_id,
+					pe->table_group.tce64_start, top)) {
+			pe->tce_bypass_enabled = true;
+			pe_info(pe, "Enabled 64-bit direct bypass at %llx\n",
+					pe->table_group.tce64_start);
+		}
 	}
-	if (rc)
-		pe_err(pe, "OPAL error %lld configuring bypass window\n", rc);
-	else
-		pe->tce_bypass_enabled = enable;
 }
 
 static long pnv_pci_ioda2_create_table(int nid, int num, __u64 bus_offset,
@@ -2409,6 +2495,9 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
 	const unsigned int tceshift = PAGE_SHIFT;
 	unsigned long res_start, res_end, tces_order, tcelevel_order, levels;
 
+	if (pnv_iommu_bypass_mode != PNV_IOMMU_BYPASS_DISABLED)
+		pnv_pci_ioda2_set_bypass(pe, true);
+
 	/*
 	 * crashkernel= specifies the kdump kernel's maximum memory at
 	 * some offset and there is no guaranteed the result is a power
@@ -2470,9 +2559,6 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
 		return rc;
 	}
 
-	if (!pnv_iommu_bypass_disabled)
-		pnv_pci_ioda2_set_bypass(pe, true);
-
 	/*
 	 * Set table base for the case of IOMMU DMA use. Usually this is done
 	 * from dma_dev_setup() which is not called when a device is returned
@@ -2624,8 +2710,6 @@ static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe,
 				bus);
 }
 
-static unsigned long pnv_ioda_parse_tce_sizes(struct pnv_phb *phb);
-
 static void pnv_pci_ioda_setup_iommu_api(void)
 {
 	struct pci_controller *hose;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH kernel v2 4/7] powerpc/powernv/phb4: Use IOMMU instead of bypassing
@ 2020-03-23  7:53   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 57+ messages in thread
From: Alexey Kardashevskiy @ 2020-03-23  7:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Fabiano Rosas, Alexey Kardashevskiy, Alistair Popple,
	kvm-ppc, David Gibson

At the moment IODA2 systems do 64bit DMA by bypassing IOMMU which
allows mapping PCI space to system space at fixed offset (1<<59).
The bypass is controlled via the "iommu" kernel parameter.

This adds a "iommu_bypass" mode which maps PCI space to system space
using an actual TCE table with the biggest IOMMU page size available
(256MB or 1GB) and 2 levels so in a typical case about 4 to 6 system
pages per PHB are allocated.

This creates a single TCE table per PHB which is shared among devices
under the same PHB.

With this enabled, all DMA goes via IOMMU. Tests on 100GBit ethernet
did not show any regression.

The following patch allows using a special PHB4 4GB PCI hack which
moved 64bit DMA window at 4GB from 1<<59 to improve DMA support.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/pci.h      |   1 +
 arch/powerpc/platforms/powernv/pci-ioda.c | 128 ++++++++++++++++++----
 2 files changed, 107 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index a808dd396522..ce00278185b0 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -100,6 +100,7 @@ struct pnv_phb {
 	int			has_dbgfs;
 	struct dentry		*dbgfs;
 #endif
+	struct iommu_table	*bypass_tbl; /* PNV_IOMMU_TCE_BYPASS only */
 
 	unsigned int		msi_base;
 	unsigned int		msi32_support;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index f5f1b4e25530..9928a1618a8b 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -51,6 +51,10 @@ static const char * const pnv_phb_names[] = { "IODA1", "IODA2", "NPU_NVLINK",
 					      "NPU_OCAPI" };
 
 static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable);
+static unsigned long pnv_ioda_parse_tce_sizes(struct pnv_phb *phb);
+static long pnv_pci_ioda2_create_table(int nid, int num, __u64 bus_offset,
+		__u32 page_shift, __u64 window_size, __u32 levels,
+		bool alloc_userspace_copy, struct iommu_table **ptbl);
 
 void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
 			    const char *fmt, ...)
@@ -83,7 +87,14 @@ void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
 	va_end(args);
 }
 
-static bool pnv_iommu_bypass_disabled __read_mostly;
+enum pnv_iommu_bypass_mode {
+	PNV_IOMMU_NO_TRANSLATE,
+	PNV_IOMMU_BYPASS_DISABLED,
+	PNV_IOMMU_TCE_BYPASS
+};
+
+static enum pnv_iommu_bypass_mode pnv_iommu_bypass_mode __read_mostly =
+		PNV_IOMMU_NO_TRANSLATE;
 static bool pci_reset_phbs __read_mostly;
 
 static int __init iommu_setup(char *str)
@@ -93,9 +104,13 @@ static int __init iommu_setup(char *str)
 
 	while (*str) {
 		if (!strncmp(str, "nobypass", 8)) {
-			pnv_iommu_bypass_disabled = true;
+			pnv_iommu_bypass_mode = PNV_IOMMU_BYPASS_DISABLED;
 			pr_info("PowerNV: IOMMU bypass window disabled.\n");
 			break;
+		} else if (!strncmp(str, "iommu_bypass", 12)) {
+			pnv_iommu_bypass_mode = PNV_IOMMU_TCE_BYPASS;
+			pr_info("PowerNV: IOMMU TCE bypass window selected.\n");
+			break;
 		}
 		str += strcspn(str, ",");
 		if (*str == ',')
@@ -2351,28 +2366,99 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
 	return 0;
 }
 
+static long pnv_pci_ioda2_set_bypass_iommu(struct pnv_ioda_pe *pe,
+		unsigned long bus_offset)
+{
+	struct pnv_phb *phb = pe->phb;
+	long rc;
+	struct memblock_region *r;
+	unsigned long pgsizes;
+
+
+	pgsizes = pnv_ioda_parse_tce_sizes(phb);
+	if (!pgsizes)
+		return -1;
+
+	if (!phb->bypass_tbl) {
+		struct iommu_table *tbl = NULL;
+
+		rc = pnv_pci_ioda2_create_table(phb->hose->node,
+				1 /* window number */,
+				bus_offset,
+				__fls(pgsizes),
+				roundup_pow_of_two(memory_hotplug_max()),
+				2 /* levels */,
+				false /* userspace cache */,
+				&tbl);
+		if (rc)
+			return -1;
+
+		for_each_memblock(memory, r)
+			pnv_ioda2_tce_build(tbl,
+					    (r->base >> tbl->it_page_shift) +
+					    tbl->it_offset,
+					    r->size >> tbl->it_page_shift,
+					    (unsigned long) __va(r->base),
+					    DMA_BIDIRECTIONAL,
+					    0);
+		phb->bypass_tbl = tbl;
+		pe_info(pe, "Created 64-bit bypass TCE table\n");
+	} else {
+		iommu_tce_table_get(phb->bypass_tbl);
+	}
+
+	rc = pnv_pci_ioda2_set_window(&pe->table_group, 1, phb->bypass_tbl);
+	if (rc) {
+		iommu_tce_table_put(phb->bypass_tbl);
+		return -1;
+	}
+
+	pe->tce_bypass_enabled = true;
+
+	return 0;
+}
+
 static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
 {
+	struct pnv_phb *phb = pe->phb;
 	uint16_t window_id = (pe->pe_number << 1 ) + 1;
 	int64_t rc;
+	phys_addr_t top;
 
-	pe_info(pe, "%sabling 64-bit DMA bypass\n", enable ? "En" : "Dis");
-	if (enable) {
-		phys_addr_t top = memblock_end_of_DRAM();
+	if (!enable) {
+		pe_info(pe, "Disabling 64-bit bypass\n");
+		rc = opal_pci_map_pe_dma_window_real(phb->opal_id,
+				pe->pe_number, window_id, 0, 0);
+		if (rc)
+			pe_err(pe, "OPAL error %lld configuring bypass window\n",
+				rc);
 
-		top = roundup_pow_of_two(top);
-		rc = opal_pci_map_pe_dma_window_real(pe->phb->opal_id,
-				pe->pe_number, window_id,
-				pe->table_group.tce64_start, top);
-	} else {
-		rc = opal_pci_map_pe_dma_window_real(pe->phb->opal_id,
-				pe->pe_number, window_id,
-				pe->table_group.tce64_start, 0);
+		pe->tce_bypass_enabled = false;
+		return;
+	}
+
+	if (pnv_iommu_bypass_mode == PNV_IOMMU_TCE_BYPASS) {
+		if (!pnv_pci_ioda2_set_bypass_iommu(pe,
+				pe->table_group.tce64_start)) {
+			pe->tce_bypass_enabled = true;
+			pe_info(pe, "Enabled 64-bit IOMMU bypass at %llx\n",
+				pe->table_group.tce64_start);
+			return;
+		}
+		/* IOMMU bypass failed, fallback to direct bypass */
+		pnv_iommu_bypass_mode = PNV_IOMMU_NO_TRANSLATE;
+	}
+
+	if (pnv_iommu_bypass_mode == PNV_IOMMU_NO_TRANSLATE) {
+		top = roundup_pow_of_two(memblock_end_of_DRAM());
+		if (!opal_pci_map_pe_dma_window_real(phb->opal_id,
+					pe->pe_number, window_id,
+					pe->table_group.tce64_start, top)) {
+			pe->tce_bypass_enabled = true;
+			pe_info(pe, "Enabled 64-bit direct bypass at %llx\n",
+					pe->table_group.tce64_start);
+		}
 	}
-	if (rc)
-		pe_err(pe, "OPAL error %lld configuring bypass window\n", rc);
-	else
-		pe->tce_bypass_enabled = enable;
 }
 
 static long pnv_pci_ioda2_create_table(int nid, int num, __u64 bus_offset,
@@ -2409,6 +2495,9 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
 	const unsigned int tceshift = PAGE_SHIFT;
 	unsigned long res_start, res_end, tces_order, tcelevel_order, levels;
 
+	if (pnv_iommu_bypass_mode != PNV_IOMMU_BYPASS_DISABLED)
+		pnv_pci_ioda2_set_bypass(pe, true);
+
 	/*
 	 * crashkernel= specifies the kdump kernel's maximum memory at
 	 * some offset and there is no guaranteed the result is a power
@@ -2470,9 +2559,6 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
 		return rc;
 	}
 
-	if (!pnv_iommu_bypass_disabled)
-		pnv_pci_ioda2_set_bypass(pe, true);
-
 	/*
 	 * Set table base for the case of IOMMU DMA use. Usually this is done
 	 * from dma_dev_setup() which is not called when a device is returned
@@ -2624,8 +2710,6 @@ static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe,
 				bus);
 }
 
-static unsigned long pnv_ioda_parse_tce_sizes(struct pnv_phb *phb);
-
 static void pnv_pci_ioda_setup_iommu_api(void)
 {
 	struct pci_controller *hose;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH kernel v2 4/7] powerpc/powernv/phb4: Use IOMMU instead of bypassing
@ 2020-03-23  7:53   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 57+ messages in thread
From: Alexey Kardashevskiy @ 2020-03-23  7:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: David Gibson, kvm-ppc, kvm, Alistair Popple, Fabiano Rosas,
	Alexey Kardashevskiy

At the moment IODA2 systems do 64bit DMA by bypassing IOMMU which
allows mapping PCI space to system space at fixed offset (1<<59).
The bypass is controlled via the "iommu" kernel parameter.

This adds a "iommu_bypass" mode which maps PCI space to system space
using an actual TCE table with the biggest IOMMU page size available
(256MB or 1GB) and 2 levels so in a typical case about 4 to 6 system
pages per PHB are allocated.

This creates a single TCE table per PHB which is shared among devices
under the same PHB.

With this enabled, all DMA goes via IOMMU. Tests on 100GBit ethernet
did not show any regression.

The following patch allows using a special PHB4 4GB PCI hack which
moved 64bit DMA window at 4GB from 1<<59 to improve DMA support.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/pci.h      |   1 +
 arch/powerpc/platforms/powernv/pci-ioda.c | 128 ++++++++++++++++++----
 2 files changed, 107 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index a808dd396522..ce00278185b0 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -100,6 +100,7 @@ struct pnv_phb {
 	int			has_dbgfs;
 	struct dentry		*dbgfs;
 #endif
+	struct iommu_table	*bypass_tbl; /* PNV_IOMMU_TCE_BYPASS only */
 
 	unsigned int		msi_base;
 	unsigned int		msi32_support;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index f5f1b4e25530..9928a1618a8b 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -51,6 +51,10 @@ static const char * const pnv_phb_names[] = { "IODA1", "IODA2", "NPU_NVLINK",
 					      "NPU_OCAPI" };
 
 static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable);
+static unsigned long pnv_ioda_parse_tce_sizes(struct pnv_phb *phb);
+static long pnv_pci_ioda2_create_table(int nid, int num, __u64 bus_offset,
+		__u32 page_shift, __u64 window_size, __u32 levels,
+		bool alloc_userspace_copy, struct iommu_table **ptbl);
 
 void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
 			    const char *fmt, ...)
@@ -83,7 +87,14 @@ void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
 	va_end(args);
 }
 
-static bool pnv_iommu_bypass_disabled __read_mostly;
+enum pnv_iommu_bypass_mode {
+	PNV_IOMMU_NO_TRANSLATE,
+	PNV_IOMMU_BYPASS_DISABLED,
+	PNV_IOMMU_TCE_BYPASS
+};
+
+static enum pnv_iommu_bypass_mode pnv_iommu_bypass_mode __read_mostly +		PNV_IOMMU_NO_TRANSLATE;
 static bool pci_reset_phbs __read_mostly;
 
 static int __init iommu_setup(char *str)
@@ -93,9 +104,13 @@ static int __init iommu_setup(char *str)
 
 	while (*str) {
 		if (!strncmp(str, "nobypass", 8)) {
-			pnv_iommu_bypass_disabled = true;
+			pnv_iommu_bypass_mode = PNV_IOMMU_BYPASS_DISABLED;
 			pr_info("PowerNV: IOMMU bypass window disabled.\n");
 			break;
+		} else if (!strncmp(str, "iommu_bypass", 12)) {
+			pnv_iommu_bypass_mode = PNV_IOMMU_TCE_BYPASS;
+			pr_info("PowerNV: IOMMU TCE bypass window selected.\n");
+			break;
 		}
 		str += strcspn(str, ",");
 		if (*str = ',')
@@ -2351,28 +2366,99 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
 	return 0;
 }
 
+static long pnv_pci_ioda2_set_bypass_iommu(struct pnv_ioda_pe *pe,
+		unsigned long bus_offset)
+{
+	struct pnv_phb *phb = pe->phb;
+	long rc;
+	struct memblock_region *r;
+	unsigned long pgsizes;
+
+
+	pgsizes = pnv_ioda_parse_tce_sizes(phb);
+	if (!pgsizes)
+		return -1;
+
+	if (!phb->bypass_tbl) {
+		struct iommu_table *tbl = NULL;
+
+		rc = pnv_pci_ioda2_create_table(phb->hose->node,
+				1 /* window number */,
+				bus_offset,
+				__fls(pgsizes),
+				roundup_pow_of_two(memory_hotplug_max()),
+				2 /* levels */,
+				false /* userspace cache */,
+				&tbl);
+		if (rc)
+			return -1;
+
+		for_each_memblock(memory, r)
+			pnv_ioda2_tce_build(tbl,
+					    (r->base >> tbl->it_page_shift) +
+					    tbl->it_offset,
+					    r->size >> tbl->it_page_shift,
+					    (unsigned long) __va(r->base),
+					    DMA_BIDIRECTIONAL,
+					    0);
+		phb->bypass_tbl = tbl;
+		pe_info(pe, "Created 64-bit bypass TCE table\n");
+	} else {
+		iommu_tce_table_get(phb->bypass_tbl);
+	}
+
+	rc = pnv_pci_ioda2_set_window(&pe->table_group, 1, phb->bypass_tbl);
+	if (rc) {
+		iommu_tce_table_put(phb->bypass_tbl);
+		return -1;
+	}
+
+	pe->tce_bypass_enabled = true;
+
+	return 0;
+}
+
 static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
 {
+	struct pnv_phb *phb = pe->phb;
 	uint16_t window_id = (pe->pe_number << 1 ) + 1;
 	int64_t rc;
+	phys_addr_t top;
 
-	pe_info(pe, "%sabling 64-bit DMA bypass\n", enable ? "En" : "Dis");
-	if (enable) {
-		phys_addr_t top = memblock_end_of_DRAM();
+	if (!enable) {
+		pe_info(pe, "Disabling 64-bit bypass\n");
+		rc = opal_pci_map_pe_dma_window_real(phb->opal_id,
+				pe->pe_number, window_id, 0, 0);
+		if (rc)
+			pe_err(pe, "OPAL error %lld configuring bypass window\n",
+				rc);
 
-		top = roundup_pow_of_two(top);
-		rc = opal_pci_map_pe_dma_window_real(pe->phb->opal_id,
-				pe->pe_number, window_id,
-				pe->table_group.tce64_start, top);
-	} else {
-		rc = opal_pci_map_pe_dma_window_real(pe->phb->opal_id,
-				pe->pe_number, window_id,
-				pe->table_group.tce64_start, 0);
+		pe->tce_bypass_enabled = false;
+		return;
+	}
+
+	if (pnv_iommu_bypass_mode = PNV_IOMMU_TCE_BYPASS) {
+		if (!pnv_pci_ioda2_set_bypass_iommu(pe,
+				pe->table_group.tce64_start)) {
+			pe->tce_bypass_enabled = true;
+			pe_info(pe, "Enabled 64-bit IOMMU bypass at %llx\n",
+				pe->table_group.tce64_start);
+			return;
+		}
+		/* IOMMU bypass failed, fallback to direct bypass */
+		pnv_iommu_bypass_mode = PNV_IOMMU_NO_TRANSLATE;
+	}
+
+	if (pnv_iommu_bypass_mode = PNV_IOMMU_NO_TRANSLATE) {
+		top = roundup_pow_of_two(memblock_end_of_DRAM());
+		if (!opal_pci_map_pe_dma_window_real(phb->opal_id,
+					pe->pe_number, window_id,
+					pe->table_group.tce64_start, top)) {
+			pe->tce_bypass_enabled = true;
+			pe_info(pe, "Enabled 64-bit direct bypass at %llx\n",
+					pe->table_group.tce64_start);
+		}
 	}
-	if (rc)
-		pe_err(pe, "OPAL error %lld configuring bypass window\n", rc);
-	else
-		pe->tce_bypass_enabled = enable;
 }
 
 static long pnv_pci_ioda2_create_table(int nid, int num, __u64 bus_offset,
@@ -2409,6 +2495,9 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
 	const unsigned int tceshift = PAGE_SHIFT;
 	unsigned long res_start, res_end, tces_order, tcelevel_order, levels;
 
+	if (pnv_iommu_bypass_mode != PNV_IOMMU_BYPASS_DISABLED)
+		pnv_pci_ioda2_set_bypass(pe, true);
+
 	/*
 	 * crashkernel= specifies the kdump kernel's maximum memory at
 	 * some offset and there is no guaranteed the result is a power
@@ -2470,9 +2559,6 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
 		return rc;
 	}
 
-	if (!pnv_iommu_bypass_disabled)
-		pnv_pci_ioda2_set_bypass(pe, true);
-
 	/*
 	 * Set table base for the case of IOMMU DMA use. Usually this is done
 	 * from dma_dev_setup() which is not called when a device is returned
@@ -2624,8 +2710,6 @@ static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe,
 				bus);
 }
 
-static unsigned long pnv_ioda_parse_tce_sizes(struct pnv_phb *phb);
-
 static void pnv_pci_ioda_setup_iommu_api(void)
 {
 	struct pci_controller *hose;
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH kernel v2 5/7] powerpc/iommu: Add a window number to iommu_table_group_ops::get_table_size
  2020-03-23  7:53 ` Alexey Kardashevskiy
  (?)
@ 2020-03-23  7:53   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 57+ messages in thread
From: Alexey Kardashevskiy @ 2020-03-23  7:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: David Gibson, kvm-ppc, kvm, Alistair Popple, Fabiano Rosas,
	Alexey Kardashevskiy

We are about to add an additional offset within a TCE table which is
going to increase the size of the window, prepare for this.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h          | 1 +
 arch/powerpc/platforms/powernv/pci.h      | 2 +-
 arch/powerpc/platforms/powernv/pci-ioda.c | 4 ++--
 drivers/vfio/vfio_iommu_spapr_tce.c       | 4 ++--
 4 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 479439ef003e..acf64a73ead1 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -161,6 +161,7 @@ struct iommu_table_group;
 
 struct iommu_table_group_ops {
 	unsigned long (*get_table_size)(
+			int num,
 			__u32 page_shift,
 			__u64 window_size,
 			__u32 levels);
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index ce00278185b0..d8fa2f65517e 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -192,7 +192,7 @@ extern int pnv_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type);
 extern void pnv_teardown_msi_irqs(struct pci_dev *pdev);
 extern struct pnv_ioda_pe *pnv_ioda_get_pe(struct pci_dev *dev);
 extern void pnv_set_msi_irq_chip(struct pnv_phb *phb, unsigned int virq);
-extern unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
+extern unsigned long pnv_pci_ioda2_get_table_size(int num, __u32 page_shift,
 		__u64 window_size, __u32 levels);
 extern int pnv_eeh_post_init(void);
 
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 9928a1618a8b..27a505a5edb4 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2597,7 +2597,7 @@ static long pnv_pci_ioda2_unset_window(struct iommu_table_group *table_group,
 #endif
 
 #ifdef CONFIG_IOMMU_API
-unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
+unsigned long pnv_pci_ioda2_get_table_size(int num, __u32 page_shift,
 		__u64 window_size, __u32 levels)
 {
 	unsigned long bytes = 0;
@@ -2644,7 +2644,7 @@ static long pnv_pci_ioda2_create_table_userspace(
 
 	if (!ret)
 		(*ptbl)->it_allocated_size = pnv_pci_ioda2_get_table_size(
-				page_shift, window_size, levels);
+				num, page_shift, window_size, levels);
 	return ret;
 }
 
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 16b3adc508db..750a0676e9b7 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -613,8 +613,8 @@ static long tce_iommu_create_table(struct tce_container *container,
 {
 	long ret, table_size;
 
-	table_size = table_group->ops->get_table_size(page_shift, window_size,
-			levels);
+	table_size = table_group->ops->get_table_size(num, page_shift,
+			window_size, levels);
 	if (!table_size)
 		return -EINVAL;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH kernel v2 5/7] powerpc/iommu: Add a window number to iommu_table_group_ops::get_table_size
@ 2020-03-23  7:53   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 57+ messages in thread
From: Alexey Kardashevskiy @ 2020-03-23  7:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Fabiano Rosas, Alexey Kardashevskiy, Alistair Popple,
	kvm-ppc, David Gibson

We are about to add an additional offset within a TCE table which is
going to increase the size of the window, prepare for this.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h          | 1 +
 arch/powerpc/platforms/powernv/pci.h      | 2 +-
 arch/powerpc/platforms/powernv/pci-ioda.c | 4 ++--
 drivers/vfio/vfio_iommu_spapr_tce.c       | 4 ++--
 4 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 479439ef003e..acf64a73ead1 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -161,6 +161,7 @@ struct iommu_table_group;
 
 struct iommu_table_group_ops {
 	unsigned long (*get_table_size)(
+			int num,
 			__u32 page_shift,
 			__u64 window_size,
 			__u32 levels);
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index ce00278185b0..d8fa2f65517e 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -192,7 +192,7 @@ extern int pnv_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type);
 extern void pnv_teardown_msi_irqs(struct pci_dev *pdev);
 extern struct pnv_ioda_pe *pnv_ioda_get_pe(struct pci_dev *dev);
 extern void pnv_set_msi_irq_chip(struct pnv_phb *phb, unsigned int virq);
-extern unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
+extern unsigned long pnv_pci_ioda2_get_table_size(int num, __u32 page_shift,
 		__u64 window_size, __u32 levels);
 extern int pnv_eeh_post_init(void);
 
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 9928a1618a8b..27a505a5edb4 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2597,7 +2597,7 @@ static long pnv_pci_ioda2_unset_window(struct iommu_table_group *table_group,
 #endif
 
 #ifdef CONFIG_IOMMU_API
-unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
+unsigned long pnv_pci_ioda2_get_table_size(int num, __u32 page_shift,
 		__u64 window_size, __u32 levels)
 {
 	unsigned long bytes = 0;
@@ -2644,7 +2644,7 @@ static long pnv_pci_ioda2_create_table_userspace(
 
 	if (!ret)
 		(*ptbl)->it_allocated_size = pnv_pci_ioda2_get_table_size(
-				page_shift, window_size, levels);
+				num, page_shift, window_size, levels);
 	return ret;
 }
 
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 16b3adc508db..750a0676e9b7 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -613,8 +613,8 @@ static long tce_iommu_create_table(struct tce_container *container,
 {
 	long ret, table_size;
 
-	table_size = table_group->ops->get_table_size(page_shift, window_size,
-			levels);
+	table_size = table_group->ops->get_table_size(num, page_shift,
+			window_size, levels);
 	if (!table_size)
 		return -EINVAL;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH kernel v2 5/7] powerpc/iommu: Add a window number to iommu_table_group_ops::get_table_size
@ 2020-03-23  7:53   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 57+ messages in thread
From: Alexey Kardashevskiy @ 2020-03-23  7:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: David Gibson, kvm-ppc, kvm, Alistair Popple, Fabiano Rosas,
	Alexey Kardashevskiy

We are about to add an additional offset within a TCE table which is
going to increase the size of the window, prepare for this.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h          | 1 +
 arch/powerpc/platforms/powernv/pci.h      | 2 +-
 arch/powerpc/platforms/powernv/pci-ioda.c | 4 ++--
 drivers/vfio/vfio_iommu_spapr_tce.c       | 4 ++--
 4 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 479439ef003e..acf64a73ead1 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -161,6 +161,7 @@ struct iommu_table_group;
 
 struct iommu_table_group_ops {
 	unsigned long (*get_table_size)(
+			int num,
 			__u32 page_shift,
 			__u64 window_size,
 			__u32 levels);
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index ce00278185b0..d8fa2f65517e 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -192,7 +192,7 @@ extern int pnv_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type);
 extern void pnv_teardown_msi_irqs(struct pci_dev *pdev);
 extern struct pnv_ioda_pe *pnv_ioda_get_pe(struct pci_dev *dev);
 extern void pnv_set_msi_irq_chip(struct pnv_phb *phb, unsigned int virq);
-extern unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
+extern unsigned long pnv_pci_ioda2_get_table_size(int num, __u32 page_shift,
 		__u64 window_size, __u32 levels);
 extern int pnv_eeh_post_init(void);
 
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 9928a1618a8b..27a505a5edb4 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2597,7 +2597,7 @@ static long pnv_pci_ioda2_unset_window(struct iommu_table_group *table_group,
 #endif
 
 #ifdef CONFIG_IOMMU_API
-unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
+unsigned long pnv_pci_ioda2_get_table_size(int num, __u32 page_shift,
 		__u64 window_size, __u32 levels)
 {
 	unsigned long bytes = 0;
@@ -2644,7 +2644,7 @@ static long pnv_pci_ioda2_create_table_userspace(
 
 	if (!ret)
 		(*ptbl)->it_allocated_size = pnv_pci_ioda2_get_table_size(
-				page_shift, window_size, levels);
+				num, page_shift, window_size, levels);
 	return ret;
 }
 
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 16b3adc508db..750a0676e9b7 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -613,8 +613,8 @@ static long tce_iommu_create_table(struct tce_container *container,
 {
 	long ret, table_size;
 
-	table_size = table_group->ops->get_table_size(page_shift, window_size,
-			levels);
+	table_size = table_group->ops->get_table_size(num, page_shift,
+			window_size, levels);
 	if (!table_size)
 		return -EINVAL;
 
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH kernel v2 6/7] powerpc/powernv/phb4: Add 4GB IOMMU bypass mode
  2020-03-23  7:53 ` Alexey Kardashevskiy
  (?)
@ 2020-03-23  7:53   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 57+ messages in thread
From: Alexey Kardashevskiy @ 2020-03-23  7:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: David Gibson, kvm-ppc, kvm, Alistair Popple, Fabiano Rosas,
	Alexey Kardashevskiy

IODA2 systems (POWER8/9) allow DMA windows at 2 fixed locations - 0 and
0x800.0000.0000.0000==1<<59, stored in TVT as TVE0/1. PHB4 on POWER9 has
a "TVT Select 'GTE4GB' Option" which allows mapping both windows at 0 and
selecting one based on IOBA address - accesses below 4GB go via TVE0 and
above 4GB - via TVE1. Note that the TVE1's window still has to allocate
TCEs for below 4GB.

This changes iommu=iommy_bypass mode to move the second window at 4GB
if possible. When TVE1_4GB enabled, this creates a small (2GB typically)
32 bit window as there is no need to cover as much of lower DMA space -
the 4GB+ window does it better anyway.

As the physical TCE table from TVE1 maps PCI space from 0 and we want it
look like a 1:1 mapping with a fixed 4GB offset, this adds
a iommu_table::it_tceoff field which is a number of reserved TCEs covering
first 4GB of DMA space.

This keeps the existing behavior by default as the TVE1_4GB flag is set
per PHB by device assignment is done on PE basis and managing both modes
dynamically might get nasty.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h           |  1 +
 arch/powerpc/include/asm/opal-api.h        |  9 ++-
 arch/powerpc/include/asm/opal.h            |  2 +
 arch/powerpc/platforms/powernv/opal-call.c |  2 +
 arch/powerpc/platforms/powernv/pci-ioda.c  | 66 +++++++++++++++++++---
 5 files changed, 70 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index acf64a73ead1..b9c4af9f129c 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -97,6 +97,7 @@ struct iommu_table {
 	unsigned long  it_level_size;
 	unsigned long  it_allocated_size;
 	unsigned long  it_offset;    /* Offset into global table */
+	unsigned long  it_tceoff;
 	unsigned long  it_base;      /* mapped address of tce table */
 	unsigned long  it_index;     /* which iommu table this is */
 	unsigned long  it_type;      /* type: PCI or Virtual Bus */
diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h
index c1f25a760eb1..7873754f5ea6 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -214,7 +214,9 @@
 #define OPAL_SECVAR_GET				176
 #define OPAL_SECVAR_GET_NEXT			177
 #define OPAL_SECVAR_ENQUEUE_UPDATE		178
-#define OPAL_LAST				178
+#define OPAL_PHB_SET_OPTION			179
+#define OPAL_PHB_GET_OPTION			180
+#define OPAL_LAST				180
 
 #define QUIESCE_HOLD			1 /* Spin all calls at entry */
 #define QUIESCE_REJECT			2 /* Fail all calls with OPAL_BUSY */
@@ -437,6 +439,11 @@ enum OpalSlotLedState {
 	OPAL_SLOT_LED_STATE_ON = 1	/* LED is ON */
 };
 
+enum OpalPhbOption {
+	OPAL_PHB_OPTION_TVE1_4GB = 0x1,
+	OPAL_PHB_OPTION_MMIO_EEH_DISABLE = 0x2,
+};
+
 /*
  * Address cycle types for LPC accesses. These also correspond
  * to the content of the first cell of the "reg" property for
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 9986ac34b8e2..89b712288cdd 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -142,6 +142,8 @@ int64_t opal_pci_map_pe_dma_window(uint64_t phb_id, uint16_t pe_number, uint16_t
 int64_t opal_pci_map_pe_dma_window_real(uint64_t phb_id, uint16_t pe_number,
 					uint16_t dma_window_number, uint64_t pci_start_addr,
 					uint64_t pci_mem_size);
+int64_t opal_phb_set_option(uint64_t phb_id, uint64_t opt, uint64_t setting);
+int64_t opal_phb_get_option(uint64_t phb_id, uint64_t opt, uint64_t *setting);
 int64_t opal_pci_reset(uint64_t id, uint8_t reset_scope, uint8_t assert_state);
 
 int64_t opal_pci_get_hub_diag_data(uint64_t hub_id, void *diag_buffer,
diff --git a/arch/powerpc/platforms/powernv/opal-call.c b/arch/powerpc/platforms/powernv/opal-call.c
index 5cd0f52d258f..3130d5a41570 100644
--- a/arch/powerpc/platforms/powernv/opal-call.c
+++ b/arch/powerpc/platforms/powernv/opal-call.c
@@ -293,3 +293,5 @@ OPAL_CALL(opal_mpipl_query_tag,			OPAL_MPIPL_QUERY_TAG);
 OPAL_CALL(opal_secvar_get,			OPAL_SECVAR_GET);
 OPAL_CALL(opal_secvar_get_next,			OPAL_SECVAR_GET_NEXT);
 OPAL_CALL(opal_secvar_enqueue_update,		OPAL_SECVAR_ENQUEUE_UPDATE);
+OPAL_CALL(opal_phb_set_option,			OPAL_PHB_SET_OPTION);
+OPAL_CALL(opal_phb_get_option,			OPAL_PHB_GET_OPTION);
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 27a505a5edb4..cba2cb2e1119 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2367,7 +2367,7 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
 }
 
 static long pnv_pci_ioda2_set_bypass_iommu(struct pnv_ioda_pe *pe,
-		unsigned long bus_offset)
+		unsigned long bus_offset, unsigned long tbl_offset)
 {
 	struct pnv_phb *phb = pe->phb;
 	long rc;
@@ -2376,6 +2376,8 @@ static long pnv_pci_ioda2_set_bypass_iommu(struct pnv_ioda_pe *pe,
 
 
 	pgsizes = pnv_ioda_parse_tce_sizes(phb);
+	/* Filter sizes to have round number of TCEs to cover 0..tbl_offset */
+	pgsizes &= tbl_offset | (tbl_offset - 1);
 	if (!pgsizes)
 		return -1;
 
@@ -2386,17 +2388,19 @@ static long pnv_pci_ioda2_set_bypass_iommu(struct pnv_ioda_pe *pe,
 				1 /* window number */,
 				bus_offset,
 				__fls(pgsizes),
-				roundup_pow_of_two(memory_hotplug_max()),
+				roundup_pow_of_two(memory_hotplug_max() +
+						   tbl_offset),
 				2 /* levels */,
 				false /* userspace cache */,
 				&tbl);
 		if (rc)
 			return -1;
 
+		tbl->it_tceoff = tbl_offset >> tbl->it_page_shift;
 		for_each_memblock(memory, r)
 			pnv_ioda2_tce_build(tbl,
 					    (r->base >> tbl->it_page_shift) +
-					    tbl->it_offset,
+					    tbl->it_offset + tbl->it_tceoff,
 					    r->size >> tbl->it_page_shift,
 					    (unsigned long) __va(r->base),
 					    DMA_BIDIRECTIONAL,
@@ -2438,8 +2442,22 @@ static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
 	}
 
 	if (pnv_iommu_bypass_mode == PNV_IOMMU_TCE_BYPASS) {
+		if (!opal_phb_set_option(phb->opal_id,
+					OPAL_PHB_OPTION_TVE1_4GB, 1)) {
+			pe->table_group.tce64_start = SZ_4G;
+			if (!pnv_pci_ioda2_set_bypass_iommu(pe,
+					pe->table_group.tce64_start, SZ_4G)) {
+				pe->tce_bypass_enabled = true;
+				pe_info(pe, "Enabled 64-bit IOMMU bypass at %llx\n",
+						pe->table_group.tce64_start);
+				return;
+			}
+			pe_err(pe, "Enabled TVE1_4GB but failed to configure TCE table");
+			opal_phb_set_option(phb->opal_id,
+					    OPAL_PHB_OPTION_TVE1_4GB, 0);
+		}
 		if (!pnv_pci_ioda2_set_bypass_iommu(pe,
-				pe->table_group.tce64_start)) {
+				pe->table_group.tce64_start, 0)) {
 			pe->tce_bypass_enabled = true;
 			pe_info(pe, "Enabled 64-bit IOMMU bypass at %llx\n",
 				pe->table_group.tce64_start);
@@ -2450,6 +2468,10 @@ static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
 	}
 
 	if (pnv_iommu_bypass_mode == PNV_IOMMU_NO_TRANSLATE) {
+		/*
+		 * FIXME: if we enable dynamic switch, here we need to disable
+		 * OPAL_PCI_PHB_FLAG_TVE1_4GB
+		 */
 		top = roundup_pow_of_two(memblock_end_of_DRAM());
 		if (!opal_pci_map_pe_dma_window_real(phb->opal_id,
 					pe->pe_number, window_id,
@@ -2521,6 +2543,15 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
 	 */
 	/* iommu_table::it_map uses 1 bit per IOMMU page, hence 8 */
 	window_size = min((maxblock * 8) << tceshift, max_memory);
+
+	/*
+	 * If we get TVE#1_4GB on, there is no point in having a huge default
+	 * DMA window.
+	 */
+	if (pnv_iommu_bypass_mode == PNV_IOMMU_TCE_BYPASS)
+		window_size = min_t(u64, pe->table_group.tce32_size,
+				window_size);
+
 	/* Each TCE level cannot exceed maxblock so go multilevel if needed */
 	tces_order = ilog2(window_size >> tceshift);
 	tcelevel_order = ilog2(maxblock >> 3);
@@ -2611,6 +2642,9 @@ unsigned long pnv_pci_ioda2_get_table_size(int num, __u32 page_shift,
 			!is_power_of_2(window_size))
 		return 0;
 
+	if (pnv_iommu_bypass_mode == PNV_IOMMU_TCE_BYPASS && num == 1)
+		window_size = roundup_pow_of_two(window_size + SZ_4G);
+
 	/* Calculate a direct table size from window_size and levels */
 	entries_shift = (entries_shift + levels - 1) / levels;
 	table_shift = entries_shift + 3;
@@ -2636,15 +2670,29 @@ static long pnv_pci_ioda2_create_table_userspace(
 {
 	struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
 			table_group);
-	__u64 bus_offset = num ?
-		pe->table_group.tce64_start : table_group->tce32_start;
-	long ret = pnv_pci_ioda2_create_table(pe->phb->hose->node,
-			num, bus_offset, page_shift, window_size, levels, true,
+	__u64 bus_offset, tce_offset = 0, win_size = window_size;
+	long ret;
+
+	if (num == 0) {
+		bus_offset = table_group->tce32_start;
+	} else if (table_group->tce64_start == SZ_4G) {
+		bus_offset = table_group->tce32_start;
+		tce_offset = SZ_4G;
+		win_size = roundup_pow_of_two(window_size + tce_offset);
+	} else {
+		bus_offset = table_group->tce64_start;
+	}
+
+	ret = pnv_pci_ioda2_create_table(pe->phb->hose->node,
+			num, bus_offset, page_shift, win_size, levels, true,
 			ptbl);
 
-	if (!ret)
+	if (!ret) {
 		(*ptbl)->it_allocated_size = pnv_pci_ioda2_get_table_size(
 				num, page_shift, window_size, levels);
+		(*ptbl)->it_tceoff = tce_offset >> page_shift;
+	}
+
 	return ret;
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH kernel v2 6/7] powerpc/powernv/phb4: Add 4GB IOMMU bypass mode
@ 2020-03-23  7:53   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 57+ messages in thread
From: Alexey Kardashevskiy @ 2020-03-23  7:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Fabiano Rosas, Alexey Kardashevskiy, Alistair Popple,
	kvm-ppc, David Gibson

IODA2 systems (POWER8/9) allow DMA windows at 2 fixed locations - 0 and
0x800.0000.0000.0000==1<<59, stored in TVT as TVE0/1. PHB4 on POWER9 has
a "TVT Select 'GTE4GB' Option" which allows mapping both windows at 0 and
selecting one based on IOBA address - accesses below 4GB go via TVE0 and
above 4GB - via TVE1. Note that the TVE1's window still has to allocate
TCEs for below 4GB.

This changes iommu=iommy_bypass mode to move the second window at 4GB
if possible. When TVE1_4GB enabled, this creates a small (2GB typically)
32 bit window as there is no need to cover as much of lower DMA space -
the 4GB+ window does it better anyway.

As the physical TCE table from TVE1 maps PCI space from 0 and we want it
look like a 1:1 mapping with a fixed 4GB offset, this adds
a iommu_table::it_tceoff field which is a number of reserved TCEs covering
first 4GB of DMA space.

This keeps the existing behavior by default as the TVE1_4GB flag is set
per PHB by device assignment is done on PE basis and managing both modes
dynamically might get nasty.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h           |  1 +
 arch/powerpc/include/asm/opal-api.h        |  9 ++-
 arch/powerpc/include/asm/opal.h            |  2 +
 arch/powerpc/platforms/powernv/opal-call.c |  2 +
 arch/powerpc/platforms/powernv/pci-ioda.c  | 66 +++++++++++++++++++---
 5 files changed, 70 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index acf64a73ead1..b9c4af9f129c 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -97,6 +97,7 @@ struct iommu_table {
 	unsigned long  it_level_size;
 	unsigned long  it_allocated_size;
 	unsigned long  it_offset;    /* Offset into global table */
+	unsigned long  it_tceoff;
 	unsigned long  it_base;      /* mapped address of tce table */
 	unsigned long  it_index;     /* which iommu table this is */
 	unsigned long  it_type;      /* type: PCI or Virtual Bus */
diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h
index c1f25a760eb1..7873754f5ea6 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -214,7 +214,9 @@
 #define OPAL_SECVAR_GET				176
 #define OPAL_SECVAR_GET_NEXT			177
 #define OPAL_SECVAR_ENQUEUE_UPDATE		178
-#define OPAL_LAST				178
+#define OPAL_PHB_SET_OPTION			179
+#define OPAL_PHB_GET_OPTION			180
+#define OPAL_LAST				180
 
 #define QUIESCE_HOLD			1 /* Spin all calls at entry */
 #define QUIESCE_REJECT			2 /* Fail all calls with OPAL_BUSY */
@@ -437,6 +439,11 @@ enum OpalSlotLedState {
 	OPAL_SLOT_LED_STATE_ON = 1	/* LED is ON */
 };
 
+enum OpalPhbOption {
+	OPAL_PHB_OPTION_TVE1_4GB = 0x1,
+	OPAL_PHB_OPTION_MMIO_EEH_DISABLE = 0x2,
+};
+
 /*
  * Address cycle types for LPC accesses. These also correspond
  * to the content of the first cell of the "reg" property for
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 9986ac34b8e2..89b712288cdd 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -142,6 +142,8 @@ int64_t opal_pci_map_pe_dma_window(uint64_t phb_id, uint16_t pe_number, uint16_t
 int64_t opal_pci_map_pe_dma_window_real(uint64_t phb_id, uint16_t pe_number,
 					uint16_t dma_window_number, uint64_t pci_start_addr,
 					uint64_t pci_mem_size);
+int64_t opal_phb_set_option(uint64_t phb_id, uint64_t opt, uint64_t setting);
+int64_t opal_phb_get_option(uint64_t phb_id, uint64_t opt, uint64_t *setting);
 int64_t opal_pci_reset(uint64_t id, uint8_t reset_scope, uint8_t assert_state);
 
 int64_t opal_pci_get_hub_diag_data(uint64_t hub_id, void *diag_buffer,
diff --git a/arch/powerpc/platforms/powernv/opal-call.c b/arch/powerpc/platforms/powernv/opal-call.c
index 5cd0f52d258f..3130d5a41570 100644
--- a/arch/powerpc/platforms/powernv/opal-call.c
+++ b/arch/powerpc/platforms/powernv/opal-call.c
@@ -293,3 +293,5 @@ OPAL_CALL(opal_mpipl_query_tag,			OPAL_MPIPL_QUERY_TAG);
 OPAL_CALL(opal_secvar_get,			OPAL_SECVAR_GET);
 OPAL_CALL(opal_secvar_get_next,			OPAL_SECVAR_GET_NEXT);
 OPAL_CALL(opal_secvar_enqueue_update,		OPAL_SECVAR_ENQUEUE_UPDATE);
+OPAL_CALL(opal_phb_set_option,			OPAL_PHB_SET_OPTION);
+OPAL_CALL(opal_phb_get_option,			OPAL_PHB_GET_OPTION);
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 27a505a5edb4..cba2cb2e1119 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2367,7 +2367,7 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
 }
 
 static long pnv_pci_ioda2_set_bypass_iommu(struct pnv_ioda_pe *pe,
-		unsigned long bus_offset)
+		unsigned long bus_offset, unsigned long tbl_offset)
 {
 	struct pnv_phb *phb = pe->phb;
 	long rc;
@@ -2376,6 +2376,8 @@ static long pnv_pci_ioda2_set_bypass_iommu(struct pnv_ioda_pe *pe,
 
 
 	pgsizes = pnv_ioda_parse_tce_sizes(phb);
+	/* Filter sizes to have round number of TCEs to cover 0..tbl_offset */
+	pgsizes &= tbl_offset | (tbl_offset - 1);
 	if (!pgsizes)
 		return -1;
 
@@ -2386,17 +2388,19 @@ static long pnv_pci_ioda2_set_bypass_iommu(struct pnv_ioda_pe *pe,
 				1 /* window number */,
 				bus_offset,
 				__fls(pgsizes),
-				roundup_pow_of_two(memory_hotplug_max()),
+				roundup_pow_of_two(memory_hotplug_max() +
+						   tbl_offset),
 				2 /* levels */,
 				false /* userspace cache */,
 				&tbl);
 		if (rc)
 			return -1;
 
+		tbl->it_tceoff = tbl_offset >> tbl->it_page_shift;
 		for_each_memblock(memory, r)
 			pnv_ioda2_tce_build(tbl,
 					    (r->base >> tbl->it_page_shift) +
-					    tbl->it_offset,
+					    tbl->it_offset + tbl->it_tceoff,
 					    r->size >> tbl->it_page_shift,
 					    (unsigned long) __va(r->base),
 					    DMA_BIDIRECTIONAL,
@@ -2438,8 +2442,22 @@ static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
 	}
 
 	if (pnv_iommu_bypass_mode == PNV_IOMMU_TCE_BYPASS) {
+		if (!opal_phb_set_option(phb->opal_id,
+					OPAL_PHB_OPTION_TVE1_4GB, 1)) {
+			pe->table_group.tce64_start = SZ_4G;
+			if (!pnv_pci_ioda2_set_bypass_iommu(pe,
+					pe->table_group.tce64_start, SZ_4G)) {
+				pe->tce_bypass_enabled = true;
+				pe_info(pe, "Enabled 64-bit IOMMU bypass at %llx\n",
+						pe->table_group.tce64_start);
+				return;
+			}
+			pe_err(pe, "Enabled TVE1_4GB but failed to configure TCE table");
+			opal_phb_set_option(phb->opal_id,
+					    OPAL_PHB_OPTION_TVE1_4GB, 0);
+		}
 		if (!pnv_pci_ioda2_set_bypass_iommu(pe,
-				pe->table_group.tce64_start)) {
+				pe->table_group.tce64_start, 0)) {
 			pe->tce_bypass_enabled = true;
 			pe_info(pe, "Enabled 64-bit IOMMU bypass at %llx\n",
 				pe->table_group.tce64_start);
@@ -2450,6 +2468,10 @@ static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
 	}
 
 	if (pnv_iommu_bypass_mode == PNV_IOMMU_NO_TRANSLATE) {
+		/*
+		 * FIXME: if we enable dynamic switch, here we need to disable
+		 * OPAL_PCI_PHB_FLAG_TVE1_4GB
+		 */
 		top = roundup_pow_of_two(memblock_end_of_DRAM());
 		if (!opal_pci_map_pe_dma_window_real(phb->opal_id,
 					pe->pe_number, window_id,
@@ -2521,6 +2543,15 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
 	 */
 	/* iommu_table::it_map uses 1 bit per IOMMU page, hence 8 */
 	window_size = min((maxblock * 8) << tceshift, max_memory);
+
+	/*
+	 * If we get TVE#1_4GB on, there is no point in having a huge default
+	 * DMA window.
+	 */
+	if (pnv_iommu_bypass_mode == PNV_IOMMU_TCE_BYPASS)
+		window_size = min_t(u64, pe->table_group.tce32_size,
+				window_size);
+
 	/* Each TCE level cannot exceed maxblock so go multilevel if needed */
 	tces_order = ilog2(window_size >> tceshift);
 	tcelevel_order = ilog2(maxblock >> 3);
@@ -2611,6 +2642,9 @@ unsigned long pnv_pci_ioda2_get_table_size(int num, __u32 page_shift,
 			!is_power_of_2(window_size))
 		return 0;
 
+	if (pnv_iommu_bypass_mode == PNV_IOMMU_TCE_BYPASS && num == 1)
+		window_size = roundup_pow_of_two(window_size + SZ_4G);
+
 	/* Calculate a direct table size from window_size and levels */
 	entries_shift = (entries_shift + levels - 1) / levels;
 	table_shift = entries_shift + 3;
@@ -2636,15 +2670,29 @@ static long pnv_pci_ioda2_create_table_userspace(
 {
 	struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
 			table_group);
-	__u64 bus_offset = num ?
-		pe->table_group.tce64_start : table_group->tce32_start;
-	long ret = pnv_pci_ioda2_create_table(pe->phb->hose->node,
-			num, bus_offset, page_shift, window_size, levels, true,
+	__u64 bus_offset, tce_offset = 0, win_size = window_size;
+	long ret;
+
+	if (num == 0) {
+		bus_offset = table_group->tce32_start;
+	} else if (table_group->tce64_start == SZ_4G) {
+		bus_offset = table_group->tce32_start;
+		tce_offset = SZ_4G;
+		win_size = roundup_pow_of_two(window_size + tce_offset);
+	} else {
+		bus_offset = table_group->tce64_start;
+	}
+
+	ret = pnv_pci_ioda2_create_table(pe->phb->hose->node,
+			num, bus_offset, page_shift, win_size, levels, true,
 			ptbl);
 
-	if (!ret)
+	if (!ret) {
 		(*ptbl)->it_allocated_size = pnv_pci_ioda2_get_table_size(
 				num, page_shift, window_size, levels);
+		(*ptbl)->it_tceoff = tce_offset >> page_shift;
+	}
+
 	return ret;
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH kernel v2 6/7] powerpc/powernv/phb4: Add 4GB IOMMU bypass mode
@ 2020-03-23  7:53   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 57+ messages in thread
From: Alexey Kardashevskiy @ 2020-03-23  7:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: David Gibson, kvm-ppc, kvm, Alistair Popple, Fabiano Rosas,
	Alexey Kardashevskiy

IODA2 systems (POWER8/9) allow DMA windows at 2 fixed locations - 0 and
0x800.0000.0000.0000=1<<59, stored in TVT as TVE0/1. PHB4 on POWER9 has
a "TVT Select 'GTE4GB' Option" which allows mapping both windows at 0 and
selecting one based on IOBA address - accesses below 4GB go via TVE0 and
above 4GB - via TVE1. Note that the TVE1's window still has to allocate
TCEs for below 4GB.

This changes iommu=iommy_bypass mode to move the second window at 4GB
if possible. When TVE1_4GB enabled, this creates a small (2GB typically)
32 bit window as there is no need to cover as much of lower DMA space -
the 4GB+ window does it better anyway.

As the physical TCE table from TVE1 maps PCI space from 0 and we want it
look like a 1:1 mapping with a fixed 4GB offset, this adds
a iommu_table::it_tceoff field which is a number of reserved TCEs covering
first 4GB of DMA space.

This keeps the existing behavior by default as the TVE1_4GB flag is set
per PHB by device assignment is done on PE basis and managing both modes
dynamically might get nasty.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h           |  1 +
 arch/powerpc/include/asm/opal-api.h        |  9 ++-
 arch/powerpc/include/asm/opal.h            |  2 +
 arch/powerpc/platforms/powernv/opal-call.c |  2 +
 arch/powerpc/platforms/powernv/pci-ioda.c  | 66 +++++++++++++++++++---
 5 files changed, 70 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index acf64a73ead1..b9c4af9f129c 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -97,6 +97,7 @@ struct iommu_table {
 	unsigned long  it_level_size;
 	unsigned long  it_allocated_size;
 	unsigned long  it_offset;    /* Offset into global table */
+	unsigned long  it_tceoff;
 	unsigned long  it_base;      /* mapped address of tce table */
 	unsigned long  it_index;     /* which iommu table this is */
 	unsigned long  it_type;      /* type: PCI or Virtual Bus */
diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h
index c1f25a760eb1..7873754f5ea6 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -214,7 +214,9 @@
 #define OPAL_SECVAR_GET				176
 #define OPAL_SECVAR_GET_NEXT			177
 #define OPAL_SECVAR_ENQUEUE_UPDATE		178
-#define OPAL_LAST				178
+#define OPAL_PHB_SET_OPTION			179
+#define OPAL_PHB_GET_OPTION			180
+#define OPAL_LAST				180
 
 #define QUIESCE_HOLD			1 /* Spin all calls at entry */
 #define QUIESCE_REJECT			2 /* Fail all calls with OPAL_BUSY */
@@ -437,6 +439,11 @@ enum OpalSlotLedState {
 	OPAL_SLOT_LED_STATE_ON = 1	/* LED is ON */
 };
 
+enum OpalPhbOption {
+	OPAL_PHB_OPTION_TVE1_4GB = 0x1,
+	OPAL_PHB_OPTION_MMIO_EEH_DISABLE = 0x2,
+};
+
 /*
  * Address cycle types for LPC accesses. These also correspond
  * to the content of the first cell of the "reg" property for
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 9986ac34b8e2..89b712288cdd 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -142,6 +142,8 @@ int64_t opal_pci_map_pe_dma_window(uint64_t phb_id, uint16_t pe_number, uint16_t
 int64_t opal_pci_map_pe_dma_window_real(uint64_t phb_id, uint16_t pe_number,
 					uint16_t dma_window_number, uint64_t pci_start_addr,
 					uint64_t pci_mem_size);
+int64_t opal_phb_set_option(uint64_t phb_id, uint64_t opt, uint64_t setting);
+int64_t opal_phb_get_option(uint64_t phb_id, uint64_t opt, uint64_t *setting);
 int64_t opal_pci_reset(uint64_t id, uint8_t reset_scope, uint8_t assert_state);
 
 int64_t opal_pci_get_hub_diag_data(uint64_t hub_id, void *diag_buffer,
diff --git a/arch/powerpc/platforms/powernv/opal-call.c b/arch/powerpc/platforms/powernv/opal-call.c
index 5cd0f52d258f..3130d5a41570 100644
--- a/arch/powerpc/platforms/powernv/opal-call.c
+++ b/arch/powerpc/platforms/powernv/opal-call.c
@@ -293,3 +293,5 @@ OPAL_CALL(opal_mpipl_query_tag,			OPAL_MPIPL_QUERY_TAG);
 OPAL_CALL(opal_secvar_get,			OPAL_SECVAR_GET);
 OPAL_CALL(opal_secvar_get_next,			OPAL_SECVAR_GET_NEXT);
 OPAL_CALL(opal_secvar_enqueue_update,		OPAL_SECVAR_ENQUEUE_UPDATE);
+OPAL_CALL(opal_phb_set_option,			OPAL_PHB_SET_OPTION);
+OPAL_CALL(opal_phb_get_option,			OPAL_PHB_GET_OPTION);
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 27a505a5edb4..cba2cb2e1119 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2367,7 +2367,7 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
 }
 
 static long pnv_pci_ioda2_set_bypass_iommu(struct pnv_ioda_pe *pe,
-		unsigned long bus_offset)
+		unsigned long bus_offset, unsigned long tbl_offset)
 {
 	struct pnv_phb *phb = pe->phb;
 	long rc;
@@ -2376,6 +2376,8 @@ static long pnv_pci_ioda2_set_bypass_iommu(struct pnv_ioda_pe *pe,
 
 
 	pgsizes = pnv_ioda_parse_tce_sizes(phb);
+	/* Filter sizes to have round number of TCEs to cover 0..tbl_offset */
+	pgsizes &= tbl_offset | (tbl_offset - 1);
 	if (!pgsizes)
 		return -1;
 
@@ -2386,17 +2388,19 @@ static long pnv_pci_ioda2_set_bypass_iommu(struct pnv_ioda_pe *pe,
 				1 /* window number */,
 				bus_offset,
 				__fls(pgsizes),
-				roundup_pow_of_two(memory_hotplug_max()),
+				roundup_pow_of_two(memory_hotplug_max() +
+						   tbl_offset),
 				2 /* levels */,
 				false /* userspace cache */,
 				&tbl);
 		if (rc)
 			return -1;
 
+		tbl->it_tceoff = tbl_offset >> tbl->it_page_shift;
 		for_each_memblock(memory, r)
 			pnv_ioda2_tce_build(tbl,
 					    (r->base >> tbl->it_page_shift) +
-					    tbl->it_offset,
+					    tbl->it_offset + tbl->it_tceoff,
 					    r->size >> tbl->it_page_shift,
 					    (unsigned long) __va(r->base),
 					    DMA_BIDIRECTIONAL,
@@ -2438,8 +2442,22 @@ static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
 	}
 
 	if (pnv_iommu_bypass_mode = PNV_IOMMU_TCE_BYPASS) {
+		if (!opal_phb_set_option(phb->opal_id,
+					OPAL_PHB_OPTION_TVE1_4GB, 1)) {
+			pe->table_group.tce64_start = SZ_4G;
+			if (!pnv_pci_ioda2_set_bypass_iommu(pe,
+					pe->table_group.tce64_start, SZ_4G)) {
+				pe->tce_bypass_enabled = true;
+				pe_info(pe, "Enabled 64-bit IOMMU bypass at %llx\n",
+						pe->table_group.tce64_start);
+				return;
+			}
+			pe_err(pe, "Enabled TVE1_4GB but failed to configure TCE table");
+			opal_phb_set_option(phb->opal_id,
+					    OPAL_PHB_OPTION_TVE1_4GB, 0);
+		}
 		if (!pnv_pci_ioda2_set_bypass_iommu(pe,
-				pe->table_group.tce64_start)) {
+				pe->table_group.tce64_start, 0)) {
 			pe->tce_bypass_enabled = true;
 			pe_info(pe, "Enabled 64-bit IOMMU bypass at %llx\n",
 				pe->table_group.tce64_start);
@@ -2450,6 +2468,10 @@ static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
 	}
 
 	if (pnv_iommu_bypass_mode = PNV_IOMMU_NO_TRANSLATE) {
+		/*
+		 * FIXME: if we enable dynamic switch, here we need to disable
+		 * OPAL_PCI_PHB_FLAG_TVE1_4GB
+		 */
 		top = roundup_pow_of_two(memblock_end_of_DRAM());
 		if (!opal_pci_map_pe_dma_window_real(phb->opal_id,
 					pe->pe_number, window_id,
@@ -2521,6 +2543,15 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
 	 */
 	/* iommu_table::it_map uses 1 bit per IOMMU page, hence 8 */
 	window_size = min((maxblock * 8) << tceshift, max_memory);
+
+	/*
+	 * If we get TVE#1_4GB on, there is no point in having a huge default
+	 * DMA window.
+	 */
+	if (pnv_iommu_bypass_mode = PNV_IOMMU_TCE_BYPASS)
+		window_size = min_t(u64, pe->table_group.tce32_size,
+				window_size);
+
 	/* Each TCE level cannot exceed maxblock so go multilevel if needed */
 	tces_order = ilog2(window_size >> tceshift);
 	tcelevel_order = ilog2(maxblock >> 3);
@@ -2611,6 +2642,9 @@ unsigned long pnv_pci_ioda2_get_table_size(int num, __u32 page_shift,
 			!is_power_of_2(window_size))
 		return 0;
 
+	if (pnv_iommu_bypass_mode = PNV_IOMMU_TCE_BYPASS && num = 1)
+		window_size = roundup_pow_of_two(window_size + SZ_4G);
+
 	/* Calculate a direct table size from window_size and levels */
 	entries_shift = (entries_shift + levels - 1) / levels;
 	table_shift = entries_shift + 3;
@@ -2636,15 +2670,29 @@ static long pnv_pci_ioda2_create_table_userspace(
 {
 	struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
 			table_group);
-	__u64 bus_offset = num ?
-		pe->table_group.tce64_start : table_group->tce32_start;
-	long ret = pnv_pci_ioda2_create_table(pe->phb->hose->node,
-			num, bus_offset, page_shift, window_size, levels, true,
+	__u64 bus_offset, tce_offset = 0, win_size = window_size;
+	long ret;
+
+	if (num = 0) {
+		bus_offset = table_group->tce32_start;
+	} else if (table_group->tce64_start = SZ_4G) {
+		bus_offset = table_group->tce32_start;
+		tce_offset = SZ_4G;
+		win_size = roundup_pow_of_two(window_size + tce_offset);
+	} else {
+		bus_offset = table_group->tce64_start;
+	}
+
+	ret = pnv_pci_ioda2_create_table(pe->phb->hose->node,
+			num, bus_offset, page_shift, win_size, levels, true,
 			ptbl);
 
-	if (!ret)
+	if (!ret) {
 		(*ptbl)->it_allocated_size = pnv_pci_ioda2_get_table_size(
 				num, page_shift, window_size, levels);
+		(*ptbl)->it_tceoff = tce_offset >> page_shift;
+	}
+
 	return ret;
 }
 
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH kernel v2 7/7] vfio/spapr_tce: Advertise and allow a huge DMA windows at 4GB
  2020-03-23  7:53 ` Alexey Kardashevskiy
  (?)
@ 2020-03-23  7:53   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 57+ messages in thread
From: Alexey Kardashevskiy @ 2020-03-23  7:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: David Gibson, kvm-ppc, kvm, Alistair Popple, Fabiano Rosas,
	Alexey Kardashevskiy

So far the only option for a big 64big DMA window was a window located
at 0x800.0000.0000.0000 (1<<59) which creates problems for devices
supporting smaller DMA masks.

This exploits a POWER9 PHB option to allow the second DMA window to map
at 0 and advertises it with a 4GB offset to avoid overlap with
the default 32bit window.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 include/uapi/linux/vfio.h           |  2 ++
 drivers/vfio/vfio_iommu_spapr_tce.c | 13 +++++++------
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 9e843a147ead..c7f89d47335a 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -831,9 +831,11 @@ struct vfio_iommu_spapr_tce_info {
 	__u32 argsz;
 	__u32 flags;
 #define VFIO_IOMMU_SPAPR_INFO_DDW	(1 << 0)	/* DDW supported */
+#define VFIO_IOMMU_SPAPR_INFO_DDW_START	(1 << 1)	/* DDW offset */
 	__u32 dma32_window_start;	/* 32 bit window start (bytes) */
 	__u32 dma32_window_size;	/* 32 bit window size (bytes) */
 	struct vfio_iommu_spapr_tce_ddw_info ddw;
+	__u64 dma64_window_start;
 };
 
 #define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 750a0676e9b7..315fd56e51a7 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -691,7 +691,7 @@ static long tce_iommu_create_window(struct tce_container *container,
 	container->tables[num] = tbl;
 
 	/* Return start address assigned by platform in create_table() */
-	*start_addr = tbl->it_offset << tbl->it_page_shift;
+	*start_addr = (tbl->it_offset + tbl->it_tceoff) << tbl->it_page_shift;
 
 	return 0;
 
@@ -777,7 +777,7 @@ static long tce_iommu_ioctl(void *iommu_data,
 				 unsigned int cmd, unsigned long arg)
 {
 	struct tce_container *container = iommu_data;
-	unsigned long minsz, ddwsz;
+	unsigned long minsz;
 	long ret;
 
 	switch (cmd) {
@@ -842,12 +842,13 @@ static long tce_iommu_ioctl(void *iommu_data,
 			info.ddw.levels = table_group->max_levels;
 		}
 
-		ddwsz = offsetofend(struct vfio_iommu_spapr_tce_info, ddw);
+		info.flags |= VFIO_IOMMU_SPAPR_INFO_DDW_START;
+		info.dma64_window_start = table_group->tce64_start;
 
-		if (info.argsz >= ddwsz)
-			minsz = ddwsz;
+		if (info.argsz > sizeof(info))
+			info.argsz = sizeof(info);
 
-		if (copy_to_user((void __user *)arg, &info, minsz))
+		if (copy_to_user((void __user *)arg, &info, info.argsz))
 			return -EFAULT;
 
 		return 0;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH kernel v2 7/7] vfio/spapr_tce: Advertise and allow a huge DMA windows at 4GB
@ 2020-03-23  7:53   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 57+ messages in thread
From: Alexey Kardashevskiy @ 2020-03-23  7:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Fabiano Rosas, Alexey Kardashevskiy, Alistair Popple,
	kvm-ppc, David Gibson

So far the only option for a big 64big DMA window was a window located
at 0x800.0000.0000.0000 (1<<59) which creates problems for devices
supporting smaller DMA masks.

This exploits a POWER9 PHB option to allow the second DMA window to map
at 0 and advertises it with a 4GB offset to avoid overlap with
the default 32bit window.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 include/uapi/linux/vfio.h           |  2 ++
 drivers/vfio/vfio_iommu_spapr_tce.c | 13 +++++++------
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 9e843a147ead..c7f89d47335a 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -831,9 +831,11 @@ struct vfio_iommu_spapr_tce_info {
 	__u32 argsz;
 	__u32 flags;
 #define VFIO_IOMMU_SPAPR_INFO_DDW	(1 << 0)	/* DDW supported */
+#define VFIO_IOMMU_SPAPR_INFO_DDW_START	(1 << 1)	/* DDW offset */
 	__u32 dma32_window_start;	/* 32 bit window start (bytes) */
 	__u32 dma32_window_size;	/* 32 bit window size (bytes) */
 	struct vfio_iommu_spapr_tce_ddw_info ddw;
+	__u64 dma64_window_start;
 };
 
 #define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 750a0676e9b7..315fd56e51a7 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -691,7 +691,7 @@ static long tce_iommu_create_window(struct tce_container *container,
 	container->tables[num] = tbl;
 
 	/* Return start address assigned by platform in create_table() */
-	*start_addr = tbl->it_offset << tbl->it_page_shift;
+	*start_addr = (tbl->it_offset + tbl->it_tceoff) << tbl->it_page_shift;
 
 	return 0;
 
@@ -777,7 +777,7 @@ static long tce_iommu_ioctl(void *iommu_data,
 				 unsigned int cmd, unsigned long arg)
 {
 	struct tce_container *container = iommu_data;
-	unsigned long minsz, ddwsz;
+	unsigned long minsz;
 	long ret;
 
 	switch (cmd) {
@@ -842,12 +842,13 @@ static long tce_iommu_ioctl(void *iommu_data,
 			info.ddw.levels = table_group->max_levels;
 		}
 
-		ddwsz = offsetofend(struct vfio_iommu_spapr_tce_info, ddw);
+		info.flags |= VFIO_IOMMU_SPAPR_INFO_DDW_START;
+		info.dma64_window_start = table_group->tce64_start;
 
-		if (info.argsz >= ddwsz)
-			minsz = ddwsz;
+		if (info.argsz > sizeof(info))
+			info.argsz = sizeof(info);
 
-		if (copy_to_user((void __user *)arg, &info, minsz))
+		if (copy_to_user((void __user *)arg, &info, info.argsz))
 			return -EFAULT;
 
 		return 0;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH kernel v2 7/7] vfio/spapr_tce: Advertise and allow a huge DMA windows at 4GB
@ 2020-03-23  7:53   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 57+ messages in thread
From: Alexey Kardashevskiy @ 2020-03-23  7:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: David Gibson, kvm-ppc, kvm, Alistair Popple, Fabiano Rosas,
	Alexey Kardashevskiy

So far the only option for a big 64big DMA window was a window located
at 0x800.0000.0000.0000 (1<<59) which creates problems for devices
supporting smaller DMA masks.

This exploits a POWER9 PHB option to allow the second DMA window to map
at 0 and advertises it with a 4GB offset to avoid overlap with
the default 32bit window.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 include/uapi/linux/vfio.h           |  2 ++
 drivers/vfio/vfio_iommu_spapr_tce.c | 13 +++++++------
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 9e843a147ead..c7f89d47335a 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -831,9 +831,11 @@ struct vfio_iommu_spapr_tce_info {
 	__u32 argsz;
 	__u32 flags;
 #define VFIO_IOMMU_SPAPR_INFO_DDW	(1 << 0)	/* DDW supported */
+#define VFIO_IOMMU_SPAPR_INFO_DDW_START	(1 << 1)	/* DDW offset */
 	__u32 dma32_window_start;	/* 32 bit window start (bytes) */
 	__u32 dma32_window_size;	/* 32 bit window size (bytes) */
 	struct vfio_iommu_spapr_tce_ddw_info ddw;
+	__u64 dma64_window_start;
 };
 
 #define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 750a0676e9b7..315fd56e51a7 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -691,7 +691,7 @@ static long tce_iommu_create_window(struct tce_container *container,
 	container->tables[num] = tbl;
 
 	/* Return start address assigned by platform in create_table() */
-	*start_addr = tbl->it_offset << tbl->it_page_shift;
+	*start_addr = (tbl->it_offset + tbl->it_tceoff) << tbl->it_page_shift;
 
 	return 0;
 
@@ -777,7 +777,7 @@ static long tce_iommu_ioctl(void *iommu_data,
 				 unsigned int cmd, unsigned long arg)
 {
 	struct tce_container *container = iommu_data;
-	unsigned long minsz, ddwsz;
+	unsigned long minsz;
 	long ret;
 
 	switch (cmd) {
@@ -842,12 +842,13 @@ static long tce_iommu_ioctl(void *iommu_data,
 			info.ddw.levels = table_group->max_levels;
 		}
 
-		ddwsz = offsetofend(struct vfio_iommu_spapr_tce_info, ddw);
+		info.flags |= VFIO_IOMMU_SPAPR_INFO_DDW_START;
+		info.dma64_window_start = table_group->tce64_start;
 
-		if (info.argsz >= ddwsz)
-			minsz = ddwsz;
+		if (info.argsz > sizeof(info))
+			info.argsz = sizeof(info);
 
-		if (copy_to_user((void __user *)arg, &info, minsz))
+		if (copy_to_user((void __user *)arg, &info, info.argsz))
 			return -EFAULT;
 
 		return 0;
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
  2020-03-23  7:53 ` Alexey Kardashevskiy
  (?)
@ 2020-04-08  9:43   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 57+ messages in thread
From: Alexey Kardashevskiy @ 2020-04-08  9:43 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: David Gibson, kvm-ppc, kvm, Alistair Popple, Fabiano Rosas



On 23/03/2020 18:53, Alexey Kardashevskiy wrote:
> Here is an attempt to support bigger DMA space for devices
> supporting DMA masks less than 59 bits (GPUs come into mind
> first). POWER9 PHBs have an option to map 2 windows at 0
> and select a windows based on DMA address being below or above
> 4GB.
> 
> This adds the "iommu=iommu_bypass" kernel parameter and
> supports VFIO+pseries machine - current this requires telling
> upstream+unmodified QEMU about this via
> -global spapr-pci-host-bridge.dma64_win_addr=0x100000000
> or per-phb property. 4/4 advertises the new option but
> there is no automation around it in QEMU (should it be?).
> 
> For now it is either 1<<59 or 4GB mode; dynamic switching is
> not supported (could be via sysfs).
> 
> This is a rebased version of
> https://lore.kernel.org/kvm/20191202015953.127902-1-aik@ozlabs.ru/
> 
> The main change since v1 is that now it is 7 patches with
> clearer separation of steps.
> 
> 
> This is based on 6c90b86a745a "Merge tag 'mmc-v5.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc"
> 
> Please comment. Thanks.

Ping?


> 
> 
> 
> Alexey Kardashevskiy (7):
>   powerpc/powernv/ioda: Move TCE bypass base to PE
>   powerpc/powernv/ioda: Rework for huge DMA window at 4GB
>   powerpc/powernv/ioda: Allow smaller TCE table levels
>   powerpc/powernv/phb4: Use IOMMU instead of bypassing
>   powerpc/iommu: Add a window number to
>     iommu_table_group_ops::get_table_size
>   powerpc/powernv/phb4: Add 4GB IOMMU bypass mode
>   vfio/spapr_tce: Advertise and allow a huge DMA windows at 4GB
> 
>  arch/powerpc/include/asm/iommu.h              |   3 +
>  arch/powerpc/include/asm/opal-api.h           |   9 +-
>  arch/powerpc/include/asm/opal.h               |   2 +
>  arch/powerpc/platforms/powernv/pci.h          |   4 +-
>  include/uapi/linux/vfio.h                     |   2 +
>  arch/powerpc/platforms/powernv/npu-dma.c      |   1 +
>  arch/powerpc/platforms/powernv/opal-call.c    |   2 +
>  arch/powerpc/platforms/powernv/pci-ioda-tce.c |   4 +-
>  arch/powerpc/platforms/powernv/pci-ioda.c     | 234 ++++++++++++++----
>  drivers/vfio/vfio_iommu_spapr_tce.c           |  17 +-
>  10 files changed, 213 insertions(+), 65 deletions(-)
> 

-- 
Alexey

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
@ 2020-04-08  9:43   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 57+ messages in thread
From: Alexey Kardashevskiy @ 2020-04-08  9:43 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Alistair Popple, Fabiano Rosas, kvm, kvm-ppc, David Gibson



On 23/03/2020 18:53, Alexey Kardashevskiy wrote:
> Here is an attempt to support bigger DMA space for devices
> supporting DMA masks less than 59 bits (GPUs come into mind
> first). POWER9 PHBs have an option to map 2 windows at 0
> and select a windows based on DMA address being below or above
> 4GB.
> 
> This adds the "iommu=iommu_bypass" kernel parameter and
> supports VFIO+pseries machine - current this requires telling
> upstream+unmodified QEMU about this via
> -global spapr-pci-host-bridge.dma64_win_addr=0x100000000
> or per-phb property. 4/4 advertises the new option but
> there is no automation around it in QEMU (should it be?).
> 
> For now it is either 1<<59 or 4GB mode; dynamic switching is
> not supported (could be via sysfs).
> 
> This is a rebased version of
> https://lore.kernel.org/kvm/20191202015953.127902-1-aik@ozlabs.ru/
> 
> The main change since v1 is that now it is 7 patches with
> clearer separation of steps.
> 
> 
> This is based on 6c90b86a745a "Merge tag 'mmc-v5.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc"
> 
> Please comment. Thanks.

Ping?


> 
> 
> 
> Alexey Kardashevskiy (7):
>   powerpc/powernv/ioda: Move TCE bypass base to PE
>   powerpc/powernv/ioda: Rework for huge DMA window at 4GB
>   powerpc/powernv/ioda: Allow smaller TCE table levels
>   powerpc/powernv/phb4: Use IOMMU instead of bypassing
>   powerpc/iommu: Add a window number to
>     iommu_table_group_ops::get_table_size
>   powerpc/powernv/phb4: Add 4GB IOMMU bypass mode
>   vfio/spapr_tce: Advertise and allow a huge DMA windows at 4GB
> 
>  arch/powerpc/include/asm/iommu.h              |   3 +
>  arch/powerpc/include/asm/opal-api.h           |   9 +-
>  arch/powerpc/include/asm/opal.h               |   2 +
>  arch/powerpc/platforms/powernv/pci.h          |   4 +-
>  include/uapi/linux/vfio.h                     |   2 +
>  arch/powerpc/platforms/powernv/npu-dma.c      |   1 +
>  arch/powerpc/platforms/powernv/opal-call.c    |   2 +
>  arch/powerpc/platforms/powernv/pci-ioda-tce.c |   4 +-
>  arch/powerpc/platforms/powernv/pci-ioda.c     | 234 ++++++++++++++----
>  drivers/vfio/vfio_iommu_spapr_tce.c           |  17 +-
>  10 files changed, 213 insertions(+), 65 deletions(-)
> 

-- 
Alexey

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
@ 2020-04-08  9:43   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 57+ messages in thread
From: Alexey Kardashevskiy @ 2020-04-08  9:43 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: David Gibson, kvm-ppc, kvm, Alistair Popple, Fabiano Rosas



On 23/03/2020 18:53, Alexey Kardashevskiy wrote:
> Here is an attempt to support bigger DMA space for devices
> supporting DMA masks less than 59 bits (GPUs come into mind
> first). POWER9 PHBs have an option to map 2 windows at 0
> and select a windows based on DMA address being below or above
> 4GB.
> 
> This adds the "iommu=iommu_bypass" kernel parameter and
> supports VFIO+pseries machine - current this requires telling
> upstream+unmodified QEMU about this via
> -global spapr-pci-host-bridge.dma64_win_addr=0x100000000
> or per-phb property. 4/4 advertises the new option but
> there is no automation around it in QEMU (should it be?).
> 
> For now it is either 1<<59 or 4GB mode; dynamic switching is
> not supported (could be via sysfs).
> 
> This is a rebased version of
> https://lore.kernel.org/kvm/20191202015953.127902-1-aik@ozlabs.ru/
> 
> The main change since v1 is that now it is 7 patches with
> clearer separation of steps.
> 
> 
> This is based on 6c90b86a745a "Merge tag 'mmc-v5.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc"
> 
> Please comment. Thanks.

Ping?


> 
> 
> 
> Alexey Kardashevskiy (7):
>   powerpc/powernv/ioda: Move TCE bypass base to PE
>   powerpc/powernv/ioda: Rework for huge DMA window at 4GB
>   powerpc/powernv/ioda: Allow smaller TCE table levels
>   powerpc/powernv/phb4: Use IOMMU instead of bypassing
>   powerpc/iommu: Add a window number to
>     iommu_table_group_ops::get_table_size
>   powerpc/powernv/phb4: Add 4GB IOMMU bypass mode
>   vfio/spapr_tce: Advertise and allow a huge DMA windows at 4GB
> 
>  arch/powerpc/include/asm/iommu.h              |   3 +
>  arch/powerpc/include/asm/opal-api.h           |   9 +-
>  arch/powerpc/include/asm/opal.h               |   2 +
>  arch/powerpc/platforms/powernv/pci.h          |   4 +-
>  include/uapi/linux/vfio.h                     |   2 +
>  arch/powerpc/platforms/powernv/npu-dma.c      |   1 +
>  arch/powerpc/platforms/powernv/opal-call.c    |   2 +
>  arch/powerpc/platforms/powernv/pci-ioda-tce.c |   4 +-
>  arch/powerpc/platforms/powernv/pci-ioda.c     | 234 ++++++++++++++----
>  drivers/vfio/vfio_iommu_spapr_tce.c           |  17 +-
>  10 files changed, 213 insertions(+), 65 deletions(-)
> 

-- 
Alexey

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
  2020-04-08  9:43   ` Alexey Kardashevskiy
  (?)
@ 2020-04-16  1:27     ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 57+ messages in thread
From: Alexey Kardashevskiy @ 2020-04-16  1:27 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: David Gibson, kvm-ppc, kvm, Alistair Popple, Fabiano Rosas,
	Oliver, Michael Ellerman

Anyone? Is it totally useless or wrong approach? Thanks,


On 08/04/2020 19:43, Alexey Kardashevskiy wrote:
> 
> 
> On 23/03/2020 18:53, Alexey Kardashevskiy wrote:
>> Here is an attempt to support bigger DMA space for devices
>> supporting DMA masks less than 59 bits (GPUs come into mind
>> first). POWER9 PHBs have an option to map 2 windows at 0
>> and select a windows based on DMA address being below or above
>> 4GB.
>>
>> This adds the "iommu=iommu_bypass" kernel parameter and
>> supports VFIO+pseries machine - current this requires telling
>> upstream+unmodified QEMU about this via
>> -global spapr-pci-host-bridge.dma64_win_addr=0x100000000
>> or per-phb property. 4/4 advertises the new option but
>> there is no automation around it in QEMU (should it be?).
>>
>> For now it is either 1<<59 or 4GB mode; dynamic switching is
>> not supported (could be via sysfs).
>>
>> This is a rebased version of
>> https://lore.kernel.org/kvm/20191202015953.127902-1-aik@ozlabs.ru/
>>
>> The main change since v1 is that now it is 7 patches with
>> clearer separation of steps.
>>
>>
>> This is based on 6c90b86a745a "Merge tag 'mmc-v5.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc"
>>
>> Please comment. Thanks.
> 
> Ping?
> 
> 
>>
>>
>>
>> Alexey Kardashevskiy (7):
>>   powerpc/powernv/ioda: Move TCE bypass base to PE
>>   powerpc/powernv/ioda: Rework for huge DMA window at 4GB
>>   powerpc/powernv/ioda: Allow smaller TCE table levels
>>   powerpc/powernv/phb4: Use IOMMU instead of bypassing
>>   powerpc/iommu: Add a window number to
>>     iommu_table_group_ops::get_table_size
>>   powerpc/powernv/phb4: Add 4GB IOMMU bypass mode
>>   vfio/spapr_tce: Advertise and allow a huge DMA windows at 4GB
>>
>>  arch/powerpc/include/asm/iommu.h              |   3 +
>>  arch/powerpc/include/asm/opal-api.h           |   9 +-
>>  arch/powerpc/include/asm/opal.h               |   2 +
>>  arch/powerpc/platforms/powernv/pci.h          |   4 +-
>>  include/uapi/linux/vfio.h                     |   2 +
>>  arch/powerpc/platforms/powernv/npu-dma.c      |   1 +
>>  arch/powerpc/platforms/powernv/opal-call.c    |   2 +
>>  arch/powerpc/platforms/powernv/pci-ioda-tce.c |   4 +-
>>  arch/powerpc/platforms/powernv/pci-ioda.c     | 234 ++++++++++++++----
>>  drivers/vfio/vfio_iommu_spapr_tce.c           |  17 +-
>>  10 files changed, 213 insertions(+), 65 deletions(-)
>>
> 

-- 
Alexey

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
@ 2020-04-16  1:27     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 57+ messages in thread
From: Alexey Kardashevskiy @ 2020-04-16  1:27 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Fabiano Rosas, Alistair Popple, kvm-ppc, Oliver, David Gibson

Anyone? Is it totally useless or wrong approach? Thanks,


On 08/04/2020 19:43, Alexey Kardashevskiy wrote:
> 
> 
> On 23/03/2020 18:53, Alexey Kardashevskiy wrote:
>> Here is an attempt to support bigger DMA space for devices
>> supporting DMA masks less than 59 bits (GPUs come into mind
>> first). POWER9 PHBs have an option to map 2 windows at 0
>> and select a windows based on DMA address being below or above
>> 4GB.
>>
>> This adds the "iommu=iommu_bypass" kernel parameter and
>> supports VFIO+pseries machine - current this requires telling
>> upstream+unmodified QEMU about this via
>> -global spapr-pci-host-bridge.dma64_win_addr=0x100000000
>> or per-phb property. 4/4 advertises the new option but
>> there is no automation around it in QEMU (should it be?).
>>
>> For now it is either 1<<59 or 4GB mode; dynamic switching is
>> not supported (could be via sysfs).
>>
>> This is a rebased version of
>> https://lore.kernel.org/kvm/20191202015953.127902-1-aik@ozlabs.ru/
>>
>> The main change since v1 is that now it is 7 patches with
>> clearer separation of steps.
>>
>>
>> This is based on 6c90b86a745a "Merge tag 'mmc-v5.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc"
>>
>> Please comment. Thanks.
> 
> Ping?
> 
> 
>>
>>
>>
>> Alexey Kardashevskiy (7):
>>   powerpc/powernv/ioda: Move TCE bypass base to PE
>>   powerpc/powernv/ioda: Rework for huge DMA window at 4GB
>>   powerpc/powernv/ioda: Allow smaller TCE table levels
>>   powerpc/powernv/phb4: Use IOMMU instead of bypassing
>>   powerpc/iommu: Add a window number to
>>     iommu_table_group_ops::get_table_size
>>   powerpc/powernv/phb4: Add 4GB IOMMU bypass mode
>>   vfio/spapr_tce: Advertise and allow a huge DMA windows at 4GB
>>
>>  arch/powerpc/include/asm/iommu.h              |   3 +
>>  arch/powerpc/include/asm/opal-api.h           |   9 +-
>>  arch/powerpc/include/asm/opal.h               |   2 +
>>  arch/powerpc/platforms/powernv/pci.h          |   4 +-
>>  include/uapi/linux/vfio.h                     |   2 +
>>  arch/powerpc/platforms/powernv/npu-dma.c      |   1 +
>>  arch/powerpc/platforms/powernv/opal-call.c    |   2 +
>>  arch/powerpc/platforms/powernv/pci-ioda-tce.c |   4 +-
>>  arch/powerpc/platforms/powernv/pci-ioda.c     | 234 ++++++++++++++----
>>  drivers/vfio/vfio_iommu_spapr_tce.c           |  17 +-
>>  10 files changed, 213 insertions(+), 65 deletions(-)
>>
> 

-- 
Alexey

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
@ 2020-04-16  1:27     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 57+ messages in thread
From: Alexey Kardashevskiy @ 2020-04-16  1:27 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: David Gibson, kvm-ppc, kvm, Alistair Popple, Fabiano Rosas,
	Oliver, Michael Ellerman

Anyone? Is it totally useless or wrong approach? Thanks,


On 08/04/2020 19:43, Alexey Kardashevskiy wrote:
> 
> 
> On 23/03/2020 18:53, Alexey Kardashevskiy wrote:
>> Here is an attempt to support bigger DMA space for devices
>> supporting DMA masks less than 59 bits (GPUs come into mind
>> first). POWER9 PHBs have an option to map 2 windows at 0
>> and select a windows based on DMA address being below or above
>> 4GB.
>>
>> This adds the "iommu=iommu_bypass" kernel parameter and
>> supports VFIO+pseries machine - current this requires telling
>> upstream+unmodified QEMU about this via
>> -global spapr-pci-host-bridge.dma64_win_addr=0x100000000
>> or per-phb property. 4/4 advertises the new option but
>> there is no automation around it in QEMU (should it be?).
>>
>> For now it is either 1<<59 or 4GB mode; dynamic switching is
>> not supported (could be via sysfs).
>>
>> This is a rebased version of
>> https://lore.kernel.org/kvm/20191202015953.127902-1-aik@ozlabs.ru/
>>
>> The main change since v1 is that now it is 7 patches with
>> clearer separation of steps.
>>
>>
>> This is based on 6c90b86a745a "Merge tag 'mmc-v5.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc"
>>
>> Please comment. Thanks.
> 
> Ping?
> 
> 
>>
>>
>>
>> Alexey Kardashevskiy (7):
>>   powerpc/powernv/ioda: Move TCE bypass base to PE
>>   powerpc/powernv/ioda: Rework for huge DMA window at 4GB
>>   powerpc/powernv/ioda: Allow smaller TCE table levels
>>   powerpc/powernv/phb4: Use IOMMU instead of bypassing
>>   powerpc/iommu: Add a window number to
>>     iommu_table_group_ops::get_table_size
>>   powerpc/powernv/phb4: Add 4GB IOMMU bypass mode
>>   vfio/spapr_tce: Advertise and allow a huge DMA windows at 4GB
>>
>>  arch/powerpc/include/asm/iommu.h              |   3 +
>>  arch/powerpc/include/asm/opal-api.h           |   9 +-
>>  arch/powerpc/include/asm/opal.h               |   2 +
>>  arch/powerpc/platforms/powernv/pci.h          |   4 +-
>>  include/uapi/linux/vfio.h                     |   2 +
>>  arch/powerpc/platforms/powernv/npu-dma.c      |   1 +
>>  arch/powerpc/platforms/powernv/opal-call.c    |   2 +
>>  arch/powerpc/platforms/powernv/pci-ioda-tce.c |   4 +-
>>  arch/powerpc/platforms/powernv/pci-ioda.c     | 234 ++++++++++++++----
>>  drivers/vfio/vfio_iommu_spapr_tce.c           |  17 +-
>>  10 files changed, 213 insertions(+), 65 deletions(-)
>>
> 

-- 
Alexey

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
  2020-04-16  1:27     ` Alexey Kardashevskiy
  (?)
@ 2020-04-16  2:34       ` Oliver O'Halloran
  -1 siblings, 0 replies; 57+ messages in thread
From: Oliver O'Halloran @ 2020-04-16  2:34 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, David Gibson, kvm-ppc, KVM list, Alistair Popple,
	Fabiano Rosas, Michael Ellerman

On Thu, Apr 16, 2020 at 11:27 AM Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>
> Anyone? Is it totally useless or wrong approach? Thanks,

I wouldn't say it's either, but I still hate it.

The 4GB mode being per-PHB makes it difficult to use unless we force
that mode on 100% of the time which I'd prefer not to do. Ideally
devices that actually support 64bit addressing (which is most of them)
should be able to use no-translate mode when possible since a) It's
faster, and b) It frees up room in the TCE cache devices that actually
need them. I know you've done some testing with 100G NICs and found
the overhead was fine, but IMO that's a bad test since it's pretty
much the best-case scenario since all the devices on the PHB are in
the same PE. The PHB's TCE cache only hits when the TCE matches the
DMA bus address and the PE number for the device so in a multi-PE
environment there's a lot of potential for TCE cache trashing. If
there was one or two PEs under that PHB it's probably not going to
matter, but if you have an NVMe rack with 20 drives it starts to look
a bit ugly.

That all said, it might be worth doing this anyway since we probably
want the software infrastructure in place to take advantage of it.
Maybe expand the command line parameters to allow it to be enabled on
a per-PHB basis rather than globally.

Oliver

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
@ 2020-04-16  2:34       ` Oliver O'Halloran
  0 siblings, 0 replies; 57+ messages in thread
From: Oliver O'Halloran @ 2020-04-16  2:34 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: KVM list, Fabiano Rosas, Alistair Popple, kvm-ppc, linuxppc-dev,
	David Gibson

On Thu, Apr 16, 2020 at 11:27 AM Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>
> Anyone? Is it totally useless or wrong approach? Thanks,

I wouldn't say it's either, but I still hate it.

The 4GB mode being per-PHB makes it difficult to use unless we force
that mode on 100% of the time which I'd prefer not to do. Ideally
devices that actually support 64bit addressing (which is most of them)
should be able to use no-translate mode when possible since a) It's
faster, and b) It frees up room in the TCE cache devices that actually
need them. I know you've done some testing with 100G NICs and found
the overhead was fine, but IMO that's a bad test since it's pretty
much the best-case scenario since all the devices on the PHB are in
the same PE. The PHB's TCE cache only hits when the TCE matches the
DMA bus address and the PE number for the device so in a multi-PE
environment there's a lot of potential for TCE cache trashing. If
there was one or two PEs under that PHB it's probably not going to
matter, but if you have an NVMe rack with 20 drives it starts to look
a bit ugly.

That all said, it might be worth doing this anyway since we probably
want the software infrastructure in place to take advantage of it.
Maybe expand the command line parameters to allow it to be enabled on
a per-PHB basis rather than globally.

Oliver

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
@ 2020-04-16  2:34       ` Oliver O'Halloran
  0 siblings, 0 replies; 57+ messages in thread
From: Oliver O'Halloran @ 2020-04-16  2:34 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, David Gibson, kvm-ppc, KVM list, Alistair Popple,
	Fabiano Rosas, Michael Ellerman

On Thu, Apr 16, 2020 at 11:27 AM Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>
> Anyone? Is it totally useless or wrong approach? Thanks,

I wouldn't say it's either, but I still hate it.

The 4GB mode being per-PHB makes it difficult to use unless we force
that mode on 100% of the time which I'd prefer not to do. Ideally
devices that actually support 64bit addressing (which is most of them)
should be able to use no-translate mode when possible since a) It's
faster, and b) It frees up room in the TCE cache devices that actually
need them. I know you've done some testing with 100G NICs and found
the overhead was fine, but IMO that's a bad test since it's pretty
much the best-case scenario since all the devices on the PHB are in
the same PE. The PHB's TCE cache only hits when the TCE matches the
DMA bus address and the PE number for the device so in a multi-PE
environment there's a lot of potential for TCE cache trashing. If
there was one or two PEs under that PHB it's probably not going to
matter, but if you have an NVMe rack with 20 drives it starts to look
a bit ugly.

That all said, it might be worth doing this anyway since we probably
want the software infrastructure in place to take advantage of it.
Maybe expand the command line parameters to allow it to be enabled on
a per-PHB basis rather than globally.

Oliver

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
  2020-04-16  2:34       ` Oliver O'Halloran
  (?)
@ 2020-04-16  2:53         ` Oliver O'Halloran
  -1 siblings, 0 replies; 57+ messages in thread
From: Oliver O'Halloran @ 2020-04-16  2:53 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, David Gibson, kvm-ppc, KVM list, Alistair Popple,
	Fabiano Rosas, Michael Ellerman, Russell Currey

On Thu, Apr 16, 2020 at 12:34 PM Oliver O'Halloran <oohall@gmail.com> wrote:
>
> On Thu, Apr 16, 2020 at 11:27 AM Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >
> > Anyone? Is it totally useless or wrong approach? Thanks,
>
> I wouldn't say it's either, but I still hate it.
>
> The 4GB mode being per-PHB makes it difficult to use unless we force
> that mode on 100% of the time which I'd prefer not to do. Ideally
> devices that actually support 64bit addressing (which is most of them)
> should be able to use no-translate mode when possible since a) It's
> faster, and b) It frees up room in the TCE cache devices that actually
> need them. I know you've done some testing with 100G NICs and found
> the overhead was fine, but IMO that's a bad test since it's pretty
> much the best-case scenario since all the devices on the PHB are in
> the same PE. The PHB's TCE cache only hits when the TCE matches the
> DMA bus address and the PE number for the device so in a multi-PE
> environment there's a lot of potential for TCE cache trashing. If
> there was one or two PEs under that PHB it's probably not going to
> matter, but if you have an NVMe rack with 20 drives it starts to look
> a bit ugly.
>
> That all said, it might be worth doing this anyway since we probably
> want the software infrastructure in place to take advantage of it.
> Maybe expand the command line parameters to allow it to be enabled on
> a per-PHB basis rather than globally.

Since we're on the topic

I've been thinking the real issue we have is that we're trying to pick
an "optimal" IOMMU config at a point where we don't have enough
information to work out what's actually optimal. The IOMMU config is
done on a per-PE basis, but since PEs may contain devices with
different DMA masks (looking at you wierd AMD audio function) we're
always going to have to pick something conservative as the default
config for TVE#0 (64k, no bypass mapping) since the driver will tell
us what the device actually supports long after the IOMMU configuation
is done. What we really want is to be able to have separate IOMMU
contexts for each device, or at the very least a separate context for
the crippled devices.

We could allow a per-device IOMMU context by extending the Master /
Slave PE thing to cover DMA in addition to MMIO. Right now we only use
slave PEs when a device's MMIO BARs extend over multiple m64 segments.
When that happens an MMIO error causes the PHB to freezes the PE
corresponding to one of those segments, but not any of the others. To
present a single "PE" to the EEH core we check the freeze status of
each of the slave PEs when the EEH core does a PE status check and if
any of them are frozen, we freeze the rest of them too. When a driver
sets a limited DMA mask we could move that device to a seperate slave
PE so that it has it's own IOMMU context taylored to its DMA
addressing limits.

Thoughts?

Oliver

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
@ 2020-04-16  2:53         ` Oliver O'Halloran
  0 siblings, 0 replies; 57+ messages in thread
From: Oliver O'Halloran @ 2020-04-16  2:53 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: KVM list, Fabiano Rosas, Alistair Popple, kvm-ppc, linuxppc-dev,
	David Gibson

On Thu, Apr 16, 2020 at 12:34 PM Oliver O'Halloran <oohall@gmail.com> wrote:
>
> On Thu, Apr 16, 2020 at 11:27 AM Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >
> > Anyone? Is it totally useless or wrong approach? Thanks,
>
> I wouldn't say it's either, but I still hate it.
>
> The 4GB mode being per-PHB makes it difficult to use unless we force
> that mode on 100% of the time which I'd prefer not to do. Ideally
> devices that actually support 64bit addressing (which is most of them)
> should be able to use no-translate mode when possible since a) It's
> faster, and b) It frees up room in the TCE cache devices that actually
> need them. I know you've done some testing with 100G NICs and found
> the overhead was fine, but IMO that's a bad test since it's pretty
> much the best-case scenario since all the devices on the PHB are in
> the same PE. The PHB's TCE cache only hits when the TCE matches the
> DMA bus address and the PE number for the device so in a multi-PE
> environment there's a lot of potential for TCE cache trashing. If
> there was one or two PEs under that PHB it's probably not going to
> matter, but if you have an NVMe rack with 20 drives it starts to look
> a bit ugly.
>
> That all said, it might be worth doing this anyway since we probably
> want the software infrastructure in place to take advantage of it.
> Maybe expand the command line parameters to allow it to be enabled on
> a per-PHB basis rather than globally.

Since we're on the topic

I've been thinking the real issue we have is that we're trying to pick
an "optimal" IOMMU config at a point where we don't have enough
information to work out what's actually optimal. The IOMMU config is
done on a per-PE basis, but since PEs may contain devices with
different DMA masks (looking at you wierd AMD audio function) we're
always going to have to pick something conservative as the default
config for TVE#0 (64k, no bypass mapping) since the driver will tell
us what the device actually supports long after the IOMMU configuation
is done. What we really want is to be able to have separate IOMMU
contexts for each device, or at the very least a separate context for
the crippled devices.

We could allow a per-device IOMMU context by extending the Master /
Slave PE thing to cover DMA in addition to MMIO. Right now we only use
slave PEs when a device's MMIO BARs extend over multiple m64 segments.
When that happens an MMIO error causes the PHB to freezes the PE
corresponding to one of those segments, but not any of the others. To
present a single "PE" to the EEH core we check the freeze status of
each of the slave PEs when the EEH core does a PE status check and if
any of them are frozen, we freeze the rest of them too. When a driver
sets a limited DMA mask we could move that device to a seperate slave
PE so that it has it's own IOMMU context taylored to its DMA
addressing limits.

Thoughts?

Oliver

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
@ 2020-04-16  2:53         ` Oliver O'Halloran
  0 siblings, 0 replies; 57+ messages in thread
From: Oliver O'Halloran @ 2020-04-16  2:53 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, David Gibson, kvm-ppc, KVM list, Alistair Popple,
	Fabiano Rosas, Michael Ellerman, Russell Currey

On Thu, Apr 16, 2020 at 12:34 PM Oliver O'Halloran <oohall@gmail.com> wrote:
>
> On Thu, Apr 16, 2020 at 11:27 AM Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >
> > Anyone? Is it totally useless or wrong approach? Thanks,
>
> I wouldn't say it's either, but I still hate it.
>
> The 4GB mode being per-PHB makes it difficult to use unless we force
> that mode on 100% of the time which I'd prefer not to do. Ideally
> devices that actually support 64bit addressing (which is most of them)
> should be able to use no-translate mode when possible since a) It's
> faster, and b) It frees up room in the TCE cache devices that actually
> need them. I know you've done some testing with 100G NICs and found
> the overhead was fine, but IMO that's a bad test since it's pretty
> much the best-case scenario since all the devices on the PHB are in
> the same PE. The PHB's TCE cache only hits when the TCE matches the
> DMA bus address and the PE number for the device so in a multi-PE
> environment there's a lot of potential for TCE cache trashing. If
> there was one or two PEs under that PHB it's probably not going to
> matter, but if you have an NVMe rack with 20 drives it starts to look
> a bit ugly.
>
> That all said, it might be worth doing this anyway since we probably
> want the software infrastructure in place to take advantage of it.
> Maybe expand the command line parameters to allow it to be enabled on
> a per-PHB basis rather than globally.

Since we're on the topic

I've been thinking the real issue we have is that we're trying to pick
an "optimal" IOMMU config at a point where we don't have enough
information to work out what's actually optimal. The IOMMU config is
done on a per-PE basis, but since PEs may contain devices with
different DMA masks (looking at you wierd AMD audio function) we're
always going to have to pick something conservative as the default
config for TVE#0 (64k, no bypass mapping) since the driver will tell
us what the device actually supports long after the IOMMU configuation
is done. What we really want is to be able to have separate IOMMU
contexts for each device, or at the very least a separate context for
the crippled devices.

We could allow a per-device IOMMU context by extending the Master /
Slave PE thing to cover DMA in addition to MMIO. Right now we only use
slave PEs when a device's MMIO BARs extend over multiple m64 segments.
When that happens an MMIO error causes the PHB to freezes the PE
corresponding to one of those segments, but not any of the others. To
present a single "PE" to the EEH core we check the freeze status of
each of the slave PEs when the EEH core does a PE status check and if
any of them are frozen, we freeze the rest of them too. When a driver
sets a limited DMA mask we could move that device to a seperate slave
PE so that it has it's own IOMMU context taylored to its DMA
addressing limits.

Thoughts?

Oliver

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
  2020-04-16  2:53         ` Oliver O'Halloran
  (?)
@ 2020-04-17  1:26           ` Russell Currey
  -1 siblings, 0 replies; 57+ messages in thread
From: Russell Currey @ 2020-04-17  1:26 UTC (permalink / raw)
  To: Oliver O'Halloran, Alexey Kardashevskiy
  Cc: linuxppc-dev, David Gibson, kvm-ppc, KVM list, Alistair Popple,
	Fabiano Rosas, Michael Ellerman

On Thu, 2020-04-16 at 12:53 +1000, Oliver O'Halloran wrote:
> On Thu, Apr 16, 2020 at 12:34 PM Oliver O'Halloran <oohall@gmail.com>
> wrote:
> > On Thu, Apr 16, 2020 at 11:27 AM Alexey Kardashevskiy <
> > aik@ozlabs.ru> wrote:
> > > Anyone? Is it totally useless or wrong approach? Thanks,
> > 
> > I wouldn't say it's either, but I still hate it.
> > 
> > The 4GB mode being per-PHB makes it difficult to use unless we
> > force
> > that mode on 100% of the time which I'd prefer not to do. Ideally
> > devices that actually support 64bit addressing (which is most of
> > them)
> > should be able to use no-translate mode when possible since a) It's
> > faster, and b) It frees up room in the TCE cache devices that
> > actually
> > need them. I know you've done some testing with 100G NICs and found
> > the overhead was fine, but IMO that's a bad test since it's pretty
> > much the best-case scenario since all the devices on the PHB are in
> > the same PE. The PHB's TCE cache only hits when the TCE matches the
> > DMA bus address and the PE number for the device so in a multi-PE
> > environment there's a lot of potential for TCE cache trashing. If
> > there was one or two PEs under that PHB it's probably not going to
> > matter, but if you have an NVMe rack with 20 drives it starts to
> > look
> > a bit ugly.
> > 
> > That all said, it might be worth doing this anyway since we
> > probably
> > want the software infrastructure in place to take advantage of it.
> > Maybe expand the command line parameters to allow it to be enabled
> > on
> > a per-PHB basis rather than globally.
> 
> Since we're on the topic
> 
> I've been thinking the real issue we have is that we're trying to
> pick
> an "optimal" IOMMU config at a point where we don't have enough
> information to work out what's actually optimal. The IOMMU config is
> done on a per-PE basis, but since PEs may contain devices with
> different DMA masks (looking at you wierd AMD audio function) we're
> always going to have to pick something conservative as the default
> config for TVE#0 (64k, no bypass mapping) since the driver will tell
> us what the device actually supports long after the IOMMU
> configuation
> is done. What we really want is to be able to have separate IOMMU
> contexts for each device, or at the very least a separate context for
> the crippled devices.
> 
> We could allow a per-device IOMMU context by extending the Master /
> Slave PE thing to cover DMA in addition to MMIO. Right now we only
> use
> slave PEs when a device's MMIO BARs extend over multiple m64
> segments.
> When that happens an MMIO error causes the PHB to freezes the PE
> corresponding to one of those segments, but not any of the others. To
> present a single "PE" to the EEH core we check the freeze status of
> each of the slave PEs when the EEH core does a PE status check and if
> any of them are frozen, we freeze the rest of them too. When a driver
> sets a limited DMA mask we could move that device to a seperate slave
> PE so that it has it's own IOMMU context taylored to its DMA
> addressing limits.
> 
> Thoughts?

For what it's worth this sounds like a good idea to me, it just sounds
tricky to implement.  You're adding another layer of complexity on top
of EEH (well, making things look simple to the EEH core and doing your
own freezing on top of it) in addition to the DMA handling.

If it works then great, just has a high potential to become a new bug
haven.

> 
> Oliver


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
@ 2020-04-17  1:26           ` Russell Currey
  0 siblings, 0 replies; 57+ messages in thread
From: Russell Currey @ 2020-04-17  1:26 UTC (permalink / raw)
  To: Oliver O'Halloran, Alexey Kardashevskiy
  Cc: KVM list, Fabiano Rosas, Alistair Popple, kvm-ppc, linuxppc-dev,
	David Gibson

On Thu, 2020-04-16 at 12:53 +1000, Oliver O'Halloran wrote:
> On Thu, Apr 16, 2020 at 12:34 PM Oliver O'Halloran <oohall@gmail.com>
> wrote:
> > On Thu, Apr 16, 2020 at 11:27 AM Alexey Kardashevskiy <
> > aik@ozlabs.ru> wrote:
> > > Anyone? Is it totally useless or wrong approach? Thanks,
> > 
> > I wouldn't say it's either, but I still hate it.
> > 
> > The 4GB mode being per-PHB makes it difficult to use unless we
> > force
> > that mode on 100% of the time which I'd prefer not to do. Ideally
> > devices that actually support 64bit addressing (which is most of
> > them)
> > should be able to use no-translate mode when possible since a) It's
> > faster, and b) It frees up room in the TCE cache devices that
> > actually
> > need them. I know you've done some testing with 100G NICs and found
> > the overhead was fine, but IMO that's a bad test since it's pretty
> > much the best-case scenario since all the devices on the PHB are in
> > the same PE. The PHB's TCE cache only hits when the TCE matches the
> > DMA bus address and the PE number for the device so in a multi-PE
> > environment there's a lot of potential for TCE cache trashing. If
> > there was one or two PEs under that PHB it's probably not going to
> > matter, but if you have an NVMe rack with 20 drives it starts to
> > look
> > a bit ugly.
> > 
> > That all said, it might be worth doing this anyway since we
> > probably
> > want the software infrastructure in place to take advantage of it.
> > Maybe expand the command line parameters to allow it to be enabled
> > on
> > a per-PHB basis rather than globally.
> 
> Since we're on the topic
> 
> I've been thinking the real issue we have is that we're trying to
> pick
> an "optimal" IOMMU config at a point where we don't have enough
> information to work out what's actually optimal. The IOMMU config is
> done on a per-PE basis, but since PEs may contain devices with
> different DMA masks (looking at you wierd AMD audio function) we're
> always going to have to pick something conservative as the default
> config for TVE#0 (64k, no bypass mapping) since the driver will tell
> us what the device actually supports long after the IOMMU
> configuation
> is done. What we really want is to be able to have separate IOMMU
> contexts for each device, or at the very least a separate context for
> the crippled devices.
> 
> We could allow a per-device IOMMU context by extending the Master /
> Slave PE thing to cover DMA in addition to MMIO. Right now we only
> use
> slave PEs when a device's MMIO BARs extend over multiple m64
> segments.
> When that happens an MMIO error causes the PHB to freezes the PE
> corresponding to one of those segments, but not any of the others. To
> present a single "PE" to the EEH core we check the freeze status of
> each of the slave PEs when the EEH core does a PE status check and if
> any of them are frozen, we freeze the rest of them too. When a driver
> sets a limited DMA mask we could move that device to a seperate slave
> PE so that it has it's own IOMMU context taylored to its DMA
> addressing limits.
> 
> Thoughts?

For what it's worth this sounds like a good idea to me, it just sounds
tricky to implement.  You're adding another layer of complexity on top
of EEH (well, making things look simple to the EEH core and doing your
own freezing on top of it) in addition to the DMA handling.

If it works then great, just has a high potential to become a new bug
haven.

> 
> Oliver


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
@ 2020-04-17  1:26           ` Russell Currey
  0 siblings, 0 replies; 57+ messages in thread
From: Russell Currey @ 2020-04-17  1:26 UTC (permalink / raw)
  To: Oliver O'Halloran, Alexey Kardashevskiy
  Cc: linuxppc-dev, David Gibson, kvm-ppc, KVM list, Alistair Popple,
	Fabiano Rosas, Michael Ellerman

On Thu, 2020-04-16 at 12:53 +1000, Oliver O'Halloran wrote:
> On Thu, Apr 16, 2020 at 12:34 PM Oliver O'Halloran <oohall@gmail.com>
> wrote:
> > On Thu, Apr 16, 2020 at 11:27 AM Alexey Kardashevskiy <
> > aik@ozlabs.ru> wrote:
> > > Anyone? Is it totally useless or wrong approach? Thanks,
> > 
> > I wouldn't say it's either, but I still hate it.
> > 
> > The 4GB mode being per-PHB makes it difficult to use unless we
> > force
> > that mode on 100% of the time which I'd prefer not to do. Ideally
> > devices that actually support 64bit addressing (which is most of
> > them)
> > should be able to use no-translate mode when possible since a) It's
> > faster, and b) It frees up room in the TCE cache devices that
> > actually
> > need them. I know you've done some testing with 100G NICs and found
> > the overhead was fine, but IMO that's a bad test since it's pretty
> > much the best-case scenario since all the devices on the PHB are in
> > the same PE. The PHB's TCE cache only hits when the TCE matches the
> > DMA bus address and the PE number for the device so in a multi-PE
> > environment there's a lot of potential for TCE cache trashing. If
> > there was one or two PEs under that PHB it's probably not going to
> > matter, but if you have an NVMe rack with 20 drives it starts to
> > look
> > a bit ugly.
> > 
> > That all said, it might be worth doing this anyway since we
> > probably
> > want the software infrastructure in place to take advantage of it.
> > Maybe expand the command line parameters to allow it to be enabled
> > on
> > a per-PHB basis rather than globally.
> 
> Since we're on the topic
> 
> I've been thinking the real issue we have is that we're trying to
> pick
> an "optimal" IOMMU config at a point where we don't have enough
> information to work out what's actually optimal. The IOMMU config is
> done on a per-PE basis, but since PEs may contain devices with
> different DMA masks (looking at you wierd AMD audio function) we're
> always going to have to pick something conservative as the default
> config for TVE#0 (64k, no bypass mapping) since the driver will tell
> us what the device actually supports long after the IOMMU
> configuation
> is done. What we really want is to be able to have separate IOMMU
> contexts for each device, or at the very least a separate context for
> the crippled devices.
> 
> We could allow a per-device IOMMU context by extending the Master /
> Slave PE thing to cover DMA in addition to MMIO. Right now we only
> use
> slave PEs when a device's MMIO BARs extend over multiple m64
> segments.
> When that happens an MMIO error causes the PHB to freezes the PE
> corresponding to one of those segments, but not any of the others. To
> present a single "PE" to the EEH core we check the freeze status of
> each of the slave PEs when the EEH core does a PE status check and if
> any of them are frozen, we freeze the rest of them too. When a driver
> sets a limited DMA mask we could move that device to a seperate slave
> PE so that it has it's own IOMMU context taylored to its DMA
> addressing limits.
> 
> Thoughts?

For what it's worth this sounds like a good idea to me, it just sounds
tricky to implement.  You're adding another layer of complexity on top
of EEH (well, making things look simple to the EEH core and doing your
own freezing on top of it) in addition to the DMA handling.

If it works then great, just has a high potential to become a new bug
haven.

> 
> Oliver

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
  2020-04-17  1:26           ` Russell Currey
  (?)
@ 2020-04-17  5:47             ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 57+ messages in thread
From: Alexey Kardashevskiy @ 2020-04-17  5:47 UTC (permalink / raw)
  To: Russell Currey, Oliver O'Halloran
  Cc: linuxppc-dev, David Gibson, kvm-ppc, KVM list, Alistair Popple,
	Fabiano Rosas, Michael Ellerman



On 17/04/2020 11:26, Russell Currey wrote:
> On Thu, 2020-04-16 at 12:53 +1000, Oliver O'Halloran wrote:
>> On Thu, Apr 16, 2020 at 12:34 PM Oliver O'Halloran <oohall@gmail.com>
>> wrote:
>>> On Thu, Apr 16, 2020 at 11:27 AM Alexey Kardashevskiy <
>>> aik@ozlabs.ru> wrote:
>>>> Anyone? Is it totally useless or wrong approach? Thanks,
>>>
>>> I wouldn't say it's either, but I still hate it.
>>>
>>> The 4GB mode being per-PHB makes it difficult to use unless we
>>> force
>>> that mode on 100% of the time which I'd prefer not to do. Ideally
>>> devices that actually support 64bit addressing (which is most of
>>> them)
>>> should be able to use no-translate mode when possible since a) It's
>>> faster, and b) It frees up room in the TCE cache devices that
>>> actually
>>> need them. I know you've done some testing with 100G NICs and found
>>> the overhead was fine, but IMO that's a bad test since it's pretty
>>> much the best-case scenario since all the devices on the PHB are in
>>> the same PE. The PHB's TCE cache only hits when the TCE matches the
>>> DMA bus address and the PE number for the device so in a multi-PE
>>> environment there's a lot of potential for TCE cache trashing. If
>>> there was one or two PEs under that PHB it's probably not going to
>>> matter, but if you have an NVMe rack with 20 drives it starts to
>>> look
>>> a bit ugly.
>>>
>>> That all said, it might be worth doing this anyway since we
>>> probably
>>> want the software infrastructure in place to take advantage of it.
>>> Maybe expand the command line parameters to allow it to be enabled
>>> on
>>> a per-PHB basis rather than globally.
>>
>> Since we're on the topic
>>
>> I've been thinking the real issue we have is that we're trying to
>> pick
>> an "optimal" IOMMU config at a point where we don't have enough
>> information to work out what's actually optimal. The IOMMU config is
>> done on a per-PE basis, but since PEs may contain devices with
>> different DMA masks (looking at you wierd AMD audio function) we're
>> always going to have to pick something conservative as the default
>> config for TVE#0 (64k, no bypass mapping) since the driver will tell
>> us what the device actually supports long after the IOMMU
>> configuation
>> is done. What we really want is to be able to have separate IOMMU
>> contexts for each device, or at the very least a separate context for
>> the crippled devices.
>>
>> We could allow a per-device IOMMU context by extending the Master /
>> Slave PE thing to cover DMA in addition to MMIO. Right now we only
>> use
>> slave PEs when a device's MMIO BARs extend over multiple m64
>> segments.
>> When that happens an MMIO error causes the PHB to freezes the PE
>> corresponding to one of those segments, but not any of the others. To
>> present a single "PE" to the EEH core we check the freeze status of
>> each of the slave PEs when the EEH core does a PE status check and if
>> any of them are frozen, we freeze the rest of them too. When a driver
>> sets a limited DMA mask we could move that device to a seperate slave
>> PE so that it has it's own IOMMU context taylored to its DMA
>> addressing limits.
>>
>> Thoughts?
> 
> For what it's worth this sounds like a good idea to me, it just sounds
> tricky to implement.  You're adding another layer of complexity on top
> of EEH (well, making things look simple to the EEH core and doing your
> own freezing on top of it) in addition to the DMA handling.
> 
> If it works then great, just has a high potential to become a new bug
> haven.


imho putting every PCI function to a separate PE is the right thing to
do here but I've been told it is not that simple, and I believe that.
Reusing slave PEs seems unreliable - the configuration will depend on
whether a PE occupied enough segments to give an unique PE to a PCI
function and my little brain explodes.

So this is not happening soon.

For the time being, this patchset is good for:
1. weird hardware which has limited DMA mask (this is why the patchset
was written in the first place)
2. debug DMA by routing it via IOMMU (even when 4GB hack is not enabled).



-- 
Alexey

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
@ 2020-04-17  5:47             ` Alexey Kardashevskiy
  0 siblings, 0 replies; 57+ messages in thread
From: Alexey Kardashevskiy @ 2020-04-17  5:47 UTC (permalink / raw)
  To: Russell Currey, Oliver O'Halloran
  Cc: KVM list, Fabiano Rosas, Alistair Popple, kvm-ppc, linuxppc-dev,
	David Gibson



On 17/04/2020 11:26, Russell Currey wrote:
> On Thu, 2020-04-16 at 12:53 +1000, Oliver O'Halloran wrote:
>> On Thu, Apr 16, 2020 at 12:34 PM Oliver O'Halloran <oohall@gmail.com>
>> wrote:
>>> On Thu, Apr 16, 2020 at 11:27 AM Alexey Kardashevskiy <
>>> aik@ozlabs.ru> wrote:
>>>> Anyone? Is it totally useless or wrong approach? Thanks,
>>>
>>> I wouldn't say it's either, but I still hate it.
>>>
>>> The 4GB mode being per-PHB makes it difficult to use unless we
>>> force
>>> that mode on 100% of the time which I'd prefer not to do. Ideally
>>> devices that actually support 64bit addressing (which is most of
>>> them)
>>> should be able to use no-translate mode when possible since a) It's
>>> faster, and b) It frees up room in the TCE cache devices that
>>> actually
>>> need them. I know you've done some testing with 100G NICs and found
>>> the overhead was fine, but IMO that's a bad test since it's pretty
>>> much the best-case scenario since all the devices on the PHB are in
>>> the same PE. The PHB's TCE cache only hits when the TCE matches the
>>> DMA bus address and the PE number for the device so in a multi-PE
>>> environment there's a lot of potential for TCE cache trashing. If
>>> there was one or two PEs under that PHB it's probably not going to
>>> matter, but if you have an NVMe rack with 20 drives it starts to
>>> look
>>> a bit ugly.
>>>
>>> That all said, it might be worth doing this anyway since we
>>> probably
>>> want the software infrastructure in place to take advantage of it.
>>> Maybe expand the command line parameters to allow it to be enabled
>>> on
>>> a per-PHB basis rather than globally.
>>
>> Since we're on the topic
>>
>> I've been thinking the real issue we have is that we're trying to
>> pick
>> an "optimal" IOMMU config at a point where we don't have enough
>> information to work out what's actually optimal. The IOMMU config is
>> done on a per-PE basis, but since PEs may contain devices with
>> different DMA masks (looking at you wierd AMD audio function) we're
>> always going to have to pick something conservative as the default
>> config for TVE#0 (64k, no bypass mapping) since the driver will tell
>> us what the device actually supports long after the IOMMU
>> configuation
>> is done. What we really want is to be able to have separate IOMMU
>> contexts for each device, or at the very least a separate context for
>> the crippled devices.
>>
>> We could allow a per-device IOMMU context by extending the Master /
>> Slave PE thing to cover DMA in addition to MMIO. Right now we only
>> use
>> slave PEs when a device's MMIO BARs extend over multiple m64
>> segments.
>> When that happens an MMIO error causes the PHB to freezes the PE
>> corresponding to one of those segments, but not any of the others. To
>> present a single "PE" to the EEH core we check the freeze status of
>> each of the slave PEs when the EEH core does a PE status check and if
>> any of them are frozen, we freeze the rest of them too. When a driver
>> sets a limited DMA mask we could move that device to a seperate slave
>> PE so that it has it's own IOMMU context taylored to its DMA
>> addressing limits.
>>
>> Thoughts?
> 
> For what it's worth this sounds like a good idea to me, it just sounds
> tricky to implement.  You're adding another layer of complexity on top
> of EEH (well, making things look simple to the EEH core and doing your
> own freezing on top of it) in addition to the DMA handling.
> 
> If it works then great, just has a high potential to become a new bug
> haven.


imho putting every PCI function to a separate PE is the right thing to
do here but I've been told it is not that simple, and I believe that.
Reusing slave PEs seems unreliable - the configuration will depend on
whether a PE occupied enough segments to give an unique PE to a PCI
function and my little brain explodes.

So this is not happening soon.

For the time being, this patchset is good for:
1. weird hardware which has limited DMA mask (this is why the patchset
was written in the first place)
2. debug DMA by routing it via IOMMU (even when 4GB hack is not enabled).



-- 
Alexey

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
@ 2020-04-17  5:47             ` Alexey Kardashevskiy
  0 siblings, 0 replies; 57+ messages in thread
From: Alexey Kardashevskiy @ 2020-04-17  5:47 UTC (permalink / raw)
  To: Russell Currey, Oliver O'Halloran
  Cc: linuxppc-dev, David Gibson, kvm-ppc, KVM list, Alistair Popple,
	Fabiano Rosas, Michael Ellerman



On 17/04/2020 11:26, Russell Currey wrote:
> On Thu, 2020-04-16 at 12:53 +1000, Oliver O'Halloran wrote:
>> On Thu, Apr 16, 2020 at 12:34 PM Oliver O'Halloran <oohall@gmail.com>
>> wrote:
>>> On Thu, Apr 16, 2020 at 11:27 AM Alexey Kardashevskiy <
>>> aik@ozlabs.ru> wrote:
>>>> Anyone? Is it totally useless or wrong approach? Thanks,
>>>
>>> I wouldn't say it's either, but I still hate it.
>>>
>>> The 4GB mode being per-PHB makes it difficult to use unless we
>>> force
>>> that mode on 100% of the time which I'd prefer not to do. Ideally
>>> devices that actually support 64bit addressing (which is most of
>>> them)
>>> should be able to use no-translate mode when possible since a) It's
>>> faster, and b) It frees up room in the TCE cache devices that
>>> actually
>>> need them. I know you've done some testing with 100G NICs and found
>>> the overhead was fine, but IMO that's a bad test since it's pretty
>>> much the best-case scenario since all the devices on the PHB are in
>>> the same PE. The PHB's TCE cache only hits when the TCE matches the
>>> DMA bus address and the PE number for the device so in a multi-PE
>>> environment there's a lot of potential for TCE cache trashing. If
>>> there was one or two PEs under that PHB it's probably not going to
>>> matter, but if you have an NVMe rack with 20 drives it starts to
>>> look
>>> a bit ugly.
>>>
>>> That all said, it might be worth doing this anyway since we
>>> probably
>>> want the software infrastructure in place to take advantage of it.
>>> Maybe expand the command line parameters to allow it to be enabled
>>> on
>>> a per-PHB basis rather than globally.
>>
>> Since we're on the topic
>>
>> I've been thinking the real issue we have is that we're trying to
>> pick
>> an "optimal" IOMMU config at a point where we don't have enough
>> information to work out what's actually optimal. The IOMMU config is
>> done on a per-PE basis, but since PEs may contain devices with
>> different DMA masks (looking at you wierd AMD audio function) we're
>> always going to have to pick something conservative as the default
>> config for TVE#0 (64k, no bypass mapping) since the driver will tell
>> us what the device actually supports long after the IOMMU
>> configuation
>> is done. What we really want is to be able to have separate IOMMU
>> contexts for each device, or at the very least a separate context for
>> the crippled devices.
>>
>> We could allow a per-device IOMMU context by extending the Master /
>> Slave PE thing to cover DMA in addition to MMIO. Right now we only
>> use
>> slave PEs when a device's MMIO BARs extend over multiple m64
>> segments.
>> When that happens an MMIO error causes the PHB to freezes the PE
>> corresponding to one of those segments, but not any of the others. To
>> present a single "PE" to the EEH core we check the freeze status of
>> each of the slave PEs when the EEH core does a PE status check and if
>> any of them are frozen, we freeze the rest of them too. When a driver
>> sets a limited DMA mask we could move that device to a seperate slave
>> PE so that it has it's own IOMMU context taylored to its DMA
>> addressing limits.
>>
>> Thoughts?
> 
> For what it's worth this sounds like a good idea to me, it just sounds
> tricky to implement.  You're adding another layer of complexity on top
> of EEH (well, making things look simple to the EEH core and doing your
> own freezing on top of it) in addition to the DMA handling.
> 
> If it works then great, just has a high potential to become a new bug
> haven.


imho putting every PCI function to a separate PE is the right thing to
do here but I've been told it is not that simple, and I believe that.
Reusing slave PEs seems unreliable - the configuration will depend on
whether a PE occupied enough segments to give an unique PE to a PCI
function and my little brain explodes.

So this is not happening soon.

For the time being, this patchset is good for:
1. weird hardware which has limited DMA mask (this is why the patchset
was written in the first place)
2. debug DMA by routing it via IOMMU (even when 4GB hack is not enabled).



-- 
Alexey

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
  2020-04-17  5:47             ` Alexey Kardashevskiy
  (?)
@ 2020-04-20 14:04               ` Oliver O'Halloran
  -1 siblings, 0 replies; 57+ messages in thread
From: Oliver O'Halloran @ 2020-04-20 14:04 UTC (permalink / raw)
  To: Alexey Kardashevskiy, Russell Currey
  Cc: linuxppc-dev, David Gibson, kvm-ppc, KVM list, Alistair Popple,
	Fabiano Rosas, Michael Ellerman

On Fri, 2020-04-17 at 15:47 +1000, Alexey Kardashevskiy wrote:
> 
> On 17/04/2020 11:26, Russell Currey wrote:
> > 
> > For what it's worth this sounds like a good idea to me, it just sounds
> > tricky to implement.  You're adding another layer of complexity on top
> > of EEH (well, making things look simple to the EEH core and doing your
> > own freezing on top of it) in addition to the DMA handling.
> > 
> > If it works then great, just has a high potential to become a new bug
> > haven.
> 
> imho putting every PCI function to a separate PE is the right thing to
> do here but I've been told it is not that simple, and I believe that.
> Reusing slave PEs seems unreliable - the configuration will depend on
> whether a PE occupied enough segments to give an unique PE to a PCI
> function and my little brain explodes.

You're overthinking it.

If a bus has no m64 MMIO space we're free to assign whatever PE number
we want since the PE for the bus isn't fixed by the m64 segment its
BARs were placed in. For those buses we assign a PE number starting
from the max and counting down (0xff, 0xfe, etc). For example, with
this PHB:

# lspci -s 1:: -v | egrep '0001:|Memory at'
0001:00:00.0 PCI bridge: IBM Device 04c1 (prog-if 00 [Normal decode])
0001:01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
(prog-if 00 [Normal decode])
        Memory at 600c081000000 (32-bit, non-prefetchable) [size=256K]
0001:02:01.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
(prog-if 00 [Normal decode])
0001:02:08.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
(prog-if 00 [Normal decode])
0001:02:09.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
(prog-if 00 [Normal decode])
0001:03:00.0 Non-Volatile memory controller: PMC-Sierra Inc. Device
f117 (rev 06) (prog-if 02 [NVM Express])
        Memory at 600c080000000 (64-bit, non-prefetchable) [size=16K]
        Memory at 6004000000000 (64-bit, prefetchable) [size=1M]
0001:09:00.0 Ethernet controller: Intel Corporation Ethernet Controller
X710/X557-AT 10GBASE-T (rev 02)
        Memory at 6004048000000 (64-bit, prefetchable) [size=8M]
        Memory at 600404a000000 (64-bit, prefetchable) [size=32K]
(redundant functions removed)

We get these PE assignments:

0001:00:00.0 - 0xfe # Root port
0001:01:00.0 - 0xfc # upstream port
0001:02:01.0 - 0xfd # downstream port bus
0001:02:08.0 - 0xfd
0001:02:09.0 - 0xfd
0001:03:00.0 - 0x0  # NVMe
0001:09:00.0 - 0x1  # Ethernet

All the switch ports either have 32bit BARs or no BARs so they get
assigned PEs starting from the top. The Ethernet and the NVMe have some
prefetchable 64bit BARs so they have to be in PE 0x0 and 0x1
respectively since that's where their m64 BARs are located. For our
DMA-only slave PEs any MMIO space would remain in their master PE so we
can allocate a PE number for the DMA-PE (our iommu context).

I think the key thing to realise is that we'd only be using the DMA-PE
when a crippled DMA mask is set by the driver. In all other cases we
can just use the "native PE" and when the driver unbinds we can de-
allocate our DMA-PE and return the device to the PE containing it's
MMIO BARs. I think we can keep things relatively sane that way and the
real issue is detecting EEH events on the DMA-PE.

On P9 we don't have PHB error interrupts enabled in firmware so we're
completely reliant on seeing a 0xFF response to an MMIO and manually
checking the PE status to see if it's due to a PE freeze. For our DMA-
PE it could be frozen (due to a bad DMA) and we'd never notice unless
we explicitly check the status of the DMA-PE since there's no
corresponding MMIO space to freeze.

On P8 we had PHB Error interrupts so you would notice that *something*
happened, then go check for frozen PEs, at which point the master-slave 
grouping logic would see that one PE in the group was frozen and freeze
the rest of them. We can re-use that on that, but we still need
something to actually notice a freeze occured. A background poller
checking for freezes on each PE might do the trick.

> So this is not happening soon.

Oh ye of little faith.

> For the time being, this patchset is good for:
> 1. weird hardware which has limited DMA mask (this is why the patchset
> was written in the first place)
> 2. debug DMA by routing it via IOMMU (even when 4GB hack is not enabled).

Sure, but it's still dependent on having firmware which supports the
4GB hack and I don't think that's in any offical firmware releases yet.

Oliver

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
@ 2020-04-20 14:04               ` Oliver O'Halloran
  0 siblings, 0 replies; 57+ messages in thread
From: Oliver O'Halloran @ 2020-04-20 14:04 UTC (permalink / raw)
  To: Alexey Kardashevskiy, Russell Currey
  Cc: KVM list, Fabiano Rosas, Alistair Popple, kvm-ppc, linuxppc-dev,
	David Gibson

On Fri, 2020-04-17 at 15:47 +1000, Alexey Kardashevskiy wrote:
> 
> On 17/04/2020 11:26, Russell Currey wrote:
> > 
> > For what it's worth this sounds like a good idea to me, it just sounds
> > tricky to implement.  You're adding another layer of complexity on top
> > of EEH (well, making things look simple to the EEH core and doing your
> > own freezing on top of it) in addition to the DMA handling.
> > 
> > If it works then great, just has a high potential to become a new bug
> > haven.
> 
> imho putting every PCI function to a separate PE is the right thing to
> do here but I've been told it is not that simple, and I believe that.
> Reusing slave PEs seems unreliable - the configuration will depend on
> whether a PE occupied enough segments to give an unique PE to a PCI
> function and my little brain explodes.

You're overthinking it.

If a bus has no m64 MMIO space we're free to assign whatever PE number
we want since the PE for the bus isn't fixed by the m64 segment its
BARs were placed in. For those buses we assign a PE number starting
from the max and counting down (0xff, 0xfe, etc). For example, with
this PHB:

# lspci -s 1:: -v | egrep '0001:|Memory at'
0001:00:00.0 PCI bridge: IBM Device 04c1 (prog-if 00 [Normal decode])
0001:01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
(prog-if 00 [Normal decode])
        Memory at 600c081000000 (32-bit, non-prefetchable) [size=256K]
0001:02:01.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
(prog-if 00 [Normal decode])
0001:02:08.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
(prog-if 00 [Normal decode])
0001:02:09.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
(prog-if 00 [Normal decode])
0001:03:00.0 Non-Volatile memory controller: PMC-Sierra Inc. Device
f117 (rev 06) (prog-if 02 [NVM Express])
        Memory at 600c080000000 (64-bit, non-prefetchable) [size=16K]
        Memory at 6004000000000 (64-bit, prefetchable) [size=1M]
0001:09:00.0 Ethernet controller: Intel Corporation Ethernet Controller
X710/X557-AT 10GBASE-T (rev 02)
        Memory at 6004048000000 (64-bit, prefetchable) [size=8M]
        Memory at 600404a000000 (64-bit, prefetchable) [size=32K]
(redundant functions removed)

We get these PE assignments:

0001:00:00.0 - 0xfe # Root port
0001:01:00.0 - 0xfc # upstream port
0001:02:01.0 - 0xfd # downstream port bus
0001:02:08.0 - 0xfd
0001:02:09.0 - 0xfd
0001:03:00.0 - 0x0  # NVMe
0001:09:00.0 - 0x1  # Ethernet

All the switch ports either have 32bit BARs or no BARs so they get
assigned PEs starting from the top. The Ethernet and the NVMe have some
prefetchable 64bit BARs so they have to be in PE 0x0 and 0x1
respectively since that's where their m64 BARs are located. For our
DMA-only slave PEs any MMIO space would remain in their master PE so we
can allocate a PE number for the DMA-PE (our iommu context).

I think the key thing to realise is that we'd only be using the DMA-PE
when a crippled DMA mask is set by the driver. In all other cases we
can just use the "native PE" and when the driver unbinds we can de-
allocate our DMA-PE and return the device to the PE containing it's
MMIO BARs. I think we can keep things relatively sane that way and the
real issue is detecting EEH events on the DMA-PE.

On P9 we don't have PHB error interrupts enabled in firmware so we're
completely reliant on seeing a 0xFF response to an MMIO and manually
checking the PE status to see if it's due to a PE freeze. For our DMA-
PE it could be frozen (due to a bad DMA) and we'd never notice unless
we explicitly check the status of the DMA-PE since there's no
corresponding MMIO space to freeze.

On P8 we had PHB Error interrupts so you would notice that *something*
happened, then go check for frozen PEs, at which point the master-slave 
grouping logic would see that one PE in the group was frozen and freeze
the rest of them. We can re-use that on that, but we still need
something to actually notice a freeze occured. A background poller
checking for freezes on each PE might do the trick.

> So this is not happening soon.

Oh ye of little faith.

> For the time being, this patchset is good for:
> 1. weird hardware which has limited DMA mask (this is why the patchset
> was written in the first place)
> 2. debug DMA by routing it via IOMMU (even when 4GB hack is not enabled).

Sure, but it's still dependent on having firmware which supports the
4GB hack and I don't think that's in any offical firmware releases yet.

Oliver

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
@ 2020-04-20 14:04               ` Oliver O'Halloran
  0 siblings, 0 replies; 57+ messages in thread
From: Oliver O'Halloran @ 2020-04-20 14:04 UTC (permalink / raw)
  To: Alexey Kardashevskiy, Russell Currey
  Cc: linuxppc-dev, David Gibson, kvm-ppc, KVM list, Alistair Popple,
	Fabiano Rosas, Michael Ellerman

On Fri, 2020-04-17 at 15:47 +1000, Alexey Kardashevskiy wrote:
> 
> On 17/04/2020 11:26, Russell Currey wrote:
> > 
> > For what it's worth this sounds like a good idea to me, it just sounds
> > tricky to implement.  You're adding another layer of complexity on top
> > of EEH (well, making things look simple to the EEH core and doing your
> > own freezing on top of it) in addition to the DMA handling.
> > 
> > If it works then great, just has a high potential to become a new bug
> > haven.
> 
> imho putting every PCI function to a separate PE is the right thing to
> do here but I've been told it is not that simple, and I believe that.
> Reusing slave PEs seems unreliable - the configuration will depend on
> whether a PE occupied enough segments to give an unique PE to a PCI
> function and my little brain explodes.

You're overthinking it.

If a bus has no m64 MMIO space we're free to assign whatever PE number
we want since the PE for the bus isn't fixed by the m64 segment its
BARs were placed in. For those buses we assign a PE number starting
from the max and counting down (0xff, 0xfe, etc). For example, with
this PHB:

# lspci -s 1:: -v | egrep '0001:|Memory at'
0001:00:00.0 PCI bridge: IBM Device 04c1 (prog-if 00 [Normal decode])
0001:01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
(prog-if 00 [Normal decode])
        Memory at 600c081000000 (32-bit, non-prefetchable) [size%6K]
0001:02:01.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
(prog-if 00 [Normal decode])
0001:02:08.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
(prog-if 00 [Normal decode])
0001:02:09.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
(prog-if 00 [Normal decode])
0001:03:00.0 Non-Volatile memory controller: PMC-Sierra Inc. Device
f117 (rev 06) (prog-if 02 [NVM Express])
        Memory at 600c080000000 (64-bit, non-prefetchable) [size\x16K]
        Memory at 6004000000000 (64-bit, prefetchable) [size=1M]
0001:09:00.0 Ethernet controller: Intel Corporation Ethernet Controller
X710/X557-AT 10GBASE-T (rev 02)
        Memory at 6004048000000 (64-bit, prefetchable) [size=8M]
        Memory at 600404a000000 (64-bit, prefetchable) [size2K]
(redundant functions removed)

We get these PE assignments:

0001:00:00.0 - 0xfe # Root port
0001:01:00.0 - 0xfc # upstream port
0001:02:01.0 - 0xfd # downstream port bus
0001:02:08.0 - 0xfd
0001:02:09.0 - 0xfd
0001:03:00.0 - 0x0  # NVMe
0001:09:00.0 - 0x1  # Ethernet

All the switch ports either have 32bit BARs or no BARs so they get
assigned PEs starting from the top. The Ethernet and the NVMe have some
prefetchable 64bit BARs so they have to be in PE 0x0 and 0x1
respectively since that's where their m64 BARs are located. For our
DMA-only slave PEs any MMIO space would remain in their master PE so we
can allocate a PE number for the DMA-PE (our iommu context).

I think the key thing to realise is that we'd only be using the DMA-PE
when a crippled DMA mask is set by the driver. In all other cases we
can just use the "native PE" and when the driver unbinds we can de-
allocate our DMA-PE and return the device to the PE containing it's
MMIO BARs. I think we can keep things relatively sane that way and the
real issue is detecting EEH events on the DMA-PE.

On P9 we don't have PHB error interrupts enabled in firmware so we're
completely reliant on seeing a 0xFF response to an MMIO and manually
checking the PE status to see if it's due to a PE freeze. For our DMA-
PE it could be frozen (due to a bad DMA) and we'd never notice unless
we explicitly check the status of the DMA-PE since there's no
corresponding MMIO space to freeze.

On P8 we had PHB Error interrupts so you would notice that *something*
happened, then go check for frozen PEs, at which point the master-slave 
grouping logic would see that one PE in the group was frozen and freeze
the rest of them. We can re-use that on that, but we still need
something to actually notice a freeze occured. A background poller
checking for freezes on each PE might do the trick.

> So this is not happening soon.

Oh ye of little faith.

> For the time being, this patchset is good for:
> 1. weird hardware which has limited DMA mask (this is why the patchset
> was written in the first place)
> 2. debug DMA by routing it via IOMMU (even when 4GB hack is not enabled).

Sure, but it's still dependent on having firmware which supports the
4GB hack and I don't think that's in any offical firmware releases yet.

Oliver

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
  2020-04-20 14:04               ` Oliver O'Halloran
  (?)
@ 2020-04-21  5:11                 ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 57+ messages in thread
From: Alexey Kardashevskiy @ 2020-04-21  5:11 UTC (permalink / raw)
  To: Oliver O'Halloran, Russell Currey
  Cc: linuxppc-dev, David Gibson, kvm-ppc, KVM list, Alistair Popple,
	Fabiano Rosas, Michael Ellerman



On 21/04/2020 00:04, Oliver O'Halloran wrote:
> On Fri, 2020-04-17 at 15:47 +1000, Alexey Kardashevskiy wrote:
>>
>> On 17/04/2020 11:26, Russell Currey wrote:
>>>
>>> For what it's worth this sounds like a good idea to me, it just sounds
>>> tricky to implement.  You're adding another layer of complexity on top
>>> of EEH (well, making things look simple to the EEH core and doing your
>>> own freezing on top of it) in addition to the DMA handling.
>>>
>>> If it works then great, just has a high potential to become a new bug
>>> haven.
>>
>> imho putting every PCI function to a separate PE is the right thing to
>> do here but I've been told it is not that simple, and I believe that.
>> Reusing slave PEs seems unreliable - the configuration will depend on
>> whether a PE occupied enough segments to give an unique PE to a PCI
>> function and my little brain explodes.
> 
> You're overthinking it.
> 
> If a bus has no m64 MMIO space we're free to assign whatever PE number
> we want since the PE for the bus isn't fixed by the m64 segment its
> BARs were placed in. For those buses we assign a PE number starting
> from the max and counting down (0xff, 0xfe, etc). For example, with
> this PHB:
> 
> # lspci -s 1:: -v | egrep '0001:|Memory at'
> 0001:00:00.0 PCI bridge: IBM Device 04c1 (prog-if 00 [Normal decode])
> 0001:01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
> (prog-if 00 [Normal decode])
>         Memory at 600c081000000 (32-bit, non-prefetchable) [size=256K]
> 0001:02:01.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
> (prog-if 00 [Normal decode])
> 0001:02:08.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
> (prog-if 00 [Normal decode])
> 0001:02:09.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
> (prog-if 00 [Normal decode])
> 0001:03:00.0 Non-Volatile memory controller: PMC-Sierra Inc. Device
> f117 (rev 06) (prog-if 02 [NVM Express])
>         Memory at 600c080000000 (64-bit, non-prefetchable) [size=16K]
>         Memory at 6004000000000 (64-bit, prefetchable) [size=1M]
> 0001:09:00.0 Ethernet controller: Intel Corporation Ethernet Controller
> X710/X557-AT 10GBASE-T (rev 02)
>         Memory at 6004048000000 (64-bit, prefetchable) [size=8M]
>         Memory at 600404a000000 (64-bit, prefetchable) [size=32K]
> (redundant functions removed)
> 
> We get these PE assignments:
> 
> 0001:00:00.0 - 0xfe # Root port
> 0001:01:00.0 - 0xfc # upstream port
> 0001:02:01.0 - 0xfd # downstream port bus
> 0001:02:08.0 - 0xfd
> 0001:02:09.0 - 0xfd
> 0001:03:00.0 - 0x0  # NVMe
> 0001:09:00.0 - 0x1  # Ethernet
> 
> All the switch ports either have 32bit BARs or no BARs so they get
> assigned PEs starting from the top. The Ethernet and the NVMe have some
> prefetchable 64bit BARs so they have to be in PE 0x0 and 0x1
> respectively since that's where their m64 BARs are located. For our
> DMA-only slave PEs any MMIO space would remain in their master PE so we
> can allocate a PE number for the DMA-PE (our iommu context).


One example of a problem device is AMD GPU with 64bit video PCI function
and 32bit audio, no?

What PEs will they get assigned to now? Where will audio's MMIO go? It
cannot be the same 64bit MMIO segment, right? If so, it is a separate PE
already. If not, then I do not understand "we're free to assign whatever
PE number we want.


> I think the key thing to realise is that we'd only be using the DMA-PE
> when a crippled DMA mask is set by the driver. In all other cases we
> can just use the "native PE" and when the driver unbinds we can de-
> allocate our DMA-PE and return the device to the PE containing it's
> MMIO BARs. I think we can keep things relatively sane that way and the
> real issue is detecting EEH events on the DMA-PE.


Oooor we could just have 1 DMA window (or, more precisely, a single
"TVE" as it is either window or bypass) per a PE and give every function
its own PE and create a window or a table when a device sets a DMA mask.
I feel I am missing something here though.


> 
> On P9 we don't have PHB error interrupts enabled in firmware so we're
> completely reliant on seeing a 0xFF response to an MMIO and manually
> checking the PE status to see if it's due to a PE freeze. For our DMA-
> PE it could be frozen (due to a bad DMA) and we'd never notice unless
> we explicitly check the status of the DMA-PE since there's no
> corresponding MMIO space to freeze.
> 
> On P8 we had PHB Error interrupts so you would notice that *something*
> happened, then go check for frozen PEs, at which point the master-slave 
> grouping logic would see that one PE in the group was frozen and freeze
> the rest of them. We can re-use that on that, but we still need
> something to actually notice a freeze occured. A background poller
> checking for freezes on each PE might do the trick.
> 
>> So this is not happening soon.
> 
> Oh ye of little faith.
> 
>> For the time being, this patchset is good for:
>> 1. weird hardware which has limited DMA mask (this is why the patchset
>> was written in the first place)
>> 2. debug DMA by routing it via IOMMU (even when 4GB hack is not enabled).
> 
> Sure, but it's still dependent on having firmware which supports the
> 4GB hack and I don't think that's in any offical firmware releases yet.

It's been a while :-/


-- 
Alexey

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
@ 2020-04-21  5:11                 ` Alexey Kardashevskiy
  0 siblings, 0 replies; 57+ messages in thread
From: Alexey Kardashevskiy @ 2020-04-21  5:11 UTC (permalink / raw)
  To: Oliver O'Halloran, Russell Currey
  Cc: KVM list, Fabiano Rosas, Alistair Popple, kvm-ppc, linuxppc-dev,
	David Gibson



On 21/04/2020 00:04, Oliver O'Halloran wrote:
> On Fri, 2020-04-17 at 15:47 +1000, Alexey Kardashevskiy wrote:
>>
>> On 17/04/2020 11:26, Russell Currey wrote:
>>>
>>> For what it's worth this sounds like a good idea to me, it just sounds
>>> tricky to implement.  You're adding another layer of complexity on top
>>> of EEH (well, making things look simple to the EEH core and doing your
>>> own freezing on top of it) in addition to the DMA handling.
>>>
>>> If it works then great, just has a high potential to become a new bug
>>> haven.
>>
>> imho putting every PCI function to a separate PE is the right thing to
>> do here but I've been told it is not that simple, and I believe that.
>> Reusing slave PEs seems unreliable - the configuration will depend on
>> whether a PE occupied enough segments to give an unique PE to a PCI
>> function and my little brain explodes.
> 
> You're overthinking it.
> 
> If a bus has no m64 MMIO space we're free to assign whatever PE number
> we want since the PE for the bus isn't fixed by the m64 segment its
> BARs were placed in. For those buses we assign a PE number starting
> from the max and counting down (0xff, 0xfe, etc). For example, with
> this PHB:
> 
> # lspci -s 1:: -v | egrep '0001:|Memory at'
> 0001:00:00.0 PCI bridge: IBM Device 04c1 (prog-if 00 [Normal decode])
> 0001:01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
> (prog-if 00 [Normal decode])
>         Memory at 600c081000000 (32-bit, non-prefetchable) [size=256K]
> 0001:02:01.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
> (prog-if 00 [Normal decode])
> 0001:02:08.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
> (prog-if 00 [Normal decode])
> 0001:02:09.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
> (prog-if 00 [Normal decode])
> 0001:03:00.0 Non-Volatile memory controller: PMC-Sierra Inc. Device
> f117 (rev 06) (prog-if 02 [NVM Express])
>         Memory at 600c080000000 (64-bit, non-prefetchable) [size=16K]
>         Memory at 6004000000000 (64-bit, prefetchable) [size=1M]
> 0001:09:00.0 Ethernet controller: Intel Corporation Ethernet Controller
> X710/X557-AT 10GBASE-T (rev 02)
>         Memory at 6004048000000 (64-bit, prefetchable) [size=8M]
>         Memory at 600404a000000 (64-bit, prefetchable) [size=32K]
> (redundant functions removed)
> 
> We get these PE assignments:
> 
> 0001:00:00.0 - 0xfe # Root port
> 0001:01:00.0 - 0xfc # upstream port
> 0001:02:01.0 - 0xfd # downstream port bus
> 0001:02:08.0 - 0xfd
> 0001:02:09.0 - 0xfd
> 0001:03:00.0 - 0x0  # NVMe
> 0001:09:00.0 - 0x1  # Ethernet
> 
> All the switch ports either have 32bit BARs or no BARs so they get
> assigned PEs starting from the top. The Ethernet and the NVMe have some
> prefetchable 64bit BARs so they have to be in PE 0x0 and 0x1
> respectively since that's where their m64 BARs are located. For our
> DMA-only slave PEs any MMIO space would remain in their master PE so we
> can allocate a PE number for the DMA-PE (our iommu context).


One example of a problem device is AMD GPU with 64bit video PCI function
and 32bit audio, no?

What PEs will they get assigned to now? Where will audio's MMIO go? It
cannot be the same 64bit MMIO segment, right? If so, it is a separate PE
already. If not, then I do not understand "we're free to assign whatever
PE number we want.


> I think the key thing to realise is that we'd only be using the DMA-PE
> when a crippled DMA mask is set by the driver. In all other cases we
> can just use the "native PE" and when the driver unbinds we can de-
> allocate our DMA-PE and return the device to the PE containing it's
> MMIO BARs. I think we can keep things relatively sane that way and the
> real issue is detecting EEH events on the DMA-PE.


Oooor we could just have 1 DMA window (or, more precisely, a single
"TVE" as it is either window or bypass) per a PE and give every function
its own PE and create a window or a table when a device sets a DMA mask.
I feel I am missing something here though.


> 
> On P9 we don't have PHB error interrupts enabled in firmware so we're
> completely reliant on seeing a 0xFF response to an MMIO and manually
> checking the PE status to see if it's due to a PE freeze. For our DMA-
> PE it could be frozen (due to a bad DMA) and we'd never notice unless
> we explicitly check the status of the DMA-PE since there's no
> corresponding MMIO space to freeze.
> 
> On P8 we had PHB Error interrupts so you would notice that *something*
> happened, then go check for frozen PEs, at which point the master-slave 
> grouping logic would see that one PE in the group was frozen and freeze
> the rest of them. We can re-use that on that, but we still need
> something to actually notice a freeze occured. A background poller
> checking for freezes on each PE might do the trick.
> 
>> So this is not happening soon.
> 
> Oh ye of little faith.
> 
>> For the time being, this patchset is good for:
>> 1. weird hardware which has limited DMA mask (this is why the patchset
>> was written in the first place)
>> 2. debug DMA by routing it via IOMMU (even when 4GB hack is not enabled).
> 
> Sure, but it's still dependent on having firmware which supports the
> 4GB hack and I don't think that's in any offical firmware releases yet.

It's been a while :-/


-- 
Alexey

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
@ 2020-04-21  5:11                 ` Alexey Kardashevskiy
  0 siblings, 0 replies; 57+ messages in thread
From: Alexey Kardashevskiy @ 2020-04-21  5:11 UTC (permalink / raw)
  To: Oliver O'Halloran, Russell Currey
  Cc: linuxppc-dev, David Gibson, kvm-ppc, KVM list, Alistair Popple,
	Fabiano Rosas, Michael Ellerman



On 21/04/2020 00:04, Oliver O'Halloran wrote:
> On Fri, 2020-04-17 at 15:47 +1000, Alexey Kardashevskiy wrote:
>>
>> On 17/04/2020 11:26, Russell Currey wrote:
>>>
>>> For what it's worth this sounds like a good idea to me, it just sounds
>>> tricky to implement.  You're adding another layer of complexity on top
>>> of EEH (well, making things look simple to the EEH core and doing your
>>> own freezing on top of it) in addition to the DMA handling.
>>>
>>> If it works then great, just has a high potential to become a new bug
>>> haven.
>>
>> imho putting every PCI function to a separate PE is the right thing to
>> do here but I've been told it is not that simple, and I believe that.
>> Reusing slave PEs seems unreliable - the configuration will depend on
>> whether a PE occupied enough segments to give an unique PE to a PCI
>> function and my little brain explodes.
> 
> You're overthinking it.
> 
> If a bus has no m64 MMIO space we're free to assign whatever PE number
> we want since the PE for the bus isn't fixed by the m64 segment its
> BARs were placed in. For those buses we assign a PE number starting
> from the max and counting down (0xff, 0xfe, etc). For example, with
> this PHB:
> 
> # lspci -s 1:: -v | egrep '0001:|Memory at'
> 0001:00:00.0 PCI bridge: IBM Device 04c1 (prog-if 00 [Normal decode])
> 0001:01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
> (prog-if 00 [Normal decode])
>         Memory at 600c081000000 (32-bit, non-prefetchable) [size%6K]
> 0001:02:01.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
> (prog-if 00 [Normal decode])
> 0001:02:08.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
> (prog-if 00 [Normal decode])
> 0001:02:09.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
> (prog-if 00 [Normal decode])
> 0001:03:00.0 Non-Volatile memory controller: PMC-Sierra Inc. Device
> f117 (rev 06) (prog-if 02 [NVM Express])
>         Memory at 600c080000000 (64-bit, non-prefetchable) [size\x16K]
>         Memory at 6004000000000 (64-bit, prefetchable) [size=1M]
> 0001:09:00.0 Ethernet controller: Intel Corporation Ethernet Controller
> X710/X557-AT 10GBASE-T (rev 02)
>         Memory at 6004048000000 (64-bit, prefetchable) [size=8M]
>         Memory at 600404a000000 (64-bit, prefetchable) [size2K]
> (redundant functions removed)
> 
> We get these PE assignments:
> 
> 0001:00:00.0 - 0xfe # Root port
> 0001:01:00.0 - 0xfc # upstream port
> 0001:02:01.0 - 0xfd # downstream port bus
> 0001:02:08.0 - 0xfd
> 0001:02:09.0 - 0xfd
> 0001:03:00.0 - 0x0  # NVMe
> 0001:09:00.0 - 0x1  # Ethernet
> 
> All the switch ports either have 32bit BARs or no BARs so they get
> assigned PEs starting from the top. The Ethernet and the NVMe have some
> prefetchable 64bit BARs so they have to be in PE 0x0 and 0x1
> respectively since that's where their m64 BARs are located. For our
> DMA-only slave PEs any MMIO space would remain in their master PE so we
> can allocate a PE number for the DMA-PE (our iommu context).


One example of a problem device is AMD GPU with 64bit video PCI function
and 32bit audio, no?

What PEs will they get assigned to now? Where will audio's MMIO go? It
cannot be the same 64bit MMIO segment, right? If so, it is a separate PE
already. If not, then I do not understand "we're free to assign whatever
PE number we want.


> I think the key thing to realise is that we'd only be using the DMA-PE
> when a crippled DMA mask is set by the driver. In all other cases we
> can just use the "native PE" and when the driver unbinds we can de-
> allocate our DMA-PE and return the device to the PE containing it's
> MMIO BARs. I think we can keep things relatively sane that way and the
> real issue is detecting EEH events on the DMA-PE.


Oooor we could just have 1 DMA window (or, more precisely, a single
"TVE" as it is either window or bypass) per a PE and give every function
its own PE and create a window or a table when a device sets a DMA mask.
I feel I am missing something here though.


> 
> On P9 we don't have PHB error interrupts enabled in firmware so we're
> completely reliant on seeing a 0xFF response to an MMIO and manually
> checking the PE status to see if it's due to a PE freeze. For our DMA-
> PE it could be frozen (due to a bad DMA) and we'd never notice unless
> we explicitly check the status of the DMA-PE since there's no
> corresponding MMIO space to freeze.
> 
> On P8 we had PHB Error interrupts so you would notice that *something*
> happened, then go check for frozen PEs, at which point the master-slave 
> grouping logic would see that one PE in the group was frozen and freeze
> the rest of them. We can re-use that on that, but we still need
> something to actually notice a freeze occured. A background poller
> checking for freezes on each PE might do the trick.
> 
>> So this is not happening soon.
> 
> Oh ye of little faith.
> 
>> For the time being, this patchset is good for:
>> 1. weird hardware which has limited DMA mask (this is why the patchset
>> was written in the first place)
>> 2. debug DMA by routing it via IOMMU (even when 4GB hack is not enabled).
> 
> Sure, but it's still dependent on having firmware which supports the
> 4GB hack and I don't think that's in any offical firmware releases yet.

It's been a while :-/


-- 
Alexey

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
  2020-04-21  5:11                 ` Alexey Kardashevskiy
  (?)
@ 2020-04-21  6:35                   ` Oliver O'Halloran
  -1 siblings, 0 replies; 57+ messages in thread
From: Oliver O'Halloran @ 2020-04-21  6:35 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Russell Currey, linuxppc-dev, David Gibson, kvm-ppc, KVM list,
	Alistair Popple, Fabiano Rosas, Michael Ellerman

On Tue, Apr 21, 2020 at 3:11 PM Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>
> One example of a problem device is AMD GPU with 64bit video PCI function
> and 32bit audio, no?
>
> What PEs will they get assigned to now? Where will audio's MMIO go? It
> cannot be the same 64bit MMIO segment, right? If so, it is a separate PE
> already. If not, then I do not understand "we're free to assign whatever
> PE number we want.

The BARs stay in the same place and as far as MMIO is concerned
nothing has changed. For MMIO the PHB uses the MMIO address to find a
PE via the M64 BAR table, but for DMA it uses a *completely* different
mechanism. Instead it takes the BDFN (included in the DMA packet
header) and the Requester Translation Table (RTT) to map the BDFN to a
PE. Normally you would configure the PHB so the same PE used for MMIO
and DMA, but you don't have to.

> > I think the key thing to realise is that we'd only be using the DMA-PE
> > when a crippled DMA mask is set by the driver. In all other cases we
> > can just use the "native PE" and when the driver unbinds we can de-
> > allocate our DMA-PE and return the device to the PE containing it's
> > MMIO BARs. I think we can keep things relatively sane that way and the
> > real issue is detecting EEH events on the DMA-PE.
>
>
> Oooor we could just have 1 DMA window (or, more precisely, a single
> "TVE" as it is either window or bypass) per a PE and give every function
> its own PE and create a window or a table when a device sets a DMA mask.
> I feel I am missing something here though.

Yes, we could do that, but do we want to?

I was thinking we should try minimise the number of DMA-only PEs since
it complicates the EEH freeze handling. When MMIO and DMA are mapped
to the same PE an error on either will cause the hardware to stop
both. When seperate PEs are used for DMA and MMIO you lose that
atomicity. It's not a big deal if DMA is stopped and MMIO allowed
since PAPR (sort-of) allows that, but having MMIO frozen with DMA
unfrozen is a bit sketch.

> >> For the time being, this patchset is good for:
> >> 1. weird hardware which has limited DMA mask (this is why the patchset
> >> was written in the first place)
> >> 2. debug DMA by routing it via IOMMU (even when 4GB hack is not enabled).
> >
> > Sure, but it's still dependent on having firmware which supports the
> > 4GB hack and I don't think that's in any offical firmware releases yet.
>
> It's been a while :-/

There's been no official FW releases with a skiboot that supports the
phb get/set option opal calls so the only systems that can actually
take advantage of it are our lab systems. It might still be useful for
future systems, but I'd rather something that doesn't depend on FW
support.

Oliver

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
@ 2020-04-21  6:35                   ` Oliver O'Halloran
  0 siblings, 0 replies; 57+ messages in thread
From: Oliver O'Halloran @ 2020-04-21  6:35 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: KVM list, Fabiano Rosas, Alistair Popple, kvm-ppc, linuxppc-dev,
	David Gibson

On Tue, Apr 21, 2020 at 3:11 PM Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>
> One example of a problem device is AMD GPU with 64bit video PCI function
> and 32bit audio, no?
>
> What PEs will they get assigned to now? Where will audio's MMIO go? It
> cannot be the same 64bit MMIO segment, right? If so, it is a separate PE
> already. If not, then I do not understand "we're free to assign whatever
> PE number we want.

The BARs stay in the same place and as far as MMIO is concerned
nothing has changed. For MMIO the PHB uses the MMIO address to find a
PE via the M64 BAR table, but for DMA it uses a *completely* different
mechanism. Instead it takes the BDFN (included in the DMA packet
header) and the Requester Translation Table (RTT) to map the BDFN to a
PE. Normally you would configure the PHB so the same PE used for MMIO
and DMA, but you don't have to.

> > I think the key thing to realise is that we'd only be using the DMA-PE
> > when a crippled DMA mask is set by the driver. In all other cases we
> > can just use the "native PE" and when the driver unbinds we can de-
> > allocate our DMA-PE and return the device to the PE containing it's
> > MMIO BARs. I think we can keep things relatively sane that way and the
> > real issue is detecting EEH events on the DMA-PE.
>
>
> Oooor we could just have 1 DMA window (or, more precisely, a single
> "TVE" as it is either window or bypass) per a PE and give every function
> its own PE and create a window or a table when a device sets a DMA mask.
> I feel I am missing something here though.

Yes, we could do that, but do we want to?

I was thinking we should try minimise the number of DMA-only PEs since
it complicates the EEH freeze handling. When MMIO and DMA are mapped
to the same PE an error on either will cause the hardware to stop
both. When seperate PEs are used for DMA and MMIO you lose that
atomicity. It's not a big deal if DMA is stopped and MMIO allowed
since PAPR (sort-of) allows that, but having MMIO frozen with DMA
unfrozen is a bit sketch.

> >> For the time being, this patchset is good for:
> >> 1. weird hardware which has limited DMA mask (this is why the patchset
> >> was written in the first place)
> >> 2. debug DMA by routing it via IOMMU (even when 4GB hack is not enabled).
> >
> > Sure, but it's still dependent on having firmware which supports the
> > 4GB hack and I don't think that's in any offical firmware releases yet.
>
> It's been a while :-/

There's been no official FW releases with a skiboot that supports the
phb get/set option opal calls so the only systems that can actually
take advantage of it are our lab systems. It might still be useful for
future systems, but I'd rather something that doesn't depend on FW
support.

Oliver

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
@ 2020-04-21  6:35                   ` Oliver O'Halloran
  0 siblings, 0 replies; 57+ messages in thread
From: Oliver O'Halloran @ 2020-04-21  6:35 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Russell Currey, linuxppc-dev, David Gibson, kvm-ppc, KVM list,
	Alistair Popple, Fabiano Rosas, Michael Ellerman

On Tue, Apr 21, 2020 at 3:11 PM Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>
> One example of a problem device is AMD GPU with 64bit video PCI function
> and 32bit audio, no?
>
> What PEs will they get assigned to now? Where will audio's MMIO go? It
> cannot be the same 64bit MMIO segment, right? If so, it is a separate PE
> already. If not, then I do not understand "we're free to assign whatever
> PE number we want.

The BARs stay in the same place and as far as MMIO is concerned
nothing has changed. For MMIO the PHB uses the MMIO address to find a
PE via the M64 BAR table, but for DMA it uses a *completely* different
mechanism. Instead it takes the BDFN (included in the DMA packet
header) and the Requester Translation Table (RTT) to map the BDFN to a
PE. Normally you would configure the PHB so the same PE used for MMIO
and DMA, but you don't have to.

> > I think the key thing to realise is that we'd only be using the DMA-PE
> > when a crippled DMA mask is set by the driver. In all other cases we
> > can just use the "native PE" and when the driver unbinds we can de-
> > allocate our DMA-PE and return the device to the PE containing it's
> > MMIO BARs. I think we can keep things relatively sane that way and the
> > real issue is detecting EEH events on the DMA-PE.
>
>
> Oooor we could just have 1 DMA window (or, more precisely, a single
> "TVE" as it is either window or bypass) per a PE and give every function
> its own PE and create a window or a table when a device sets a DMA mask.
> I feel I am missing something here though.

Yes, we could do that, but do we want to?

I was thinking we should try minimise the number of DMA-only PEs since
it complicates the EEH freeze handling. When MMIO and DMA are mapped
to the same PE an error on either will cause the hardware to stop
both. When seperate PEs are used for DMA and MMIO you lose that
atomicity. It's not a big deal if DMA is stopped and MMIO allowed
since PAPR (sort-of) allows that, but having MMIO frozen with DMA
unfrozen is a bit sketch.

> >> For the time being, this patchset is good for:
> >> 1. weird hardware which has limited DMA mask (this is why the patchset
> >> was written in the first place)
> >> 2. debug DMA by routing it via IOMMU (even when 4GB hack is not enabled).
> >
> > Sure, but it's still dependent on having firmware which supports the
> > 4GB hack and I don't think that's in any offical firmware releases yet.
>
> It's been a while :-/

There's been no official FW releases with a skiboot that supports the
phb get/set option opal calls so the only systems that can actually
take advantage of it are our lab systems. It might still be useful for
future systems, but I'd rather something that doesn't depend on FW
support.

Oliver

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
  2020-04-21  6:35                   ` Oliver O'Halloran
  (?)
@ 2020-04-22  6:49                     ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 57+ messages in thread
From: Alexey Kardashevskiy @ 2020-04-22  6:49 UTC (permalink / raw)
  To: Oliver O'Halloran
  Cc: Russell Currey, linuxppc-dev, David Gibson, kvm-ppc, KVM list,
	Alistair Popple, Fabiano Rosas, Michael Ellerman



On 21/04/2020 16:35, Oliver O'Halloran wrote:
> On Tue, Apr 21, 2020 at 3:11 PM Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>
>> One example of a problem device is AMD GPU with 64bit video PCI function
>> and 32bit audio, no?
>>
>> What PEs will they get assigned to now? Where will audio's MMIO go? It
>> cannot be the same 64bit MMIO segment, right? If so, it is a separate PE
>> already. If not, then I do not understand "we're free to assign whatever
>> PE number we want.
> 
> The BARs stay in the same place and as far as MMIO is concerned
> nothing has changed. For MMIO the PHB uses the MMIO address to find a
> PE via the M64 BAR table, but for DMA it uses a *completely* different
> mechanism. Instead it takes the BDFN (included in the DMA packet
> header) and the Requester Translation Table (RTT) to map the BDFN to a
> PE. Normally you would configure the PHB so the same PE used for MMIO
> and DMA, but you don't have to.

32bit MMIO is what puzzles me in this picture, how does it work?


>>> I think the key thing to realise is that we'd only be using the DMA-PE
>>> when a crippled DMA mask is set by the driver. In all other cases we
>>> can just use the "native PE" and when the driver unbinds we can de-
>>> allocate our DMA-PE and return the device to the PE containing it's
>>> MMIO BARs. I think we can keep things relatively sane that way and the
>>> real issue is detecting EEH events on the DMA-PE.
>>
>>
>> Oooor we could just have 1 DMA window (or, more precisely, a single
>> "TVE" as it is either window or bypass) per a PE and give every function
>> its own PE and create a window or a table when a device sets a DMA mask.
>> I feel I am missing something here though.
> 
> Yes, we could do that, but do we want to?
> 
> I was thinking we should try minimise the number of DMA-only PEs since
> it complicates the EEH freeze handling. When MMIO and DMA are mapped
> to the same PE an error on either will cause the hardware to stop
> both. When seperate PEs are used for DMA and MMIO you lose that
> atomicity. It's not a big deal if DMA is stopped and MMIO allowed
> since PAPR (sort-of) allows that, but having MMIO frozen with DMA
> unfrozen is a bit sketch.


You suggested using slave PEs for crippled functions - won't we have the
same problem then?
And is this "slave PE" something the hardware supports or it is a
software concept?


>>>> For the time being, this patchset is good for:
>>>> 1. weird hardware which has limited DMA mask (this is why the patchset
>>>> was written in the first place)
>>>> 2. debug DMA by routing it via IOMMU (even when 4GB hack is not enabled).
>>>
>>> Sure, but it's still dependent on having firmware which supports the
>>> 4GB hack and I don't think that's in any offical firmware releases yet.
>>
>> It's been a while :-/
> 
> There's been no official FW releases with a skiboot that supports the
> phb get/set option opal calls so the only systems that can actually
> take advantage of it are our lab systems. It might still be useful for
> future systems, but I'd rather something that doesn't depend on FW
> support.

Pensando folks use it ;)


-- 
Alexey

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
@ 2020-04-22  6:49                     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 57+ messages in thread
From: Alexey Kardashevskiy @ 2020-04-22  6:49 UTC (permalink / raw)
  To: Oliver O'Halloran
  Cc: KVM list, Fabiano Rosas, Alistair Popple, kvm-ppc, linuxppc-dev,
	David Gibson



On 21/04/2020 16:35, Oliver O'Halloran wrote:
> On Tue, Apr 21, 2020 at 3:11 PM Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>
>> One example of a problem device is AMD GPU with 64bit video PCI function
>> and 32bit audio, no?
>>
>> What PEs will they get assigned to now? Where will audio's MMIO go? It
>> cannot be the same 64bit MMIO segment, right? If so, it is a separate PE
>> already. If not, then I do not understand "we're free to assign whatever
>> PE number we want.
> 
> The BARs stay in the same place and as far as MMIO is concerned
> nothing has changed. For MMIO the PHB uses the MMIO address to find a
> PE via the M64 BAR table, but for DMA it uses a *completely* different
> mechanism. Instead it takes the BDFN (included in the DMA packet
> header) and the Requester Translation Table (RTT) to map the BDFN to a
> PE. Normally you would configure the PHB so the same PE used for MMIO
> and DMA, but you don't have to.

32bit MMIO is what puzzles me in this picture, how does it work?


>>> I think the key thing to realise is that we'd only be using the DMA-PE
>>> when a crippled DMA mask is set by the driver. In all other cases we
>>> can just use the "native PE" and when the driver unbinds we can de-
>>> allocate our DMA-PE and return the device to the PE containing it's
>>> MMIO BARs. I think we can keep things relatively sane that way and the
>>> real issue is detecting EEH events on the DMA-PE.
>>
>>
>> Oooor we could just have 1 DMA window (or, more precisely, a single
>> "TVE" as it is either window or bypass) per a PE and give every function
>> its own PE and create a window or a table when a device sets a DMA mask.
>> I feel I am missing something here though.
> 
> Yes, we could do that, but do we want to?
> 
> I was thinking we should try minimise the number of DMA-only PEs since
> it complicates the EEH freeze handling. When MMIO and DMA are mapped
> to the same PE an error on either will cause the hardware to stop
> both. When seperate PEs are used for DMA and MMIO you lose that
> atomicity. It's not a big deal if DMA is stopped and MMIO allowed
> since PAPR (sort-of) allows that, but having MMIO frozen with DMA
> unfrozen is a bit sketch.


You suggested using slave PEs for crippled functions - won't we have the
same problem then?
And is this "slave PE" something the hardware supports or it is a
software concept?


>>>> For the time being, this patchset is good for:
>>>> 1. weird hardware which has limited DMA mask (this is why the patchset
>>>> was written in the first place)
>>>> 2. debug DMA by routing it via IOMMU (even when 4GB hack is not enabled).
>>>
>>> Sure, but it's still dependent on having firmware which supports the
>>> 4GB hack and I don't think that's in any offical firmware releases yet.
>>
>> It's been a while :-/
> 
> There's been no official FW releases with a skiboot that supports the
> phb get/set option opal calls so the only systems that can actually
> take advantage of it are our lab systems. It might still be useful for
> future systems, but I'd rather something that doesn't depend on FW
> support.

Pensando folks use it ;)


-- 
Alexey

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
@ 2020-04-22  6:49                     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 57+ messages in thread
From: Alexey Kardashevskiy @ 2020-04-22  6:49 UTC (permalink / raw)
  To: Oliver O'Halloran
  Cc: Russell Currey, linuxppc-dev, David Gibson, kvm-ppc, KVM list,
	Alistair Popple, Fabiano Rosas, Michael Ellerman



On 21/04/2020 16:35, Oliver O'Halloran wrote:
> On Tue, Apr 21, 2020 at 3:11 PM Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>
>> One example of a problem device is AMD GPU with 64bit video PCI function
>> and 32bit audio, no?
>>
>> What PEs will they get assigned to now? Where will audio's MMIO go? It
>> cannot be the same 64bit MMIO segment, right? If so, it is a separate PE
>> already. If not, then I do not understand "we're free to assign whatever
>> PE number we want.
> 
> The BARs stay in the same place and as far as MMIO is concerned
> nothing has changed. For MMIO the PHB uses the MMIO address to find a
> PE via the M64 BAR table, but for DMA it uses a *completely* different
> mechanism. Instead it takes the BDFN (included in the DMA packet
> header) and the Requester Translation Table (RTT) to map the BDFN to a
> PE. Normally you would configure the PHB so the same PE used for MMIO
> and DMA, but you don't have to.

32bit MMIO is what puzzles me in this picture, how does it work?


>>> I think the key thing to realise is that we'd only be using the DMA-PE
>>> when a crippled DMA mask is set by the driver. In all other cases we
>>> can just use the "native PE" and when the driver unbinds we can de-
>>> allocate our DMA-PE and return the device to the PE containing it's
>>> MMIO BARs. I think we can keep things relatively sane that way and the
>>> real issue is detecting EEH events on the DMA-PE.
>>
>>
>> Oooor we could just have 1 DMA window (or, more precisely, a single
>> "TVE" as it is either window or bypass) per a PE and give every function
>> its own PE and create a window or a table when a device sets a DMA mask.
>> I feel I am missing something here though.
> 
> Yes, we could do that, but do we want to?
> 
> I was thinking we should try minimise the number of DMA-only PEs since
> it complicates the EEH freeze handling. When MMIO and DMA are mapped
> to the same PE an error on either will cause the hardware to stop
> both. When seperate PEs are used for DMA and MMIO you lose that
> atomicity. It's not a big deal if DMA is stopped and MMIO allowed
> since PAPR (sort-of) allows that, but having MMIO frozen with DMA
> unfrozen is a bit sketch.


You suggested using slave PEs for crippled functions - won't we have the
same problem then?
And is this "slave PE" something the hardware supports or it is a
software concept?


>>>> For the time being, this patchset is good for:
>>>> 1. weird hardware which has limited DMA mask (this is why the patchset
>>>> was written in the first place)
>>>> 2. debug DMA by routing it via IOMMU (even when 4GB hack is not enabled).
>>>
>>> Sure, but it's still dependent on having firmware which supports the
>>> 4GB hack and I don't think that's in any offical firmware releases yet.
>>
>> It's been a while :-/
> 
> There's been no official FW releases with a skiboot that supports the
> phb get/set option opal calls so the only systems that can actually
> take advantage of it are our lab systems. It might still be useful for
> future systems, but I'd rather something that doesn't depend on FW
> support.

Pensando folks use it ;)


-- 
Alexey

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
  2020-04-22  6:49                     ` Alexey Kardashevskiy
  (?)
@ 2020-04-22  9:11                       ` Oliver O'Halloran
  -1 siblings, 0 replies; 57+ messages in thread
From: Oliver O'Halloran @ 2020-04-22  9:11 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Russell Currey, linuxppc-dev, David Gibson, kvm-ppc, KVM list,
	Alistair Popple, Fabiano Rosas, Michael Ellerman

On Wed, Apr 22, 2020 at 4:49 PM Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>
> 32bit MMIO is what puzzles me in this picture, how does it work?

For devices with no m64 we allocate a PE number as described above. In
the 32bit MMIO window we have a segment-to-PE remapping table so any
m32 segment can be assigned to any PE. As a result slave PE concept
isn't really needed. If the BARs of a device span multiple m32
segments then we can setup the remapping table so that all the
segments point to the same PE.

> > I was thinking we should try minimise the number of DMA-only PEs since
> > it complicates the EEH freeze handling. When MMIO and DMA are mapped
> > to the same PE an error on either will cause the hardware to stop
> > both. When seperate PEs are used for DMA and MMIO you lose that
> > atomicity. It's not a big deal if DMA is stopped and MMIO allowed
> > since PAPR (sort-of) allows that, but having MMIO frozen with DMA
> > unfrozen is a bit sketch.
>
> You suggested using slave PEs for crippled functions - won't we have the
> same problem then?

Yes, but I think it's probably worth doing in that case. You get
slightly janky EEH in exchange for better DMA performance.

> And is this "slave PE" something the hardware supports or it is a
> software concept?

It's all in software. The hardware does have the PELT-V which allows
you to specify a group of PEs to additionally freeze when a PE is
frozen, but the PELT-V is only used when handling AER messages.  All
other error sources (DMAs, MMIOs, etc) will only freeze one PE (or all
of them in very rare cases).

> > There's been no official FW releases with a skiboot that supports the
> > phb get/set option opal calls so the only systems that can actually
> > take advantage of it are our lab systems. It might still be useful for
> > future systems, but I'd rather something that doesn't depend on FW
> > support.
>
> Pensando folks use it ;)

the what folks

Oliver

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
@ 2020-04-22  9:11                       ` Oliver O'Halloran
  0 siblings, 0 replies; 57+ messages in thread
From: Oliver O'Halloran @ 2020-04-22  9:11 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: KVM list, Fabiano Rosas, Alistair Popple, kvm-ppc, linuxppc-dev,
	David Gibson

On Wed, Apr 22, 2020 at 4:49 PM Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>
> 32bit MMIO is what puzzles me in this picture, how does it work?

For devices with no m64 we allocate a PE number as described above. In
the 32bit MMIO window we have a segment-to-PE remapping table so any
m32 segment can be assigned to any PE. As a result slave PE concept
isn't really needed. If the BARs of a device span multiple m32
segments then we can setup the remapping table so that all the
segments point to the same PE.

> > I was thinking we should try minimise the number of DMA-only PEs since
> > it complicates the EEH freeze handling. When MMIO and DMA are mapped
> > to the same PE an error on either will cause the hardware to stop
> > both. When seperate PEs are used for DMA and MMIO you lose that
> > atomicity. It's not a big deal if DMA is stopped and MMIO allowed
> > since PAPR (sort-of) allows that, but having MMIO frozen with DMA
> > unfrozen is a bit sketch.
>
> You suggested using slave PEs for crippled functions - won't we have the
> same problem then?

Yes, but I think it's probably worth doing in that case. You get
slightly janky EEH in exchange for better DMA performance.

> And is this "slave PE" something the hardware supports or it is a
> software concept?

It's all in software. The hardware does have the PELT-V which allows
you to specify a group of PEs to additionally freeze when a PE is
frozen, but the PELT-V is only used when handling AER messages.  All
other error sources (DMAs, MMIOs, etc) will only freeze one PE (or all
of them in very rare cases).

> > There's been no official FW releases with a skiboot that supports the
> > phb get/set option opal calls so the only systems that can actually
> > take advantage of it are our lab systems. It might still be useful for
> > future systems, but I'd rather something that doesn't depend on FW
> > support.
>
> Pensando folks use it ;)

the what folks

Oliver

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
@ 2020-04-22  9:11                       ` Oliver O'Halloran
  0 siblings, 0 replies; 57+ messages in thread
From: Oliver O'Halloran @ 2020-04-22  9:11 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Russell Currey, linuxppc-dev, David Gibson, kvm-ppc, KVM list,
	Alistair Popple, Fabiano Rosas, Michael Ellerman

On Wed, Apr 22, 2020 at 4:49 PM Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>
> 32bit MMIO is what puzzles me in this picture, how does it work?

For devices with no m64 we allocate a PE number as described above. In
the 32bit MMIO window we have a segment-to-PE remapping table so any
m32 segment can be assigned to any PE. As a result slave PE concept
isn't really needed. If the BARs of a device span multiple m32
segments then we can setup the remapping table so that all the
segments point to the same PE.

> > I was thinking we should try minimise the number of DMA-only PEs since
> > it complicates the EEH freeze handling. When MMIO and DMA are mapped
> > to the same PE an error on either will cause the hardware to stop
> > both. When seperate PEs are used for DMA and MMIO you lose that
> > atomicity. It's not a big deal if DMA is stopped and MMIO allowed
> > since PAPR (sort-of) allows that, but having MMIO frozen with DMA
> > unfrozen is a bit sketch.
>
> You suggested using slave PEs for crippled functions - won't we have the
> same problem then?

Yes, but I think it's probably worth doing in that case. You get
slightly janky EEH in exchange for better DMA performance.

> And is this "slave PE" something the hardware supports or it is a
> software concept?

It's all in software. The hardware does have the PELT-V which allows
you to specify a group of PEs to additionally freeze when a PE is
frozen, but the PELT-V is only used when handling AER messages.  All
other error sources (DMAs, MMIOs, etc) will only freeze one PE (or all
of them in very rare cases).

> > There's been no official FW releases with a skiboot that supports the
> > phb get/set option opal calls so the only systems that can actually
> > take advantage of it are our lab systems. It might still be useful for
> > future systems, but I'd rather something that doesn't depend on FW
> > support.
>
> Pensando folks use it ;)

the what folks

Oliver

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2020-04-22  9:13 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-23  7:53 [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB Alexey Kardashevskiy
2020-03-23  7:53 ` Alexey Kardashevskiy
2020-03-23  7:53 ` Alexey Kardashevskiy
2020-03-23  7:53 ` [PATCH kernel v2 1/7] powerpc/powernv/ioda: Move TCE bypass base to PE Alexey Kardashevskiy
2020-03-23  7:53   ` Alexey Kardashevskiy
2020-03-23  7:53   ` Alexey Kardashevskiy
2020-03-23  7:53 ` [PATCH kernel v2 2/7] powerpc/powernv/ioda: Rework for huge DMA window at 4GB Alexey Kardashevskiy
2020-03-23  7:53   ` Alexey Kardashevskiy
2020-03-23  7:53   ` Alexey Kardashevskiy
2020-03-23  7:53 ` [PATCH kernel v2 3/7] powerpc/powernv/ioda: Allow smaller TCE table levels Alexey Kardashevskiy
2020-03-23  7:53   ` Alexey Kardashevskiy
2020-03-23  7:53   ` Alexey Kardashevskiy
2020-03-23  7:53 ` [PATCH kernel v2 4/7] powerpc/powernv/phb4: Use IOMMU instead of bypassing Alexey Kardashevskiy
2020-03-23  7:53   ` Alexey Kardashevskiy
2020-03-23  7:53   ` Alexey Kardashevskiy
2020-03-23  7:53 ` [PATCH kernel v2 5/7] powerpc/iommu: Add a window number to iommu_table_group_ops::get_table_size Alexey Kardashevskiy
2020-03-23  7:53   ` Alexey Kardashevskiy
2020-03-23  7:53   ` Alexey Kardashevskiy
2020-03-23  7:53 ` [PATCH kernel v2 6/7] powerpc/powernv/phb4: Add 4GB IOMMU bypass mode Alexey Kardashevskiy
2020-03-23  7:53   ` Alexey Kardashevskiy
2020-03-23  7:53   ` Alexey Kardashevskiy
2020-03-23  7:53 ` [PATCH kernel v2 7/7] vfio/spapr_tce: Advertise and allow a huge DMA windows at 4GB Alexey Kardashevskiy
2020-03-23  7:53   ` Alexey Kardashevskiy
2020-03-23  7:53   ` Alexey Kardashevskiy
2020-04-08  9:43 ` [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window " Alexey Kardashevskiy
2020-04-08  9:43   ` Alexey Kardashevskiy
2020-04-08  9:43   ` Alexey Kardashevskiy
2020-04-16  1:27   ` Alexey Kardashevskiy
2020-04-16  1:27     ` Alexey Kardashevskiy
2020-04-16  1:27     ` Alexey Kardashevskiy
2020-04-16  2:34     ` Oliver O'Halloran
2020-04-16  2:34       ` Oliver O'Halloran
2020-04-16  2:34       ` Oliver O'Halloran
2020-04-16  2:53       ` Oliver O'Halloran
2020-04-16  2:53         ` Oliver O'Halloran
2020-04-16  2:53         ` Oliver O'Halloran
2020-04-17  1:26         ` Russell Currey
2020-04-17  1:26           ` Russell Currey
2020-04-17  1:26           ` Russell Currey
2020-04-17  5:47           ` Alexey Kardashevskiy
2020-04-17  5:47             ` Alexey Kardashevskiy
2020-04-17  5:47             ` Alexey Kardashevskiy
2020-04-20 14:04             ` Oliver O'Halloran
2020-04-20 14:04               ` Oliver O'Halloran
2020-04-20 14:04               ` Oliver O'Halloran
2020-04-21  5:11               ` Alexey Kardashevskiy
2020-04-21  5:11                 ` Alexey Kardashevskiy
2020-04-21  5:11                 ` Alexey Kardashevskiy
2020-04-21  6:35                 ` Oliver O'Halloran
2020-04-21  6:35                   ` Oliver O'Halloran
2020-04-21  6:35                   ` Oliver O'Halloran
2020-04-22  6:49                   ` Alexey Kardashevskiy
2020-04-22  6:49                     ` Alexey Kardashevskiy
2020-04-22  6:49                     ` Alexey Kardashevskiy
2020-04-22  9:11                     ` Oliver O'Halloran
2020-04-22  9:11                       ` Oliver O'Halloran
2020-04-22  9:11                       ` Oliver O'Halloran

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.