All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH kernel v3 0/3] powerpc/ioda2: Yet another attempt to allow DMA masks between 32 and 59
@ 2019-05-30  7:03 Alexey Kardashevskiy
  2019-05-30  7:03 ` [PATCH kernel v3 1/3] powerpc/iommu: Allow bypass-only for DMA Alexey Kardashevskiy
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Alexey Kardashevskiy @ 2019-05-30  7:03 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alistair Popple, Oliver O'Halloran,
	Sam Bobroff, David Gibson

This is an attempt to allow DMA masks between 32..59 which are not large
enough to use either a PHB3 bypass mode or a sketchy bypass. Depending
on the max order, up to 40 is usually available.


This is based on v5.2-rc2.

Please comment. Thanks.



Alexey Kardashevskiy (3):
  powerpc/iommu: Allow bypass-only for DMA
  powerpc/powernv/ioda2: Allocate TCE table levels on demand for default
    DMA window
  powerpc/powernv/ioda2: Create bigger default window with 64k IOMMU
    pages

 arch/powerpc/include/asm/iommu.h              |  8 ++-
 arch/powerpc/platforms/powernv/pci.h          |  2 +-
 arch/powerpc/kernel/dma-iommu.c               | 11 ++--
 arch/powerpc/kernel/iommu.c                   | 58 +++++++++++++------
 arch/powerpc/platforms/powernv/pci-ioda-tce.c | 20 +++----
 arch/powerpc/platforms/powernv/pci-ioda.c     | 40 +++++++++++--
 6 files changed, 95 insertions(+), 44 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH kernel v3 1/3] powerpc/iommu: Allow bypass-only for DMA
  2019-05-30  7:03 [PATCH kernel v3 0/3] powerpc/ioda2: Yet another attempt to allow DMA masks between 32 and 59 Alexey Kardashevskiy
@ 2019-05-30  7:03 ` Alexey Kardashevskiy
  2019-06-03  2:03   ` David Gibson
  2019-05-30  7:03 ` [PATCH kernel v3 2/3] powerpc/powernv/ioda2: Allocate TCE table levels on demand for default DMA window Alexey Kardashevskiy
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 21+ messages in thread
From: Alexey Kardashevskiy @ 2019-05-30  7:03 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alistair Popple, Oliver O'Halloran,
	Sam Bobroff, David Gibson

POWER8 and newer support a bypass mode which maps all host memory to
PCI buses so an IOMMU table is not always required. However if we fail to
create such a table, the DMA setup fails and the kernel does not boot.

This skips the 32bit DMA setup check if the bypass is can be selected.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---

This minor thing helped me debugging next 2 patches so it can help
somebody else too.
---
 arch/powerpc/kernel/dma-iommu.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/dma-iommu.c b/arch/powerpc/kernel/dma-iommu.c
index 09231ef06d01..809c1dc01edf 100644
--- a/arch/powerpc/kernel/dma-iommu.c
+++ b/arch/powerpc/kernel/dma-iommu.c
@@ -118,18 +118,17 @@ int dma_iommu_dma_supported(struct device *dev, u64 mask)
 {
 	struct iommu_table *tbl = get_iommu_table_base(dev);
 
-	if (!tbl) {
-		dev_info(dev, "Warning: IOMMU dma not supported: mask 0x%08llx"
-			", table unavailable\n", mask);
-		return 0;
-	}
-
 	if (dev_is_pci(dev) && dma_iommu_bypass_supported(dev, mask)) {
 		dev->archdata.iommu_bypass = true;
 		dev_dbg(dev, "iommu: 64-bit OK, using fixed ops\n");
 		return 1;
 	}
 
+	if (!tbl) {
+		dev_err(dev, "Warning: IOMMU dma not supported: mask 0x%08llx, table unavailable\n", mask);
+		return 0;
+	}
+
 	if (tbl->it_offset > (mask >> tbl->it_page_shift)) {
 		dev_info(dev, "Warning: IOMMU offset too big for device mask\n");
 		dev_info(dev, "mask: 0x%08llx, table offset: 0x%08lx\n",
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH kernel v3 2/3] powerpc/powernv/ioda2: Allocate TCE table levels on demand for default DMA window
  2019-05-30  7:03 [PATCH kernel v3 0/3] powerpc/ioda2: Yet another attempt to allow DMA masks between 32 and 59 Alexey Kardashevskiy
  2019-05-30  7:03 ` [PATCH kernel v3 1/3] powerpc/iommu: Allow bypass-only for DMA Alexey Kardashevskiy
@ 2019-05-30  7:03 ` Alexey Kardashevskiy
  2019-07-08  7:01   ` alistair
  2019-05-30  7:03 ` [PATCH kernel v3 3/3] powerpc/powernv/ioda2: Create bigger default window with 64k IOMMU pages Alexey Kardashevskiy
  2019-06-06  4:11 ` [PATCH kernel v3 0/3] powerpc/ioda2: Yet another attempt to allow DMA masks between 32 and 59 Shawn Anastasio
  3 siblings, 1 reply; 21+ messages in thread
From: Alexey Kardashevskiy @ 2019-05-30  7:03 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alistair Popple, Oliver O'Halloran,
	Sam Bobroff, David Gibson

We allocate only the first level of multilevel TCE tables for KVM
already (alloc_userspace_copy==true), and the rest is allocated on demand.
This is not enabled though for baremetal.

This removes the KVM limitation (implicit, via the alloc_userspace_copy
parameter) and always allocates just the first level. The on-demand
allocation of missing levels is already implemented.

As from now on DMA map might happen with disabled interrupts, this
allocates TCEs with GFP_ATOMIC.

To save time when creating a new clean table, this skips non-allocated
indirect TCE entries in pnv_tce_free just like we already do in
the VFIO IOMMU TCE driver.

This changes the default level number from 1 to 2 to reduce the amount
of memory required for the default 32bit DMA window at the boot time.
The default window size is up to 2GB which requires 4MB of TCEs which is
unlikely to be used entirely or at all as most devices these days are
64bit capable so by switching to 2 levels by default we save 4032KB of
RAM per a device.

While at this, add __GFP_NOWARN to alloc_pages_node() as the userspace
can trigger this path via VFIO, see the failure and try creating a table
again with different parameters which might succeed.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v2:
* added __GFP_NOWARN to alloc_pages_node
---
 arch/powerpc/platforms/powernv/pci.h          |  2 +-
 arch/powerpc/platforms/powernv/pci-ioda-tce.c | 20 +++++++++----------
 2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 1a51e7bfc541..a55dabc8d057 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -225,7 +225,7 @@ extern struct iommu_table_group *pnv_npu_compound_attach(
 		struct pnv_ioda_pe *pe);
 
 /* pci-ioda-tce.c */
-#define POWERNV_IOMMU_DEFAULT_LEVELS	1
+#define POWERNV_IOMMU_DEFAULT_LEVELS	2
 #define POWERNV_IOMMU_MAX_LEVELS	5
 
 extern int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
diff --git a/arch/powerpc/platforms/powernv/pci-ioda-tce.c b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
index e28f03e1eb5e..c75ec37bf0cd 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda-tce.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
@@ -36,7 +36,8 @@ static __be64 *pnv_alloc_tce_level(int nid, unsigned int shift)
 	struct page *tce_mem = NULL;
 	__be64 *addr;
 
-	tce_mem = alloc_pages_node(nid, GFP_KERNEL, shift - PAGE_SHIFT);
+	tce_mem = alloc_pages_node(nid, GFP_ATOMIC | __GFP_NOWARN,
+			shift - PAGE_SHIFT);
 	if (!tce_mem) {
 		pr_err("Failed to allocate a TCE memory, level shift=%d\n",
 				shift);
@@ -161,6 +162,9 @@ void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
 
 		if (ptce)
 			*ptce = cpu_to_be64(0);
+		else
+			/* Skip the rest of the level */
+			i |= tbl->it_level_size - 1;
 	}
 }
 
@@ -260,7 +264,6 @@ long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
 	unsigned int table_shift = max_t(unsigned int, entries_shift + 3,
 			PAGE_SHIFT);
 	const unsigned long tce_table_size = 1UL << table_shift;
-	unsigned int tmplevels = levels;
 
 	if (!levels || (levels > POWERNV_IOMMU_MAX_LEVELS))
 		return -EINVAL;
@@ -268,9 +271,6 @@ long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
 	if (!is_power_of_2(window_size))
 		return -EINVAL;
 
-	if (alloc_userspace_copy && (window_size > (1ULL << 32)))
-		tmplevels = 1;
-
 	/* Adjust direct table size from window_size and levels */
 	entries_shift = (entries_shift + levels - 1) / levels;
 	level_shift = entries_shift + 3;
@@ -281,7 +281,7 @@ long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
 
 	/* Allocate TCE table */
 	addr = pnv_pci_ioda2_table_do_alloc_pages(nid, level_shift,
-			tmplevels, tce_table_size, &offset, &total_allocated);
+			1, tce_table_size, &offset, &total_allocated);
 
 	/* addr==NULL means that the first level allocation failed */
 	if (!addr)
@@ -292,18 +292,18 @@ long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
 	 * we did not allocate as much as we wanted,
 	 * release partially allocated table.
 	 */
-	if (tmplevels == levels && offset < tce_table_size)
+	if (levels == 1 && offset < tce_table_size)
 		goto free_tces_exit;
 
 	/* Allocate userspace view of the TCE table */
 	if (alloc_userspace_copy) {
 		offset = 0;
 		uas = pnv_pci_ioda2_table_do_alloc_pages(nid, level_shift,
-				tmplevels, tce_table_size, &offset,
+				1, tce_table_size, &offset,
 				&total_allocated_uas);
 		if (!uas)
 			goto free_tces_exit;
-		if (tmplevels == levels && (offset < tce_table_size ||
+		if (levels == 1 && (offset < tce_table_size ||
 				total_allocated_uas != total_allocated))
 			goto free_uas_exit;
 	}
@@ -318,7 +318,7 @@ long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
 
 	pr_debug("Created TCE table: ws=%08llx ts=%lx @%08llx base=%lx uas=%p levels=%d/%d\n",
 			window_size, tce_table_size, bus_offset, tbl->it_base,
-			tbl->it_userspace, tmplevels, levels);
+			tbl->it_userspace, 1, levels);
 
 	return 0;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH kernel v3 3/3] powerpc/powernv/ioda2: Create bigger default window with 64k IOMMU pages
  2019-05-30  7:03 [PATCH kernel v3 0/3] powerpc/ioda2: Yet another attempt to allow DMA masks between 32 and 59 Alexey Kardashevskiy
  2019-05-30  7:03 ` [PATCH kernel v3 1/3] powerpc/iommu: Allow bypass-only for DMA Alexey Kardashevskiy
  2019-05-30  7:03 ` [PATCH kernel v3 2/3] powerpc/powernv/ioda2: Allocate TCE table levels on demand for default DMA window Alexey Kardashevskiy
@ 2019-05-30  7:03 ` Alexey Kardashevskiy
  2019-07-08  7:01   ` alistair
  2019-06-06  4:11 ` [PATCH kernel v3 0/3] powerpc/ioda2: Yet another attempt to allow DMA masks between 32 and 59 Shawn Anastasio
  3 siblings, 1 reply; 21+ messages in thread
From: Alexey Kardashevskiy @ 2019-05-30  7:03 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alistair Popple, Oliver O'Halloran,
	Sam Bobroff, David Gibson

At the moment we create a small window only for 32bit devices, the window
maps 0..2GB of the PCI space only. For other devices we either use
a sketchy bypass or hardware bypass but the former can only work if
the amount of RAM is no bigger than the device's DMA mask and the latter
requires devices to support at least 59bit DMA.

This extends the default DMA window to the maximum size possible to allow
a wider DMA mask than just 32bit. The default window size is now limited
by the the iommu_table::it_map allocation bitmap which is a contiguous
array, 1 bit per an IOMMU page.

This increases the default IOMMU page size from hard coded 4K to
the system page size to allow wider DMA masks.

This increases the level number to not exceed the max order allocation
limit per TCE level. By the same time, this keeps minimal levels number
as 2 in order to save memory.

As the extended window now overlaps the 32bit MMIO region, this adds
an area reservation to iommu_init_table().

After this change the default window size is 0x80000000000==1<<43 so
devices limited to DMA mask smaller than the amount of system RAM can
still use more than just 2GB of memory for DMA.

With the on-demand allocation of indirect TCE table levels enabled and
2 levels, the first TCE level size is just
1<<ceil((log2(0x7ffffffffff+1)-16)/2)=16384 TCEs or 2 system pages.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v3:
* fixed tce levels calculation

v2:
* adjusted level number to the max order
---
 arch/powerpc/include/asm/iommu.h          |  8 +++-
 arch/powerpc/kernel/iommu.c               | 58 +++++++++++++++--------
 arch/powerpc/platforms/powernv/pci-ioda.c | 40 +++++++++++++---
 3 files changed, 79 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 0ac52392ed99..5ea782e04803 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -124,6 +124,8 @@ struct iommu_table {
 	struct iommu_table_ops *it_ops;
 	struct kref    it_kref;
 	int it_nid;
+	unsigned long it_reserved_start; /* Start of not-DMA-able (MMIO) area */
+	unsigned long it_reserved_end;
 };
 
 #define IOMMU_TABLE_USERSPACE_ENTRY_RO(tbl, entry) \
@@ -162,8 +164,10 @@ extern int iommu_tce_table_put(struct iommu_table *tbl);
 /* Initializes an iommu_table based in values set in the passed-in
  * structure
  */
-extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
-					    int nid);
+extern struct iommu_table *iommu_init_table_res(struct iommu_table *tbl,
+		int nid, unsigned long res_start, unsigned long res_end);
+#define iommu_init_table(tbl, nid) iommu_init_table_res((tbl), (nid), 0, 0)
+
 #define IOMMU_TABLE_GROUP_MAX_TABLES	2
 
 struct iommu_table_group;
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 33bbd59cff79..209306ce7f4b 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -646,11 +646,43 @@ static void iommu_table_clear(struct iommu_table *tbl)
 #endif
 }
 
+static void iommu_table_reserve_pages(struct iommu_table *tbl)
+{
+	int i;
+
+	/*
+	 * Reserve page 0 so it will not be used for any mappings.
+	 * This avoids buggy drivers that consider page 0 to be invalid
+	 * to crash the machine or even lose data.
+	 */
+	if (tbl->it_offset == 0)
+		set_bit(0, tbl->it_map);
+
+	for (i = tbl->it_reserved_start; i < tbl->it_reserved_end; ++i)
+		set_bit(i, tbl->it_map);
+}
+
+static void iommu_table_release_pages(struct iommu_table *tbl)
+{
+	int i;
+
+	/*
+	 * In case we have reserved the first bit, we should not emit
+	 * the warning below.
+	 */
+	if (tbl->it_offset == 0)
+		clear_bit(0, tbl->it_map);
+
+	for (i = tbl->it_reserved_start; i < tbl->it_reserved_end; ++i)
+		clear_bit(i, tbl->it_map);
+}
+
 /*
  * Build a iommu_table structure.  This contains a bit map which
  * is used to manage allocation of the tce space.
  */
-struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
+struct iommu_table *iommu_init_table_res(struct iommu_table *tbl, int nid,
+		unsigned long res_start, unsigned long res_end)
 {
 	unsigned long sz;
 	static int welcomed = 0;
@@ -669,13 +701,9 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
 	tbl->it_map = page_address(page);
 	memset(tbl->it_map, 0, sz);
 
-	/*
-	 * Reserve page 0 so it will not be used for any mappings.
-	 * This avoids buggy drivers that consider page 0 to be invalid
-	 * to crash the machine or even lose data.
-	 */
-	if (tbl->it_offset == 0)
-		set_bit(0, tbl->it_map);
+	tbl->it_reserved_start = res_start;
+	tbl->it_reserved_end = res_end;
+	iommu_table_reserve_pages(tbl);
 
 	/* We only split the IOMMU table if we have 1GB or more of space */
 	if ((tbl->it_size << tbl->it_page_shift) >= (1UL * 1024 * 1024 * 1024))
@@ -727,12 +755,7 @@ static void iommu_table_free(struct kref *kref)
 		return;
 	}
 
-	/*
-	 * In case we have reserved the first bit, we should not emit
-	 * the warning below.
-	 */
-	if (tbl->it_offset == 0)
-		clear_bit(0, tbl->it_map);
+	iommu_table_release_pages(tbl);
 
 	/* verify that table contains no entries */
 	if (!bitmap_empty(tbl->it_map, tbl->it_size))
@@ -1037,8 +1060,7 @@ int iommu_take_ownership(struct iommu_table *tbl)
 	for (i = 0; i < tbl->nr_pools; i++)
 		spin_lock(&tbl->pools[i].lock);
 
-	if (tbl->it_offset == 0)
-		clear_bit(0, tbl->it_map);
+	iommu_table_reserve_pages(tbl);
 
 	if (!bitmap_empty(tbl->it_map, tbl->it_size)) {
 		pr_err("iommu_tce: it_map is not empty");
@@ -1068,9 +1090,7 @@ void iommu_release_ownership(struct iommu_table *tbl)
 
 	memset(tbl->it_map, 0, sz);
 
-	/* Restore bit#0 set by iommu_init_table() */
-	if (tbl->it_offset == 0)
-		set_bit(0, tbl->it_map);
+	iommu_table_release_pages(tbl);
 
 	for (i = 0; i < tbl->nr_pools; i++)
 		spin_unlock(&tbl->pools[i].lock);
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 126602b4e399..ce2efdb3900d 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2422,6 +2422,7 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
 {
 	struct iommu_table *tbl = NULL;
 	long rc;
+	unsigned long res_start, res_end;
 
 	/*
 	 * crashkernel= specifies the kdump kernel's maximum memory at
@@ -2435,19 +2436,46 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
 	 * DMA window can be larger than available memory, which will
 	 * cause errors later.
 	 */
-	const u64 window_size = min((u64)pe->table_group.tce32_size, max_memory);
+	const u64 maxblock = 1UL << (PAGE_SHIFT + MAX_ORDER - 1);
 
-	rc = pnv_pci_ioda2_create_table(&pe->table_group, 0,
-			IOMMU_PAGE_SHIFT_4K,
-			window_size,
-			POWERNV_IOMMU_DEFAULT_LEVELS, false, &tbl);
+	/*
+	 * We create the default window as big as we can. The constraint is
+	 * the max order of allocation possible. The TCE tableis likely to
+	 * end up being multilevel and with on-demand allocation in place,
+	 * the initial use is not going to be huge as the default window aims
+	 * to support cripplied devices (i.e. not fully 64bit DMAble) only.
+	 */
+	/* iommu_table::it_map uses 1 bit per IOMMU page, hence 8 */
+	const u64 window_size = min((maxblock * 8) << PAGE_SHIFT, max_memory);
+	/* Each TCE level cannot exceed maxblock so go multilevel if needed */
+	unsigned long tces_order = ilog2(window_size >> PAGE_SHIFT);
+	unsigned long tcelevel_order = ilog2(maxblock >> 3);
+	unsigned int levels = tces_order / tcelevel_order;
+
+	if (tces_order % tcelevel_order)
+		levels += 1;
+	/*
+	 * We try to stick to default levels (which is >1 at the moment) in
+	 * order to save memory by relying on on-demain TCE level allocation.
+	 */
+	levels = max_t(unsigned int, levels, POWERNV_IOMMU_DEFAULT_LEVELS);
+
+	rc = pnv_pci_ioda2_create_table(&pe->table_group, 0, PAGE_SHIFT,
+			window_size, levels, false, &tbl);
 	if (rc) {
 		pe_err(pe, "Failed to create 32-bit TCE table, err %ld",
 				rc);
 		return rc;
 	}
 
-	iommu_init_table(tbl, pe->phb->hose->node);
+	/* We use top part of 32bit space for MMIO so exclude it from DMA */
+	res_start = 0;
+	res_end = 0;
+	if (window_size > pe->phb->ioda.m32_pci_base) {
+		res_start = pe->phb->ioda.m32_pci_base >> tbl->it_page_shift;
+		res_end = min(window_size, SZ_4G) >> tbl->it_page_shift;
+	}
+	iommu_init_table_res(tbl, pe->phb->hose->node, res_start, res_end);
 
 	rc = pnv_pci_ioda2_set_window(&pe->table_group, 0, tbl);
 	if (rc) {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH kernel v3 1/3] powerpc/iommu: Allow bypass-only for DMA
  2019-05-30  7:03 ` [PATCH kernel v3 1/3] powerpc/iommu: Allow bypass-only for DMA Alexey Kardashevskiy
@ 2019-06-03  2:03   ` David Gibson
  0 siblings, 0 replies; 21+ messages in thread
From: David Gibson @ 2019-06-03  2:03 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Sam Bobroff, Alistair Popple, Oliver O'Halloran, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 1953 bytes --]

On Thu, May 30, 2019 at 05:03:53PM +1000, Alexey Kardashevskiy wrote:
> POWER8 and newer support a bypass mode which maps all host memory to
> PCI buses so an IOMMU table is not always required. However if we fail to
> create such a table, the DMA setup fails and the kernel does not boot.
> 
> This skips the 32bit DMA setup check if the bypass is can be selected.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
> 
> This minor thing helped me debugging next 2 patches so it can help
> somebody else too.
> ---
>  arch/powerpc/kernel/dma-iommu.c | 11 +++++------
>  1 file changed, 5 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/dma-iommu.c b/arch/powerpc/kernel/dma-iommu.c
> index 09231ef06d01..809c1dc01edf 100644
> --- a/arch/powerpc/kernel/dma-iommu.c
> +++ b/arch/powerpc/kernel/dma-iommu.c
> @@ -118,18 +118,17 @@ int dma_iommu_dma_supported(struct device *dev, u64 mask)
>  {
>  	struct iommu_table *tbl = get_iommu_table_base(dev);
>  
> -	if (!tbl) {
> -		dev_info(dev, "Warning: IOMMU dma not supported: mask 0x%08llx"
> -			", table unavailable\n", mask);
> -		return 0;
> -	}
> -
>  	if (dev_is_pci(dev) && dma_iommu_bypass_supported(dev, mask)) {
>  		dev->archdata.iommu_bypass = true;
>  		dev_dbg(dev, "iommu: 64-bit OK, using fixed ops\n");
>  		return 1;
>  	}
>  
> +	if (!tbl) {
> +		dev_err(dev, "Warning: IOMMU dma not supported: mask 0x%08llx, table unavailable\n", mask);
> +		return 0;
> +	}
> +
>  	if (tbl->it_offset > (mask >> tbl->it_page_shift)) {
>  		dev_info(dev, "Warning: IOMMU offset too big for device mask\n");
>  		dev_info(dev, "mask: 0x%08llx, table offset: 0x%08lx\n",

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH kernel v3 0/3] powerpc/ioda2: Yet another attempt to allow DMA masks between 32 and 59
  2019-05-30  7:03 [PATCH kernel v3 0/3] powerpc/ioda2: Yet another attempt to allow DMA masks between 32 and 59 Alexey Kardashevskiy
                   ` (2 preceding siblings ...)
  2019-05-30  7:03 ` [PATCH kernel v3 3/3] powerpc/powernv/ioda2: Create bigger default window with 64k IOMMU pages Alexey Kardashevskiy
@ 2019-06-06  4:11 ` Shawn Anastasio
  2019-06-06  7:17   ` Alistair Popple
  2019-06-12  5:05   ` Shawn Anastasio
  3 siblings, 2 replies; 21+ messages in thread
From: Shawn Anastasio @ 2019-06-06  4:11 UTC (permalink / raw)
  To: Alexey Kardashevskiy, linuxppc-dev
  Cc: Sam Bobroff, Alistair Popple, Oliver O'Halloran, David Gibson

On 5/30/19 2:03 AM, Alexey Kardashevskiy wrote:
> This is an attempt to allow DMA masks between 32..59 which are not large
> enough to use either a PHB3 bypass mode or a sketchy bypass. Depending
> on the max order, up to 40 is usually available.
> 
> 
> This is based on v5.2-rc2.
> 
> Please comment. Thanks.

I have tested this patch set with an AMD GPU that's limited to <64bit
DMA (I believe it's 40 or 42 bit). It successfully allows the card to
operate without falling back to 32-bit DMA mode as it does without
the patches.

Relevant kernel log message:
```
[    0.311211] pci 0033:01     : [PE# 00] Enabling 64-bit DMA bypass
```

Tested-by: Shawn Anastasio <shawn@anastas.io>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH kernel v3 0/3] powerpc/ioda2: Yet another attempt to allow DMA masks between 32 and 59
  2019-06-06  4:11 ` [PATCH kernel v3 0/3] powerpc/ioda2: Yet another attempt to allow DMA masks between 32 and 59 Shawn Anastasio
@ 2019-06-06  7:17   ` Alistair Popple
  2019-06-06 12:07     ` Oliver
  2019-06-12  5:05   ` Shawn Anastasio
  1 sibling, 1 reply; 21+ messages in thread
From: Alistair Popple @ 2019-06-06  7:17 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Sam Bobroff, Oliver O'Halloran,
	Shawn Anastasio, David Gibson

I have been hitting EEH address errors testing this with some network
cards which map/unmap DMA addresses more frequently. For example:

PHB4 PHB#5 Diag-data (Version: 1)                                                                                                                                      
brdgCtl:    00000002                                                                                                                                                   
RootSts:    00060020 00402000 a0220008 00100107 00000800                                                                                                               
PhbSts:     0000001c00000000 0000001c00000000                                                                                                                          
Lem:        0000000100000080 0000000000000000 0000000000000080                                                                                                         
PhbErr:     0000028000000000 0000020000000000 2148000098000240 a008400000000000                                                                                        
RxeTceErr:  2000000000000000 2000000000000000 c000000000000000 0000000000000000                                                                                        
PblErr:     0000000000020000 0000000000020000 0000000000000000 0000000000000000                                                                                        
RegbErr:    0000004000000000 0000004000000000 61000c4800000000 0000000000000000                                                                                        
PE[000] A/B: 8300b03800000000 8000000000000000                                                                                                                         

Interestingly the PE[000] A/B data is the same across different cards
and drivers.

- Alistair

On Wednesday, 5 June 2019 11:11:06 PM AEST Shawn Anastasio wrote:
> On 5/30/19 2:03 AM, Alexey Kardashevskiy wrote:
> > This is an attempt to allow DMA masks between 32..59 which are not large
> > enough to use either a PHB3 bypass mode or a sketchy bypass. Depending
> > on the max order, up to 40 is usually available.
> > 
> > 
> > This is based on v5.2-rc2.
> > 
> > Please comment. Thanks.
> 
> I have tested this patch set with an AMD GPU that's limited to <64bit
> DMA (I believe it's 40 or 42 bit). It successfully allows the card to
> operate without falling back to 32-bit DMA mode as it does without
> the patches.
> 
> Relevant kernel log message:
> ```
> [    0.311211] pci 0033:01     : [PE# 00] Enabling 64-bit DMA bypass
> ```
> 
> Tested-by: Shawn Anastasio <shawn@anastas.io>



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH kernel v3 0/3] powerpc/ioda2: Yet another attempt to allow DMA masks between 32 and 59
  2019-06-06  7:17   ` Alistair Popple
@ 2019-06-06 12:07     ` Oliver
  2019-06-07  1:41       ` Alistair Popple
  0 siblings, 1 reply; 21+ messages in thread
From: Oliver @ 2019-06-06 12:07 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Alexey Kardashevskiy, Shawn Anastasio, David Gibson,
	linuxppc-dev, Sam Bobroff

On Thu, Jun 6, 2019 at 5:17 PM Alistair Popple <alistair@popple.id.au> wrote:
>
> I have been hitting EEH address errors testing this with some network
> cards which map/unmap DMA addresses more frequently. For example:
>
> PHB4 PHB#5 Diag-data (Version: 1)
> brdgCtl:    00000002
> RootSts:    00060020 00402000 a0220008 00100107 00000800
> PhbSts:     0000001c00000000 0000001c00000000
> Lem:        0000000100000080 0000000000000000 0000000000000080
> PhbErr:     0000028000000000 0000020000000000 2148000098000240 a008400000000000
> RxeTceErr:  2000000000000000 2000000000000000 c000000000000000 0000000000000000
> PblErr:     0000000000020000 0000000000020000 0000000000000000 0000000000000000
> RegbErr:    0000004000000000 0000004000000000 61000c4800000000 0000000000000000
> PE[000] A/B: 8300b03800000000 8000000000000000
>
> Interestingly the PE[000] A/B data is the same across different cards
> and drivers.

TCE page fault due to permissions so odds are the DMA address was unmapped.

What cards did you get this with? I tried with one of the common
BCM5719 NICs and generated network traffic by using rsync to copy a
linux git tree to the system and it worked fine.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH kernel v3 0/3] powerpc/ioda2: Yet another attempt to allow DMA masks between 32 and 59
  2019-06-06 12:07     ` Oliver
@ 2019-06-07  1:41       ` Alistair Popple
  2019-06-10  5:19         ` Alexey Kardashevskiy
  0 siblings, 1 reply; 21+ messages in thread
From: Alistair Popple @ 2019-06-07  1:41 UTC (permalink / raw)
  To: Oliver
  Cc: Alexey Kardashevskiy, Shawn Anastasio, David Gibson,
	linuxppc-dev, Sam Bobroff

On Thursday, 6 June 2019 10:07:54 PM AEST Oliver wrote:
> On Thu, Jun 6, 2019 at 5:17 PM Alistair Popple <alistair@popple.id.au> 
wrote:
> > I have been hitting EEH address errors testing this with some network
> > cards which map/unmap DMA addresses more frequently. For example:
> > 
> > PHB4 PHB#5 Diag-data (Version: 1)
> > brdgCtl:    00000002
> > RootSts:    00060020 00402000 a0220008 00100107 00000800
> > PhbSts:     0000001c00000000 0000001c00000000
> > Lem:        0000000100000080 0000000000000000 0000000000000080
> > PhbErr:     0000028000000000 0000020000000000 2148000098000240
> > a008400000000000 RxeTceErr:  2000000000000000 2000000000000000
> > c000000000000000 0000000000000000 PblErr:     0000000000020000
> > 0000000000020000 0000000000000000 0000000000000000 RegbErr:   
> > 0000004000000000 0000004000000000 61000c4800000000 0000000000000000
> > PE[000] A/B: 8300b03800000000 8000000000000000
> > 
> > Interestingly the PE[000] A/B data is the same across different cards
> > and drivers.
> 
> TCE page fault due to permissions so odds are the DMA address was unmapped.
> 
> What cards did you get this with? I tried with one of the common
> BCM5719 NICs and generated network traffic by using rsync to copy a
> linux git tree to the system and it worked fine.

Personally I've seen it with the BCM5719 with the driver modified to set a DMA 
mask of 48 bits instead of 64 and using scp to copy a random 1GB file to the 
system repeatedly until it crashes. 

I have also had reports of someone hitting the same error using a Mellanox 
CX-5 adaptor with a similar driver modification.

- Alistair


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH kernel v3 0/3] powerpc/ioda2: Yet another attempt to allow DMA masks between 32 and 59
  2019-06-07  1:41       ` Alistair Popple
@ 2019-06-10  5:19         ` Alexey Kardashevskiy
  0 siblings, 0 replies; 21+ messages in thread
From: Alexey Kardashevskiy @ 2019-06-10  5:19 UTC (permalink / raw)
  To: Alistair Popple, Oliver
  Cc: Sam Bobroff, Shawn Anastasio, linuxppc-dev, David Gibson



On 07/06/2019 11:41, Alistair Popple wrote:
> On Thursday, 6 June 2019 10:07:54 PM AEST Oliver wrote:
>> On Thu, Jun 6, 2019 at 5:17 PM Alistair Popple <alistair@popple.id.au> 
> wrote:
>>> I have been hitting EEH address errors testing this with some network
>>> cards which map/unmap DMA addresses more frequently. For example:
>>>
>>> PHB4 PHB#5 Diag-data (Version: 1)
>>> brdgCtl:    00000002
>>> RootSts:    00060020 00402000 a0220008 00100107 00000800
>>> PhbSts:     0000001c00000000 0000001c00000000
>>> Lem:        0000000100000080 0000000000000000 0000000000000080
>>> PhbErr:     0000028000000000 0000020000000000 2148000098000240
>>> a008400000000000 RxeTceErr:  2000000000000000 2000000000000000
>>> c000000000000000 0000000000000000 PblErr:     0000000000020000
>>> 0000000000020000 0000000000000000 0000000000000000 RegbErr:   
>>> 0000004000000000 0000004000000000 61000c4800000000 0000000000000000
>>> PE[000] A/B: 8300b03800000000 8000000000000000
>>>
>>> Interestingly the PE[000] A/B data is the same across different cards
>>> and drivers.
>>
>> TCE page fault due to permissions so odds are the DMA address was unmapped.
>>
>> What cards did you get this with? I tried with one of the common
>> BCM5719 NICs and generated network traffic by using rsync to copy a
>> linux git tree to the system and it worked fine.
> 
> Personally I've seen it with the BCM5719 with the driver modified to set a DMA 
> mask of 48 bits instead of 64 and using scp to copy a random 1GB file to the 
> system repeatedly until it crashes. 
> 
> I have also had reports of someone hitting the same error using a Mellanox 
> CX-5 adaptor with a similar driver modification.
> 
> - Alistair


Seems to be a race (I could see broken TCEs), this helps on my setup:


diff --git a/arch/powerpc/include/asm/iommu.h
b/arch/powerpc/include/asm/iommu.h
index f920697c03ef..215351e16ae8 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -113,6 +113,7 @@ struct iommu_table {
        int it_nid;
        unsigned long it_reserved_start; /* Start of not-DMA-able (MMIO)
area */
        unsigned long it_reserved_end;
+       spinlock_t lock;
 };

 #define IOMMU_TABLE_USERSPACE_ENTRY_RO(tbl, entry) \
diff --git a/arch/powerpc/platforms/powernv/pci-ioda-tce.c
b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
index c75ec37bf0cd..7045b6518243 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda-tce.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
@@ -29,6 +29,7 @@ void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
        tbl->it_size = tce_size >> 3;
        tbl->it_busno = 0;
        tbl->it_type = TCE_PCI;
+       spin_lock_init(&tbl->lock);
 }

 static __be64 *pnv_alloc_tce_level(int nid, unsigned int shift)
@@ -65,14 +66,17 @@ static __be64 *pnv_tce(struct iommu_table *tbl, bool
user, long idx, bool alloc)

                        if (!alloc)
                                return NULL;
-
-                       tmp2 = pnv_alloc_tce_level(tbl->it_nid,
-                                       ilog2(tbl->it_level_size) + 3);
-                       if (!tmp2)
-                               return NULL;
-
-                       tmp[n] = cpu_to_be64(__pa(tmp2) |
-                                       TCE_PCI_READ | TCE_PCI_WRITE);
+                       }
+                       spin_lock(&tbl->lock);
+                       if (tmp[n] == 0) {
+                               tmp2 = pnv_alloc_tce_level(tbl->it_nid,
+
ilog2(tbl->it_level_size) + 3);
+                               if (!tmp2)
+                                       return NULL;
+                               tmp[n] = cpu_to_be64(__pa(tmp2) |
+                                               TCE_PCI_READ |
TCE_PCI_WRITE);
+                       }
+                       spin_unlock(&tbl->lock);
                }
                tce = be64_to_cpu(tmp[n]);




-- 
Alexey

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH kernel v3 0/3] powerpc/ioda2: Yet another attempt to allow DMA masks between 32 and 59
  2019-06-06  4:11 ` [PATCH kernel v3 0/3] powerpc/ioda2: Yet another attempt to allow DMA masks between 32 and 59 Shawn Anastasio
  2019-06-06  7:17   ` Alistair Popple
@ 2019-06-12  5:05   ` Shawn Anastasio
  2019-06-12  6:16     ` Oliver O'Halloran
  2019-06-12  7:07     ` Alexey Kardashevskiy
  1 sibling, 2 replies; 21+ messages in thread
From: Shawn Anastasio @ 2019-06-12  5:05 UTC (permalink / raw)
  To: Alexey Kardashevskiy, linuxppc-dev
  Cc: Sam Bobroff, Alistair Popple, Oliver O'Halloran, David Gibson

On 6/5/19 11:11 PM, Shawn Anastasio wrote:
> On 5/30/19 2:03 AM, Alexey Kardashevskiy wrote:
>> This is an attempt to allow DMA masks between 32..59 which are not large
>> enough to use either a PHB3 bypass mode or a sketchy bypass. Depending
>> on the max order, up to 40 is usually available.
>>
>>
>> This is based on v5.2-rc2.
>>
>> Please comment. Thanks.
> 
> I have tested this patch set with an AMD GPU that's limited to <64bit
> DMA (I believe it's 40 or 42 bit). It successfully allows the card to
> operate without falling back to 32-bit DMA mode as it does without
> the patches.
> 
> Relevant kernel log message:
> ```
> [    0.311211] pci 0033:01     : [PE# 00] Enabling 64-bit DMA bypass
> ```
> 
> Tested-by: Shawn Anastasio <shawn@anastas.io>

After a few days of further testing, I've started to run into stability
issues with the patch applied and used with an AMD GPU. Specifically,
the system sometimes spontaneously crashes. Not just EEH errors either,
the whole system shuts down in what looks like a checkstop.

Perhaps some subtle corruption is occurring?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH kernel v3 0/3] powerpc/ioda2: Yet another attempt to allow DMA masks between 32 and 59
  2019-06-12  5:05   ` Shawn Anastasio
@ 2019-06-12  6:16     ` Oliver O'Halloran
  2019-06-12 19:14       ` Shawn Anastasio
  2019-06-12  7:07     ` Alexey Kardashevskiy
  1 sibling, 1 reply; 21+ messages in thread
From: Oliver O'Halloran @ 2019-06-12  6:16 UTC (permalink / raw)
  To: Shawn Anastasio
  Cc: Alexey Kardashevskiy, Alistair Popple, David Gibson,
	linuxppc-dev, Sam Bobroff

On Wed, Jun 12, 2019 at 3:06 PM Shawn Anastasio <shawn@anastas.io> wrote:
>
> On 6/5/19 11:11 PM, Shawn Anastasio wrote:
> > On 5/30/19 2:03 AM, Alexey Kardashevskiy wrote:
> >> This is an attempt to allow DMA masks between 32..59 which are not large
> >> enough to use either a PHB3 bypass mode or a sketchy bypass. Depending
> >> on the max order, up to 40 is usually available.
> >>
> >>
> >> This is based on v5.2-rc2.
> >>
> >> Please comment. Thanks.
> >
> > I have tested this patch set with an AMD GPU that's limited to <64bit
> > DMA (I believe it's 40 or 42 bit). It successfully allows the card to
> > operate without falling back to 32-bit DMA mode as it does without
> > the patches.
> >
> > Relevant kernel log message:
> > ```
> > [    0.311211] pci 0033:01     : [PE# 00] Enabling 64-bit DMA bypass
> > ```
> >
> > Tested-by: Shawn Anastasio <shawn@anastas.io>
>
> After a few days of further testing, I've started to run into stability
> issues with the patch applied and used with an AMD GPU. Specifically,
> the system sometimes spontaneously crashes. Not just EEH errors either,
> the whole system shuts down in what looks like a checkstop.

Any specific workload? Checkstops are harder to debug without a system
in the failed state so we'd need to replicate that locally to get a
decent idea what's up.

> Perhaps some subtle corruption is occurring?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH kernel v3 0/3] powerpc/ioda2: Yet another attempt to allow DMA masks between 32 and 59
  2019-06-12  5:05   ` Shawn Anastasio
  2019-06-12  6:16     ` Oliver O'Halloran
@ 2019-06-12  7:07     ` Alexey Kardashevskiy
  2019-06-12 19:15       ` Shawn Anastasio
  1 sibling, 1 reply; 21+ messages in thread
From: Alexey Kardashevskiy @ 2019-06-12  7:07 UTC (permalink / raw)
  To: Shawn Anastasio, linuxppc-dev
  Cc: Sam Bobroff, Alistair Popple, Oliver O'Halloran, David Gibson



On 12/06/2019 15:05, Shawn Anastasio wrote:
> On 6/5/19 11:11 PM, Shawn Anastasio wrote:
>> On 5/30/19 2:03 AM, Alexey Kardashevskiy wrote:
>>> This is an attempt to allow DMA masks between 32..59 which are not large
>>> enough to use either a PHB3 bypass mode or a sketchy bypass. Depending
>>> on the max order, up to 40 is usually available.
>>>
>>>
>>> This is based on v5.2-rc2.
>>>
>>> Please comment. Thanks.
>>
>> I have tested this patch set with an AMD GPU that's limited to <64bit
>> DMA (I believe it's 40 or 42 bit). It successfully allows the card to
>> operate without falling back to 32-bit DMA mode as it does without
>> the patches.
>>
>> Relevant kernel log message:
>> ```
>> [    0.311211] pci 0033:01     : [PE# 00] Enabling 64-bit DMA bypass
>> ```
>>
>> Tested-by: Shawn Anastasio <shawn@anastas.io>
> 
> After a few days of further testing, I've started to run into stability
> issues with the patch applied and used with an AMD GPU. Specifically,
> the system sometimes spontaneously crashes. Not just EEH errors either,
> the whole system shuts down in what looks like a checkstop.
> 
> Perhaps some subtle corruption is occurring?

Have you tried this?

https://patchwork.ozlabs.org/patch/1113506/



-- 
Alexey

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH kernel v3 0/3] powerpc/ioda2: Yet another attempt to allow DMA masks between 32 and 59
  2019-06-12  6:16     ` Oliver O'Halloran
@ 2019-06-12 19:14       ` Shawn Anastasio
  0 siblings, 0 replies; 21+ messages in thread
From: Shawn Anastasio @ 2019-06-12 19:14 UTC (permalink / raw)
  To: Oliver O'Halloran
  Cc: Alexey Kardashevskiy, Alistair Popple, Sam Bobroff, linuxppc-dev,
	David Gibson

On 6/12/19 1:16 AM, Oliver O'Halloran wrote:
> On Wed, Jun 12, 2019 at 3:06 PM Shawn Anastasio <shawn@anastas.io> wrote:
>>
>> On 6/5/19 11:11 PM, Shawn Anastasio wrote:
>>> On 5/30/19 2:03 AM, Alexey Kardashevskiy wrote:
>>>> This is an attempt to allow DMA masks between 32..59 which are not large
>>>> enough to use either a PHB3 bypass mode or a sketchy bypass. Depending
>>>> on the max order, up to 40 is usually available.
>>>>
>>>>
>>>> This is based on v5.2-rc2.
>>>>
>>>> Please comment. Thanks.
>>>
>>> I have tested this patch set with an AMD GPU that's limited to <64bit
>>> DMA (I believe it's 40 or 42 bit). It successfully allows the card to
>>> operate without falling back to 32-bit DMA mode as it does without
>>> the patches.
>>>
>>> Relevant kernel log message:
>>> ```
>>> [    0.311211] pci 0033:01     : [PE# 00] Enabling 64-bit DMA bypass
>>> ```
>>>
>>> Tested-by: Shawn Anastasio <shawn@anastas.io>
>>
>> After a few days of further testing, I've started to run into stability
>> issues with the patch applied and used with an AMD GPU. Specifically,
>> the system sometimes spontaneously crashes. Not just EEH errors either,
>> the whole system shuts down in what looks like a checkstop.
> 
> Any specific workload? Checkstops are harder to debug without a system
> in the failed state so we'd need to replicate that locally to get a
> decent idea what's up.

I haven't been able to pinpoint the exact cause. The first time it
happened was after about 4 days of uptime while playing a 1080p
video in mpv. The second time was about 5 minutes after booting up
while restoring a firefox session.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH kernel v3 0/3] powerpc/ioda2: Yet another attempt to allow DMA masks between 32 and 59
  2019-06-12  7:07     ` Alexey Kardashevskiy
@ 2019-06-12 19:15       ` Shawn Anastasio
  2019-06-18  4:26         ` Shawn Anastasio
  0 siblings, 1 reply; 21+ messages in thread
From: Shawn Anastasio @ 2019-06-12 19:15 UTC (permalink / raw)
  To: Alexey Kardashevskiy, linuxppc-dev
  Cc: Sam Bobroff, Alistair Popple, Oliver O'Halloran, David Gibson

On 6/12/19 2:07 AM, Alexey Kardashevskiy wrote:
> 
> 
> On 12/06/2019 15:05, Shawn Anastasio wrote:
>> On 6/5/19 11:11 PM, Shawn Anastasio wrote:
>>> On 5/30/19 2:03 AM, Alexey Kardashevskiy wrote:
>>>> This is an attempt to allow DMA masks between 32..59 which are not large
>>>> enough to use either a PHB3 bypass mode or a sketchy bypass. Depending
>>>> on the max order, up to 40 is usually available.
>>>>
>>>>
>>>> This is based on v5.2-rc2.
>>>>
>>>> Please comment. Thanks.
>>>
>>> I have tested this patch set with an AMD GPU that's limited to <64bit
>>> DMA (I believe it's 40 or 42 bit). It successfully allows the card to
>>> operate without falling back to 32-bit DMA mode as it does without
>>> the patches.
>>>
>>> Relevant kernel log message:
>>> ```
>>> [    0.311211] pci 0033:01     : [PE# 00] Enabling 64-bit DMA bypass
>>> ```
>>>
>>> Tested-by: Shawn Anastasio <shawn@anastas.io>
>>
>> After a few days of further testing, I've started to run into stability
>> issues with the patch applied and used with an AMD GPU. Specifically,
>> the system sometimes spontaneously crashes. Not just EEH errors either,
>> the whole system shuts down in what looks like a checkstop.
>>
>> Perhaps some subtle corruption is occurring?
> 
> Have you tried this?
> 
> https://patchwork.ozlabs.org/patch/1113506/

I have not. I'll give it a shot and try it out for a few days to see
if I'm able to reproduce the crashes.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH kernel v3 0/3] powerpc/ioda2: Yet another attempt to allow DMA masks between 32 and 59
  2019-06-12 19:15       ` Shawn Anastasio
@ 2019-06-18  4:26         ` Shawn Anastasio
  2019-06-18  6:39           ` Alexey Kardashevskiy
  0 siblings, 1 reply; 21+ messages in thread
From: Shawn Anastasio @ 2019-06-18  4:26 UTC (permalink / raw)
  To: Alexey Kardashevskiy, linuxppc-dev
  Cc: Sam Bobroff, Alistair Popple, Oliver O'Halloran, David Gibson

On 6/12/19 2:15 PM, Shawn Anastasio wrote:
> On 6/12/19 2:07 AM, Alexey Kardashevskiy wrote:
>>
>>
>> On 12/06/2019 15:05, Shawn Anastasio wrote:
>>> On 6/5/19 11:11 PM, Shawn Anastasio wrote:
>>>> On 5/30/19 2:03 AM, Alexey Kardashevskiy wrote:
>>>>> This is an attempt to allow DMA masks between 32..59 which are not 
>>>>> large
>>>>> enough to use either a PHB3 bypass mode or a sketchy bypass. Depending
>>>>> on the max order, up to 40 is usually available.
>>>>>
>>>>>
>>>>> This is based on v5.2-rc2.
>>>>>
>>>>> Please comment. Thanks.
>>>>
>>>> I have tested this patch set with an AMD GPU that's limited to <64bit
>>>> DMA (I believe it's 40 or 42 bit). It successfully allows the card to
>>>> operate without falling back to 32-bit DMA mode as it does without
>>>> the patches.
>>>>
>>>> Relevant kernel log message:
>>>> ```
>>>> [    0.311211] pci 0033:01     : [PE# 00] Enabling 64-bit DMA bypass
>>>> ```
>>>>
>>>> Tested-by: Shawn Anastasio <shawn@anastas.io>
>>>
>>> After a few days of further testing, I've started to run into stability
>>> issues with the patch applied and used with an AMD GPU. Specifically,
>>> the system sometimes spontaneously crashes. Not just EEH errors either,
>>> the whole system shuts down in what looks like a checkstop.
>>>
>>> Perhaps some subtle corruption is occurring?
>>
>> Have you tried this?
>>
>> https://patchwork.ozlabs.org/patch/1113506/
> 
> I have not. I'll give it a shot and try it out for a few days to see
> if I'm able to reproduce the crashes.

A few days later and I was able to reproduce the checkstop while
watching a video in mpv. At this point the system had ~4 day
uptime and this wasn't the first video I watched during that time.

This is with https://patchwork.ozlabs.org/patch/1113506/ applied, too.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH kernel v3 0/3] powerpc/ioda2: Yet another attempt to allow DMA masks between 32 and 59
  2019-06-18  4:26         ` Shawn Anastasio
@ 2019-06-18  6:39           ` Alexey Kardashevskiy
  2019-06-18  7:00             ` Shawn Anastasio
  0 siblings, 1 reply; 21+ messages in thread
From: Alexey Kardashevskiy @ 2019-06-18  6:39 UTC (permalink / raw)
  To: Shawn Anastasio, linuxppc-dev
  Cc: Sam Bobroff, Alistair Popple, Oliver O'Halloran, David Gibson



On 18/06/2019 14:26, Shawn Anastasio wrote:
> On 6/12/19 2:15 PM, Shawn Anastasio wrote:
>> On 6/12/19 2:07 AM, Alexey Kardashevskiy wrote:
>>>
>>>
>>> On 12/06/2019 15:05, Shawn Anastasio wrote:
>>>> On 6/5/19 11:11 PM, Shawn Anastasio wrote:
>>>>> On 5/30/19 2:03 AM, Alexey Kardashevskiy wrote:
>>>>>> This is an attempt to allow DMA masks between 32..59 which are not
>>>>>> large
>>>>>> enough to use either a PHB3 bypass mode or a sketchy bypass.
>>>>>> Depending
>>>>>> on the max order, up to 40 is usually available.
>>>>>>
>>>>>>
>>>>>> This is based on v5.2-rc2.
>>>>>>
>>>>>> Please comment. Thanks.
>>>>>
>>>>> I have tested this patch set with an AMD GPU that's limited to <64bit
>>>>> DMA (I believe it's 40 or 42 bit). It successfully allows the card to
>>>>> operate without falling back to 32-bit DMA mode as it does without
>>>>> the patches.
>>>>>
>>>>> Relevant kernel log message:
>>>>> ```
>>>>> [    0.311211] pci 0033:01     : [PE# 00] Enabling 64-bit DMA bypass
>>>>> ```
>>>>>
>>>>> Tested-by: Shawn Anastasio <shawn@anastas.io>
>>>>
>>>> After a few days of further testing, I've started to run into stability
>>>> issues with the patch applied and used with an AMD GPU. Specifically,
>>>> the system sometimes spontaneously crashes. Not just EEH errors either,
>>>> the whole system shuts down in what looks like a checkstop.
>>>>
>>>> Perhaps some subtle corruption is occurring?
>>>
>>> Have you tried this?
>>>
>>> https://patchwork.ozlabs.org/patch/1113506/
>>
>> I have not. I'll give it a shot and try it out for a few days to see
>> if I'm able to reproduce the crashes.
> 
> A few days later and I was able to reproduce the checkstop while
> watching a video in mpv. At this point the system had ~4 day
> uptime and this wasn't the first video I watched during that time.
> 
> This is with https://patchwork.ozlabs.org/patch/1113506/ applied, too.


Any logs left? What was the reason for the checkstop and what is the
hardware? "lscpu" and "lspci -vv" for the starter would help. Thanks,


-- 
Alexey

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH kernel v3 0/3] powerpc/ioda2: Yet another attempt to allow DMA masks between 32 and 59
  2019-06-18  6:39           ` Alexey Kardashevskiy
@ 2019-06-18  7:00             ` Shawn Anastasio
  0 siblings, 0 replies; 21+ messages in thread
From: Shawn Anastasio @ 2019-06-18  7:00 UTC (permalink / raw)
  To: Alexey Kardashevskiy, linuxppc-dev
  Cc: Sam Bobroff, Alistair Popple, Oliver O'Halloran, David Gibson

On 6/18/19 1:39 AM, Alexey Kardashevskiy wrote:
> 
> 
> On 18/06/2019 14:26, Shawn Anastasio wrote:
>> On 6/12/19 2:15 PM, Shawn Anastasio wrote:
>>> On 6/12/19 2:07 AM, Alexey Kardashevskiy wrote:
>>>>
>>>>
>>>> On 12/06/2019 15:05, Shawn Anastasio wrote:
>>>>> On 6/5/19 11:11 PM, Shawn Anastasio wrote:
>>>>>> On 5/30/19 2:03 AM, Alexey Kardashevskiy wrote:
>>>>>>> This is an attempt to allow DMA masks between 32..59 which are not
>>>>>>> large
>>>>>>> enough to use either a PHB3 bypass mode or a sketchy bypass.
>>>>>>> Depending
>>>>>>> on the max order, up to 40 is usually available.
>>>>>>>
>>>>>>>
>>>>>>> This is based on v5.2-rc2.
>>>>>>>
>>>>>>> Please comment. Thanks.
>>>>>>
>>>>>> I have tested this patch set with an AMD GPU that's limited to <64bit
>>>>>> DMA (I believe it's 40 or 42 bit). It successfully allows the card to
>>>>>> operate without falling back to 32-bit DMA mode as it does without
>>>>>> the patches.
>>>>>>
>>>>>> Relevant kernel log message:
>>>>>> ```
>>>>>> [    0.311211] pci 0033:01     : [PE# 00] Enabling 64-bit DMA bypass
>>>>>> ```
>>>>>>
>>>>>> Tested-by: Shawn Anastasio <shawn@anastas.io>
>>>>>
>>>>> After a few days of further testing, I've started to run into stability
>>>>> issues with the patch applied and used with an AMD GPU. Specifically,
>>>>> the system sometimes spontaneously crashes. Not just EEH errors either,
>>>>> the whole system shuts down in what looks like a checkstop.
>>>>>
>>>>> Perhaps some subtle corruption is occurring?
>>>>
>>>> Have you tried this?
>>>>
>>>> https://patchwork.ozlabs.org/patch/1113506/
>>>
>>> I have not. I'll give it a shot and try it out for a few days to see
>>> if I'm able to reproduce the crashes.
>>
>> A few days later and I was able to reproduce the checkstop while
>> watching a video in mpv. At this point the system had ~4 day
>> uptime and this wasn't the first video I watched during that time.
>>
>> This is with https://patchwork.ozlabs.org/patch/1113506/ applied, too.
> 
> 
> Any logs left? What was the reason for the checkstop and what is the
> hardware? "lscpu" and "lspci -vv" for the starter would help. Thanks,

The machine is a Talos II with 2x 8 core DD2.2 Sforza modules.
I've added the output of lscpu and lspci below. As for logs,
it doesn't seem there are any kernel logs of the event.
The opal-gard utility shows some error records which I have
also included below.

opal-gard:
```
$ sudo ./opal-gard show 1
Record ID:    0x00000001
========================
Error ID:     0x9000000b
Error Type:   Fatal (0xe3)
Path Type: physical
 >Sys, Instance #0
  >Node, Instance #0
   >Proc, Instance #1
    >EQ, Instance #0
     >EX, Instance #0

$ sudo ./opal-gard show 2
Record ID:    0x00000002
========================
Error ID:     0x90000021
Error Type:   Fatal (0xe3)
Path Type: physical
 >Sys, Instance #0
  >Node, Instance #0
   >Proc, Instance #1
    >EQ, Instance #2
     >EX, Instance #1

```

lscpu:
```
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              52
On-line CPU(s) list: 0-3,8-31,36-47,52-63
Thread(s) per core:  4
Core(s) per socket:  6
Socket(s):           2
NUMA node(s):        2
Model:               2.2 (pvr 004e 1202)
Model name:          POWER9, altivec supported
CPU max MHz:         3800.0000
CPU min MHz:         2154.0000
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            10240K
NUMA node0 CPU(s):   0-3,8-31
NUMA node8 CPU(s):   36-47,52-63

```

lspci -vv:
Output at: https://upaste.anastas.io/IwVQzt

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH kernel v3 2/3] powerpc/powernv/ioda2: Allocate TCE table levels on demand for default DMA window
  2019-05-30  7:03 ` [PATCH kernel v3 2/3] powerpc/powernv/ioda2: Allocate TCE table levels on demand for default DMA window Alexey Kardashevskiy
@ 2019-07-08  7:01   ` alistair
  0 siblings, 0 replies; 21+ messages in thread
From: alistair @ 2019-07-08  7:01 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Sam Bobroff, Oliver O'Halloran, linuxppc-dev, David Gibson

It seems this is mostly just enabling already existing code used by KVM 
for
on-demand TCE level allocation on baremetal as well. Given that I 
suppose the
implementation of the on-demand allocation itself is already used and
therefore somewhat tested by KVM. I took a look at pnv_tce() which does 
the
on-demand allocation and the changes here seem like they should work 
with that
so:

Reviewed-by: Alistair Popple <alistair@popple.id.au>

On Thursday, 30 May 2019 5:03:54 PM AEST Alexey Kardashevskiy wrote:
> We allocate only the first level of multilevel TCE tables for KVM
> already (alloc_userspace_copy==true), and the rest is allocated on 
> demand.
> This is not enabled though for baremetal.
> 
> This removes the KVM limitation (implicit, via the alloc_userspace_copy
> parameter) and always allocates just the first level. The on-demand
> allocation of missing levels is already implemented.
> 
> As from now on DMA map might happen with disabled interrupts, this
> allocates TCEs with GFP_ATOMIC.
> 
> To save time when creating a new clean table, this skips non-allocated
> indirect TCE entries in pnv_tce_free just like we already do in
> the VFIO IOMMU TCE driver.
> 
> This changes the default level number from 1 to 2 to reduce the amount
> of memory required for the default 32bit DMA window at the boot time.
> The default window size is up to 2GB which requires 4MB of TCEs which 
> is
> unlikely to be used entirely or at all as most devices these days are
> 64bit capable so by switching to 2 levels by default we save 4032KB of
> RAM per a device.
> 
> While at this, add __GFP_NOWARN to alloc_pages_node() as the userspace
> can trigger this path via VFIO, see the failure and try creating a 
> table
> again with different parameters which might succeed.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v2:
> * added __GFP_NOWARN to alloc_pages_node
> ---
>  arch/powerpc/platforms/powernv/pci.h          |  2 +-
>  arch/powerpc/platforms/powernv/pci-ioda-tce.c | 20 +++++++++----------
>  2 files changed, 11 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci.h
> b/arch/powerpc/platforms/powernv/pci.h index 1a51e7bfc541..a55dabc8d057
> 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -225,7 +225,7 @@ extern struct iommu_table_group
> *pnv_npu_compound_attach( struct pnv_ioda_pe *pe);
> 
>  /* pci-ioda-tce.c */
> -#define POWERNV_IOMMU_DEFAULT_LEVELS	1
> +#define POWERNV_IOMMU_DEFAULT_LEVELS	2
>  #define POWERNV_IOMMU_MAX_LEVELS	5
> 
>  extern int pnv_tce_build(struct iommu_table *tbl, long index, long 
> npages,
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda-tce.c
> b/arch/powerpc/platforms/powernv/pci-ioda-tce.c index
> e28f03e1eb5e..c75ec37bf0cd 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda-tce.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
> @@ -36,7 +36,8 @@ static __be64 *pnv_alloc_tce_level(int nid, unsigned 
> int
> shift) struct page *tce_mem = NULL;
>  	__be64 *addr;
> 
> -	tce_mem = alloc_pages_node(nid, GFP_KERNEL, shift - PAGE_SHIFT);
> +	tce_mem = alloc_pages_node(nid, GFP_ATOMIC | __GFP_NOWARN,
> +			shift - PAGE_SHIFT);
>  	if (!tce_mem) {
>  		pr_err("Failed to allocate a TCE memory, level shift=%d\n",
>  				shift);
> @@ -161,6 +162,9 @@ void pnv_tce_free(struct iommu_table *tbl, long 
> index,
> long npages)
> 
>  		if (ptce)
>  			*ptce = cpu_to_be64(0);
> +		else
> +			/* Skip the rest of the level */
> +			i |= tbl->it_level_size - 1;
>  	}
>  }
> 
> @@ -260,7 +264,6 @@ long pnv_pci_ioda2_table_alloc_pages(int nid, __u64
> bus_offset, unsigned int table_shift = max_t(unsigned int, 
> entries_shift +
> 3, PAGE_SHIFT);
>  	const unsigned long tce_table_size = 1UL << table_shift;
> -	unsigned int tmplevels = levels;
> 
>  	if (!levels || (levels > POWERNV_IOMMU_MAX_LEVELS))
>  		return -EINVAL;
> @@ -268,9 +271,6 @@ long pnv_pci_ioda2_table_alloc_pages(int nid, __u64
> bus_offset, if (!is_power_of_2(window_size))
>  		return -EINVAL;
> 
> -	if (alloc_userspace_copy && (window_size > (1ULL << 32)))
> -		tmplevels = 1;
> -
>  	/* Adjust direct table size from window_size and levels */
>  	entries_shift = (entries_shift + levels - 1) / levels;
>  	level_shift = entries_shift + 3;
> @@ -281,7 +281,7 @@ long pnv_pci_ioda2_table_alloc_pages(int nid, __u64
> bus_offset,
> 
>  	/* Allocate TCE table */
>  	addr = pnv_pci_ioda2_table_do_alloc_pages(nid, level_shift,
> -			tmplevels, tce_table_size, &offset, &total_allocated);
> +			1, tce_table_size, &offset, &total_allocated);
> 
>  	/* addr==NULL means that the first level allocation failed */
>  	if (!addr)
> @@ -292,18 +292,18 @@ long pnv_pci_ioda2_table_alloc_pages(int nid, 
> __u64
> bus_offset, * we did not allocate as much as we wanted,
>  	 * release partially allocated table.
>  	 */
> -	if (tmplevels == levels && offset < tce_table_size)
> +	if (levels == 1 && offset < tce_table_size)
>  		goto free_tces_exit;
> 
>  	/* Allocate userspace view of the TCE table */
>  	if (alloc_userspace_copy) {
>  		offset = 0;
>  		uas = pnv_pci_ioda2_table_do_alloc_pages(nid, level_shift,
> -				tmplevels, tce_table_size, &offset,
> +				1, tce_table_size, &offset,
>  				&total_allocated_uas);
>  		if (!uas)
>  			goto free_tces_exit;
> -		if (tmplevels == levels && (offset < tce_table_size ||
> +		if (levels == 1 && (offset < tce_table_size ||
>  				total_allocated_uas != total_allocated))
>  			goto free_uas_exit;
>  	}
> @@ -318,7 +318,7 @@ long pnv_pci_ioda2_table_alloc_pages(int nid, __u64
> bus_offset,
> 
>  	pr_debug("Created TCE table: ws=%08llx ts=%lx @%08llx base=%lx uas=%p
> levels=%d/%d\n", window_size, tce_table_size, bus_offset, tbl->it_base,
> -			tbl->it_userspace, tmplevels, levels);
> +			tbl->it_userspace, 1, levels);
> 
>  	return 0;

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH kernel v3 3/3] powerpc/powernv/ioda2: Create bigger default window with 64k IOMMU pages
  2019-05-30  7:03 ` [PATCH kernel v3 3/3] powerpc/powernv/ioda2: Create bigger default window with 64k IOMMU pages Alexey Kardashevskiy
@ 2019-07-08  7:01   ` alistair
  2019-07-09  0:57     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 21+ messages in thread
From: alistair @ 2019-07-08  7:01 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Sam Bobroff, Oliver O'Halloran, linuxppc-dev, David Gibson

Hi Alexey,

Couple of comment/questions below.

> -	/*
> -	 * Reserve page 0 so it will not be used for any mappings.
> -	 * This avoids buggy drivers that consider page 0 to be invalid
> -	 * to crash the machine or even lose data.
> -	 */
> -	if (tbl->it_offset == 0)
> -		set_bit(0, tbl->it_map);
> +	tbl->it_reserved_start = res_start;
> +	tbl->it_reserved_end = res_end;
> +	iommu_table_reserve_pages(tbl);

Personally I think it would be cleaner to set tbl->it_reserved_start/end 
in
iommu_table_reserve_pages() rather than separating the setup like this.

> 
>  	/* We only split the IOMMU table if we have 1GB or more of space */
>  	if ((tbl->it_size << tbl->it_page_shift) >= (1UL * 1024 * 1024 * 
> 1024))
> @@ -727,12 +755,7 @@ static void iommu_table_free(struct kref *kref)
>  		return;
>  	}
> 
> -	/*
> -	 * In case we have reserved the first bit, we should not emit
> -	 * the warning below.
> -	 */
> -	if (tbl->it_offset == 0)
> -		clear_bit(0, tbl->it_map);
> +	iommu_table_release_pages(tbl);
> 
>  	/* verify that table contains no entries */
>  	if (!bitmap_empty(tbl->it_map, tbl->it_size))
> @@ -1037,8 +1060,7 @@ int iommu_take_ownership(struct iommu_table *tbl)
>  	for (i = 0; i < tbl->nr_pools; i++)
>  		spin_lock(&tbl->pools[i].lock);
> 
> -	if (tbl->it_offset == 0)
> -		clear_bit(0, tbl->it_map);
> +	iommu_table_reserve_pages(tbl);
> 
>  	if (!bitmap_empty(tbl->it_map, tbl->it_size)) {
>  		pr_err("iommu_tce: it_map is not empty");
> @@ -1068,9 +1090,7 @@ void iommu_release_ownership(struct iommu_table 
> *tbl)
> 
>  	memset(tbl->it_map, 0, sz);
> 
> -	/* Restore bit#0 set by iommu_init_table() */
> -	if (tbl->it_offset == 0)
> -		set_bit(0, tbl->it_map);
> +	iommu_table_release_pages(tbl);

Are the above two changes correct? From the perspective of code 
behaviour this
seems to swap the set/clear_bit() calls. For example above you are 
replacing
set_bit(0, tbl->it_map) with a call to iommu_table_release_pages() which 
does
clear_bit(0, tbl->it_map) instead.

- Alistair

>  	for (i = 0; i < tbl->nr_pools; i++)
>  		spin_unlock(&tbl->pools[i].lock);
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c
> b/arch/powerpc/platforms/powernv/pci-ioda.c index
> 126602b4e399..ce2efdb3900d 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -2422,6 +2422,7 @@ static long 
> pnv_pci_ioda2_setup_default_config(struct
> pnv_ioda_pe *pe) {
>  	struct iommu_table *tbl = NULL;
>  	long rc;
> +	unsigned long res_start, res_end;
> 
>  	/*
>  	 * crashkernel= specifies the kdump kernel's maximum memory at
> @@ -2435,19 +2436,46 @@ static long
> pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe) * DMA window 
> can
> be larger than available memory, which will
>  	 * cause errors later.
>  	 */
> -	const u64 window_size = min((u64)pe->table_group.tce32_size, 
> max_memory);
> +	const u64 maxblock = 1UL << (PAGE_SHIFT + MAX_ORDER - 1);
> 
> -	rc = pnv_pci_ioda2_create_table(&pe->table_group, 0,
> -			IOMMU_PAGE_SHIFT_4K,
> -			window_size,
> -			POWERNV_IOMMU_DEFAULT_LEVELS, false, &tbl);
> +	/*
> +	 * We create the default window as big as we can. The constraint is
> +	 * the max order of allocation possible. The TCE tableis likely to
> +	 * end up being multilevel and with on-demand allocation in place,
> +	 * the initial use is not going to be huge as the default window aims
> +	 * to support cripplied devices (i.e. not fully 64bit DMAble) only.
> +	 */
> +	/* iommu_table::it_map uses 1 bit per IOMMU page, hence 8 */
> +	const u64 window_size = min((maxblock * 8) << PAGE_SHIFT, 
> max_memory);
> +	/* Each TCE level cannot exceed maxblock so go multilevel if needed 
> */
> +	unsigned long tces_order = ilog2(window_size >> PAGE_SHIFT);
> +	unsigned long tcelevel_order = ilog2(maxblock >> 3);
> +	unsigned int levels = tces_order / tcelevel_order;
> +
> +	if (tces_order % tcelevel_order)
> +		levels += 1;
> +	/*
> +	 * We try to stick to default levels (which is >1 at the moment) in
> +	 * order to save memory by relying on on-demain TCE level allocation.
> +	 */
> +	levels = max_t(unsigned int, levels, POWERNV_IOMMU_DEFAULT_LEVELS);
> +
> +	rc = pnv_pci_ioda2_create_table(&pe->table_group, 0, PAGE_SHIFT,
> +			window_size, levels, false, &tbl);
>  	if (rc) {
>  		pe_err(pe, "Failed to create 32-bit TCE table, err %ld",
>  				rc);
>  		return rc;
>  	}
> 
> -	iommu_init_table(tbl, pe->phb->hose->node);
> +	/* We use top part of 32bit space for MMIO so exclude it from DMA */
> +	res_start = 0;
> +	res_end = 0;
> +	if (window_size > pe->phb->ioda.m32_pci_base) {
> +		res_start = pe->phb->ioda.m32_pci_base >> tbl->it_page_shift;
> +		res_end = min(window_size, SZ_4G) >> tbl->it_page_shift;
> +	}
> +	iommu_init_table_res(tbl, pe->phb->hose->node, res_start, res_end);
> 
>  	rc = pnv_pci_ioda2_set_window(&pe->table_group, 0, tbl);
>  	if (rc) {


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH kernel v3 3/3] powerpc/powernv/ioda2: Create bigger default window with 64k IOMMU pages
  2019-07-08  7:01   ` alistair
@ 2019-07-09  0:57     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 21+ messages in thread
From: Alexey Kardashevskiy @ 2019-07-09  0:57 UTC (permalink / raw)
  To: alistair; +Cc: Sam Bobroff, Oliver O'Halloran, linuxppc-dev, David Gibson



On 08/07/2019 17:01, alistair@popple.id.au wrote:
> Hi Alexey,
> 
> Couple of comment/questions below.
> 
>> -    /*
>> -     * Reserve page 0 so it will not be used for any mappings.
>> -     * This avoids buggy drivers that consider page 0 to be invalid
>> -     * to crash the machine or even lose data.
>> -     */
>> -    if (tbl->it_offset == 0)
>> -        set_bit(0, tbl->it_map);
>> +    tbl->it_reserved_start = res_start;
>> +    tbl->it_reserved_end = res_end;
>> +    iommu_table_reserve_pages(tbl);
> 
> Personally I think it would be cleaner to set tbl->it_reserved_start/end in
> iommu_table_reserve_pages() rather than separating the setup like this.


Yeah I suppose...


> 
>>
>>      /* We only split the IOMMU table if we have 1GB or more of space */
>>      if ((tbl->it_size << tbl->it_page_shift) >= (1UL * 1024 * 1024 * 
>> 1024))
>> @@ -727,12 +755,7 @@ static void iommu_table_free(struct kref *kref)
>>          return;
>>      }
>>
>> -    /*
>> -     * In case we have reserved the first bit, we should not emit
>> -     * the warning below.
>> -     */
>> -    if (tbl->it_offset == 0)
>> -        clear_bit(0, tbl->it_map);
>> +    iommu_table_release_pages(tbl);
>>
>>      /* verify that table contains no entries */
>>      if (!bitmap_empty(tbl->it_map, tbl->it_size))
>> @@ -1037,8 +1060,7 @@ int iommu_take_ownership(struct iommu_table *tbl)
>>      for (i = 0; i < tbl->nr_pools; i++)
>>          spin_lock(&tbl->pools[i].lock);
>>
>> -    if (tbl->it_offset == 0)
>> -        clear_bit(0, tbl->it_map);
>> +    iommu_table_reserve_pages(tbl);
>>
>>      if (!bitmap_empty(tbl->it_map, tbl->it_size)) {
>>          pr_err("iommu_tce: it_map is not empty");
>> @@ -1068,9 +1090,7 @@ void iommu_release_ownership(struct iommu_table 
>> *tbl)
>>
>>      memset(tbl->it_map, 0, sz);
>>
>> -    /* Restore bit#0 set by iommu_init_table() */
>> -    if (tbl->it_offset == 0)
>> -        set_bit(0, tbl->it_map);
>> +    iommu_table_release_pages(tbl);
> 
> Are the above two changes correct? 


No they are not, this is a bug. Thanks for spotting, repost is coming.



> From the perspective of code 
> behaviour this
> seems to swap the set/clear_bit() calls. For example above you are 
> replacing
> set_bit(0, tbl->it_map) with a call to iommu_table_release_pages() which 
> does
> clear_bit(0, tbl->it_map) instead.
> 
> - Alistair
> 
>>      for (i = 0; i < tbl->nr_pools; i++)
>>          spin_unlock(&tbl->pools[i].lock);
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c
>> b/arch/powerpc/platforms/powernv/pci-ioda.c index
>> 126602b4e399..ce2efdb3900d 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> @@ -2422,6 +2422,7 @@ static long 
>> pnv_pci_ioda2_setup_default_config(struct
>> pnv_ioda_pe *pe) {
>>      struct iommu_table *tbl = NULL;
>>      long rc;
>> +    unsigned long res_start, res_end;
>>
>>      /*
>>       * crashkernel= specifies the kdump kernel's maximum memory at
>> @@ -2435,19 +2436,46 @@ static long
>> pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe) * DMA 
>> window can
>> be larger than available memory, which will
>>       * cause errors later.
>>       */
>> -    const u64 window_size = min((u64)pe->table_group.tce32_size, 
>> max_memory);
>> +    const u64 maxblock = 1UL << (PAGE_SHIFT + MAX_ORDER - 1);
>>
>> -    rc = pnv_pci_ioda2_create_table(&pe->table_group, 0,
>> -            IOMMU_PAGE_SHIFT_4K,
>> -            window_size,
>> -            POWERNV_IOMMU_DEFAULT_LEVELS, false, &tbl);
>> +    /*
>> +     * We create the default window as big as we can. The constraint is
>> +     * the max order of allocation possible. The TCE tableis likely to
>> +     * end up being multilevel and with on-demand allocation in place,
>> +     * the initial use is not going to be huge as the default window 
>> aims
>> +     * to support cripplied devices (i.e. not fully 64bit DMAble) only.
>> +     */
>> +    /* iommu_table::it_map uses 1 bit per IOMMU page, hence 8 */
>> +    const u64 window_size = min((maxblock * 8) << PAGE_SHIFT, 
>> max_memory);
>> +    /* Each TCE level cannot exceed maxblock so go multilevel if 
>> needed */
>> +    unsigned long tces_order = ilog2(window_size >> PAGE_SHIFT);
>> +    unsigned long tcelevel_order = ilog2(maxblock >> 3);
>> +    unsigned int levels = tces_order / tcelevel_order;
>> +
>> +    if (tces_order % tcelevel_order)
>> +        levels += 1;
>> +    /*
>> +     * We try to stick to default levels (which is >1 at the moment) in
>> +     * order to save memory by relying on on-demain TCE level 
>> allocation.
>> +     */
>> +    levels = max_t(unsigned int, levels, POWERNV_IOMMU_DEFAULT_LEVELS);
>> +
>> +    rc = pnv_pci_ioda2_create_table(&pe->table_group, 0, PAGE_SHIFT,
>> +            window_size, levels, false, &tbl);
>>      if (rc) {
>>          pe_err(pe, "Failed to create 32-bit TCE table, err %ld",
>>                  rc);
>>          return rc;
>>      }
>>
>> -    iommu_init_table(tbl, pe->phb->hose->node);
>> +    /* We use top part of 32bit space for MMIO so exclude it from DMA */
>> +    res_start = 0;
>> +    res_end = 0;
>> +    if (window_size > pe->phb->ioda.m32_pci_base) {
>> +        res_start = pe->phb->ioda.m32_pci_base >> tbl->it_page_shift;
>> +        res_end = min(window_size, SZ_4G) >> tbl->it_page_shift;
>> +    }
>> +    iommu_init_table_res(tbl, pe->phb->hose->node, res_start, res_end);
>>
>>      rc = pnv_pci_ioda2_set_window(&pe->table_group, 0, tbl);
>>      if (rc) {
> 

-- 
Alexey

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2019-07-09  0:59 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-30  7:03 [PATCH kernel v3 0/3] powerpc/ioda2: Yet another attempt to allow DMA masks between 32 and 59 Alexey Kardashevskiy
2019-05-30  7:03 ` [PATCH kernel v3 1/3] powerpc/iommu: Allow bypass-only for DMA Alexey Kardashevskiy
2019-06-03  2:03   ` David Gibson
2019-05-30  7:03 ` [PATCH kernel v3 2/3] powerpc/powernv/ioda2: Allocate TCE table levels on demand for default DMA window Alexey Kardashevskiy
2019-07-08  7:01   ` alistair
2019-05-30  7:03 ` [PATCH kernel v3 3/3] powerpc/powernv/ioda2: Create bigger default window with 64k IOMMU pages Alexey Kardashevskiy
2019-07-08  7:01   ` alistair
2019-07-09  0:57     ` Alexey Kardashevskiy
2019-06-06  4:11 ` [PATCH kernel v3 0/3] powerpc/ioda2: Yet another attempt to allow DMA masks between 32 and 59 Shawn Anastasio
2019-06-06  7:17   ` Alistair Popple
2019-06-06 12:07     ` Oliver
2019-06-07  1:41       ` Alistair Popple
2019-06-10  5:19         ` Alexey Kardashevskiy
2019-06-12  5:05   ` Shawn Anastasio
2019-06-12  6:16     ` Oliver O'Halloran
2019-06-12 19:14       ` Shawn Anastasio
2019-06-12  7:07     ` Alexey Kardashevskiy
2019-06-12 19:15       ` Shawn Anastasio
2019-06-18  4:26         ` Shawn Anastasio
2019-06-18  6:39           ` Alexey Kardashevskiy
2019-06-18  7:00             ` Shawn Anastasio

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.