linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA
@ 2015-10-23 20:54 Nishanth Aravamudan
  2015-10-23 20:56 ` [PATCH 1/7 v3] dma-mapping: add generic dma_get_page_shift API Nishanth Aravamudan
                   ` (8 more replies)
  0 siblings, 9 replies; 49+ messages in thread
From: Nishanth Aravamudan @ 2015-10-23 20:54 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Keith Busch, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Alexey Kardashevskiy, David Gibson,
	Christoph Hellwig, David S. Miller, linux-nvme, linux-kernel,
	linuxppc-dev, sparclinux

We received a bug report recently when DDW (64-bit direct DMA on Power)
is not enabled for NVMe devices. In that case, we fall back to 32-bit
DMA via the IOMMU, which is always done via 4K TCEs (Translation Control
Entries).

The NVMe device driver, though, assumes that the DMA alignment for the
PRP entries will match the device's page size, and that the DMA aligment
matches the kernel's page aligment. On Power, the the IOMMU page size,
as mentioned above, can be 4K, while the device can have a page size of
8K, while the kernel has a page size of 64K. This eventually trips the
BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple
of 4K but not 8K (e.g., 0xF000).

In this particular case, and generally, we want to use the IOMMU's page
size for the default device page size, rather than the kernel's page
size.

This series consists of five patches:

1) add a generic dma_get_page_shift implementation that just returns
PAGE_SHIFT
2) override the generic implementation on Power to use the IOMMU table's
page shift if available
3) allow further specific overriding on power with machdep platform
overrides
4) use the machdep override on pseries, as the DDW code puts the TCE
shift in a special property and there is no IOMMU table available
5) move some sparc code around to make IOMMU_PAGE_SHIFT available in
include/asm
6) override the generic implementation on sparce to use IOMMU_PAGE_SHIFT
7) leverage the new API in the NVMe driver

With these patches, a NVMe device survives our internal hardware
exerciser; the kernel BUGs within a few seconds without the patch.

 arch/powerpc/include/asm/dma-mapping.h |  3 +++
 arch/powerpc/include/asm/machdep.h     |  3 ++-
 arch/powerpc/kernel/dma.c              | 11 +++++++++++
 arch/powerpc/platforms/pseries/iommu.c | 36 ++++++++++++++++++++++++++++++++++++
 arch/sparc/include/asm/dma-mapping.h   |  8 ++++++++
 arch/sparc/include/asm/iommu_common.h  | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
 arch/sparc/kernel/iommu.c              |  2 +-
 arch/sparc/kernel/iommu_common.h       | 51 ---------------------------------------------------
 arch/sparc/kernel/pci_psycho.c         |  2 +-
 arch/sparc/kernel/pci_sabre.c          |  2 +-
 arch/sparc/kernel/pci_schizo.c         |  2 +-
 arch/sparc/kernel/pci_sun4v.c          |  2 +-
 arch/sparc/kernel/psycho_common.c      |  2 +-
 arch/sparc/kernel/sbus.c               |  3 +--
 drivers/block/nvme-core.c              |  3 ++-
 include/linux/dma-mapping.h            |  7 +++++++
 16 files changed, 127 insertions(+), 61 deletions(-)

v1 -> v2:
  Based upon feedback from Christoph Hellwig, rather than using an
  arch-specific hack, expose the DMA page shift via a generic DMA API and
  override it on Power as needed.
v2 -> v3:
  Based upon feedback from Christoph Hellwig, put the generic
  implementation in include/linux/dma-mapping.h, since not all archs use
  include/asm-generic/dma-mapping-common.h.
  Add sparc implementation, as that arch seems to have a different IOMMU
  page size.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 1/7 v3] dma-mapping: add generic dma_get_page_shift API
  2015-10-23 20:54 [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA Nishanth Aravamudan
@ 2015-10-23 20:56 ` Nishanth Aravamudan
  2015-10-23 20:57 ` [PATCH 2/7 v2] powerpc/dma-mapping: override dma_get_page_shift Nishanth Aravamudan
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 49+ messages in thread
From: Nishanth Aravamudan @ 2015-10-23 20:56 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Keith Busch, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Alexey Kardashevskiy, David Gibson,
	Christoph Hellwig, David S. Miller, linux-nvme, linux-kernel,
	linuxppc-dev, sparclinux

Drivers like NVMe need to be able to determine the page size used for
DMA transfers. Add a new API that defaults to return PAGE_SHIFT on all
architectures.

Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

---
v1 -> v2:
  Based upon feedback from Christoph Hellwig, implement the IOMMU page
  size lookup as a generic DMA API, rather than an architecture-specific
  hack.
v2 -> v3:
  Based upon feedback from Christoph Hellwig that not all architectures
  have moved to dma-mapping-common.h, so move the #ifdef and default to
  linux/dma-mapping.h.
---
 include/linux/dma-mapping.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index ac07ff0..7eaba8d 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -95,6 +95,13 @@ static inline u64 dma_get_mask(struct device *dev)
 	return DMA_BIT_MASK(32);
 }
 
+#ifndef HAVE_ARCH_DMA_GET_PAGE_SHIFT
+static inline unsigned long dma_get_page_shift(struct device *dev)
+{
+	return PAGE_SHIFT;
+}
+#endif
+
 #ifdef CONFIG_ARCH_HAS_DMA_SET_COHERENT_MASK
 int dma_set_coherent_mask(struct device *dev, u64 mask);
 #else
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 2/7 v2] powerpc/dma-mapping: override dma_get_page_shift
  2015-10-23 20:54 [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA Nishanth Aravamudan
  2015-10-23 20:56 ` [PATCH 1/7 v3] dma-mapping: add generic dma_get_page_shift API Nishanth Aravamudan
@ 2015-10-23 20:57 ` Nishanth Aravamudan
  2015-10-27  6:02   ` Alexey Kardashevskiy
  2015-10-23 20:57 ` [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA Nishanth Aravamudan
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 49+ messages in thread
From: Nishanth Aravamudan @ 2015-10-23 20:57 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Matthew Wilcox, Keith Busch, Benjamin Herrenschmidt,
	Paul Mackerras, Alexey Kardashevskiy, David Gibson,
	Christoph Hellwig, David S. Miller, linux-nvme, linux-kernel,
	linuxppc-dev, sparclinux

On Power, the kernel's page size can differ from the IOMMU's page size,
so we need to override the generic implementation, which always returns
the kernel's page size. Lookup the IOMMU's page size from struct
iommu_table, if available. Fallback to the kernel's page size,
otherwise.

Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/dma-mapping.h | 3 +++
 arch/powerpc/kernel/dma.c              | 9 +++++++++
 2 files changed, 12 insertions(+)

diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h
index 7f522c0..c5638f4 100644
--- a/arch/powerpc/include/asm/dma-mapping.h
+++ b/arch/powerpc/include/asm/dma-mapping.h
@@ -125,6 +125,9 @@ static inline void set_dma_offset(struct device *dev, dma_addr_t off)
 #define HAVE_ARCH_DMA_SET_MASK 1
 extern int dma_set_mask(struct device *dev, u64 dma_mask);
 
+#define HAVE_ARCH_DMA_GET_PAGE_SHIFT 1
+extern unsigned long dma_get_page_shift(struct device *dev);
+
 #include <asm-generic/dma-mapping-common.h>
 
 extern int __dma_set_mask(struct device *dev, u64 dma_mask);
diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
index 59503ed..e805af2 100644
--- a/arch/powerpc/kernel/dma.c
+++ b/arch/powerpc/kernel/dma.c
@@ -335,6 +335,15 @@ int dma_set_mask(struct device *dev, u64 dma_mask)
 }
 EXPORT_SYMBOL(dma_set_mask);
 
+unsigned long dma_get_page_shift(struct device *dev)
+{
+	struct iommu_table *tbl = get_iommu_table_base(dev);
+	if (tbl)
+		return tbl->it_page_shift;
+	return PAGE_SHIFT;
+}
+EXPORT_SYMBOL(dma_get_page_shift);
+
 u64 __dma_get_required_mask(struct device *dev)
 {
 	struct dma_map_ops *dma_ops = get_dma_ops(dev);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA
  2015-10-23 20:54 [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA Nishanth Aravamudan
  2015-10-23 20:56 ` [PATCH 1/7 v3] dma-mapping: add generic dma_get_page_shift API Nishanth Aravamudan
  2015-10-23 20:57 ` [PATCH 2/7 v2] powerpc/dma-mapping: override dma_get_page_shift Nishanth Aravamudan
@ 2015-10-23 20:57 ` Nishanth Aravamudan
  2015-10-23 20:58 ` [PATCH 3/7 v2] powerpc/dma: implement per-platform dma_get_page_shift Nishanth Aravamudan
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 49+ messages in thread
From: Nishanth Aravamudan @ 2015-10-23 20:57 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Keith Busch, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Alexey Kardashevskiy, David Gibson,
	Christoph Hellwig, David S. Miller, linux-nvme, linux-kernel,
	linuxppc-dev, sparclinux

[Sorry, subject should have been 0/7!]

On 23.10.2015 [13:54:20 -0700], Nishanth Aravamudan wrote:
> We received a bug report recently when DDW (64-bit direct DMA on Power)
> is not enabled for NVMe devices. In that case, we fall back to 32-bit
> DMA via the IOMMU, which is always done via 4K TCEs (Translation Control
> Entries).
> 
> The NVMe device driver, though, assumes that the DMA alignment for the
> PRP entries will match the device's page size, and that the DMA aligment
> matches the kernel's page aligment. On Power, the the IOMMU page size,
> as mentioned above, can be 4K, while the device can have a page size of
> 8K, while the kernel has a page size of 64K. This eventually trips the
> BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple
> of 4K but not 8K (e.g., 0xF000).
> 
> In this particular case, and generally, we want to use the IOMMU's page
> size for the default device page size, rather than the kernel's page
> size.
> 
> This series consists of five patches:
> 
> 1) add a generic dma_get_page_shift implementation that just returns
> PAGE_SHIFT
> 2) override the generic implementation on Power to use the IOMMU table's
> page shift if available
> 3) allow further specific overriding on power with machdep platform
> overrides
> 4) use the machdep override on pseries, as the DDW code puts the TCE
> shift in a special property and there is no IOMMU table available
> 5) move some sparc code around to make IOMMU_PAGE_SHIFT available in
> include/asm
> 6) override the generic implementation on sparce to use IOMMU_PAGE_SHIFT
> 7) leverage the new API in the NVMe driver
> 
> With these patches, a NVMe device survives our internal hardware
> exerciser; the kernel BUGs within a few seconds without the patch.
> 
>  arch/powerpc/include/asm/dma-mapping.h |  3 +++
>  arch/powerpc/include/asm/machdep.h     |  3 ++-
>  arch/powerpc/kernel/dma.c              | 11 +++++++++++
>  arch/powerpc/platforms/pseries/iommu.c | 36 ++++++++++++++++++++++++++++++++++++
>  arch/sparc/include/asm/dma-mapping.h   |  8 ++++++++
>  arch/sparc/include/asm/iommu_common.h  | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  arch/sparc/kernel/iommu.c              |  2 +-
>  arch/sparc/kernel/iommu_common.h       | 51 ---------------------------------------------------
>  arch/sparc/kernel/pci_psycho.c         |  2 +-
>  arch/sparc/kernel/pci_sabre.c          |  2 +-
>  arch/sparc/kernel/pci_schizo.c         |  2 +-
>  arch/sparc/kernel/pci_sun4v.c          |  2 +-
>  arch/sparc/kernel/psycho_common.c      |  2 +-
>  arch/sparc/kernel/sbus.c               |  3 +--
>  drivers/block/nvme-core.c              |  3 ++-
>  include/linux/dma-mapping.h            |  7 +++++++
>  16 files changed, 127 insertions(+), 61 deletions(-)
> 
> v1 -> v2:
>   Based upon feedback from Christoph Hellwig, rather than using an
>   arch-specific hack, expose the DMA page shift via a generic DMA API and
>   override it on Power as needed.
> v2 -> v3:
>   Based upon feedback from Christoph Hellwig, put the generic
>   implementation in include/linux/dma-mapping.h, since not all archs use
>   include/asm-generic/dma-mapping-common.h.
>   Add sparc implementation, as that arch seems to have a different IOMMU
>   page size.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 3/7 v2] powerpc/dma: implement per-platform dma_get_page_shift
  2015-10-23 20:54 [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA Nishanth Aravamudan
                   ` (2 preceding siblings ...)
  2015-10-23 20:57 ` [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA Nishanth Aravamudan
@ 2015-10-23 20:58 ` Nishanth Aravamudan
  2015-10-23 20:59 ` [PATCH 4/7 v2] pseries/iommu: implement DDW-aware dma_get_page_shift Nishanth Aravamudan
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 49+ messages in thread
From: Nishanth Aravamudan @ 2015-10-23 20:58 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Matthew Wilcox, Keith Busch, Benjamin Herrenschmidt,
	Paul Mackerras, Alexey Kardashevskiy, David Gibson,
	Christoph Hellwig, David S. Miller, linux-nvme, linux-kernel,
	linuxppc-dev, sparclinux

The IOMMU page size is not always stored in struct iommu on Power.
Specifically if a device is configured for DDW (Dynamic DMA Windows aka.
64-bit direct DMA), the used TCE (Translation Control Entry) size is
stored in a special device property created at run-time by the DDW
configuration code. DDW is a pseries-specific feature, so allow
platforms to override the implementation of dma_get_page_shift if
desired.

Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/machdep.h | 3 ++-
 arch/powerpc/kernel/dma.c          | 2 ++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
index 3f191f5..9dd03cf 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -83,9 +83,10 @@ struct machdep_calls {
 #endif
 #endif /* CONFIG_PPC64 */
 
-	/* Platform set_dma_mask and dma_get_required_mask overrides */
+	/* Platform overrides */
 	int		(*dma_set_mask)(struct device *dev, u64 dma_mask);
 	u64		(*dma_get_required_mask)(struct device *dev);
+	unsigned long	(*dma_get_page_shift)(struct device *dev);
 
 	int		(*probe)(void);
 	void		(*setup_arch)(void); /* Optional, may be NULL */
diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
index e805af2..c363896 100644
--- a/arch/powerpc/kernel/dma.c
+++ b/arch/powerpc/kernel/dma.c
@@ -338,6 +338,8 @@ EXPORT_SYMBOL(dma_set_mask);
 unsigned long dma_get_page_shift(struct device *dev)
 {
 	struct iommu_table *tbl = get_iommu_table_base(dev);
+	if (ppc_md.dma_get_page_shift)
+		return ppc_md.dma_get_page_shift(dev);
 	if (tbl)
 		return tbl->it_page_shift;
 	return PAGE_SHIFT;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 4/7 v2] pseries/iommu: implement DDW-aware dma_get_page_shift
  2015-10-23 20:54 [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA Nishanth Aravamudan
                   ` (3 preceding siblings ...)
  2015-10-23 20:58 ` [PATCH 3/7 v2] powerpc/dma: implement per-platform dma_get_page_shift Nishanth Aravamudan
@ 2015-10-23 20:59 ` Nishanth Aravamudan
  2015-10-27  5:56   ` Alexey Kardashevskiy
  2015-10-23 21:00 ` [PATCH 5/7] [RFC PATCH 5/7] sparc: rename kernel/iommu_common.h -> include/asm/iommu_common.h Nishanth Aravamudan
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 49+ messages in thread
From: Nishanth Aravamudan @ 2015-10-23 20:59 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Matthew Wilcox, Keith Busch, Benjamin Herrenschmidt,
	Paul Mackerras, Alexey Kardashevskiy, David Gibson,
	Christoph Hellwig, David S. Miller, linux-nvme, linux-kernel,
	linuxppc-dev, sparclinux

When DDW (Dynamic DMA Windows) are present for a device, we have stored
the TCE (Translation Control Entry) size in a special device tree
property. Check if we have enabled DDW for the device and return the TCE
size from that property if present. If the property isn't present,
fallback to looking the value up in struct iommu_table. If we don't find
a iommu_table, fallback to the kernel's page size.

Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/pseries/iommu.c | 36 ++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 0946b98..1bf6471 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1292,6 +1292,40 @@ static u64 dma_get_required_mask_pSeriesLP(struct device *dev)
 	return dma_iommu_ops.get_required_mask(dev);
 }
 
+static unsigned long dma_get_page_shift_pSeriesLP(struct device *dev)
+{
+	struct iommu_table *tbl;
+
+	if (!disable_ddw && dev_is_pci(dev)) {
+		struct pci_dev *pdev = to_pci_dev(dev);
+		struct device_node *dn;
+
+		dn = pci_device_to_OF_node(pdev);
+
+		/* search upwards for ibm,dma-window */
+		for (; dn && PCI_DN(dn) && !PCI_DN(dn)->table_group;
+				dn = dn->parent)
+			if (of_get_property(dn, "ibm,dma-window", NULL))
+				break;
+		/*
+		 * if there is a DDW configuration, the TCE shift is stored in
+		 * the property
+		 */
+		if (dn && PCI_DN(dn)) {
+			const struct dynamic_dma_window_prop *direct64 =
+				of_get_property(dn, DIRECT64_PROPNAME, NULL);
+			if (direct64)
+				return be32_to_cpu(direct64->tce_shift);
+		}
+	}
+
+	tbl = get_iommu_table_base(dev);
+	if (tbl)
+		return tbl->it_page_shift;
+
+	return PAGE_SHIFT;
+}
+
 #else  /* CONFIG_PCI */
 #define pci_dma_bus_setup_pSeries	NULL
 #define pci_dma_dev_setup_pSeries	NULL
@@ -1299,6 +1333,7 @@ static u64 dma_get_required_mask_pSeriesLP(struct device *dev)
 #define pci_dma_dev_setup_pSeriesLP	NULL
 #define dma_set_mask_pSeriesLP		NULL
 #define dma_get_required_mask_pSeriesLP	NULL
+#define dma_get_page_shift_pSeriesLP	NULL
 #endif /* !CONFIG_PCI */
 
 static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action,
@@ -1395,6 +1430,7 @@ void iommu_init_early_pSeries(void)
 		pseries_pci_controller_ops.dma_dev_setup = pci_dma_dev_setup_pSeriesLP;
 		ppc_md.dma_set_mask = dma_set_mask_pSeriesLP;
 		ppc_md.dma_get_required_mask = dma_get_required_mask_pSeriesLP;
+		ppc_md.dma_get_page_shift = dma_get_page_shift_pSeriesLP;
 	} else {
 		pseries_pci_controller_ops.dma_bus_setup = pci_dma_bus_setup_pSeries;
 		pseries_pci_controller_ops.dma_dev_setup = pci_dma_dev_setup_pSeries;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 5/7] [RFC PATCH 5/7] sparc: rename kernel/iommu_common.h -> include/asm/iommu_common.h
  2015-10-23 20:54 [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA Nishanth Aravamudan
                   ` (4 preceding siblings ...)
  2015-10-23 20:59 ` [PATCH 4/7 v2] pseries/iommu: implement DDW-aware dma_get_page_shift Nishanth Aravamudan
@ 2015-10-23 21:00 ` Nishanth Aravamudan
  2015-10-23 21:02   ` Nishanth Aravamudan
  2015-10-23 21:01 ` [RFC PATCH 6/7] sparc/dma-mapping: override dma_get_page_shift Nishanth Aravamudan
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 49+ messages in thread
From: Nishanth Aravamudan @ 2015-10-23 21:00 UTC (permalink / raw)
  To: David S. Miller
  Cc: Matthew Wilcox, Keith Busch, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Alexey Kardashevskiy,
	David Gibson, Christoph Hellwig, linux-nvme, linux-kernel,
	linuxppc-dev, sparclinux

In order to cleanly expose the desired IOMMU page shift via the new
dma_get_page_shift API, we need to have the sparc constants available in
a more typical location. There should be no functional impact to this
move, but it is untested.

Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
---
 arch/sparc/include/asm/iommu_common.h | 51 +++++++++++++++++++++++++++++++++++
 arch/sparc/kernel/iommu.c             |  2 +-
 arch/sparc/kernel/iommu_common.h      | 51 -----------------------------------
 arch/sparc/kernel/pci_psycho.c        |  2 +-
 arch/sparc/kernel/pci_sabre.c         |  2 +-
 arch/sparc/kernel/pci_schizo.c        |  2 +-
 arch/sparc/kernel/pci_sun4v.c         |  2 +-
 arch/sparc/kernel/psycho_common.c     |  2 +-
 arch/sparc/kernel/sbus.c              |  3 +--
 9 files changed, 58 insertions(+), 59 deletions(-)
 create mode 100644 arch/sparc/include/asm/iommu_common.h
 delete mode 100644 arch/sparc/kernel/iommu_common.h

diff --git a/arch/sparc/include/asm/iommu_common.h b/arch/sparc/include/asm/iommu_common.h
new file mode 100644
index 0000000..b40cec2
--- /dev/null
+++ b/arch/sparc/include/asm/iommu_common.h
@@ -0,0 +1,51 @@
+/* iommu_common.h: UltraSparc SBUS/PCI common iommu declarations.
+ *
+ * Copyright (C) 1999, 2008 David S. Miller (davem@davemloft.net)
+ */
+
+#ifndef _IOMMU_COMMON_H
+#define _IOMMU_COMMON_H
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/scatterlist.h>
+#include <linux/device.h>
+#include <linux/iommu-helper.h>
+#include <linux/scatterlist.h>
+
+#include <asm/iommu.h>
+
+/*
+ * These give mapping size of each iommu pte/tlb.
+ */
+#define IO_PAGE_SHIFT			13
+#define IO_PAGE_SIZE			(1UL << IO_PAGE_SHIFT)
+#define IO_PAGE_MASK			(~(IO_PAGE_SIZE-1))
+#define IO_PAGE_ALIGN(addr)		ALIGN(addr, IO_PAGE_SIZE)
+
+#define IO_TSB_ENTRIES			(128*1024)
+#define IO_TSB_SIZE			(IO_TSB_ENTRIES * 8)
+
+/*
+ * This is the hardwired shift in the iotlb tag/data parts.
+ */
+#define IOMMU_PAGE_SHIFT		13
+
+#define SG_ENT_PHYS_ADDRESS(SG)	(__pa(sg_virt((SG))))
+
+static inline int is_span_boundary(unsigned long entry,
+				   unsigned long shift,
+				   unsigned long boundary_size,
+				   struct scatterlist *outs,
+				   struct scatterlist *sg)
+{
+	unsigned long paddr = SG_ENT_PHYS_ADDRESS(outs);
+	int nr = iommu_num_pages(paddr, outs->dma_length + sg->length,
+				 IO_PAGE_SIZE);
+
+	return iommu_is_span_boundary(entry, nr, shift, boundary_size);
+}
+
+#endif /* _IOMMU_COMMON_H */
diff --git a/arch/sparc/kernel/iommu.c b/arch/sparc/kernel/iommu.c
index 5320689..f9b6077 100644
--- a/arch/sparc/kernel/iommu.c
+++ b/arch/sparc/kernel/iommu.c
@@ -20,8 +20,8 @@
 #endif
 
 #include <asm/iommu.h>
+#include <asm/iommu_common.h>
 
-#include "iommu_common.h"
 #include "kernel.h"
 
 #define STC_CTXMATCH_ADDR(STC, CTX)	\
diff --git a/arch/sparc/kernel/iommu_common.h b/arch/sparc/kernel/iommu_common.h
deleted file mode 100644
index b40cec2..0000000
--- a/arch/sparc/kernel/iommu_common.h
+++ /dev/null
@@ -1,51 +0,0 @@
-/* iommu_common.h: UltraSparc SBUS/PCI common iommu declarations.
- *
- * Copyright (C) 1999, 2008 David S. Miller (davem@davemloft.net)
- */
-
-#ifndef _IOMMU_COMMON_H
-#define _IOMMU_COMMON_H
-
-#include <linux/kernel.h>
-#include <linux/types.h>
-#include <linux/sched.h>
-#include <linux/mm.h>
-#include <linux/scatterlist.h>
-#include <linux/device.h>
-#include <linux/iommu-helper.h>
-#include <linux/scatterlist.h>
-
-#include <asm/iommu.h>
-
-/*
- * These give mapping size of each iommu pte/tlb.
- */
-#define IO_PAGE_SHIFT			13
-#define IO_PAGE_SIZE			(1UL << IO_PAGE_SHIFT)
-#define IO_PAGE_MASK			(~(IO_PAGE_SIZE-1))
-#define IO_PAGE_ALIGN(addr)		ALIGN(addr, IO_PAGE_SIZE)
-
-#define IO_TSB_ENTRIES			(128*1024)
-#define IO_TSB_SIZE			(IO_TSB_ENTRIES * 8)
-
-/*
- * This is the hardwired shift in the iotlb tag/data parts.
- */
-#define IOMMU_PAGE_SHIFT		13
-
-#define SG_ENT_PHYS_ADDRESS(SG)	(__pa(sg_virt((SG))))
-
-static inline int is_span_boundary(unsigned long entry,
-				   unsigned long shift,
-				   unsigned long boundary_size,
-				   struct scatterlist *outs,
-				   struct scatterlist *sg)
-{
-	unsigned long paddr = SG_ENT_PHYS_ADDRESS(outs);
-	int nr = iommu_num_pages(paddr, outs->dma_length + sg->length,
-				 IO_PAGE_SIZE);
-
-	return iommu_is_span_boundary(entry, nr, shift, boundary_size);
-}
-
-#endif /* _IOMMU_COMMON_H */
diff --git a/arch/sparc/kernel/pci_psycho.c b/arch/sparc/kernel/pci_psycho.c
index 7dce27b..332fd6f 100644
--- a/arch/sparc/kernel/pci_psycho.c
+++ b/arch/sparc/kernel/pci_psycho.c
@@ -15,13 +15,13 @@
 #include <linux/of_device.h>
 
 #include <asm/iommu.h>
+#include <asm/iommu_common.h>
 #include <asm/irq.h>
 #include <asm/starfire.h>
 #include <asm/prom.h>
 #include <asm/upa.h>
 
 #include "pci_impl.h"
-#include "iommu_common.h"
 #include "psycho_common.h"
 
 #define DRIVER_NAME	"psycho"
diff --git a/arch/sparc/kernel/pci_sabre.c b/arch/sparc/kernel/pci_sabre.c
index 00a616f..6c2c616 100644
--- a/arch/sparc/kernel/pci_sabre.c
+++ b/arch/sparc/kernel/pci_sabre.c
@@ -16,12 +16,12 @@
 
 #include <asm/apb.h>
 #include <asm/iommu.h>
+#include <asm/iommu_common.h>
 #include <asm/irq.h>
 #include <asm/prom.h>
 #include <asm/upa.h>
 
 #include "pci_impl.h"
-#include "iommu_common.h"
 #include "psycho_common.h"
 
 #define DRIVER_NAME	"sabre"
diff --git a/arch/sparc/kernel/pci_schizo.c b/arch/sparc/kernel/pci_schizo.c
index c664d3e..ef23ab2 100644
--- a/arch/sparc/kernel/pci_schizo.c
+++ b/arch/sparc/kernel/pci_schizo.c
@@ -13,13 +13,13 @@
 #include <linux/of_device.h>
 
 #include <asm/iommu.h>
+#include <asm/iommu_common.h>
 #include <asm/irq.h>
 #include <asm/pstate.h>
 #include <asm/prom.h>
 #include <asm/upa.h>
 
 #include "pci_impl.h"
-#include "iommu_common.h"
 
 #define DRIVER_NAME	"schizo"
 #define PFX		DRIVER_NAME ": "
diff --git a/arch/sparc/kernel/pci_sun4v.c b/arch/sparc/kernel/pci_sun4v.c
index d2fe57d..1d72295 100644
--- a/arch/sparc/kernel/pci_sun4v.c
+++ b/arch/sparc/kernel/pci_sun4v.c
@@ -18,12 +18,12 @@
 #include <linux/iommu-common.h>
 
 #include <asm/iommu.h>
+#include <asm/iommu_common.h>
 #include <asm/irq.h>
 #include <asm/hypervisor.h>
 #include <asm/prom.h>
 
 #include "pci_impl.h"
-#include "iommu_common.h"
 
 #include "pci_sun4v.h"
 
diff --git a/arch/sparc/kernel/psycho_common.c b/arch/sparc/kernel/psycho_common.c
index 8db48e8..8eec09d 100644
--- a/arch/sparc/kernel/psycho_common.c
+++ b/arch/sparc/kernel/psycho_common.c
@@ -5,10 +5,10 @@
 #include <linux/kernel.h>
 #include <linux/interrupt.h>
 
+#include <asm/iommu_common.h>
 #include <asm/upa.h>
 
 #include "pci_impl.h"
-#include "iommu_common.h"
 #include "psycho_common.h"
 
 #define  PSYCHO_STRBUF_CTRL_DENAB	0x0000000000000002ULL
diff --git a/arch/sparc/kernel/sbus.c b/arch/sparc/kernel/sbus.c
index be5bdf9..a92cdf6 100644
--- a/arch/sparc/kernel/sbus.c
+++ b/arch/sparc/kernel/sbus.c
@@ -24,8 +24,7 @@
 #include <asm/prom.h>
 #include <asm/oplib.h>
 #include <asm/starfire.h>
-
-#include "iommu_common.h"
+#include <asm/iommu_common.h>
 
 #define MAP_BASE	((u32)0xc0000000)
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH 6/7] sparc/dma-mapping: override dma_get_page_shift
  2015-10-23 20:54 [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA Nishanth Aravamudan
                   ` (5 preceding siblings ...)
  2015-10-23 21:00 ` [PATCH 5/7] [RFC PATCH 5/7] sparc: rename kernel/iommu_common.h -> include/asm/iommu_common.h Nishanth Aravamudan
@ 2015-10-23 21:01 ` Nishanth Aravamudan
  2015-10-23 21:02 ` [PATCH 7/7 v2] drivers/nvme: default to the IOMMU page size Nishanth Aravamudan
  2015-10-27  1:27 ` [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA David Miller
  8 siblings, 0 replies; 49+ messages in thread
From: Nishanth Aravamudan @ 2015-10-23 21:01 UTC (permalink / raw)
  To: David S. Miller
  Cc: Matthew Wilcox, Keith Busch, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Alexey Kardashevskiy,
	David Gibson, Christoph Hellwig, linux-nvme, linux-kernel,
	linuxppc-dev, sparclinux

On sparc, the kernel's page size differs from the IOMMU's page size, so
override the generic implementation, which always returns the kernel's
page size, and return IOMMU_PAGE_SHIFT instead.

Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

---
I know very little about sparc, so please correct me if this patch is
invalid.

 arch/sparc/include/asm/dma-mapping.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/sparc/include/asm/dma-mapping.h b/arch/sparc/include/asm/dma-mapping.h
index a21da59..76ba470 100644
--- a/arch/sparc/include/asm/dma-mapping.h
+++ b/arch/sparc/include/asm/dma-mapping.h
@@ -52,6 +52,14 @@ static inline int dma_set_mask(struct device *dev, u64 mask)
 	return -EINVAL;
 }
 
+/* for IOMMU_PAGE_SHIFT */
+#include <asm/iommu_common.h>
+#define HAVE_ARCH_DMA_GET_PAGE_SHIFT 1
+static inline unsigned long dma_get_page_shift(struct device *dev)
+{
+	return IOMMU_PAGE_SHIFT;
+}
+
 #include <asm-generic/dma-mapping-common.h>
 
 #endif
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 7/7 v2] drivers/nvme: default to the IOMMU page size
  2015-10-23 20:54 [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA Nishanth Aravamudan
                   ` (6 preceding siblings ...)
  2015-10-23 21:01 ` [RFC PATCH 6/7] sparc/dma-mapping: override dma_get_page_shift Nishanth Aravamudan
@ 2015-10-23 21:02 ` Nishanth Aravamudan
  2015-10-27  1:27 ` [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA David Miller
  8 siblings, 0 replies; 49+ messages in thread
From: Nishanth Aravamudan @ 2015-10-23 21:02 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Keith Busch, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Alexey Kardashevskiy, David Gibson,
	Christoph Hellwig, David S. Miller, linux-nvme, linux-kernel,
	linuxppc-dev, sparclinux

We received a bug report recently when DDW (64-bit direct DMA on Power)
is not enabled for NVMe devices. In that case, we fall back to 32-bit
DMA via the IOMMU, which is always done via 4K TCEs (Translation Control
Entries).

The NVMe device driver, though, assumes that the DMA alignment for the
PRP entries will match the device's page size, and that the DMA aligment
matches the kernel's page aligment. On Power, the the IOMMU page size,
as mentioned above, can be 4K, while the device can have a page size of
8K, while the kernel has a page size of 64K. This eventually trips the
BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple
of 4K but not 8K (e.g., 0xF000).

In this particular case of page sizes, we clearly want to use the
IOMMU's page size in the driver. And generally, the NVMe driver in this
function should be using the IOMMU's page size for the default device
page size, rather than the kernel's page size.

With this patch, a NVMe device survives our internal hardware
exerciser; the kernel BUGs within a few seconds without the patch.

Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

---
v1 -> v2:
  Based upon feedback from Christoph Hellwig, implement the IOMMU page
  size lookup as a generic DMA API, rather than an architecture-specific
  hack.

 drivers/block/nvme-core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index 6f04771..5a79106 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -18,6 +18,7 @@
 #include <linux/blk-mq.h>
 #include <linux/cpu.h>
 #include <linux/delay.h>
+#include <linux/dma-mapping.h>
 #include <linux/errno.h>
 #include <linux/fs.h>
 #include <linux/genhd.h>
@@ -1711,7 +1712,7 @@ static int nvme_configure_admin_queue(struct nvme_dev *dev)
 	u32 aqa;
 	u64 cap = readq(&dev->bar->cap);
 	struct nvme_queue *nvmeq;
-	unsigned page_shift = PAGE_SHIFT;
+	unsigned page_shift = dma_get_page_shift(dev->dev);
 	unsigned dev_page_min = NVME_CAP_MPSMIN(cap) + 12;
 	unsigned dev_page_max = NVME_CAP_MPSMAX(cap) + 12;
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 5/7] [RFC PATCH 5/7] sparc: rename kernel/iommu_common.h -> include/asm/iommu_common.h
  2015-10-23 21:00 ` [PATCH 5/7] [RFC PATCH 5/7] sparc: rename kernel/iommu_common.h -> include/asm/iommu_common.h Nishanth Aravamudan
@ 2015-10-23 21:02   ` Nishanth Aravamudan
  0 siblings, 0 replies; 49+ messages in thread
From: Nishanth Aravamudan @ 2015-10-23 21:02 UTC (permalink / raw)
  To: David S. Miller
  Cc: Matthew Wilcox, Keith Busch, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Alexey Kardashevskiy,
	David Gibson, Christoph Hellwig, linux-nvme, linux-kernel,
	linuxppc-dev, sparclinux

[Apologies for the subject line, should just have the [RFC PATCH 5/7]]

On 23.10.2015 [14:00:08 -0700], Nishanth Aravamudan wrote:
> In order to cleanly expose the desired IOMMU page shift via the new
> dma_get_page_shift API, we need to have the sparc constants available in
> a more typical location. There should be no functional impact to this
> move, but it is untested.
> 
> Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> ---
>  arch/sparc/include/asm/iommu_common.h | 51 +++++++++++++++++++++++++++++++++++
>  arch/sparc/kernel/iommu.c             |  2 +-
>  arch/sparc/kernel/iommu_common.h      | 51 -----------------------------------
>  arch/sparc/kernel/pci_psycho.c        |  2 +-
>  arch/sparc/kernel/pci_sabre.c         |  2 +-
>  arch/sparc/kernel/pci_schizo.c        |  2 +-
>  arch/sparc/kernel/pci_sun4v.c         |  2 +-
>  arch/sparc/kernel/psycho_common.c     |  2 +-
>  arch/sparc/kernel/sbus.c              |  3 +--
>  9 files changed, 58 insertions(+), 59 deletions(-)
>  create mode 100644 arch/sparc/include/asm/iommu_common.h
>  delete mode 100644 arch/sparc/kernel/iommu_common.h
> 
> diff --git a/arch/sparc/include/asm/iommu_common.h b/arch/sparc/include/asm/iommu_common.h
> new file mode 100644
> index 0000000..b40cec2
> --- /dev/null
> +++ b/arch/sparc/include/asm/iommu_common.h
> @@ -0,0 +1,51 @@
> +/* iommu_common.h: UltraSparc SBUS/PCI common iommu declarations.
> + *
> + * Copyright (C) 1999, 2008 David S. Miller (davem@davemloft.net)
> + */
> +
> +#ifndef _IOMMU_COMMON_H
> +#define _IOMMU_COMMON_H
> +
> +#include <linux/kernel.h>
> +#include <linux/types.h>
> +#include <linux/sched.h>
> +#include <linux/mm.h>
> +#include <linux/scatterlist.h>
> +#include <linux/device.h>
> +#include <linux/iommu-helper.h>
> +#include <linux/scatterlist.h>
> +
> +#include <asm/iommu.h>
> +
> +/*
> + * These give mapping size of each iommu pte/tlb.
> + */
> +#define IO_PAGE_SHIFT			13
> +#define IO_PAGE_SIZE			(1UL << IO_PAGE_SHIFT)
> +#define IO_PAGE_MASK			(~(IO_PAGE_SIZE-1))
> +#define IO_PAGE_ALIGN(addr)		ALIGN(addr, IO_PAGE_SIZE)
> +
> +#define IO_TSB_ENTRIES			(128*1024)
> +#define IO_TSB_SIZE			(IO_TSB_ENTRIES * 8)
> +
> +/*
> + * This is the hardwired shift in the iotlb tag/data parts.
> + */
> +#define IOMMU_PAGE_SHIFT		13
> +
> +#define SG_ENT_PHYS_ADDRESS(SG)	(__pa(sg_virt((SG))))
> +
> +static inline int is_span_boundary(unsigned long entry,
> +				   unsigned long shift,
> +				   unsigned long boundary_size,
> +				   struct scatterlist *outs,
> +				   struct scatterlist *sg)
> +{
> +	unsigned long paddr = SG_ENT_PHYS_ADDRESS(outs);
> +	int nr = iommu_num_pages(paddr, outs->dma_length + sg->length,
> +				 IO_PAGE_SIZE);
> +
> +	return iommu_is_span_boundary(entry, nr, shift, boundary_size);
> +}
> +
> +#endif /* _IOMMU_COMMON_H */
> diff --git a/arch/sparc/kernel/iommu.c b/arch/sparc/kernel/iommu.c
> index 5320689..f9b6077 100644
> --- a/arch/sparc/kernel/iommu.c
> +++ b/arch/sparc/kernel/iommu.c
> @@ -20,8 +20,8 @@
>  #endif
>  
>  #include <asm/iommu.h>
> +#include <asm/iommu_common.h>
>  
> -#include "iommu_common.h"
>  #include "kernel.h"
>  
>  #define STC_CTXMATCH_ADDR(STC, CTX)	\
> diff --git a/arch/sparc/kernel/iommu_common.h b/arch/sparc/kernel/iommu_common.h
> deleted file mode 100644
> index b40cec2..0000000
> --- a/arch/sparc/kernel/iommu_common.h
> +++ /dev/null
> @@ -1,51 +0,0 @@
> -/* iommu_common.h: UltraSparc SBUS/PCI common iommu declarations.
> - *
> - * Copyright (C) 1999, 2008 David S. Miller (davem@davemloft.net)
> - */
> -
> -#ifndef _IOMMU_COMMON_H
> -#define _IOMMU_COMMON_H
> -
> -#include <linux/kernel.h>
> -#include <linux/types.h>
> -#include <linux/sched.h>
> -#include <linux/mm.h>
> -#include <linux/scatterlist.h>
> -#include <linux/device.h>
> -#include <linux/iommu-helper.h>
> -#include <linux/scatterlist.h>
> -
> -#include <asm/iommu.h>
> -
> -/*
> - * These give mapping size of each iommu pte/tlb.
> - */
> -#define IO_PAGE_SHIFT			13
> -#define IO_PAGE_SIZE			(1UL << IO_PAGE_SHIFT)
> -#define IO_PAGE_MASK			(~(IO_PAGE_SIZE-1))
> -#define IO_PAGE_ALIGN(addr)		ALIGN(addr, IO_PAGE_SIZE)
> -
> -#define IO_TSB_ENTRIES			(128*1024)
> -#define IO_TSB_SIZE			(IO_TSB_ENTRIES * 8)
> -
> -/*
> - * This is the hardwired shift in the iotlb tag/data parts.
> - */
> -#define IOMMU_PAGE_SHIFT		13
> -
> -#define SG_ENT_PHYS_ADDRESS(SG)	(__pa(sg_virt((SG))))
> -
> -static inline int is_span_boundary(unsigned long entry,
> -				   unsigned long shift,
> -				   unsigned long boundary_size,
> -				   struct scatterlist *outs,
> -				   struct scatterlist *sg)
> -{
> -	unsigned long paddr = SG_ENT_PHYS_ADDRESS(outs);
> -	int nr = iommu_num_pages(paddr, outs->dma_length + sg->length,
> -				 IO_PAGE_SIZE);
> -
> -	return iommu_is_span_boundary(entry, nr, shift, boundary_size);
> -}
> -
> -#endif /* _IOMMU_COMMON_H */
> diff --git a/arch/sparc/kernel/pci_psycho.c b/arch/sparc/kernel/pci_psycho.c
> index 7dce27b..332fd6f 100644
> --- a/arch/sparc/kernel/pci_psycho.c
> +++ b/arch/sparc/kernel/pci_psycho.c
> @@ -15,13 +15,13 @@
>  #include <linux/of_device.h>
>  
>  #include <asm/iommu.h>
> +#include <asm/iommu_common.h>
>  #include <asm/irq.h>
>  #include <asm/starfire.h>
>  #include <asm/prom.h>
>  #include <asm/upa.h>
>  
>  #include "pci_impl.h"
> -#include "iommu_common.h"
>  #include "psycho_common.h"
>  
>  #define DRIVER_NAME	"psycho"
> diff --git a/arch/sparc/kernel/pci_sabre.c b/arch/sparc/kernel/pci_sabre.c
> index 00a616f..6c2c616 100644
> --- a/arch/sparc/kernel/pci_sabre.c
> +++ b/arch/sparc/kernel/pci_sabre.c
> @@ -16,12 +16,12 @@
>  
>  #include <asm/apb.h>
>  #include <asm/iommu.h>
> +#include <asm/iommu_common.h>
>  #include <asm/irq.h>
>  #include <asm/prom.h>
>  #include <asm/upa.h>
>  
>  #include "pci_impl.h"
> -#include "iommu_common.h"
>  #include "psycho_common.h"
>  
>  #define DRIVER_NAME	"sabre"
> diff --git a/arch/sparc/kernel/pci_schizo.c b/arch/sparc/kernel/pci_schizo.c
> index c664d3e..ef23ab2 100644
> --- a/arch/sparc/kernel/pci_schizo.c
> +++ b/arch/sparc/kernel/pci_schizo.c
> @@ -13,13 +13,13 @@
>  #include <linux/of_device.h>
>  
>  #include <asm/iommu.h>
> +#include <asm/iommu_common.h>
>  #include <asm/irq.h>
>  #include <asm/pstate.h>
>  #include <asm/prom.h>
>  #include <asm/upa.h>
>  
>  #include "pci_impl.h"
> -#include "iommu_common.h"
>  
>  #define DRIVER_NAME	"schizo"
>  #define PFX		DRIVER_NAME ": "
> diff --git a/arch/sparc/kernel/pci_sun4v.c b/arch/sparc/kernel/pci_sun4v.c
> index d2fe57d..1d72295 100644
> --- a/arch/sparc/kernel/pci_sun4v.c
> +++ b/arch/sparc/kernel/pci_sun4v.c
> @@ -18,12 +18,12 @@
>  #include <linux/iommu-common.h>
>  
>  #include <asm/iommu.h>
> +#include <asm/iommu_common.h>
>  #include <asm/irq.h>
>  #include <asm/hypervisor.h>
>  #include <asm/prom.h>
>  
>  #include "pci_impl.h"
> -#include "iommu_common.h"
>  
>  #include "pci_sun4v.h"
>  
> diff --git a/arch/sparc/kernel/psycho_common.c b/arch/sparc/kernel/psycho_common.c
> index 8db48e8..8eec09d 100644
> --- a/arch/sparc/kernel/psycho_common.c
> +++ b/arch/sparc/kernel/psycho_common.c
> @@ -5,10 +5,10 @@
>  #include <linux/kernel.h>
>  #include <linux/interrupt.h>
>  
> +#include <asm/iommu_common.h>
>  #include <asm/upa.h>
>  
>  #include "pci_impl.h"
> -#include "iommu_common.h"
>  #include "psycho_common.h"
>  
>  #define  PSYCHO_STRBUF_CTRL_DENAB	0x0000000000000002ULL
> diff --git a/arch/sparc/kernel/sbus.c b/arch/sparc/kernel/sbus.c
> index be5bdf9..a92cdf6 100644
> --- a/arch/sparc/kernel/sbus.c
> +++ b/arch/sparc/kernel/sbus.c
> @@ -24,8 +24,7 @@
>  #include <asm/prom.h>
>  #include <asm/oplib.h>
>  #include <asm/starfire.h>
> -
> -#include "iommu_common.h"
> +#include <asm/iommu_common.h>
>  
>  #define MAP_BASE	((u32)0xc0000000)
>  
> -- 
> 1.9.1
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA
  2015-10-23 20:54 [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA Nishanth Aravamudan
                   ` (7 preceding siblings ...)
  2015-10-23 21:02 ` [PATCH 7/7 v2] drivers/nvme: default to the IOMMU page size Nishanth Aravamudan
@ 2015-10-27  1:27 ` David Miller
  2015-10-27 22:20   ` Nishanth Aravamudan
  8 siblings, 1 reply; 49+ messages in thread
From: David Miller @ 2015-10-27  1:27 UTC (permalink / raw)
  To: nacc
  Cc: willy, keith.busch, benh, paulus, mpe, aik, david, hch,
	linux-nvme, linux-kernel, linuxppc-dev, sparclinux

From: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Date: Fri, 23 Oct 2015 13:54:20 -0700

> 1) add a generic dma_get_page_shift implementation that just returns
> PAGE_SHIFT

I won't object to this patch series, but if I had implemented this I
would have required the architectures to implement this explicitly,
one-by-one.  I think it is less error prone and more likely to end
up with all the architectures setting this correctly.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 4/7 v2] pseries/iommu: implement DDW-aware dma_get_page_shift
  2015-10-23 20:59 ` [PATCH 4/7 v2] pseries/iommu: implement DDW-aware dma_get_page_shift Nishanth Aravamudan
@ 2015-10-27  5:56   ` Alexey Kardashevskiy
  2015-10-27 22:22     ` Nishanth Aravamudan
  0 siblings, 1 reply; 49+ messages in thread
From: Alexey Kardashevskiy @ 2015-10-27  5:56 UTC (permalink / raw)
  To: Nishanth Aravamudan, Michael Ellerman
  Cc: Matthew Wilcox, Keith Busch, Benjamin Herrenschmidt,
	Paul Mackerras, David Gibson, Christoph Hellwig, David S. Miller,
	linux-nvme, linux-kernel, linuxppc-dev, sparclinux

On 10/24/2015 07:59 AM, Nishanth Aravamudan wrote:
> When DDW (Dynamic DMA Windows) are present for a device, we have stored
> the TCE (Translation Control Entry) size in a special device tree
> property. Check if we have enabled DDW for the device and return the TCE
> size from that property if present. If the property isn't present,
> fallback to looking the value up in struct iommu_table. If we don't find
> a iommu_table, fallback to the kernel's page size.
>
> Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> ---
>   arch/powerpc/platforms/pseries/iommu.c | 36 ++++++++++++++++++++++++++++++++++
>   1 file changed, 36 insertions(+)
>
> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> index 0946b98..1bf6471 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -1292,6 +1292,40 @@ static u64 dma_get_required_mask_pSeriesLP(struct device *dev)
>   	return dma_iommu_ops.get_required_mask(dev);
>   }
>
> +static unsigned long dma_get_page_shift_pSeriesLP(struct device *dev)
> +{
> +	struct iommu_table *tbl;
> +
> +	if (!disable_ddw && dev_is_pci(dev)) {
> +		struct pci_dev *pdev = to_pci_dev(dev);
> +		struct device_node *dn;
> +
> +		dn = pci_device_to_OF_node(pdev);
> +
> +		/* search upwards for ibm,dma-window */
> +		for (; dn && PCI_DN(dn) && !PCI_DN(dn)->table_group;
> +				dn = dn->parent)
> +			if (of_get_property(dn, "ibm,dma-window", NULL))
> +				break;
> +		/*
> +		 * if there is a DDW configuration, the TCE shift is stored in
> +		 * the property
> +		 */
> +		if (dn && PCI_DN(dn)) {
> +			const struct dynamic_dma_window_prop *direct64 =
> +				of_get_property(dn, DIRECT64_PROPNAME, NULL);


This DIRECT64_PROPNAME property is only present under pHyp, QEMU/KVM does 
not set it as 64bit windows are dynamic there so something like 
find_existing_ddw() needs to be used here.



> +			if (direct64)
> +				return be32_to_cpu(direct64->tce_shift);
> +		}
> +	}
> +
> +	tbl = get_iommu_table_base(dev);
> +	if (tbl)
> +		return tbl->it_page_shift;
> +
> +	return PAGE_SHIFT;
> +}
> +
>   #else  /* CONFIG_PCI */
>   #define pci_dma_bus_setup_pSeries	NULL
>   #define pci_dma_dev_setup_pSeries	NULL
> @@ -1299,6 +1333,7 @@ static u64 dma_get_required_mask_pSeriesLP(struct device *dev)
>   #define pci_dma_dev_setup_pSeriesLP	NULL
>   #define dma_set_mask_pSeriesLP		NULL
>   #define dma_get_required_mask_pSeriesLP	NULL
> +#define dma_get_page_shift_pSeriesLP	NULL
>   #endif /* !CONFIG_PCI */
>
>   static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action,
> @@ -1395,6 +1430,7 @@ void iommu_init_early_pSeries(void)
>   		pseries_pci_controller_ops.dma_dev_setup = pci_dma_dev_setup_pSeriesLP;
>   		ppc_md.dma_set_mask = dma_set_mask_pSeriesLP;
>   		ppc_md.dma_get_required_mask = dma_get_required_mask_pSeriesLP;
> +		ppc_md.dma_get_page_shift = dma_get_page_shift_pSeriesLP;
>   	} else {
>   		pseries_pci_controller_ops.dma_bus_setup = pci_dma_bus_setup_pSeries;
>   		pseries_pci_controller_ops.dma_dev_setup = pci_dma_dev_setup_pSeries;
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/7 v2] powerpc/dma-mapping: override dma_get_page_shift
  2015-10-23 20:57 ` [PATCH 2/7 v2] powerpc/dma-mapping: override dma_get_page_shift Nishanth Aravamudan
@ 2015-10-27  6:02   ` Alexey Kardashevskiy
  2015-10-27 14:06     ` Busch, Keith
  2015-10-27 22:27     ` Nishanth Aravamudan
  0 siblings, 2 replies; 49+ messages in thread
From: Alexey Kardashevskiy @ 2015-10-27  6:02 UTC (permalink / raw)
  To: Nishanth Aravamudan, Michael Ellerman
  Cc: Matthew Wilcox, Keith Busch, Benjamin Herrenschmidt,
	Paul Mackerras, David Gibson, Christoph Hellwig, David S. Miller,
	linux-nvme, linux-kernel, linuxppc-dev, sparclinux

On 10/24/2015 07:57 AM, Nishanth Aravamudan wrote:
> On Power, the kernel's page size can differ from the IOMMU's page size,
> so we need to override the generic implementation, which always returns
> the kernel's page size. Lookup the IOMMU's page size from struct
> iommu_table, if available. Fallback to the kernel's page size,
> otherwise.
>
> Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> ---
>   arch/powerpc/include/asm/dma-mapping.h | 3 +++
>   arch/powerpc/kernel/dma.c              | 9 +++++++++
>   2 files changed, 12 insertions(+)
>
> diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h
> index 7f522c0..c5638f4 100644
> --- a/arch/powerpc/include/asm/dma-mapping.h
> +++ b/arch/powerpc/include/asm/dma-mapping.h
> @@ -125,6 +125,9 @@ static inline void set_dma_offset(struct device *dev, dma_addr_t off)
>   #define HAVE_ARCH_DMA_SET_MASK 1
>   extern int dma_set_mask(struct device *dev, u64 dma_mask);
>
> +#define HAVE_ARCH_DMA_GET_PAGE_SHIFT 1
> +extern unsigned long dma_get_page_shift(struct device *dev);
> +
>   #include <asm-generic/dma-mapping-common.h>
>
>   extern int __dma_set_mask(struct device *dev, u64 dma_mask);
> diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
> index 59503ed..e805af2 100644
> --- a/arch/powerpc/kernel/dma.c
> +++ b/arch/powerpc/kernel/dma.c
> @@ -335,6 +335,15 @@ int dma_set_mask(struct device *dev, u64 dma_mask)
>   }
>   EXPORT_SYMBOL(dma_set_mask);
>
> +unsigned long dma_get_page_shift(struct device *dev)
> +{
> +	struct iommu_table *tbl = get_iommu_table_base(dev);
> +	if (tbl)
> +		return tbl->it_page_shift;


All PCI devices have this initialized on POWER (at least, our, IBM's POWER) 
so 4K will always be returned here while in the case of 
(get_dma_ops(dev)==&dma_direct_ops) it could actually return PAGE_SHIFT. Is 
4K still preferred value to return here?



> +	return PAGE_SHIFT;
> +}
> +EXPORT_SYMBOL(dma_get_page_shift);
> +
>   u64 __dma_get_required_mask(struct device *dev)
>   {
>   	struct dma_map_ops *dma_ops = get_dma_ops(dev);
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/7 v2] powerpc/dma-mapping: override dma_get_page_shift
  2015-10-27  6:02   ` Alexey Kardashevskiy
@ 2015-10-27 14:06     ` Busch, Keith
  2015-10-27 22:27     ` Nishanth Aravamudan
  1 sibling, 0 replies; 49+ messages in thread
From: Busch, Keith @ 2015-10-27 14:06 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Nishanth Aravamudan, Michael Ellerman, Matthew Wilcox,
	Benjamin Herrenschmidt, Paul Mackerras, David Gibson,
	Christoph Hellwig, David S. Miller, linux-nvme, linux-kernel,
	linuxppc-dev, sparclinux

On Tue, Oct 27, 2015 at 05:02:16PM +1100, Alexey Kardashevskiy wrote:
> >+unsigned long dma_get_page_shift(struct device *dev)
> >+{
> >+	struct iommu_table *tbl = get_iommu_table_base(dev);
> >+	if (tbl)
> >+		return tbl->it_page_shift;
> 
> 
> All PCI devices have this initialized on POWER (at least, our, IBM's
> POWER) so 4K will always be returned here while in the case of
> (get_dma_ops(dev)==&dma_direct_ops) it could actually return
> PAGE_SHIFT. Is 4K still preferred value to return here?

4k is always a safe option to return, but ideally you want to return the
highest guaranteed DMA address alignment. The driver just needs to know
which bits to mask from virtual addresses such that the offset is the
same as the DMA address.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA
  2015-10-27  1:27 ` [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA David Miller
@ 2015-10-27 22:20   ` Nishanth Aravamudan
  2015-10-27 22:36     ` Busch, Keith
                       ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Nishanth Aravamudan @ 2015-10-27 22:20 UTC (permalink / raw)
  To: David Miller
  Cc: willy, keith.busch, benh, paulus, mpe, aik, david, hch,
	linux-nvme, linux-kernel, linuxppc-dev, sparclinux

On 26.10.2015 [18:27:46 -0700], David Miller wrote:
> From: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> Date: Fri, 23 Oct 2015 13:54:20 -0700
> 
> > 1) add a generic dma_get_page_shift implementation that just returns
> > PAGE_SHIFT
> 
> I won't object to this patch series, but if I had implemented this I
> would have required the architectures to implement this explicitly,
> one-by-one.  I think it is less error prone and more likely to end
> up with all the architectures setting this correctly.

Well, looks like I should spin up a v4 anyways for the powerpc changes.
So, to make sure I understand your point, should I make the generic
dma_get_page_shift a compile-error kind of thing? It will only fail on
architectures that actually build the NVME driver (as the only caller).
But I'm not sure how exactly to achieve that, if you could give a bit
more detail I'd appreciate it!

Thanks,
Nish


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 4/7 v2] pseries/iommu: implement DDW-aware dma_get_page_shift
  2015-10-27  5:56   ` Alexey Kardashevskiy
@ 2015-10-27 22:22     ` Nishanth Aravamudan
  0 siblings, 0 replies; 49+ messages in thread
From: Nishanth Aravamudan @ 2015-10-27 22:22 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Michael Ellerman, Matthew Wilcox, Keith Busch,
	Benjamin Herrenschmidt, Paul Mackerras, David Gibson,
	Christoph Hellwig, David S. Miller, linux-nvme, linux-kernel,
	linuxppc-dev, sparclinux

On 27.10.2015 [16:56:10 +1100], Alexey Kardashevskiy wrote:
> On 10/24/2015 07:59 AM, Nishanth Aravamudan wrote:
> >When DDW (Dynamic DMA Windows) are present for a device, we have stored
> >the TCE (Translation Control Entry) size in a special device tree
> >property. Check if we have enabled DDW for the device and return the TCE
> >size from that property if present. If the property isn't present,
> >fallback to looking the value up in struct iommu_table. If we don't find
> >a iommu_table, fallback to the kernel's page size.
> >
> >Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> >---
> >  arch/powerpc/platforms/pseries/iommu.c | 36 ++++++++++++++++++++++++++++++++++
> >  1 file changed, 36 insertions(+)
> >
> >diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> >index 0946b98..1bf6471 100644
> >--- a/arch/powerpc/platforms/pseries/iommu.c
> >+++ b/arch/powerpc/platforms/pseries/iommu.c
> >@@ -1292,6 +1292,40 @@ static u64 dma_get_required_mask_pSeriesLP(struct device *dev)
> >  	return dma_iommu_ops.get_required_mask(dev);
> >  }
> >
> >+static unsigned long dma_get_page_shift_pSeriesLP(struct device *dev)
> >+{
> >+	struct iommu_table *tbl;
> >+
> >+	if (!disable_ddw && dev_is_pci(dev)) {
> >+		struct pci_dev *pdev = to_pci_dev(dev);
> >+		struct device_node *dn;
> >+
> >+		dn = pci_device_to_OF_node(pdev);
> >+
> >+		/* search upwards for ibm,dma-window */
> >+		for (; dn && PCI_DN(dn) && !PCI_DN(dn)->table_group;
> >+				dn = dn->parent)
> >+			if (of_get_property(dn, "ibm,dma-window", NULL))
> >+				break;
> >+		/*
> >+		 * if there is a DDW configuration, the TCE shift is stored in
> >+		 * the property
> >+		 */
> >+		if (dn && PCI_DN(dn)) {
> >+			const struct dynamic_dma_window_prop *direct64 =
> >+				of_get_property(dn, DIRECT64_PROPNAME, NULL);
> 
> 
> This DIRECT64_PROPNAME property is only present under pHyp, QEMU/KVM
> does not set it as 64bit windows are dynamic there so something like
> find_existing_ddw() needs to be used here.

DIRECT64_PROPNAME is a Linux thing, not a pHyp or QEMU/KVM thing -- it's
created by the Linux DDW logic and left in the device-tree when we
successfully configure DDW.

You're right, though, that logically find_existing_ddw() would be better
to use here. I'll spin up a new version.

-Nish


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/7 v2] powerpc/dma-mapping: override dma_get_page_shift
  2015-10-27  6:02   ` Alexey Kardashevskiy
  2015-10-27 14:06     ` Busch, Keith
@ 2015-10-27 22:27     ` Nishanth Aravamudan
  2015-10-28  1:00       ` Alexey Kardashevskiy
  1 sibling, 1 reply; 49+ messages in thread
From: Nishanth Aravamudan @ 2015-10-27 22:27 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Michael Ellerman, Matthew Wilcox, Keith Busch,
	Benjamin Herrenschmidt, Paul Mackerras, David Gibson,
	Christoph Hellwig, David S. Miller, linux-nvme, linux-kernel,
	linuxppc-dev, sparclinux

On 27.10.2015 [17:02:16 +1100], Alexey Kardashevskiy wrote:
> On 10/24/2015 07:57 AM, Nishanth Aravamudan wrote:
> >On Power, the kernel's page size can differ from the IOMMU's page size,
> >so we need to override the generic implementation, which always returns
> >the kernel's page size. Lookup the IOMMU's page size from struct
> >iommu_table, if available. Fallback to the kernel's page size,
> >otherwise.
> >
> >Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> >---
> >  arch/powerpc/include/asm/dma-mapping.h | 3 +++
> >  arch/powerpc/kernel/dma.c              | 9 +++++++++
> >  2 files changed, 12 insertions(+)
> >
> >diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h
> >index 7f522c0..c5638f4 100644
> >--- a/arch/powerpc/include/asm/dma-mapping.h
> >+++ b/arch/powerpc/include/asm/dma-mapping.h
> >@@ -125,6 +125,9 @@ static inline void set_dma_offset(struct device *dev, dma_addr_t off)
> >  #define HAVE_ARCH_DMA_SET_MASK 1
> >  extern int dma_set_mask(struct device *dev, u64 dma_mask);
> >
> >+#define HAVE_ARCH_DMA_GET_PAGE_SHIFT 1
> >+extern unsigned long dma_get_page_shift(struct device *dev);
> >+
> >  #include <asm-generic/dma-mapping-common.h>
> >
> >  extern int __dma_set_mask(struct device *dev, u64 dma_mask);
> >diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
> >index 59503ed..e805af2 100644
> >--- a/arch/powerpc/kernel/dma.c
> >+++ b/arch/powerpc/kernel/dma.c
> >@@ -335,6 +335,15 @@ int dma_set_mask(struct device *dev, u64 dma_mask)
> >  }
> >  EXPORT_SYMBOL(dma_set_mask);
> >
> >+unsigned long dma_get_page_shift(struct device *dev)
> >+{
> >+	struct iommu_table *tbl = get_iommu_table_base(dev);
> >+	if (tbl)
> >+		return tbl->it_page_shift;
> 
> 
> All PCI devices have this initialized on POWER (at least, our, IBM's
> POWER) so 4K will always be returned here while in the case of
> (get_dma_ops(dev)==&dma_direct_ops) it could actually return
> PAGE_SHIFT. Is 4K still preferred value to return here?

Right, so the logic of my series, goes like this:

a) We currently are assuming DMA_PAGE_SHIFT (conceptual constant) is
PAGE_SHIFT everywhere, including Power.

b) After 2/7, the Power code will return either the IOMMU table's shift
value, if set, or PAGE_SHIFT (I guess this would be the case if
get_dma_ops(dev) == &dma_direct_ops, as you said). That is no different
than we have now, except we can return the accurate IOMMU value if
available.

3) After 3/7, the platform can override the generic Power
get_dma_page_shift().

4) After 4/7, pseries will return the DDW value, if available, then
fallback to the IOMMU table's value. I think in the case of
get_dma_ops(dev)==&dma_direct_ops, the only way that can happen is if we
are using DDW, right?

-Nish


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA
  2015-10-27 22:20   ` Nishanth Aravamudan
@ 2015-10-27 22:36     ` Busch, Keith
  2015-10-28  0:54       ` David Miller
  2015-10-27 22:57     ` Julian Calaby
  2015-10-28  0:53     ` David Miller
  2 siblings, 1 reply; 49+ messages in thread
From: Busch, Keith @ 2015-10-27 22:36 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: David Miller, willy, benh, paulus, mpe, aik, david, hch,
	linux-nvme, linux-kernel, linuxppc-dev, sparclinux

On Tue, Oct 27, 2015 at 03:20:10PM -0700, Nishanth Aravamudan wrote:
> On 26.10.2015 [18:27:46 -0700], David Miller wrote:
> > From: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> > Date: Fri, 23 Oct 2015 13:54:20 -0700
> > 
> > > 1) add a generic dma_get_page_shift implementation that just returns
> > > PAGE_SHIFT
> > 
> > I won't object to this patch series, but if I had implemented this I
> > would have required the architectures to implement this explicitly,
> > one-by-one.  I think it is less error prone and more likely to end
> > up with all the architectures setting this correctly.
> 
> Well, looks like I should spin up a v4 anyways for the powerpc changes.
> So, to make sure I understand your point, should I make the generic
> dma_get_page_shift a compile-error kind of thing? It will only fail on
> architectures that actually build the NVME driver (as the only caller).
> But I'm not sure how exactly to achieve that, if you could give a bit
> more detail I'd appreciate it!

If you're suggesting to compile-time break architectures that currently
work just fine with NVMe, let me stop you right there.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA
  2015-10-27 22:20   ` Nishanth Aravamudan
  2015-10-27 22:36     ` Busch, Keith
@ 2015-10-27 22:57     ` Julian Calaby
  2015-10-27 23:40       ` Nishanth Aravamudan
  2015-10-28  0:53     ` David Miller
  2 siblings, 1 reply; 49+ messages in thread
From: Julian Calaby @ 2015-10-27 22:57 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: David Miller, willy, keith.busch, benh, paulus, mpe, aik, david,
	Christoph Hellwig, linux-nvme, linux-kernel, linuxppc-dev,
	sparclinux

Hi Nishanth,

On Wed, Oct 28, 2015 at 9:20 AM, Nishanth Aravamudan
<nacc@linux.vnet.ibm.com> wrote:
> On 26.10.2015 [18:27:46 -0700], David Miller wrote:
>> From: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
>> Date: Fri, 23 Oct 2015 13:54:20 -0700
>>
>> > 1) add a generic dma_get_page_shift implementation that just returns
>> > PAGE_SHIFT
>>
>> I won't object to this patch series, but if I had implemented this I
>> would have required the architectures to implement this explicitly,
>> one-by-one.  I think it is less error prone and more likely to end
>> up with all the architectures setting this correctly.
>
> Well, looks like I should spin up a v4 anyways for the powerpc changes.
> So, to make sure I understand your point, should I make the generic
> dma_get_page_shift a compile-error kind of thing? It will only fail on
> architectures that actually build the NVME driver (as the only caller).
> But I'm not sure how exactly to achieve that, if you could give a bit
> more detail I'd appreciate it!

He's suggesting that you _don't_ put a generic implementation in
/include/linux/dma-mapping.h and instead add it to _every_
architecture.

Thanks,

-- 
Julian Calaby

Email: julian.calaby@gmail.com
Profile: http://www.google.com/profiles/julian.calaby/

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA
  2015-10-27 22:57     ` Julian Calaby
@ 2015-10-27 23:40       ` Nishanth Aravamudan
  2015-10-27 23:43         ` Julian Calaby
  0 siblings, 1 reply; 49+ messages in thread
From: Nishanth Aravamudan @ 2015-10-27 23:40 UTC (permalink / raw)
  To: Julian Calaby
  Cc: David Miller, willy, keith.busch, benh, paulus, mpe, aik, david,
	Christoph Hellwig, linux-nvme, linux-kernel, linuxppc-dev,
	sparclinux

On 28.10.2015 [09:57:48 +1100], Julian Calaby wrote:
> Hi Nishanth,
> 
> On Wed, Oct 28, 2015 at 9:20 AM, Nishanth Aravamudan
> <nacc@linux.vnet.ibm.com> wrote:
> > On 26.10.2015 [18:27:46 -0700], David Miller wrote:
> >> From: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> >> Date: Fri, 23 Oct 2015 13:54:20 -0700
> >>
> >> > 1) add a generic dma_get_page_shift implementation that just returns
> >> > PAGE_SHIFT
> >>
> >> I won't object to this patch series, but if I had implemented this I
> >> would have required the architectures to implement this explicitly,
> >> one-by-one.  I think it is less error prone and more likely to end
> >> up with all the architectures setting this correctly.
> >
> > Well, looks like I should spin up a v4 anyways for the powerpc changes.
> > So, to make sure I understand your point, should I make the generic
> > dma_get_page_shift a compile-error kind of thing? It will only fail on
> > architectures that actually build the NVME driver (as the only caller).
> > But I'm not sure how exactly to achieve that, if you could give a bit
> > more detail I'd appreciate it!
> 
> He's suggesting that you _don't_ put a generic implementation in
> /include/linux/dma-mapping.h and instead add it to _every_
> architecture.

Ah, I see! Well, I don't know much about the DMA internals of most
architectures -- and my approach kept things functionally the same
everywhere (using PAGE_SHIFT) except:

a) Power, where I know it doesn't work as-is
and
b) sparc, where the code implied that a different value than PAGE_SHIFT
should be used.

Thanks,
Nish


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA
  2015-10-27 23:40       ` Nishanth Aravamudan
@ 2015-10-27 23:43         ` Julian Calaby
  2015-10-28  0:29           ` Benjamin Herrenschmidt
  2015-10-28  1:00           ` David Miller
  0 siblings, 2 replies; 49+ messages in thread
From: Julian Calaby @ 2015-10-27 23:43 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: David Miller, willy, keith.busch, benh, paulus, mpe, aik, david,
	Christoph Hellwig, linux-nvme, linux-kernel, linuxppc-dev,
	sparclinux

Hi Nishanth,

On Wed, Oct 28, 2015 at 10:40 AM, Nishanth Aravamudan
<nacc@linux.vnet.ibm.com> wrote:
> On 28.10.2015 [09:57:48 +1100], Julian Calaby wrote:
>> Hi Nishanth,
>>
>> On Wed, Oct 28, 2015 at 9:20 AM, Nishanth Aravamudan
>> <nacc@linux.vnet.ibm.com> wrote:
>> > On 26.10.2015 [18:27:46 -0700], David Miller wrote:
>> >> From: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
>> >> Date: Fri, 23 Oct 2015 13:54:20 -0700
>> >>
>> >> > 1) add a generic dma_get_page_shift implementation that just returns
>> >> > PAGE_SHIFT
>> >>
>> >> I won't object to this patch series, but if I had implemented this I
>> >> would have required the architectures to implement this explicitly,
>> >> one-by-one.  I think it is less error prone and more likely to end
>> >> up with all the architectures setting this correctly.
>> >
>> > Well, looks like I should spin up a v4 anyways for the powerpc changes.
>> > So, to make sure I understand your point, should I make the generic
>> > dma_get_page_shift a compile-error kind of thing? It will only fail on
>> > architectures that actually build the NVME driver (as the only caller).
>> > But I'm not sure how exactly to achieve that, if you could give a bit
>> > more detail I'd appreciate it!
>>
>> He's suggesting that you _don't_ put a generic implementation in
>> /include/linux/dma-mapping.h and instead add it to _every_
>> architecture.
>
> Ah, I see! Well, I don't know much about the DMA internals of most
> architectures -- and my approach kept things functionally the same
> everywhere (using PAGE_SHIFT) except:
>
> a) Power, where I know it doesn't work as-is
> and
> b) sparc, where the code implied that a different value than PAGE_SHIFT
> should be used.

You'll be CCing the maintainers of each architecture on the patches to
add the functions, so if they do have specific requirements, I'm sure
they'll let you know or provide patches.

Thanks,

-- 
Julian Calaby

Email: julian.calaby@gmail.com
Profile: http://www.google.com/profiles/julian.calaby/

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA
  2015-10-27 23:43         ` Julian Calaby
@ 2015-10-28  0:29           ` Benjamin Herrenschmidt
  2015-10-28  1:00           ` David Miller
  1 sibling, 0 replies; 49+ messages in thread
From: Benjamin Herrenschmidt @ 2015-10-28  0:29 UTC (permalink / raw)
  To: Julian Calaby, Nishanth Aravamudan
  Cc: David Miller, willy, keith.busch, paulus, mpe, aik, david,
	Christoph Hellwig, linux-nvme, linux-kernel, linuxppc-dev,
	sparclinux

On Wed, 2015-10-28 at 10:43 +1100, Julian Calaby wrote:
> Hi Nishanth,
> You'll be CCing the maintainers of each architecture on the patches
> to
> add the functions, so if they do have specific requirements, I'm sure
> they'll let you know or provide patches.

That sort of accross-all-arch change tend to take forever. I'd rather
get the existing series in to fix the problem, we can look into
improving things later but I tend to think that the default of using
PAGE_SHIFT in asm-generic will be fine for most archs.

Ben.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA
  2015-10-27 22:20   ` Nishanth Aravamudan
  2015-10-27 22:36     ` Busch, Keith
  2015-10-27 22:57     ` Julian Calaby
@ 2015-10-28  0:53     ` David Miller
  2015-10-28  1:52       ` Nishanth Aravamudan
  2 siblings, 1 reply; 49+ messages in thread
From: David Miller @ 2015-10-28  0:53 UTC (permalink / raw)
  To: nacc
  Cc: willy, keith.busch, benh, paulus, mpe, aik, david, hch,
	linux-nvme, linux-kernel, linuxppc-dev, sparclinux

From: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Date: Tue, 27 Oct 2015 15:20:10 -0700

> Well, looks like I should spin up a v4 anyways for the powerpc changes.
> So, to make sure I understand your point, should I make the generic
> dma_get_page_shift a compile-error kind of thing? It will only fail on
> architectures that actually build the NVME driver (as the only caller).
> But I'm not sure how exactly to achieve that, if you could give a bit
> more detail I'd appreciate it!

Yes, I am basically suggesting to simply not provide a default at all.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA
  2015-10-27 22:36     ` Busch, Keith
@ 2015-10-28  0:54       ` David Miller
  2015-10-28 13:59         ` Busch, Keith
  0 siblings, 1 reply; 49+ messages in thread
From: David Miller @ 2015-10-28  0:54 UTC (permalink / raw)
  To: keith.busch
  Cc: nacc, willy, benh, paulus, mpe, aik, david, hch, linux-nvme,
	linux-kernel, linuxppc-dev, sparclinux

From: "Busch, Keith" <keith.busch@intel.com>
Date: Tue, 27 Oct 2015 22:36:43 +0000

> If you're suggesting to compile-time break architectures that currently
> work just fine with NVMe, let me stop you right there.

Silently "working" without the architecture maintainer having to explicity
look at the new interface and make sure his platform is implementing it
properly is an extremely bad practice.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA
  2015-10-27 23:43         ` Julian Calaby
  2015-10-28  0:29           ` Benjamin Herrenschmidt
@ 2015-10-28  1:00           ` David Miller
  1 sibling, 0 replies; 49+ messages in thread
From: David Miller @ 2015-10-28  1:00 UTC (permalink / raw)
  To: julian.calaby
  Cc: nacc, willy, keith.busch, benh, paulus, mpe, aik, david, hch,
	linux-nvme, linux-kernel, linuxppc-dev, sparclinux

From: Julian Calaby <julian.calaby@gmail.com>
Date: Wed, 28 Oct 2015 10:43:35 +1100

> You'll be CCing the maintainers of each architecture on the patches to
> add the functions, so if they do have specific requirements, I'm sure
> they'll let you know or provide patches.

People miss things, maintainers get busy, so while a CC: is important
and appreciated, it doesn't ensure a correct implementation is likely
to result.

Personally, I'd much rather my arch stop building so I have to go in
there to explicitly look at the reason why and make sure I fill in the
missing pieces properly and accurately.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/7 v2] powerpc/dma-mapping: override dma_get_page_shift
  2015-10-27 22:27     ` Nishanth Aravamudan
@ 2015-10-28  1:00       ` Alexey Kardashevskiy
  2015-10-28  1:54         ` Nishanth Aravamudan
  0 siblings, 1 reply; 49+ messages in thread
From: Alexey Kardashevskiy @ 2015-10-28  1:00 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Michael Ellerman, Matthew Wilcox, Keith Busch,
	Benjamin Herrenschmidt, Paul Mackerras, David Gibson,
	Christoph Hellwig, David S. Miller, linux-nvme, linux-kernel,
	linuxppc-dev, sparclinux

On 10/28/2015 09:27 AM, Nishanth Aravamudan wrote:
> On 27.10.2015 [17:02:16 +1100], Alexey Kardashevskiy wrote:
>> On 10/24/2015 07:57 AM, Nishanth Aravamudan wrote:
>>> On Power, the kernel's page size can differ from the IOMMU's page size,
>>> so we need to override the generic implementation, which always returns
>>> the kernel's page size. Lookup the IOMMU's page size from struct
>>> iommu_table, if available. Fallback to the kernel's page size,
>>> otherwise.
>>>
>>> Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
>>> ---
>>>   arch/powerpc/include/asm/dma-mapping.h | 3 +++
>>>   arch/powerpc/kernel/dma.c              | 9 +++++++++
>>>   2 files changed, 12 insertions(+)
>>>
>>> diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h
>>> index 7f522c0..c5638f4 100644
>>> --- a/arch/powerpc/include/asm/dma-mapping.h
>>> +++ b/arch/powerpc/include/asm/dma-mapping.h
>>> @@ -125,6 +125,9 @@ static inline void set_dma_offset(struct device *dev, dma_addr_t off)
>>>   #define HAVE_ARCH_DMA_SET_MASK 1
>>>   extern int dma_set_mask(struct device *dev, u64 dma_mask);
>>>
>>> +#define HAVE_ARCH_DMA_GET_PAGE_SHIFT 1
>>> +extern unsigned long dma_get_page_shift(struct device *dev);
>>> +
>>>   #include <asm-generic/dma-mapping-common.h>
>>>
>>>   extern int __dma_set_mask(struct device *dev, u64 dma_mask);
>>> diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
>>> index 59503ed..e805af2 100644
>>> --- a/arch/powerpc/kernel/dma.c
>>> +++ b/arch/powerpc/kernel/dma.c
>>> @@ -335,6 +335,15 @@ int dma_set_mask(struct device *dev, u64 dma_mask)
>>>   }
>>>   EXPORT_SYMBOL(dma_set_mask);
>>>
>>> +unsigned long dma_get_page_shift(struct device *dev)
>>> +{
>>> +	struct iommu_table *tbl = get_iommu_table_base(dev);
>>> +	if (tbl)
>>> +		return tbl->it_page_shift;
>>
>>
>> All PCI devices have this initialized on POWER (at least, our, IBM's
>> POWER) so 4K will always be returned here while in the case of
>> (get_dma_ops(dev)==&dma_direct_ops) it could actually return
>> PAGE_SHIFT. Is 4K still preferred value to return here?
>
> Right, so the logic of my series, goes like this:
>
> a) We currently are assuming DMA_PAGE_SHIFT (conceptual constant) is
> PAGE_SHIFT everywhere, including Power.
>
> b) After 2/7, the Power code will return either the IOMMU table's shift
> value, if set, or PAGE_SHIFT (I guess this would be the case if
> get_dma_ops(dev) == &dma_direct_ops, as you said). That is no different
> than we have now, except we can return the accurate IOMMU value if
> available.

If it is not available, then something went wrong and BUG_ON(!tbl || 
!tbl->it_page_shift) make more sense here than pretending that this 
function can ever return PAGE_SHIFT. imho.


>
> 3) After 3/7, the platform can override the generic Power
> get_dma_page_shift().
>
> 4) After 4/7, pseries will return the DDW value, if available, then
> fallback to the IOMMU table's value. I think in the case of
> get_dma_ops(dev)==&dma_direct_ops, the only way that can happen is if we
> are using DDW, right?

This is for pseries guests; for the powernv host it is a "bypass" mode 
which does 64bit direct DMA mapping and there is no additional window for 
that (i.e. DIRECT64_PROPNAME, etc).



-- 
Alexey

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA
  2015-10-28  0:53     ` David Miller
@ 2015-10-28  1:52       ` Nishanth Aravamudan
  0 siblings, 0 replies; 49+ messages in thread
From: Nishanth Aravamudan @ 2015-10-28  1:52 UTC (permalink / raw)
  To: David Miller
  Cc: willy, keith.busch, benh, paulus, mpe, aik, david, hch,
	linux-nvme, linux-kernel, linuxppc-dev, sparclinux

On 27.10.2015 [17:53:22 -0700], David Miller wrote:
> From: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> Date: Tue, 27 Oct 2015 15:20:10 -0700
> 
> > Well, looks like I should spin up a v4 anyways for the powerpc changes.
> > So, to make sure I understand your point, should I make the generic
> > dma_get_page_shift a compile-error kind of thing? It will only fail on
> > architectures that actually build the NVME driver (as the only caller).
> > But I'm not sure how exactly to achieve that, if you could give a bit
> > more detail I'd appreciate it!
> 
> Yes, I am basically suggesting to simply not provide a default at all.

For my own edification -- what is the way that gets resolved? I guess I
mean it seems like linux-next would cease to compile because of my new
series. Would my patches just get kicked out of -next for introducing
that (or even via the 0-day notifications), or should I put something
into the commit message indicating it is an API introduction?

Sorry for the tentativeness, I have not introduce a cross-architecture
API like this before.

Thanks,
Nish

> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/7 v2] powerpc/dma-mapping: override dma_get_page_shift
  2015-10-28  1:00       ` Alexey Kardashevskiy
@ 2015-10-28  1:54         ` Nishanth Aravamudan
  2015-10-28  2:20           ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 49+ messages in thread
From: Nishanth Aravamudan @ 2015-10-28  1:54 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Michael Ellerman, Matthew Wilcox, Keith Busch,
	Benjamin Herrenschmidt, Paul Mackerras, David Gibson,
	Christoph Hellwig, David S. Miller, linux-nvme, linux-kernel,
	linuxppc-dev, sparclinux

On 28.10.2015 [12:00:20 +1100], Alexey Kardashevskiy wrote:
> On 10/28/2015 09:27 AM, Nishanth Aravamudan wrote:
> >On 27.10.2015 [17:02:16 +1100], Alexey Kardashevskiy wrote:
> >>On 10/24/2015 07:57 AM, Nishanth Aravamudan wrote:
> >>>On Power, the kernel's page size can differ from the IOMMU's page size,
> >>>so we need to override the generic implementation, which always returns
> >>>the kernel's page size. Lookup the IOMMU's page size from struct
> >>>iommu_table, if available. Fallback to the kernel's page size,
> >>>otherwise.
> >>>
> >>>Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> >>>---
> >>>  arch/powerpc/include/asm/dma-mapping.h | 3 +++
> >>>  arch/powerpc/kernel/dma.c              | 9 +++++++++
> >>>  2 files changed, 12 insertions(+)
> >>>
> >>>diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h
> >>>index 7f522c0..c5638f4 100644
> >>>--- a/arch/powerpc/include/asm/dma-mapping.h
> >>>+++ b/arch/powerpc/include/asm/dma-mapping.h
> >>>@@ -125,6 +125,9 @@ static inline void set_dma_offset(struct device *dev, dma_addr_t off)
> >>>  #define HAVE_ARCH_DMA_SET_MASK 1
> >>>  extern int dma_set_mask(struct device *dev, u64 dma_mask);
> >>>
> >>>+#define HAVE_ARCH_DMA_GET_PAGE_SHIFT 1
> >>>+extern unsigned long dma_get_page_shift(struct device *dev);
> >>>+
> >>>  #include <asm-generic/dma-mapping-common.h>
> >>>
> >>>  extern int __dma_set_mask(struct device *dev, u64 dma_mask);
> >>>diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
> >>>index 59503ed..e805af2 100644
> >>>--- a/arch/powerpc/kernel/dma.c
> >>>+++ b/arch/powerpc/kernel/dma.c
> >>>@@ -335,6 +335,15 @@ int dma_set_mask(struct device *dev, u64 dma_mask)
> >>>  }
> >>>  EXPORT_SYMBOL(dma_set_mask);
> >>>
> >>>+unsigned long dma_get_page_shift(struct device *dev)
> >>>+{
> >>>+	struct iommu_table *tbl = get_iommu_table_base(dev);
> >>>+	if (tbl)
> >>>+		return tbl->it_page_shift;
> >>
> >>
> >>All PCI devices have this initialized on POWER (at least, our, IBM's
> >>POWER) so 4K will always be returned here while in the case of
> >>(get_dma_ops(dev)==&dma_direct_ops) it could actually return
> >>PAGE_SHIFT. Is 4K still preferred value to return here?
> >
> >Right, so the logic of my series, goes like this:
> >
> >a) We currently are assuming DMA_PAGE_SHIFT (conceptual constant) is
> >PAGE_SHIFT everywhere, including Power.
> >
> >b) After 2/7, the Power code will return either the IOMMU table's shift
> >value, if set, or PAGE_SHIFT (I guess this would be the case if
> >get_dma_ops(dev) == &dma_direct_ops, as you said). That is no different
> >than we have now, except we can return the accurate IOMMU value if
> >available.
> 
> If it is not available, then something went wrong and BUG_ON(!tbl ||
> !tbl->it_page_shift) make more sense here than pretending that this
> function can ever return PAGE_SHIFT. imho.

That's a good point, thanks!

> >3) After 3/7, the platform can override the generic Power
> >get_dma_page_shift().
> >
> >4) After 4/7, pseries will return the DDW value, if available, then
> >fallback to the IOMMU table's value. I think in the case of
> >get_dma_ops(dev)==&dma_direct_ops, the only way that can happen is if we
> >are using DDW, right?
> 
> This is for pseries guests; for the powernv host it is a "bypass"
> mode which does 64bit direct DMA mapping and there is no additional
> window for that (i.e. DIRECT64_PROPNAME, etc).

You're right! I should update the code to handle both cases.

In "bypass" mode, what TCE size is used? Is it guaranteed to be 4K?

Seems like this would be a different platform implentation I'd put in
for 'powernv', is that right?

My apologies for missing that, and thank you for the review!

-Nish


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/7 v2] powerpc/dma-mapping: override dma_get_page_shift
  2015-10-28  1:54         ` Nishanth Aravamudan
@ 2015-10-28  2:20           ` Benjamin Herrenschmidt
  2015-10-28  2:30             ` Nishanth Aravamudan
  0 siblings, 1 reply; 49+ messages in thread
From: Benjamin Herrenschmidt @ 2015-10-28  2:20 UTC (permalink / raw)
  To: Nishanth Aravamudan, Alexey Kardashevskiy
  Cc: Michael Ellerman, Matthew Wilcox, Keith Busch, Paul Mackerras,
	David Gibson, Christoph Hellwig, David S. Miller, linux-nvme,
	linux-kernel, linuxppc-dev, sparclinux

On Tue, 2015-10-27 at 18:54 -0700, Nishanth Aravamudan wrote:
> 
> In "bypass" mode, what TCE size is used? Is it guaranteed to be 4K?

None :-) The TCEs are completely bypassed. You get a N:M linear mapping
of all memory starting at 1<<59 PCI side.

> Seems like this would be a different platform implentation I'd put in
> for 'powernv', is that right?
> 
> My apologies for missing that, and thank you for the review!

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/7 v2] powerpc/dma-mapping: override dma_get_page_shift
  2015-10-28  2:20           ` Benjamin Herrenschmidt
@ 2015-10-28  2:30             ` Nishanth Aravamudan
  2015-10-28  3:20               ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 49+ messages in thread
From: Nishanth Aravamudan @ 2015-10-28  2:30 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, Michael Ellerman, Matthew Wilcox,
	Keith Busch, Paul Mackerras, David Gibson, Christoph Hellwig,
	David S. Miller, linux-nvme, linux-kernel, linuxppc-dev,
	sparclinux

On 28.10.2015 [11:20:05 +0900], Benjamin Herrenschmidt wrote:
> On Tue, 2015-10-27 at 18:54 -0700, Nishanth Aravamudan wrote:
> > 
> > In "bypass" mode, what TCE size is used? Is it guaranteed to be 4K?
> 
> None :-) The TCEs are completely bypassed. You get a N:M linear mapping
> of all memory starting at 1<<59 PCI side.

Err, duh, sorry! Ok, so in that case, DMA page shift is PAGE_SHIFT,
then?

-Nish


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/7 v2] powerpc/dma-mapping: override dma_get_page_shift
  2015-10-28  2:30             ` Nishanth Aravamudan
@ 2015-10-28  3:20               ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 49+ messages in thread
From: Benjamin Herrenschmidt @ 2015-10-28  3:20 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Alexey Kardashevskiy, Michael Ellerman, Matthew Wilcox,
	Keith Busch, Paul Mackerras, David Gibson, Christoph Hellwig,
	David S. Miller, linux-nvme, linux-kernel, linuxppc-dev,
	sparclinux

On Tue, 2015-10-27 at 19:30 -0700, Nishanth Aravamudan wrote:
> On 28.10.2015 [11:20:05 +0900], Benjamin Herrenschmidt wrote:
> > On Tue, 2015-10-27 at 18:54 -0700, Nishanth Aravamudan wrote:
> > > 
> > > In "bypass" mode, what TCE size is used? Is it guaranteed to be
> > > 4K?
> > 
> > None :-) The TCEs are completely bypassed. You get a N:M linear
> > mapping
> > of all memory starting at 1<<59 PCI side.
> 
> Err, duh, sorry! Ok, so in that case, DMA page shift is PAGE_SHIFT,
> then?

I think so.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA
  2015-10-28  0:54       ` David Miller
@ 2015-10-28 13:59         ` Busch, Keith
  2015-10-29 11:55           ` Christoph Hellwig
  0 siblings, 1 reply; 49+ messages in thread
From: Busch, Keith @ 2015-10-28 13:59 UTC (permalink / raw)
  To: David Miller
  Cc: nacc, willy, benh, paulus, mpe, aik, david, hch, linux-nvme,
	linux-kernel, linuxppc-dev, sparclinux

On Tue, Oct 27, 2015 at 05:54:43PM -0700, David Miller wrote:
> From: "Busch, Keith" <keith.busch@intel.com>
> Date: Tue, 27 Oct 2015 22:36:43 +0000
> 
> > If you're suggesting to compile-time break architectures that currently
> > work just fine with NVMe, let me stop you right there.
> 
> Silently "working" without the architecture maintainer having to explicity
> look at the new interface and make sure his platform is implementing it
> properly is an extremely bad practice.

It won't work if it's wrong. It'll BUG_ON, and I'll be assigned to help
fix it, like what's happened here (on a private bugzilla).

The "new" interface for all the other architectures is the same as the
old one we've been using for the last 5 years.

I welcome x86 maintainer feedback to confirm virtual and DMA addresses
have the same offset at 4k alignment, but I have to insist we don't
break my currently working hardware to force their attention.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA
  2015-10-28 13:59         ` Busch, Keith
@ 2015-10-29 11:55           ` Christoph Hellwig
  2015-10-29 15:57             ` Nishanth Aravamudan
  0 siblings, 1 reply; 49+ messages in thread
From: Christoph Hellwig @ 2015-10-29 11:55 UTC (permalink / raw)
  To: Busch, Keith
  Cc: David Miller, aik, mpe, nacc, linux-kernel, linux-nvme, hch,
	paulus, benh, sparclinux, willy, linuxppc-dev, david

On Wed, Oct 28, 2015 at 01:59:23PM +0000, Busch, Keith wrote:
> The "new" interface for all the other architectures is the same as the
> old one we've been using for the last 5 years.
> 
> I welcome x86 maintainer feedback to confirm virtual and DMA addresses
> have the same offset at 4k alignment, but I have to insist we don't
> break my currently working hardware to force their attention.

We had a quick cht about this issue and I think we simply should
default to a NVMe controler page size of 4k everywhere as that's the
safe default.  This is also what we do for RDMA Memory reigstrations and
it works fine there for SRP and iSER.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA
  2015-10-29 11:55           ` Christoph Hellwig
@ 2015-10-29 15:57             ` Nishanth Aravamudan
  2015-10-29 17:20               ` Busch, Keith
  2015-10-30  1:49               ` [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA David Miller
  0 siblings, 2 replies; 49+ messages in thread
From: Nishanth Aravamudan @ 2015-10-29 15:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Busch, Keith, aik, linux-kernel, linux-nvme, paulus, sparclinux,
	willy, linuxppc-dev, David Miller, david

On 29.10.2015 [04:55:36 -0700], Christoph Hellwig wrote:
> On Wed, Oct 28, 2015 at 01:59:23PM +0000, Busch, Keith wrote:
> > The "new" interface for all the other architectures is the same as the
> > old one we've been using for the last 5 years.
> > 
> > I welcome x86 maintainer feedback to confirm virtual and DMA addresses
> > have the same offset at 4k alignment, but I have to insist we don't
> > break my currently working hardware to force their attention.
> 
> We had a quick cht about this issue and I think we simply should
> default to a NVMe controler page size of 4k everywhere as that's the
> safe default.  This is also what we do for RDMA Memory reigstrations and
> it works fine there for SRP and iSER.

So, would that imply changing just the NVMe driver code rather than
adding the dma_page_shift API at all? What about
architectures that can support the larger page sizes? There is an
implied performance impact, at least, of shifting the IO size down.

Sorry for the continuing questions -- I got lots of conflicting feedback
on the last series and want to make sure v4 is more acceptable.

Thanks,
Nish


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA
  2015-10-29 15:57             ` Nishanth Aravamudan
@ 2015-10-29 17:20               ` Busch, Keith
  2015-10-30 21:35                 ` [PATCH 1/1 v3] drivers/nvme: default to 4k device page size Nishanth Aravamudan
  2015-10-30  1:49               ` [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA David Miller
  1 sibling, 1 reply; 49+ messages in thread
From: Busch, Keith @ 2015-10-29 17:20 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Christoph Hellwig, aik, linux-kernel, linux-nvme, paulus,
	sparclinux, willy, linuxppc-dev, David Miller, david

On Thu, Oct 29, 2015 at 08:57:01AM -0700, Nishanth Aravamudan wrote:
> On 29.10.2015 [04:55:36 -0700], Christoph Hellwig wrote:
> > We had a quick cht about this issue and I think we simply should
> > default to a NVMe controler page size of 4k everywhere as that's the
> > safe default.  This is also what we do for RDMA Memory reigstrations and
> > it works fine there for SRP and iSER.
> 
> So, would that imply changing just the NVMe driver code rather than
> adding the dma_page_shift API at all? What about
> architectures that can support the larger page sizes? There is an
> implied performance impact, at least, of shifting the IO size down.

It is the safe option, but you're right that it might have a measurable
performance impact (can you run an experiment?). Maybe we should just
change the driver to always use MPSMIN for the moment in the interest
of time, and you can flush out the new API before the next merge window.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA
  2015-10-29 15:57             ` Nishanth Aravamudan
  2015-10-29 17:20               ` Busch, Keith
@ 2015-10-30  1:49               ` David Miller
  2015-10-30 21:35                 ` Nishanth Aravamudan
  1 sibling, 1 reply; 49+ messages in thread
From: David Miller @ 2015-10-30  1:49 UTC (permalink / raw)
  To: nacc
  Cc: hch, keith.busch, aik, linux-kernel, linux-nvme, paulus,
	sparclinux, willy, linuxppc-dev, david

From: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Date: Thu, 29 Oct 2015 08:57:01 -0700

> So, would that imply changing just the NVMe driver code rather than
> adding the dma_page_shift API at all? What about
> architectures that can support the larger page sizes? There is an
> implied performance impact, at least, of shifting the IO size down.
> 
> Sorry for the continuing questions -- I got lots of conflicting feedback
> on the last series and want to make sure v4 is more acceptable.

In the long term I would be very happy to see us having a real interface
for this stuff, just my opinion...

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 1/1 v3] drivers/nvme: default to 4k device page size
  2015-10-29 17:20               ` Busch, Keith
@ 2015-10-30 21:35                 ` Nishanth Aravamudan
  2015-10-30 21:48                   ` Keith Busch
  2015-11-03 13:18                   ` Christoph Hellwig
  0 siblings, 2 replies; 49+ messages in thread
From: Nishanth Aravamudan @ 2015-10-30 21:35 UTC (permalink / raw)
  To: Busch, Keith
  Cc: Christoph Hellwig, aik, linux-kernel, linux-nvme, paulus,
	sparclinux, willy, linuxppc-dev, David Miller, david

On 29.10.2015 [17:20:43 +0000], Busch, Keith wrote:
> On Thu, Oct 29, 2015 at 08:57:01AM -0700, Nishanth Aravamudan wrote:
> > On 29.10.2015 [04:55:36 -0700], Christoph Hellwig wrote:
> > > We had a quick cht about this issue and I think we simply should
> > > default to a NVMe controler page size of 4k everywhere as that's the
> > > safe default.  This is also what we do for RDMA Memory reigstrations and
> > > it works fine there for SRP and iSER.
> > 
> > So, would that imply changing just the NVMe driver code rather than
> > adding the dma_page_shift API at all? What about
> > architectures that can support the larger page sizes? There is an
> > implied performance impact, at least, of shifting the IO size down.
> 
> It is the safe option, but you're right that it might have a
> measurable performance impact (can you run an experiment?). Maybe we
> should just change the driver to always use MPSMIN for the moment in
> the interest of time, and you can flush out the new API before the
> next merge window.

Given that it's 4K just about everywhere by default (and sort of
implicitly expected to be, I guess), I think I'd prefer we default to
4K. That should mitigate the performance impact (I'll ask our IO team to
do some runs, but since this impacts functionality on some hardware, I
don't think it's too relevant for now). Unless there are NVMe devcies
with a MPSMAX < 4K? 

Something like the following?



We received a bug report recently when DDW (64-bit direct DMA on Power)
is not enabled for NVMe devices. In that case, we fall back to 32-bit
DMA via the IOMMU, which is always done via 4K TCEs (Translation Control
Entries).

The NVMe device driver, though, assumes that the DMA alignment for the
PRP entries will match the device's page size, and that the DMA aligment
matches the kernel's page aligment. On Power, the the IOMMU page size,
as mentioned above, can be 4K, while the device can have a page size of
8K, while the kernel has a page size of 64K. This eventually trips the
BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple
of 4K but not 8K (e.g., 0xF000).

In this particular case of page sizes, we clearly want to use the
IOMMU's page size in the driver. And generally, the NVMe driver in this
function should be using the IOMMU's page size for the default device
page size, rather than the kernel's page size. There is not currently an
API to obtain the IOMMU's page size across all architectures and in the
interest of a stop-gap fix to this functional issue, default the NVMe
device page size to 4K, with the intent of adding such an API and
implementation across all architectures in the next merge window.

Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

---
v1 -> v2:
  Based upon feedback from Christoph Hellwig, implement the IOMMU page
  size lookup as a generic DMA API, rather than an architecture-specific
  hack.

v2 -> v3:
  In the interest of fixing the functional problem in the short-term,
  just force the device page size to 4K and work on adding the new API
  in the next merge window.

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index ccc0c1f93daa..a9a5285bdb39 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -1717,7 +1717,12 @@ static int nvme_configure_admin_queue(struct nvme_dev *dev)
 	u32 aqa;
 	u64 cap = readq(&dev->bar->cap);
 	struct nvme_queue *nvmeq;
-	unsigned page_shift = PAGE_SHIFT;
+	/*
+	 * default to a 4K page size, with the intention to update this
+	 * path in the future to accomodate architectures with differing
+	 * kernel and IO page sizes.
+	 */
+	unsigned page_shift = 12;
 	unsigned dev_page_min = NVME_CAP_MPSMIN(cap) + 12;
 	unsigned dev_page_max = NVME_CAP_MPSMAX(cap) + 12;
 




-Nish


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA
  2015-10-30  1:49               ` [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA David Miller
@ 2015-10-30 21:35                 ` Nishanth Aravamudan
  0 siblings, 0 replies; 49+ messages in thread
From: Nishanth Aravamudan @ 2015-10-30 21:35 UTC (permalink / raw)
  To: David Miller
  Cc: hch, keith.busch, aik, linux-kernel, linux-nvme, paulus,
	sparclinux, willy, linuxppc-dev, david

On 29.10.2015 [18:49:55 -0700], David Miller wrote:
> From: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> Date: Thu, 29 Oct 2015 08:57:01 -0700
> 
> > So, would that imply changing just the NVMe driver code rather than
> > adding the dma_page_shift API at all? What about
> > architectures that can support the larger page sizes? There is an
> > implied performance impact, at least, of shifting the IO size down.
> > 
> > Sorry for the continuing questions -- I got lots of conflicting feedback
> > on the last series and want to make sure v4 is more acceptable.
> 
> In the long term I would be very happy to see us having a real interface
> for this stuff, just my opinion...

Yep, I think I'll try and balance the two -- fix NVMe for now with a 4K
page size as suggested by Christoph, and then work on the more complete
API for the next merge.

Thanks,
Nish


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/1 v3] drivers/nvme: default to 4k device page size
  2015-10-30 21:35                 ` [PATCH 1/1 v3] drivers/nvme: default to 4k device page size Nishanth Aravamudan
@ 2015-10-30 21:48                   ` Keith Busch
  2015-10-30 22:13                     ` Nishanth Aravamudan
  2015-11-03 13:18                   ` Christoph Hellwig
  1 sibling, 1 reply; 49+ messages in thread
From: Keith Busch @ 2015-10-30 21:48 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Christoph Hellwig, aik, linux-kernel, linux-nvme, paulus,
	sparclinux, willy, linuxppc-dev, David Miller, david

On Fri, Oct 30, 2015 at 02:35:11PM -0700, Nishanth Aravamudan wrote:
> Given that it's 4K just about everywhere by default (and sort of
> implicitly expected to be, I guess), I think I'd prefer we default to
> 4K. That should mitigate the performance impact (I'll ask our IO team to
> do some runs, but since this impacts functionality on some hardware, I
> don't think it's too relevant for now). Unless there are NVMe devcies
> with a MPSMAX < 4K? 

Right, I assumed MPSMIN was always 4k for the same reason you mentioned,
but you can hard code it like you've done in your patch.

The spec defines MPSMAX such that it's impossible to find a device
with MPSMAX < 4k.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/1 v3] drivers/nvme: default to 4k device page size
  2015-10-30 21:48                   ` Keith Busch
@ 2015-10-30 22:13                     ` Nishanth Aravamudan
  0 siblings, 0 replies; 49+ messages in thread
From: Nishanth Aravamudan @ 2015-10-30 22:13 UTC (permalink / raw)
  To: Keith Busch
  Cc: aik, linux-kernel, linux-nvme, Christoph Hellwig, paulus,
	sparclinux, willy, linuxppc-dev, David Miller, david

On 30.10.2015 [21:48:48 +0000], Keith Busch wrote:
> On Fri, Oct 30, 2015 at 02:35:11PM -0700, Nishanth Aravamudan wrote:
> > Given that it's 4K just about everywhere by default (and sort of
> > implicitly expected to be, I guess), I think I'd prefer we default to
> > 4K. That should mitigate the performance impact (I'll ask our IO team to
> > do some runs, but since this impacts functionality on some hardware, I
> > don't think it's too relevant for now). Unless there are NVMe devcies
> > with a MPSMAX < 4K? 
> 
> Right, I assumed MPSMIN was always 4k for the same reason you mentioned,
> but you can hard code it like you've done in your patch.
> 
> The spec defines MPSMAX such that it's impossible to find a device
> with MPSMAX < 4k.

Great, thanks!

-Nish


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/1 v3] drivers/nvme: default to 4k device page size
  2015-10-30 21:35                 ` [PATCH 1/1 v3] drivers/nvme: default to 4k device page size Nishanth Aravamudan
  2015-10-30 21:48                   ` Keith Busch
@ 2015-11-03 13:18                   ` Christoph Hellwig
  2015-11-03 13:46                     ` Keith Busch
  1 sibling, 1 reply; 49+ messages in thread
From: Christoph Hellwig @ 2015-11-03 13:18 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Busch, Keith, aik, linux-kernel, linux-nvme, Christoph Hellwig,
	paulus, sparclinux, willy, linuxppc-dev, David Miller, david

On Fri, Oct 30, 2015 at 02:35:11PM -0700, Nishanth Aravamudan wrote:
> diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
> index ccc0c1f93daa..a9a5285bdb39 100644
> --- a/drivers/block/nvme-core.c
> +++ b/drivers/block/nvme-core.c
> @@ -1717,7 +1717,12 @@ static int nvme_configure_admin_queue(struct nvme_dev *dev)
>  	u32 aqa;
>  	u64 cap = readq(&dev->bar->cap);
>  	struct nvme_queue *nvmeq;
> -	unsigned page_shift = PAGE_SHIFT;
> +	/*
> +	 * default to a 4K page size, with the intention to update this
> +	 * path in the future to accomodate architectures with differing
> +	 * kernel and IO page sizes.
> +	 */
> +	unsigned page_shift = 12;
>  	unsigned dev_page_min = NVME_CAP_MPSMIN(cap) + 12;
>  	unsigned dev_page_max = NVME_CAP_MPSMAX(cap) + 12;

Looks good as a start.  Note that all the MPSMIN/MAX checking could
be removed as NVMe devices must support 4k pages.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/1 v3] drivers/nvme: default to 4k device page size
  2015-11-03 13:18                   ` Christoph Hellwig
@ 2015-11-03 13:46                     ` Keith Busch
  2015-11-05 17:01                       ` [PATCH 1/1 v4] " Nishanth Aravamudan
  0 siblings, 1 reply; 49+ messages in thread
From: Keith Busch @ 2015-11-03 13:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Nishanth Aravamudan, aik, linux-kernel, linux-nvme, paulus,
	sparclinux, willy, linuxppc-dev, David Miller, david

On Tue, Nov 03, 2015 at 05:18:24AM -0800, Christoph Hellwig wrote:
> On Fri, Oct 30, 2015 at 02:35:11PM -0700, Nishanth Aravamudan wrote:
> > diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
> > index ccc0c1f93daa..a9a5285bdb39 100644
> > --- a/drivers/block/nvme-core.c
> > +++ b/drivers/block/nvme-core.c
> > @@ -1717,7 +1717,12 @@ static int nvme_configure_admin_queue(struct nvme_dev *dev)
> >  	u32 aqa;
> >  	u64 cap = readq(&dev->bar->cap);
> >  	struct nvme_queue *nvmeq;
> > -	unsigned page_shift = PAGE_SHIFT;
> > +	/*
> > +	 * default to a 4K page size, with the intention to update this
> > +	 * path in the future to accomodate architectures with differing
> > +	 * kernel and IO page sizes.
> > +	 */
> > +	unsigned page_shift = 12;
> >  	unsigned dev_page_min = NVME_CAP_MPSMIN(cap) + 12;
> >  	unsigned dev_page_max = NVME_CAP_MPSMAX(cap) + 12;
> 
> Looks good as a start.  Note that all the MPSMIN/MAX checking could
> be removed as NVMe devices must support 4k pages.

MAX can go, and while it's probably the case that all devices support 4k,
it's not a spec requirement, so we should keep the dev_page_min check.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 1/1 v4] drivers/nvme: default to 4k device page size
  2015-11-03 13:46                     ` Keith Busch
@ 2015-11-05 17:01                       ` Nishanth Aravamudan
  2015-11-05 19:58                         ` Christoph Hellwig
  0 siblings, 1 reply; 49+ messages in thread
From: Nishanth Aravamudan @ 2015-11-05 17:01 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, aik, linux-kernel, linux-nvme, paulus,
	sparclinux, willy, linuxppc-dev, David Miller, david

On 03.11.2015 [13:46:25 +0000], Keith Busch wrote:
> On Tue, Nov 03, 2015 at 05:18:24AM -0800, Christoph Hellwig wrote:
> > On Fri, Oct 30, 2015 at 02:35:11PM -0700, Nishanth Aravamudan wrote:
> > > diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
> > > index ccc0c1f93daa..a9a5285bdb39 100644
> > > --- a/drivers/block/nvme-core.c
> > > +++ b/drivers/block/nvme-core.c
> > > @@ -1717,7 +1717,12 @@ static int nvme_configure_admin_queue(struct nvme_dev *dev)
> > >  	u32 aqa;
> > >  	u64 cap = readq(&dev->bar->cap);
> > >  	struct nvme_queue *nvmeq;
> > > -	unsigned page_shift = PAGE_SHIFT;
> > > +	/*
> > > +	 * default to a 4K page size, with the intention to update this
> > > +	 * path in the future to accomodate architectures with differing
> > > +	 * kernel and IO page sizes.
> > > +	 */
> > > +	unsigned page_shift = 12;
> > >  	unsigned dev_page_min = NVME_CAP_MPSMIN(cap) + 12;
> > >  	unsigned dev_page_max = NVME_CAP_MPSMAX(cap) + 12;
> > 
> > Looks good as a start.  Note that all the MPSMIN/MAX checking could
> > be removed as NVMe devices must support 4k pages.
> 
> MAX can go, and while it's probably the case that all devices support 4k,
> it's not a spec requirement, so we should keep the dev_page_min check.

Ok, here's an updated patch.

We received a bug report recently when DDW (64-bit direct DMA on Power)
is not enabled for NVMe devices. In that case, we fall back to 32-bit
DMA via the IOMMU, which is always done via 4K TCEs (Translation Control
Entries).

The NVMe device driver, though, assumes that the DMA alignment for the
PRP entries will match the device's page size, and that the DMA aligment
matches the kernel's page aligment. On Power, the the IOMMU page size,
as mentioned above, can be 4K, while the device can have a page size of
8K, while the kernel has a page size of 64K. This eventually trips the
BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple
of 4K but not 8K (e.g., 0xF000).

In this particular case of page sizes, we clearly want to use the
IOMMU's page size in the driver. And generally, the NVMe driver in this
function should be using the IOMMU's page size for the default device
page size, rather than the kernel's page size. There is not currently an
API to obtain the IOMMU's page size across all architectures and in the
interest of a stop-gap fix to this functional issue, default the NVMe
device page size to 4K, with the intent of adding such an API and
implementation across all architectures in the next merge window.

With the functionally equivalent v3 of this patch, our hardware test
exerciser survives when using 32-bit DMA; without the patch, the kernel
will BUG within a few minutes.

Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

---
v1 -> v2:
  Based upon feedback from Christoph Hellwig, implement the IOMMU page
  size lookup as a generic DMA API, rather than an architecture-specific
  hack.

v2 -> v3:
  In the interest of fixing the functional problem in the short-term,
  just force the device page size to 4K and work on adding the new API
  in the next merge window.

v3 -> v4:
  Rebase to the 4.3, including the new code locations.
  Based upon feedback from Keith Busch and Christoph Hellwig, remove the
  device max check, as the spec requires MPSMAX >= 4K.

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index e878590e71b6..00ca45bb0bc0 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1701,9 +1701,13 @@ static int nvme_configure_admin_queue(struct nvme_dev *dev)
 	u32 aqa;
 	u64 cap = readq(&dev->bar->cap);
 	struct nvme_queue *nvmeq;
-	unsigned page_shift = PAGE_SHIFT;
+	/*
+	 * default to a 4K page size, with the intention to update this
+	 * path in the future to accomodate architectures with differing
+	 * kernel and IO page sizes.
+	 */
+	unsigned page_shift = 12;
 	unsigned dev_page_min = NVME_CAP_MPSMIN(cap) + 12;
-	unsigned dev_page_max = NVME_CAP_MPSMAX(cap) + 12;
 
 	if (page_shift < dev_page_min) {
 		dev_err(dev->dev,
@@ -1712,13 +1716,6 @@ static int nvme_configure_admin_queue(struct nvme_dev *dev)
 				1 << page_shift);
 		return -ENODEV;
 	}
-	if (page_shift > dev_page_max) {
-		dev_info(dev->dev,
-				"Device maximum page size (%u) smaller than "
-				"host (%u); enabling work-around\n",
-				1 << dev_page_max, 1 << page_shift);
-		page_shift = dev_page_max;
-	}
 
 	dev->subsystem = readl(&dev->bar->vs) >= NVME_VS(1, 1) ?
 						NVME_CAP_NSSRC(cap) : 0;


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/1 v4] drivers/nvme: default to 4k device page size
  2015-11-05 17:01                       ` [PATCH 1/1 v4] " Nishanth Aravamudan
@ 2015-11-05 19:58                         ` Christoph Hellwig
  2015-11-05 21:54                           ` Nishanth Aravamudan
                                             ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Christoph Hellwig @ 2015-11-05 19:58 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Keith Busch, aik, linux-kernel, linux-nvme, paulus, sparclinux,
	willy, linuxppc-dev, David Miller, david

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

... but I doubt we'll ever bother updating it.  Most architectures
with arger page sizes also have iommus and would need different settings
for different iommus vs direct mapping for very little gain.  There's a
reason why we never bothered for RDMA either.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/1 v4] drivers/nvme: default to 4k device page size
  2015-11-05 19:58                         ` Christoph Hellwig
@ 2015-11-05 21:54                           ` Nishanth Aravamudan
  2015-11-06 16:13                           ` Nishanth Aravamudan
  2015-11-13  7:37                           ` Christoph Hellwig
  2 siblings, 0 replies; 49+ messages in thread
From: Nishanth Aravamudan @ 2015-11-05 21:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, aik, linux-kernel, linux-nvme, paulus, sparclinux,
	willy, linuxppc-dev, David Miller, david

On 05.11.2015 [11:58:39 -0800], Christoph Hellwig wrote:
> Looks fine,
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
> ... but I doubt we'll ever bother updating it.  Most architectures
> with arger page sizes also have iommus and would need different settings
> for different iommus vs direct mapping for very little gain.  There's a
> reason why we never bothered for RDMA either.

Fair enough :) Thanks for all your reviews and comments.

-Nish


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/1 v4] drivers/nvme: default to 4k device page size
  2015-11-05 19:58                         ` Christoph Hellwig
  2015-11-05 21:54                           ` Nishanth Aravamudan
@ 2015-11-06 16:13                           ` Nishanth Aravamudan
  2015-11-13  7:37                           ` Christoph Hellwig
  2 siblings, 0 replies; 49+ messages in thread
From: Nishanth Aravamudan @ 2015-11-06 16:13 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, aik, linux-kernel, linux-nvme, paulus, sparclinux,
	willy, linuxppc-dev, David Miller, david

On 05.11.2015 [11:58:39 -0800], Christoph Hellwig wrote:
> Looks fine,
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
> ... but I doubt we'll ever bother updating it.  Most architectures
> with arger page sizes also have iommus and would need different settings
> for different iommus vs direct mapping for very little gain.  There's a
> reason why we never bothered for RDMA either.

FWIW, whose tree should this go through? The bug only appears on Power,
afaik, but the patch is now just an NVMe change.

Thanks,
Nish


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/1 v4] drivers/nvme: default to 4k device page size
  2015-11-05 19:58                         ` Christoph Hellwig
  2015-11-05 21:54                           ` Nishanth Aravamudan
  2015-11-06 16:13                           ` Nishanth Aravamudan
@ 2015-11-13  7:37                           ` Christoph Hellwig
  2015-11-13 15:08                             ` Keith Busch
  2 siblings, 1 reply; 49+ messages in thread
From: Christoph Hellwig @ 2015-11-13  7:37 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: aik, linux-kernel, linux-nvme, paulus, sparclinux, willy,
	linuxppc-dev, David Miller, david

Jens, Keith: any chance to get this to Linux for 4.4 (and -stable)?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/1 v4] drivers/nvme: default to 4k device page size
  2015-11-13  7:37                           ` Christoph Hellwig
@ 2015-11-13 15:08                             ` Keith Busch
  2015-11-18 14:42                               ` Christoph Hellwig
  0 siblings, 1 reply; 49+ messages in thread
From: Keith Busch @ 2015-11-13 15:08 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Nishanth Aravamudan, aik, linux-kernel, linux-nvme, paulus,
	sparclinux, willy, linuxppc-dev, David Miller, david

On Thu, Nov 12, 2015 at 11:37:54PM -0800, Christoph Hellwig wrote:
> Jens, Keith: any chance to get this to Linux for 4.4 (and -stable)?

I agreed, looks good to me.

Acked-by: Keith Busch <keith.busch@intel.com>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/1 v4] drivers/nvme: default to 4k device page size
  2015-11-13 15:08                             ` Keith Busch
@ 2015-11-18 14:42                               ` Christoph Hellwig
  0 siblings, 0 replies; 49+ messages in thread
From: Christoph Hellwig @ 2015-11-18 14:42 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, aik, Nishanth Aravamudan, linux-kernel,
	linux-nvme, paulus, sparclinux, willy, linuxppc-dev,
	David Miller, david

On Fri, Nov 13, 2015 at 03:08:11PM +0000, Keith Busch wrote:
> On Thu, Nov 12, 2015 at 11:37:54PM -0800, Christoph Hellwig wrote:
> > Jens, Keith: any chance to get this to Linux for 4.4 (and -stable)?
> 
> I agreed, looks good to me.
> 
> Acked-by: Keith Busch <keith.busch@intel.com>

Jens, can you pick this one for -rc2?

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2015-11-18 14:42 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-23 20:54 [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA Nishanth Aravamudan
2015-10-23 20:56 ` [PATCH 1/7 v3] dma-mapping: add generic dma_get_page_shift API Nishanth Aravamudan
2015-10-23 20:57 ` [PATCH 2/7 v2] powerpc/dma-mapping: override dma_get_page_shift Nishanth Aravamudan
2015-10-27  6:02   ` Alexey Kardashevskiy
2015-10-27 14:06     ` Busch, Keith
2015-10-27 22:27     ` Nishanth Aravamudan
2015-10-28  1:00       ` Alexey Kardashevskiy
2015-10-28  1:54         ` Nishanth Aravamudan
2015-10-28  2:20           ` Benjamin Herrenschmidt
2015-10-28  2:30             ` Nishanth Aravamudan
2015-10-28  3:20               ` Benjamin Herrenschmidt
2015-10-23 20:57 ` [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA Nishanth Aravamudan
2015-10-23 20:58 ` [PATCH 3/7 v2] powerpc/dma: implement per-platform dma_get_page_shift Nishanth Aravamudan
2015-10-23 20:59 ` [PATCH 4/7 v2] pseries/iommu: implement DDW-aware dma_get_page_shift Nishanth Aravamudan
2015-10-27  5:56   ` Alexey Kardashevskiy
2015-10-27 22:22     ` Nishanth Aravamudan
2015-10-23 21:00 ` [PATCH 5/7] [RFC PATCH 5/7] sparc: rename kernel/iommu_common.h -> include/asm/iommu_common.h Nishanth Aravamudan
2015-10-23 21:02   ` Nishanth Aravamudan
2015-10-23 21:01 ` [RFC PATCH 6/7] sparc/dma-mapping: override dma_get_page_shift Nishanth Aravamudan
2015-10-23 21:02 ` [PATCH 7/7 v2] drivers/nvme: default to the IOMMU page size Nishanth Aravamudan
2015-10-27  1:27 ` [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA David Miller
2015-10-27 22:20   ` Nishanth Aravamudan
2015-10-27 22:36     ` Busch, Keith
2015-10-28  0:54       ` David Miller
2015-10-28 13:59         ` Busch, Keith
2015-10-29 11:55           ` Christoph Hellwig
2015-10-29 15:57             ` Nishanth Aravamudan
2015-10-29 17:20               ` Busch, Keith
2015-10-30 21:35                 ` [PATCH 1/1 v3] drivers/nvme: default to 4k device page size Nishanth Aravamudan
2015-10-30 21:48                   ` Keith Busch
2015-10-30 22:13                     ` Nishanth Aravamudan
2015-11-03 13:18                   ` Christoph Hellwig
2015-11-03 13:46                     ` Keith Busch
2015-11-05 17:01                       ` [PATCH 1/1 v4] " Nishanth Aravamudan
2015-11-05 19:58                         ` Christoph Hellwig
2015-11-05 21:54                           ` Nishanth Aravamudan
2015-11-06 16:13                           ` Nishanth Aravamudan
2015-11-13  7:37                           ` Christoph Hellwig
2015-11-13 15:08                             ` Keith Busch
2015-11-18 14:42                               ` Christoph Hellwig
2015-10-30  1:49               ` [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA David Miller
2015-10-30 21:35                 ` Nishanth Aravamudan
2015-10-27 22:57     ` Julian Calaby
2015-10-27 23:40       ` Nishanth Aravamudan
2015-10-27 23:43         ` Julian Calaby
2015-10-28  0:29           ` Benjamin Herrenschmidt
2015-10-28  1:00           ` David Miller
2015-10-28  0:53     ` David Miller
2015-10-28  1:52       ` Nishanth Aravamudan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).