linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/5] enhance DMA CMA on x86
@ 2014-04-15 13:08 Akinobu Mita
  2014-04-15 13:08 ` [PATCH v3 1/5] x86: make dma_alloc_coherent() return zeroed memory if CMA is enabled Akinobu Mita
                   ` (5 more replies)
  0 siblings, 6 replies; 27+ messages in thread
From: Akinobu Mita @ 2014-04-15 13:08 UTC (permalink / raw)
  To: linux-kernel, akpm
  Cc: Akinobu Mita, Marek Szyprowski, Konrad Rzeszutek Wilk,
	David Woodhouse, Don Dutile, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, x86, iommu

This patch set enhances the DMA Contiguous Memory Allocator on x86.

Currently the DMA CMA is only supported with pci-nommu dma_map_ops
and furthermore it can't be enabled on x86_64.  But I would like to
allocate big contiguous memory with dma_alloc_coherent() and tell it
to the device that requires it, regardless of which dma mapping
implementation is actually used in the system.

So this makes it work with swiotlb and intel-iommu dma_map_ops, too.
And this also extends "cma=" kernel parameter to specify placement
constraint by the physical address range of memory allocations.  For
example, CMA allocates memory below 4GB by "cma=64M@0-4G", it is
required for the devices only supporting 32-bit addressing on 64-bit
systems without iommu.

* Changes from v2
- Rebased on current Linus tree
- Add Acked-by line
- Fix gfp flags check for __GFP_ATOMIC, reported by Marek Szyprowski
- Avoid CMA area on highmem with cma= option, reported by Marek Szyprowski

* Changes from v1
- fix dma_alloc_coherent() with __GFP_ZERO
- add placement specifier for "cma=" kernel parameter

Akinobu Mita (5):
  x86: make dma_alloc_coherent() return zeroed memory if CMA is enabled
  x86: enable DMA CMA with swiotlb
  intel-iommu: integrate DMA CMA
  memblock: introduce memblock_alloc_range()
  cma: add placement specifier for "cma=" kernel parameter

 Documentation/kernel-parameters.txt |  7 +++++--
 arch/x86/Kconfig                    |  2 +-
 arch/x86/include/asm/swiotlb.h      |  7 +++++++
 arch/x86/kernel/amd_gart_64.c       |  2 +-
 arch/x86/kernel/pci-dma.c           |  3 +--
 arch/x86/kernel/pci-swiotlb.c       |  9 +++++---
 arch/x86/kernel/setup.c             |  2 +-
 arch/x86/pci/sta2x11-fixup.c        |  6 ++----
 drivers/base/dma-contiguous.c       | 42 ++++++++++++++++++++++++++++---------
 drivers/iommu/intel-iommu.c         | 32 +++++++++++++++++++++-------
 include/linux/dma-contiguous.h      |  9 +++++---
 include/linux/memblock.h            |  2 ++
 include/linux/swiotlb.h             |  2 ++
 lib/swiotlb.c                       |  2 +-
 mm/memblock.c                       | 21 +++++++++++++++----
 15 files changed, 108 insertions(+), 40 deletions(-)

Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: x86@kernel.org
Cc: iommu@lists.linux-foundation.org
-- 
1.8.3.2


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 1/5] x86: make dma_alloc_coherent() return zeroed memory if CMA is enabled
  2014-04-15 13:08 [PATCH v3 0/5] enhance DMA CMA on x86 Akinobu Mita
@ 2014-04-15 13:08 ` Akinobu Mita
  2014-04-16 19:44   ` Andrew Morton
  2014-04-15 13:08 ` [PATCH v3 2/5] x86: enable DMA CMA with swiotlb Akinobu Mita
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 27+ messages in thread
From: Akinobu Mita @ 2014-04-15 13:08 UTC (permalink / raw)
  To: linux-kernel, akpm
  Cc: Akinobu Mita, Marek Szyprowski, Konrad Rzeszutek Wilk,
	David Woodhouse, Don Dutile, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, x86, iommu

Calling dma_alloc_coherent() with __GFP_ZERO must return zeroed memory.

But when the contiguous memory allocator (CMA) is enabled on x86 and
the memory region is allocated by dma_alloc_from_contiguous(), it
doesn't return zeroed memory.  Because dma_generic_alloc_coherent()
forgot to fill the memory region with zero if it was allocated by
dma_alloc_from_contiguous()

Most implementations of dma_alloc_coherent() return zeroed memory
regardless of whether __GFP_ZERO is specified.  So this fixes it by
unconditionally zeroing the allocated memory region.

Alternatively, we could fix dma_alloc_from_contiguous() to return
zeroed out memory and remove memset() from all caller of it.  But we
can't simply remove the memset on arm because __dma_clear_buffer() is
used there for ensuring cache flushing and it is used in many places.
Of course we can do redundant memset in dma_alloc_from_contiguous(),
but I think this patch is less impact for fixing this problem.

Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: x86@kernel.org
Cc: iommu@lists.linux-foundation.org
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
---
* Change from v2
- update commit log to describe a possible alternative fix

 arch/x86/kernel/pci-dma.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index f7d0672..a0ffe44 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -97,7 +97,6 @@ void *dma_generic_alloc_coherent(struct device *dev, size_t size,
 
 	dma_mask = dma_alloc_coherent_mask(dev, flag);
 
-	flag |= __GFP_ZERO;
 again:
 	page = NULL;
 	/* CMA can be used only in the context which permits sleeping */
@@ -120,7 +119,7 @@ again:
 
 		return NULL;
 	}
-
+	memset(page_address(page), 0, size);
 	*dma_addr = addr;
 	return page_address(page);
 }
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v3 2/5] x86: enable DMA CMA with swiotlb
  2014-04-15 13:08 [PATCH v3 0/5] enhance DMA CMA on x86 Akinobu Mita
  2014-04-15 13:08 ` [PATCH v3 1/5] x86: make dma_alloc_coherent() return zeroed memory if CMA is enabled Akinobu Mita
@ 2014-04-15 13:08 ` Akinobu Mita
  2014-04-15 13:08 ` [PATCH v3 3/5] intel-iommu: integrate DMA CMA Akinobu Mita
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 27+ messages in thread
From: Akinobu Mita @ 2014-04-15 13:08 UTC (permalink / raw)
  To: linux-kernel, akpm
  Cc: Akinobu Mita, Marek Szyprowski, Konrad Rzeszutek Wilk,
	David Woodhouse, Don Dutile, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, x86, iommu

The DMA Contiguous Memory Allocator support on x86 is disabled when
swiotlb config option is enabled.  So DMA CMA is always disabled on
x86_64 because swiotlb is always enabled.  This attempts to support
for DMA CMA with enabling swiotlb config option.

The contiguous memory allocator on x86 is integrated in the function
dma_generic_alloc_coherent() which is .alloc callback in nommu_dma_ops
for dma_alloc_coherent().

x86_swiotlb_alloc_coherent() which is .alloc callback in swiotlb_dma_ops
tries to allocate with dma_generic_alloc_coherent() firstly and then
swiotlb_alloc_coherent() is called as a fallback.

The main part of supporting DMA CMA with swiotlb is that changing
x86_swiotlb_free_coherent() which is .free callback in swiotlb_dma_ops
for dma_free_coherent() so that it can distinguish memory allocated by
dma_generic_alloc_coherent() from one allocated by swiotlb_alloc_coherent()
and release it with dma_generic_free_coherent() which can handle contiguous
memory.  This change requires making is_swiotlb_buffer() global function.

This also needs to change .free callback in the dma_map_ops for amd_gart
and sta2x11, because these dma_ops are also using
dma_generic_alloc_coherent().

Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: x86@kernel.org
Cc: iommu@lists.linux-foundation.org
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
* Change from v2
- Add Acked-by line

 arch/x86/Kconfig               | 2 +-
 arch/x86/include/asm/swiotlb.h | 7 +++++++
 arch/x86/kernel/amd_gart_64.c  | 2 +-
 arch/x86/kernel/pci-swiotlb.c  | 9 ++++++---
 arch/x86/pci/sta2x11-fixup.c   | 6 ++----
 include/linux/swiotlb.h        | 2 ++
 lib/swiotlb.c                  | 2 +-
 7 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 25d2c6f..7fa3f83 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -41,7 +41,7 @@ config X86
 	select ARCH_WANT_OPTIONAL_GPIOLIB
 	select ARCH_WANT_FRAME_POINTERS
 	select HAVE_DMA_ATTRS
-	select HAVE_DMA_CONTIGUOUS if !SWIOTLB
+	select HAVE_DMA_CONTIGUOUS
 	select HAVE_KRETPROBES
 	select GENERIC_EARLY_IOREMAP
 	select HAVE_OPTPROBES
diff --git a/arch/x86/include/asm/swiotlb.h b/arch/x86/include/asm/swiotlb.h
index 977f176..ab05d73 100644
--- a/arch/x86/include/asm/swiotlb.h
+++ b/arch/x86/include/asm/swiotlb.h
@@ -29,4 +29,11 @@ static inline void pci_swiotlb_late_init(void)
 
 static inline void dma_mark_clean(void *addr, size_t size) {}
 
+extern void *x86_swiotlb_alloc_coherent(struct device *hwdev, size_t size,
+					dma_addr_t *dma_handle, gfp_t flags,
+					struct dma_attrs *attrs);
+extern void x86_swiotlb_free_coherent(struct device *dev, size_t size,
+					void *vaddr, dma_addr_t dma_addr,
+					struct dma_attrs *attrs);
+
 #endif /* _ASM_X86_SWIOTLB_H */
diff --git a/arch/x86/kernel/amd_gart_64.c b/arch/x86/kernel/amd_gart_64.c
index b574b29..8e3842f 100644
--- a/arch/x86/kernel/amd_gart_64.c
+++ b/arch/x86/kernel/amd_gart_64.c
@@ -512,7 +512,7 @@ gart_free_coherent(struct device *dev, size_t size, void *vaddr,
 		   dma_addr_t dma_addr, struct dma_attrs *attrs)
 {
 	gart_unmap_page(dev, dma_addr, size, DMA_BIDIRECTIONAL, NULL);
-	free_pages((unsigned long)vaddr, get_order(size));
+	dma_generic_free_coherent(dev, size, vaddr, dma_addr, attrs);
 }
 
 static int gart_mapping_error(struct device *dev, dma_addr_t dma_addr)
diff --git a/arch/x86/kernel/pci-swiotlb.c b/arch/x86/kernel/pci-swiotlb.c
index 6c483ba..77dd0ad 100644
--- a/arch/x86/kernel/pci-swiotlb.c
+++ b/arch/x86/kernel/pci-swiotlb.c
@@ -14,7 +14,7 @@
 #include <asm/iommu_table.h>
 int swiotlb __read_mostly;
 
-static void *x86_swiotlb_alloc_coherent(struct device *hwdev, size_t size,
+void *x86_swiotlb_alloc_coherent(struct device *hwdev, size_t size,
 					dma_addr_t *dma_handle, gfp_t flags,
 					struct dma_attrs *attrs)
 {
@@ -28,11 +28,14 @@ static void *x86_swiotlb_alloc_coherent(struct device *hwdev, size_t size,
 	return swiotlb_alloc_coherent(hwdev, size, dma_handle, flags);
 }
 
-static void x86_swiotlb_free_coherent(struct device *dev, size_t size,
+void x86_swiotlb_free_coherent(struct device *dev, size_t size,
 				      void *vaddr, dma_addr_t dma_addr,
 				      struct dma_attrs *attrs)
 {
-	swiotlb_free_coherent(dev, size, vaddr, dma_addr);
+	if (is_swiotlb_buffer(dma_to_phys(dev, dma_addr)))
+		swiotlb_free_coherent(dev, size, vaddr, dma_addr);
+	else
+		dma_generic_free_coherent(dev, size, vaddr, dma_addr, attrs);
 }
 
 static struct dma_map_ops swiotlb_dma_ops = {
diff --git a/arch/x86/pci/sta2x11-fixup.c b/arch/x86/pci/sta2x11-fixup.c
index 9d8a509..5ceda85 100644
--- a/arch/x86/pci/sta2x11-fixup.c
+++ b/arch/x86/pci/sta2x11-fixup.c
@@ -173,9 +173,7 @@ static void *sta2x11_swiotlb_alloc_coherent(struct device *dev,
 {
 	void *vaddr;
 
-	vaddr = dma_generic_alloc_coherent(dev, size, dma_handle, flags, attrs);
-	if (!vaddr)
-		vaddr = swiotlb_alloc_coherent(dev, size, dma_handle, flags);
+	vaddr = x86_swiotlb_alloc_coherent(dev, size, dma_handle, flags, attrs);
 	*dma_handle = p2a(*dma_handle, to_pci_dev(dev));
 	return vaddr;
 }
@@ -183,7 +181,7 @@ static void *sta2x11_swiotlb_alloc_coherent(struct device *dev,
 /* We have our own dma_ops: the same as swiotlb but from alloc (above) */
 static struct dma_map_ops sta2x11_dma_ops = {
 	.alloc = sta2x11_swiotlb_alloc_coherent,
-	.free = swiotlb_free_coherent,
+	.free = x86_swiotlb_free_coherent,
 	.map_page = swiotlb_map_page,
 	.unmap_page = swiotlb_unmap_page,
 	.map_sg = swiotlb_map_sg_attrs,
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index a5ffd32..e7a018e 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -116,4 +116,6 @@ static inline void swiotlb_free(void) { }
 #endif
 
 extern void swiotlb_print_info(void);
+extern int is_swiotlb_buffer(phys_addr_t paddr);
+
 #endif /* __LINUX_SWIOTLB_H */
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index 7f57f24..caaab5d 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -374,7 +374,7 @@ void __init swiotlb_free(void)
 	io_tlb_nslabs = 0;
 }
 
-static int is_swiotlb_buffer(phys_addr_t paddr)
+int is_swiotlb_buffer(phys_addr_t paddr)
 {
 	return paddr >= io_tlb_start && paddr < io_tlb_end;
 }
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v3 3/5] intel-iommu: integrate DMA CMA
  2014-04-15 13:08 [PATCH v3 0/5] enhance DMA CMA on x86 Akinobu Mita
  2014-04-15 13:08 ` [PATCH v3 1/5] x86: make dma_alloc_coherent() return zeroed memory if CMA is enabled Akinobu Mita
  2014-04-15 13:08 ` [PATCH v3 2/5] x86: enable DMA CMA with swiotlb Akinobu Mita
@ 2014-04-15 13:08 ` Akinobu Mita
  2014-04-15 13:08 ` [PATCH v3 4/5] memblock: introduce memblock_alloc_range() Akinobu Mita
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 27+ messages in thread
From: Akinobu Mita @ 2014-04-15 13:08 UTC (permalink / raw)
  To: linux-kernel, akpm
  Cc: Akinobu Mita, Marek Szyprowski, Konrad Rzeszutek Wilk,
	David Woodhouse, Don Dutile, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, x86, iommu

This adds support for the DMA Contiguous Memory Allocator for intel-iommu.
This change enables dma_alloc_coherent() to allocate big contiguous
memory.

It is achieved in the same way as nommu_dma_ops currently does, i.e.
trying to allocate memory by dma_alloc_from_contiguous() and alloc_pages()
is used as a fallback.

Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: x86@kernel.org
Cc: iommu@lists.linux-foundation.org
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
---
* Changes from v2
- Fix gfp flags check for __GFP_ATOMIC, reported by Marek Szyprowski
- Rebased on current Linus tree

 drivers/iommu/intel-iommu.c | 32 ++++++++++++++++++++++++--------
 1 file changed, 24 insertions(+), 8 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index cdb97c4..78c68cb 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -3185,7 +3185,7 @@ static void *intel_alloc_coherent(struct device *dev, size_t size,
 				  dma_addr_t *dma_handle, gfp_t flags,
 				  struct dma_attrs *attrs)
 {
-	void *vaddr;
+	struct page *page = NULL;
 	int order;
 
 	size = PAGE_ALIGN(size);
@@ -3200,17 +3200,31 @@ static void *intel_alloc_coherent(struct device *dev, size_t size,
 			flags |= GFP_DMA32;
 	}
 
-	vaddr = (void *)__get_free_pages(flags, order);
-	if (!vaddr)
+	if (flags & __GFP_WAIT) {
+		unsigned int count = size >> PAGE_SHIFT;
+
+		page = dma_alloc_from_contiguous(dev, count, order);
+		if (page && iommu_no_mapping(dev) &&
+		    page_to_phys(page) + size > dev->coherent_dma_mask) {
+			dma_release_from_contiguous(dev, page, count);
+			page = NULL;
+		}
+	}
+
+	if (!page)
+		page = alloc_pages(flags, order);
+	if (!page)
 		return NULL;
-	memset(vaddr, 0, size);
+	memset(page_address(page), 0, size);
 
-	*dma_handle = __intel_map_single(dev, virt_to_bus(vaddr), size,
+	*dma_handle = __intel_map_single(dev, page_to_phys(page), size,
 					 DMA_BIDIRECTIONAL,
 					 dev->coherent_dma_mask);
 	if (*dma_handle)
-		return vaddr;
-	free_pages((unsigned long)vaddr, order);
+		return page_address(page);
+	if (!dma_release_from_contiguous(dev, page, size >> PAGE_SHIFT))
+		__free_pages(page, order);
+
 	return NULL;
 }
 
@@ -3218,12 +3232,14 @@ static void intel_free_coherent(struct device *dev, size_t size, void *vaddr,
 				dma_addr_t dma_handle, struct dma_attrs *attrs)
 {
 	int order;
+	struct page *page = virt_to_page(vaddr);
 
 	size = PAGE_ALIGN(size);
 	order = get_order(size);
 
 	intel_unmap_page(dev, dma_handle, size, DMA_BIDIRECTIONAL, NULL);
-	free_pages((unsigned long)vaddr, order);
+	if (!dma_release_from_contiguous(dev, page, size >> PAGE_SHIFT))
+		__free_pages(page, order);
 }
 
 static void intel_unmap_sg(struct device *dev, struct scatterlist *sglist,
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v3 4/5] memblock: introduce memblock_alloc_range()
  2014-04-15 13:08 [PATCH v3 0/5] enhance DMA CMA on x86 Akinobu Mita
                   ` (2 preceding siblings ...)
  2014-04-15 13:08 ` [PATCH v3 3/5] intel-iommu: integrate DMA CMA Akinobu Mita
@ 2014-04-15 13:08 ` Akinobu Mita
  2014-04-15 13:08 ` [PATCH v3 5/5] cma: add placement specifier for "cma=" kernel parameter Akinobu Mita
  2014-09-27 14:30 ` [PATCH v3 0/5] enhance DMA CMA on x86 Peter Hurley
  5 siblings, 0 replies; 27+ messages in thread
From: Akinobu Mita @ 2014-04-15 13:08 UTC (permalink / raw)
  To: linux-kernel, akpm
  Cc: Akinobu Mita, Marek Szyprowski, Konrad Rzeszutek Wilk,
	David Woodhouse, Don Dutile, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, x86, iommu

This introduces memblock_alloc_range() which allocates memblock from
the specified range of physical address.  I would like to use this
function to specify the location of CMA.

Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: x86@kernel.org
Cc: iommu@lists.linux-foundation.org
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
---
* Change from v2
- Rebased on current Linus tree

 include/linux/memblock.h |  2 ++
 mm/memblock.c            | 21 +++++++++++++++++----
 2 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 8a20a51..c5a61d9 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -221,6 +221,8 @@ static inline bool memblock_bottom_up(void) { return false; }
 #define MEMBLOCK_ALLOC_ANYWHERE	(~(phys_addr_t)0)
 #define MEMBLOCK_ALLOC_ACCESSIBLE	0
 
+phys_addr_t __init memblock_alloc_range(phys_addr_t size, phys_addr_t align,
+					phys_addr_t start, phys_addr_t end);
 phys_addr_t memblock_alloc_base(phys_addr_t size, phys_addr_t align,
 				phys_addr_t max_addr);
 phys_addr_t __memblock_alloc_base(phys_addr_t size, phys_addr_t align,
diff --git a/mm/memblock.c b/mm/memblock.c
index e9d6ca9..9a3bed0 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -975,22 +975,35 @@ int __init_memblock memblock_set_node(phys_addr_t base, phys_addr_t size,
 }
 #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 
-static phys_addr_t __init memblock_alloc_base_nid(phys_addr_t size,
-					phys_addr_t align, phys_addr_t max_addr,
-					int nid)
+static phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
+					phys_addr_t align, phys_addr_t start,
+					phys_addr_t end, int nid)
 {
 	phys_addr_t found;
 
 	if (!align)
 		align = SMP_CACHE_BYTES;
 
-	found = memblock_find_in_range_node(size, align, 0, max_addr, nid);
+	found = memblock_find_in_range_node(size, align, start, end, nid);
 	if (found && !memblock_reserve(found, size))
 		return found;
 
 	return 0;
 }
 
+phys_addr_t __init memblock_alloc_range(phys_addr_t size, phys_addr_t align,
+					phys_addr_t start, phys_addr_t end)
+{
+	return memblock_alloc_range_nid(size, align, start, end, NUMA_NO_NODE);
+}
+
+static phys_addr_t __init memblock_alloc_base_nid(phys_addr_t size,
+					phys_addr_t align, phys_addr_t max_addr,
+					int nid)
+{
+	return memblock_alloc_range_nid(size, align, 0, max_addr, nid);
+}
+
 phys_addr_t __init memblock_alloc_nid(phys_addr_t size, phys_addr_t align, int nid)
 {
 	return memblock_alloc_base_nid(size, align, MEMBLOCK_ALLOC_ACCESSIBLE, nid);
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v3 5/5] cma: add placement specifier for "cma=" kernel parameter
  2014-04-15 13:08 [PATCH v3 0/5] enhance DMA CMA on x86 Akinobu Mita
                   ` (3 preceding siblings ...)
  2014-04-15 13:08 ` [PATCH v3 4/5] memblock: introduce memblock_alloc_range() Akinobu Mita
@ 2014-04-15 13:08 ` Akinobu Mita
  2014-09-27 14:30 ` [PATCH v3 0/5] enhance DMA CMA on x86 Peter Hurley
  5 siblings, 0 replies; 27+ messages in thread
From: Akinobu Mita @ 2014-04-15 13:08 UTC (permalink / raw)
  To: linux-kernel, akpm
  Cc: Akinobu Mita, Marek Szyprowski, Konrad Rzeszutek Wilk,
	David Woodhouse, Don Dutile, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, x86, iommu

Currently, "cma=" kernel parameter is used to specify the size of CMA,
but we can't specify where it is located.  We want to locate CMA below
4GB for devices only supporting 32-bit addressing on 64-bit systems
without iommu.

This enables to specify the placement of CMA by extending "cma=" kernel
parameter.

Examples:
1. locate 64MB CMA below 4GB by "cma=64M@0-4G"
2. locate 64MB CMA exact at 512MB by "cma=64M@512M"

Note that the DMA contiguous memory allocator on x86 assumes that
page_address() works for the pages to allocate.  So this change requires
to limit end address of contiguous memory area upto max_pfn_mapped to
prevent from locating it on highmem area by the argument of
dma_contiguous_reserve().

Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: x86@kernel.org
Cc: iommu@lists.linux-foundation.org
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
---
* Change from v2
- Avoid CMA area on highmem with cma= option, reported by Marek Szyprowski

 Documentation/kernel-parameters.txt |  7 +++++--
 arch/x86/kernel/setup.c             |  2 +-
 drivers/base/dma-contiguous.c       | 42 ++++++++++++++++++++++++++++---------
 include/linux/dma-contiguous.h      |  9 +++++---
 4 files changed, 44 insertions(+), 16 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 03e50b4..8488e68 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -617,8 +617,11 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			Also note the kernel might malfunction if you disable
 			some critical bits.
 
-	cma=nn[MG]	[ARM,KNL]
-			Sets the size of kernel global memory area for contiguous
+	cma=nn[MG]@[start[MG][-end[MG]]]
+			[ARM,X86,KNL]
+			Sets the size of kernel global memory area for
+			contiguous memory allocations and optionally the
+			placement constraint by the physical address range of
 			memory allocations. For more information, see
 			include/linux/dma-contiguous.h
 
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 09c76d2..78a0e62 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1119,7 +1119,7 @@ void __init setup_arch(char **cmdline_p)
 	setup_real_mode();
 
 	memblock_set_current_limit(get_max_mapped());
-	dma_contiguous_reserve(0);
+	dma_contiguous_reserve(max_pfn_mapped << PAGE_SHIFT);
 
 	/*
 	 * NOTE: On x86-32, only from this point on, fixmaps are ready for use.
diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
index 165c2c2..b056661 100644
--- a/drivers/base/dma-contiguous.c
+++ b/drivers/base/dma-contiguous.c
@@ -59,11 +59,22 @@ struct cma *dma_contiguous_default_area;
  */
 static const phys_addr_t size_bytes = CMA_SIZE_MBYTES * SZ_1M;
 static phys_addr_t size_cmdline = -1;
+static phys_addr_t base_cmdline;
+static phys_addr_t limit_cmdline;
 
 static int __init early_cma(char *p)
 {
 	pr_debug("%s(%s)\n", __func__, p);
 	size_cmdline = memparse(p, &p);
+	if (*p != '@')
+		return 0;
+	base_cmdline = memparse(p + 1, &p);
+	if (*p != '-') {
+		limit_cmdline = base_cmdline + size_cmdline;
+		return 0;
+	}
+	limit_cmdline = memparse(p + 1, &p);
+
 	return 0;
 }
 early_param("cma", early_cma);
@@ -107,11 +118,18 @@ static inline __maybe_unused phys_addr_t cma_early_percent_memory(void)
 void __init dma_contiguous_reserve(phys_addr_t limit)
 {
 	phys_addr_t selected_size = 0;
+	phys_addr_t selected_base = 0;
+	phys_addr_t selected_limit = limit;
+	bool fixed = false;
 
 	pr_debug("%s(limit %08lx)\n", __func__, (unsigned long)limit);
 
 	if (size_cmdline != -1) {
 		selected_size = size_cmdline;
+		selected_base = base_cmdline;
+		selected_limit = min_not_zero(limit_cmdline, limit);
+		if (base_cmdline + size_cmdline == limit_cmdline)
+			fixed = true;
 	} else {
 #ifdef CONFIG_CMA_SIZE_SEL_MBYTES
 		selected_size = size_bytes;
@@ -128,10 +146,12 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
 		pr_debug("%s: reserving %ld MiB for global area\n", __func__,
 			 (unsigned long)selected_size / SZ_1M);
 
-		dma_contiguous_reserve_area(selected_size, 0, limit,
-					    &dma_contiguous_default_area);
+		dma_contiguous_reserve_area(selected_size, selected_base,
+					    selected_limit,
+					    &dma_contiguous_default_area,
+					    fixed);
 	}
-};
+}
 
 static DEFINE_MUTEX(cma_mutex);
 
@@ -187,15 +207,20 @@ core_initcall(cma_init_reserved_areas);
  * @base: Base address of the reserved area optional, use 0 for any
  * @limit: End address of the reserved memory (optional, 0 for any).
  * @res_cma: Pointer to store the created cma region.
+ * @fixed: hint about where to place the reserved area
  *
  * This function reserves memory from early allocator. It should be
  * called by arch specific code once the early allocator (memblock or bootmem)
  * has been activated and all other subsystems have already allocated/reserved
  * memory. This function allows to create custom reserved areas for specific
  * devices.
+ *
+ * If @fixed is true, reserve contiguous area at exactly @base.  If false,
+ * reserve in range from @base to @limit.
  */
 int __init dma_contiguous_reserve_area(phys_addr_t size, phys_addr_t base,
-				       phys_addr_t limit, struct cma **res_cma)
+				       phys_addr_t limit, struct cma **res_cma,
+				       bool fixed)
 {
 	struct cma *cma = &cma_areas[cma_area_count];
 	phys_addr_t alignment;
@@ -221,18 +246,15 @@ int __init dma_contiguous_reserve_area(phys_addr_t size, phys_addr_t base,
 	limit &= ~(alignment - 1);
 
 	/* Reserve memory */
-	if (base) {
+	if (base && fixed) {
 		if (memblock_is_region_reserved(base, size) ||
 		    memblock_reserve(base, size) < 0) {
 			ret = -EBUSY;
 			goto err;
 		}
 	} else {
-		/*
-		 * Use __memblock_alloc_base() since
-		 * memblock_alloc_base() panic()s.
-		 */
-		phys_addr_t addr = __memblock_alloc_base(size, alignment, limit);
+		phys_addr_t addr = memblock_alloc_range(size, alignment, base,
+							limit);
 		if (!addr) {
 			ret = -ENOMEM;
 			goto err;
diff --git a/include/linux/dma-contiguous.h b/include/linux/dma-contiguous.h
index 3b28f93..772eab5 100644
--- a/include/linux/dma-contiguous.h
+++ b/include/linux/dma-contiguous.h
@@ -88,7 +88,8 @@ static inline void dma_contiguous_set_default(struct cma *cma)
 void dma_contiguous_reserve(phys_addr_t addr_limit);
 
 int __init dma_contiguous_reserve_area(phys_addr_t size, phys_addr_t base,
-				       phys_addr_t limit, struct cma **res_cma);
+				       phys_addr_t limit, struct cma **res_cma,
+				       bool fixed);
 
 /**
  * dma_declare_contiguous() - reserve area for contiguous memory handling
@@ -108,7 +109,7 @@ static inline int dma_declare_contiguous(struct device *dev, phys_addr_t size,
 {
 	struct cma *cma;
 	int ret;
-	ret = dma_contiguous_reserve_area(size, base, limit, &cma);
+	ret = dma_contiguous_reserve_area(size, base, limit, &cma, true);
 	if (ret == 0)
 		dev_set_cma_area(dev, cma);
 
@@ -136,7 +137,9 @@ static inline void dma_contiguous_set_default(struct cma *cma) { }
 static inline void dma_contiguous_reserve(phys_addr_t limit) { }
 
 static inline int dma_contiguous_reserve_area(phys_addr_t size, phys_addr_t base,
-				       phys_addr_t limit, struct cma **res_cma) {
+				       phys_addr_t limit, struct cma **res_cma,
+				       bool fixed)
+{
 	return -ENOSYS;
 }
 
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 1/5] x86: make dma_alloc_coherent() return zeroed memory if CMA is enabled
  2014-04-15 13:08 ` [PATCH v3 1/5] x86: make dma_alloc_coherent() return zeroed memory if CMA is enabled Akinobu Mita
@ 2014-04-16 19:44   ` Andrew Morton
  2014-04-17 15:40     ` Akinobu Mita
  0 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2014-04-16 19:44 UTC (permalink / raw)
  To: Akinobu Mita
  Cc: linux-kernel, Marek Szyprowski, Konrad Rzeszutek Wilk,
	David Woodhouse, Don Dutile, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, x86, iommu

On Tue, 15 Apr 2014 22:08:45 +0900 Akinobu Mita <akinobu.mita@gmail.com> wrote:

> Calling dma_alloc_coherent() with __GFP_ZERO must return zeroed memory.
> 
> But when the contiguous memory allocator (CMA) is enabled on x86 and
> the memory region is allocated by dma_alloc_from_contiguous(), it
> doesn't return zeroed memory.  Because dma_generic_alloc_coherent()
> forgot to fill the memory region with zero if it was allocated by
> dma_alloc_from_contiguous()
> 
> Most implementations of dma_alloc_coherent() return zeroed memory
> regardless of whether __GFP_ZERO is specified.  So this fixes it by
> unconditionally zeroing the allocated memory region.
> 
> Alternatively, we could fix dma_alloc_from_contiguous() to return
> zeroed out memory and remove memset() from all caller of it.  But we
> can't simply remove the memset on arm because __dma_clear_buffer() is
> used there for ensuring cache flushing and it is used in many places.
> Of course we can do redundant memset in dma_alloc_from_contiguous(),
> but I think this patch is less impact for fixing this problem.

But this patch does a duplicated memset if the page was allocated by
alloc_pages_node()?

Would it not be better to pass the gfp_t to dma_alloc_from_contiguous()
and have it implement __GFP_ZERO?  That will fix thsi inefficiency,
will be symmetrical with the other underlying allocators and should
permit the appropriate fixups in arm?


> --- a/arch/x86/kernel/pci-dma.c
> +++ b/arch/x86/kernel/pci-dma.c
> @@ -97,7 +97,6 @@ void *dma_generic_alloc_coherent(struct device *dev, size_t size,
>  
>  	dma_mask = dma_alloc_coherent_mask(dev, flag);
>  
> -	flag |= __GFP_ZERO;
>  again:
>  	page = NULL;
>  	/* CMA can be used only in the context which permits sleeping */
> @@ -120,7 +119,7 @@ again:
>  
>  		return NULL;
>  	}
> -
> +	memset(page_address(page), 0, size);
>  	*dma_addr = addr;
>  	return page_address(page);
>  }
> -- 
> 1.8.3.2

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 1/5] x86: make dma_alloc_coherent() return zeroed memory if CMA is enabled
  2014-04-16 19:44   ` Andrew Morton
@ 2014-04-17 15:40     ` Akinobu Mita
  0 siblings, 0 replies; 27+ messages in thread
From: Akinobu Mita @ 2014-04-17 15:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Marek Szyprowski, Konrad Rzeszutek Wilk, David Woodhouse,
	Don Dutile, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Andi Kleen, x86, iommu

2014-04-17 4:44 GMT+09:00 Andrew Morton <akpm@linux-foundation.org>:
> On Tue, 15 Apr 2014 22:08:45 +0900 Akinobu Mita <akinobu.mita@gmail.com> wrote:
>
>> Calling dma_alloc_coherent() with __GFP_ZERO must return zeroed memory.
>>
>> But when the contiguous memory allocator (CMA) is enabled on x86 and
>> the memory region is allocated by dma_alloc_from_contiguous(), it
>> doesn't return zeroed memory.  Because dma_generic_alloc_coherent()
>> forgot to fill the memory region with zero if it was allocated by
>> dma_alloc_from_contiguous()
>>
>> Most implementations of dma_alloc_coherent() return zeroed memory
>> regardless of whether __GFP_ZERO is specified.  So this fixes it by
>> unconditionally zeroing the allocated memory region.
>>
>> Alternatively, we could fix dma_alloc_from_contiguous() to return
>> zeroed out memory and remove memset() from all caller of it.  But we
>> can't simply remove the memset on arm because __dma_clear_buffer() is
>> used there for ensuring cache flushing and it is used in many places.
>> Of course we can do redundant memset in dma_alloc_from_contiguous(),
>> but I think this patch is less impact for fixing this problem.
>
> But this patch does a duplicated memset if the page was allocated by
> alloc_pages_node()?

You're right.  Clearing __GFP_ZERO bit in gfp flags before allocating by
alloc_pages_node() can fix this duplicated memset.

> Would it not be better to pass the gfp_t to dma_alloc_from_contiguous()
> and have it implement __GFP_ZERO?  That will fix thsi inefficiency,
> will be symmetrical with the other underlying allocators and should
> permit the appropriate fixups in arm?

Sounds good.  If it also handles __GFP_WAIT, We can remove __GFP_WAIT
check which is almost always required before calling
dma_alloc_from_contiguous().

>> --- a/arch/x86/kernel/pci-dma.c
>> +++ b/arch/x86/kernel/pci-dma.c
>> @@ -97,7 +97,6 @@ void *dma_generic_alloc_coherent(struct device *dev, size_t size,
>>
>>       dma_mask = dma_alloc_coherent_mask(dev, flag);
>>
>> -     flag |= __GFP_ZERO;

I'll soon prepare a follow-up patch to clear __GFP_ZERO like

+        flag &= ~__GFP_ZERO

>>  again:
>>       page = NULL;
>>       /* CMA can be used only in the context which permits sleeping */
>> @@ -120,7 +119,7 @@ again:
>>
>>               return NULL;
>>       }
>> -
>> +     memset(page_address(page), 0, size);
>>       *dma_addr = addr;
>>       return page_address(page);
>>  }
>> --
>> 1.8.3.2

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 0/5] enhance DMA CMA on x86
  2014-04-15 13:08 [PATCH v3 0/5] enhance DMA CMA on x86 Akinobu Mita
                   ` (4 preceding siblings ...)
  2014-04-15 13:08 ` [PATCH v3 5/5] cma: add placement specifier for "cma=" kernel parameter Akinobu Mita
@ 2014-09-27 14:30 ` Peter Hurley
  2014-09-28  0:31   ` Akinobu Mita
  5 siblings, 1 reply; 27+ messages in thread
From: Peter Hurley @ 2014-09-27 14:30 UTC (permalink / raw)
  To: Akinobu Mita, linux-kernel, akpm
  Cc: Marek Szyprowski, Konrad Rzeszutek Wilk, David Woodhouse,
	Don Dutile, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Andi Kleen, x86, iommu

On 04/15/2014 09:08 AM, Akinobu Mita wrote:
> This patch set enhances the DMA Contiguous Memory Allocator on x86.
> 
> Currently the DMA CMA is only supported with pci-nommu dma_map_ops
> and furthermore it can't be enabled on x86_64.  But I would like to
> allocate big contiguous memory with dma_alloc_coherent() and tell it
> to the device that requires it, regardless of which dma mapping
> implementation is actually used in the system.
> 
> So this makes it work with swiotlb and intel-iommu dma_map_ops, too.
> And this also extends "cma=" kernel parameter to specify placement
> constraint by the physical address range of memory allocations.  For
> example, CMA allocates memory below 4GB by "cma=64M@0-4G", it is
> required for the devices only supporting 32-bit addressing on 64-bit
> systems without iommu.
> 
> * Changes from v2
> - Rebased on current Linus tree
> - Add Acked-by line
> - Fix gfp flags check for __GFP_ATOMIC, reported by Marek Szyprowski
> - Avoid CMA area on highmem with cma= option, reported by Marek Szyprowski
> 
> * Changes from v1
> - fix dma_alloc_coherent() with __GFP_ZERO
> - add placement specifier for "cma=" kernel parameter
> 
> Akinobu Mita (5):
>   x86: make dma_alloc_coherent() return zeroed memory if CMA is enabled
>   x86: enable DMA CMA with swiotlb
>   intel-iommu: integrate DMA CMA
>   memblock: introduce memblock_alloc_range()
>   cma: add placement specifier for "cma=" kernel parameter

This patchset breaks every x86 iommu configuration when CONFIG_DMA_CMA is
on, which is the base configuration for Ubuntu x86 and amd64 distro kernels.

Granted, the patchset leveraged existing code from the nommu configuration,
but that base (ie., calling dma_alloc_from_contiguous() in
dma_generic_alloc_config()) was an ill-conceived test configuration designed
to allow ARM developers to validate the CMA allocator on x86 boxen and
KVM guests, not as a general-purpose replacement for the existing page
allocator. The test code should have had a separate CONFIG_ knob.

What this patchset does is restrict all iommu configurations which can
map all of system memory to one _very_ small physical region, thus disabling
the whole point of an iommu.

Now I know why my GPU is causing paging to disk! And why my RAID controller
stalls for ages when I do a git log at the same time as a kernel build!

And the apparent goal of this patchset is to enable DMA allocation below
4GB, which is already supported in the existing page allocator with the
GFP_DMA32 flag?!

Regards,
Peter Hurley

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 0/5] enhance DMA CMA on x86
  2014-09-27 14:30 ` [PATCH v3 0/5] enhance DMA CMA on x86 Peter Hurley
@ 2014-09-28  0:31   ` Akinobu Mita
  2014-09-29 12:09     ` Peter Hurley
  0 siblings, 1 reply; 27+ messages in thread
From: Akinobu Mita @ 2014-09-28  0:31 UTC (permalink / raw)
  To: Peter Hurley
  Cc: LKML, Andrew Morton, Marek Szyprowski, Konrad Rzeszutek Wilk,
	David Woodhouse, Don Dutile, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, x86, iommu

2014-09-27 23:30 GMT+09:00 Peter Hurley <peter@hurleysoftware.com>:
> On 04/15/2014 09:08 AM, Akinobu Mita wrote:
>> This patch set enhances the DMA Contiguous Memory Allocator on x86.
>>
>> Currently the DMA CMA is only supported with pci-nommu dma_map_ops
>> and furthermore it can't be enabled on x86_64.  But I would like to
>> allocate big contiguous memory with dma_alloc_coherent() and tell it
>> to the device that requires it, regardless of which dma mapping
>> implementation is actually used in the system.
>>
>> So this makes it work with swiotlb and intel-iommu dma_map_ops, too.
>> And this also extends "cma=" kernel parameter to specify placement
>> constraint by the physical address range of memory allocations.  For
>> example, CMA allocates memory below 4GB by "cma=64M@0-4G", it is
>> required for the devices only supporting 32-bit addressing on 64-bit
>> systems without iommu.
>>
>> * Changes from v2
>> - Rebased on current Linus tree
>> - Add Acked-by line
>> - Fix gfp flags check for __GFP_ATOMIC, reported by Marek Szyprowski
>> - Avoid CMA area on highmem with cma= option, reported by Marek Szyprowski
>>
>> * Changes from v1
>> - fix dma_alloc_coherent() with __GFP_ZERO
>> - add placement specifier for "cma=" kernel parameter
>>
>> Akinobu Mita (5):
>>   x86: make dma_alloc_coherent() return zeroed memory if CMA is enabled
>>   x86: enable DMA CMA with swiotlb
>>   intel-iommu: integrate DMA CMA
>>   memblock: introduce memblock_alloc_range()
>>   cma: add placement specifier for "cma=" kernel parameter
>
> This patchset breaks every x86 iommu configuration when CONFIG_DMA_CMA is
> on, which is the base configuration for Ubuntu x86 and amd64 distro kernels.
>
> Granted, the patchset leveraged existing code from the nommu configuration,
> but that base (ie., calling dma_alloc_from_contiguous() in
> dma_generic_alloc_config()) was an ill-conceived test configuration designed
> to allow ARM developers to validate the CMA allocator on x86 boxen and
> KVM guests, not as a general-purpose replacement for the existing page
> allocator. The test code should have had a separate CONFIG_ knob.
>
> What this patchset does is restrict all iommu configurations which can
> map all of system memory to one _very_ small physical region, thus disabling
> the whole point of an iommu.
>
> Now I know why my GPU is causing paging to disk! And why my RAID controller
> stalls for ages when I do a git log at the same time as a kernel build!

The solution I have for this is that instead of trying to
dma_alloc_from_contiguous() firstly, call alloc_pages() in dma_alloc_coherent().
dma_alloc_from_contiguous() should be called only when alloc_pages() is failed
or DMA_ATTR_FORCE_CONTIGUOUS is specified in dma_attr.

> And the apparent goal of this patchset is to enable DMA allocation below
> 4GB, which is already supported in the existing page allocator with the
> GFP_DMA32 flag?!

The goal of this patchset is to enable huge DMA allocation which
alloc_pages() can't (> MAX_ORDER) for the devices that require it.

Thanks for the notification.  I'll prepare a patch described above.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 0/5] enhance DMA CMA on x86
  2014-09-28  0:31   ` Akinobu Mita
@ 2014-09-29 12:09     ` Peter Hurley
  2014-09-29 14:32       ` Akinobu Mita
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Hurley @ 2014-09-29 12:09 UTC (permalink / raw)
  To: Akinobu Mita
  Cc: LKML, Andrew Morton, Marek Szyprowski, Konrad Rzeszutek Wilk,
	David Woodhouse, Don Dutile, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, x86, iommu

On 09/27/2014 08:31 PM, Akinobu Mita wrote:
> 2014-09-27 23:30 GMT+09:00 Peter Hurley <peter@hurleysoftware.com>:
>> On 04/15/2014 09:08 AM, Akinobu Mita wrote:
>>> This patch set enhances the DMA Contiguous Memory Allocator on x86.

[...]

>> What this patchset does is restrict all iommu configurations which can
>> map all of system memory to one _very_ small physical region, thus disabling
>> the whole point of an iommu.
>>
>> Now I know why my GPU is causing paging to disk! And why my RAID controller
>> stalls for ages when I do a git log at the same time as a kernel build!
> 
> The solution I have for this is that instead of trying to
> dma_alloc_from_contiguous() firstly, call alloc_pages() in dma_alloc_coherent().
> dma_alloc_from_contiguous() should be called only when alloc_pages() is failed
> or DMA_ATTR_FORCE_CONTIGUOUS is specified in dma_attr.

Why is all this extra complexity being added when there are no X86 users
of DMA_ATTR_FORCE_CONTIGUOUS?


>> And the apparent goal of this patchset is to enable DMA allocation below
>> 4GB, which is already supported in the existing page allocator with the
>> GFP_DMA32 flag?!
> 
> The goal of this patchset is to enable huge DMA allocation which
> alloc_pages() can't (> MAX_ORDER) for the devices that require it.

What x86 devices need > MAX_ORDER DMA allocation and why can't they allocate
directly from dma_alloc_from_contiguous()?

Regards,
Peter Hurley



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 0/5] enhance DMA CMA on x86
  2014-09-29 12:09     ` Peter Hurley
@ 2014-09-29 14:32       ` Akinobu Mita
  2014-09-30 14:34         ` Peter Hurley
  0 siblings, 1 reply; 27+ messages in thread
From: Akinobu Mita @ 2014-09-29 14:32 UTC (permalink / raw)
  To: Peter Hurley
  Cc: LKML, Andrew Morton, Marek Szyprowski, Konrad Rzeszutek Wilk,
	David Woodhouse, Don Dutile, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, x86, iommu

2014-09-29 21:09 GMT+09:00 Peter Hurley <peter@hurleysoftware.com>:
> On 09/27/2014 08:31 PM, Akinobu Mita wrote:
>> 2014-09-27 23:30 GMT+09:00 Peter Hurley <peter@hurleysoftware.com>:
>>> On 04/15/2014 09:08 AM, Akinobu Mita wrote:
>>>> This patch set enhances the DMA Contiguous Memory Allocator on x86.
>
> [...]
>
>>> What this patchset does is restrict all iommu configurations which can
>>> map all of system memory to one _very_ small physical region, thus disabling
>>> the whole point of an iommu.
>>>
>>> Now I know why my GPU is causing paging to disk! And why my RAID controller
>>> stalls for ages when I do a git log at the same time as a kernel build!
>>
>> The solution I have for this is that instead of trying to
>> dma_alloc_from_contiguous() firstly, call alloc_pages() in dma_alloc_coherent().
>> dma_alloc_from_contiguous() should be called only when alloc_pages() is failed
>> or DMA_ATTR_FORCE_CONTIGUOUS is specified in dma_attr.
>
> Why is all this extra complexity being added when there are no X86 users
> of DMA_ATTR_FORCE_CONTIGUOUS?

I misunderstood DMA_ATTR_FORCE_CONTIGUOUS.  It is specified to request
that underlaying DMA mapping span physically contiguous with IOMMU.
But current alloc_dma_coherent() for intel-iommu always returns
physically contiguous memory, so it is ignored on x86.

>>> And the apparent goal of this patchset is to enable DMA allocation below
>>> 4GB, which is already supported in the existing page allocator with the
>>> GFP_DMA32 flag?!
>>
>> The goal of this patchset is to enable huge DMA allocation which
>> alloc_pages() can't (> MAX_ORDER) for the devices that require it.
>
> What x86 devices need > MAX_ORDER DMA allocation and why can't they allocate
> directly from dma_alloc_from_contiguous()?

I need this for UFS unified memory extension which is apparently not in
mainline for now.
http://www.jedec.org/standards-documents/docs/jesd220-1
http://www.jedec.org/sites/default/files/T_Fujisawa_MF_2013.pdf

But there must be some other use cases on x86, too.  Because I have
received several emails privately from developers who care its status.

And allocating directly from dma_alloc_from_contiguous() in the driver
doesn't work with IOMMU, as it just returns memory regoin and doesn't
create DMA mapping.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 0/5] enhance DMA CMA on x86
  2014-09-29 14:32       ` Akinobu Mita
@ 2014-09-30 14:34         ` Peter Hurley
  2014-09-30 23:23           ` Akinobu Mita
  2014-09-30 23:45           ` Thomas Gleixner
  0 siblings, 2 replies; 27+ messages in thread
From: Peter Hurley @ 2014-09-30 14:34 UTC (permalink / raw)
  To: Akinobu Mita
  Cc: LKML, Andrew Morton, Marek Szyprowski, Konrad Rzeszutek Wilk,
	David Woodhouse, Don Dutile, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, x86, iommu

On 09/29/2014 10:32 AM, Akinobu Mita wrote:
> 2014-09-29 21:09 GMT+09:00 Peter Hurley <peter@hurleysoftware.com>:
>> On 09/27/2014 08:31 PM, Akinobu Mita wrote:
>>> 2014-09-27 23:30 GMT+09:00 Peter Hurley <peter@hurleysoftware.com>:
>>>> On 04/15/2014 09:08 AM, Akinobu Mita wrote:
>>>>> This patch set enhances the DMA Contiguous Memory Allocator on x86.
>>
>> [...]
>>
>>>> What this patchset does is restrict all iommu configurations which can
>>>> map all of system memory to one _very_ small physical region, thus disabling
>>>> the whole point of an iommu.
>>>>
>>>> Now I know why my GPU is causing paging to disk! And why my RAID controller
>>>> stalls for ages when I do a git log at the same time as a kernel build!
>>>
>>> The solution I have for this is that instead of trying to
>>> dma_alloc_from_contiguous() firstly, call alloc_pages() in dma_alloc_coherent().
>>> dma_alloc_from_contiguous() should be called only when alloc_pages() is failed
>>> or DMA_ATTR_FORCE_CONTIGUOUS is specified in dma_attr.
>>
>> Why is all this extra complexity being added when there are no X86 users
>> of DMA_ATTR_FORCE_CONTIGUOUS?
> 
> I misunderstood DMA_ATTR_FORCE_CONTIGUOUS.  It is specified to request
> that underlaying DMA mapping span physically contiguous with IOMMU.
> But current alloc_dma_coherent() for intel-iommu always returns
> physically contiguous memory, so it is ignored on x86.
> 
>>>> And the apparent goal of this patchset is to enable DMA allocation below
>>>> 4GB, which is already supported in the existing page allocator with the
>>>> GFP_DMA32 flag?!
>>>
>>> The goal of this patchset is to enable huge DMA allocation which
>>> alloc_pages() can't (> MAX_ORDER) for the devices that require it.
>>
>> What x86 devices need > MAX_ORDER DMA allocation and why can't they allocate
>> directly from dma_alloc_from_contiguous()?
> 
> I need this for UFS unified memory extension which is apparently not in
> mainline for now.
> http://www.jedec.org/standards-documents/docs/jesd220-1
> http://www.jedec.org/sites/default/files/T_Fujisawa_MF_2013.pdf
> 
> But there must be some other use cases on x86, too.  Because I have
> received several emails privately from developers who care its status.
> 
> And allocating directly from dma_alloc_from_contiguous() in the driver
> doesn't work with IOMMU, as it just returns memory regoin and doesn't
> create DMA mapping.


I read the UFS Unified Memory Extension v1.0 (JESD220-1) specification and
it is not clear to me that using DMA mapping is the right approach to
supporting UM, at least on x86.

And without a mainline user, the merits of this approach are not evident.
I cannot even find a production x86 UFS controller, much less one that
supports UME.

The only PCI UFS controller I could find (and that mainline supports) is
Samsung's x86 FPGA-based test unit for developing UFS devices in a x86 test
environment, and not a production x86 design.

Samsung's own roadmap (http://www.slideshare.net/linaroorg/next-gen-mobilestorageufs)
mentions nothing about bringing UFS to x86 designs.

Unless there's something else I've missed, I don't think these patches
belong in mainline.

Regards,
Peter Hurley




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 0/5] enhance DMA CMA on x86
  2014-09-30 14:34         ` Peter Hurley
@ 2014-09-30 23:23           ` Akinobu Mita
  2014-09-30 23:45           ` Thomas Gleixner
  1 sibling, 0 replies; 27+ messages in thread
From: Akinobu Mita @ 2014-09-30 23:23 UTC (permalink / raw)
  To: Peter Hurley
  Cc: LKML, Andrew Morton, Marek Szyprowski, Konrad Rzeszutek Wilk,
	David Woodhouse, Don Dutile, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, x86, iommu

2014-09-30 23:34 GMT+09:00 Peter Hurley <peter@hurleysoftware.com>:
> On 09/29/2014 10:32 AM, Akinobu Mita wrote:
>> 2014-09-29 21:09 GMT+09:00 Peter Hurley <peter@hurleysoftware.com>:
>>> On 09/27/2014 08:31 PM, Akinobu Mita wrote:
>>>> 2014-09-27 23:30 GMT+09:00 Peter Hurley <peter@hurleysoftware.com>:
>>>>> On 04/15/2014 09:08 AM, Akinobu Mita wrote:
>>>>>> This patch set enhances the DMA Contiguous Memory Allocator on x86.
>>>
>>> [...]
>>>
>>>>> What this patchset does is restrict all iommu configurations which can
>>>>> map all of system memory to one _very_ small physical region, thus disabling
>>>>> the whole point of an iommu.
>>>>>
>>>>> Now I know why my GPU is causing paging to disk! And why my RAID controller
>>>>> stalls for ages when I do a git log at the same time as a kernel build!
>>>>
>>>> The solution I have for this is that instead of trying to
>>>> dma_alloc_from_contiguous() firstly, call alloc_pages() in dma_alloc_coherent().
>>>> dma_alloc_from_contiguous() should be called only when alloc_pages() is failed
>>>> or DMA_ATTR_FORCE_CONTIGUOUS is specified in dma_attr.
>>>
>>> Why is all this extra complexity being added when there are no X86 users
>>> of DMA_ATTR_FORCE_CONTIGUOUS?
>>
>> I misunderstood DMA_ATTR_FORCE_CONTIGUOUS.  It is specified to request
>> that underlaying DMA mapping span physically contiguous with IOMMU.
>> But current alloc_dma_coherent() for intel-iommu always returns
>> physically contiguous memory, so it is ignored on x86.
>>
>>>>> And the apparent goal of this patchset is to enable DMA allocation below
>>>>> 4GB, which is already supported in the existing page allocator with the
>>>>> GFP_DMA32 flag?!
>>>>
>>>> The goal of this patchset is to enable huge DMA allocation which
>>>> alloc_pages() can't (> MAX_ORDER) for the devices that require it.
>>>
>>> What x86 devices need > MAX_ORDER DMA allocation and why can't they allocate
>>> directly from dma_alloc_from_contiguous()?
>>
>> I need this for UFS unified memory extension which is apparently not in
>> mainline for now.
>> http://www.jedec.org/standards-documents/docs/jesd220-1
>> http://www.jedec.org/sites/default/files/T_Fujisawa_MF_2013.pdf
>>
>> But there must be some other use cases on x86, too.  Because I have
>> received several emails privately from developers who care its status.
>>
>> And allocating directly from dma_alloc_from_contiguous() in the driver
>> doesn't work with IOMMU, as it just returns memory regoin and doesn't
>> create DMA mapping.
>
>
> I read the UFS Unified Memory Extension v1.0 (JESD220-1) specification and
> it is not clear to me that using DMA mapping is the right approach to
> supporting UM, at least on x86.

Without DMA mapping, there is no way for the devices to access host
memory.  Unified memory extension requires a single contiguous memory
region instead of multiple scattered mapping.

> And without a mainline user, the merits of this approach are not evident.
> I cannot even find a production x86 UFS controller, much less one that
> supports UME.
>
> The only PCI UFS controller I could find (and that mainline supports) is
> Samsung's x86 FPGA-based test unit for developing UFS devices in a x86 test
> environment, and not a production x86 design.
>
> Samsung's own roadmap (http://www.slideshare.net/linaroorg/next-gen-mobilestorageufs)
> mentions nothing about bringing UFS to x86 designs.
>
> Unless there's something else I've missed, I don't think these patches
> belong in mainline.

Removing CONFIG_CMA_DMA support from x86_64 will disappoint me, but it's
personal opinion.  FWIW, MIPS also starts supporting it in linux-next.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 0/5] enhance DMA CMA on x86
  2014-09-30 14:34         ` Peter Hurley
  2014-09-30 23:23           ` Akinobu Mita
@ 2014-09-30 23:45           ` Thomas Gleixner
  2014-09-30 23:49             ` Peter Hurley
  2014-10-01  1:49             ` Peter Hurley
  1 sibling, 2 replies; 27+ messages in thread
From: Thomas Gleixner @ 2014-09-30 23:45 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Akinobu Mita, LKML, Andrew Morton, Marek Szyprowski,
	Konrad Rzeszutek Wilk, David Woodhouse, Don Dutile, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, x86, iommu, Greg KH

On Tue, 30 Sep 2014, Peter Hurley wrote:
> I read the UFS Unified Memory Extension v1.0 (JESD220-1) specification and
> it is not clear to me that using DMA mapping is the right approach to
> supporting UM, at least on x86.
> 
> And without a mainline user, the merits of this approach are not evident.
> I cannot even find a production x86 UFS controller, much less one that
> supports UME.
> 
> The only PCI UFS controller I could find (and that mainline supports) is
> Samsung's x86 FPGA-based test unit for developing UFS devices in a x86 test
> environment, and not a production x86 design.

And how is that relevant? That device exists and you have no reason to
deny it to be supported just because you are not interested in it.
 
> Unless there's something else I've missed, I don't think these patches
> belong in mainline.

You missed that there is no reason WHY such a device should not be
supported in mainline.

> Samsung's own roadmap
> (http://www.slideshare.net/linaroorg/next-gen-mobilestorageufs)
> mentions nothing about bringing UFS to x86 designs.

And that's telling you what? 

   - That we should deny Samsung proper support for their obviously
     x86 based test card

   - That we should ignore a JEDEC Standard which is obviously never
     going to hit x86 land just because you decide it?

Your argumentation is just ass backwards. Linux wants to support the
full zoo of hardware including this particular PCI card. Period.

Whether the proposed patchset is the correct solution to support it is
a completely different question.

So either you stop this right now and help Akinobu to find the proper
solution or you just go back in your uncontaminated x86 cave and STFU.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 0/5] enhance DMA CMA on x86
  2014-09-30 23:45           ` Thomas Gleixner
@ 2014-09-30 23:49             ` Peter Hurley
  2014-10-01  1:49             ` Peter Hurley
  1 sibling, 0 replies; 27+ messages in thread
From: Peter Hurley @ 2014-09-30 23:49 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Akinobu Mita, LKML, Andrew Morton, Marek Szyprowski,
	Konrad Rzeszutek Wilk, David Woodhouse, Don Dutile, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, x86, iommu, Greg KH

On 09/30/2014 07:45 PM, Thomas Gleixner wrote:
> On Tue, 30 Sep 2014, Peter Hurley wrote:
>> I read the UFS Unified Memory Extension v1.0 (JESD220-1) specification and
>> it is not clear to me that using DMA mapping is the right approach to
>> supporting UM, at least on x86.
>>
>> And without a mainline user, the merits of this approach are not evident.
>> I cannot even find a production x86 UFS controller, much less one that
>> supports UME.
>>
>> The only PCI UFS controller I could find (and that mainline supports) is
>> Samsung's x86 FPGA-based test unit for developing UFS devices in a x86 test
>> environment, and not a production x86 design.
> 
> And how is that relevant? That device exists and you have no reason to
> deny it to be supported just because you are not interested in it.
>  
>> Unless there's something else I've missed, I don't think these patches
>> belong in mainline.
> 
> You missed that there is no reason WHY such a device should not be
> supported in mainline.

Mainline already supports this card right now without these patches.

>> Samsung's own roadmap
>> (http://www.slideshare.net/linaroorg/next-gen-mobilestorageufs)
>> mentions nothing about bringing UFS to x86 designs.
> 
> And that's telling you what? 
> 
>    - That we should deny Samsung proper support for their obviously
>      x86 based test card
> 
>    - That we should ignore a JEDEC Standard which is obviously never
>      going to hit x86 land just because you decide it?
> 
> Your argumentation is just ass backwards. Linux wants to support the
> full zoo of hardware including this particular PCI card. Period.
> 
> Whether the proposed patchset is the correct solution to support it is
> a completely different question.

And there is currently no way to determine that because there is no
user in mainline that requires this support.

Which you would understand if you had read more carefully.

Regards,
Peter Hurley


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 0/5] enhance DMA CMA on x86
  2014-09-30 23:45           ` Thomas Gleixner
  2014-09-30 23:49             ` Peter Hurley
@ 2014-10-01  1:49             ` Peter Hurley
  2014-10-01  9:05               ` Thomas Gleixner
  2014-10-02 16:41               ` Konrad Rzeszutek Wilk
  1 sibling, 2 replies; 27+ messages in thread
From: Peter Hurley @ 2014-10-01  1:49 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Akinobu Mita, LKML, Andrew Morton, Marek Szyprowski,
	Konrad Rzeszutek Wilk, David Woodhouse, Don Dutile, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, x86, iommu, Greg KH

On 09/30/2014 07:45 PM, Thomas Gleixner wrote:
> Whether the proposed patchset is the correct solution to support it is
> a completely different question.

This patchset has been in mainline since 3.16 and has already caused
regressions, so the question of whether this is the correct solution has
already been answered.

> So either you stop this right now and help Akinobu to find the proper
> solution 

If this is only a test platform for ARM parts then I don't think it
unreasonable to suggest forking x86 swiotlb support into a iommu=cma
selector that gets DMA mapping working for this test platform and doesn't
cause a bunch of breakage.

Which is different than if the plan is to ship production units for x86;
then a general purpose solution will be required.

As to the good design of a general purpose solution for allocating and
mapping huge order pages, you are certainly more qualified to help Akinobu
than I am.

Regards,
Peter Hurley


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 0/5] enhance DMA CMA on x86
  2014-10-01  1:49             ` Peter Hurley
@ 2014-10-01  9:05               ` Thomas Gleixner
  2014-10-02 16:41               ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 27+ messages in thread
From: Thomas Gleixner @ 2014-10-01  9:05 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Akinobu Mita, LKML, Andrew Morton, Marek Szyprowski,
	Konrad Rzeszutek Wilk, David Woodhouse, Don Dutile, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, x86, iommu, Greg KH

On Tue, 30 Sep 2014, Peter Hurley wrote:
> On 09/30/2014 07:45 PM, Thomas Gleixner wrote:
> > Whether the proposed patchset is the correct solution to support it is
> > a completely different question.
> 
> This patchset has been in mainline since 3.16 and has already caused
> regressions, so the question of whether this is the correct solution has
> already been answered.

Agreed.
 
> > So either you stop this right now and help Akinobu to find the proper
> > solution 
> 
> If this is only a test platform for ARM parts then I don't think it
> unreasonable to suggest forking x86 swiotlb support into a iommu=cma
> selector that gets DMA mapping working for this test platform and doesn't
> cause a bunch of breakage.

Breakage is not acceptable in any case.
 
> Which is different than if the plan is to ship production units for x86;
> then a general purpose solution will be required.
> 
> As to the good design of a general purpose solution for allocating and
> mapping huge order pages, you are certainly more qualified to help Akinobu
> than I am.

Fair enough. Still this does not make the case for outright rejecting
the idea of supporting that kind of device even if it is a esoteric
case. We deal with enough esoteric hardware in Linux and if done
right, it's no harm to anyone.

I'll have a look at the technical details.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 0/5] enhance DMA CMA on x86
  2014-10-01  1:49             ` Peter Hurley
  2014-10-01  9:05               ` Thomas Gleixner
@ 2014-10-02 16:41               ` Konrad Rzeszutek Wilk
  2014-10-02 22:03                 ` Peter Hurley
  1 sibling, 1 reply; 27+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-10-02 16:41 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Thomas Gleixner, Akinobu Mita, LKML, Andrew Morton,
	Marek Szyprowski, David Woodhouse, Don Dutile, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, x86, iommu, Greg KH

On Tue, Sep 30, 2014 at 09:49:54PM -0400, Peter Hurley wrote:
> On 09/30/2014 07:45 PM, Thomas Gleixner wrote:
> > Whether the proposed patchset is the correct solution to support it is
> > a completely different question.
> 
> This patchset has been in mainline since 3.16 and has already caused
> regressions, so the question of whether this is the correct solution has
> already been answered.
> 
> > So either you stop this right now and help Akinobu to find the proper
> > solution 
> 
> If this is only a test platform for ARM parts then I don't think it
> unreasonable to suggest forking x86 swiotlb support into a iommu=cma

Not sure what you mean by 'forking x86 swiotlb' ? As in have SWIOTLB
work under ARM?

> selector that gets DMA mapping working for this test platform and doesn't
> cause a bunch of breakage.

I think you might want to take a look at the IOMMU_DETECT macros
and enable CMA there only if the certain devices are available.

That way the normal flow of detecting which IOMMU to use is still present
and will turn of CMA if there is no device that would use it.

> 
> Which is different than if the plan is to ship production units for x86;
> then a general purpose solution will be required.
> 
> As to the good design of a general purpose solution for allocating and
> mapping huge order pages, you are certainly more qualified to help Akinobu
> than I am.
> 
> Regards,
> Peter Hurley
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 0/5] enhance DMA CMA on x86
  2014-10-02 16:41               ` Konrad Rzeszutek Wilk
@ 2014-10-02 22:03                 ` Peter Hurley
  2014-10-02 23:08                   ` Akinobu Mita
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Hurley @ 2014-10-02 22:03 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Thomas Gleixner, Akinobu Mita, LKML, Andrew Morton,
	Marek Szyprowski, David Woodhouse, Don Dutile, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, x86, iommu, Greg KH

On 10/02/2014 12:41 PM, Konrad Rzeszutek Wilk wrote:
> On Tue, Sep 30, 2014 at 09:49:54PM -0400, Peter Hurley wrote:
>> On 09/30/2014 07:45 PM, Thomas Gleixner wrote:
>>> Whether the proposed patchset is the correct solution to support it is
>>> a completely different question.
>>
>> This patchset has been in mainline since 3.16 and has already caused
>> regressions, so the question of whether this is the correct solution has
>> already been answered.
>>
>>> So either you stop this right now and help Akinobu to find the proper
>>> solution 
>>
>> If this is only a test platform for ARM parts then I don't think it
>> unreasonable to suggest forking x86 swiotlb support into a iommu=cma
> 
> Not sure what you mean by 'forking x86 swiotlb' ? As in have SWIOTLB
> work under ARM?

No, that's not what I meant.

>> selector that gets DMA mapping working for this test platform and doesn't
>> cause a bunch of breakage.
> 
> I think you might want to take a look at the IOMMU_DETECT macros
> and enable CMA there only if the certain devices are available.
> 
> That way the normal flow of detecting which IOMMU to use is still present
> and will turn of CMA if there is no device that would use it.
> 
>>
>> Which is different than if the plan is to ship production units for x86;
>> then a general purpose solution will be required.
>>
>> As to the good design of a general purpose solution for allocating and
>> mapping huge order pages, you are certainly more qualified to help Akinobu
>> than I am.

What Akinobu's patches intend to support is:

	phys_addr = dma_alloc_coherent(dev, 64 * 1024 * 1024, &bus_addr, GFP_KERNEL);

which raises three issues:

1. Where do coherent blocks of this size come from?
2. How to prevent fragmentation of these reserved blocks over time by
   existing DMA users?
3. Is this support generically required across all iommu implementations on x86?

Questions 1 and 2 are non-trivial, in the general case, otherwise the page
allocator would already do this. Simply dropping in the contiguous memory
allocator doesn't work because CMA does not have the same policy and performance
as the page allocator, and is already causing performance regressions even
in the absence of huge page allocations.

So that's why I raised question 3; is making the necessary compromises to support
64MB coherent DMA allocations across all x86 iommu implementations actually
required?

Prior to Akinobu's patches, the use of CMA by x86 iommu configurations was
designed to be limited to testing configurations, as the introductory
commit states:

commit 0a2b9a6ea93650b8a00f9fd5ee8fdd25671e2df6
Author: Marek Szyprowski <m.szyprowski@samsung.com>
Date:   Thu Dec 29 13:09:51 2011 +0100

    X86: integrate CMA with DMA-mapping subsystem
    
    This patch adds support for CMA to dma-mapping subsystem for x86
    architecture that uses common pci-dma/pci-nommu implementation. This
    allows to test CMA on KVM/QEMU and a lot of common x86 boxes.
    
    Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
    Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
    CC: Michal Nazarewicz <mina86@mina86.com>
    Acked-by: Arnd Bergmann <arnd@arndb.de>


Which brings me to my suggestion: if support for huge coherent DMA is
required only for a special test platform, then could not this support
be specific to a new iommu configuration, namely iommu=cma, which would
get initialized much the same way that iommu=calgary is now.

The code for such a iommu configuration would mostly duplicate
arch/x86/kernel/pci-swiotlb.c and the CMA support would get removed from
the other x86 iommu implementations.

Regards,
Peter Hurley

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 0/5] enhance DMA CMA on x86
  2014-10-02 22:03                 ` Peter Hurley
@ 2014-10-02 23:08                   ` Akinobu Mita
  2014-10-03 13:40                     ` Konrad Rzeszutek Wilk
  2014-10-03 14:27                     ` Peter Hurley
  0 siblings, 2 replies; 27+ messages in thread
From: Akinobu Mita @ 2014-10-02 23:08 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Konrad Rzeszutek Wilk, Thomas Gleixner, LKML, Andrew Morton,
	Marek Szyprowski, David Woodhouse, Don Dutile, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, x86, iommu, Greg KH

2014-10-03 7:03 GMT+09:00 Peter Hurley <peter@hurleysoftware.com>:
> On 10/02/2014 12:41 PM, Konrad Rzeszutek Wilk wrote:
>> On Tue, Sep 30, 2014 at 09:49:54PM -0400, Peter Hurley wrote:
>>> On 09/30/2014 07:45 PM, Thomas Gleixner wrote:

>>> Which is different than if the plan is to ship production units for x86;
>>> then a general purpose solution will be required.
>>>
>>> As to the good design of a general purpose solution for allocating and
>>> mapping huge order pages, you are certainly more qualified to help Akinobu
>>> than I am.
>
> What Akinobu's patches intend to support is:
>
>         phys_addr = dma_alloc_coherent(dev, 64 * 1024 * 1024, &bus_addr, GFP_KERNEL);
>
> which raises three issues:
>
> 1. Where do coherent blocks of this size come from?
> 2. How to prevent fragmentation of these reserved blocks over time by
>    existing DMA users?
> 3. Is this support generically required across all iommu implementations on x86?
>
> Questions 1 and 2 are non-trivial, in the general case, otherwise the page
> allocator would already do this. Simply dropping in the contiguous memory
> allocator doesn't work because CMA does not have the same policy and performance
> as the page allocator, and is already causing performance regressions even
> in the absence of huge page allocations.

Could you take a look at the patches I sent?  Can they fix these issues?
https://lkml.org/lkml/2014/9/28/110

With these patches, normal alloc_pages() is used for allocation first
and dma_alloc_from_contiguous() is used as a fallback.

> So that's why I raised question 3; is making the necessary compromises to support
> 64MB coherent DMA allocations across all x86 iommu implementations actually
> required?
>
> Prior to Akinobu's patches, the use of CMA by x86 iommu configurations was
> designed to be limited to testing configurations, as the introductory
> commit states:
>
> commit 0a2b9a6ea93650b8a00f9fd5ee8fdd25671e2df6
> Author: Marek Szyprowski <m.szyprowski@samsung.com>
> Date:   Thu Dec 29 13:09:51 2011 +0100
>
>     X86: integrate CMA with DMA-mapping subsystem
>
>     This patch adds support for CMA to dma-mapping subsystem for x86
>     architecture that uses common pci-dma/pci-nommu implementation. This
>     allows to test CMA on KVM/QEMU and a lot of common x86 boxes.
>
>     Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
>     Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
>     CC: Michal Nazarewicz <mina86@mina86.com>
>     Acked-by: Arnd Bergmann <arnd@arndb.de>
>
>
> Which brings me to my suggestion: if support for huge coherent DMA is
> required only for a special test platform, then could not this support
> be specific to a new iommu configuration, namely iommu=cma, which would
> get initialized much the same way that iommu=calgary is now.
>
> The code for such a iommu configuration would mostly duplicate
> arch/x86/kernel/pci-swiotlb.c and the CMA support would get removed from
> the other x86 iommu implementations.

I'm not sure I read correctly, though.  Can boot option 'cma=0' also
help avoiding CMA from IOMMU implementation?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 0/5] enhance DMA CMA on x86
  2014-10-02 23:08                   ` Akinobu Mita
@ 2014-10-03 13:40                     ` Konrad Rzeszutek Wilk
  2014-10-03 14:27                     ` Peter Hurley
  1 sibling, 0 replies; 27+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-10-03 13:40 UTC (permalink / raw)
  To: Akinobu Mita
  Cc: Peter Hurley, Thomas Gleixner, LKML, Andrew Morton,
	Marek Szyprowski, David Woodhouse, Don Dutile, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, x86, iommu, Greg KH

On Fri, Oct 03, 2014 at 08:08:33AM +0900, Akinobu Mita wrote:
> 2014-10-03 7:03 GMT+09:00 Peter Hurley <peter@hurleysoftware.com>:
> > On 10/02/2014 12:41 PM, Konrad Rzeszutek Wilk wrote:
> >> On Tue, Sep 30, 2014 at 09:49:54PM -0400, Peter Hurley wrote:
> >>> On 09/30/2014 07:45 PM, Thomas Gleixner wrote:
> 
> >>> Which is different than if the plan is to ship production units for x86;
> >>> then a general purpose solution will be required.
> >>>
> >>> As to the good design of a general purpose solution for allocating and
> >>> mapping huge order pages, you are certainly more qualified to help Akinobu
> >>> than I am.
> >
> > What Akinobu's patches intend to support is:
> >
> >         phys_addr = dma_alloc_coherent(dev, 64 * 1024 * 1024, &bus_addr, GFP_KERNEL);
> >
> > which raises three issues:
> >
> > 1. Where do coherent blocks of this size come from?
> > 2. How to prevent fragmentation of these reserved blocks over time by
> >    existing DMA users?
> > 3. Is this support generically required across all iommu implementations on x86?
> >
> > Questions 1 and 2 are non-trivial, in the general case, otherwise the page
> > allocator would already do this. Simply dropping in the contiguous memory
> > allocator doesn't work because CMA does not have the same policy and performance
> > as the page allocator, and is already causing performance regressions even
> > in the absence of huge page allocations.
> 
> Could you take a look at the patches I sent?  Can they fix these issues?
> https://lkml.org/lkml/2014/9/28/110
> 
> With these patches, normal alloc_pages() is used for allocation first
> and dma_alloc_from_contiguous() is used as a fallback.
> 
> > So that's why I raised question 3; is making the necessary compromises to support
> > 64MB coherent DMA allocations across all x86 iommu implementations actually
> > required?
> >
> > Prior to Akinobu's patches, the use of CMA by x86 iommu configurations was
> > designed to be limited to testing configurations, as the introductory
> > commit states:
> >
> > commit 0a2b9a6ea93650b8a00f9fd5ee8fdd25671e2df6
> > Author: Marek Szyprowski <m.szyprowski@samsung.com>
> > Date:   Thu Dec 29 13:09:51 2011 +0100
> >
> >     X86: integrate CMA with DMA-mapping subsystem
> >
> >     This patch adds support for CMA to dma-mapping subsystem for x86
> >     architecture that uses common pci-dma/pci-nommu implementation. This
> >     allows to test CMA on KVM/QEMU and a lot of common x86 boxes.
> >
> >     Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
> >     Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
> >     CC: Michal Nazarewicz <mina86@mina86.com>
> >     Acked-by: Arnd Bergmann <arnd@arndb.de>
> >
> >
> > Which brings me to my suggestion: if support for huge coherent DMA is
> > required only for a special test platform, then could not this support
> > be specific to a new iommu configuration, namely iommu=cma, which would
> > get initialized much the same way that iommu=calgary is now.
> >
> > The code for such a iommu configuration would mostly duplicate
> > arch/x86/kernel/pci-swiotlb.c and the CMA support would get removed from
> > the other x86 iommu implementations.

Right. That sounds like a good plan ..
> 
> I'm not sure I read correctly, though.  Can boot option 'cma=0' also
> help avoiding CMA from IOMMU implementation?

.. it would automatically done now instead of having to pass 'cma=0'.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 0/5] enhance DMA CMA on x86
  2014-10-02 23:08                   ` Akinobu Mita
  2014-10-03 13:40                     ` Konrad Rzeszutek Wilk
@ 2014-10-03 14:27                     ` Peter Hurley
  2014-10-03 16:06                       ` Akinobu Mita
  1 sibling, 1 reply; 27+ messages in thread
From: Peter Hurley @ 2014-10-03 14:27 UTC (permalink / raw)
  To: Akinobu Mita
  Cc: Konrad Rzeszutek Wilk, Thomas Gleixner, LKML, Andrew Morton,
	Marek Szyprowski, David Woodhouse, Don Dutile, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, x86, iommu, Greg KH

On 10/02/2014 07:08 PM, Akinobu Mita wrote:
> 2014-10-03 7:03 GMT+09:00 Peter Hurley <peter@hurleysoftware.com>:
>> On 10/02/2014 12:41 PM, Konrad Rzeszutek Wilk wrote:
>>> On Tue, Sep 30, 2014 at 09:49:54PM -0400, Peter Hurley wrote:
>>>> On 09/30/2014 07:45 PM, Thomas Gleixner wrote:
> 
>>>> Which is different than if the plan is to ship production units for x86;
>>>> then a general purpose solution will be required.
>>>>
>>>> As to the good design of a general purpose solution for allocating and
>>>> mapping huge order pages, you are certainly more qualified to help Akinobu
>>>> than I am.
>>
>> What Akinobu's patches intend to support is:
>>
>>         phys_addr = dma_alloc_coherent(dev, 64 * 1024 * 1024, &bus_addr, GFP_KERNEL);
>>
>> which raises three issues:
>>
>> 1. Where do coherent blocks of this size come from?
>> 2. How to prevent fragmentation of these reserved blocks over time by
>>    existing DMA users?
>> 3. Is this support generically required across all iommu implementations on x86?
>>
>> Questions 1 and 2 are non-trivial, in the general case, otherwise the page
>> allocator would already do this. Simply dropping in the contiguous memory
>> allocator doesn't work because CMA does not have the same policy and performance
>> as the page allocator, and is already causing performance regressions even
>> in the absence of huge page allocations.
> 
> Could you take a look at the patches I sent?  Can they fix these issues?
> https://lkml.org/lkml/2014/9/28/110
> 
> With these patches, normal alloc_pages() is used for allocation first
> and dma_alloc_from_contiguous() is used as a fallback.

Sure, I can test these patches this weekend.
Where are the unit tests?

>> So that's why I raised question 3; is making the necessary compromises to support
>> 64MB coherent DMA allocations across all x86 iommu implementations actually
>> required?
>>
>> Prior to Akinobu's patches, the use of CMA by x86 iommu configurations was
>> designed to be limited to testing configurations, as the introductory
>> commit states:
>>
>> commit 0a2b9a6ea93650b8a00f9fd5ee8fdd25671e2df6
>> Author: Marek Szyprowski <m.szyprowski@samsung.com>
>> Date:   Thu Dec 29 13:09:51 2011 +0100
>>
>>     X86: integrate CMA with DMA-mapping subsystem
>>
>>     This patch adds support for CMA to dma-mapping subsystem for x86
>>     architecture that uses common pci-dma/pci-nommu implementation. This
>>     allows to test CMA on KVM/QEMU and a lot of common x86 boxes.
>>
>>     Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
>>     Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
>>     CC: Michal Nazarewicz <mina86@mina86.com>
>>     Acked-by: Arnd Bergmann <arnd@arndb.de>
>>
>>
>> Which brings me to my suggestion: if support for huge coherent DMA is
>> required only for a special test platform, then could not this support
>> be specific to a new iommu configuration, namely iommu=cma, which would
>> get initialized much the same way that iommu=calgary is now.
>>
>> The code for such a iommu configuration would mostly duplicate
>> arch/x86/kernel/pci-swiotlb.c and the CMA support would get removed from
>> the other x86 iommu implementations.
> 
> I'm not sure I read correctly, though.  Can boot option 'cma=0' also
> help avoiding CMA from IOMMU implementation?

Maybe, but that's not an appropriate solution for distro kernels.

Nor does this address configurations that want a really large CMA so
1GB huge pages can be allocated (not for DMA though).

Regards,
Peter Hurley
 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 0/5] enhance DMA CMA on x86
  2014-10-03 14:27                     ` Peter Hurley
@ 2014-10-03 16:06                       ` Akinobu Mita
  2014-10-03 16:33                         ` konrad wilk
  2014-10-03 16:39                         ` Peter Hurley
  0 siblings, 2 replies; 27+ messages in thread
From: Akinobu Mita @ 2014-10-03 16:06 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Konrad Rzeszutek Wilk, Thomas Gleixner, LKML, Andrew Morton,
	Marek Szyprowski, David Woodhouse, Don Dutile, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, x86, iommu, Greg KH

2014-10-03 23:27 GMT+09:00 Peter Hurley <peter@hurleysoftware.com>:
> On 10/02/2014 07:08 PM, Akinobu Mita wrote:
>> 2014-10-03 7:03 GMT+09:00 Peter Hurley <peter@hurleysoftware.com>:
>>> On 10/02/2014 12:41 PM, Konrad Rzeszutek Wilk wrote:
>>>> On Tue, Sep 30, 2014 at 09:49:54PM -0400, Peter Hurley wrote:
>>>>> On 09/30/2014 07:45 PM, Thomas Gleixner wrote:
>>
>>>>> Which is different than if the plan is to ship production units for x86;
>>>>> then a general purpose solution will be required.
>>>>>
>>>>> As to the good design of a general purpose solution for allocating and
>>>>> mapping huge order pages, you are certainly more qualified to help Akinobu
>>>>> than I am.
>>>
>>> What Akinobu's patches intend to support is:
>>>
>>>         phys_addr = dma_alloc_coherent(dev, 64 * 1024 * 1024, &bus_addr, GFP_KERNEL);
>>>
>>> which raises three issues:
>>>
>>> 1. Where do coherent blocks of this size come from?
>>> 2. How to prevent fragmentation of these reserved blocks over time by
>>>    existing DMA users?
>>> 3. Is this support generically required across all iommu implementations on x86?
>>>
>>> Questions 1 and 2 are non-trivial, in the general case, otherwise the page
>>> allocator would already do this. Simply dropping in the contiguous memory
>>> allocator doesn't work because CMA does not have the same policy and performance
>>> as the page allocator, and is already causing performance regressions even
>>> in the absence of huge page allocations.
>>
>> Could you take a look at the patches I sent?  Can they fix these issues?
>> https://lkml.org/lkml/2014/9/28/110
>>
>> With these patches, normal alloc_pages() is used for allocation first
>> and dma_alloc_from_contiguous() is used as a fallback.
>
> Sure, I can test these patches this weekend.
> Where are the unit tests?

Thanks a lot.  I would like to know whether the performance regression
you see will disappear or not with these patches as if CONFIG_DMA_CMA is
disabled.

>>> So that's why I raised question 3; is making the necessary compromises to support
>>> 64MB coherent DMA allocations across all x86 iommu implementations actually
>>> required?
>>>
>>> Prior to Akinobu's patches, the use of CMA by x86 iommu configurations was
>>> designed to be limited to testing configurations, as the introductory
>>> commit states:
>>>
>>> commit 0a2b9a6ea93650b8a00f9fd5ee8fdd25671e2df6
>>> Author: Marek Szyprowski <m.szyprowski@samsung.com>
>>> Date:   Thu Dec 29 13:09:51 2011 +0100
>>>
>>>     X86: integrate CMA with DMA-mapping subsystem
>>>
>>>     This patch adds support for CMA to dma-mapping subsystem for x86
>>>     architecture that uses common pci-dma/pci-nommu implementation. This
>>>     allows to test CMA on KVM/QEMU and a lot of common x86 boxes.
>>>
>>>     Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
>>>     Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
>>>     CC: Michal Nazarewicz <mina86@mina86.com>
>>>     Acked-by: Arnd Bergmann <arnd@arndb.de>
>>>
>>>
>>> Which brings me to my suggestion: if support for huge coherent DMA is
>>> required only for a special test platform, then could not this support
>>> be specific to a new iommu configuration, namely iommu=cma, which would
>>> get initialized much the same way that iommu=calgary is now.
>>>
>>> The code for such a iommu configuration would mostly duplicate
>>> arch/x86/kernel/pci-swiotlb.c and the CMA support would get removed from
>>> the other x86 iommu implementations.
>>
>> I'm not sure I read correctly, though.  Can boot option 'cma=0' also
>> help avoiding CMA from IOMMU implementation?
>
> Maybe, but that's not an appropriate solution for distro kernels.
>
> Nor does this address configurations that want a really large CMA so
> 1GB huge pages can be allocated (not for DMA though).

Now I see the point of iommu=cma you suggested.  But what should we do
when CONFIG_SWIOTLB is disabled, especially for x86_32?
Should we just introduce yet another flag to tell not using DMA_CMA
instead of adding new swiotlb-like iommu implementation?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 0/5] enhance DMA CMA on x86
  2014-10-03 16:06                       ` Akinobu Mita
@ 2014-10-03 16:33                         ` konrad wilk
  2014-10-03 16:39                         ` Peter Hurley
  1 sibling, 0 replies; 27+ messages in thread
From: konrad wilk @ 2014-10-03 16:33 UTC (permalink / raw)
  To: Akinobu Mita
  Cc: Peter Hurley, Thomas Gleixner, LKML, Andrew Morton,
	Marek Szyprowski, David Woodhouse, Don Dutile, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, x86, iommu, Greg KH

On 10/3/2014 12:06 PM, Akinobu Mita wrote:
> 2014-10-03 23:27 GMT+09:00 Peter Hurley <peter@hurleysoftware.com>:
>> On 10/02/2014 07:08 PM, Akinobu Mita wrote:
>>> 2014-10-03 7:03 GMT+09:00 Peter Hurley <peter@hurleysoftware.com>:
>>>> On 10/02/2014 12:41 PM, Konrad Rzeszutek Wilk wrote:
>>>>> On Tue, Sep 30, 2014 at 09:49:54PM -0400, Peter Hurley wrote:
>>>>>> On 09/30/2014 07:45 PM, Thomas Gleixner wrote:
>>>
>>>>>> Which is different than if the plan is to ship production units for x86;
>>>>>> then a general purpose solution will be required.
>>>>>>
>>>>>> As to the good design of a general purpose solution for allocating and
>>>>>> mapping huge order pages, you are certainly more qualified to help Akinobu
>>>>>> than I am.
>>>>
>>>> What Akinobu's patches intend to support is:
>>>>
>>>>          phys_addr = dma_alloc_coherent(dev, 64 * 1024 * 1024, &bus_addr, GFP_KERNEL);
>>>>
>>>> which raises three issues:
>>>>
>>>> 1. Where do coherent blocks of this size come from?
>>>> 2. How to prevent fragmentation of these reserved blocks over time by
>>>>     existing DMA users?
>>>> 3. Is this support generically required across all iommu implementations on x86?
>>>>
>>>> Questions 1 and 2 are non-trivial, in the general case, otherwise the page
>>>> allocator would already do this. Simply dropping in the contiguous memory
>>>> allocator doesn't work because CMA does not have the same policy and performance
>>>> as the page allocator, and is already causing performance regressions even
>>>> in the absence of huge page allocations.
>>>
>>> Could you take a look at the patches I sent?  Can they fix these issues?
>>> https://lkml.org/lkml/2014/9/28/110
>>>
>>> With these patches, normal alloc_pages() is used for allocation first
>>> and dma_alloc_from_contiguous() is used as a fallback.
>>
>> Sure, I can test these patches this weekend.
>> Where are the unit tests?
>
> Thanks a lot.  I would like to know whether the performance regression
> you see will disappear or not with these patches as if CONFIG_DMA_CMA is
> disabled.
>
>>>> So that's why I raised question 3; is making the necessary compromises to support
>>>> 64MB coherent DMA allocations across all x86 iommu implementations actually
>>>> required?
>>>>
>>>> Prior to Akinobu's patches, the use of CMA by x86 iommu configurations was
>>>> designed to be limited to testing configurations, as the introductory
>>>> commit states:
>>>>
>>>> commit 0a2b9a6ea93650b8a00f9fd5ee8fdd25671e2df6
>>>> Author: Marek Szyprowski <m.szyprowski@samsung.com>
>>>> Date:   Thu Dec 29 13:09:51 2011 +0100
>>>>
>>>>      X86: integrate CMA with DMA-mapping subsystem
>>>>
>>>>      This patch adds support for CMA to dma-mapping subsystem for x86
>>>>      architecture that uses common pci-dma/pci-nommu implementation. This
>>>>      allows to test CMA on KVM/QEMU and a lot of common x86 boxes.
>>>>
>>>>      Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
>>>>      Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
>>>>      CC: Michal Nazarewicz <mina86@mina86.com>
>>>>      Acked-by: Arnd Bergmann <arnd@arndb.de>
>>>>
>>>>
>>>> Which brings me to my suggestion: if support for huge coherent DMA is
>>>> required only for a special test platform, then could not this support
>>>> be specific to a new iommu configuration, namely iommu=cma, which would
>>>> get initialized much the same way that iommu=calgary is now.
>>>>
>>>> The code for such a iommu configuration would mostly duplicate
>>>> arch/x86/kernel/pci-swiotlb.c and the CMA support would get removed from
>>>> the other x86 iommu implementations.
>>>
>>> I'm not sure I read correctly, though.  Can boot option 'cma=0' also
>>> help avoiding CMA from IOMMU implementation?
>>
>> Maybe, but that's not an appropriate solution for distro kernels.
>>
>> Nor does this address configurations that want a really large CMA so
>> 1GB huge pages can be allocated (not for DMA though).
>
> Now I see the point of iommu=cma you suggested.  But what should we do
> when CONFIG_SWIOTLB is disabled, especially for x86_32?
> Should we just introduce yet another flag to tell not using DMA_CMA
> instead of adding new swiotlb-like iommu implementation?
>

If you implement an DMA API producer - aka dma_ops (which is what Peter 
is thinking I believe) it won't matter which IOMMUs / DMA producers are 
selected right?

Or are you saying that CMA needs SWIOTLB to handle certain type of
pages as a fallback mechanism - and hence there needs to be a tight
relationship?

In which case I would look at making SWIOTLB be more library like - the 
Xen-SWIOTLB already does that by using certain parts of the SWIOTLB code
which are exposed to the rest of the kernel.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 0/5] enhance DMA CMA on x86
  2014-10-03 16:06                       ` Akinobu Mita
  2014-10-03 16:33                         ` konrad wilk
@ 2014-10-03 16:39                         ` Peter Hurley
  2014-10-05  6:01                           ` Akinobu Mita
  1 sibling, 1 reply; 27+ messages in thread
From: Peter Hurley @ 2014-10-03 16:39 UTC (permalink / raw)
  To: Akinobu Mita
  Cc: Konrad Rzeszutek Wilk, Thomas Gleixner, LKML, Andrew Morton,
	Marek Szyprowski, David Woodhouse, Don Dutile, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, x86, iommu, Greg KH

On 10/03/2014 12:06 PM, Akinobu Mita wrote:
> 2014-10-03 23:27 GMT+09:00 Peter Hurley <peter@hurleysoftware.com>:
>> On 10/02/2014 07:08 PM, Akinobu Mita wrote:
>>> 2014-10-03 7:03 GMT+09:00 Peter Hurley <peter@hurleysoftware.com>:
>>>> On 10/02/2014 12:41 PM, Konrad Rzeszutek Wilk wrote:
>>>>> On Tue, Sep 30, 2014 at 09:49:54PM -0400, Peter Hurley wrote:
>>>>>> On 09/30/2014 07:45 PM, Thomas Gleixner wrote:
>>>
>>>>>> Which is different than if the plan is to ship production units for x86;
>>>>>> then a general purpose solution will be required.
>>>>>>
>>>>>> As to the good design of a general purpose solution for allocating and
>>>>>> mapping huge order pages, you are certainly more qualified to help Akinobu
>>>>>> than I am.
>>>>
>>>> What Akinobu's patches intend to support is:
>>>>
>>>>         phys_addr = dma_alloc_coherent(dev, 64 * 1024 * 1024, &bus_addr, GFP_KERNEL);
>>>>
>>>> which raises three issues:
>>>>
>>>> 1. Where do coherent blocks of this size come from?
>>>> 2. How to prevent fragmentation of these reserved blocks over time by
>>>>    existing DMA users?
>>>> 3. Is this support generically required across all iommu implementations on x86?
>>>>
>>>> Questions 1 and 2 are non-trivial, in the general case, otherwise the page
>>>> allocator would already do this. Simply dropping in the contiguous memory
>>>> allocator doesn't work because CMA does not have the same policy and performance
>>>> as the page allocator, and is already causing performance regressions even
>>>> in the absence of huge page allocations.
>>>
>>> Could you take a look at the patches I sent?  Can they fix these issues?
>>> https://lkml.org/lkml/2014/9/28/110
>>>
>>> With these patches, normal alloc_pages() is used for allocation first
>>> and dma_alloc_from_contiguous() is used as a fallback.
>>
>> Sure, I can test these patches this weekend.
>> Where are the unit tests?
> 
> Thanks a lot.  I would like to know whether the performance regression
> you see will disappear or not with these patches as if CONFIG_DMA_CMA is
> disabled.

I think something may have gotten lost in translation.

My "test" consists of doing my daily work (email, emacs, kernel builds,
web breaks, etc).

I don't have a testsuite that validates a page allocator or records any
performance metrics (for TTM allocations under load, as an example).

Without a unit test and performance metrics, my "test" is not really
positive affirmation of a correct implementation.


>>>> So that's why I raised question 3; is making the necessary compromises to support
>>>> 64MB coherent DMA allocations across all x86 iommu implementations actually
>>>> required?
>>>>
>>>> Prior to Akinobu's patches, the use of CMA by x86 iommu configurations was
>>>> designed to be limited to testing configurations, as the introductory
>>>> commit states:
>>>>
>>>> commit 0a2b9a6ea93650b8a00f9fd5ee8fdd25671e2df6
>>>> Author: Marek Szyprowski <m.szyprowski@samsung.com>
>>>> Date:   Thu Dec 29 13:09:51 2011 +0100
>>>>
>>>>     X86: integrate CMA with DMA-mapping subsystem
>>>>
>>>>     This patch adds support for CMA to dma-mapping subsystem for x86
>>>>     architecture that uses common pci-dma/pci-nommu implementation. This
>>>>     allows to test CMA on KVM/QEMU and a lot of common x86 boxes.
>>>>
>>>>     Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
>>>>     Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
>>>>     CC: Michal Nazarewicz <mina86@mina86.com>
>>>>     Acked-by: Arnd Bergmann <arnd@arndb.de>
>>>>
>>>>
>>>> Which brings me to my suggestion: if support for huge coherent DMA is
>>>> required only for a special test platform, then could not this support
>>>> be specific to a new iommu configuration, namely iommu=cma, which would
>>>> get initialized much the same way that iommu=calgary is now.
>>>>
>>>> The code for such a iommu configuration would mostly duplicate
>>>> arch/x86/kernel/pci-swiotlb.c and the CMA support would get removed from
>>>> the other x86 iommu implementations.
>>>
>>> I'm not sure I read correctly, though.  Can boot option 'cma=0' also
>>> help avoiding CMA from IOMMU implementation?
>>
>> Maybe, but that's not an appropriate solution for distro kernels.
>>
>> Nor does this address configurations that want a really large CMA so
>> 1GB huge pages can be allocated (not for DMA though).
> 
> Now I see the point of iommu=cma you suggested.  But what should we do
> when CONFIG_SWIOTLB is disabled, especially for x86_32?
> Should we just introduce yet another flag to tell not using DMA_CMA
> instead of adding new swiotlb-like iommu implementation?

Again, since I don't know what you're using this for and
there are no existing mainline users, I can't really design this for
you.

I'm just trying to do my best to come up with alternative solutions
that limit the impact to existing x86 configurations, while still
achieving your goals (without really knowing what those design
constraints are).

Regards,
Peter Hurley


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 0/5] enhance DMA CMA on x86
  2014-10-03 16:39                         ` Peter Hurley
@ 2014-10-05  6:01                           ` Akinobu Mita
  0 siblings, 0 replies; 27+ messages in thread
From: Akinobu Mita @ 2014-10-05  6:01 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Konrad Rzeszutek Wilk, Thomas Gleixner, LKML, Andrew Morton,
	Marek Szyprowski, David Woodhouse, Don Dutile, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, x86, iommu, Greg KH

2014-10-04 1:39 GMT+09:00 Peter Hurley <peter@hurleysoftware.com>:
> On 10/03/2014 12:06 PM, Akinobu Mita wrote:
>> 2014-10-03 23:27 GMT+09:00 Peter Hurley <peter@hurleysoftware.com>:
>>> On 10/02/2014 07:08 PM, Akinobu Mita wrote:
>>>> 2014-10-03 7:03 GMT+09:00 Peter Hurley <peter@hurleysoftware.com>:
>>>>> On 10/02/2014 12:41 PM, Konrad Rzeszutek Wilk wrote:
>>>>>> On Tue, Sep 30, 2014 at 09:49:54PM -0400, Peter Hurley wrote:
>>>>>>> On 09/30/2014 07:45 PM, Thomas Gleixner wrote:
>>>>
>>>>>>> Which is different than if the plan is to ship production units for x86;
>>>>>>> then a general purpose solution will be required.
>>>>>>>
>>>>>>> As to the good design of a general purpose solution for allocating and
>>>>>>> mapping huge order pages, you are certainly more qualified to help Akinobu
>>>>>>> than I am.
>>>>>
>>>>> What Akinobu's patches intend to support is:
>>>>>
>>>>>         phys_addr = dma_alloc_coherent(dev, 64 * 1024 * 1024, &bus_addr, GFP_KERNEL);
>>>>>
>>>>> which raises three issues:
>>>>>
>>>>> 1. Where do coherent blocks of this size come from?
>>>>> 2. How to prevent fragmentation of these reserved blocks over time by
>>>>>    existing DMA users?
>>>>> 3. Is this support generically required across all iommu implementations on x86?
>>>>>
>>>>> Questions 1 and 2 are non-trivial, in the general case, otherwise the page
>>>>> allocator would already do this. Simply dropping in the contiguous memory
>>>>> allocator doesn't work because CMA does not have the same policy and performance
>>>>> as the page allocator, and is already causing performance regressions even
>>>>> in the absence of huge page allocations.
>>>>
>>>> Could you take a look at the patches I sent?  Can they fix these issues?
>>>> https://lkml.org/lkml/2014/9/28/110
>>>>
>>>> With these patches, normal alloc_pages() is used for allocation first
>>>> and dma_alloc_from_contiguous() is used as a fallback.
>>>
>>> Sure, I can test these patches this weekend.
>>> Where are the unit tests?
>>
>> Thanks a lot.  I would like to know whether the performance regression
>> you see will disappear or not with these patches as if CONFIG_DMA_CMA is
>> disabled.
>
> I think something may have gotten lost in translation.
>
> My "test" consists of doing my daily work (email, emacs, kernel builds,
> web breaks, etc).
>
> I don't have a testsuite that validates a page allocator or records any
> performance metrics (for TTM allocations under load, as an example).
>
> Without a unit test and performance metrics, my "test" is not really
> positive affirmation of a correct implementation.
>
>
>>>>> So that's why I raised question 3; is making the necessary compromises to support
>>>>> 64MB coherent DMA allocations across all x86 iommu implementations actually
>>>>> required?
>>>>>
>>>>> Prior to Akinobu's patches, the use of CMA by x86 iommu configurations was
>>>>> designed to be limited to testing configurations, as the introductory
>>>>> commit states:
>>>>>
>>>>> commit 0a2b9a6ea93650b8a00f9fd5ee8fdd25671e2df6
>>>>> Author: Marek Szyprowski <m.szyprowski@samsung.com>
>>>>> Date:   Thu Dec 29 13:09:51 2011 +0100
>>>>>
>>>>>     X86: integrate CMA with DMA-mapping subsystem
>>>>>
>>>>>     This patch adds support for CMA to dma-mapping subsystem for x86
>>>>>     architecture that uses common pci-dma/pci-nommu implementation. This
>>>>>     allows to test CMA on KVM/QEMU and a lot of common x86 boxes.
>>>>>
>>>>>     Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
>>>>>     Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
>>>>>     CC: Michal Nazarewicz <mina86@mina86.com>
>>>>>     Acked-by: Arnd Bergmann <arnd@arndb.de>
>>>>>
>>>>>
>>>>> Which brings me to my suggestion: if support for huge coherent DMA is
>>>>> required only for a special test platform, then could not this support
>>>>> be specific to a new iommu configuration, namely iommu=cma, which would
>>>>> get initialized much the same way that iommu=calgary is now.
>>>>>
>>>>> The code for such a iommu configuration would mostly duplicate
>>>>> arch/x86/kernel/pci-swiotlb.c and the CMA support would get removed from
>>>>> the other x86 iommu implementations.
>>>>
>>>> I'm not sure I read correctly, though.  Can boot option 'cma=0' also
>>>> help avoiding CMA from IOMMU implementation?
>>>
>>> Maybe, but that's not an appropriate solution for distro kernels.
>>>
>>> Nor does this address configurations that want a really large CMA so
>>> 1GB huge pages can be allocated (not for DMA though).

kernel parameter 'cma=' is only available when CONFIG_DMA_CMA is enabled.
cma=0 doesn't disable 1GB huge pages as far as I can see.
So I prepare a patch which make default cma size zero on x86.

>> Now I see the point of iommu=cma you suggested.  But what should we do
>> when CONFIG_SWIOTLB is disabled, especially for x86_32?
>> Should we just introduce yet another flag to tell not using DMA_CMA
>> instead of adding new swiotlb-like iommu implementation?
>
> Again, since I don't know what you're using this for and
> there are no existing mainline users, I can't really design this for
> you.
>
> I'm just trying to do my best to come up with alternative solutions
> that limit the impact to existing x86 configurations, while still
> achieving your goals (without really knowing what those design
> constraints are).

Thanks a lot for your advise.

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2014-10-05  6:01 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-15 13:08 [PATCH v3 0/5] enhance DMA CMA on x86 Akinobu Mita
2014-04-15 13:08 ` [PATCH v3 1/5] x86: make dma_alloc_coherent() return zeroed memory if CMA is enabled Akinobu Mita
2014-04-16 19:44   ` Andrew Morton
2014-04-17 15:40     ` Akinobu Mita
2014-04-15 13:08 ` [PATCH v3 2/5] x86: enable DMA CMA with swiotlb Akinobu Mita
2014-04-15 13:08 ` [PATCH v3 3/5] intel-iommu: integrate DMA CMA Akinobu Mita
2014-04-15 13:08 ` [PATCH v3 4/5] memblock: introduce memblock_alloc_range() Akinobu Mita
2014-04-15 13:08 ` [PATCH v3 5/5] cma: add placement specifier for "cma=" kernel parameter Akinobu Mita
2014-09-27 14:30 ` [PATCH v3 0/5] enhance DMA CMA on x86 Peter Hurley
2014-09-28  0:31   ` Akinobu Mita
2014-09-29 12:09     ` Peter Hurley
2014-09-29 14:32       ` Akinobu Mita
2014-09-30 14:34         ` Peter Hurley
2014-09-30 23:23           ` Akinobu Mita
2014-09-30 23:45           ` Thomas Gleixner
2014-09-30 23:49             ` Peter Hurley
2014-10-01  1:49             ` Peter Hurley
2014-10-01  9:05               ` Thomas Gleixner
2014-10-02 16:41               ` Konrad Rzeszutek Wilk
2014-10-02 22:03                 ` Peter Hurley
2014-10-02 23:08                   ` Akinobu Mita
2014-10-03 13:40                     ` Konrad Rzeszutek Wilk
2014-10-03 14:27                     ` Peter Hurley
2014-10-03 16:06                       ` Akinobu Mita
2014-10-03 16:33                         ` konrad wilk
2014-10-03 16:39                         ` Peter Hurley
2014-10-05  6:01                           ` Akinobu Mita

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).