[PATCH v2 0/2] make dma_alloc_coherent NUMA-aware by per-NUMA CMA

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/2] make dma_alloc_coherent NUMA-aware by per-NUMA CMA
@ 2020-06-25  7:43 Barry Song
  2020-06-25  7:43 ` [PATCH v2 1/2] dma-direct: provide the ability to reserve per-numa CMA Barry Song
  2020-06-25  7:43 ` [PATCH v2 2/2] arm64: mm: reserve per-numa CMA after numa_init Barry Song
  0 siblings, 2 replies; 8+ messages in thread
From: Barry Song @ 2020-06-25  7:43 UTC (permalink / raw)
  To: hch, m.szyprowski, robin.murphy, will, ganapatrao.kulkarni,
	catalin.marinas
  Cc: iommu, linuxarm, linux-arm-kernel, linux-kernel, Barry Song

Ganapatrao Kulkarni has put some effort on making arm-smmu-v3 use local
memory to save command queues[1]. I also did similar job in patch
"iommu/arm-smmu-v3: allocate the memory of queues in local numa node"
[2] while not realizing Ganapatrao did that before.

But it seems it is much better to make dma_alloc_coherent() to be
inherently NUMA-aware on NUMA-capable systems.

Right now, smmu is using dma_alloc_coherent() to get memory to save queues
and tables. Typically, on ARM64 server, there is a default CMA located at
node0, which could be far away from node2, node3 etc.
Saving queues and tables remotely will increase the latency of ARM SMMU
significantly. For example, when SMMU is at node2 and the default global
CMA is at node0, after sending a CMD_SYNC in an empty command queue, we
have to wait more than 550ns for the completion of the command CMD_SYNC.
However, if we save them locally, we only need to wait for 240ns.

with per-numa CMA, smmu will get memory from local numa node to save command
queues and page tables. that means dma_unmap latency will be shrunk much.

Meanwhile, when iommu.passthrough is on, device drivers which call dma_
alloc_coherent() will also get local memory and avoid the travel between
numa nodes.

[1] https://lists.linuxfoundation.org/pipermail/iommu/2017-October/024455.html
[2] https://www.spinics.net/lists/iommu/msg44767.html

-v2: fix some issues reported by kernel test robot;
     fallback to default cma to avoid regression while allocation fails in
     per-numa cma according to Jonathan Cameron's suggestion;
     free memory properly

Barry Song (2):
  dma-direct: provide the ability to reserve per-numa CMA
  arm64: mm: reserve per-numa CMA after numa_init

 arch/arm64/mm/init.c           |  2 +
 include/linux/dma-contiguous.h |  4 ++
 kernel/dma/Kconfig             | 10 ++++
 kernel/dma/contiguous.c        | 99 ++++++++++++++++++++++++++++++----
 4 files changed, 106 insertions(+), 9 deletions(-)

-- 
2.27.0

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 1/2] dma-direct: provide the ability to reserve per-numa CMA
  2020-06-25  7:43 [PATCH v2 0/2] make dma_alloc_coherent NUMA-aware by per-NUMA CMA Barry Song
@ 2020-06-25  7:43 ` Barry Song
  2020-06-25 11:10   ` Robin Murphy
  2020-06-28  8:58   ` kernel test robot
  2020-06-25  7:43 ` [PATCH v2 2/2] arm64: mm: reserve per-numa CMA after numa_init Barry Song
  1 sibling, 2 replies; 8+ messages in thread
From: Barry Song @ 2020-06-25  7:43 UTC (permalink / raw)
  To: hch, m.szyprowski, robin.murphy, will, ganapatrao.kulkarni,
	catalin.marinas
  Cc: iommu, linuxarm, linux-arm-kernel, linux-kernel, Barry Song,
	Jonathan Cameron, Nicolas Saenz Julienne, Steve Capper,
	Andrew Morton, Mike Rapoport

This is useful for at least two scenarios:
1. ARM64 smmu will get memory from local numa node, it can save its
command queues and page tables locally. Tests show it can decrease
dma_unmap latency at lot. For example, without this patch, smmu on
node2 will get memory from node0 by calling dma_alloc_coherent(),
typically, it has to wait for more than 560ns for the completion of
CMD_SYNC in an empty command queue; with this patch, it needs 240ns
only.
2. when we set iommu passthrough, drivers will get memory from CMA,
local memory means much less latency.

Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Will Deacon <will@kernel.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
Cc: Steve Capper <steve.capper@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
---
 include/linux/dma-contiguous.h |  4 ++
 kernel/dma/Kconfig             | 10 ++++
 kernel/dma/contiguous.c        | 99 ++++++++++++++++++++++++++++++----
 3 files changed, 104 insertions(+), 9 deletions(-)

diff --git a/include/linux/dma-contiguous.h b/include/linux/dma-contiguous.h
index 03f8e98e3bcc..278a80a40456 100644
--- a/include/linux/dma-contiguous.h
+++ b/include/linux/dma-contiguous.h
@@ -79,6 +79,8 @@ static inline void dma_contiguous_set_default(struct cma *cma)
 
 void dma_contiguous_reserve(phys_addr_t addr_limit);
 
+void dma_pernuma_cma_reserve(void);
+
 int __init dma_contiguous_reserve_area(phys_addr_t size, phys_addr_t base,
 				       phys_addr_t limit, struct cma **res_cma,
 				       bool fixed);
@@ -128,6 +130,8 @@ static inline void dma_contiguous_set_default(struct cma *cma) { }
 
 static inline void dma_contiguous_reserve(phys_addr_t limit) { }
 
+static inline void dma_pernuma_cma_reserve(void) { }
+
 static inline int dma_contiguous_reserve_area(phys_addr_t size, phys_addr_t base,
 				       phys_addr_t limit, struct cma **res_cma,
 				       bool fixed)
diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
index d006668c0027..aeb976b1d21c 100644
--- a/kernel/dma/Kconfig
+++ b/kernel/dma/Kconfig
@@ -104,6 +104,16 @@ config DMA_CMA
 if  DMA_CMA
 comment "Default contiguous memory area size:"
 
+config CMA_PERNUMA_SIZE_MBYTES
+	int "Size in Mega Bytes for per-numa CMA areas"
+	depends on NUMA
+	default 16 if ARM64
+	default 0
+	help
+	  Defines the size (in MiB) of the per-numa memory area for Contiguous
+	  Memory Allocator. Every numa node will get a separate CMA with this
+	  size. If the size of 0 is selected, per-numa CMA is disabled.
+
 config CMA_SIZE_MBYTES
 	int "Size in Mega Bytes"
 	depends on !CMA_SIZE_SEL_PERCENTAGE
diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c
index 15bc5026c485..bcbd53aead93 100644
--- a/kernel/dma/contiguous.c
+++ b/kernel/dma/contiguous.c
@@ -30,7 +30,14 @@
 #define CMA_SIZE_MBYTES 0
 #endif
 
+#ifdef CONFIG_CMA_PERNUMA_SIZE_MBYTES
+#define CMA_SIZE_PERNUMA_MBYTES CONFIG_CMA_PERNUMA_SIZE_MBYTES
+#else
+#define CMA_SIZE_PERNUMA_MBYTES 0
+#endif
+
 struct cma *dma_contiguous_default_area;
+static struct cma *dma_contiguous_pernuma_area[MAX_NUMNODES];
 
 /*
  * Default global CMA area size can be defined in kernel's .config.
@@ -44,6 +51,8 @@ struct cma *dma_contiguous_default_area;
  */
 static const phys_addr_t size_bytes __initconst =
 	(phys_addr_t)CMA_SIZE_MBYTES * SZ_1M;
+static const phys_addr_t pernuma_size_bytes __initconst =
+	(phys_addr_t)CMA_SIZE_PERNUMA_MBYTES * SZ_1M;
 static phys_addr_t  size_cmdline __initdata = -1;
 static phys_addr_t base_cmdline __initdata;
 static phys_addr_t limit_cmdline __initdata;
@@ -96,6 +105,33 @@ static inline __maybe_unused phys_addr_t cma_early_percent_memory(void)
 
 #endif
 
+void __init dma_pernuma_cma_reserve(void)
+{
+	int nid;
+
+	if (!pernuma_size_bytes || nr_online_nodes <= 1)
+		return;
+
+	for_each_node_state(nid, N_ONLINE) {
+		int ret;
+		char name[20];
+
+		snprintf(name, sizeof(name), "pernuma%d", nid);
+		ret = cma_declare_contiguous_nid(0, pernuma_size_bytes, 0, 0,
+						 0, false, name,
+						 &dma_contiguous_pernuma_area[nid],
+						 nid);
+		if (ret) {
+			pr_warn("%s: reservation failed: err %d, node %d", __func__,
+				ret, nid);
+			continue;
+		}
+
+		pr_debug("%s: reserved %llu MiB on node %d\n", __func__,
+			(unsigned long long)pernuma_size_bytes / SZ_1M, nid);
+	}
+}
+
 /**
  * dma_contiguous_reserve() - reserve area(s) for contiguous memory handling
  * @limit: End address of the reserved memory (optional, 0 for any).
@@ -222,22 +258,31 @@ bool dma_release_from_contiguous(struct device *dev, struct page *pages,
  * @gfp:   Allocation flags.
  *
  * This function allocates contiguous memory buffer for specified device. It
- * tries to use device specific contiguous memory area if available, or the
- * default global one.
+ * tries to use device specific contiguous memory area if available, or it
+ * tries to use per-numa cma, if the allocation fails, it will fallback to
+ * try default global one.
  *
- * Note that it byapss one-page size of allocations from the global area as
- * the addresses within one page are always contiguous, so there is no need
- * to waste CMA pages for that kind; it also helps reduce fragmentations.
+ * Note that it bypass one-page size of allocations from the per-numa and
+ * global area as the addresses within one page are always contiguous, so
+ * there is no need to waste CMA pages for that kind; it also helps reduce
+ * fragmentations.
  */
 struct page *dma_alloc_contiguous(struct device *dev, size_t size, gfp_t gfp)
 {
 	size_t count = size >> PAGE_SHIFT;
 	struct page *page = NULL;
 	struct cma *cma = NULL;
+	int nid = dev ? dev_to_node(dev) : NUMA_NO_NODE;
+	bool alloc_from_pernuma = false;
 
 	if (dev && dev->cma_area)
 		cma = dev->cma_area;
-	else if (count > 1)
+	else if ((nid != NUMA_NO_NODE) && dma_contiguous_pernuma_area[nid]
+		&& !(gfp & (GFP_DMA | GFP_DMA32))
+		&& (count > 1)) {
+		cma = dma_contiguous_pernuma_area[nid];
+		alloc_from_pernuma = true;
+	} else if (count > 1)
 		cma = dma_contiguous_default_area;
 
 	/* CMA can be used only in the context which permits sleeping */
@@ -246,6 +291,11 @@ struct page *dma_alloc_contiguous(struct device *dev, size_t size, gfp_t gfp)
 		size_t cma_align = min_t(size_t, align, CONFIG_CMA_ALIGNMENT);
 
 		page = cma_alloc(cma, count, cma_align, gfp & __GFP_NOWARN);
+
+		/* fall back to default cma if failed in per-numa cma */
+		if (!page && alloc_from_pernuma)
+			page = cma_alloc(dma_contiguous_default_area, count,
+				cma_align, gfp & __GFP_NOWARN);
 	}
 
 	return page;
@@ -264,9 +314,40 @@ struct page *dma_alloc_contiguous(struct device *dev, size_t size, gfp_t gfp)
  */
 void dma_free_contiguous(struct device *dev, struct page *page, size_t size)
 {
-	if (!cma_release(dev_get_cma_area(dev), page,
-			 PAGE_ALIGN(size) >> PAGE_SHIFT))
-		__free_pages(page, get_order(size));
+	/* if dev has its own cma, free page from there */
+	if (dev && dev->cma_area) {
+		if (cma_release(dev->cma_area, page, PAGE_ALIGN(size) >> PAGE_SHIFT))
+			return;
+	} else {
+		/*
+		 * otherwise, page is from either per-numa cma or default cma
+		 */
+		int nid = dev ? dev_to_node(dev) : NUMA_NO_NODE;
+
+		if (nid != NUMA_NO_NODE) {
+			int i;
+
+			/*
+			 * Literally we only need to call cma_release() on pernuma cma of
+			 * node nid, howerver, a corner case is that users might write
+			 * /sys/devices/pci-x/numa_node to change node to workaround firmware
+			 * bug, so it might allocate memory from nodeA CMA, but free from nodeB
+			 * CMA.
+			 */
+			for (i = 0; i < MAX_NUMNODES; i++) {
+				if (cma_release(dma_contiguous_pernuma_area[i], page,
+							PAGE_ALIGN(size) >> PAGE_SHIFT))
+					return;
+			}
+		}
+
+		if (cma_release(dma_contiguous_default_area, page,
+					PAGE_ALIGN(size) >> PAGE_SHIFT))
+			return;
+	}
+
+	/* not in any cma, free from buddy */
+	__free_pages(page, get_order(size));
 }
 
 /*
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 2/2] arm64: mm: reserve per-numa CMA after numa_init
  2020-06-25  7:43 [PATCH v2 0/2] make dma_alloc_coherent NUMA-aware by per-NUMA CMA Barry Song
  2020-06-25  7:43 ` [PATCH v2 1/2] dma-direct: provide the ability to reserve per-numa CMA Barry Song
@ 2020-06-25  7:43 ` Barry Song
  2020-06-25 11:15   ` Robin Murphy
  1 sibling, 1 reply; 8+ messages in thread
From: Barry Song @ 2020-06-25  7:43 UTC (permalink / raw)
  To: hch, m.szyprowski, robin.murphy, will, ganapatrao.kulkarni,
	catalin.marinas
  Cc: iommu, linuxarm, linux-arm-kernel, linux-kernel, Barry Song,
	Nicolas Saenz Julienne, Steve Capper, Andrew Morton,
	Mike Rapoport

Right now, smmu is using dma_alloc_coherent() to get memory to save queues
and tables. Typically, on ARM64 server, there is a default CMA located at
node0, which could be far away from node2, node3 etc.
with this patch, smmu will get memory from local numa node to save command
queues and page tables. that means dma_unmap latency will be shrunk much.
Meanwhile, when iommu.passthrough is on, device drivers which call dma_
alloc_coherent() will also get local memory and avoid the travel between
numa nodes.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Will Deacon <will@kernel.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
Cc: Steve Capper <steve.capper@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
---
 arch/arm64/mm/init.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 1e93cfc7c47a..07d4d1fe7983 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -420,6 +420,8 @@ void __init bootmem_init(void)
 
 	arm64_numa_init();
 
+	dma_pernuma_cma_reserve();
+
 	/*
 	 * must be done after arm64_numa_init() which calls numa_init() to
 	 * initialize node_online_map that gets used in hugetlb_cma_reserve()
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/2] dma-direct: provide the ability to reserve per-numa CMA
  2020-06-25  7:43 ` [PATCH v2 1/2] dma-direct: provide the ability to reserve per-numa CMA Barry Song
@ 2020-06-25 11:10   ` Robin Murphy
  2020-06-26 12:01     ` Song Bao Hua (Barry Song)
  2020-06-28  8:58   ` kernel test robot
  1 sibling, 1 reply; 8+ messages in thread
From: Robin Murphy @ 2020-06-25 11:10 UTC (permalink / raw)
  To: Barry Song, hch, m.szyprowski, will, ganapatrao.kulkarni,
	catalin.marinas
  Cc: iommu, linuxarm, linux-arm-kernel, linux-kernel,
	Jonathan Cameron, Nicolas Saenz Julienne, Steve Capper,
	Andrew Morton, Mike Rapoport

On 2020-06-25 08:43, Barry Song wrote:
> This is useful for at least two scenarios:
> 1. ARM64 smmu will get memory from local numa node, it can save its
> command queues and page tables locally. Tests show it can decrease
> dma_unmap latency at lot. For example, without this patch, smmu on
> node2 will get memory from node0 by calling dma_alloc_coherent(),
> typically, it has to wait for more than 560ns for the completion of
> CMD_SYNC in an empty command queue; with this patch, it needs 240ns
> only.
> 2. when we set iommu passthrough, drivers will get memory from CMA,
> local memory means much less latency.
> 
> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
> Cc: Steve Capper <steve.capper@arm.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Mike Rapoport <rppt@linux.ibm.com>
> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> ---
>   include/linux/dma-contiguous.h |  4 ++
>   kernel/dma/Kconfig             | 10 ++++
>   kernel/dma/contiguous.c        | 99 ++++++++++++++++++++++++++++++----
>   3 files changed, 104 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/dma-contiguous.h b/include/linux/dma-contiguous.h
> index 03f8e98e3bcc..278a80a40456 100644
> --- a/include/linux/dma-contiguous.h
> +++ b/include/linux/dma-contiguous.h
> @@ -79,6 +79,8 @@ static inline void dma_contiguous_set_default(struct cma *cma)
>   
>   void dma_contiguous_reserve(phys_addr_t addr_limit);
>   
> +void dma_pernuma_cma_reserve(void);
> +
>   int __init dma_contiguous_reserve_area(phys_addr_t size, phys_addr_t base,
>   				       phys_addr_t limit, struct cma **res_cma,
>   				       bool fixed);
> @@ -128,6 +130,8 @@ static inline void dma_contiguous_set_default(struct cma *cma) { }
>   
>   static inline void dma_contiguous_reserve(phys_addr_t limit) { }
>   
> +static inline void dma_pernuma_cma_reserve(void) { }
> +
>   static inline int dma_contiguous_reserve_area(phys_addr_t size, phys_addr_t base,
>   				       phys_addr_t limit, struct cma **res_cma,
>   				       bool fixed)
> diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
> index d006668c0027..aeb976b1d21c 100644
> --- a/kernel/dma/Kconfig
> +++ b/kernel/dma/Kconfig
> @@ -104,6 +104,16 @@ config DMA_CMA
>   if  DMA_CMA
>   comment "Default contiguous memory area size:"
>   
> +config CMA_PERNUMA_SIZE_MBYTES
> +	int "Size in Mega Bytes for per-numa CMA areas"
> +	depends on NUMA
> +	default 16 if ARM64
> +	default 0
> +	help
> +	  Defines the size (in MiB) of the per-numa memory area for Contiguous
> +	  Memory Allocator. Every numa node will get a separate CMA with this
> +	  size. If the size of 0 is selected, per-numa CMA is disabled.
> +

I think this needs to be cleverer than just a static config option. 
Pretty much everything else CMA-related is runtime-configurable to some 
degree, and doing any per-node setup when booting on a single-node 
system would be wasted effort.

Since this is conceptually very similar to the existing hugetlb_cma 
implementation I'm also wondering about inconsistency with respect to 
specifying per-node vs. total sizes.

Another thought, though, is that systems large enough to have multiple 
NUMA nodes tend not to be short on memory, so it might not be 
unreasonable to base this all on whatever size the default area is 
given, and simply have a binary on/off switch to control the per-node 
aspect.

>   config CMA_SIZE_MBYTES
>   	int "Size in Mega Bytes"
>   	depends on !CMA_SIZE_SEL_PERCENTAGE
> diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c
> index 15bc5026c485..bcbd53aead93 100644
> --- a/kernel/dma/contiguous.c
> +++ b/kernel/dma/contiguous.c
> @@ -30,7 +30,14 @@
>   #define CMA_SIZE_MBYTES 0
>   #endif
>   
> +#ifdef CONFIG_CMA_PERNUMA_SIZE_MBYTES
> +#define CMA_SIZE_PERNUMA_MBYTES CONFIG_CMA_PERNUMA_SIZE_MBYTES
> +#else
> +#define CMA_SIZE_PERNUMA_MBYTES 0
> +#endif
> +
>   struct cma *dma_contiguous_default_area;
> +static struct cma *dma_contiguous_pernuma_area[MAX_NUMNODES];
>   
>   /*
>    * Default global CMA area size can be defined in kernel's .config.
> @@ -44,6 +51,8 @@ struct cma *dma_contiguous_default_area;
>    */
>   static const phys_addr_t size_bytes __initconst =
>   	(phys_addr_t)CMA_SIZE_MBYTES * SZ_1M;
> +static const phys_addr_t pernuma_size_bytes __initconst =
> +	(phys_addr_t)CMA_SIZE_PERNUMA_MBYTES * SZ_1M;
>   static phys_addr_t  size_cmdline __initdata = -1;
>   static phys_addr_t base_cmdline __initdata;
>   static phys_addr_t limit_cmdline __initdata;
> @@ -96,6 +105,33 @@ static inline __maybe_unused phys_addr_t cma_early_percent_memory(void)
>   
>   #endif
>   
> +void __init dma_pernuma_cma_reserve(void)
> +{
> +	int nid;
> +
> +	if (!pernuma_size_bytes || nr_online_nodes <= 1)
> +		return;
> +
> +	for_each_node_state(nid, N_ONLINE) {

Do we need/want notifiers to handle currently-offline nodes coming 
online later (I'm not sure off-hand how NUMA interacts with stuff like 
"maxcpus=n")?

> +		int ret;
> +		char name[20];
> +
> +		snprintf(name, sizeof(name), "pernuma%d", nid);
> +		ret = cma_declare_contiguous_nid(0, pernuma_size_bytes, 0, 0,
> +						 0, false, name,
> +						 &dma_contiguous_pernuma_area[nid],
> +						 nid);
> +		if (ret) {
> +			pr_warn("%s: reservation failed: err %d, node %d", __func__,
> +				ret, nid);
> +			continue;
> +		}
> +
> +		pr_debug("%s: reserved %llu MiB on node %d\n", __func__,
> +			(unsigned long long)pernuma_size_bytes / SZ_1M, nid);
> +	}
> +}
> +
>   /**
>    * dma_contiguous_reserve() - reserve area(s) for contiguous memory handling
>    * @limit: End address of the reserved memory (optional, 0 for any).
> @@ -222,22 +258,31 @@ bool dma_release_from_contiguous(struct device *dev, struct page *pages,
>    * @gfp:   Allocation flags.
>    *
>    * This function allocates contiguous memory buffer for specified device. It
> - * tries to use device specific contiguous memory area if available, or the
> - * default global one.
> + * tries to use device specific contiguous memory area if available, or it
> + * tries to use per-numa cma, if the allocation fails, it will fallback to
> + * try default global one.
>    *
> - * Note that it byapss one-page size of allocations from the global area as
> - * the addresses within one page are always contiguous, so there is no need
> - * to waste CMA pages for that kind; it also helps reduce fragmentations.
> + * Note that it bypass one-page size of allocations from the per-numa and
> + * global area as the addresses within one page are always contiguous, so
> + * there is no need to waste CMA pages for that kind; it also helps reduce
> + * fragmentations.
>    */
>   struct page *dma_alloc_contiguous(struct device *dev, size_t size, gfp_t gfp)
>   {
>   	size_t count = size >> PAGE_SHIFT;
>   	struct page *page = NULL;
>   	struct cma *cma = NULL;
> +	int nid = dev ? dev_to_node(dev) : NUMA_NO_NODE;

dev should never be NULL here (the existing check below could be cleaned 
up if we're refactoring anyway).

> +	bool alloc_from_pernuma = false;
>   
>   	if (dev && dev->cma_area)
>   		cma = dev->cma_area;
> -	else if (count > 1)
> +	else if ((nid != NUMA_NO_NODE) && dma_contiguous_pernuma_area[nid]
> +		&& !(gfp & (GFP_DMA | GFP_DMA32))
> +		&& (count > 1)) {
> +		cma = dma_contiguous_pernuma_area[nid];
> +		alloc_from_pernuma = true;
> +	} else if (count > 1)

Well this is a big ugly mess... I'd suggest restructuring the whole 
function to bail out immediately if (count == 1 && !dev->cma_area), then 
try the per-device, per-node and default areas in turn until something 
works.

>   		cma = dma_contiguous_default_area;
>   
>   	/* CMA can be used only in the context which permits sleeping */
> @@ -246,6 +291,11 @@ struct page *dma_alloc_contiguous(struct device *dev, size_t size, gfp_t gfp)
>   		size_t cma_align = min_t(size_t, align, CONFIG_CMA_ALIGNMENT);
>   
>   		page = cma_alloc(cma, count, cma_align, gfp & __GFP_NOWARN);
> +
> +		/* fall back to default cma if failed in per-numa cma */
> +		if (!page && alloc_from_pernuma)
> +			page = cma_alloc(dma_contiguous_default_area, count,
> +				cma_align, gfp & __GFP_NOWARN);
>   	}
>   
>   	return page;
> @@ -264,9 +314,40 @@ struct page *dma_alloc_contiguous(struct device *dev, size_t size, gfp_t gfp)
>    */
>   void dma_free_contiguous(struct device *dev, struct page *page, size_t size)
>   {
> -	if (!cma_release(dev_get_cma_area(dev), page,
> -			 PAGE_ALIGN(size) >> PAGE_SHIFT))
> -		__free_pages(page, get_order(size));
> +	/* if dev has its own cma, free page from there */
> +	if (dev && dev->cma_area) {

Again, no new redundant NULL checks please.

> +		if (cma_release(dev->cma_area, page, PAGE_ALIGN(size) >> PAGE_SHIFT))
> +			return;
> +	} else {
> +		/*
> +		 * otherwise, page is from either per-numa cma or default cma
> +		 */
> +		int nid = dev ? dev_to_node(dev) : NUMA_NO_NODE;
> +
> +		if (nid != NUMA_NO_NODE) {
> +			int i;
> +
> +			/*
> +			 * Literally we only need to call cma_release() on pernuma cma of
> +			 * node nid, howerver, a corner case is that users might write
> +			 * /sys/devices/pci-x/numa_node to change node to workaround firmware
> +			 * bug, so it might allocate memory from nodeA CMA, but free from nodeB
> +			 * CMA.
> +			 */

Why bother with this dance at all? You have the page, so you can't not 
know where it is. Just use page_to_nid() like hugetlb_cma does.

Robin.

> +			for (i = 0; i < MAX_NUMNODES; i++) {
> +				if (cma_release(dma_contiguous_pernuma_area[i], page,
> +							PAGE_ALIGN(size) >> PAGE_SHIFT))
> +					return;
> +			}
> +		}
> +
> +		if (cma_release(dma_contiguous_default_area, page,
> +					PAGE_ALIGN(size) >> PAGE_SHIFT))
> +			return;
> +	}
> +
> +	/* not in any cma, free from buddy */
> +	__free_pages(page, get_order(size));
>   }
>   
>   /*
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 2/2] arm64: mm: reserve per-numa CMA after numa_init
  2020-06-25  7:43 ` [PATCH v2 2/2] arm64: mm: reserve per-numa CMA after numa_init Barry Song
@ 2020-06-25 11:15   ` Robin Murphy
  2020-06-26  3:44     ` Song Bao Hua (Barry Song)
  0 siblings, 1 reply; 8+ messages in thread
From: Robin Murphy @ 2020-06-25 11:15 UTC (permalink / raw)
  To: Barry Song, hch, m.szyprowski, will, ganapatrao.kulkarni,
	catalin.marinas
  Cc: iommu, linuxarm, linux-arm-kernel, linux-kernel,
	Nicolas Saenz Julienne, Steve Capper, Andrew Morton,
	Mike Rapoport

On 2020-06-25 08:43, Barry Song wrote:
> Right now, smmu is using dma_alloc_coherent() to get memory to save queues
> and tables. Typically, on ARM64 server, there is a default CMA located at
> node0, which could be far away from node2, node3 etc.
> with this patch, smmu will get memory from local numa node to save command
> queues and page tables. that means dma_unmap latency will be shrunk much.
> Meanwhile, when iommu.passthrough is on, device drivers which call dma_
> alloc_coherent() will also get local memory and avoid the travel between
> numa nodes.
> 
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
> Cc: Steve Capper <steve.capper@arm.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Mike Rapoport <rppt@linux.ibm.com>
> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> ---
>   arch/arm64/mm/init.c | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 1e93cfc7c47a..07d4d1fe7983 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -420,6 +420,8 @@ void __init bootmem_init(void)
>   
>   	arm64_numa_init();
>   
> +	dma_pernuma_cma_reserve();
> +

It might be worth putting this after the hugetlb_cma_reserve() call for 
clarity, since the comment below applies equally to this call too.

Robin.

>   	/*
>   	 * must be done after arm64_numa_init() which calls numa_init() to
>   	 * initialize node_online_map that gets used in hugetlb_cma_reserve()
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [PATCH v2 2/2] arm64: mm: reserve per-numa CMA after numa_init
  2020-06-25 11:15   ` Robin Murphy
@ 2020-06-26  3:44     ` Song Bao Hua (Barry Song)
  0 siblings, 0 replies; 8+ messages in thread
From: Song Bao Hua (Barry Song) @ 2020-06-26  3:44 UTC (permalink / raw)
  To: Robin Murphy, hch, m.szyprowski, will, ganapatrao.kulkarni,
	catalin.marinas
  Cc: iommu, Linuxarm, linux-arm-kernel, linux-kernel,
	Nicolas Saenz Julienne, Steve Capper, Andrew Morton,
	Mike Rapoport



> -----Original Message-----
> From: Robin Murphy [mailto:robin.murphy@arm.com]
> Sent: Thursday, June 25, 2020 11:16 PM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>; hch@lst.de;
> m.szyprowski@samsung.com; will@kernel.org;
> ganapatrao.kulkarni@cavium.com; catalin.marinas@arm.com
> Cc: iommu@lists.linux-foundation.org; Linuxarm <linuxarm@huawei.com>;
> linux-arm-kernel@lists.infradead.org; linux-kernel@vger.kernel.org; Nicolas
> Saenz Julienne <nsaenzjulienne@suse.de>; Steve Capper
> <steve.capper@arm.com>; Andrew Morton <akpm@linux-foundation.org>;
> Mike Rapoport <rppt@linux.ibm.com>
> Subject: Re: [PATCH v2 2/2] arm64: mm: reserve per-numa CMA after
> numa_init
> 
> On 2020-06-25 08:43, Barry Song wrote:
> > Right now, smmu is using dma_alloc_coherent() to get memory to save
> queues
> > and tables. Typically, on ARM64 server, there is a default CMA located at
> > node0, which could be far away from node2, node3 etc.
> > with this patch, smmu will get memory from local numa node to save
> command
> > queues and page tables. that means dma_unmap latency will be shrunk
> much.
> > Meanwhile, when iommu.passthrough is on, device drivers which call dma_
> > alloc_coherent() will also get local memory and avoid the travel between
> > numa nodes.
> >
> > Cc: Christoph Hellwig <hch@lst.de>
> > Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> > Cc: Will Deacon <will@kernel.org>
> > Cc: Robin Murphy <robin.murphy@arm.com>
> > Cc: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>
> > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > Cc: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
> > Cc: Steve Capper <steve.capper@arm.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Mike Rapoport <rppt@linux.ibm.com>
> > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> > ---
> >   arch/arm64/mm/init.c | 2 ++
> >   1 file changed, 2 insertions(+)
> >
> > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> > index 1e93cfc7c47a..07d4d1fe7983 100644
> > --- a/arch/arm64/mm/init.c
> > +++ b/arch/arm64/mm/init.c
> > @@ -420,6 +420,8 @@ void __init bootmem_init(void)
> >
> >   	arm64_numa_init();
> >
> > +	dma_pernuma_cma_reserve();
> > +
> 
> It might be worth putting this after the hugetlb_cma_reserve() call for
> clarity, since the comment below applies equally to this call too.

Yep, it looks even better though dma_pernuma_cma_reserve() is self-documenting by name.

> 
> Robin.
> 
> >   	/*
> >   	 * must be done after arm64_numa_init() which calls numa_init() to
> >   	 * initialize node_online_map that gets used in hugetlb_cma_reserve()
> >
Thanks
Barry


^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [PATCH v2 1/2] dma-direct: provide the ability to reserve per-numa CMA
  2020-06-25 11:10   ` Robin Murphy
@ 2020-06-26 12:01     ` Song Bao Hua (Barry Song)
  0 siblings, 0 replies; 8+ messages in thread
From: Song Bao Hua (Barry Song) @ 2020-06-26 12:01 UTC (permalink / raw)
  To: Robin Murphy, hch, m.szyprowski, will, ganapatrao.kulkarni,
	catalin.marinas
  Cc: iommu, Linuxarm, linux-arm-kernel, linux-kernel,
	Jonathan Cameron, Nicolas Saenz Julienne, Steve Capper,
	Andrew Morton, Mike Rapoport



> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org
> [mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of Robin Murphy
> Sent: Thursday, June 25, 2020 11:11 PM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>; hch@lst.de;
> m.szyprowski@samsung.com; will@kernel.org;
> ganapatrao.kulkarni@cavium.com; catalin.marinas@arm.com
> Cc: iommu@lists.linux-foundation.org; Linuxarm <linuxarm@huawei.com>;
> linux-arm-kernel@lists.infradead.org; linux-kernel@vger.kernel.org; Jonathan
> Cameron <jonathan.cameron@huawei.com>; Nicolas Saenz Julienne
> <nsaenzjulienne@suse.de>; Steve Capper <steve.capper@arm.com>; Andrew
> Morton <akpm@linux-foundation.org>; Mike Rapoport <rppt@linux.ibm.com>
> Subject: Re: [PATCH v2 1/2] dma-direct: provide the ability to reserve
> per-numa CMA
> 
> On 2020-06-25 08:43, Barry Song wrote:
> > This is useful for at least two scenarios:
> > 1. ARM64 smmu will get memory from local numa node, it can save its
> > command queues and page tables locally. Tests show it can decrease
> > dma_unmap latency at lot. For example, without this patch, smmu on
> > node2 will get memory from node0 by calling dma_alloc_coherent(),
> > typically, it has to wait for more than 560ns for the completion of
> > CMD_SYNC in an empty command queue; with this patch, it needs 240ns
> > only.
> > 2. when we set iommu passthrough, drivers will get memory from CMA,
> > local memory means much less latency.
> >
> > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > Cc: Christoph Hellwig <hch@lst.de>
> > Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> > Cc: Will Deacon <will@kernel.org>
> > Cc: Robin Murphy <robin.murphy@arm.com>
> > Cc: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>
> > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > Cc: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
> > Cc: Steve Capper <steve.capper@arm.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Mike Rapoport <rppt@linux.ibm.com>
> > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> > ---
> >   include/linux/dma-contiguous.h |  4 ++
> >   kernel/dma/Kconfig             | 10 ++++
> >   kernel/dma/contiguous.c        | 99
> ++++++++++++++++++++++++++++++----
> >   3 files changed, 104 insertions(+), 9 deletions(-)
> >
> > diff --git a/include/linux/dma-contiguous.h
> b/include/linux/dma-contiguous.h
> > index 03f8e98e3bcc..278a80a40456 100644
> > --- a/include/linux/dma-contiguous.h
> > +++ b/include/linux/dma-contiguous.h
> > @@ -79,6 +79,8 @@ static inline void dma_contiguous_set_default(struct
> cma *cma)
> >
> >   void dma_contiguous_reserve(phys_addr_t addr_limit);
> >
> > +void dma_pernuma_cma_reserve(void);
> > +
> >   int __init dma_contiguous_reserve_area(phys_addr_t size, phys_addr_t
> base,
> >   				       phys_addr_t limit, struct cma **res_cma,
> >   				       bool fixed);
> > @@ -128,6 +130,8 @@ static inline void dma_contiguous_set_default(struct
> cma *cma) { }
> >
> >   static inline void dma_contiguous_reserve(phys_addr_t limit) { }
> >
> > +static inline void dma_pernuma_cma_reserve(void) { }
> > +
> >   static inline int dma_contiguous_reserve_area(phys_addr_t size,
> phys_addr_t base,
> >   				       phys_addr_t limit, struct cma **res_cma,
> >   				       bool fixed)
> > diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
> > index d006668c0027..aeb976b1d21c 100644
> > --- a/kernel/dma/Kconfig
> > +++ b/kernel/dma/Kconfig
> > @@ -104,6 +104,16 @@ config DMA_CMA
> >   if  DMA_CMA
> >   comment "Default contiguous memory area size:"
> >
> > +config CMA_PERNUMA_SIZE_MBYTES
> > +	int "Size in Mega Bytes for per-numa CMA areas"
> > +	depends on NUMA
> > +	default 16 if ARM64
> > +	default 0
> > +	help
> > +	  Defines the size (in MiB) of the per-numa memory area for Contiguous
> > +	  Memory Allocator. Every numa node will get a separate CMA with this
> > +	  size. If the size of 0 is selected, per-numa CMA is disabled.
> > +
> 
> I think this needs to be cleverer than just a static config option.
> Pretty much everything else CMA-related is runtime-configurable to some
> degree, and doing any per-node setup when booting on a single-node
> system would be wasted effort.

I agree some dynamic configuration should be supported to set the size of cma.
It could be a kernel parameter bootargs, or leverage an existing parameter.

For a system with NUMA enabled, but with only one node or actually non-numa, 
the current dma_pernuma_cma_reserve() won't do anything:

void __init dma_pernuma_cma_reserve(void)
{
	int nid;

	if (!pernuma_size_bytes || nr_online_nodes <= 1)
		return;
}

> 
> Since this is conceptually very similar to the existing hugetlb_cma
> implementation I'm also wondering about inconsistency with respect to
> specifying per-node vs. total sizes.

For hugetlbcma, pernuma cma size is: total_size/the number of nodes
It is quite simple.

But for DMA, node0's CMA might support address limitation for old
devices which allocate memory by GFP_DMA(32).

> 
> Another thought, though, is that systems large enough to have multiple
> NUMA nodes tend not to be short on memory, so it might not be
> unreasonable to base this all on whatever size the default area is
> given, and simply have a binary on/off switch to control the per-node
> aspect.

I don't mind applying the default size to all NUMA nodes as generally a server
has larger memory than embedded system.
One issue is that the default global CMA is on node0, and if the code applies the default size to
every node, the default cma is better to also as node0's pernuma cma too. Otherwise, there
would be two CMA with same size on node0. 

So the code can bypass node0 while creating per-numa CMA.

On the other hand, only CMA on node0 support address limitation for old devices
as its memory is in low physical address.
That is probably why the cma of node0 still need to be handled separately.

So node0's CMA will be both default global CMA and per-NUMA CMA of node0.

> 
> >   config CMA_SIZE_MBYTES
> >   	int "Size in Mega Bytes"
> >   	depends on !CMA_SIZE_SEL_PERCENTAGE
> > diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c
> > index 15bc5026c485..bcbd53aead93 100644
> > --- a/kernel/dma/contiguous.c
> > +++ b/kernel/dma/contiguous.c
> > @@ -30,7 +30,14 @@
> >   #define CMA_SIZE_MBYTES 0
> >   #endif
> >
> > +#ifdef CONFIG_CMA_PERNUMA_SIZE_MBYTES
> > +#define CMA_SIZE_PERNUMA_MBYTES
> CONFIG_CMA_PERNUMA_SIZE_MBYTES
> > +#else
> > +#define CMA_SIZE_PERNUMA_MBYTES 0
> > +#endif
> > +
> >   struct cma *dma_contiguous_default_area;
> > +static struct cma *dma_contiguous_pernuma_area[MAX_NUMNODES];
> >
> >   /*
> >    * Default global CMA area size can be defined in kernel's .config.
> > @@ -44,6 +51,8 @@ struct cma *dma_contiguous_default_area;
> >    */
> >   static const phys_addr_t size_bytes __initconst =
> >   	(phys_addr_t)CMA_SIZE_MBYTES * SZ_1M;
> > +static const phys_addr_t pernuma_size_bytes __initconst =
> > +	(phys_addr_t)CMA_SIZE_PERNUMA_MBYTES * SZ_1M;
> >   static phys_addr_t  size_cmdline __initdata = -1;
> >   static phys_addr_t base_cmdline __initdata;
> >   static phys_addr_t limit_cmdline __initdata;
> > @@ -96,6 +105,33 @@ static inline __maybe_unused phys_addr_t
> cma_early_percent_memory(void)
> >
> >   #endif
> >
> > +void __init dma_pernuma_cma_reserve(void)
> > +{
> > +	int nid;
> > +
> > +	if (!pernuma_size_bytes || nr_online_nodes <= 1)
> > +		return;
> > +
> > +	for_each_node_state(nid, N_ONLINE) {
> 
> Do we need/want notifiers to handle currently-offline nodes coming
> online later (I'm not sure off-hand how NUMA interacts with stuff like
> "maxcpus=n")?

I had a try.

For example, if the hardware has 4 numa nodes and each one has 24 CPUs.

Setting maxcpus=20 will result in:

4 numa nodes are all online after arm64_numa_init() ;
CPU0-19 is online;
CPU20-95 is offline;

$cat /sys/devices/system/cpu/cpu0/online 
1

$cat /sys/devices/system/cpu/cpu24/online 
0

$cat /sys/devices/system/cpu/cpu50/online 
0

But the printk still print 4 here:

--- a/arch/arm64/mm/init.c                                                                                                                                 
+++ b/arch/arm64/mm/init.c                                                                                                                                 
@@ -420,6 +420,8 @@ void __init bootmem_init(void)

        arm64_numa_init();                                                                                                                                 

+       printk("%s nr_online_node :%d\n", __func__, nr_online_nodes);
+
        /*                                                                                                                                                 
         * must be done after arm64_numa_init() which calls numa_init() to                                                                                 
         * initialize node_online_map that gets used in hugetlb_cma_reserve()    

> 
> > +		int ret;
> > +		char name[20];
> > +
> > +		snprintf(name, sizeof(name), "pernuma%d", nid);
> > +		ret = cma_declare_contiguous_nid(0, pernuma_size_bytes, 0, 0,
> > +						 0, false, name,
> > +						 &dma_contiguous_pernuma_area[nid],
> > +						 nid);
> > +		if (ret) {
> > +			pr_warn("%s: reservation failed: err %d, node %d", __func__,
> > +				ret, nid);
> > +			continue;
> > +		}
> > +
> > +		pr_debug("%s: reserved %llu MiB on node %d\n", __func__,
> > +			(unsigned long long)pernuma_size_bytes / SZ_1M, nid);
> > +	}
> > +}
> > +
> >   /**
> >    * dma_contiguous_reserve() - reserve area(s) for contiguous memory
> handling
> >    * @limit: End address of the reserved memory (optional, 0 for any).
> > @@ -222,22 +258,31 @@ bool dma_release_from_contiguous(struct device
> *dev, struct page *pages,
> >    * @gfp:   Allocation flags.
> >    *
> >    * This function allocates contiguous memory buffer for specified device.
> It
> > - * tries to use device specific contiguous memory area if available, or the
> > - * default global one.
> > + * tries to use device specific contiguous memory area if available, or it
> > + * tries to use per-numa cma, if the allocation fails, it will fallback to
> > + * try default global one.
> >    *
> > - * Note that it byapss one-page size of allocations from the global area as
> > - * the addresses within one page are always contiguous, so there is no need
> > - * to waste CMA pages for that kind; it also helps reduce fragmentations.
> > + * Note that it bypass one-page size of allocations from the per-numa and
> > + * global area as the addresses within one page are always contiguous, so
> > + * there is no need to waste CMA pages for that kind; it also helps reduce
> > + * fragmentations.
> >    */
> >   struct page *dma_alloc_contiguous(struct device *dev, size_t size, gfp_t
> gfp)
> >   {
> >   	size_t count = size >> PAGE_SHIFT;
> >   	struct page *page = NULL;
> >   	struct cma *cma = NULL;
> > +	int nid = dev ? dev_to_node(dev) : NUMA_NO_NODE;
> 
> dev should never be NULL here (the existing check below could be cleaned
> up if we're refactoring anyway).

It was not done in v1. But kernel test robot reported issue like 

"smatch warnings:
kernel/dma/contiguous.c:274 dma_alloc_contiguous() warn: variable dereferenced before \
check 'dev' (see line 272)"

https://marc.info/?l=linux-arm-kernel&m=159127076208780&w=2

> 
> > +	bool alloc_from_pernuma = false;
> >
> >   	if (dev && dev->cma_area)
> >   		cma = dev->cma_area;
> > -	else if (count > 1)
> > +	else if ((nid != NUMA_NO_NODE) &&
> dma_contiguous_pernuma_area[nid]
> > +		&& !(gfp & (GFP_DMA | GFP_DMA32))
> > +		&& (count > 1)) {
> > +		cma = dma_contiguous_pernuma_area[nid];
> > +		alloc_from_pernuma = true;
> > +	} else if (count > 1)
> 
> Well this is a big ugly mess... I'd suggest restructuring the whole
> function to bail out immediately if (count == 1 && !dev->cma_area), then
> try the per-device, per-node and default areas in turn until something
> works.

Agreed

> 
> >   		cma = dma_contiguous_default_area;
> >
> >   	/* CMA can be used only in the context which permits sleeping */
> > @@ -246,6 +291,11 @@ struct page *dma_alloc_contiguous(struct device
> *dev, size_t size, gfp_t gfp)
> >   		size_t cma_align = min_t(size_t, align, CONFIG_CMA_ALIGNMENT);
> >
> >   		page = cma_alloc(cma, count, cma_align, gfp & __GFP_NOWARN);
> > +
> > +		/* fall back to default cma if failed in per-numa cma */
> > +		if (!page && alloc_from_pernuma)
> > +			page = cma_alloc(dma_contiguous_default_area, count,
> > +				cma_align, gfp & __GFP_NOWARN);
> >   	}
> >
> >   	return page;
> > @@ -264,9 +314,40 @@ struct page *dma_alloc_contiguous(struct device
> *dev, size_t size, gfp_t gfp)
> >    */
> >   void dma_free_contiguous(struct device *dev, struct page *page, size_t
> size)
> >   {
> > -	if (!cma_release(dev_get_cma_area(dev), page,
> > -			 PAGE_ALIGN(size) >> PAGE_SHIFT))
> > -		__free_pages(page, get_order(size));
> > +	/* if dev has its own cma, free page from there */
> > +	if (dev && dev->cma_area) {
> 
> Again, no new redundant NULL checks please.

Kernel test robot might report issue.

> 
> > +		if (cma_release(dev->cma_area, page, PAGE_ALIGN(size) >>
> PAGE_SHIFT))
> > +			return;
> > +	} else {
> > +		/*
> > +		 * otherwise, page is from either per-numa cma or default cma
> > +		 */
> > +		int nid = dev ? dev_to_node(dev) : NUMA_NO_NODE;
> > +
> > +		if (nid != NUMA_NO_NODE) {
> > +			int i;
> > +
> > +			/*
> > +			 * Literally we only need to call cma_release() on pernuma cma
> of
> > +			 * node nid, howerver, a corner case is that users might write
> > +			 * /sys/devices/pci-x/numa_node to change node to
> workaround firmware
> > +			 * bug, so it might allocate memory from nodeA CMA, but free
> from nodeB
> > +			 * CMA.
> > +			 */
> 
> Why bother with this dance at all? You have the page, so you can't not
> know where it is. Just use page_to_nid() like hugetlb_cma does.

Cool.

> 
> Robin.
> 
> > +			for (i = 0; i < MAX_NUMNODES; i++) {
> > +				if (cma_release(dma_contiguous_pernuma_area[i], page,
> > +							PAGE_ALIGN(size) >> PAGE_SHIFT))
> > +					return;
> > +			}
> > +		}
> > +
> > +		if (cma_release(dma_contiguous_default_area, page,
> > +					PAGE_ALIGN(size) >> PAGE_SHIFT))
> > +			return;
> > +	}
> > +
> > +	/* not in any cma, free from buddy */
> > +	__free_pages(page, get_order(size));
> >   }

Thanks
Barry


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/2] dma-direct: provide the ability to reserve per-numa CMA
  2020-06-25  7:43 ` [PATCH v2 1/2] dma-direct: provide the ability to reserve per-numa CMA Barry Song
  2020-06-25 11:10   ` Robin Murphy
@ 2020-06-28  8:58   ` kernel test robot
  1 sibling, 0 replies; 8+ messages in thread
From: kernel test robot @ 2020-06-28  8:58 UTC (permalink / raw)
  To: Barry Song, hch, m.szyprowski, robin.murphy, will,
	ganapatrao.kulkarni, catalin.marinas
  Cc: kbuild-all, iommu, linuxarm, linux-arm-kernel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 7448 bytes --]

Hi Barry,

I love your patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v5.8-rc2 next-20200625]
[cannot apply to arm64/for-next/core hch-configfs/for-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use  as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Barry-Song/make-dma_alloc_coherent-NUMA-aware-by-per-NUMA-CMA/20200625-154656
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 8be3a53e18e0e1a98f288f6c7f5e9da3adbe9c49
config: x86_64-randconfig-s022-20200624 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-13) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.2-dirty
        # save the attached .config to linux build tree
        make W=1 C=1 ARCH=x86_64 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__'

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)

>> kernel/dma/contiguous.c:283:50: sparse: sparse: invalid access below 'dma_contiguous_pernuma_area' (-8 8)

# https://github.com/0day-ci/linux/commit/d6930169a3364418b985c2d19c31ecf1c4c3d4a9
git remote add linux-review https://github.com/0day-ci/linux
git remote update linux-review
git checkout d6930169a3364418b985c2d19c31ecf1c4c3d4a9
vim +/dma_contiguous_pernuma_area +283 kernel/dma/contiguous.c

de9e14eebf33a6 drivers/base/dma-contiguous.c Marek Szyprowski  2014-10-13  253  
b1d2dc009dece4 kernel/dma/contiguous.c       Nicolin Chen      2019-05-23  254  /**
b1d2dc009dece4 kernel/dma/contiguous.c       Nicolin Chen      2019-05-23  255   * dma_alloc_contiguous() - allocate contiguous pages
b1d2dc009dece4 kernel/dma/contiguous.c       Nicolin Chen      2019-05-23  256   * @dev:   Pointer to device for which the allocation is performed.
b1d2dc009dece4 kernel/dma/contiguous.c       Nicolin Chen      2019-05-23  257   * @size:  Requested allocation size.
b1d2dc009dece4 kernel/dma/contiguous.c       Nicolin Chen      2019-05-23  258   * @gfp:   Allocation flags.
b1d2dc009dece4 kernel/dma/contiguous.c       Nicolin Chen      2019-05-23  259   *
b1d2dc009dece4 kernel/dma/contiguous.c       Nicolin Chen      2019-05-23  260   * This function allocates contiguous memory buffer for specified device. It
d6930169a33644 kernel/dma/contiguous.c       Barry Song        2020-06-25  261   * tries to use device specific contiguous memory area if available, or it
d6930169a33644 kernel/dma/contiguous.c       Barry Song        2020-06-25  262   * tries to use per-numa cma, if the allocation fails, it will fallback to
d6930169a33644 kernel/dma/contiguous.c       Barry Song        2020-06-25  263   * try default global one.
bd2e75633c8012 kernel/dma/contiguous.c       Nicolin Chen      2019-05-23  264   *
d6930169a33644 kernel/dma/contiguous.c       Barry Song        2020-06-25  265   * Note that it bypass one-page size of allocations from the per-numa and
d6930169a33644 kernel/dma/contiguous.c       Barry Song        2020-06-25  266   * global area as the addresses within one page are always contiguous, so
d6930169a33644 kernel/dma/contiguous.c       Barry Song        2020-06-25  267   * there is no need to waste CMA pages for that kind; it also helps reduce
d6930169a33644 kernel/dma/contiguous.c       Barry Song        2020-06-25  268   * fragmentations.
b1d2dc009dece4 kernel/dma/contiguous.c       Nicolin Chen      2019-05-23  269   */
b1d2dc009dece4 kernel/dma/contiguous.c       Nicolin Chen      2019-05-23  270  struct page *dma_alloc_contiguous(struct device *dev, size_t size, gfp_t gfp)
b1d2dc009dece4 kernel/dma/contiguous.c       Nicolin Chen      2019-05-23  271  {
90ae409f9eb3bc kernel/dma/contiguous.c       Christoph Hellwig 2019-08-20  272  	size_t count = size >> PAGE_SHIFT;
b1d2dc009dece4 kernel/dma/contiguous.c       Nicolin Chen      2019-05-23  273  	struct page *page = NULL;
bd2e75633c8012 kernel/dma/contiguous.c       Nicolin Chen      2019-05-23  274  	struct cma *cma = NULL;
d6930169a33644 kernel/dma/contiguous.c       Barry Song        2020-06-25  275  	int nid = dev ? dev_to_node(dev) : NUMA_NO_NODE;
d6930169a33644 kernel/dma/contiguous.c       Barry Song        2020-06-25  276  	bool alloc_from_pernuma = false;
bd2e75633c8012 kernel/dma/contiguous.c       Nicolin Chen      2019-05-23  277  
bd2e75633c8012 kernel/dma/contiguous.c       Nicolin Chen      2019-05-23  278  	if (dev && dev->cma_area)
bd2e75633c8012 kernel/dma/contiguous.c       Nicolin Chen      2019-05-23  279  		cma = dev->cma_area;
d6930169a33644 kernel/dma/contiguous.c       Barry Song        2020-06-25  280  	else if ((nid != NUMA_NO_NODE) && dma_contiguous_pernuma_area[nid]
d6930169a33644 kernel/dma/contiguous.c       Barry Song        2020-06-25  281  		&& !(gfp & (GFP_DMA | GFP_DMA32))
d6930169a33644 kernel/dma/contiguous.c       Barry Song        2020-06-25  282  		&& (count > 1)) {
d6930169a33644 kernel/dma/contiguous.c       Barry Song        2020-06-25 @283  		cma = dma_contiguous_pernuma_area[nid];
d6930169a33644 kernel/dma/contiguous.c       Barry Song        2020-06-25  284  		alloc_from_pernuma = true;
d6930169a33644 kernel/dma/contiguous.c       Barry Song        2020-06-25  285  	} else if (count > 1)
bd2e75633c8012 kernel/dma/contiguous.c       Nicolin Chen      2019-05-23  286  		cma = dma_contiguous_default_area;
b1d2dc009dece4 kernel/dma/contiguous.c       Nicolin Chen      2019-05-23  287  
b1d2dc009dece4 kernel/dma/contiguous.c       Nicolin Chen      2019-05-23  288  	/* CMA can be used only in the context which permits sleeping */
b1d2dc009dece4 kernel/dma/contiguous.c       Nicolin Chen      2019-05-23  289  	if (cma && gfpflags_allow_blocking(gfp)) {
90ae409f9eb3bc kernel/dma/contiguous.c       Christoph Hellwig 2019-08-20  290  		size_t align = get_order(size);
c6622a425acd1d kernel/dma/contiguous.c       Nicolin Chen      2019-07-26  291  		size_t cma_align = min_t(size_t, align, CONFIG_CMA_ALIGNMENT);
c6622a425acd1d kernel/dma/contiguous.c       Nicolin Chen      2019-07-26  292  
c6622a425acd1d kernel/dma/contiguous.c       Nicolin Chen      2019-07-26  293  		page = cma_alloc(cma, count, cma_align, gfp & __GFP_NOWARN);
d6930169a33644 kernel/dma/contiguous.c       Barry Song        2020-06-25  294  
d6930169a33644 kernel/dma/contiguous.c       Barry Song        2020-06-25  295  		/* fall back to default cma if failed in per-numa cma */
d6930169a33644 kernel/dma/contiguous.c       Barry Song        2020-06-25  296  		if (!page && alloc_from_pernuma)
d6930169a33644 kernel/dma/contiguous.c       Barry Song        2020-06-25  297  			page = cma_alloc(dma_contiguous_default_area, count,
d6930169a33644 kernel/dma/contiguous.c       Barry Song        2020-06-25  298  				cma_align, gfp & __GFP_NOWARN);
b1d2dc009dece4 kernel/dma/contiguous.c       Nicolin Chen      2019-05-23  299  	}
b1d2dc009dece4 kernel/dma/contiguous.c       Nicolin Chen      2019-05-23  300  
b1d2dc009dece4 kernel/dma/contiguous.c       Nicolin Chen      2019-05-23  301  	return page;
b1d2dc009dece4 kernel/dma/contiguous.c       Nicolin Chen      2019-05-23  302  }
b1d2dc009dece4 kernel/dma/contiguous.c       Nicolin Chen      2019-05-23  303  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 30547 bytes --]

[-- Attachment #3: Type: text/plain, Size: 149 bytes --]

_______________________________________________
kbuild mailing list -- kbuild@lists.01.org
To unsubscribe send an email to kbuild-leave@lists.01.org

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-06-28  8:59 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-25  7:43 [PATCH v2 0/2] make dma_alloc_coherent NUMA-aware by per-NUMA CMA Barry Song
2020-06-25  7:43 ` [PATCH v2 1/2] dma-direct: provide the ability to reserve per-numa CMA Barry Song
2020-06-25 11:10   ` Robin Murphy
2020-06-26 12:01     ` Song Bao Hua (Barry Song)
2020-06-28  8:58   ` kernel test robot
2020-06-25  7:43 ` [PATCH v2 2/2] arm64: mm: reserve per-numa CMA after numa_init Barry Song
2020-06-25 11:15   ` Robin Murphy
2020-06-26  3:44     ` Song Bao Hua (Barry Song)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).