linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/4] Chunk Heap Support on DMA-HEAP
@ 2020-12-01 17:51 Minchan Kim
  2020-12-01 17:51 ` [PATCH v2 1/4] mm: introduce alloc_contig_mode Minchan Kim
                   ` (3 more replies)
  0 siblings, 4 replies; 28+ messages in thread
From: Minchan Kim @ 2020-12-01 17:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, linux-mm, hyesoo.yu, willy, david, iamjoonsoo.kim, vbabka,
	surenb, pullip.cho, joaodias, hridya, sumit.semwal, john.stultz,
	Brian.Starkey, linux-media, devicetree, robh, christian.koenig,
	linaro-mm-sig, Minchan Kim

This patchset introduces a new dma heap, chunk heap that makes it
easy to perform the bulk allocation of high order pages.
It has been created to help optimize the 4K/8K HDR video playback
with secure DRM HW to protect contents on memory. The HW needs
physically contiguous memory chunks up to several hundred MB memory.

This patchset is against on next-20201130.

The patchset includes the following:
 - cma_alloc_bulk API
 - export dma-heap API to register kernel module dma heap.
 - add chunk heap implementation.

* Since v1 - 
  https://lore.kernel.org/linux-mm/20201117181935.3613581-1-minchan@kernel.org/

  * introduce alloc_contig_mode - David
  * use default CMA instead of device tree - John
    
Hyesoo Yu (2):
  dma-buf: add export symbol for dma-heap
  dma-buf: heaps: add chunk heap to dmabuf heaps

Minchan Kim (2):
  mm: introduce alloc_contig_mode
  mm: introduce cma_alloc_bulk API

 drivers/dma-buf/dma-heap.c         |   2 +
 drivers/dma-buf/heaps/Kconfig      |  15 +
 drivers/dma-buf/heaps/Makefile     |   1 +
 drivers/dma-buf/heaps/chunk_heap.c | 429 +++++++++++++++++++++++++++++
 drivers/virtio/virtio_mem.c        |   2 +-
 include/linux/cma.h                |   5 +
 include/linux/gfp.h                |  10 +-
 kernel/dma/contiguous.c            |   1 +
 mm/cma.c                           | 134 ++++++++-
 mm/page_alloc.c                    |  25 +-
 10 files changed, 607 insertions(+), 17 deletions(-)
 create mode 100644 drivers/dma-buf/heaps/chunk_heap.c

-- 
2.29.2.454.gaff20da3a2-goog


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 1/4] mm: introduce alloc_contig_mode
  2020-12-01 17:51 [PATCH v2 0/4] Chunk Heap Support on DMA-HEAP Minchan Kim
@ 2020-12-01 17:51 ` Minchan Kim
  2020-12-01 17:51 ` [PATCH v2 2/4] mm: introduce cma_alloc_bulk API Minchan Kim
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 28+ messages in thread
From: Minchan Kim @ 2020-12-01 17:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, linux-mm, hyesoo.yu, willy, david, iamjoonsoo.kim, vbabka,
	surenb, pullip.cho, joaodias, hridya, sumit.semwal, john.stultz,
	Brian.Starkey, linux-media, devicetree, robh, christian.koenig,
	linaro-mm-sig, Minchan Kim

There are demands to control how hard alloc_contig_range try to
increase allocation success ratio. This patch abstract it by
adding new enum mode parameter in alloc_contig_range.
New API in next patch will add up new mode there to control it.

This patch shouldn't change any existing behavior.

Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 drivers/virtio/virtio_mem.c | 2 +-
 include/linux/gfp.h         | 8 +++++++-
 mm/cma.c                    | 3 ++-
 mm/page_alloc.c             | 6 ++++--
 4 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 9fc9ec4a25f5..5585fc67b65e 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -1148,7 +1148,7 @@ static int virtio_mem_fake_offline(unsigned long pfn, unsigned long nr_pages)
 	 */
 	for (retry_count = 0; retry_count < 5; retry_count++) {
 		rc = alloc_contig_range(pfn, pfn + nr_pages, MIGRATE_MOVABLE,
-					GFP_KERNEL);
+					GFP_KERNEL, ALLOC_CONTIG_NORMAL);
 		if (rc == -ENOMEM)
 			/* whoops, out of memory */
 			return rc;
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index c603237e006c..ad5872699692 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -624,9 +624,15 @@ static inline bool pm_suspended_storage(void)
 #endif /* CONFIG_PM_SLEEP */
 
 #ifdef CONFIG_CONTIG_ALLOC
+enum alloc_contig_mode {
+	/* try several ways to increase success ratio of memory allocation */
+	ALLOC_CONTIG_NORMAL,
+};
+
 /* The below functions must be run on a range from a single zone. */
 extern int alloc_contig_range(unsigned long start, unsigned long end,
-			      unsigned migratetype, gfp_t gfp_mask);
+			      unsigned migratetype, gfp_t gfp_mask,
+			      enum alloc_contig_mode mode);
 extern struct page *alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask,
 				       int nid, nodemask_t *nodemask);
 #endif
diff --git a/mm/cma.c b/mm/cma.c
index 3692a34e2353..8010c1ba04b0 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -454,7 +454,8 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
 
 		pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit);
 		ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA,
-				     GFP_KERNEL | (no_warn ? __GFP_NOWARN : 0));
+				     GFP_KERNEL | (no_warn ? __GFP_NOWARN : 0),
+				     ALLOC_CONTIG_NORMAL);
 
 		if (ret == 0) {
 			page = pfn_to_page(pfn);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f91df593bf71..adfbfd95fbc3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8533,6 +8533,7 @@ static void __alloc_contig_clear_range(unsigned long start_pfn,
  *			be either of the two.
  * @gfp_mask:	GFP mask to use during compaction. __GFP_ZERO clears allocated
  *		pages.
+ * @mode:	how hard it will try to increase allocation success ratio
  *
  * The PFN range does not have to be pageblock or MAX_ORDER_NR_PAGES
  * aligned.  The PFN range must belong to a single zone.
@@ -8546,7 +8547,8 @@ static void __alloc_contig_clear_range(unsigned long start_pfn,
  * need to be freed with free_contig_range().
  */
 int alloc_contig_range(unsigned long start, unsigned long end,
-		       unsigned migratetype, gfp_t gfp_mask)
+		       unsigned migratetype, gfp_t gfp_mask,
+		       enum alloc_contig_mode mode)
 {
 	unsigned long outer_start, outer_end;
 	unsigned int order;
@@ -8689,7 +8691,7 @@ static int __alloc_contig_pages(unsigned long start_pfn,
 	unsigned long end_pfn = start_pfn + nr_pages;
 
 	return alloc_contig_range(start_pfn, end_pfn, MIGRATE_MOVABLE,
-				  gfp_mask);
+				  gfp_mask, ALLOC_CONTIG_NORMAL);
 }
 
 static bool pfn_range_valid_contig(struct zone *z, unsigned long start_pfn,
-- 
2.29.2.454.gaff20da3a2-goog


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 2/4] mm: introduce cma_alloc_bulk API
  2020-12-01 17:51 [PATCH v2 0/4] Chunk Heap Support on DMA-HEAP Minchan Kim
  2020-12-01 17:51 ` [PATCH v2 1/4] mm: introduce alloc_contig_mode Minchan Kim
@ 2020-12-01 17:51 ` Minchan Kim
  2020-12-02  9:14   ` David Hildenbrand
  2020-12-01 17:51 ` [PATCH v2 3/4] dma-buf: add export symbol for dma-heap Minchan Kim
  2020-12-01 17:51 ` [PATCH v2 4/4] dma-buf: heaps: add chunk heap to dmabuf heaps Minchan Kim
  3 siblings, 1 reply; 28+ messages in thread
From: Minchan Kim @ 2020-12-01 17:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, linux-mm, hyesoo.yu, willy, david, iamjoonsoo.kim, vbabka,
	surenb, pullip.cho, joaodias, hridya, sumit.semwal, john.stultz,
	Brian.Starkey, linux-media, devicetree, robh, christian.koenig,
	linaro-mm-sig, Minchan Kim

There is a need for special HW to require bulk allocation of
high-order pages. For example, 4800 * order-4 pages, which
would be minimum, sometimes, it requires more.

To meet the requirement, a option reserves 300M CMA area and
requests the whole 300M contiguous memory. However, it doesn't
work if even one of those pages in the range is long-term pinned
directly or indirectly. The other option is to ask higher-order
size (e.g., 2M) than requested order(64K) repeatedly until driver
could gather necessary amount of memory. Basically, this approach
makes the allocation very slow due to cma_alloc's function
slowness and it could be stuck on one of the pageblocks if it
encounters unmigratable page.

To solve the issue, this patch introduces cma_alloc_bulk.

	int cma_alloc_bulk(struct cma *cma, unsigned int align,
		bool fast, unsigned int order, size_t nr_requests,
		struct page **page_array, size_t *nr_allocated);

Most parameters are same with cma_alloc but it additionally passes
vector array to store allocated memory. What's different with cma_alloc
is it will skip pageblocks without waiting/stopping if it has unmovable
page so that API continues to scan other pageblocks to find requested
order page.

cma_alloc_bulk is best effort approach in that it skips some pageblocks
if they have unmovable pages unlike cma_alloc. It doesn't need to be
perfect from the beginning at the cost of performance. Thus, the API
takes "bool fast parameter" which is propagated into alloc_contig_range to
avoid significat overhead functions to inrecase CMA allocation success
ratio(e.g., migration retrial, PCP, LRU draining per pageblock)
at the cost of less allocation success ratio. If the caller couldn't
allocate enough, they could call it with "false" to increase success ratio
if they are okay to expense the overhead for the success ratio.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/cma.h |   5 ++
 include/linux/gfp.h |   2 +
 mm/cma.c            | 126 ++++++++++++++++++++++++++++++++++++++++++--
 mm/page_alloc.c     |  19 ++++---
 4 files changed, 140 insertions(+), 12 deletions(-)

diff --git a/include/linux/cma.h b/include/linux/cma.h
index 217999c8a762..7375d3131804 100644
--- a/include/linux/cma.h
+++ b/include/linux/cma.h
@@ -46,6 +46,11 @@ extern int cma_init_reserved_mem(phys_addr_t base, phys_addr_t size,
 					struct cma **res_cma);
 extern struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
 			      bool no_warn);
+
+extern int cma_alloc_bulk(struct cma *cma, unsigned int align, bool fast,
+			unsigned int order, size_t nr_requests,
+			struct page **page_array, size_t *nr_allocated);
+
 extern bool cma_release(struct cma *cma, const struct page *pages, unsigned int count);
 
 extern int cma_for_each_area(int (*it)(struct cma *cma, void *data), void *data);
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index ad5872699692..75bfb673d75b 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -627,6 +627,8 @@ static inline bool pm_suspended_storage(void)
 enum alloc_contig_mode {
 	/* try several ways to increase success ratio of memory allocation */
 	ALLOC_CONTIG_NORMAL,
+	/* avoid costly functions to make the call fast */
+	ALLOC_CONTIG_FAST,
 };
 
 /* The below functions must be run on a range from a single zone. */
diff --git a/mm/cma.c b/mm/cma.c
index 8010c1ba04b0..4459045fa717 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -32,6 +32,7 @@
 #include <linux/highmem.h>
 #include <linux/io.h>
 #include <linux/kmemleak.h>
+#include <linux/swap.h>
 #include <trace/events/cma.h>
 
 #include "cma.h"
@@ -397,6 +398,14 @@ static void cma_debug_show_areas(struct cma *cma)
 static inline void cma_debug_show_areas(struct cma *cma) { }
 #endif
 
+static void reset_page_kasan_tag(struct page *page, int count)
+{
+	int i;
+
+	for (i = 0; i < count; i++)
+		page_kasan_tag_reset(page + i);
+}
+
 /**
  * cma_alloc() - allocate pages from contiguous area
  * @cma:   Contiguous memory region for which the allocation is performed.
@@ -414,7 +423,6 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
 	unsigned long pfn = -1;
 	unsigned long start = 0;
 	unsigned long bitmap_maxno, bitmap_no, bitmap_count;
-	size_t i;
 	struct page *page = NULL;
 	int ret = -ENOMEM;
 
@@ -479,10 +487,8 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
 	 * blocks being marked with different tags. Reset the tags to ignore
 	 * those page blocks.
 	 */
-	if (page) {
-		for (i = 0; i < count; i++)
-			page_kasan_tag_reset(page + i);
-	}
+	if (page)
+		reset_page_kasan_tag(page, count);
 
 	if (ret && !no_warn) {
 		pr_err("%s: alloc failed, req-size: %zu pages, ret: %d\n",
@@ -494,6 +500,116 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
 	return page;
 }
 
+/*
+ * cma_alloc_bulk() - allocate high order bulk pages from contiguous area with
+ * 		best effort. It will usually be used for private @cma
+ *
+ * @cma:	contiguous memory region for which the allocation is performed.
+ * @align:	requested alignment of pages (in PAGE_SIZE order).
+ * @fast:	will skip costly opeartions if it's true.
+ * @order:	requested page order
+ * @nr_requests: the number of 2^order pages requested to be allocated as input,
+ * @page_array:	page_array pointer to store allocated pages (must have space
+ *		for at least nr_requests)
+ * @nr_allocated: the number of 2^order pages allocated as output
+ *
+ * This function tries to allocate up to @nr_requests @order pages on specific
+ * contiguous memory area. If @fast has true, it will avoid costly functions
+ * to increase allocation success ratio so it will be faster but might return
+ * less than requested number of pages. User could retry it with true if it is
+ * needed.
+ *
+ * Return: it will return 0 only if all pages requested by @nr_requestsed are
+ * allocated. Otherwise, it returns negative error code.
+ *
+ * Note: Regardless of success/failure, user should check @nr_allocated to see
+ * how many @order pages are allocated and free those pages when they are not
+ * needed.
+ */
+int cma_alloc_bulk(struct cma *cma, unsigned int align, bool fast,
+			unsigned int order, size_t nr_requests,
+			struct page **page_array, size_t *nr_allocated)
+{
+	int ret = 0;
+	size_t i = 0;
+	unsigned long nr_pages_needed = nr_requests * (1 << order);
+	unsigned long nr_chunk_pages, nr_pages;
+	unsigned long mask, offset;
+	unsigned long pfn = -1;
+	unsigned long start = 0;
+	unsigned long bitmap_maxno, bitmap_no, bitmap_count;
+	struct page *page = NULL;
+	enum alloc_contig_mode mode = fast ? ALLOC_CONTIG_FAST :
+						ALLOC_CONTIG_NORMAL;
+	*nr_allocated = 0;
+	if (!cma || !cma->count || !cma->bitmap || !page_array)
+		return -EINVAL;
+
+	if (!nr_pages_needed)
+		return 0;
+
+	nr_chunk_pages = 1 << max_t(unsigned int, order, pageblock_order);
+
+	mask = cma_bitmap_aligned_mask(cma, align);
+	offset = cma_bitmap_aligned_offset(cma, align);
+	bitmap_maxno = cma_bitmap_maxno(cma);
+
+	lru_add_drain_all();
+	drain_all_pages(NULL);
+
+	while (nr_pages_needed) {
+		nr_pages = min(nr_chunk_pages, nr_pages_needed);
+
+		bitmap_count = cma_bitmap_pages_to_bits(cma, nr_pages);
+		mutex_lock(&cma->lock);
+		bitmap_no = bitmap_find_next_zero_area_off(cma->bitmap,
+				bitmap_maxno, start, bitmap_count, mask,
+				offset);
+		if (bitmap_no >= bitmap_maxno) {
+			mutex_unlock(&cma->lock);
+			break;
+		}
+		bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
+		/*
+		 * It's safe to drop the lock here. If the migration fails
+		 * cma_clear_bitmap will take the lock again and unmark it.
+		 */
+		mutex_unlock(&cma->lock);
+
+		pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit);
+		ret = alloc_contig_range(pfn, pfn + nr_pages, MIGRATE_CMA,
+						GFP_KERNEL|__GFP_NOWARN, mode);
+		if (ret) {
+			cma_clear_bitmap(cma, pfn, nr_pages);
+			if (ret != -EBUSY)
+				break;
+
+			/* continue to search next block */
+			start = (pfn + nr_pages - cma->base_pfn) >>
+						cma->order_per_bit;
+			continue;
+		}
+
+		page = pfn_to_page(pfn);
+		while (nr_pages) {
+			page_array[i++] = page;
+			reset_page_kasan_tag(page, 1 << order);
+			page += 1 << order;
+			nr_pages -= 1 << order;
+			nr_pages_needed -= 1 << order;
+		}
+
+		start = bitmap_no + bitmap_count;
+	}
+
+	*nr_allocated = i;
+
+	if (!ret && nr_pages_needed)
+		ret = -EBUSY;
+
+	return ret;
+}
+
 /**
  * cma_release() - release allocated pages
  * @cma:   Contiguous memory region for which the allocation is performed.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index adfbfd95fbc3..2a1799ff14fc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8463,7 +8463,8 @@ static unsigned long pfn_max_align_up(unsigned long pfn)
 
 /* [start, end) must belong to a single zone. */
 static int __alloc_contig_migrate_range(struct compact_control *cc,
-					unsigned long start, unsigned long end)
+					unsigned long start, unsigned long end,
+					unsigned int max_tries)
 {
 	/* This function is based on compact_zone() from compaction.c. */
 	unsigned int nr_reclaimed;
@@ -8491,7 +8492,7 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
 				break;
 			}
 			tries = 0;
-		} else if (++tries == 5) {
+		} else if (++tries == max_tries) {
 			ret = ret < 0 ? ret : -EBUSY;
 			break;
 		}
@@ -8553,6 +8554,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 	unsigned long outer_start, outer_end;
 	unsigned int order;
 	int ret = 0;
+	bool fast_mode = mode == ALLOC_CONTIG_FAST;
 
 	struct compact_control cc = {
 		.nr_migratepages = 0,
@@ -8595,7 +8597,8 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 	if (ret)
 		return ret;
 
-	drain_all_pages(cc.zone);
+	if (!fast_mode)
+		drain_all_pages(cc.zone);
 
 	/*
 	 * In case of -EBUSY, we'd like to know which page causes problem.
@@ -8607,7 +8610,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 	 * allocated.  So, if we fall through be sure to clear ret so that
 	 * -EBUSY is not accidentally used or returned to caller.
 	 */
-	ret = __alloc_contig_migrate_range(&cc, start, end);
+	ret = __alloc_contig_migrate_range(&cc, start, end, fast_mode ? 1 : 5);
 	if (ret && ret != -EBUSY)
 		goto done;
 	ret =0;
@@ -8629,7 +8632,8 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 	 * isolated thus they won't get removed from buddy.
 	 */
 
-	lru_add_drain_all();
+	if (!fast_mode)
+		lru_add_drain_all();
 
 	order = 0;
 	outer_start = start;
@@ -8656,8 +8660,9 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 
 	/* Make sure the range is really isolated. */
 	if (test_pages_isolated(outer_start, end, 0)) {
-		pr_info_ratelimited("%s: [%lx, %lx) PFNs busy\n",
-			__func__, outer_start, end);
+		if (!fast_mode)
+			pr_info_ratelimited("%s: [%lx, %lx) PFNs busy\n",
+				__func__, outer_start, end);
 		ret = -EBUSY;
 		goto done;
 	}
-- 
2.29.2.454.gaff20da3a2-goog


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 3/4] dma-buf: add export symbol for dma-heap
  2020-12-01 17:51 [PATCH v2 0/4] Chunk Heap Support on DMA-HEAP Minchan Kim
  2020-12-01 17:51 ` [PATCH v2 1/4] mm: introduce alloc_contig_mode Minchan Kim
  2020-12-01 17:51 ` [PATCH v2 2/4] mm: introduce cma_alloc_bulk API Minchan Kim
@ 2020-12-01 17:51 ` Minchan Kim
  2020-12-02 13:51   ` Christoph Hellwig
  2020-12-01 17:51 ` [PATCH v2 4/4] dma-buf: heaps: add chunk heap to dmabuf heaps Minchan Kim
  3 siblings, 1 reply; 28+ messages in thread
From: Minchan Kim @ 2020-12-01 17:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, linux-mm, hyesoo.yu, willy, david, iamjoonsoo.kim, vbabka,
	surenb, pullip.cho, joaodias, hridya, sumit.semwal, john.stultz,
	Brian.Starkey, linux-media, devicetree, robh, christian.koenig,
	linaro-mm-sig, Minchan Kim

From: Hyesoo Yu <hyesoo.yu@samsung.com>

The heaps could be added as module, so some functions should
be exported to register dma-heaps. And dma-heap of module can use
cma area to allocate and free. However the function related cma
is not exported now. Let's export them for next patches.

Signed-off-by: Hyesoo Yu <hyesoo.yu@samsung.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 drivers/dma-buf/dma-heap.c | 2 ++
 kernel/dma/contiguous.c    | 1 +
 mm/cma.c                   | 5 +++++
 3 files changed, 8 insertions(+)

diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c
index afd22c9dbdcf..cc6339cbca09 100644
--- a/drivers/dma-buf/dma-heap.c
+++ b/drivers/dma-buf/dma-heap.c
@@ -189,6 +189,7 @@ void *dma_heap_get_drvdata(struct dma_heap *heap)
 {
 	return heap->priv;
 }
+EXPORT_SYMBOL_GPL(dma_heap_get_drvdata);
 
 struct dma_heap *dma_heap_add(const struct dma_heap_export_info *exp_info)
 {
@@ -272,6 +273,7 @@ struct dma_heap *dma_heap_add(const struct dma_heap_export_info *exp_info)
 	kfree(heap);
 	return err_ret;
 }
+EXPORT_SYMBOL_GPL(dma_heap_add);
 
 static char *dma_heap_devnode(struct device *dev, umode_t *mode)
 {
diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c
index 3d63d91cba5c..7e9777119b29 100644
--- a/kernel/dma/contiguous.c
+++ b/kernel/dma/contiguous.c
@@ -58,6 +58,7 @@
 #endif
 
 struct cma *dma_contiguous_default_area;
+EXPORT_SYMBOL_GPL(dma_contiguous_default_area);
 
 /*
  * Default global CMA area size can be defined in kernel's .config.
diff --git a/mm/cma.c b/mm/cma.c
index 4459045fa717..d39cb7066b9e 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -33,6 +33,7 @@
 #include <linux/io.h>
 #include <linux/kmemleak.h>
 #include <linux/swap.h>
+#include <linux/module.h>
 #include <trace/events/cma.h>
 
 #include "cma.h"
@@ -54,6 +55,7 @@ const char *cma_get_name(const struct cma *cma)
 {
 	return cma->name;
 }
+EXPORT_SYMBOL_GPL(cma_get_name);
 
 static unsigned long cma_bitmap_aligned_mask(const struct cma *cma,
 					     unsigned int align_order)
@@ -499,6 +501,7 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
 	pr_debug("%s(): returned %p\n", __func__, page);
 	return page;
 }
+EXPORT_SYMBOL_GPL(cma_alloc);
 
 /*
  * cma_alloc_bulk() - allocate high order bulk pages from contiguous area with
@@ -609,6 +612,7 @@ int cma_alloc_bulk(struct cma *cma, unsigned int align, bool fast,
 
 	return ret;
 }
+EXPORT_SYMBOL_GPL(cma_alloc_bulk);
 
 /**
  * cma_release() - release allocated pages
@@ -642,6 +646,7 @@ bool cma_release(struct cma *cma, const struct page *pages, unsigned int count)
 
 	return true;
 }
+EXPORT_SYMBOL_GPL(cma_release);
 
 int cma_for_each_area(int (*it)(struct cma *cma, void *data), void *data)
 {
-- 
2.29.2.454.gaff20da3a2-goog


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 4/4] dma-buf: heaps: add chunk heap to dmabuf heaps
  2020-12-01 17:51 [PATCH v2 0/4] Chunk Heap Support on DMA-HEAP Minchan Kim
                   ` (2 preceding siblings ...)
  2020-12-01 17:51 ` [PATCH v2 3/4] dma-buf: add export symbol for dma-heap Minchan Kim
@ 2020-12-01 17:51 ` Minchan Kim
  2020-12-01 19:48   ` John Stultz
  2020-12-02 13:54   ` Christoph Hellwig
  3 siblings, 2 replies; 28+ messages in thread
From: Minchan Kim @ 2020-12-01 17:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, linux-mm, hyesoo.yu, willy, david, iamjoonsoo.kim, vbabka,
	surenb, pullip.cho, joaodias, hridya, sumit.semwal, john.stultz,
	Brian.Starkey, linux-media, devicetree, robh, christian.koenig,
	linaro-mm-sig, Minchan Kim

From: Hyesoo Yu <hyesoo.yu@samsung.com>

This patch supports chunk heap that allocates the buffers that
arranged into a list a fixed size chunks taken from CMA.

The chunk heap doesn't use heap-helper although it can remove
duplicated code since heap-helper is under deprecated process.[1]

NOTE: This patch only adds the default CMA heap to allocate chunk
pages. We will add another CMA memory regions to the dmabuf heaps
interface with a later patch (which requires a dt binding)

[1] https://lore.kernel.org/patchwork/patch/1336002

Signed-off-by: Hyesoo Yu <hyesoo.yu@samsung.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 drivers/dma-buf/heaps/Kconfig      |  15 +
 drivers/dma-buf/heaps/Makefile     |   1 +
 drivers/dma-buf/heaps/chunk_heap.c | 429 +++++++++++++++++++++++++++++
 3 files changed, 445 insertions(+)
 create mode 100644 drivers/dma-buf/heaps/chunk_heap.c

diff --git a/drivers/dma-buf/heaps/Kconfig b/drivers/dma-buf/heaps/Kconfig
index a5eef06c4226..9153f83afed7 100644
--- a/drivers/dma-buf/heaps/Kconfig
+++ b/drivers/dma-buf/heaps/Kconfig
@@ -12,3 +12,18 @@ config DMABUF_HEAPS_CMA
 	  Choose this option to enable dma-buf CMA heap. This heap is backed
 	  by the Contiguous Memory Allocator (CMA). If your system has these
 	  regions, you should say Y here.
+
+config DMABUF_HEAPS_CHUNK
+	tristate "DMA-BUF CHUNK Heap"
+	depends on DMABUF_HEAPS && DMA_CMA
+	help
+	  Choose this option to enable dma-buf CHUNK heap. This heap is backed
+	  by the Contiguous Memory Allocator (CMA) and allocates the buffers that
+	  arranged into a list of fixed size chunks taken from CMA.
+
+config DMABUF_HEAPS_CHUNK_ORDER
+	int "Chunk page order for dmabuf chunk heap"
+	default 4
+	depends on DMABUF_HEAPS_CHUNK
+	help
+	  Set page order of fixed chunk size to allocate from CMA.
diff --git a/drivers/dma-buf/heaps/Makefile b/drivers/dma-buf/heaps/Makefile
index 974467791032..8faa6cfdc0c5 100644
--- a/drivers/dma-buf/heaps/Makefile
+++ b/drivers/dma-buf/heaps/Makefile
@@ -1,3 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_DMABUF_HEAPS_SYSTEM)	+= system_heap.o
 obj-$(CONFIG_DMABUF_HEAPS_CMA)		+= cma_heap.o
+obj-$(CONFIG_DMABUF_HEAPS_CHUNK)	+= chunk_heap.o
diff --git a/drivers/dma-buf/heaps/chunk_heap.c b/drivers/dma-buf/heaps/chunk_heap.c
new file mode 100644
index 000000000000..0277707a93a9
--- /dev/null
+++ b/drivers/dma-buf/heaps/chunk_heap.c
@@ -0,0 +1,429 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * ION Memory Allocator chunk heap exporter
+ *
+ * Copyright (c) 2020 Samsung Electronics Co., Ltd.
+ * Author: <hyesoo.yu@samsung.com> for Samsung Electronics.
+ */
+
+#include <linux/platform_device.h>
+#include <linux/cma.h>
+#include <linux/device.h>
+#include <linux/dma-buf.h>
+#include <linux/dma-heap.h>
+#include <linux/dma-map-ops.h>
+#include <linux/err.h>
+#include <linux/errno.h>
+#include <linux/highmem.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/scatterlist.h>
+#include <linux/sched/signal.h>
+#include <linux/of_reserved_mem.h>
+#include <linux/of.h>
+
+struct chunk_heap {
+	struct dma_heap *heap;
+	unsigned int order;
+	struct cma *cma;
+};
+
+struct chunk_heap_buffer {
+	struct chunk_heap *heap;
+	struct list_head attachments;
+	struct mutex lock;
+	struct sg_table sg_table;
+	unsigned long len;
+	int vmap_cnt;
+	void *vaddr;
+};
+
+struct chunk_heap_attachment {
+	struct device *dev;
+	struct sg_table *table;
+	struct list_head list;
+	bool mapped;
+};
+
+static struct sg_table *dup_sg_table(struct sg_table *table)
+{
+	struct sg_table *new_table;
+	int ret, i;
+	struct scatterlist *sg, *new_sg;
+
+	new_table = kzalloc(sizeof(*new_table), GFP_KERNEL);
+	if (!new_table)
+		return ERR_PTR(-ENOMEM);
+
+	ret = sg_alloc_table(new_table, table->orig_nents, GFP_KERNEL);
+	if (ret) {
+		kfree(new_table);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	new_sg = new_table->sgl;
+	for_each_sgtable_sg(table, sg, i) {
+		sg_set_page(new_sg, sg_page(sg), sg->length, sg->offset);
+		new_sg = sg_next(new_sg);
+	}
+
+	return new_table;
+}
+
+static int chunk_heap_attach(struct dma_buf *dmabuf, struct dma_buf_attachment *attachment)
+{
+	struct chunk_heap_buffer *buffer = dmabuf->priv;
+	struct chunk_heap_attachment *a;
+	struct sg_table *table;
+
+	a = kzalloc(sizeof(*a), GFP_KERNEL);
+	if (!a)
+		return -ENOMEM;
+
+	table = dup_sg_table(&buffer->sg_table);
+	if (IS_ERR(table)) {
+		kfree(a);
+		return -ENOMEM;
+	}
+
+	a->table = table;
+	a->dev = attachment->dev;
+	INIT_LIST_HEAD(&a->list);
+	a->mapped = false;
+
+	attachment->priv = a;
+
+	mutex_lock(&buffer->lock);
+	list_add(&a->list, &buffer->attachments);
+	mutex_unlock(&buffer->lock);
+
+	return 0;
+}
+
+static void chunk_heap_detach(struct dma_buf *dmabuf, struct dma_buf_attachment *attachment)
+{
+	struct chunk_heap_buffer *buffer = dmabuf->priv;
+	struct chunk_heap_attachment *a = attachment->priv;
+
+	mutex_lock(&buffer->lock);
+	list_del(&a->list);
+	mutex_unlock(&buffer->lock);
+
+	sg_free_table(a->table);
+	kfree(a->table);
+	kfree(a);
+}
+
+static struct sg_table *chunk_heap_map_dma_buf(struct dma_buf_attachment *attachment,
+					       enum dma_data_direction direction)
+{
+	struct chunk_heap_attachment *a = attachment->priv;
+	struct sg_table *table = a->table;
+	int ret;
+
+	ret = dma_map_sgtable(attachment->dev, table, direction, 0);
+	if (ret)
+		return ERR_PTR(ret);
+
+	a->mapped = true;
+	return table;
+}
+
+static void chunk_heap_unmap_dma_buf(struct dma_buf_attachment *attachment,
+				     struct sg_table *table,
+				     enum dma_data_direction direction)
+{
+	struct chunk_heap_attachment *a = attachment->priv;
+
+	a->mapped = false;
+	dma_unmap_sgtable(attachment->dev, table, direction, 0);
+}
+
+static int chunk_heap_dma_buf_begin_cpu_access(struct dma_buf *dmabuf,
+						enum dma_data_direction direction)
+{
+	struct chunk_heap_buffer *buffer = dmabuf->priv;
+	struct chunk_heap_attachment *a;
+
+	mutex_lock(&buffer->lock);
+
+	if (buffer->vmap_cnt)
+		invalidate_kernel_vmap_range(buffer->vaddr, buffer->len);
+
+	list_for_each_entry(a, &buffer->attachments, list) {
+		if (!a->mapped)
+			continue;
+		dma_sync_sgtable_for_cpu(a->dev, a->table, direction);
+	}
+	mutex_unlock(&buffer->lock);
+
+	return 0;
+}
+
+static int chunk_heap_dma_buf_end_cpu_access(struct dma_buf *dmabuf,
+					      enum dma_data_direction direction)
+{
+	struct chunk_heap_buffer *buffer = dmabuf->priv;
+	struct chunk_heap_attachment *a;
+
+	mutex_lock(&buffer->lock);
+
+	if (buffer->vmap_cnt)
+		flush_kernel_vmap_range(buffer->vaddr, buffer->len);
+
+	list_for_each_entry(a, &buffer->attachments, list) {
+		if (!a->mapped)
+			continue;
+		dma_sync_sgtable_for_device(a->dev, a->table, direction);
+	}
+	mutex_unlock(&buffer->lock);
+
+	return 0;
+}
+
+static int chunk_heap_mmap(struct dma_buf *dmabuf, struct vm_area_struct *vma)
+{
+	struct chunk_heap_buffer *buffer = dmabuf->priv;
+	struct sg_table *table = &buffer->sg_table;
+	unsigned long addr = vma->vm_start;
+	struct sg_page_iter piter;
+	int ret;
+
+	for_each_sgtable_page(table, &piter, vma->vm_pgoff) {
+		struct page *page = sg_page_iter_page(&piter);
+
+		ret = remap_pfn_range(vma, addr, page_to_pfn(page), PAGE_SIZE,
+				      vma->vm_page_prot);
+		if (ret)
+			return ret;
+		addr += PAGE_SIZE;
+		if (addr >= vma->vm_end)
+			return 0;
+	}
+	return 0;
+}
+
+static void *chunk_heap_do_vmap(struct chunk_heap_buffer *buffer)
+{
+	struct sg_table *table = &buffer->sg_table;
+	int npages = PAGE_ALIGN(buffer->len) / PAGE_SIZE;
+	struct page **pages = vmalloc(sizeof(struct page *) * npages);
+	struct page **tmp = pages;
+	struct sg_page_iter piter;
+	void *vaddr;
+
+	if (!pages)
+		return ERR_PTR(-ENOMEM);
+
+	for_each_sgtable_page(table, &piter, 0) {
+		WARN_ON(tmp - pages >= npages);
+		*tmp++ = sg_page_iter_page(&piter);
+	}
+
+	vaddr = vmap(pages, npages, VM_MAP, PAGE_KERNEL);
+	vfree(pages);
+
+	if (!vaddr)
+		return ERR_PTR(-ENOMEM);
+
+	return vaddr;
+}
+
+static int chunk_heap_vmap(struct dma_buf *dmabuf, struct dma_buf_map *map)
+{
+	struct chunk_heap_buffer *buffer = dmabuf->priv;
+	int ret = 0;
+	void *vaddr;
+
+	mutex_lock(&buffer->lock);
+	if (buffer->vmap_cnt) {
+		vaddr = buffer->vaddr;
+		goto done;
+	}
+
+	vaddr = chunk_heap_do_vmap(buffer);
+	if (IS_ERR(vaddr)) {
+		ret = PTR_ERR(vaddr);
+		goto err;
+	}
+
+	buffer->vaddr = vaddr;
+done:
+	buffer->vmap_cnt++;
+	dma_buf_map_set_vaddr(map, vaddr);
+err:
+	mutex_unlock(&buffer->lock);
+
+	return ret;
+}
+
+static void chunk_heap_vunmap(struct dma_buf *dmabuf, struct dma_buf_map *map)
+{
+	struct chunk_heap_buffer *buffer = dmabuf->priv;
+
+	mutex_lock(&buffer->lock);
+	if (!--buffer->vmap_cnt) {
+		vunmap(buffer->vaddr);
+		buffer->vaddr = NULL;
+	}
+	mutex_unlock(&buffer->lock);
+}
+
+static void chunk_heap_dma_buf_release(struct dma_buf *dmabuf)
+{
+	struct chunk_heap_buffer *buffer = dmabuf->priv;
+	struct chunk_heap *chunk_heap = buffer->heap;
+	struct sg_table *table;
+	struct scatterlist *sg;
+	int i;
+
+	table = &buffer->sg_table;
+	for_each_sgtable_sg(table, sg, i)
+		cma_release(chunk_heap->cma, sg_page(sg), 1 << chunk_heap->order);
+	sg_free_table(table);
+	kfree(buffer);
+}
+
+static const struct dma_buf_ops chunk_heap_buf_ops = {
+	.attach = chunk_heap_attach,
+	.detach = chunk_heap_detach,
+	.map_dma_buf = chunk_heap_map_dma_buf,
+	.unmap_dma_buf = chunk_heap_unmap_dma_buf,
+	.begin_cpu_access = chunk_heap_dma_buf_begin_cpu_access,
+	.end_cpu_access = chunk_heap_dma_buf_end_cpu_access,
+	.mmap = chunk_heap_mmap,
+	.vmap = chunk_heap_vmap,
+	.vunmap = chunk_heap_vunmap,
+	.release = chunk_heap_dma_buf_release,
+};
+
+static int chunk_heap_allocate(struct dma_heap *heap, unsigned long len,
+			       unsigned long fd_flags, unsigned long heap_flags)
+{
+	struct chunk_heap *chunk_heap = dma_heap_get_drvdata(heap);
+	struct chunk_heap_buffer *buffer;
+	DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
+	struct dma_buf *dmabuf;
+	struct sg_table *table;
+	struct scatterlist *sg;
+	struct page **pages;
+	unsigned int chunk_size = PAGE_SIZE << chunk_heap->order;
+	unsigned int count, alloced = 0;
+	unsigned int num_retry = 5;
+	int ret = -ENOMEM;
+	pgoff_t pg;
+
+	buffer = kzalloc(sizeof(*buffer), GFP_KERNEL);
+	if (!buffer)
+		return ret;
+
+	INIT_LIST_HEAD(&buffer->attachments);
+	mutex_init(&buffer->lock);
+	buffer->heap = chunk_heap;
+	buffer->len = ALIGN(len, chunk_size);
+	count = buffer->len / chunk_size;
+
+	pages = kvmalloc_array(count, sizeof(*pages), GFP_KERNEL);
+	if (!pages)
+		goto err_pages;
+
+	while (num_retry--) {
+		unsigned long nr_pages;
+
+		ret = cma_alloc_bulk(chunk_heap->cma, chunk_heap->order,
+				     num_retry ? true : false,
+				     chunk_heap->order, count - alloced,
+				     pages + alloced, &nr_pages);
+		alloced += nr_pages;
+		if (alloced == count)
+			break;
+		if (ret != -EBUSY)
+			break;
+
+	}
+	if (ret < 0)
+		goto err_alloc;
+
+	table = &buffer->sg_table;
+	if (sg_alloc_table(table, count, GFP_KERNEL))
+		goto err_alloc;
+
+	sg = table->sgl;
+	for (pg = 0; pg < count; pg++) {
+		sg_set_page(sg, pages[pg], chunk_size, 0);
+		sg = sg_next(sg);
+	}
+
+	exp_info.ops = &chunk_heap_buf_ops;
+	exp_info.size = buffer->len;
+	exp_info.flags = fd_flags;
+	exp_info.priv = buffer;
+	dmabuf = dma_buf_export(&exp_info);
+	if (IS_ERR(dmabuf)) {
+		ret = PTR_ERR(dmabuf);
+		goto err_export;
+	}
+	kvfree(pages);
+
+	ret = dma_buf_fd(dmabuf, fd_flags);
+	if (ret < 0) {
+		dma_buf_put(dmabuf);
+		return ret;
+	}
+
+	return 0;
+err_export:
+	sg_free_table(table);
+err_alloc:
+	for (pg = 0; pg < alloced; pg++)
+		cma_release(chunk_heap->cma, pages[pg], 1 << chunk_heap->order);
+	kvfree(pages);
+err_pages:
+	kfree(buffer);
+
+	return ret;
+}
+
+static const struct dma_heap_ops chunk_heap_ops = {
+	.allocate = chunk_heap_allocate,
+};
+
+#ifdef CONFIG_DMABUF_HEAPS_CHUNK_ORDER
+#define CHUNK_HEAP_ORDER (CONFIG_DMABUF_HEAPS_CHUNK_ORDER)
+#else
+#define CHUNK_HEAP_ORDER (0)
+#endif
+
+static int __init chunk_heap_init(void)
+{
+	struct cma *default_cma = dev_get_cma_area(NULL);
+	struct dma_heap_export_info exp_info;
+	struct chunk_heap *chunk_heap;
+
+	if (!default_cma)
+		return 0;
+
+	chunk_heap = kzalloc(sizeof(*chunk_heap), GFP_KERNEL);
+	if (!chunk_heap)
+		return -ENOMEM;
+
+	chunk_heap->order = CHUNK_HEAP_ORDER;
+	chunk_heap->cma = default_cma;
+
+	exp_info.name = cma_get_name(default_cma);
+	exp_info.ops = &chunk_heap_ops;
+	exp_info.priv = chunk_heap;
+
+	chunk_heap->heap = dma_heap_add(&exp_info);
+	if (IS_ERR(chunk_heap->heap)) {
+		int ret = PTR_ERR(chunk_heap->heap);
+
+		kfree(chunk_heap);
+		return ret;
+	}
+
+	return 0;
+}
+module_init(chunk_heap_init);
+MODULE_DESCRIPTION("DMA-BUF Chunk Heap");
+MODULE_LICENSE("GPL v2");
-- 
2.29.2.454.gaff20da3a2-goog


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 4/4] dma-buf: heaps: add chunk heap to dmabuf heaps
  2020-12-01 17:51 ` [PATCH v2 4/4] dma-buf: heaps: add chunk heap to dmabuf heaps Minchan Kim
@ 2020-12-01 19:48   ` John Stultz
  2020-12-01 22:55     ` Minchan Kim
  2020-12-02 13:54   ` Christoph Hellwig
  1 sibling, 1 reply; 28+ messages in thread
From: John Stultz @ 2020-12-01 19:48 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Hyesoo Yu, Matthew Wilcox, david,
	iamjoonsoo.kim, vbabka, Suren Baghdasaryan, KyongHo Cho,
	John Dias, Hridya Valsaraju, Sumit Semwal, Brian Starkey,
	linux-media,
	open list:OPEN FIRMWARE AND FLATTENED DEVICE TREE BINDINGS,
	Rob Herring, Christian Koenig,
	moderated list:DMA BUFFER SHARING FRAMEWORK

On Tue, Dec 1, 2020 at 9:51 AM Minchan Kim <minchan@kernel.org> wrote:

Thanks for reworking and resending this!

...
> +static int __init chunk_heap_init(void)
> +{
> +       struct cma *default_cma = dev_get_cma_area(NULL);
> +       struct dma_heap_export_info exp_info;
> +       struct chunk_heap *chunk_heap;
> +
> +       if (!default_cma)
> +               return 0;
> +
> +       chunk_heap = kzalloc(sizeof(*chunk_heap), GFP_KERNEL);
> +       if (!chunk_heap)
> +               return -ENOMEM;
> +
> +       chunk_heap->order = CHUNK_HEAP_ORDER;
> +       chunk_heap->cma = default_cma;
> +
> +       exp_info.name = cma_get_name(default_cma);

So, this would create a chunk heap name with the default CMA name,
which would be indistinguishable from the heap name used for the plain
CMA heap.

Probably a good idea to prefix it with "chunk-" so the heap device
names are unique?

thanks
-john

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 4/4] dma-buf: heaps: add chunk heap to dmabuf heaps
  2020-12-01 19:48   ` John Stultz
@ 2020-12-01 22:55     ` Minchan Kim
  2020-12-01 23:38       ` John Stultz
  0 siblings, 1 reply; 28+ messages in thread
From: Minchan Kim @ 2020-12-01 22:55 UTC (permalink / raw)
  To: John Stultz
  Cc: Andrew Morton, LKML, linux-mm, Hyesoo Yu, Matthew Wilcox, david,
	iamjoonsoo.kim, vbabka, Suren Baghdasaryan, KyongHo Cho,
	John Dias, Hridya Valsaraju, Sumit Semwal, Brian Starkey,
	linux-media,
	open list:OPEN FIRMWARE AND FLATTENED DEVICE TREE BINDINGS,
	Rob Herring, Christian Koenig,
	moderated list:DMA BUFFER SHARING FRAMEWORK

On Tue, Dec 01, 2020 at 11:48:15AM -0800, John Stultz wrote:
> On Tue, Dec 1, 2020 at 9:51 AM Minchan Kim <minchan@kernel.org> wrote:
> 
> Thanks for reworking and resending this!
> 
> ...
> > +static int __init chunk_heap_init(void)
> > +{
> > +       struct cma *default_cma = dev_get_cma_area(NULL);
> > +       struct dma_heap_export_info exp_info;
> > +       struct chunk_heap *chunk_heap;
> > +
> > +       if (!default_cma)
> > +               return 0;
> > +
> > +       chunk_heap = kzalloc(sizeof(*chunk_heap), GFP_KERNEL);
> > +       if (!chunk_heap)
> > +               return -ENOMEM;
> > +
> > +       chunk_heap->order = CHUNK_HEAP_ORDER;
> > +       chunk_heap->cma = default_cma;
> > +
> > +       exp_info.name = cma_get_name(default_cma);
> 
> So, this would create a chunk heap name with the default CMA name,
> which would be indistinguishable from the heap name used for the plain
> CMA heap.
> 
> Probably a good idea to prefix it with "chunk-" so the heap device
> names are unique?

That will give an impression to user that they are using different CMA
area but that's not true. IMHO, let's be honest at this moment.
When DT binding with CMA is landing down, it could provide unique name.
Thought?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 4/4] dma-buf: heaps: add chunk heap to dmabuf heaps
  2020-12-01 22:55     ` Minchan Kim
@ 2020-12-01 23:38       ` John Stultz
  2020-12-02  0:13         ` Minchan Kim
  0 siblings, 1 reply; 28+ messages in thread
From: John Stultz @ 2020-12-01 23:38 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Hyesoo Yu, Matthew Wilcox, david,
	iamjoonsoo.kim, vbabka, Suren Baghdasaryan, KyongHo Cho,
	John Dias, Hridya Valsaraju, Sumit Semwal, Brian Starkey,
	linux-media,
	open list:OPEN FIRMWARE AND FLATTENED DEVICE TREE BINDINGS,
	Rob Herring, Christian Koenig,
	moderated list:DMA BUFFER SHARING FRAMEWORK

On Tue, Dec 1, 2020 at 2:55 PM Minchan Kim <minchan@kernel.org> wrote:
> On Tue, Dec 01, 2020 at 11:48:15AM -0800, John Stultz wrote:
> > On Tue, Dec 1, 2020 at 9:51 AM Minchan Kim <minchan@kernel.org> wrote:
> >
> > Thanks for reworking and resending this!
> >
> > ...
> > > +static int __init chunk_heap_init(void)
> > > +{
> > > +       struct cma *default_cma = dev_get_cma_area(NULL);
> > > +       struct dma_heap_export_info exp_info;
> > > +       struct chunk_heap *chunk_heap;
> > > +
> > > +       if (!default_cma)
> > > +               return 0;
> > > +
> > > +       chunk_heap = kzalloc(sizeof(*chunk_heap), GFP_KERNEL);
> > > +       if (!chunk_heap)
> > > +               return -ENOMEM;
> > > +
> > > +       chunk_heap->order = CHUNK_HEAP_ORDER;
> > > +       chunk_heap->cma = default_cma;
> > > +
> > > +       exp_info.name = cma_get_name(default_cma);
> >
> > So, this would create a chunk heap name with the default CMA name,
> > which would be indistinguishable from the heap name used for the plain
> > CMA heap.
> >
> > Probably a good idea to prefix it with "chunk-" so the heap device
> > names are unique?
>
> That will give an impression to user that they are using different CMA
> area but that's not true. IMHO, let's be honest at this moment.

I disagree.  The dmabuf heaps provide an abstraction for allocating a
type of memory, and while your heap is pulling from CMA, you aren't
"just" allocating CMA as the existing CMA heap would suffice for that.

Since you need a slightly different method to allocate high order
pages in bulk, we really should have a unique way to name the
allocator interface. That's why I'd suggest the "chunk-" prefix to the
heap name.

thanks
-john

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 4/4] dma-buf: heaps: add chunk heap to dmabuf heaps
  2020-12-01 23:38       ` John Stultz
@ 2020-12-02  0:13         ` Minchan Kim
  2020-12-02  0:33           ` John Stultz
  0 siblings, 1 reply; 28+ messages in thread
From: Minchan Kim @ 2020-12-02  0:13 UTC (permalink / raw)
  To: John Stultz
  Cc: Andrew Morton, LKML, linux-mm, Hyesoo Yu, Matthew Wilcox, david,
	iamjoonsoo.kim, vbabka, Suren Baghdasaryan, KyongHo Cho,
	John Dias, Hridya Valsaraju, Sumit Semwal, Brian Starkey,
	linux-media,
	open list:OPEN FIRMWARE AND FLATTENED DEVICE TREE BINDINGS,
	Rob Herring, Christian Koenig,
	moderated list:DMA BUFFER SHARING FRAMEWORK

On Tue, Dec 01, 2020 at 03:38:14PM -0800, John Stultz wrote:
> On Tue, Dec 1, 2020 at 2:55 PM Minchan Kim <minchan@kernel.org> wrote:
> > On Tue, Dec 01, 2020 at 11:48:15AM -0800, John Stultz wrote:
> > > On Tue, Dec 1, 2020 at 9:51 AM Minchan Kim <minchan@kernel.org> wrote:
> > >
> > > Thanks for reworking and resending this!
> > >
> > > ...
> > > > +static int __init chunk_heap_init(void)
> > > > +{
> > > > +       struct cma *default_cma = dev_get_cma_area(NULL);
> > > > +       struct dma_heap_export_info exp_info;
> > > > +       struct chunk_heap *chunk_heap;
> > > > +
> > > > +       if (!default_cma)
> > > > +               return 0;
> > > > +
> > > > +       chunk_heap = kzalloc(sizeof(*chunk_heap), GFP_KERNEL);
> > > > +       if (!chunk_heap)
> > > > +               return -ENOMEM;
> > > > +
> > > > +       chunk_heap->order = CHUNK_HEAP_ORDER;
> > > > +       chunk_heap->cma = default_cma;
> > > > +
> > > > +       exp_info.name = cma_get_name(default_cma);
> > >
> > > So, this would create a chunk heap name with the default CMA name,
> > > which would be indistinguishable from the heap name used for the plain
> > > CMA heap.
> > >
> > > Probably a good idea to prefix it with "chunk-" so the heap device
> > > names are unique?
> >
> > That will give an impression to user that they are using different CMA
> > area but that's not true. IMHO, let's be honest at this moment.
> 
> I disagree.  The dmabuf heaps provide an abstraction for allocating a
> type of memory, and while your heap is pulling from CMA, you aren't
> "just" allocating CMA as the existing CMA heap would suffice for that.
> 
> Since you need a slightly different method to allocate high order
> pages in bulk, we really should have a unique way to name the
> allocator interface. That's why I'd suggest the "chunk-" prefix to the
> heap name.

Got it. How about this? 

diff --git a/drivers/dma-buf/heaps/chunk_heap.c b/drivers/dma-buf/heaps/chunk_heap.c
index 0277707a93a9..36e189d0b73d 100644
--- a/drivers/dma-buf/heaps/chunk_heap.c
+++ b/drivers/dma-buf/heaps/chunk_heap.c
@@ -410,7 +410,7 @@ static int __init chunk_heap_init(void)
        chunk_heap->order = CHUNK_HEAP_ORDER;
        chunk_heap->cma = default_cma;

-       exp_info.name = cma_get_name(default_cma);
+       exp_info.name = "cma-chunk-heap";
        exp_info.ops = &chunk_heap_ops;
        exp_info.priv = chunk_heap;

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 4/4] dma-buf: heaps: add chunk heap to dmabuf heaps
  2020-12-02  0:13         ` Minchan Kim
@ 2020-12-02  0:33           ` John Stultz
  2020-12-02  0:57             ` Minchan Kim
  0 siblings, 1 reply; 28+ messages in thread
From: John Stultz @ 2020-12-02  0:33 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Hyesoo Yu, Matthew Wilcox, david,
	iamjoonsoo.kim, vbabka, Suren Baghdasaryan, KyongHo Cho,
	John Dias, Hridya Valsaraju, Sumit Semwal, Brian Starkey,
	linux-media,
	open list:OPEN FIRMWARE AND FLATTENED DEVICE TREE BINDINGS,
	Rob Herring, Christian Koenig,
	moderated list:DMA BUFFER SHARING FRAMEWORK

On Tue, Dec 1, 2020 at 4:13 PM Minchan Kim <minchan@kernel.org> wrote:
>
> On Tue, Dec 01, 2020 at 03:38:14PM -0800, John Stultz wrote:
> > On Tue, Dec 1, 2020 at 2:55 PM Minchan Kim <minchan@kernel.org> wrote:
> > > On Tue, Dec 01, 2020 at 11:48:15AM -0800, John Stultz wrote:
> > > > On Tue, Dec 1, 2020 at 9:51 AM Minchan Kim <minchan@kernel.org> wrote:
> > > >
> > > > Thanks for reworking and resending this!
> > > >
> > > > ...
> > > > > +static int __init chunk_heap_init(void)
> > > > > +{
> > > > > +       struct cma *default_cma = dev_get_cma_area(NULL);
> > > > > +       struct dma_heap_export_info exp_info;
> > > > > +       struct chunk_heap *chunk_heap;
> > > > > +
> > > > > +       if (!default_cma)
> > > > > +               return 0;
> > > > > +
> > > > > +       chunk_heap = kzalloc(sizeof(*chunk_heap), GFP_KERNEL);
> > > > > +       if (!chunk_heap)
> > > > > +               return -ENOMEM;
> > > > > +
> > > > > +       chunk_heap->order = CHUNK_HEAP_ORDER;
> > > > > +       chunk_heap->cma = default_cma;
> > > > > +
> > > > > +       exp_info.name = cma_get_name(default_cma);
> > > >
> > > > So, this would create a chunk heap name with the default CMA name,
> > > > which would be indistinguishable from the heap name used for the plain
> > > > CMA heap.
> > > >
> > > > Probably a good idea to prefix it with "chunk-" so the heap device
> > > > names are unique?
> > >
> > > That will give an impression to user that they are using different CMA
> > > area but that's not true. IMHO, let's be honest at this moment.
> >
> > I disagree.  The dmabuf heaps provide an abstraction for allocating a
> > type of memory, and while your heap is pulling from CMA, you aren't
> > "just" allocating CMA as the existing CMA heap would suffice for that.
> >
> > Since you need a slightly different method to allocate high order
> > pages in bulk, we really should have a unique way to name the
> > allocator interface. That's why I'd suggest the "chunk-" prefix to the
> > heap name.
>
> Got it. How about this?
>
> diff --git a/drivers/dma-buf/heaps/chunk_heap.c b/drivers/dma-buf/heaps/chunk_heap.c
> index 0277707a93a9..36e189d0b73d 100644
> --- a/drivers/dma-buf/heaps/chunk_heap.c
> +++ b/drivers/dma-buf/heaps/chunk_heap.c
> @@ -410,7 +410,7 @@ static int __init chunk_heap_init(void)
>         chunk_heap->order = CHUNK_HEAP_ORDER;
>         chunk_heap->cma = default_cma;
>
> -       exp_info.name = cma_get_name(default_cma);
> +       exp_info.name = "cma-chunk-heap";

That's still a bit general for the default cma (which can be named
differently). I think including cma name is important, just adding the
chunk prefix might be best.

So something like
  sprintf(buf, "chunk-%s", cma_get_name(default_cma));
  exp_info.name = buf;

thanks
-john

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 4/4] dma-buf: heaps: add chunk heap to dmabuf heaps
  2020-12-02  0:33           ` John Stultz
@ 2020-12-02  0:57             ` Minchan Kim
  0 siblings, 0 replies; 28+ messages in thread
From: Minchan Kim @ 2020-12-02  0:57 UTC (permalink / raw)
  To: John Stultz
  Cc: Andrew Morton, LKML, linux-mm, Hyesoo Yu, Matthew Wilcox, david,
	iamjoonsoo.kim, vbabka, Suren Baghdasaryan, KyongHo Cho,
	John Dias, Hridya Valsaraju, Sumit Semwal, Brian Starkey,
	linux-media,
	open list:OPEN FIRMWARE AND FLATTENED DEVICE TREE BINDINGS,
	Rob Herring, Christian Koenig,
	moderated list:DMA BUFFER SHARING FRAMEWORK

On Tue, Dec 01, 2020 at 04:33:14PM -0800, John Stultz wrote:
> On Tue, Dec 1, 2020 at 4:13 PM Minchan Kim <minchan@kernel.org> wrote:
> >
> > On Tue, Dec 01, 2020 at 03:38:14PM -0800, John Stultz wrote:
> > > On Tue, Dec 1, 2020 at 2:55 PM Minchan Kim <minchan@kernel.org> wrote:
> > > > On Tue, Dec 01, 2020 at 11:48:15AM -0800, John Stultz wrote:
> > > > > On Tue, Dec 1, 2020 at 9:51 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > >
> > > > > Thanks for reworking and resending this!
> > > > >
> > > > > ...
> > > > > > +static int __init chunk_heap_init(void)
> > > > > > +{
> > > > > > +       struct cma *default_cma = dev_get_cma_area(NULL);
> > > > > > +       struct dma_heap_export_info exp_info;
> > > > > > +       struct chunk_heap *chunk_heap;
> > > > > > +
> > > > > > +       if (!default_cma)
> > > > > > +               return 0;
> > > > > > +
> > > > > > +       chunk_heap = kzalloc(sizeof(*chunk_heap), GFP_KERNEL);
> > > > > > +       if (!chunk_heap)
> > > > > > +               return -ENOMEM;
> > > > > > +
> > > > > > +       chunk_heap->order = CHUNK_HEAP_ORDER;
> > > > > > +       chunk_heap->cma = default_cma;
> > > > > > +
> > > > > > +       exp_info.name = cma_get_name(default_cma);
> > > > >
> > > > > So, this would create a chunk heap name with the default CMA name,
> > > > > which would be indistinguishable from the heap name used for the plain
> > > > > CMA heap.
> > > > >
> > > > > Probably a good idea to prefix it with "chunk-" so the heap device
> > > > > names are unique?
> > > >
> > > > That will give an impression to user that they are using different CMA
> > > > area but that's not true. IMHO, let's be honest at this moment.
> > >
> > > I disagree.  The dmabuf heaps provide an abstraction for allocating a
> > > type of memory, and while your heap is pulling from CMA, you aren't
> > > "just" allocating CMA as the existing CMA heap would suffice for that.
> > >
> > > Since you need a slightly different method to allocate high order
> > > pages in bulk, we really should have a unique way to name the
> > > allocator interface. That's why I'd suggest the "chunk-" prefix to the
> > > heap name.
> >
> > Got it. How about this?
> >
> > diff --git a/drivers/dma-buf/heaps/chunk_heap.c b/drivers/dma-buf/heaps/chunk_heap.c
> > index 0277707a93a9..36e189d0b73d 100644
> > --- a/drivers/dma-buf/heaps/chunk_heap.c
> > +++ b/drivers/dma-buf/heaps/chunk_heap.c
> > @@ -410,7 +410,7 @@ static int __init chunk_heap_init(void)
> >         chunk_heap->order = CHUNK_HEAP_ORDER;
> >         chunk_heap->cma = default_cma;
> >
> > -       exp_info.name = cma_get_name(default_cma);
> > +       exp_info.name = "cma-chunk-heap";
> 
> That's still a bit general for the default cma (which can be named
> differently). I think including cma name is important, just adding the
> chunk prefix might be best.
> 
> So something like
>   sprintf(buf, "chunk-%s", cma_get_name(default_cma));
>   exp_info.name = buf;

No problem. Will do that in respoin.
Other than that, can you give any Acked-by or Reviewed-by to save
iteration?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 2/4] mm: introduce cma_alloc_bulk API
  2020-12-01 17:51 ` [PATCH v2 2/4] mm: introduce cma_alloc_bulk API Minchan Kim
@ 2020-12-02  9:14   ` David Hildenbrand
  2020-12-02 15:49     ` Michal Hocko
  2020-12-02 16:00     ` Minchan Kim
  0 siblings, 2 replies; 28+ messages in thread
From: David Hildenbrand @ 2020-12-02  9:14 UTC (permalink / raw)
  To: Minchan Kim, Andrew Morton
  Cc: LKML, linux-mm, hyesoo.yu, willy, iamjoonsoo.kim, vbabka, surenb,
	pullip.cho, joaodias, hridya, sumit.semwal, john.stultz,
	Brian.Starkey, linux-media, devicetree, robh, christian.koenig,
	linaro-mm-sig, Michal Hocko

On 01.12.20 18:51, Minchan Kim wrote:
> There is a need for special HW to require bulk allocation of
> high-order pages. For example, 4800 * order-4 pages, which
> would be minimum, sometimes, it requires more.
> 
> To meet the requirement, a option reserves 300M CMA area and
> requests the whole 300M contiguous memory. However, it doesn't
> work if even one of those pages in the range is long-term pinned
> directly or indirectly. The other option is to ask higher-order

My latest knowledge is that pages in the CMA area are never long term
pinned.

https://lore.kernel.org/lkml/20201123090129.GD27488@dhcp22.suse.cz/

"gup already tries to deal with long term pins on CMA regions and migrate
to a non CMA region. Have a look at __gup_longterm_locked."

We should rather identify ways how that is still possible and get rid of
them.


Now, short-term pinnings and PCP are other issues where
alloc_contig_range() could be improved (e.g., in contrast to a FAST
mode, a HARD mode which temporarily disables the PCP, ...).

> size (e.g., 2M) than requested order(64K) repeatedly until driver
> could gather necessary amount of memory. Basically, this approach
> makes the allocation very slow due to cma_alloc's function
> slowness and it could be stuck on one of the pageblocks if it
> encounters unmigratable page.
> 
> To solve the issue, this patch introduces cma_alloc_bulk.
> 
> 	int cma_alloc_bulk(struct cma *cma, unsigned int align,
> 		bool fast, unsigned int order, size_t nr_requests,
> 		struct page **page_array, size_t *nr_allocated);
> 
> Most parameters are same with cma_alloc but it additionally passes
> vector array to store allocated memory. What's different with cma_alloc
> is it will skip pageblocks without waiting/stopping if it has unmovable
> page so that API continues to scan other pageblocks to find requested
> order page.
> 
> cma_alloc_bulk is best effort approach in that it skips some pageblocks
> if they have unmovable pages unlike cma_alloc. It doesn't need to be
> perfect from the beginning at the cost of performance. Thus, the API
> takes "bool fast parameter" which is propagated into alloc_contig_range to
> avoid significat overhead functions to inrecase CMA allocation success
> ratio(e.g., migration retrial, PCP, LRU draining per pageblock)
> at the cost of less allocation success ratio. If the caller couldn't
> allocate enough, they could call it with "false" to increase success ratio
> if they are okay to expense the overhead for the success ratio.

Just so I understand what the idea is:

alloc_contig_range() sometimes fails on CMA regions when trying to
allocate big chunks (e.g., 300M). Instead of tackling that issue, you
rather allocate plenty of small chunks, and make these small allocations
fail faster/ make the allocations less reliable. Correct?

I don't really have a strong opinion on that. Giving up fast rather than
trying for longer sounds like a useful thing to have - but I wonder if
it's strictly necessary for the use case you describe.

I'd like to hear Michals opinion on that.

> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  include/linux/cma.h |   5 ++
>  include/linux/gfp.h |   2 +
>  mm/cma.c            | 126 ++++++++++++++++++++++++++++++++++++++++++--
>  mm/page_alloc.c     |  19 ++++---
>  4 files changed, 140 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/cma.h b/include/linux/cma.h
> index 217999c8a762..7375d3131804 100644
> --- a/include/linux/cma.h
> +++ b/include/linux/cma.h
> @@ -46,6 +46,11 @@ extern int cma_init_reserved_mem(phys_addr_t base, phys_addr_t size,
>  					struct cma **res_cma);
>  extern struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
>  			      bool no_warn);
> +
> +extern int cma_alloc_bulk(struct cma *cma, unsigned int align, bool fast,
> +			unsigned int order, size_t nr_requests,
> +			struct page **page_array, size_t *nr_allocated);
> +
>  extern bool cma_release(struct cma *cma, const struct page *pages, unsigned int count);
>  
>  extern int cma_for_each_area(int (*it)(struct cma *cma, void *data), void *data);
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index ad5872699692..75bfb673d75b 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -627,6 +627,8 @@ static inline bool pm_suspended_storage(void)
>  enum alloc_contig_mode {
>  	/* try several ways to increase success ratio of memory allocation */
>  	ALLOC_CONTIG_NORMAL,
> +	/* avoid costly functions to make the call fast */
> +	ALLOC_CONTIG_FAST,
>  };
>  
>  /* The below functions must be run on a range from a single zone. */
> diff --git a/mm/cma.c b/mm/cma.c
> index 8010c1ba04b0..4459045fa717 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -32,6 +32,7 @@
>  #include <linux/highmem.h>
>  #include <linux/io.h>
>  #include <linux/kmemleak.h>
> +#include <linux/swap.h>
>  #include <trace/events/cma.h>
>  
>  #include "cma.h"
> @@ -397,6 +398,14 @@ static void cma_debug_show_areas(struct cma *cma)
>  static inline void cma_debug_show_areas(struct cma *cma) { }
>  #endif
>  
> +static void reset_page_kasan_tag(struct page *page, int count)
> +{
> +	int i;
> +
> +	for (i = 0; i < count; i++)
> +		page_kasan_tag_reset(page + i);
> +}
> +
>  /**
>   * cma_alloc() - allocate pages from contiguous area
>   * @cma:   Contiguous memory region for which the allocation is performed.
> @@ -414,7 +423,6 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
>  	unsigned long pfn = -1;
>  	unsigned long start = 0;
>  	unsigned long bitmap_maxno, bitmap_no, bitmap_count;
> -	size_t i;
>  	struct page *page = NULL;
>  	int ret = -ENOMEM;
>  
> @@ -479,10 +487,8 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
>  	 * blocks being marked with different tags. Reset the tags to ignore
>  	 * those page blocks.
>  	 */
> -	if (page) {
> -		for (i = 0; i < count; i++)
> -			page_kasan_tag_reset(page + i);
> -	}
> +	if (page)
> +		reset_page_kasan_tag(page, count);
>  
>  	if (ret && !no_warn) {
>  		pr_err("%s: alloc failed, req-size: %zu pages, ret: %d\n",
> @@ -494,6 +500,116 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
>  	return page;
>  }
>  
> +/*
> + * cma_alloc_bulk() - allocate high order bulk pages from contiguous area with
> + * 		best effort. It will usually be used for private @cma
> + *
> + * @cma:	contiguous memory region for which the allocation is performed.
> + * @align:	requested alignment of pages (in PAGE_SIZE order).
> + * @fast:	will skip costly opeartions if it's true.
> + * @order:	requested page order
> + * @nr_requests: the number of 2^order pages requested to be allocated as input,
> + * @page_array:	page_array pointer to store allocated pages (must have space
> + *		for at least nr_requests)
> + * @nr_allocated: the number of 2^order pages allocated as output
> + *
> + * This function tries to allocate up to @nr_requests @order pages on specific
> + * contiguous memory area. If @fast has true, it will avoid costly functions
> + * to increase allocation success ratio so it will be faster but might return
> + * less than requested number of pages. User could retry it with true if it is
> + * needed.
> + *
> + * Return: it will return 0 only if all pages requested by @nr_requestsed are
> + * allocated. Otherwise, it returns negative error code.
> + *
> + * Note: Regardless of success/failure, user should check @nr_allocated to see
> + * how many @order pages are allocated and free those pages when they are not
> + * needed.
> + */
> +int cma_alloc_bulk(struct cma *cma, unsigned int align, bool fast,
> +			unsigned int order, size_t nr_requests,
> +			struct page **page_array, size_t *nr_allocated)
> +{
> +	int ret = 0;
> +	size_t i = 0;
> +	unsigned long nr_pages_needed = nr_requests * (1 << order);
> +	unsigned long nr_chunk_pages, nr_pages;
> +	unsigned long mask, offset;
> +	unsigned long pfn = -1;
> +	unsigned long start = 0;
> +	unsigned long bitmap_maxno, bitmap_no, bitmap_count;
> +	struct page *page = NULL;
> +	enum alloc_contig_mode mode = fast ? ALLOC_CONTIG_FAST :
> +						ALLOC_CONTIG_NORMAL;
> +	*nr_allocated = 0;
> +	if (!cma || !cma->count || !cma->bitmap || !page_array)
> +		return -EINVAL;
> +
> +	if (!nr_pages_needed)
> +		return 0;
> +
> +	nr_chunk_pages = 1 << max_t(unsigned int, order, pageblock_order);
> +
> +	mask = cma_bitmap_aligned_mask(cma, align);
> +	offset = cma_bitmap_aligned_offset(cma, align);
> +	bitmap_maxno = cma_bitmap_maxno(cma);
> +
> +	lru_add_drain_all();
> +	drain_all_pages(NULL);
> +
> +	while (nr_pages_needed) {
> +		nr_pages = min(nr_chunk_pages, nr_pages_needed);
> +
> +		bitmap_count = cma_bitmap_pages_to_bits(cma, nr_pages);
> +		mutex_lock(&cma->lock);
> +		bitmap_no = bitmap_find_next_zero_area_off(cma->bitmap,
> +				bitmap_maxno, start, bitmap_count, mask,
> +				offset);
> +		if (bitmap_no >= bitmap_maxno) {
> +			mutex_unlock(&cma->lock);
> +			break;
> +		}
> +		bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
> +		/*
> +		 * It's safe to drop the lock here. If the migration fails
> +		 * cma_clear_bitmap will take the lock again and unmark it.
> +		 */
> +		mutex_unlock(&cma->lock);
> +
> +		pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit);
> +		ret = alloc_contig_range(pfn, pfn + nr_pages, MIGRATE_CMA,
> +						GFP_KERNEL|__GFP_NOWARN, mode);
> +		if (ret) {
> +			cma_clear_bitmap(cma, pfn, nr_pages);
> +			if (ret != -EBUSY)
> +				break;
> +
> +			/* continue to search next block */
> +			start = (pfn + nr_pages - cma->base_pfn) >>
> +						cma->order_per_bit;
> +			continue;
> +		}
> +
> +		page = pfn_to_page(pfn);
> +		while (nr_pages) {
> +			page_array[i++] = page;
> +			reset_page_kasan_tag(page, 1 << order);
> +			page += 1 << order;
> +			nr_pages -= 1 << order;
> +			nr_pages_needed -= 1 << order;
> +		}
> +
> +		start = bitmap_no + bitmap_count;
> +	}
> +
> +	*nr_allocated = i;
> +
> +	if (!ret && nr_pages_needed)
> +		ret = -EBUSY;
> +
> +	return ret;
> +}
> +
>  /**
>   * cma_release() - release allocated pages
>   * @cma:   Contiguous memory region for which the allocation is performed.
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index adfbfd95fbc3..2a1799ff14fc 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -8463,7 +8463,8 @@ static unsigned long pfn_max_align_up(unsigned long pfn)
>  
>  /* [start, end) must belong to a single zone. */
>  static int __alloc_contig_migrate_range(struct compact_control *cc,
> -					unsigned long start, unsigned long end)
> +					unsigned long start, unsigned long end,
> +					unsigned int max_tries)
>  {
>  	/* This function is based on compact_zone() from compaction.c. */
>  	unsigned int nr_reclaimed;
> @@ -8491,7 +8492,7 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
>  				break;
>  			}
>  			tries = 0;
> -		} else if (++tries == 5) {
> +		} else if (++tries == max_tries) {
>  			ret = ret < 0 ? ret : -EBUSY;
>  			break;
>  		}
> @@ -8553,6 +8554,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>  	unsigned long outer_start, outer_end;
>  	unsigned int order;
>  	int ret = 0;
> +	bool fast_mode = mode == ALLOC_CONTIG_FAST;
>  
>  	struct compact_control cc = {
>  		.nr_migratepages = 0,
> @@ -8595,7 +8597,8 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>  	if (ret)
>  		return ret;
>  
> -	drain_all_pages(cc.zone);
> +	if (!fast_mode)
> +		drain_all_pages(cc.zone);
>  
>  	/*
>  	 * In case of -EBUSY, we'd like to know which page causes problem.
> @@ -8607,7 +8610,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>  	 * allocated.  So, if we fall through be sure to clear ret so that
>  	 * -EBUSY is not accidentally used or returned to caller.
>  	 */
> -	ret = __alloc_contig_migrate_range(&cc, start, end);
> +	ret = __alloc_contig_migrate_range(&cc, start, end, fast_mode ? 1 : 5);
>  	if (ret && ret != -EBUSY)
>  		goto done;
>  	ret =0;
> @@ -8629,7 +8632,8 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>  	 * isolated thus they won't get removed from buddy.
>  	 */
>  
> -	lru_add_drain_all();
> +	if (!fast_mode)
> +		lru_add_drain_all();
>  
>  	order = 0;
>  	outer_start = start;
> @@ -8656,8 +8660,9 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>  
>  	/* Make sure the range is really isolated. */
>  	if (test_pages_isolated(outer_start, end, 0)) {
> -		pr_info_ratelimited("%s: [%lx, %lx) PFNs busy\n",
> -			__func__, outer_start, end);
> +		if (!fast_mode)
> +			pr_info_ratelimited("%s: [%lx, %lx) PFNs busy\n",
> +				__func__, outer_start, end);
>  		ret = -EBUSY;
>  		goto done;
>  	}
> 


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 3/4] dma-buf: add export symbol for dma-heap
  2020-12-01 17:51 ` [PATCH v2 3/4] dma-buf: add export symbol for dma-heap Minchan Kim
@ 2020-12-02 13:51   ` Christoph Hellwig
  0 siblings, 0 replies; 28+ messages in thread
From: Christoph Hellwig @ 2020-12-02 13:51 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, hyesoo.yu, willy, david,
	iamjoonsoo.kim, vbabka, surenb, pullip.cho, joaodias, hridya,
	sumit.semwal, john.stultz, Brian.Starkey, linux-media,
	devicetree, robh, christian.koenig, linaro-mm-sig

On Tue, Dec 01, 2020 at 09:51:43AM -0800, Minchan Kim wrote:
> From: Hyesoo Yu <hyesoo.yu@samsung.com>
> 
> The heaps could be added as module, so some functions should
> be exported to register dma-heaps. And dma-heap of module can use
> cma area to allocate and free. However the function related cma
> is not exported now. Let's export them for next patches.
> 
> Signed-off-by: Hyesoo Yu <hyesoo.yu@samsung.com>
> Signed-off-by: Minchan Kim <minchan@kernel.org>

NAK.  I really don't think any module has business gaining that
kind of low-level access to the various symbols.  I think in general
your heap can just be built-in and we should be fine.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 4/4] dma-buf: heaps: add chunk heap to dmabuf heaps
  2020-12-01 17:51 ` [PATCH v2 4/4] dma-buf: heaps: add chunk heap to dmabuf heaps Minchan Kim
  2020-12-01 19:48   ` John Stultz
@ 2020-12-02 13:54   ` Christoph Hellwig
  1 sibling, 0 replies; 28+ messages in thread
From: Christoph Hellwig @ 2020-12-02 13:54 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, hyesoo.yu, willy, david,
	iamjoonsoo.kim, vbabka, surenb, pullip.cho, joaodias, hridya,
	sumit.semwal, john.stultz, Brian.Starkey, linux-media,
	devicetree, robh, christian.koenig, linaro-mm-sig

On Tue, Dec 01, 2020 at 09:51:44AM -0800, Minchan Kim wrote:
> From: Hyesoo Yu <hyesoo.yu@samsung.com>
> 
> This patch supports chunk heap that allocates the buffers that
> arranged into a list a fixed size chunks taken from CMA.
> 
> The chunk heap doesn't use heap-helper although it can remove
> duplicated code since heap-helper is under deprecated process.[1]
> 
> NOTE: This patch only adds the default CMA heap to allocate chunk
> pages. We will add another CMA memory regions to the dmabuf heaps
> interface with a later patch (which requires a dt binding)

This new heap seems to largely duplicate the exsting cma_heap.c
file.  Why can't you reuse the code and allow creating different
heaps with different chunk sizes or max numbers of segments?

> +config DMABUF_HEAPS_CHUNK_ORDER
> +	int "Chunk page order for dmabuf chunk heap"
> +	default 4
> +	depends on DMABUF_HEAPS_CHUNK
> +	help
> +	  Set page order of fixed chunk size to allocate from CMA.

Using a config option for this is just broken.  It needs to be runtime
or at very least boot time / DT controllable.

> + * ION Memory Allocator chunk heap exporter

This comment seems wrong.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 2/4] mm: introduce cma_alloc_bulk API
  2020-12-02  9:14   ` David Hildenbrand
@ 2020-12-02 15:49     ` Michal Hocko
  2020-12-02 16:00       ` David Hildenbrand
  2020-12-02 16:15       ` Minchan Kim
  2020-12-02 16:00     ` Minchan Kim
  1 sibling, 2 replies; 28+ messages in thread
From: Michal Hocko @ 2020-12-02 15:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, hyesoo.yu, willy,
	iamjoonsoo.kim, vbabka, surenb, pullip.cho, joaodias, hridya,
	sumit.semwal, john.stultz, Brian.Starkey, linux-media,
	devicetree, robh, christian.koenig, linaro-mm-sig

On Wed 02-12-20 10:14:41, David Hildenbrand wrote:
> On 01.12.20 18:51, Minchan Kim wrote:
> > There is a need for special HW to require bulk allocation of
> > high-order pages. For example, 4800 * order-4 pages, which
> > would be minimum, sometimes, it requires more.
> > 
> > To meet the requirement, a option reserves 300M CMA area and
> > requests the whole 300M contiguous memory. However, it doesn't
> > work if even one of those pages in the range is long-term pinned
> > directly or indirectly. The other option is to ask higher-order
> 
> My latest knowledge is that pages in the CMA area are never long term
> pinned.
> 
> https://lore.kernel.org/lkml/20201123090129.GD27488@dhcp22.suse.cz/
> 
> "gup already tries to deal with long term pins on CMA regions and migrate
> to a non CMA region. Have a look at __gup_longterm_locked."
> 
> We should rather identify ways how that is still possible and get rid of
> them.
> 
> 
> Now, short-term pinnings and PCP are other issues where
> alloc_contig_range() could be improved (e.g., in contrast to a FAST
> mode, a HARD mode which temporarily disables the PCP, ...).

Agreed!

> > size (e.g., 2M) than requested order(64K) repeatedly until driver
> > could gather necessary amount of memory. Basically, this approach
> > makes the allocation very slow due to cma_alloc's function
> > slowness and it could be stuck on one of the pageblocks if it
> > encounters unmigratable page.
> > 
> > To solve the issue, this patch introduces cma_alloc_bulk.
> > 
> > 	int cma_alloc_bulk(struct cma *cma, unsigned int align,
> > 		bool fast, unsigned int order, size_t nr_requests,
> > 		struct page **page_array, size_t *nr_allocated);
> > 
> > Most parameters are same with cma_alloc but it additionally passes
> > vector array to store allocated memory. What's different with cma_alloc
> > is it will skip pageblocks without waiting/stopping if it has unmovable
> > page so that API continues to scan other pageblocks to find requested
> > order page.
> > 
> > cma_alloc_bulk is best effort approach in that it skips some pageblocks
> > if they have unmovable pages unlike cma_alloc. It doesn't need to be
> > perfect from the beginning at the cost of performance. Thus, the API
> > takes "bool fast parameter" which is propagated into alloc_contig_range to
> > avoid significat overhead functions to inrecase CMA allocation success
> > ratio(e.g., migration retrial, PCP, LRU draining per pageblock)
> > at the cost of less allocation success ratio. If the caller couldn't
> > allocate enough, they could call it with "false" to increase success ratio
> > if they are okay to expense the overhead for the success ratio.
> 
> Just so I understand what the idea is:
> 
> alloc_contig_range() sometimes fails on CMA regions when trying to
> allocate big chunks (e.g., 300M). Instead of tackling that issue, you
> rather allocate plenty of small chunks, and make these small allocations
> fail faster/ make the allocations less reliable. Correct?
> 
> I don't really have a strong opinion on that. Giving up fast rather than
> trying for longer sounds like a useful thing to have - but I wonder if
> it's strictly necessary for the use case you describe.
> 
> I'd like to hear Michals opinion on that.

Well, what I can see is that this new interface is an antipatern to our
allocation routines. We tend to control allocations by gfp mask yet you
are introducing a bool parameter to make something faster... What that
really means is rather arbitrary. Would it make more sense to teach
cma_alloc resp. alloc_contig_range to recognize GFP_NOWAIT, GFP_NORETRY resp.
GFP_RETRY_MAYFAIL instead?

I am not deeply familiar with the cma allocator so sorry for a
potentially stupid question. Why does a bulk interface performs better
than repeated calls to cma_alloc? Is this because a failure would help
to move on to the next pfn range while a repeated call would have to
deal with the same range?

> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  include/linux/cma.h |   5 ++
> >  include/linux/gfp.h |   2 +
> >  mm/cma.c            | 126 ++++++++++++++++++++++++++++++++++++++++++--
> >  mm/page_alloc.c     |  19 ++++---
> >  4 files changed, 140 insertions(+), 12 deletions(-)
> > 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 2/4] mm: introduce cma_alloc_bulk API
  2020-12-02  9:14   ` David Hildenbrand
  2020-12-02 15:49     ` Michal Hocko
@ 2020-12-02 16:00     ` Minchan Kim
  1 sibling, 0 replies; 28+ messages in thread
From: Minchan Kim @ 2020-12-02 16:00 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, LKML, linux-mm, hyesoo.yu, willy, iamjoonsoo.kim,
	vbabka, surenb, pullip.cho, joaodias, hridya, sumit.semwal,
	john.stultz, Brian.Starkey, linux-media, devicetree, robh,
	christian.koenig, linaro-mm-sig, Michal Hocko

On Wed, Dec 02, 2020 at 10:14:41AM +0100, David Hildenbrand wrote:
> On 01.12.20 18:51, Minchan Kim wrote:
> > There is a need for special HW to require bulk allocation of
> > high-order pages. For example, 4800 * order-4 pages, which
> > would be minimum, sometimes, it requires more.
> > 
> > To meet the requirement, a option reserves 300M CMA area and
> > requests the whole 300M contiguous memory. However, it doesn't
> > work if even one of those pages in the range is long-term pinned
> > directly or indirectly. The other option is to ask higher-order
> 
> My latest knowledge is that pages in the CMA area are never long term
> pinned.
> 
> https://lore.kernel.org/lkml/20201123090129.GD27488@dhcp22.suse.cz/
> 
> "gup already tries to deal with long term pins on CMA regions and migrate
> to a non CMA region. Have a look at __gup_longterm_locked."
> 
> We should rather identify ways how that is still possible and get rid of
> them.
> 
> 
> Now, short-term pinnings and PCP are other issues where
> alloc_contig_range() could be improved (e.g., in contrast to a FAST
> mode, a HARD mode which temporarily disables the PCP, ...).

GUP is just one of cases to make CMA failure directly but there are still
bunch of cases indirectly - make short-term pin to long-term by several
reasons (CPU scheduling/IO scheduling/memory pressure/locking dependency)

In the end, they would be finally released but is much unreliable/slow so
idea just skip the page and continue to search the other pages in CMA
area because what we needed is just severl high-order pages, not that
big 300M chunk.

> 
> > size (e.g., 2M) than requested order(64K) repeatedly until driver
> > could gather necessary amount of memory. Basically, this approach
> > makes the allocation very slow due to cma_alloc's function
> > slowness and it could be stuck on one of the pageblocks if it
> > encounters unmigratable page.
> > 
> > To solve the issue, this patch introduces cma_alloc_bulk.
> > 
> > 	int cma_alloc_bulk(struct cma *cma, unsigned int align,
> > 		bool fast, unsigned int order, size_t nr_requests,
> > 		struct page **page_array, size_t *nr_allocated);
> > 
> > Most parameters are same with cma_alloc but it additionally passes
> > vector array to store allocated memory. What's different with cma_alloc
> > is it will skip pageblocks without waiting/stopping if it has unmovable
> > page so that API continues to scan other pageblocks to find requested
> > order page.
> > 
> > cma_alloc_bulk is best effort approach in that it skips some pageblocks
> > if they have unmovable pages unlike cma_alloc. It doesn't need to be
> > perfect from the beginning at the cost of performance. Thus, the API
> > takes "bool fast parameter" which is propagated into alloc_contig_range to
> > avoid significat overhead functions to inrecase CMA allocation success
> > ratio(e.g., migration retrial, PCP, LRU draining per pageblock)
> > at the cost of less allocation success ratio. If the caller couldn't
> > allocate enough, they could call it with "false" to increase success ratio
> > if they are okay to expense the overhead for the success ratio.
> 
> Just so I understand what the idea is:
> 
> alloc_contig_range() sometimes fails on CMA regions when trying to
> allocate big chunks (e.g., 300M). Instead of tackling that issue, you
> rather allocate plenty of small chunks, and make these small allocations
> fail faster/ make the allocations less reliable. Correct?

Please look at above.

> 
> I don't really have a strong opinion on that. Giving up fast rather than
> trying for longer sounds like a useful thing to have - but I wonder if
> it's strictly necessary for the use case you describe.
> 
> I'd like to hear Michals opinion on that.
> 
> > 
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  include/linux/cma.h |   5 ++
> >  include/linux/gfp.h |   2 +
> >  mm/cma.c            | 126 ++++++++++++++++++++++++++++++++++++++++++--
> >  mm/page_alloc.c     |  19 ++++---
> >  4 files changed, 140 insertions(+), 12 deletions(-)
> > 
> > diff --git a/include/linux/cma.h b/include/linux/cma.h
> > index 217999c8a762..7375d3131804 100644
> > --- a/include/linux/cma.h
> > +++ b/include/linux/cma.h
> > @@ -46,6 +46,11 @@ extern int cma_init_reserved_mem(phys_addr_t base, phys_addr_t size,
> >  					struct cma **res_cma);
> >  extern struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
> >  			      bool no_warn);
> > +
> > +extern int cma_alloc_bulk(struct cma *cma, unsigned int align, bool fast,
> > +			unsigned int order, size_t nr_requests,
> > +			struct page **page_array, size_t *nr_allocated);
> > +
> >  extern bool cma_release(struct cma *cma, const struct page *pages, unsigned int count);
> >  
> >  extern int cma_for_each_area(int (*it)(struct cma *cma, void *data), void *data);
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index ad5872699692..75bfb673d75b 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -627,6 +627,8 @@ static inline bool pm_suspended_storage(void)
> >  enum alloc_contig_mode {
> >  	/* try several ways to increase success ratio of memory allocation */
> >  	ALLOC_CONTIG_NORMAL,
> > +	/* avoid costly functions to make the call fast */
> > +	ALLOC_CONTIG_FAST,
> >  };
> >  
> >  /* The below functions must be run on a range from a single zone. */
> > diff --git a/mm/cma.c b/mm/cma.c
> > index 8010c1ba04b0..4459045fa717 100644
> > --- a/mm/cma.c
> > +++ b/mm/cma.c
> > @@ -32,6 +32,7 @@
> >  #include <linux/highmem.h>
> >  #include <linux/io.h>
> >  #include <linux/kmemleak.h>
> > +#include <linux/swap.h>
> >  #include <trace/events/cma.h>
> >  
> >  #include "cma.h"
> > @@ -397,6 +398,14 @@ static void cma_debug_show_areas(struct cma *cma)
> >  static inline void cma_debug_show_areas(struct cma *cma) { }
> >  #endif
> >  
> > +static void reset_page_kasan_tag(struct page *page, int count)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < count; i++)
> > +		page_kasan_tag_reset(page + i);
> > +}
> > +
> >  /**
> >   * cma_alloc() - allocate pages from contiguous area
> >   * @cma:   Contiguous memory region for which the allocation is performed.
> > @@ -414,7 +423,6 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
> >  	unsigned long pfn = -1;
> >  	unsigned long start = 0;
> >  	unsigned long bitmap_maxno, bitmap_no, bitmap_count;
> > -	size_t i;
> >  	struct page *page = NULL;
> >  	int ret = -ENOMEM;
> >  
> > @@ -479,10 +487,8 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
> >  	 * blocks being marked with different tags. Reset the tags to ignore
> >  	 * those page blocks.
> >  	 */
> > -	if (page) {
> > -		for (i = 0; i < count; i++)
> > -			page_kasan_tag_reset(page + i);
> > -	}
> > +	if (page)
> > +		reset_page_kasan_tag(page, count);
> >  
> >  	if (ret && !no_warn) {
> >  		pr_err("%s: alloc failed, req-size: %zu pages, ret: %d\n",
> > @@ -494,6 +500,116 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
> >  	return page;
> >  }
> >  
> > +/*
> > + * cma_alloc_bulk() - allocate high order bulk pages from contiguous area with
> > + * 		best effort. It will usually be used for private @cma
> > + *
> > + * @cma:	contiguous memory region for which the allocation is performed.
> > + * @align:	requested alignment of pages (in PAGE_SIZE order).
> > + * @fast:	will skip costly opeartions if it's true.
> > + * @order:	requested page order
> > + * @nr_requests: the number of 2^order pages requested to be allocated as input,
> > + * @page_array:	page_array pointer to store allocated pages (must have space
> > + *		for at least nr_requests)
> > + * @nr_allocated: the number of 2^order pages allocated as output
> > + *
> > + * This function tries to allocate up to @nr_requests @order pages on specific
> > + * contiguous memory area. If @fast has true, it will avoid costly functions
> > + * to increase allocation success ratio so it will be faster but might return
> > + * less than requested number of pages. User could retry it with true if it is
> > + * needed.
> > + *
> > + * Return: it will return 0 only if all pages requested by @nr_requestsed are
> > + * allocated. Otherwise, it returns negative error code.
> > + *
> > + * Note: Regardless of success/failure, user should check @nr_allocated to see
> > + * how many @order pages are allocated and free those pages when they are not
> > + * needed.
> > + */
> > +int cma_alloc_bulk(struct cma *cma, unsigned int align, bool fast,
> > +			unsigned int order, size_t nr_requests,
> > +			struct page **page_array, size_t *nr_allocated)
> > +{
> > +	int ret = 0;
> > +	size_t i = 0;
> > +	unsigned long nr_pages_needed = nr_requests * (1 << order);
> > +	unsigned long nr_chunk_pages, nr_pages;
> > +	unsigned long mask, offset;
> > +	unsigned long pfn = -1;
> > +	unsigned long start = 0;
> > +	unsigned long bitmap_maxno, bitmap_no, bitmap_count;
> > +	struct page *page = NULL;
> > +	enum alloc_contig_mode mode = fast ? ALLOC_CONTIG_FAST :
> > +						ALLOC_CONTIG_NORMAL;
> > +	*nr_allocated = 0;
> > +	if (!cma || !cma->count || !cma->bitmap || !page_array)
> > +		return -EINVAL;
> > +
> > +	if (!nr_pages_needed)
> > +		return 0;
> > +
> > +	nr_chunk_pages = 1 << max_t(unsigned int, order, pageblock_order);
> > +
> > +	mask = cma_bitmap_aligned_mask(cma, align);
> > +	offset = cma_bitmap_aligned_offset(cma, align);
> > +	bitmap_maxno = cma_bitmap_maxno(cma);
> > +
> > +	lru_add_drain_all();
> > +	drain_all_pages(NULL);
> > +
> > +	while (nr_pages_needed) {
> > +		nr_pages = min(nr_chunk_pages, nr_pages_needed);
> > +
> > +		bitmap_count = cma_bitmap_pages_to_bits(cma, nr_pages);
> > +		mutex_lock(&cma->lock);
> > +		bitmap_no = bitmap_find_next_zero_area_off(cma->bitmap,
> > +				bitmap_maxno, start, bitmap_count, mask,
> > +				offset);
> > +		if (bitmap_no >= bitmap_maxno) {
> > +			mutex_unlock(&cma->lock);
> > +			break;
> > +		}
> > +		bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
> > +		/*
> > +		 * It's safe to drop the lock here. If the migration fails
> > +		 * cma_clear_bitmap will take the lock again and unmark it.
> > +		 */
> > +		mutex_unlock(&cma->lock);
> > +
> > +		pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit);
> > +		ret = alloc_contig_range(pfn, pfn + nr_pages, MIGRATE_CMA,
> > +						GFP_KERNEL|__GFP_NOWARN, mode);
> > +		if (ret) {
> > +			cma_clear_bitmap(cma, pfn, nr_pages);
> > +			if (ret != -EBUSY)
> > +				break;
> > +
> > +			/* continue to search next block */
> > +			start = (pfn + nr_pages - cma->base_pfn) >>
> > +						cma->order_per_bit;
> > +			continue;
> > +		}
> > +
> > +		page = pfn_to_page(pfn);
> > +		while (nr_pages) {
> > +			page_array[i++] = page;
> > +			reset_page_kasan_tag(page, 1 << order);
> > +			page += 1 << order;
> > +			nr_pages -= 1 << order;
> > +			nr_pages_needed -= 1 << order;
> > +		}
> > +
> > +		start = bitmap_no + bitmap_count;
> > +	}
> > +
> > +	*nr_allocated = i;
> > +
> > +	if (!ret && nr_pages_needed)
> > +		ret = -EBUSY;
> > +
> > +	return ret;
> > +}
> > +
> >  /**
> >   * cma_release() - release allocated pages
> >   * @cma:   Contiguous memory region for which the allocation is performed.
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index adfbfd95fbc3..2a1799ff14fc 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -8463,7 +8463,8 @@ static unsigned long pfn_max_align_up(unsigned long pfn)
> >  
> >  /* [start, end) must belong to a single zone. */
> >  static int __alloc_contig_migrate_range(struct compact_control *cc,
> > -					unsigned long start, unsigned long end)
> > +					unsigned long start, unsigned long end,
> > +					unsigned int max_tries)
> >  {
> >  	/* This function is based on compact_zone() from compaction.c. */
> >  	unsigned int nr_reclaimed;
> > @@ -8491,7 +8492,7 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
> >  				break;
> >  			}
> >  			tries = 0;
> > -		} else if (++tries == 5) {
> > +		} else if (++tries == max_tries) {
> >  			ret = ret < 0 ? ret : -EBUSY;
> >  			break;
> >  		}
> > @@ -8553,6 +8554,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
> >  	unsigned long outer_start, outer_end;
> >  	unsigned int order;
> >  	int ret = 0;
> > +	bool fast_mode = mode == ALLOC_CONTIG_FAST;
> >  
> >  	struct compact_control cc = {
> >  		.nr_migratepages = 0,
> > @@ -8595,7 +8597,8 @@ int alloc_contig_range(unsigned long start, unsigned long end,
> >  	if (ret)
> >  		return ret;
> >  
> > -	drain_all_pages(cc.zone);
> > +	if (!fast_mode)
> > +		drain_all_pages(cc.zone);
> >  
> >  	/*
> >  	 * In case of -EBUSY, we'd like to know which page causes problem.
> > @@ -8607,7 +8610,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
> >  	 * allocated.  So, if we fall through be sure to clear ret so that
> >  	 * -EBUSY is not accidentally used or returned to caller.
> >  	 */
> > -	ret = __alloc_contig_migrate_range(&cc, start, end);
> > +	ret = __alloc_contig_migrate_range(&cc, start, end, fast_mode ? 1 : 5);
> >  	if (ret && ret != -EBUSY)
> >  		goto done;
> >  	ret =0;
> > @@ -8629,7 +8632,8 @@ int alloc_contig_range(unsigned long start, unsigned long end,
> >  	 * isolated thus they won't get removed from buddy.
> >  	 */
> >  
> > -	lru_add_drain_all();
> > +	if (!fast_mode)
> > +		lru_add_drain_all();
> >  
> >  	order = 0;
> >  	outer_start = start;
> > @@ -8656,8 +8660,9 @@ int alloc_contig_range(unsigned long start, unsigned long end,
> >  
> >  	/* Make sure the range is really isolated. */
> >  	if (test_pages_isolated(outer_start, end, 0)) {
> > -		pr_info_ratelimited("%s: [%lx, %lx) PFNs busy\n",
> > -			__func__, outer_start, end);
> > +		if (!fast_mode)
> > +			pr_info_ratelimited("%s: [%lx, %lx) PFNs busy\n",
> > +				__func__, outer_start, end);
> >  		ret = -EBUSY;
> >  		goto done;
> >  	}
> > 
> 
> 
> -- 
> Thanks,
> 
> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 2/4] mm: introduce cma_alloc_bulk API
  2020-12-02 15:49     ` Michal Hocko
@ 2020-12-02 16:00       ` David Hildenbrand
  2020-12-02 16:15       ` Minchan Kim
  1 sibling, 0 replies; 28+ messages in thread
From: David Hildenbrand @ 2020-12-02 16:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, hyesoo.yu, willy,
	iamjoonsoo.kim, vbabka, surenb, pullip.cho, joaodias, hridya,
	sumit.semwal, john.stultz, Brian.Starkey, linux-media,
	devicetree, robh, christian.koenig, linaro-mm-sig

On 02.12.20 16:49, Michal Hocko wrote:
> On Wed 02-12-20 10:14:41, David Hildenbrand wrote:
>> On 01.12.20 18:51, Minchan Kim wrote:
>>> There is a need for special HW to require bulk allocation of
>>> high-order pages. For example, 4800 * order-4 pages, which
>>> would be minimum, sometimes, it requires more.
>>>
>>> To meet the requirement, a option reserves 300M CMA area and
>>> requests the whole 300M contiguous memory. However, it doesn't
>>> work if even one of those pages in the range is long-term pinned
>>> directly or indirectly. The other option is to ask higher-order
>>
>> My latest knowledge is that pages in the CMA area are never long term
>> pinned.
>>
>> https://lore.kernel.org/lkml/20201123090129.GD27488@dhcp22.suse.cz/
>>
>> "gup already tries to deal with long term pins on CMA regions and migrate
>> to a non CMA region. Have a look at __gup_longterm_locked."
>>
>> We should rather identify ways how that is still possible and get rid of
>> them.
>>
>>
>> Now, short-term pinnings and PCP are other issues where
>> alloc_contig_range() could be improved (e.g., in contrast to a FAST
>> mode, a HARD mode which temporarily disables the PCP, ...).
> 
> Agreed!
> 
>>> size (e.g., 2M) than requested order(64K) repeatedly until driver
>>> could gather necessary amount of memory. Basically, this approach
>>> makes the allocation very slow due to cma_alloc's function
>>> slowness and it could be stuck on one of the pageblocks if it
>>> encounters unmigratable page.
>>>
>>> To solve the issue, this patch introduces cma_alloc_bulk.
>>>
>>> 	int cma_alloc_bulk(struct cma *cma, unsigned int align,
>>> 		bool fast, unsigned int order, size_t nr_requests,
>>> 		struct page **page_array, size_t *nr_allocated);
>>>
>>> Most parameters are same with cma_alloc but it additionally passes
>>> vector array to store allocated memory. What's different with cma_alloc
>>> is it will skip pageblocks without waiting/stopping if it has unmovable
>>> page so that API continues to scan other pageblocks to find requested
>>> order page.
>>>
>>> cma_alloc_bulk is best effort approach in that it skips some pageblocks
>>> if they have unmovable pages unlike cma_alloc. It doesn't need to be
>>> perfect from the beginning at the cost of performance. Thus, the API
>>> takes "bool fast parameter" which is propagated into alloc_contig_range to
>>> avoid significat overhead functions to inrecase CMA allocation success
>>> ratio(e.g., migration retrial, PCP, LRU draining per pageblock)
>>> at the cost of less allocation success ratio. If the caller couldn't
>>> allocate enough, they could call it with "false" to increase success ratio
>>> if they are okay to expense the overhead for the success ratio.
>>
>> Just so I understand what the idea is:
>>
>> alloc_contig_range() sometimes fails on CMA regions when trying to
>> allocate big chunks (e.g., 300M). Instead of tackling that issue, you
>> rather allocate plenty of small chunks, and make these small allocations
>> fail faster/ make the allocations less reliable. Correct?
>>
>> I don't really have a strong opinion on that. Giving up fast rather than
>> trying for longer sounds like a useful thing to have - but I wonder if
>> it's strictly necessary for the use case you describe.
>>
>> I'd like to hear Michals opinion on that.
> 
> Well, what I can see is that this new interface is an antipatern to our
> allocation routines. We tend to control allocations by gfp mask yet you
> are introducing a bool parameter to make something faster... What that
> really means is rather arbitrary. Would it make more sense to teach
> cma_alloc resp. alloc_contig_range to recognize GFP_NOWAIT, GFP_NORETRY resp.
> GFP_RETRY_MAYFAIL instead?

Minchan did that before, but I disliked gluing things like "don't drain
lru, don't drain pcp" to GFP_NORETRY and shifting responsibility to the
user.


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 2/4] mm: introduce cma_alloc_bulk API
  2020-12-02 15:49     ` Michal Hocko
  2020-12-02 16:00       ` David Hildenbrand
@ 2020-12-02 16:15       ` Minchan Kim
  2020-12-02 16:48         ` Michal Hocko
  1 sibling, 1 reply; 28+ messages in thread
From: Minchan Kim @ 2020-12-02 16:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, Andrew Morton, LKML, linux-mm, hyesoo.yu,
	willy, iamjoonsoo.kim, vbabka, surenb, pullip.cho, joaodias,
	hridya, sumit.semwal, john.stultz, Brian.Starkey, linux-media,
	devicetree, robh, christian.koenig, linaro-mm-sig

On Wed, Dec 02, 2020 at 04:49:15PM +0100, Michal Hocko wrote:
> On Wed 02-12-20 10:14:41, David Hildenbrand wrote:
> > On 01.12.20 18:51, Minchan Kim wrote:
> > > There is a need for special HW to require bulk allocation of
> > > high-order pages. For example, 4800 * order-4 pages, which
> > > would be minimum, sometimes, it requires more.
> > > 
> > > To meet the requirement, a option reserves 300M CMA area and
> > > requests the whole 300M contiguous memory. However, it doesn't
> > > work if even one of those pages in the range is long-term pinned
> > > directly or indirectly. The other option is to ask higher-order
> > 
> > My latest knowledge is that pages in the CMA area are never long term
> > pinned.
> > 
> > https://lore.kernel.org/lkml/20201123090129.GD27488@dhcp22.suse.cz/
> > 
> > "gup already tries to deal with long term pins on CMA regions and migrate
> > to a non CMA region. Have a look at __gup_longterm_locked."
> > 
> > We should rather identify ways how that is still possible and get rid of
> > them.
> > 
> > 
> > Now, short-term pinnings and PCP are other issues where
> > alloc_contig_range() could be improved (e.g., in contrast to a FAST
> > mode, a HARD mode which temporarily disables the PCP, ...).
> 
> Agreed!
> 
> > > size (e.g., 2M) than requested order(64K) repeatedly until driver
> > > could gather necessary amount of memory. Basically, this approach
> > > makes the allocation very slow due to cma_alloc's function
> > > slowness and it could be stuck on one of the pageblocks if it
> > > encounters unmigratable page.
> > > 
> > > To solve the issue, this patch introduces cma_alloc_bulk.
> > > 
> > > 	int cma_alloc_bulk(struct cma *cma, unsigned int align,
> > > 		bool fast, unsigned int order, size_t nr_requests,
> > > 		struct page **page_array, size_t *nr_allocated);
> > > 
> > > Most parameters are same with cma_alloc but it additionally passes
> > > vector array to store allocated memory. What's different with cma_alloc
> > > is it will skip pageblocks without waiting/stopping if it has unmovable
> > > page so that API continues to scan other pageblocks to find requested
> > > order page.
> > > 
> > > cma_alloc_bulk is best effort approach in that it skips some pageblocks
> > > if they have unmovable pages unlike cma_alloc. It doesn't need to be
> > > perfect from the beginning at the cost of performance. Thus, the API
> > > takes "bool fast parameter" which is propagated into alloc_contig_range to
> > > avoid significat overhead functions to inrecase CMA allocation success
> > > ratio(e.g., migration retrial, PCP, LRU draining per pageblock)
> > > at the cost of less allocation success ratio. If the caller couldn't
> > > allocate enough, they could call it with "false" to increase success ratio
> > > if they are okay to expense the overhead for the success ratio.
> > 
> > Just so I understand what the idea is:
> > 
> > alloc_contig_range() sometimes fails on CMA regions when trying to
> > allocate big chunks (e.g., 300M). Instead of tackling that issue, you
> > rather allocate plenty of small chunks, and make these small allocations
> > fail faster/ make the allocations less reliable. Correct?
> > 
> > I don't really have a strong opinion on that. Giving up fast rather than
> > trying for longer sounds like a useful thing to have - but I wonder if
> > it's strictly necessary for the use case you describe.
> > 
> > I'd like to hear Michals opinion on that.
> 
> Well, what I can see is that this new interface is an antipatern to our
> allocation routines. We tend to control allocations by gfp mask yet you
> are introducing a bool parameter to make something faster... What that
> really means is rather arbitrary. Would it make more sense to teach
> cma_alloc resp. alloc_contig_range to recognize GFP_NOWAIT, GFP_NORETRY resp.
> GFP_RETRY_MAYFAIL instead?

If we use cma_alloc, that interface requires "allocate one big memory
chunk". IOW, return value is just struct page and expected that the page
is a big contiguos memory. That means it couldn't have a hole in the
range. However the idea here, what we asked is much smaller chunk rather
than a big contiguous memory so we could skip some of pages if they are
randomly pinned(long-term/short-term whatever) and search other pages
in the CMA area to avoid long stall. Thus, it couldn't work with exising
cma_alloc API with simple gfp_mak.

> 
> I am not deeply familiar with the cma allocator so sorry for a
> potentially stupid question. Why does a bulk interface performs better
> than repeated calls to cma_alloc? Is this because a failure would help
> to move on to the next pfn range while a repeated call would have to
> deal with the same range?

Yub, true with other overheads(e.g., migration retrial, waiting writeback
PCP/LRU draining IPI)

> 
> > > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > > ---
> > >  include/linux/cma.h |   5 ++
> > >  include/linux/gfp.h |   2 +
> > >  mm/cma.c            | 126 ++++++++++++++++++++++++++++++++++++++++++--
> > >  mm/page_alloc.c     |  19 ++++---
> > >  4 files changed, 140 insertions(+), 12 deletions(-)
> > > 
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 2/4] mm: introduce cma_alloc_bulk API
  2020-12-02 16:15       ` Minchan Kim
@ 2020-12-02 16:48         ` Michal Hocko
  2020-12-02 17:54           ` Minchan Kim
  0 siblings, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2020-12-02 16:48 UTC (permalink / raw)
  To: Minchan Kim
  Cc: David Hildenbrand, Andrew Morton, LKML, linux-mm, hyesoo.yu,
	willy, iamjoonsoo.kim, vbabka, surenb, pullip.cho, joaodias,
	hridya, sumit.semwal, john.stultz, Brian.Starkey, linux-media,
	devicetree, robh, christian.koenig, linaro-mm-sig

On Wed 02-12-20 08:15:49, Minchan Kim wrote:
> On Wed, Dec 02, 2020 at 04:49:15PM +0100, Michal Hocko wrote:
[...]
> > Well, what I can see is that this new interface is an antipatern to our
> > allocation routines. We tend to control allocations by gfp mask yet you
> > are introducing a bool parameter to make something faster... What that
> > really means is rather arbitrary. Would it make more sense to teach
> > cma_alloc resp. alloc_contig_range to recognize GFP_NOWAIT, GFP_NORETRY resp.
> > GFP_RETRY_MAYFAIL instead?
> 
> If we use cma_alloc, that interface requires "allocate one big memory
> chunk". IOW, return value is just struct page and expected that the page
> is a big contiguos memory. That means it couldn't have a hole in the
> range.
> However the idea here, what we asked is much smaller chunk rather
> than a big contiguous memory so we could skip some of pages if they are
> randomly pinned(long-term/short-term whatever) and search other pages
> in the CMA area to avoid long stall. Thus, it couldn't work with exising
> cma_alloc API with simple gfp_mak.

I really do not see that as something really alient to the cma_alloc
interface. All you should care about, really, is what size of the object
you want and how hard the system should try. If you have a problem with
an internal implementation of CMA and how it chooses a range and deal
with pinned pages then it should be addressed inside the CMA allocator.
I suspect that you are effectivelly trying to workaround those problems
by a side implementation with a slightly different API. Or maybe I still
do not follow the actual problem.
 
> > I am not deeply familiar with the cma allocator so sorry for a
> > potentially stupid question. Why does a bulk interface performs better
> > than repeated calls to cma_alloc? Is this because a failure would help
> > to move on to the next pfn range while a repeated call would have to
> > deal with the same range?
> 
> Yub, true with other overheads(e.g., migration retrial, waiting writeback
> PCP/LRU draining IPI)

Why cannot this be implemented in the cma_alloc layer? I mean you can
cache failed cases and optimize the proper pfn range search.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 2/4] mm: introduce cma_alloc_bulk API
  2020-12-02 16:48         ` Michal Hocko
@ 2020-12-02 17:54           ` Minchan Kim
  2020-12-02 18:51             ` Michal Hocko
  0 siblings, 1 reply; 28+ messages in thread
From: Minchan Kim @ 2020-12-02 17:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, Andrew Morton, LKML, linux-mm, hyesoo.yu,
	willy, iamjoonsoo.kim, vbabka, surenb, pullip.cho, joaodias,
	hridya, sumit.semwal, john.stultz, Brian.Starkey, linux-media,
	devicetree, robh, christian.koenig, linaro-mm-sig

On Wed, Dec 02, 2020 at 05:48:34PM +0100, Michal Hocko wrote:
> On Wed 02-12-20 08:15:49, Minchan Kim wrote:
> > On Wed, Dec 02, 2020 at 04:49:15PM +0100, Michal Hocko wrote:
> [...]
> > > Well, what I can see is that this new interface is an antipatern to our
> > > allocation routines. We tend to control allocations by gfp mask yet you
> > > are introducing a bool parameter to make something faster... What that
> > > really means is rather arbitrary. Would it make more sense to teach
> > > cma_alloc resp. alloc_contig_range to recognize GFP_NOWAIT, GFP_NORETRY resp.
> > > GFP_RETRY_MAYFAIL instead?
> > 
> > If we use cma_alloc, that interface requires "allocate one big memory
> > chunk". IOW, return value is just struct page and expected that the page
> > is a big contiguos memory. That means it couldn't have a hole in the
> > range.
> > However the idea here, what we asked is much smaller chunk rather
> > than a big contiguous memory so we could skip some of pages if they are
> > randomly pinned(long-term/short-term whatever) and search other pages
> > in the CMA area to avoid long stall. Thus, it couldn't work with exising
> > cma_alloc API with simple gfp_mak.
> 
> I really do not see that as something really alient to the cma_alloc
> interface. All you should care about, really, is what size of the object
> you want and how hard the system should try. If you have a problem with
> an internal implementation of CMA and how it chooses a range and deal
> with pinned pages then it should be addressed inside the CMA allocator.
> I suspect that you are effectivelly trying to workaround those problems
> by a side implementation with a slightly different API. Or maybe I still
> do not follow the actual problem.
>  
> > > I am not deeply familiar with the cma allocator so sorry for a
> > > potentially stupid question. Why does a bulk interface performs better
> > > than repeated calls to cma_alloc? Is this because a failure would help
> > > to move on to the next pfn range while a repeated call would have to
> > > deal with the same range?
> > 
> > Yub, true with other overheads(e.g., migration retrial, waiting writeback
> > PCP/LRU draining IPI)
> 
> Why cannot this be implemented in the cma_alloc layer? I mean you can
> cache failed cases and optimize the proper pfn range search.

So do you suggest this?

enum cma_alloc_mode {
	CMA_ALLOC_NORMAL,
	CMA_ALLOC_FAIL_FAST,
};

struct page *cma_alloc(struct cma *cma, size_t count, unsigned int
	align, enum cma_alloc_mode mode);

From now on, cma_alloc will keep last failed pfn and then start to
search from the next pfn for both CMA_ALLOC_NORMAL and
CMA_ALLOC_FAIL_FAST if requested size from the cached pfn is okay
within CMA area and then wraparound it couldn't find right pages
from the cached pfn. Othewise, the cached pfn will reset to the zero
so that it starts the search from the 0. I like the idea since it's
general improvement, I think.

Furthemore, With CMA_ALLOC_FAIL_FAST, it could avoid several overheads
at the cost of sacrificing allocation success ratio like GFP_NORETRY.

I think that would solve the issue with making the API more flexible.
Before diving into it, I'd like to confirm we are on same page.
Please correct me if I misunderstood.

David, any objection?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 2/4] mm: introduce cma_alloc_bulk API
  2020-12-02 17:54           ` Minchan Kim
@ 2020-12-02 18:51             ` Michal Hocko
  2020-12-02 19:26               ` Minchan Kim
  0 siblings, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2020-12-02 18:51 UTC (permalink / raw)
  To: Minchan Kim
  Cc: David Hildenbrand, Andrew Morton, LKML, linux-mm, hyesoo.yu,
	willy, iamjoonsoo.kim, vbabka, surenb, pullip.cho, joaodias,
	hridya, sumit.semwal, john.stultz, Brian.Starkey, linux-media,
	devicetree, robh, christian.koenig, linaro-mm-sig

On Wed 02-12-20 09:54:29, Minchan Kim wrote:
> On Wed, Dec 02, 2020 at 05:48:34PM +0100, Michal Hocko wrote:
> > On Wed 02-12-20 08:15:49, Minchan Kim wrote:
> > > On Wed, Dec 02, 2020 at 04:49:15PM +0100, Michal Hocko wrote:
> > [...]
> > > > Well, what I can see is that this new interface is an antipatern to our
> > > > allocation routines. We tend to control allocations by gfp mask yet you
> > > > are introducing a bool parameter to make something faster... What that
> > > > really means is rather arbitrary. Would it make more sense to teach
> > > > cma_alloc resp. alloc_contig_range to recognize GFP_NOWAIT, GFP_NORETRY resp.
> > > > GFP_RETRY_MAYFAIL instead?
> > > 
> > > If we use cma_alloc, that interface requires "allocate one big memory
> > > chunk". IOW, return value is just struct page and expected that the page
> > > is a big contiguos memory. That means it couldn't have a hole in the
> > > range.
> > > However the idea here, what we asked is much smaller chunk rather
> > > than a big contiguous memory so we could skip some of pages if they are
> > > randomly pinned(long-term/short-term whatever) and search other pages
> > > in the CMA area to avoid long stall. Thus, it couldn't work with exising
> > > cma_alloc API with simple gfp_mak.
> > 
> > I really do not see that as something really alient to the cma_alloc
> > interface. All you should care about, really, is what size of the object
> > you want and how hard the system should try. If you have a problem with
> > an internal implementation of CMA and how it chooses a range and deal
> > with pinned pages then it should be addressed inside the CMA allocator.
> > I suspect that you are effectivelly trying to workaround those problems
> > by a side implementation with a slightly different API. Or maybe I still
> > do not follow the actual problem.
> >  
> > > > I am not deeply familiar with the cma allocator so sorry for a
> > > > potentially stupid question. Why does a bulk interface performs better
> > > > than repeated calls to cma_alloc? Is this because a failure would help
> > > > to move on to the next pfn range while a repeated call would have to
> > > > deal with the same range?
> > > 
> > > Yub, true with other overheads(e.g., migration retrial, waiting writeback
> > > PCP/LRU draining IPI)
> > 
> > Why cannot this be implemented in the cma_alloc layer? I mean you can
> > cache failed cases and optimize the proper pfn range search.
> 
> So do you suggest this?
> 
> enum cma_alloc_mode {
> 	CMA_ALLOC_NORMAL,
> 	CMA_ALLOC_FAIL_FAST,
> };
> 
> struct page *cma_alloc(struct cma *cma, size_t count, unsigned int
> 	align, enum cma_alloc_mode mode);
> 
> >From now on, cma_alloc will keep last failed pfn and then start to
> search from the next pfn for both CMA_ALLOC_NORMAL and
> CMA_ALLOC_FAIL_FAST if requested size from the cached pfn is okay
> within CMA area and then wraparound it couldn't find right pages
> from the cached pfn. Othewise, the cached pfn will reset to the zero
> so that it starts the search from the 0. I like the idea since it's
> general improvement, I think.

Yes something like that. There are more options to be clever here - e.g.
track ranges etc. but I am not sure this is worth the complexity.

> Furthemore, With CMA_ALLOC_FAIL_FAST, it could avoid several overheads
> at the cost of sacrificing allocation success ratio like GFP_NORETRY.

I am still not sure a specific flag is a good interface. Really can this
be gfp_mask instead?

> I think that would solve the issue with making the API more flexible.
> Before diving into it, I'd like to confirm we are on same page.
> Please correct me if I misunderstood.

I am not sure you are still thinking about a bulk interface.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 2/4] mm: introduce cma_alloc_bulk API
  2020-12-02 18:51             ` Michal Hocko
@ 2020-12-02 19:26               ` Minchan Kim
  2020-12-02 20:22                 ` David Hildenbrand
  0 siblings, 1 reply; 28+ messages in thread
From: Minchan Kim @ 2020-12-02 19:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, Andrew Morton, LKML, linux-mm, hyesoo.yu,
	willy, iamjoonsoo.kim, vbabka, surenb, pullip.cho, joaodias,
	hridya, sumit.semwal, john.stultz, Brian.Starkey, linux-media,
	devicetree, robh, christian.koenig, linaro-mm-sig

On Wed, Dec 02, 2020 at 07:51:07PM +0100, Michal Hocko wrote:
> On Wed 02-12-20 09:54:29, Minchan Kim wrote:
> > On Wed, Dec 02, 2020 at 05:48:34PM +0100, Michal Hocko wrote:
> > > On Wed 02-12-20 08:15:49, Minchan Kim wrote:
> > > > On Wed, Dec 02, 2020 at 04:49:15PM +0100, Michal Hocko wrote:
> > > [...]
> > > > > Well, what I can see is that this new interface is an antipatern to our
> > > > > allocation routines. We tend to control allocations by gfp mask yet you
> > > > > are introducing a bool parameter to make something faster... What that
> > > > > really means is rather arbitrary. Would it make more sense to teach
> > > > > cma_alloc resp. alloc_contig_range to recognize GFP_NOWAIT, GFP_NORETRY resp.
> > > > > GFP_RETRY_MAYFAIL instead?
> > > > 
> > > > If we use cma_alloc, that interface requires "allocate one big memory
> > > > chunk". IOW, return value is just struct page and expected that the page
> > > > is a big contiguos memory. That means it couldn't have a hole in the
> > > > range.
> > > > However the idea here, what we asked is much smaller chunk rather
> > > > than a big contiguous memory so we could skip some of pages if they are
> > > > randomly pinned(long-term/short-term whatever) and search other pages
> > > > in the CMA area to avoid long stall. Thus, it couldn't work with exising
> > > > cma_alloc API with simple gfp_mak.
> > > 
> > > I really do not see that as something really alient to the cma_alloc
> > > interface. All you should care about, really, is what size of the object
> > > you want and how hard the system should try. If you have a problem with
> > > an internal implementation of CMA and how it chooses a range and deal
> > > with pinned pages then it should be addressed inside the CMA allocator.
> > > I suspect that you are effectivelly trying to workaround those problems
> > > by a side implementation with a slightly different API. Or maybe I still
> > > do not follow the actual problem.
> > >  
> > > > > I am not deeply familiar with the cma allocator so sorry for a
> > > > > potentially stupid question. Why does a bulk interface performs better
> > > > > than repeated calls to cma_alloc? Is this because a failure would help
> > > > > to move on to the next pfn range while a repeated call would have to
> > > > > deal with the same range?
> > > > 
> > > > Yub, true with other overheads(e.g., migration retrial, waiting writeback
> > > > PCP/LRU draining IPI)
> > > 
> > > Why cannot this be implemented in the cma_alloc layer? I mean you can
> > > cache failed cases and optimize the proper pfn range search.
> > 
> > So do you suggest this?
> > 
> > enum cma_alloc_mode {
> > 	CMA_ALLOC_NORMAL,
> > 	CMA_ALLOC_FAIL_FAST,
> > };
> > 
> > struct page *cma_alloc(struct cma *cma, size_t count, unsigned int
> > 	align, enum cma_alloc_mode mode);
> > 
> > >From now on, cma_alloc will keep last failed pfn and then start to
> > search from the next pfn for both CMA_ALLOC_NORMAL and
> > CMA_ALLOC_FAIL_FAST if requested size from the cached pfn is okay
> > within CMA area and then wraparound it couldn't find right pages
> > from the cached pfn. Othewise, the cached pfn will reset to the zero
> > so that it starts the search from the 0. I like the idea since it's
> > general improvement, I think.
> 
> Yes something like that. There are more options to be clever here - e.g.
> track ranges etc. but I am not sure this is worth the complexity.

Agree. Just last pfn caching would be good enough as simple start.

> 
> > Furthemore, With CMA_ALLOC_FAIL_FAST, it could avoid several overheads
> > at the cost of sacrificing allocation success ratio like GFP_NORETRY.
> 
> I am still not sure a specific flag is a good interface. Really can this
> be gfp_mask instead?

I am not strong(even, I did it with GFP_NORETRY) but David wanted to
have special mode and I agreed when he mentioned ALLOC_CONTIG_HARD as
one of options in future(it would be hard to indicate that mode with
gfp flags).

> 
> > I think that would solve the issue with making the API more flexible.
> > Before diving into it, I'd like to confirm we are on same page.
> > Please correct me if I misunderstood.
> 
> I am not sure you are still thinking about a bulk interface.

No I am thinking of just using cma_alloc API with cached pfn
as interal improvement and adding new fast fail mode to the API
so driver could call the API repeatedly until then can get enough
pages.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 2/4] mm: introduce cma_alloc_bulk API
  2020-12-02 19:26               ` Minchan Kim
@ 2020-12-02 20:22                 ` David Hildenbrand
  2020-12-02 20:48                   ` Minchan Kim
  2020-12-03  8:28                   ` Michal Hocko
  0 siblings, 2 replies; 28+ messages in thread
From: David Hildenbrand @ 2020-12-02 20:22 UTC (permalink / raw)
  To: Minchan Kim, Michal Hocko
  Cc: Andrew Morton, LKML, linux-mm, hyesoo.yu, willy, iamjoonsoo.kim,
	vbabka, surenb, pullip.cho, joaodias, hridya, sumit.semwal,
	john.stultz, Brian.Starkey, linux-media, devicetree, robh,
	christian.koenig, linaro-mm-sig

On 02.12.20 20:26, Minchan Kim wrote:
> On Wed, Dec 02, 2020 at 07:51:07PM +0100, Michal Hocko wrote:
>> On Wed 02-12-20 09:54:29, Minchan Kim wrote:
>>> On Wed, Dec 02, 2020 at 05:48:34PM +0100, Michal Hocko wrote:
>>>> On Wed 02-12-20 08:15:49, Minchan Kim wrote:
>>>>> On Wed, Dec 02, 2020 at 04:49:15PM +0100, Michal Hocko wrote:
>>>> [...]
>>>>>> Well, what I can see is that this new interface is an antipatern to our
>>>>>> allocation routines. We tend to control allocations by gfp mask yet you
>>>>>> are introducing a bool parameter to make something faster... What that
>>>>>> really means is rather arbitrary. Would it make more sense to teach
>>>>>> cma_alloc resp. alloc_contig_range to recognize GFP_NOWAIT, GFP_NORETRY resp.
>>>>>> GFP_RETRY_MAYFAIL instead?
>>>>>
>>>>> If we use cma_alloc, that interface requires "allocate one big memory
>>>>> chunk". IOW, return value is just struct page and expected that the page
>>>>> is a big contiguos memory. That means it couldn't have a hole in the
>>>>> range.
>>>>> However the idea here, what we asked is much smaller chunk rather
>>>>> than a big contiguous memory so we could skip some of pages if they are
>>>>> randomly pinned(long-term/short-term whatever) and search other pages
>>>>> in the CMA area to avoid long stall. Thus, it couldn't work with exising
>>>>> cma_alloc API with simple gfp_mak.
>>>>
>>>> I really do not see that as something really alient to the cma_alloc
>>>> interface. All you should care about, really, is what size of the object
>>>> you want and how hard the system should try. If you have a problem with
>>>> an internal implementation of CMA and how it chooses a range and deal
>>>> with pinned pages then it should be addressed inside the CMA allocator.
>>>> I suspect that you are effectivelly trying to workaround those problems
>>>> by a side implementation with a slightly different API. Or maybe I still
>>>> do not follow the actual problem.
>>>>  
>>>>>> I am not deeply familiar with the cma allocator so sorry for a
>>>>>> potentially stupid question. Why does a bulk interface performs better
>>>>>> than repeated calls to cma_alloc? Is this because a failure would help
>>>>>> to move on to the next pfn range while a repeated call would have to
>>>>>> deal with the same range?
>>>>>
>>>>> Yub, true with other overheads(e.g., migration retrial, waiting writeback
>>>>> PCP/LRU draining IPI)
>>>>
>>>> Why cannot this be implemented in the cma_alloc layer? I mean you can
>>>> cache failed cases and optimize the proper pfn range search.
>>>
>>> So do you suggest this?
>>>
>>> enum cma_alloc_mode {
>>> 	CMA_ALLOC_NORMAL,
>>> 	CMA_ALLOC_FAIL_FAST,
>>> };
>>>
>>> struct page *cma_alloc(struct cma *cma, size_t count, unsigned int
>>> 	align, enum cma_alloc_mode mode);
>>>
>>> >From now on, cma_alloc will keep last failed pfn and then start to
>>> search from the next pfn for both CMA_ALLOC_NORMAL and
>>> CMA_ALLOC_FAIL_FAST if requested size from the cached pfn is okay
>>> within CMA area and then wraparound it couldn't find right pages
>>> from the cached pfn. Othewise, the cached pfn will reset to the zero
>>> so that it starts the search from the 0. I like the idea since it's
>>> general improvement, I think.
>>
>> Yes something like that. There are more options to be clever here - e.g.
>> track ranges etc. but I am not sure this is worth the complexity.
> 
> Agree. Just last pfn caching would be good enough as simple start.
> 
>>
>>> Furthemore, With CMA_ALLOC_FAIL_FAST, it could avoid several overheads
>>> at the cost of sacrificing allocation success ratio like GFP_NORETRY.
>>
>> I am still not sure a specific flag is a good interface. Really can this
>> be gfp_mask instead?
> 
> I am not strong(even, I did it with GFP_NORETRY) but David wanted to
> have special mode and I agreed when he mentioned ALLOC_CONTIG_HARD as
> one of options in future(it would be hard to indicate that mode with
> gfp flags).

I can't tell regarding the CMA interface, but for the alloc_contig()
interface I think modes make sense. Yes, it's different to other
allocaters, but the contig range allocater is different already. E.g.,
the CMA allocater mostly hides "which exact PFNs you try to allocate".

In the contig range allocater, gfp flags are currently used to express
how to allocate pages used as migration targets. I don't think mangling
in other gfp flags (or even overloading them) makes things a lot
clearer. E.g., GFP_NORETRY: don't retry to allocate migration targets?
don't retry to migrate pages? both?

As I said, other aspects might be harder to model (e.g., don't drain
LRU) and hiding them behind generic gfp flags (e.g., GFP_NORETRY) feels
wrong.

With the mode, we're expressing details for the necessary page
migration. Suggestions on how to model that are welcome.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 2/4] mm: introduce cma_alloc_bulk API
  2020-12-02 20:22                 ` David Hildenbrand
@ 2020-12-02 20:48                   ` Minchan Kim
  2020-12-03  8:28                   ` Michal Hocko
  1 sibling, 0 replies; 28+ messages in thread
From: Minchan Kim @ 2020-12-02 20:48 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michal Hocko, Andrew Morton, LKML, linux-mm, hyesoo.yu, willy,
	iamjoonsoo.kim, vbabka, surenb, pullip.cho, joaodias, hridya,
	sumit.semwal, john.stultz, Brian.Starkey, linux-media,
	devicetree, robh, christian.koenig, linaro-mm-sig

On Wed, Dec 02, 2020 at 09:22:36PM +0100, David Hildenbrand wrote:
> On 02.12.20 20:26, Minchan Kim wrote:
> > On Wed, Dec 02, 2020 at 07:51:07PM +0100, Michal Hocko wrote:
> >> On Wed 02-12-20 09:54:29, Minchan Kim wrote:
> >>> On Wed, Dec 02, 2020 at 05:48:34PM +0100, Michal Hocko wrote:
> >>>> On Wed 02-12-20 08:15:49, Minchan Kim wrote:
> >>>>> On Wed, Dec 02, 2020 at 04:49:15PM +0100, Michal Hocko wrote:
> >>>> [...]
> >>>>>> Well, what I can see is that this new interface is an antipatern to our
> >>>>>> allocation routines. We tend to control allocations by gfp mask yet you
> >>>>>> are introducing a bool parameter to make something faster... What that
> >>>>>> really means is rather arbitrary. Would it make more sense to teach
> >>>>>> cma_alloc resp. alloc_contig_range to recognize GFP_NOWAIT, GFP_NORETRY resp.
> >>>>>> GFP_RETRY_MAYFAIL instead?
> >>>>>
> >>>>> If we use cma_alloc, that interface requires "allocate one big memory
> >>>>> chunk". IOW, return value is just struct page and expected that the page
> >>>>> is a big contiguos memory. That means it couldn't have a hole in the
> >>>>> range.
> >>>>> However the idea here, what we asked is much smaller chunk rather
> >>>>> than a big contiguous memory so we could skip some of pages if they are
> >>>>> randomly pinned(long-term/short-term whatever) and search other pages
> >>>>> in the CMA area to avoid long stall. Thus, it couldn't work with exising
> >>>>> cma_alloc API with simple gfp_mak.
> >>>>
> >>>> I really do not see that as something really alient to the cma_alloc
> >>>> interface. All you should care about, really, is what size of the object
> >>>> you want and how hard the system should try. If you have a problem with
> >>>> an internal implementation of CMA and how it chooses a range and deal
> >>>> with pinned pages then it should be addressed inside the CMA allocator.
> >>>> I suspect that you are effectivelly trying to workaround those problems
> >>>> by a side implementation with a slightly different API. Or maybe I still
> >>>> do not follow the actual problem.
> >>>>  
> >>>>>> I am not deeply familiar with the cma allocator so sorry for a
> >>>>>> potentially stupid question. Why does a bulk interface performs better
> >>>>>> than repeated calls to cma_alloc? Is this because a failure would help
> >>>>>> to move on to the next pfn range while a repeated call would have to
> >>>>>> deal with the same range?
> >>>>>
> >>>>> Yub, true with other overheads(e.g., migration retrial, waiting writeback
> >>>>> PCP/LRU draining IPI)
> >>>>
> >>>> Why cannot this be implemented in the cma_alloc layer? I mean you can
> >>>> cache failed cases and optimize the proper pfn range search.
> >>>
> >>> So do you suggest this?
> >>>
> >>> enum cma_alloc_mode {
> >>> 	CMA_ALLOC_NORMAL,
> >>> 	CMA_ALLOC_FAIL_FAST,
> >>> };
> >>>
> >>> struct page *cma_alloc(struct cma *cma, size_t count, unsigned int
> >>> 	align, enum cma_alloc_mode mode);
> >>>
> >>> >From now on, cma_alloc will keep last failed pfn and then start to
> >>> search from the next pfn for both CMA_ALLOC_NORMAL and
> >>> CMA_ALLOC_FAIL_FAST if requested size from the cached pfn is okay
> >>> within CMA area and then wraparound it couldn't find right pages
> >>> from the cached pfn. Othewise, the cached pfn will reset to the zero
> >>> so that it starts the search from the 0. I like the idea since it's
> >>> general improvement, I think.
> >>
> >> Yes something like that. There are more options to be clever here - e.g.
> >> track ranges etc. but I am not sure this is worth the complexity.
> > 
> > Agree. Just last pfn caching would be good enough as simple start.
> > 
> >>
> >>> Furthemore, With CMA_ALLOC_FAIL_FAST, it could avoid several overheads
> >>> at the cost of sacrificing allocation success ratio like GFP_NORETRY.
> >>
> >> I am still not sure a specific flag is a good interface. Really can this
> >> be gfp_mask instead?
> > 
> > I am not strong(even, I did it with GFP_NORETRY) but David wanted to
> > have special mode and I agreed when he mentioned ALLOC_CONTIG_HARD as
> > one of options in future(it would be hard to indicate that mode with
> > gfp flags).
> 
> I can't tell regarding the CMA interface, but for the alloc_contig()
> interface I think modes make sense. Yes, it's different to other
> allocaters, but the contig range allocater is different already. E.g.,
> the CMA allocater mostly hides "which exact PFNs you try to allocate".
> 
> In the contig range allocater, gfp flags are currently used to express
> how to allocate pages used as migration targets. I don't think mangling
> in other gfp flags (or even overloading them) makes things a lot
> clearer. E.g., GFP_NORETRY: don't retry to allocate migration targets?
> don't retry to migrate pages? both?
> 
> As I said, other aspects might be harder to model (e.g., don't drain
> LRU) and hiding them behind generic gfp flags (e.g., GFP_NORETRY) feels
> wrong.

I also support a special flag/bool variable for cma_alloc rather than
relying on mixing original gfp_flags since it would be more clear
with preventing passing unhandled the other gfp_flags into cma_alloc.

> 
> With the mode, we're expressing details for the necessary page
> migration. Suggestions on how to model that are welcome.
> 
> -- 
> Thanks,
> 
> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 2/4] mm: introduce cma_alloc_bulk API
  2020-12-02 20:22                 ` David Hildenbrand
  2020-12-02 20:48                   ` Minchan Kim
@ 2020-12-03  8:28                   ` Michal Hocko
  2020-12-03  9:47                     ` David Hildenbrand
  1 sibling, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2020-12-03  8:28 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, hyesoo.yu, willy,
	iamjoonsoo.kim, vbabka, surenb, pullip.cho, joaodias, hridya,
	sumit.semwal, john.stultz, Brian.Starkey, linux-media,
	devicetree, robh, christian.koenig, linaro-mm-sig

On Wed 02-12-20 21:22:36, David Hildenbrand wrote:
> On 02.12.20 20:26, Minchan Kim wrote:
> > On Wed, Dec 02, 2020 at 07:51:07PM +0100, Michal Hocko wrote:
[...]
> >> I am still not sure a specific flag is a good interface. Really can this
> >> be gfp_mask instead?
> > 
> > I am not strong(even, I did it with GFP_NORETRY) but David wanted to
> > have special mode and I agreed when he mentioned ALLOC_CONTIG_HARD as
> > one of options in future(it would be hard to indicate that mode with
> > gfp flags).
> 
> I can't tell regarding the CMA interface, but for the alloc_contig()
> interface I think modes make sense. Yes, it's different to other
> allocaters, but the contig range allocater is different already. E.g.,
> the CMA allocater mostly hides "which exact PFNs you try to allocate".

Yes, alloc_contig_range is a low level API but it already has a gfp_mask
parameter. Adding yet another allocation mode sounds like API
convolution to me.

> In the contig range allocater, gfp flags are currently used to express
> how to allocate pages used as migration targets. I don't think mangling
> in other gfp flags (or even overloading them) makes things a lot
> clearer. E.g., GFP_NORETRY: don't retry to allocate migration targets?
> don't retry to migrate pages? both?
>
> As I said, other aspects might be harder to model (e.g., don't drain
> LRU) and hiding them behind generic gfp flags (e.g., GFP_NORETRY) feels
> wrong.
> 
> With the mode, we're expressing details for the necessary page
> migration. Suggestions on how to model that are welcome.

The question is whether the caller should really have such an intimate
knowledge and control of the alloc_contig_range implementation. This all
are implementation details. Should really anybody think about how many
times migration retries or control LRU draining? Those can change in the
future and I do not think we really want to go over all users grown over
that time and try to deduce what was the intention behind.

I think we should aim at easy and very highlevel behavior:
- GFP_NOWAIT - unsupported currently IIRC but something that something
  that should be possible to implement. Isolation is non blocking,
  migration could be skipped
- GFP_KERNEL - default behavior whatever that means
- GFP_NORETRY - opportunistic allocation as lightweight as we can get.
  Failures to be expected also for transient reasons.
- GFP_RETRY_MAYFAIL - try hard but not as hard as to trigger disruption
  (e.g. via oom killer).

- __GFP_THIS_NODE - stick to a node without fallback
- we can support zone modifiers although there is no existing user.
- __GFP_NOWARN - obvious

And that is it. Or maybe I am seeing that oversimplified.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 2/4] mm: introduce cma_alloc_bulk API
  2020-12-03  8:28                   ` Michal Hocko
@ 2020-12-03  9:47                     ` David Hildenbrand
  2020-12-03 11:47                       ` Michal Hocko
  0 siblings, 1 reply; 28+ messages in thread
From: David Hildenbrand @ 2020-12-03  9:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, hyesoo.yu, willy,
	iamjoonsoo.kim, vbabka, surenb, pullip.cho, joaodias, hridya,
	sumit.semwal, john.stultz, Brian.Starkey, linux-media,
	devicetree, robh, christian.koenig, linaro-mm-sig

On 03.12.20 09:28, Michal Hocko wrote:
> On Wed 02-12-20 21:22:36, David Hildenbrand wrote:
>> On 02.12.20 20:26, Minchan Kim wrote:
>>> On Wed, Dec 02, 2020 at 07:51:07PM +0100, Michal Hocko wrote:
> [...]
>>>> I am still not sure a specific flag is a good interface. Really can this
>>>> be gfp_mask instead?
>>>
>>> I am not strong(even, I did it with GFP_NORETRY) but David wanted to
>>> have special mode and I agreed when he mentioned ALLOC_CONTIG_HARD as
>>> one of options in future(it would be hard to indicate that mode with
>>> gfp flags).
>>
>> I can't tell regarding the CMA interface, but for the alloc_contig()
>> interface I think modes make sense. Yes, it's different to other
>> allocaters, but the contig range allocater is different already. E.g.,
>> the CMA allocater mostly hides "which exact PFNs you try to allocate".
> 
> Yes, alloc_contig_range is a low level API but it already has a gfp_mask
> parameter. Adding yet another allocation mode sounds like API
> convolution to me.

Well, if another parameter is a concern, we can introduce

alloc_contig_range_fast()

instead.

> 
>> In the contig range allocater, gfp flags are currently used to express
>> how to allocate pages used as migration targets. I don't think mangling
>> in other gfp flags (or even overloading them) makes things a lot
>> clearer. E.g., GFP_NORETRY: don't retry to allocate migration targets?
>> don't retry to migrate pages? both?
>>
>> As I said, other aspects might be harder to model (e.g., don't drain
>> LRU) and hiding them behind generic gfp flags (e.g., GFP_NORETRY) feels
>> wrong.
>>
>> With the mode, we're expressing details for the necessary page
>> migration. Suggestions on how to model that are welcome.
> 
> The question is whether the caller should really have such an intimate
> knowledge and control of the alloc_contig_range implementation. This all
> are implementation details. Should really anybody think about how many
> times migration retries or control LRU draining? Those can change in the

The question is not "how many times", rather "if at all". I can
understand the possible performance improvements by letting the caller
handle things (lru draining, pcp draining) like that when issuing
gazillions of alloc_contig_range() calls.

> future and I do not think we really want to go over all users grown over
> that time and try to deduce what was the intention behind.

That's why I think we need a clear mechanism to express the expected
behavior - something we can properly document and users can actually
understand to optimize for - and we can fix them up when the documented
behavior changes. Mangling this into somewhat-fitting gfp flags makes
the interface harder to use and more error-prone IMHO.

> 
> I think we should aim at easy and very highlevel behavior:
> - GFP_NOWAIT - unsupported currently IIRC but something that something
>   that should be possible to implement. Isolation is non blocking,
>   migration could be skipped
> - GFP_KERNEL - default behavior whatever that means
> - GFP_NORETRY - opportunistic allocation as lightweight as we can get.
>   Failures to be expected also for transient reasons.
> - GFP_RETRY_MAYFAIL - try hard but not as hard as to trigger disruption
>   (e.g. via oom killer).

I think we currently see demand for 3 modes for alloc_contig_range()

a) normal

As is. Try, but don't try too hard. E.g., drain LRU, drain PCP, retry a
couple of times. Failures in some cases (short-term pinning, PCP races)
are still possible and acceptable.

GFP_RETRY_MAYFAIL ?

E.g., "Allocations with this flag may fail, but only when there is
genuinely little unused memory." - current description does not match at
all. When allocating ranges things behave completely different.


b) fast

Try, but fail fast. Leave optimizations that can improve the result to
the caller. E.g., don't drain LRU, don't drain PCP, don't retry.
Frequent failures are expected and acceptable.

__GFP_NORETRY ?

E.g., "The VM implementation will try only very lightweight memory
direct reclaim to get some memory under memory pressure" - again, I
think current description does not really match.


c) hard

Try hard, E.g., temporarily disabling the PCP. Certainly not
__GFP_NOFAIL, that would be highly dangerous. So no flags / GFP_KERNEL?

> 
> - __GFP_THIS_NODE - stick to a node without fallback
> - we can support zone modifiers although there is no existing user.
> - __GFP_NOWARN - obvious
> 
> And that is it. Or maybe I am seeing that oversimplified.
> 

Again, I think most flags make sense for the migration target allocation
 path and mainly deal with OOM situations and reclaim. For the migration
path - which is specific to the alloc_contig_range() allocater - they
don't really apply and create more confusion than they actually help - IMHO.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 2/4] mm: introduce cma_alloc_bulk API
  2020-12-03  9:47                     ` David Hildenbrand
@ 2020-12-03 11:47                       ` Michal Hocko
  2020-12-03 11:57                         ` David Hildenbrand
  0 siblings, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2020-12-03 11:47 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, hyesoo.yu, willy,
	iamjoonsoo.kim, vbabka, surenb, pullip.cho, joaodias, hridya,
	sumit.semwal, john.stultz, Brian.Starkey, linux-media,
	devicetree, robh, christian.koenig, linaro-mm-sig

On Thu 03-12-20 10:47:02, David Hildenbrand wrote:
> On 03.12.20 09:28, Michal Hocko wrote:
[...]
> > I think we should aim at easy and very highlevel behavior:
> > - GFP_NOWAIT - unsupported currently IIRC but something that something
> >   that should be possible to implement. Isolation is non blocking,
> >   migration could be skipped
> > - GFP_KERNEL - default behavior whatever that means
> > - GFP_NORETRY - opportunistic allocation as lightweight as we can get.
> >   Failures to be expected also for transient reasons.
> > - GFP_RETRY_MAYFAIL - try hard but not as hard as to trigger disruption
> >   (e.g. via oom killer).
> 
> I think we currently see demand for 3 modes for alloc_contig_range()
> 
> a) normal
> 
> As is. Try, but don't try too hard. E.g., drain LRU, drain PCP, retry a
> couple of times. Failures in some cases (short-term pinning, PCP races)
> are still possible and acceptable.
> 
> GFP_RETRY_MAYFAIL ?

normal shouldn't really require anybody to think about gfp flags hard.
That to most people really means GFP_KERNEL.

> E.g., "Allocations with this flag may fail, but only when there is
> genuinely little unused memory." - current description does not match at
> all. When allocating ranges things behave completely different.
> 
> 
> b) fast
> 
> Try, but fail fast. Leave optimizations that can improve the result to
> the caller. E.g., don't drain LRU, don't drain PCP, don't retry.
> Frequent failures are expected and acceptable.
> 
> __GFP_NORETRY ?
> 
> E.g., "The VM implementation will try only very lightweight memory
> direct reclaim to get some memory under memory pressure" - again, I
> think current description does not really match.

Agreed. As mentioned above this would be an opportunistic allocation
mode.

 
> c) hard
> 
> Try hard, E.g., temporarily disabling the PCP. Certainly not
> __GFP_NOFAIL, that would be highly dangerous. So no flags / GFP_KERNEL?

NOFAIL semantic is out of question. Should we have a mode to try harder
than the default? I dunno. Do we have users? I think RETRY_MAYFAIL is a
middle ground between the default and NORETRY which is just too easy to
fail. This is the case for the allocator as well. And from what I have
seen people are already using MAYFAIL in order to prevent oom killer so
this is a generally recognized pattern.

> > - __GFP_THIS_NODE - stick to a node without fallback
> > - we can support zone modifiers although there is no existing user.
> > - __GFP_NOWARN - obvious
> > 
> > And that is it. Or maybe I am seeing that oversimplified.
> > 
> 
> Again, I think most flags make sense for the migration target allocation
>  path and mainly deal with OOM situations and reclaim. For the migration
> path - which is specific to the alloc_contig_range() allocater - they
> don't really apply and create more confusion than they actually help - IMHO.

Migration is really an implementation detail of this interface. You
shouldn't be even thinking that there is a migration underneath not even
mention to actually trying to control it. But well, we might end up
disagreeing here. What actually matters is existing users of the
interface.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 2/4] mm: introduce cma_alloc_bulk API
  2020-12-03 11:47                       ` Michal Hocko
@ 2020-12-03 11:57                         ` David Hildenbrand
  0 siblings, 0 replies; 28+ messages in thread
From: David Hildenbrand @ 2020-12-03 11:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, hyesoo.yu, willy,
	iamjoonsoo.kim, vbabka, surenb, pullip.cho, joaodias, hridya,
	sumit.semwal, john.stultz, Brian.Starkey, linux-media,
	devicetree, robh, christian.koenig, linaro-mm-sig

On 03.12.20 12:47, Michal Hocko wrote:
> On Thu 03-12-20 10:47:02, David Hildenbrand wrote:
>> On 03.12.20 09:28, Michal Hocko wrote:
> [...]
>>> I think we should aim at easy and very highlevel behavior:
>>> - GFP_NOWAIT - unsupported currently IIRC but something that something
>>>   that should be possible to implement. Isolation is non blocking,
>>>   migration could be skipped
>>> - GFP_KERNEL - default behavior whatever that means
>>> - GFP_NORETRY - opportunistic allocation as lightweight as we can get.
>>>   Failures to be expected also for transient reasons.
>>> - GFP_RETRY_MAYFAIL - try hard but not as hard as to trigger disruption
>>>   (e.g. via oom killer).
>>
>> I think we currently see demand for 3 modes for alloc_contig_range()
>>
>> a) normal
>>
>> As is. Try, but don't try too hard. E.g., drain LRU, drain PCP, retry a
>> couple of times. Failures in some cases (short-term pinning, PCP races)
>> are still possible and acceptable.
>>
>> GFP_RETRY_MAYFAIL ?
> 
> normal shouldn't really require anybody to think about gfp flags hard.
> That to most people really means GFP_KERNEL.
> 
>> E.g., "Allocations with this flag may fail, but only when there is
>> genuinely little unused memory." - current description does not match at
>> all. When allocating ranges things behave completely different.
>>
>>
>> b) fast
>>
>> Try, but fail fast. Leave optimizations that can improve the result to
>> the caller. E.g., don't drain LRU, don't drain PCP, don't retry.
>> Frequent failures are expected and acceptable.
>>
>> __GFP_NORETRY ?
>>
>> E.g., "The VM implementation will try only very lightweight memory
>> direct reclaim to get some memory under memory pressure" - again, I
>> think current description does not really match.
> 
> Agreed. As mentioned above this would be an opportunistic allocation
> mode.
> 
>  
>> c) hard
>>
>> Try hard, E.g., temporarily disabling the PCP. Certainly not
>> __GFP_NOFAIL, that would be highly dangerous. So no flags / GFP_KERNEL?
> 
> NOFAIL semantic is out of question. Should we have a mode to try harder
> than the default? I dunno. Do we have users? I think RETRY_MAYFAIL is a
> middle ground between the default and NORETRY which is just too easy to
> fail. This is the case for the allocator as well. And from what I have
> seen people are already using MAYFAIL in order to prevent oom killer so
> this is a generally recognized pattern.

virtio-mem might be one user. It might first try in normal mode to get
as much memory out as possible, but switch to hard mode when it might
make sense.

> 
>>> - __GFP_THIS_NODE - stick to a node without fallback
>>> - we can support zone modifiers although there is no existing user.
>>> - __GFP_NOWARN - obvious
>>>
>>> And that is it. Or maybe I am seeing that oversimplified.
>>>
>>
>> Again, I think most flags make sense for the migration target allocation
>>  path and mainly deal with OOM situations and reclaim. For the migration
>> path - which is specific to the alloc_contig_range() allocater - they
>> don't really apply and create more confusion than they actually help - IMHO.
> 
> Migration is really an implementation detail of this interface. You
> shouldn't be even thinking that there is a migration underneath not even
> mention to actually trying to control it. 

CMA? I tend to agree.
alloc_contig_range? I disagree.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2020-12-03 11:59 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-01 17:51 [PATCH v2 0/4] Chunk Heap Support on DMA-HEAP Minchan Kim
2020-12-01 17:51 ` [PATCH v2 1/4] mm: introduce alloc_contig_mode Minchan Kim
2020-12-01 17:51 ` [PATCH v2 2/4] mm: introduce cma_alloc_bulk API Minchan Kim
2020-12-02  9:14   ` David Hildenbrand
2020-12-02 15:49     ` Michal Hocko
2020-12-02 16:00       ` David Hildenbrand
2020-12-02 16:15       ` Minchan Kim
2020-12-02 16:48         ` Michal Hocko
2020-12-02 17:54           ` Minchan Kim
2020-12-02 18:51             ` Michal Hocko
2020-12-02 19:26               ` Minchan Kim
2020-12-02 20:22                 ` David Hildenbrand
2020-12-02 20:48                   ` Minchan Kim
2020-12-03  8:28                   ` Michal Hocko
2020-12-03  9:47                     ` David Hildenbrand
2020-12-03 11:47                       ` Michal Hocko
2020-12-03 11:57                         ` David Hildenbrand
2020-12-02 16:00     ` Minchan Kim
2020-12-01 17:51 ` [PATCH v2 3/4] dma-buf: add export symbol for dma-heap Minchan Kim
2020-12-02 13:51   ` Christoph Hellwig
2020-12-01 17:51 ` [PATCH v2 4/4] dma-buf: heaps: add chunk heap to dmabuf heaps Minchan Kim
2020-12-01 19:48   ` John Stultz
2020-12-01 22:55     ` Minchan Kim
2020-12-01 23:38       ` John Stultz
2020-12-02  0:13         ` Minchan Kim
2020-12-02  0:33           ` John Stultz
2020-12-02  0:57             ` Minchan Kim
2020-12-02 13:54   ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).