* [PATCH v4] mm: cma: indefinitely retry allocations in cma_alloc
@ 2020-09-28 20:30 Chris Goldsworthy
2020-09-28 20:30 ` Chris Goldsworthy
0 siblings, 1 reply; 4+ messages in thread
From: Chris Goldsworthy @ 2020-09-28 20:30 UTC (permalink / raw)
To: akpm, linux-mm, minchan, linux-arm-msm, linux-kernel, pratikp,
pdaly, sudaraja, iamjoonsoo.kim, david, vinmenon, minchan.kim
Cc: Chris Goldsworthy
V1: Introduces a retry loop that attempts a CMA allocation a finite
number of times before giving up:
https://lkml.org/lkml/2020/8/5/1097
https://lkml.org/lkml/2020/8/11/893
V2: Introduces an indefinite retry for CMA allocations. David Hildenbrand
raised a page pinning example which precludes doing this infite-retrying
for all CMA users:
https://lkml.org/lkml/2020/9/17/984
V3: Re-introduce a GFP mask argument for cma_alloc(), that can take in
__GFP_NOFAIL as an argument to indicate that a CMA allocation should be
retried indefinitely. This lets callers of cma_alloc() decide if they want
to perform indefinite retires. Also introduces a config option for
controlling the duration of the sleep between retries:
https://www.spinics.net/lists/linux-mm/msg228778.html
V4: In response to comments by Minchan Kim, we make the sleep interruptable
and remove a defconfig option controlling how long the sleep lasts for (it
is now a fixed 100 ms).
Chris Goldsworthy (1):
mm: cma: indefinitely retry allocations in cma_alloc
arch/powerpc/kvm/book3s_hv_builtin.c | 2 +-
drivers/dma-buf/heaps/cma_heap.c | 2 +-
drivers/s390/char/vmcp.c | 2 +-
drivers/staging/android/ion/ion_cma_heap.c | 2 +-
include/linux/cma.h | 2 +-
kernel/dma/contiguous.c | 4 ++--
mm/cma.c | 36 +++++++++++++++++++++++++-----
mm/cma_debug.c | 2 +-
mm/hugetlb.c | 4 ++--
9 files changed, 40 insertions(+), 16 deletions(-)
--
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project
^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH v4] mm: cma: indefinitely retry allocations in cma_alloc
2020-09-28 20:30 [PATCH v4] mm: cma: indefinitely retry allocations in cma_alloc Chris Goldsworthy
@ 2020-09-28 20:30 ` Chris Goldsworthy
2020-09-29 5:59 ` Christoph Hellwig
0 siblings, 1 reply; 4+ messages in thread
From: Chris Goldsworthy @ 2020-09-28 20:30 UTC (permalink / raw)
To: akpm, linux-mm, minchan, linux-arm-msm, linux-kernel, pratikp,
pdaly, sudaraja, iamjoonsoo.kim, david, vinmenon, minchan.kim
Cc: Chris Goldsworthy
CMA allocations will fail if 'pinned' pages are in a CMA area, since we
cannot migrate pinned pages. The _refcount of a struct page being greater
than _mapcount for that page can cause pinning for anonymous pages. This
is because try_to_unmap(), which (1) is called in the CMA allocation path,
and (2) decrements both _refcount and _mapcount for a page, will stop
unmapping a page from VMAs once the _mapcount for a page reaches 0. This
implies that after try_to_unmap() has finished successfully for a page
where _recount > _mapcount, that _refcount will be greater than 0. Later
in the CMA allocation path in migrate_page_move_mapping(), we will have one
more reference count than intended for anonymous pages, meaning the
allocation will fail for that page.
If a process ends up causing _refcount > _mapcount for a page (by either
incrementing _recount or decrementing _mapcount), such that the process is
context switched out after modifying one refcount but before modifying the
other, the page will be temporarily pinned.
One example of where _refcount can be greater than _mapcount is inside of
zap_pte_range(), which is called for all the entries of a PMD when a
process is exiting, to unmap the process's memory. Inside of
zap_pte_range(), after unammping a page with page_remove_rmap(), we have
that _recount > _mapcount. _refcount can only be decremented after a TLB
flush is performed for the page - this doesn't occur until enough pages
have been batched together for flushing. The flush can either occur inside
of zap_pte_range() (during the same invocation or a later one), or if there
aren't enough pages collected by the time we unmap all of the pages in a
process, the flush will occur in tlb_finish_mmu() in exit_mmap(). After
the flush has occurred, tlb_batch_pages_flush() will decrement the
references on the flushed pages.
Another such example like the above is inside of copy_one_pte(), which is
called during a fork. For PTEs for which pte_present(pte) == true,
copy_one_pte() will increment the _refcount field followed by the
_mapcount field of a page.
So, inside of cma_alloc(), add the option of letting users pass in
__GFP_NOFAIL to indicate that we should retry CMA allocations indefinitely,
in the event that alloc_contig_range() returns -EBUSY after having scanned
a whole CMA-region bitmap.
Signed-off-by: Chris Goldsworthy <cgoldswo@codeaurora.org>
Co-developed-by: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
---
arch/powerpc/kvm/book3s_hv_builtin.c | 2 +-
drivers/dma-buf/heaps/cma_heap.c | 2 +-
drivers/s390/char/vmcp.c | 2 +-
drivers/staging/android/ion/ion_cma_heap.c | 2 +-
include/linux/cma.h | 2 +-
kernel/dma/contiguous.c | 4 ++--
mm/cma.c | 36 +++++++++++++++++++++++++-----
mm/cma_debug.c | 2 +-
mm/hugetlb.c | 4 ++--
9 files changed, 40 insertions(+), 16 deletions(-)
diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c b/arch/powerpc/kvm/book3s_hv_builtin.c
index 073617c..21c3f6a 100644
--- a/arch/powerpc/kvm/book3s_hv_builtin.c
+++ b/arch/powerpc/kvm/book3s_hv_builtin.c
@@ -74,7 +74,7 @@ struct page *kvm_alloc_hpt_cma(unsigned long nr_pages)
VM_BUG_ON(order_base_2(nr_pages) < KVM_CMA_CHUNK_ORDER - PAGE_SHIFT);
return cma_alloc(kvm_cma, nr_pages, order_base_2(HPT_ALIGN_PAGES),
- false);
+ 0);
}
EXPORT_SYMBOL_GPL(kvm_alloc_hpt_cma);
diff --git a/drivers/dma-buf/heaps/cma_heap.c b/drivers/dma-buf/heaps/cma_heap.c
index 626cf7f..7657359 100644
--- a/drivers/dma-buf/heaps/cma_heap.c
+++ b/drivers/dma-buf/heaps/cma_heap.c
@@ -66,7 +66,7 @@ static int cma_heap_allocate(struct dma_heap *heap,
helper_buffer->heap = heap;
helper_buffer->size = len;
- cma_pages = cma_alloc(cma_heap->cma, nr_pages, align, false);
+ cma_pages = cma_alloc(cma_heap->cma, nr_pages, align, 0);
if (!cma_pages)
goto free_buf;
diff --git a/drivers/s390/char/vmcp.c b/drivers/s390/char/vmcp.c
index 9e06628..11c4e3b 100644
--- a/drivers/s390/char/vmcp.c
+++ b/drivers/s390/char/vmcp.c
@@ -70,7 +70,7 @@ static void vmcp_response_alloc(struct vmcp_session *session)
* anymore the system won't work anyway.
*/
if (order > 2)
- page = cma_alloc(vmcp_cma, nr_pages, 0, false);
+ page = cma_alloc(vmcp_cma, nr_pages, 0, 0);
if (page) {
session->response = (char *)page_to_phys(page);
session->cma_alloc = 1;
diff --git a/drivers/staging/android/ion/ion_cma_heap.c b/drivers/staging/android/ion/ion_cma_heap.c
index bf65e67..128d3a5 100644
--- a/drivers/staging/android/ion/ion_cma_heap.c
+++ b/drivers/staging/android/ion/ion_cma_heap.c
@@ -39,7 +39,7 @@ static int ion_cma_allocate(struct ion_heap *heap, struct ion_buffer *buffer,
if (align > CONFIG_CMA_ALIGNMENT)
align = CONFIG_CMA_ALIGNMENT;
- pages = cma_alloc(cma_heap->cma, nr_pages, align, false);
+ pages = cma_alloc(cma_heap->cma, nr_pages, align, 0);
if (!pages)
return -ENOMEM;
diff --git a/include/linux/cma.h b/include/linux/cma.h
index 6ff79fe..2bd8544 100644
--- a/include/linux/cma.h
+++ b/include/linux/cma.h
@@ -43,7 +43,7 @@ extern int cma_init_reserved_mem(phys_addr_t base, phys_addr_t size,
const char *name,
struct cma **res_cma);
extern struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
- bool no_warn);
+ gfp_t gfp_mask);
extern bool cma_release(struct cma *cma, const struct page *pages, unsigned int count);
extern int cma_for_each_area(int (*it)(struct cma *cma, void *data), void *data);
diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c
index cff7e60..55c62b2 100644
--- a/kernel/dma/contiguous.c
+++ b/kernel/dma/contiguous.c
@@ -196,7 +196,7 @@ struct page *dma_alloc_from_contiguous(struct device *dev, size_t count,
if (align > CONFIG_CMA_ALIGNMENT)
align = CONFIG_CMA_ALIGNMENT;
- return cma_alloc(dev_get_cma_area(dev), count, align, no_warn);
+ return cma_alloc(dev_get_cma_area(dev), count, align, no_warn ? __GFP_NOWARN : 0);
}
/**
@@ -219,7 +219,7 @@ static struct page *cma_alloc_aligned(struct cma *cma, size_t size, gfp_t gfp)
{
unsigned int align = min(get_order(size), CONFIG_CMA_ALIGNMENT);
- return cma_alloc(cma, size >> PAGE_SHIFT, align, gfp & __GFP_NOWARN);
+ return cma_alloc(cma, size >> PAGE_SHIFT, align, gfp);
}
/**
diff --git a/mm/cma.c b/mm/cma.c
index 7f415d7..5d63331 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -32,6 +32,8 @@
#include <linux/highmem.h>
#include <linux/io.h>
#include <linux/kmemleak.h>
+#include <linux/sched.h>
+#include <linux/jiffies.h>
#include <trace/events/cma.h>
#include "cma.h"
@@ -403,13 +405,15 @@ static inline void cma_debug_show_areas(struct cma *cma) { }
* @cma: Contiguous memory region for which the allocation is performed.
* @count: Requested number of pages.
* @align: Requested alignment of pages (in PAGE_SIZE order).
- * @no_warn: Avoid printing message about failed allocation
+ * @gfp_mask: If __GFP_NOWARN is passed, suppress messages about failed
+ * allocations. If __GFP_NOFAIL is passed, try doing the CMA
+ * allocation indefinitely until the allocation succeeds.
*
* This function allocates part of contiguous memory on specific
* contiguous memory area.
*/
struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
- bool no_warn)
+ gfp_t gfp_mask)
{
unsigned long mask, offset;
unsigned long pfn = -1;
@@ -442,8 +446,28 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
bitmap_maxno, start, bitmap_count, mask,
offset);
if (bitmap_no >= bitmap_maxno) {
- mutex_unlock(&cma->lock);
- break;
+ if (ret == -EBUSY && gfp_mask & __GFP_NOFAIL) {
+ mutex_unlock(&cma->lock);
+
+ /*
+ * Page may be momentarily pinned by some other
+ * process which has been scheduled out, e.g.
+ * in exit path, during unmap call, or process
+ * fork and so cannot be freed there. Sleep
+ * for 100 ms and retry the allocation.
+ */
+ start = 0;
+ ret = -ENOMEM;
+ schedule_timeout_killable(msecs_to_jiffies(100));
+ continue;
+ } else {
+ /*
+ * ret == -ENOMEM - all bits in cma->bitmap are
+ * set, so we break accordingly.
+ */
+ mutex_unlock(&cma->lock);
+ break;
+ }
}
bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
/*
@@ -456,7 +480,7 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit);
mutex_lock(&cma_mutex);
ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA,
- GFP_KERNEL | (no_warn ? __GFP_NOWARN : 0));
+ GFP_KERNEL | (gfp_mask & __GFP_NOWARN));
mutex_unlock(&cma_mutex);
if (ret == 0) {
page = pfn_to_page(pfn);
@@ -485,7 +509,7 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
page_kasan_tag_reset(page + i);
}
- if (ret && !no_warn) {
+ if (ret && !(gfp_mask & __GFP_NOWARN)) {
pr_err("%s: alloc failed, req-size: %zu pages, ret: %d\n",
__func__, count, ret);
cma_debug_show_areas(cma);
diff --git a/mm/cma_debug.c b/mm/cma_debug.c
index d5bf8aa..76aea84 100644
--- a/mm/cma_debug.c
+++ b/mm/cma_debug.c
@@ -137,7 +137,7 @@ static int cma_alloc_mem(struct cma *cma, int count)
if (!mem)
return -ENOMEM;
- p = cma_alloc(cma, count, 0, false);
+ p = cma_alloc(cma, count, 0, 0);
if (!p) {
kfree(mem);
return -ENOMEM;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 67fc6383..97bdba9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1260,7 +1260,7 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
if (hugetlb_cma[nid]) {
page = cma_alloc(hugetlb_cma[nid], nr_pages,
- huge_page_order(h), true);
+ huge_page_order(h), __GFP_NOWARN);
if (page)
return page;
}
@@ -1271,7 +1271,7 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
continue;
page = cma_alloc(hugetlb_cma[node], nr_pages,
- huge_page_order(h), true);
+ huge_page_order(h), __GFP_NOWARN);
if (page)
return page;
}
--
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH v4] mm: cma: indefinitely retry allocations in cma_alloc
2020-09-28 20:30 ` Chris Goldsworthy
@ 2020-09-29 5:59 ` Christoph Hellwig
2020-10-09 20:26 ` Chris Goldsworthy
0 siblings, 1 reply; 4+ messages in thread
From: Christoph Hellwig @ 2020-09-29 5:59 UTC (permalink / raw)
To: Chris Goldsworthy
Cc: akpm, linux-mm, minchan, linux-arm-msm, linux-kernel, pratikp,
pdaly, sudaraja, iamjoonsoo.kim, david, vinmenon, minchan.kim
On Mon, Sep 28, 2020 at 01:30:27PM -0700, Chris Goldsworthy wrote:
> CMA allocations will fail if 'pinned' pages are in a CMA area, since we
> cannot migrate pinned pages. The _refcount of a struct page being greater
> than _mapcount for that page can cause pinning for anonymous pages. This
> is because try_to_unmap(), which (1) is called in the CMA allocation path,
> and (2) decrements both _refcount and _mapcount for a page, will stop
> unmapping a page from VMAs once the _mapcount for a page reaches 0. This
> implies that after try_to_unmap() has finished successfully for a page
> where _recount > _mapcount, that _refcount will be greater than 0. Later
> in the CMA allocation path in migrate_page_move_mapping(), we will have one
> more reference count than intended for anonymous pages, meaning the
> allocation will fail for that page.
>
> If a process ends up causing _refcount > _mapcount for a page (by either
> incrementing _recount or decrementing _mapcount), such that the process is
> context switched out after modifying one refcount but before modifying the
> other, the page will be temporarily pinned.
>
> One example of where _refcount can be greater than _mapcount is inside of
> zap_pte_range(), which is called for all the entries of a PMD when a
> process is exiting, to unmap the process's memory. Inside of
> zap_pte_range(), after unammping a page with page_remove_rmap(), we have
> that _recount > _mapcount. _refcount can only be decremented after a TLB
> flush is performed for the page - this doesn't occur until enough pages
> have been batched together for flushing. The flush can either occur inside
> of zap_pte_range() (during the same invocation or a later one), or if there
> aren't enough pages collected by the time we unmap all of the pages in a
> process, the flush will occur in tlb_finish_mmu() in exit_mmap(). After
> the flush has occurred, tlb_batch_pages_flush() will decrement the
> references on the flushed pages.
>
> Another such example like the above is inside of copy_one_pte(), which is
> called during a fork. For PTEs for which pte_present(pte) == true,
> copy_one_pte() will increment the _refcount field followed by the
> _mapcount field of a page.
>
> So, inside of cma_alloc(), add the option of letting users pass in
> __GFP_NOFAIL to indicate that we should retry CMA allocations indefinitely,
> in the event that alloc_contig_range() returns -EBUSY after having scanned
> a whole CMA-region bitmap.
And who is going to use this? AS-is this just seems to add code that
isn't actually used and thus actually tested. (In addition to beeing
a relly bad idea as discussed before)
> --- a/kernel/dma/contiguous.c
> +++ b/kernel/dma/contiguous.c
> @@ -196,7 +196,7 @@ struct page *dma_alloc_from_contiguous(struct device *dev, size_t count,
> if (align > CONFIG_CMA_ALIGNMENT)
> align = CONFIG_CMA_ALIGNMENT;
>
> - return cma_alloc(dev_get_cma_area(dev), count, align, no_warn);
> + return cma_alloc(dev_get_cma_area(dev), count, align, no_warn ? __GFP_NOWARN : 0);
Also don't add pointlessly overlong lines.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH v4] mm: cma: indefinitely retry allocations in cma_alloc
2020-09-29 5:59 ` Christoph Hellwig
@ 2020-10-09 20:26 ` Chris Goldsworthy
0 siblings, 0 replies; 4+ messages in thread
From: Chris Goldsworthy @ 2020-10-09 20:26 UTC (permalink / raw)
To: Christoph Hellwig
Cc: akpm, linux-mm, minchan, linux-arm-msm, linux-kernel, pratikp,
pdaly, sudaraja, iamjoonsoo.kim, david, vinmenon, minchan.kim
On 2020-09-28 22:59, Christoph Hellwig wrote:
> On Mon, Sep 28, 2020 at 01:30:27PM -0700, Chris Goldsworthy wrote:
>> CMA allocations will fail if 'pinned' pages are in a CMA area, since
>> we
>> cannot migrate pinned pages. The _refcount of a struct page being
>> greater
>> than _mapcount for that page can cause pinning for anonymous pages.
>> This
>> is because try_to_unmap(), which (1) is called in the CMA allocation
>> path,
>> and (2) decrements both _refcount and _mapcount for a page, will stop
>> unmapping a page from VMAs once the _mapcount for a page reaches 0.
>> This
>> implies that after try_to_unmap() has finished successfully for a page
>> where _recount > _mapcount, that _refcount will be greater than 0.
>> Later
>> in the CMA allocation path in migrate_page_move_mapping(), we will
>> have one
>> more reference count than intended for anonymous pages, meaning the
>> allocation will fail for that page.
>>
>> If a process ends up causing _refcount > _mapcount for a page (by
>> either
>> incrementing _recount or decrementing _mapcount), such that the
>> process is
>> context switched out after modifying one refcount but before modifying
>> the
>> other, the page will be temporarily pinned.
>>
>> One example of where _refcount can be greater than _mapcount is inside
>> of
>> zap_pte_range(), which is called for all the entries of a PMD when a
>> process is exiting, to unmap the process's memory. Inside of
>> zap_pte_range(), after unammping a page with page_remove_rmap(), we
>> have
>> that _recount > _mapcount. _refcount can only be decremented after a
>> TLB
>> flush is performed for the page - this doesn't occur until enough
>> pages
>> have been batched together for flushing. The flush can either occur
>> inside
>> of zap_pte_range() (during the same invocation or a later one), or if
>> there
>> aren't enough pages collected by the time we unmap all of the pages in
>> a
>> process, the flush will occur in tlb_finish_mmu() in exit_mmap().
>> After
>> the flush has occurred, tlb_batch_pages_flush() will decrement the
>> references on the flushed pages.
>>
>> Another such example like the above is inside of copy_one_pte(), which
>> is
>> called during a fork. For PTEs for which pte_present(pte) == true,
>> copy_one_pte() will increment the _refcount field followed by the
>> _mapcount field of a page.
>>
>> So, inside of cma_alloc(), add the option of letting users pass in
>> __GFP_NOFAIL to indicate that we should retry CMA allocations
>> indefinitely,
>> in the event that alloc_contig_range() returns -EBUSY after having
>> scanned
>> a whole CMA-region bitmap.
>
> And who is going to use this? AS-is this just seems to add code that
> isn't actually used and thus actually tested. (In addition to beeing
> a relly bad idea as discussed before)
Hi Christoph,
That had slipped my mind - what we would have submitted would have been
a modified /drivers/dma-heap/heaps/cma_heap.c, which would have created
a "linux,cma-nofail" heap, that when allocated from, passes GFP_NOFAIL
to cma_alloc(). But, since this retry approach (finite and infinite)
has effectively been nacked, I've gone back to the drawing board to find
either (1) a lock based approach to solving this (as posed by Andrew
Morton here: https://lkml.org/lkml/2020/8/21/1490), or (2) using
preempt_disable() calls.
Thanks,
Chris.
>> --- a/kernel/dma/contiguous.c
>> +++ b/kernel/dma/contiguous.c
>> @@ -196,7 +196,7 @@ struct page *dma_alloc_from_contiguous(struct
>> device *dev, size_t count,
>> if (align > CONFIG_CMA_ALIGNMENT)
>> align = CONFIG_CMA_ALIGNMENT;
>>
>> - return cma_alloc(dev_get_cma_area(dev), count, align, no_warn);
>> + return cma_alloc(dev_get_cma_area(dev), count, align, no_warn ?
>> __GFP_NOWARN : 0);
>
> Also don't add pointlessly overlong lines.
--
The Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum,
a Linux Foundation Collaborative Project
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2020-10-09 20:27 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-28 20:30 [PATCH v4] mm: cma: indefinitely retry allocations in cma_alloc Chris Goldsworthy
2020-09-28 20:30 ` Chris Goldsworthy
2020-09-29 5:59 ` Christoph Hellwig
2020-10-09 20:26 ` Chris Goldsworthy
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.