Re: [PATCH v2] mm: cma: indefinitely retry allocations in cma_alloc

From: David Hildenbrand <david@redhat.com>
To: Chris Goldsworthy <cgoldswo@codeaurora.org>
Cc: akpm@linux-foundation.org, linux-mm@kvack.org,
	linux-arm-msm@vger.kernel.org, linux-kernel@vger.kernel.org,
	pratikp@codeaurora.org, pdaly@codeaurora.org,
	sudraja@codeaurora.org, iamjoonsoo.kim@lge.com,
	linux-arm-msm-owner@vger.kernel.org,
	Vinayak Menon <vinmenon@codeaurora.org>,
	linux-kernel-owner@vger.kernel.org
Subject: Re: [PATCH v2] mm: cma: indefinitely retry allocations in cma_alloc
Date: Tue, 15 Sep 2020 09:53:30 +0200	[thread overview]
Message-ID: <a3d62a77-4c4f-e86c-de6d-5222c2a747e0@redhat.com> (raw)
In-Reply-To: <72ae0f361df527cf70946992e4ab1eb3@codeaurora.org>

On 14.09.20 20:33, Chris Goldsworthy wrote:
> On 2020-09-14 02:31, David Hildenbrand wrote:
>> On 11.09.20 21:17, Chris Goldsworthy wrote:
>>>
>>> So, inside of cma_alloc(), instead of giving up when 
>>> alloc_contig_range()
>>> returns -EBUSY after having scanned a whole CMA-region bitmap, perform
>>> retries indefinitely, with sleeps, to give the system an opportunity 
>>> to
>>> unpin any pinned pages.
>>>
>>> Signed-off-by: Chris Goldsworthy <cgoldswo@codeaurora.org>
>>> Co-developed-by: Vinayak Menon <vinmenon@codeaurora.org>
>>> Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
>>> ---
>>>  mm/cma.c | 25 +++++++++++++++++++++++--
>>>  1 file changed, 23 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/cma.c b/mm/cma.c
>>> index 7f415d7..90bb505 100644
>>> --- a/mm/cma.c
>>> +++ b/mm/cma.c
>>> @@ -442,8 +443,28 @@ struct page *cma_alloc(struct cma *cma, size_t 
>>> count, unsigned int align,
>>>  				bitmap_maxno, start, bitmap_count, mask,
>>>  				offset);
>>>  		if (bitmap_no >= bitmap_maxno) {
>>> -			mutex_unlock(&cma->lock);
>>> -			break;
>>> +			if (ret == -EBUSY) {
>>> +				mutex_unlock(&cma->lock);
>>> +
>>> +				/*
>>> +				 * Page may be momentarily pinned by some other
>>> +				 * process which has been scheduled out, e.g.
>>> +				 * in exit path, during unmap call, or process
>>> +				 * fork and so cannot be freed there. Sleep
>>> +				 * for 100ms and retry the allocation.
>>> +				 */
>>> +				start = 0;
>>> +				ret = -ENOMEM;
>>> +				msleep(100);
>>> +				continue;
>>> +			} else {
>>> +				/*
>>> +				 * ret == -ENOMEM - all bits in cma->bitmap are
>>> +				 * set, so we break accordingly.
>>> +				 */
>>> +				mutex_unlock(&cma->lock);
>>> +				break;
>>> +			}
>>>  		}
>>>  		bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
>>>  		/*
>>>
>>
>> What about long-term pinnings? IIRC, that can happen easily e.g., with
>> vfio (and I remember there is a way via vmsplice).
>>
>> Not convinced trying forever is a sane approach in the general case ...
> 
> Hi David,
> 
> I've botched the threading, so there are discussions with respect to the 
> previous patch-set that is missing on this thread, which I will 
> summarize below:
> 
> V1:
> [1] https://lkml.org/lkml/2020/8/5/1097
> [2] https://lkml.org/lkml/2020/8/6/1040
> [3] https://lkml.org/lkml/2020/8/11/893
> [4] https://lkml.org/lkml/2020/8/21/1490
> [5] https://lkml.org/lkml/2020/9/11/1072
> 
> [1] features version of the patch featured a finite number of retries, 
> which has been stable for our kernels. In [2], Andrew questioned whether 
> we could actually find a way of solving the problem on the grounds that 
> doing a finite number of retries doesn't actually fix the problem (more 
> importantly, in [4] Andrew indicated that he would prefer not to merge 
> the patch as it doesn't solve the issue).  In [3], I suggest one actual 
> fix for this, which is to use preempt_disable/enable() to prevent 
> context switches from occurring during the periods in copy_one_pte() and 
> exit_mmap() (I forgot to mention this case in the commit text) in which 
> _refcount > _mapcount for a page - you would also need to prevent 
> interrupts from occurring to if we were to fully prevent the issue from 
> occurring.  I think this would be acceptable for the copy_one_pte() 
> case, since there _refcount > _mapcount for little time.  For the 
> exit_mmap() case, however, _refcount is greater than _mapcount whilst 
> the page-tables are being torn down for a process - that could be too 
> long for disabling preemption / interrupts.
> 
> So, in [4], Andrew asks about two alternatives to see if they're viable: 
> (1) acquiring locks on the exit_mmap path and migration paths, (2) 
> retrying indefinitely.  In [5], I discuss how using locks could increase 
> the time it takes to perform a CMA allocation, such that a retry 
> approach would avoid increased CMA allocation times. I'm also uncertain 
> about how the locking scheme could be implemented effectively without 
> introducing a new per-page lock that will be used specifically to solve 
> this issue, and I'm not sure this would be accepted.

Thanks for the nice summary!

> 
> We're fine with doing indefinite retries, on the grounds that if there 
> is some long-term pinning that occurs when alloc_contig_range returns 
> -EBUSY, that it should be debugged and fixed.  Would it be possible to 
> make this infinite-retrying something that could be enabled or disabled 
> by a defconfig option?

Two thoughts:

1. Most (all?) alloc_contig_range() users are interested in handling
short-term pinnings in a nice way (IOW, make the allocation succeed).
I'd much rather want to see this being handled in a nice fashion inside
alloc_contig_range() than having to encode endless loops in the caller.
This means I strongly prefer something like [3] if feasible. But I can
understand that stuff ([5]) is complicated. I have to admit that I am
not an expert on the short term pinning described by you, and how to
eventually fix it.

2. The issue that I am having is that long-term pinnings are
(unfortunately) a real thing. It's not something to debug and fix as you
suggest. Like, run a VM with VFIO (e.g., PCI passthrough). While that VM
is running, all VM memory will be pinned. If memory falls onto a CMA
region your cma_alloc() will be stuck in an (endless, meaning until the
VM ended) loop. I am not sure if all cma users are fine with that -
especially, think about CMA being used for gigantic pages now.

Assume you want to start a new VM while the other one is running and use
some (new) gigantic pages for it. Suddenly you're trapped in an endless
loop in the kernel. That's nasty.

We do have a similar endless loop on the memory hotunplug/offlining path
(offline_pages()). However, if triggered by a user (echo 0 >
/sys/devices/system/memory/memoryX/online), a user can stop trying by
sending a signal.

If we want to stick to retrying forever, can't we use flags like
__GFP_NOFAIL to explicitly enable this new behavior for selected
cma_alloc() users that really can't fail/retry manually again?

-- 
Thanks,

David / dhildenb