linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: cma_alloc(), add sleep-and-retry for temporary page pinning
@ 2020-08-11 22:20 cgoldswo
  2020-08-21  4:17 ` cgoldswo
  2020-08-21 22:01 ` Andrew Morton
  0 siblings, 2 replies; 6+ messages in thread
From: cgoldswo @ 2020-08-11 22:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-arm-msm, linux-kernel, pratikp, pdaly, sudraja,
	iamjoonsoo.kim, linux-arm-msm-owner

On 2020-08-06 18:31, Andrew Morton wrote:
> On Wed,  5 Aug 2020 19:56:21 -0700 Chris Goldsworthy
> <cgoldswo@codeaurora.org> wrote:
> 
>> On mobile devices, failure to allocate from a CMA area constitutes a
>> functional failure.  Sometimes during CMA allocations, we have 
>> observed
>> that pages in a CMA area allocated through alloc_pages(), that we're 
>> trying
>> to migrate away to make room for a CMA allocation, are temporarily 
>> pinned.
>> This temporary pinning can occur when a process that owns the pinned 
>> page
>> is being forked (the example is explained further in the commit text).
>> This patch addresses this issue by adding a sleep-and-retry loop in
>> cma_alloc() . There's another example we know of similar to the above 
>> that
>> occurs during exit_mmap() (in zap_pte_range() specifically), but I 
>> need to
>> determine if this is still relevant today.
> 

> Sounds fairly serious but boy, we're late for 5.9.
> 
> I can queue it for 5.10 with a cc:stable so that it gets backported
> into earlier kernels a couple of months from now, if we think the
> seriousness justifies backporting(?).
> 

Queuing this seems like the best way to proceed, if we were to pick up 
this patch.
I think we can forgo back-porting this, as this is something that will 
only be
needed as vendors such as our selves start using Google's Generic Kernel 
Image
(we've carried this patch in our tree for over four years).

> 
> And...  it really is a sad little patch, isn't it?  Instead of fixing
> the problem, it reduces the problem's probability by 5x.  Can't we do
> better than this?

I have one alternative in mind.  I have been able to review the 
exit_mmap()
case, so before proceeding, let's do a breakdown of the problem: we can
categorize the pinning issue we're trying to address here as being one 
of
(1) incrementing _refcount and getting context-switched out before
incrementing _mapcount (applies to forking a process / copy_one_pte()), 
and
(2) decrementing _mapcount and getting context-switched out before
decrementing _refcount (applies to tearing down a process / 
exit_mmap()).
So, one alternative would be to insert preempt_disable/enable() calls at
affected sites. So, for the copy_one_pte() pinning case, we could do the
following inside of copy_one_pte():

         if (page) {
+               preempt_disable();
                 get_page(page);
                 page_dup_rmap(page, false);
+               preempt_enable();
                 rss[mm_counter(page)]++;
         }

I'm not sure if this approach would be acceptable for the exit_mmap()
pinning case (applicable when CONFIG_MMU_GATHER_NO_GATHER=y).  For the
purposes of this discussion, we can look at two function calls inside of
exit_mmap(), in the order they're called in, to show how the pinning is
occuring:

     1. Calling unmap_vmas(): this unmaps the pages in each VMA for an
     exiting task, using zap_pte_range() - zap_pte_range() reduces the
     _mapcount for each page in a VMA, using page_remove_rmap().  After
     calling page_remove_rmap(), the page is placed into a list in
     __tlb_remove_page().  This list of pages will be used when flushing
     TLB entries later on during the process teardown.

     2. Calling tlb_finish_mmu(): This is will flush the TLB entries
     associated with pages, before calling put_page() on them, using the
     previously collected pages from __tlb_remove_page() - the call flow 
is
     tlb_flush_mmu() > tlb_flush_mmu() > tlb_flush_mmu_free()
     > tlb_batch_pages_flush() > free_pages_and_swap_cache() >
     release_pages(), where release_pages() is described as a "batched
     put_page()"

The preempt_disable/enable() approach would entail doing the following
inside of exit_mmap():

+       preempt_disable();
         unmap_vmas(&tlb, vma, 0, -1);
         free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 
USER_PGTABLES_CEILING);
         tlb_finish_mmu(&tlb, 0, -1);
+       preempt_enable();

I'm not sure doing this is feasible, given how long it could take to do 
the
process teardown.

The good thing about this patch is that it has been stable in our kernel
for four years (though for some SoCs we increased the retry counts).  
One
thing to stress is that there are other instances of CMA page pinning, 
that
this patch isn't attempting to address. Please let me know if you're 
okay
with queuing this for the 5.10 merge window - if you are, I can add an
option to configure the number of retries, and will resend the patch 
once
the 5.9 merge window closes.

Thanks,

Chris.

-- 
The Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: cma_alloc(), add sleep-and-retry for temporary page pinning
  2020-08-11 22:20 cma_alloc(), add sleep-and-retry for temporary page pinning cgoldswo
@ 2020-08-21  4:17 ` cgoldswo
  2020-08-21 22:01 ` Andrew Morton
  1 sibling, 0 replies; 6+ messages in thread
From: cgoldswo @ 2020-08-21  4:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-arm-msm, linux-kernel, pratikp, pdaly, sudraja,
	iamjoonsoo.kim, linux-arm-msm-owner

On 2020-08-11 15:20, cgoldswo@codeaurora.org wrote:
> On 2020-08-06 18:31, Andrew Morton wrote:
>> On Wed,  5 Aug 2020 19:56:21 -0700 Chris Goldsworthy
>> <cgoldswo@codeaurora.org> wrote:
>> 
>>> On mobile devices, failure to allocate from a CMA area constitutes a
>>> functional failure.  Sometimes during CMA allocations, we have 
>>> observed
>>> that pages in a CMA area allocated through alloc_pages(), that we're 
>>> trying
>>> to migrate away to make room for a CMA allocation, are temporarily 
>>> pinned.
>>> This temporary pinning can occur when a process that owns the pinned 
>>> page
>>> is being forked (the example is explained further in the commit 
>>> text).
>>> This patch addresses this issue by adding a sleep-and-retry loop in
>>> cma_alloc() . There's another example we know of similar to the above 
>>> that
>>> occurs during exit_mmap() (in zap_pte_range() specifically), but I 
>>> need to
>>> determine if this is still relevant today.
>> 
> 
>> Sounds fairly serious but boy, we're late for 5.9.
>> 
>> I can queue it for 5.10 with a cc:stable so that it gets backported
>> into earlier kernels a couple of months from now, if we think the
>> seriousness justifies backporting(?).
>> 
> 
> Queuing this seems like the best way to proceed, if we were to pick up
> this patch.
> I think we can forgo back-porting this, as this is something that will 
> only be
> needed as vendors such as our selves start using Google's Generic 
> Kernel Image
> (we've carried this patch in our tree for over four years).
> 
>> 
>> And...  it really is a sad little patch, isn't it?  Instead of fixing
>> the problem, it reduces the problem's probability by 5x.  Can't we do
>> better than this?
> 
> I have one alternative in mind.  I have been able to review the 
> exit_mmap()
> case, so before proceeding, let's do a breakdown of the problem: we can
> categorize the pinning issue we're trying to address here as being one 
> of
> (1) incrementing _refcount and getting context-switched out before
> incrementing _mapcount (applies to forking a process / copy_one_pte()), 
> and
> (2) decrementing _mapcount and getting context-switched out before
> decrementing _refcount (applies to tearing down a process / 
> exit_mmap()).
> So, one alternative would be to insert preempt_disable/enable() calls 
> at
> affected sites. So, for the copy_one_pte() pinning case, we could do 
> the
> following inside of copy_one_pte():
> 
>         if (page) {
> +               preempt_disable();
>                 get_page(page);
>                 page_dup_rmap(page, false);
> +               preempt_enable();
>                 rss[mm_counter(page)]++;
>         }
> 
> I'm not sure if this approach would be acceptable for the exit_mmap()
> pinning case (applicable when CONFIG_MMU_GATHER_NO_GATHER=y).  For the
> purposes of this discussion, we can look at two function calls inside 
> of
> exit_mmap(), in the order they're called in, to show how the pinning is
> occuring:
> 
>     1. Calling unmap_vmas(): this unmaps the pages in each VMA for an
>     exiting task, using zap_pte_range() - zap_pte_range() reduces the
>     _mapcount for each page in a VMA, using page_remove_rmap().  After
>     calling page_remove_rmap(), the page is placed into a list in
>     __tlb_remove_page().  This list of pages will be used when flushing
>     TLB entries later on during the process teardown.
> 
>     2. Calling tlb_finish_mmu(): This is will flush the TLB entries
>     associated with pages, before calling put_page() on them, using the
>     previously collected pages from __tlb_remove_page() - the call flow 
> is
>     tlb_flush_mmu() > tlb_flush_mmu() > tlb_flush_mmu_free()
>     > tlb_batch_pages_flush() > free_pages_and_swap_cache() >
>     release_pages(), where release_pages() is described as a "batched
>     put_page()"
> 
> The preempt_disable/enable() approach would entail doing the following
> inside of exit_mmap():
> 
> +       preempt_disable();
>         unmap_vmas(&tlb, vma, 0, -1);
>         free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 
> USER_PGTABLES_CEILING);
>         tlb_finish_mmu(&tlb, 0, -1);
> +       preempt_enable();
> 
> I'm not sure doing this is feasible, given how long it could take to do 
> the
> process teardown.
> 
> The good thing about this patch is that it has been stable in our 
> kernel
> for four years (though for some SoCs we increased the retry counts).  
> One
> thing to stress is that there are other instances of CMA page pinning, 
> that
> this patch isn't attempting to address. Please let me know if you're 
> okay
> with queuing this for the 5.10 merge window - if you are, I can add an
> option to configure the number of retries, and will resend the patch 
> once
> the 5.9 merge window closes.
> 
> Thanks,
> 
> Chris.

Hi Andrew,

Have you been able to give the patch any further consideration?

Thanks,

Chris.

-- 
The Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: cma_alloc(), add sleep-and-retry for temporary page pinning
  2020-08-11 22:20 cma_alloc(), add sleep-and-retry for temporary page pinning cgoldswo
  2020-08-21  4:17 ` cgoldswo
@ 2020-08-21 22:01 ` Andrew Morton
  2020-09-11 18:39   ` Chris Goldsworthy
  1 sibling, 1 reply; 6+ messages in thread
From: Andrew Morton @ 2020-08-21 22:01 UTC (permalink / raw)
  To: cgoldswo
  Cc: linux-mm, linux-arm-msm, linux-kernel, pratikp, pdaly, sudraja,
	iamjoonsoo.kim, linux-arm-msm-owner

On Tue, 11 Aug 2020 15:20:47 -0700 cgoldswo@codeaurora.org wrote:

> On 2020-08-06 18:31, Andrew Morton wrote:
> > On Wed,  5 Aug 2020 19:56:21 -0700 Chris Goldsworthy
> > <cgoldswo@codeaurora.org> wrote:
> > 
> >> On mobile devices, failure to allocate from a CMA area constitutes a
> >> functional failure.  Sometimes during CMA allocations, we have 
> >> observed
> >> that pages in a CMA area allocated through alloc_pages(), that we're 
> >> trying
> >> to migrate away to make room for a CMA allocation, are temporarily 
> >> pinned.
> >> This temporary pinning can occur when a process that owns the pinned 
> >> page
> >> is being forked (the example is explained further in the commit text).
> >> This patch addresses this issue by adding a sleep-and-retry loop in
> >> cma_alloc() . There's another example we know of similar to the above 
> >> that
> >> occurs during exit_mmap() (in zap_pte_range() specifically), but I 
> >> need to
> >> determine if this is still relevant today.
> > 
> 
> > Sounds fairly serious but boy, we're late for 5.9.
> > 
> > I can queue it for 5.10 with a cc:stable so that it gets backported
> > into earlier kernels a couple of months from now, if we think the
> > seriousness justifies backporting(?).
> > 
> 
> Queuing this seems like the best way to proceed

I'd really prefer not.  It's very hacky and it isn't a fix - it's a
likelihood-reducer.

> > 
> > And...  it really is a sad little patch, isn't it?  Instead of fixing
> > the problem, it reduces the problem's probability by 5x.  Can't we do
> > better than this?
> 
> I have one alternative in mind.  I have been able to review the 
> exit_mmap()
> case, so before proceeding, let's do a breakdown of the problem: we can
> categorize the pinning issue we're trying to address here as being one 
> of
> (1) incrementing _refcount and getting context-switched out before
> incrementing _mapcount (applies to forking a process / copy_one_pte()), 
> and
> (2) decrementing _mapcount and getting context-switched out before
> decrementing _refcount (applies to tearing down a process / 
> exit_mmap()).
> So, one alternative would be to insert preempt_disable/enable() calls at
> affected sites. So, for the copy_one_pte() pinning case, we could do the
> following inside of copy_one_pte():
> 
>          if (page) {
> +               preempt_disable();
>                  get_page(page);
>                  page_dup_rmap(page, false);
> +               preempt_enable();
>                  rss[mm_counter(page)]++;
>          }

This would make the retry loop much more reliable, and
preempt_disable() is fast.  Such additions should be clearly commented
(a bare preempt_disable() explains nothing).  Perhaps by wrapping the
prerempt_disable() in a suitably-named wrapper function.

But it's still really unpleasant.

> 
> The good thing about this patch is that it has been stable in our kernel
> for four years (though for some SoCs we increased the retry counts).  

That's discouraging!

> One
> thing to stress is that there are other instances of CMA page pinning, 
> that
> this patch isn't attempting to address.

Oh.  How severe are these?

> Please let me know if you're 
> okay
> with queuing this for the 5.10 merge window - if you are, I can add an
> option to configure the number of retries, and will resend the patch 
> once
> the 5.9 merge window closes.

Well.  Why not wait infinitely?  Because there are other sources of CMA
page pinning, I guess.


Could we take a sleeping lock on the exit_mmap() path and on the
migration path?  So that the migration path automatically waits for
the exact amount of time?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: cma_alloc(), add sleep-and-retry for temporary page pinning
  2020-08-21 22:01 ` Andrew Morton
@ 2020-09-11 18:39   ` Chris Goldsworthy
  0 siblings, 0 replies; 6+ messages in thread
From: Chris Goldsworthy @ 2020-09-11 18:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-arm-msm, linux-kernel, pratikp, pdaly, sudraja,
	iamjoonsoo.kim, linux-arm-msm-owner

On 2020-08-21 15:01, Andrew Morton wrote:
> On Tue, 11 Aug 2020 15:20:47 -0700 cgoldswo@codeaurora.org wrote:
> 
>> One
>> thing to stress is that there are other instances of CMA page pinning,
>> that
>> this patch isn't attempting to address.
> 
> Oh.  How severe are these?

Hey Andrew,

   - get_user_pages() will pin pages (I think that's a 100% guarantee but 
I'd need to double check that). There isn't a workaround for this as far 
as I know.
   - One issue we've encountered is when the pages for buffer heads are 
stuck on an LRU list (see the call to buffer_busy() in drop_buffers() 
https://elixir.bootlin.com/linux/v5.8.8/source/fs/buffer.c#L3225). We 
deal with this by allowing the buffers to be dropped, if the reason 
buffer_busy() returns true is because the page is on an LRU list.

> Well.  Why not wait infinitely?  Because there are other sources of CMA
> page pinning, I guess.
> 
> Could we take a sleeping lock on the exit_mmap() path and on the
> migration path?  So that the migration path automatically waits for
> the exact amount of time?

Retrying indefinitely whilst alloc_contig_range() returns -EBUSY is 
actually a viable approach.  From the perspective of how long it takes 
to perform a CMA allocation, it is preferable compared to the lock-based 
method, in terms of how long it would take to do a CMA allocation.  With 
our current method, we attempt allocations across an entire CMA region 
before sleeping and retrying. With the lock-based method, we'd be 
sleeping on a per-page basis - this could lead to a situation where we 
spend a great deal of time waiting for a particular page A to be freed, 
that lies in the subset of the CMA region we're allocating form.  We 
could have instead given up on allocating that subset of the CMA region 
(because page A is pinned), and have moved on to a different subset of 
the CMA region, and have successfully allocated that subset, whilst page 
A is still pinned.

I have a patch ready that does this indefinite-retrying, that will be 
sent in reply to this e-mail.

Regards,

Chris.

--
The Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: cma_alloc(), add sleep-and-retry for temporary page pinning
  2020-08-06  2:56 Chris Goldsworthy
@ 2020-08-07  1:31 ` Andrew Morton
  0 siblings, 0 replies; 6+ messages in thread
From: Andrew Morton @ 2020-08-07  1:31 UTC (permalink / raw)
  To: Chris Goldsworthy
  Cc: linux-mm, linux-arm-msm, linux-kernel, pratikp, pdaly, sudraja,
	iamjoonsoo.kim

On Wed,  5 Aug 2020 19:56:21 -0700 Chris Goldsworthy <cgoldswo@codeaurora.org> wrote:

> On mobile devices, failure to allocate from a CMA area constitutes a
> functional failure.  Sometimes during CMA allocations, we have observed
> that pages in a CMA area allocated through alloc_pages(), that we're trying
> to migrate away to make room for a CMA allocation, are temporarily pinned.
> This temporary pinning can occur when a process that owns the pinned page
> is being forked (the example is explained further in the commit text).
> This patch addresses this issue by adding a sleep-and-retry loop in
> cma_alloc() . There's another example we know of similar to the above that
> occurs during exit_mmap() (in zap_pte_range() specifically), but I need to
> determine if this is still relevant today.

Sounds fairly serious but boy, we're late for 5.9.

I can queue it for 5.10 with a cc:stable so that it gets backported
into earlier kernels a couple of months from now, if we think the
seriousness justifies backporting(?). 

Or I can do something else - thoughts?

And...  it really is a sad little patch, isn't it?  Instead of fixing
the problem, it reduces the problem's probability by 5x.  Can't we do
better than this?



^ permalink raw reply	[flat|nested] 6+ messages in thread

* cma_alloc(), add sleep-and-retry for temporary page pinning
@ 2020-08-06  2:56 Chris Goldsworthy
  2020-08-07  1:31 ` Andrew Morton
  0 siblings, 1 reply; 6+ messages in thread
From: Chris Goldsworthy @ 2020-08-06  2:56 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, linux-arm-msm, linux-kernel, pratikp, pdaly, sudraja,
	iamjoonsoo.kim

On mobile devices, failure to allocate from a CMA area constitutes a
functional failure.  Sometimes during CMA allocations, we have observed
that pages in a CMA area allocated through alloc_pages(), that we're trying
to migrate away to make room for a CMA allocation, are temporarily pinned.
This temporary pinning can occur when a process that owns the pinned page
is being forked (the example is explained further in the commit text).
This patch addresses this issue by adding a sleep-and-retry loop in
cma_alloc() . There's another example we know of similar to the above that
occurs during exit_mmap() (in zap_pte_range() specifically), but I need to
determine if this is still relevant today.


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-09-11 18:39 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-11 22:20 cma_alloc(), add sleep-and-retry for temporary page pinning cgoldswo
2020-08-21  4:17 ` cgoldswo
2020-08-21 22:01 ` Andrew Morton
2020-09-11 18:39   ` Chris Goldsworthy
  -- strict thread matches above, loose matches on Subject: below --
2020-08-06  2:56 Chris Goldsworthy
2020-08-07  1:31 ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).