All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm: fix page leak with multiple threads mapping the same page
@ 2022-07-05 20:00 Josef Bacik
  2022-07-06 22:46 ` Kirill A. Shutemov
  0 siblings, 1 reply; 5+ messages in thread
From: Josef Bacik @ 2022-07-05 20:00 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Kirill A . Shutemov, Matthew Wilcox, Rik van Riel, Chris Mason

We have an application with a lot of threads that use a shared mmap
backed by tmpfs mounted with -o huge=within_size.  This application
started leaking loads of huge pages when we upgraded to a recent kernel.

Using the page ref tracepoints and a BPF program written by Tejun Heo we
were able to determine that these pages would have multiple refcounts
from the page fault path, but when it came to unmap time we wouldn't
drop the number of refs we had added from the faults.

I wrote a reproducer that mmap'ed a file backed by tmpfs with -o
huge=always, and then spawned 20 threads all looping faulting random
offsets in this map, while using madvise(MADV_DONTNEED) randomly for
huge page aligned ranges.  This very quickly reproduced the problem.

The problem here is that we check for the case that we have multiple
threads faulting in a range that was previously unmapped.  One thread
maps the PMD, the other thread loses the race and then returns 0.
However at this point we already have the page, and we are no longer
putting this page into the processes address space, and so we leak the
page.  We actually did the correct thing prior to f9ce0be71d1f, however
it looks like Kirill copied what we do in the anonymous page case.  In
the anonymous page case we don't yet have a page, so we don't have to
drop a reference on anything.  Previously we did the correct thing for
file based faults by returning VM_FAULT_NOPAGE so we correctly drop the
reference on the page we faulted in.

Fix this by returning VM_FAULT_NOPAGE in the pmd_devmap_trans_unstable()
case, this makes us drop the ref on the page properly, and now my
reproducer no longer leaks the huge pages.

Fixes: f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() codepaths")
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Chris Mason <clm@fb.com>
---
 mm/memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7a089145cad4..f10724d7dca3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4371,7 +4371,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 
 	/* See comment in handle_pte_fault() */
 	if (pmd_devmap_trans_unstable(vmf->pmd))
-		return 0;
+		return VM_FAULT_NOPAGE;
 
 	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
 				      vmf->address, &vmf->ptl);
-- 
2.36.1



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm: fix page leak with multiple threads mapping the same page
  2022-07-05 20:00 [PATCH] mm: fix page leak with multiple threads mapping the same page Josef Bacik
@ 2022-07-06 22:46 ` Kirill A. Shutemov
  2022-07-07  0:42   ` Rik van Riel
  0 siblings, 1 reply; 5+ messages in thread
From: Kirill A. Shutemov @ 2022-07-06 22:46 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-mm, akpm, Matthew Wilcox, Rik van Riel, Chris Mason

On Tue, Jul 05, 2022 at 04:00:36PM -0400, Josef Bacik wrote:
> We have an application with a lot of threads that use a shared mmap
> backed by tmpfs mounted with -o huge=within_size.  This application
> started leaking loads of huge pages when we upgraded to a recent kernel.
> 
> Using the page ref tracepoints and a BPF program written by Tejun Heo we
> were able to determine that these pages would have multiple refcounts
> from the page fault path, but when it came to unmap time we wouldn't
> drop the number of refs we had added from the faults.
> 
> I wrote a reproducer that mmap'ed a file backed by tmpfs with -o
> huge=always, and then spawned 20 threads all looping faulting random
> offsets in this map, while using madvise(MADV_DONTNEED) randomly for
> huge page aligned ranges.  This very quickly reproduced the problem.
> 
> The problem here is that we check for the case that we have multiple
> threads faulting in a range that was previously unmapped.  One thread
> maps the PMD, the other thread loses the race and then returns 0.
> However at this point we already have the page, and we are no longer
> putting this page into the processes address space, and so we leak the
> page.  We actually did the correct thing prior to f9ce0be71d1f, however
> it looks like Kirill copied what we do in the anonymous page case.  In
> the anonymous page case we don't yet have a page, so we don't have to
> drop a reference on anything.  Previously we did the correct thing for
> file based faults by returning VM_FAULT_NOPAGE so we correctly drop the
> reference on the page we faulted in.
> 
> Fix this by returning VM_FAULT_NOPAGE in the pmd_devmap_trans_unstable()
> case, this makes us drop the ref on the page properly, and now my
> reproducer no longer leaks the huge pages.
> 
> Fixes: f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() codepaths")
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> Signed-off-by: Rik van Riel <riel@surriel.com>
> Signed-off-by: Chris Mason <clm@fb.com>

Cc: stable@ ?

> ---
>  mm/memory.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 7a089145cad4..f10724d7dca3 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4371,7 +4371,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>  
>  	/* See comment in handle_pte_fault() */
>  	if (pmd_devmap_trans_unstable(vmf->pmd))
> -		return 0;
> +		return VM_FAULT_NOPAGE;

Comment update would be nice.

Other instances of pmd_devmap_trans_unstable() return 0 in the fault path.
Explanation would be helpful.

Otherwise,

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm: fix page leak with multiple threads mapping the same page
  2022-07-06 22:46 ` Kirill A. Shutemov
@ 2022-07-07  0:42   ` Rik van Riel
  2022-07-08  0:58     ` Andrew Morton
  0 siblings, 1 reply; 5+ messages in thread
From: Rik van Riel @ 2022-07-07  0:42 UTC (permalink / raw)
  To: Kirill A. Shutemov, Josef Bacik
  Cc: linux-mm, akpm, Matthew Wilcox, Chris Mason

[-- Attachment #1: Type: text/plain, Size: 1602 bytes --]

On Thu, 2022-07-07 at 01:46 +0300, Kirill A. Shutemov wrote:
> On Tue, Jul 05, 2022 at 04:00:36PM -0400, Josef Bacik wrote:
> > 
> > Fix this by returning VM_FAULT_NOPAGE in the
> > pmd_devmap_trans_unstable()
> > case, this makes us drop the ref on the page properly, and now my
> > reproducer no longer leaks the huge pages.
> > 
> > Fixes: f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault()
> > codepaths")
> > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > Signed-off-by: Rik van Riel <riel@surriel.com>
> > Signed-off-by: Chris Mason <clm@fb.com>
> 
> Cc: stable@ 

Yes, it should.  How do we send a patch to stable@
after the start of the thread?


> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4371,7 +4371,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
> >  
> >         /* See comment in handle_pte_fault() */
> >         if (pmd_devmap_trans_unstable(vmf->pmd))
> > -               return 0;
> > +               return VM_FAULT_NOPAGE;
> 
> Comment update would be nice.
> 
> Other instances of pmd_devmap_trans_unstable() return 0 in the fault
> path.
> Explanation would be helpful.
> 
The explanation is that by the time we get to
finish_fault, we already have a reference on a
page, and we need to ensure that reference
gets released by the caller.

VM_FAULT_NOPAGE is one of the ways to indicate
that the page should be freed.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm: fix page leak with multiple threads mapping the same page
  2022-07-07  0:42   ` Rik van Riel
@ 2022-07-08  0:58     ` Andrew Morton
  2022-07-15 15:21       ` Rik van Riel
  0 siblings, 1 reply; 5+ messages in thread
From: Andrew Morton @ 2022-07-08  0:58 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Kirill A. Shutemov, Josef Bacik, linux-mm, Matthew Wilcox, Chris Mason

On Wed, 06 Jul 2022 20:42:56 -0400 Rik van Riel <riel@surriel.com> wrote:

> On Thu, 2022-07-07 at 01:46 +0300, Kirill A. Shutemov wrote:
> > On Tue, Jul 05, 2022 at 04:00:36PM -0400, Josef Bacik wrote:
> > > 
> > > Fix this by returning VM_FAULT_NOPAGE in the
> > > pmd_devmap_trans_unstable()
> > > case, this makes us drop the ref on the page properly, and now my
> > > reproducer no longer leaks the huge pages.
> > > 
> > > Fixes: f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault()
> > > codepaths")
> > > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
> > > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > > Signed-off-by: Rik van Riel <riel@surriel.com>
> > > Signed-off-by: Chris Mason <clm@fb.com>
> > 
> > Cc: stable@ 
> 
> Yes, it should.  How do we send a patch to stable@
> after the start of the thread?

cc'ing the stable list doesn't have much affect, afaik.

What Kirill means is that we should add cc:stable to this patch's
changelog.  I did that yesterday.

> 
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -4371,7 +4371,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
> > >  
> > >         /* See comment in handle_pte_fault() */
> > >         if (pmd_devmap_trans_unstable(vmf->pmd))
> > > -               return 0;
> > > +               return VM_FAULT_NOPAGE;
> > 
> > Comment update would be nice.
> > 
> > Other instances of pmd_devmap_trans_unstable() return 0 in the fault
> > path.
> > Explanation would be helpful.
> > 
> The explanation is that by the time we get to
> finish_fault, we already have a reference on a
> page, and we need to ensure that reference
> gets released by the caller.
> 
> VM_FAULT_NOPAGE is one of the ways to indicate
> that the page should be freed.

I think Kirill meant this should be added to the code comment!


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm: fix page leak with multiple threads mapping the same page
  2022-07-08  0:58     ` Andrew Morton
@ 2022-07-15 15:21       ` Rik van Riel
  0 siblings, 0 replies; 5+ messages in thread
From: Rik van Riel @ 2022-07-15 15:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Josef Bacik, linux-mm, Matthew Wilcox, Chris Mason

[-- Attachment #1: Type: text/plain, Size: 1378 bytes --]

On Thu, 2022-07-07 at 17:58 -0700, Andrew Morton wrote:
> On Wed, 06 Jul 2022 20:42:56 -0400 Rik van Riel <riel@surriel.com>
> wrote:
> > > 
> > The explanation is that by the time we get to
> > finish_fault, we already have a reference on a
> > page, and we need to ensure that reference
> > gets released by the caller.
> > 
> > VM_FAULT_NOPAGE is one of the ways to indicate
> > that the page should be freed.
> 
> I think Kirill meant this should be added to the code comment!

OK, I finally got around to looking at that today, and
it seems like the comment at the top of finish_fault()
already has everything I wanted to write down. Is there
anything else we should add here?

/**
 * finish_fault - finish page fault once we have prepared the page to
fault
 *
 * @vmf: structure describing the fault
 *
 * This function handles all that is needed to finish a page fault once
the
 * page to fault in is prepared. It handles locking of PTEs, inserts
PTE for
 * given page, adds reverse page mapping, handles memcg charges and LRU
 * addition.
 *
 * The function expects the page to be locked and on success it
consumes a
 * reference of a page being mapped (for the PTE which maps it).
 *
 * Return: %0 on success, %VM_FAULT_ code in case of error.
 */
vm_fault_t finish_fault(struct vm_fault *vmf)
{



-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-07-15 15:21 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-05 20:00 [PATCH] mm: fix page leak with multiple threads mapping the same page Josef Bacik
2022-07-06 22:46 ` Kirill A. Shutemov
2022-07-07  0:42   ` Rik van Riel
2022-07-08  0:58     ` Andrew Morton
2022-07-15 15:21       ` Rik van Riel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.