linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: + mm-madvise-fix-freeing-of-locked-page-with-madv_free.patch added to -mm tree
       [not found] ` <20170824060957.GA29811@dhcp22.suse.cz>
@ 2017-08-25 22:02   ` Nadav Amit
  2017-08-25 22:31     ` Mike Kravetz
  0 siblings, 1 reply; 14+ messages in thread
From: Nadav Amit @ 2017-08-25 22:02 UTC (permalink / raw)
  To: mike.kravetz, ebiggers
  Cc: Andrew Morton, Andrea Arcangeli, Dmitry Vyukov, Hugh Dickins,
	Minchan Kim, rientjes, stable, mm-commits,
	open list:MEMORY MANAGEMENT, Linux Kernel Mailing List,
	Michal Hocko, nyc

Michal Hocko <mhocko@kernel.org> wrote:

> Hmm, I do not see this neither in linux-mm nor LKML. Strange
> 
> On Wed 23-08-17 14:41:21, Andrew Morton wrote:
>> From: Eric Biggers <ebiggers@google.com>
>> Subject: mm/madvise.c: fix freeing of locked page with MADV_FREE
>> 
>> If madvise(..., MADV_FREE) split a transparent hugepage, it called
>> put_page() before unlock_page().  This was wrong because put_page() can
>> free the page, e.g.  if a concurrent madvise(..., MADV_DONTNEED) has
>> removed it from the memory mapping.  put_page() then rightfully complained
>> about freeing a locked page.
>> 
>> Fix this by moving the unlock_page() before put_page().

Quick grep shows that a similar flow (put_page() followed by an
unlock_page() ) also happens in hugetlbfs_fallocate(). Isn’t it a problem as
well?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: + mm-madvise-fix-freeing-of-locked-page-with-madv_free.patch added to -mm tree
  2017-08-25 22:02   ` + mm-madvise-fix-freeing-of-locked-page-with-madv_free.patch added to -mm tree Nadav Amit
@ 2017-08-25 22:31     ` Mike Kravetz
  2017-08-25 22:51       ` Nadav Amit
  0 siblings, 1 reply; 14+ messages in thread
From: Mike Kravetz @ 2017-08-25 22:31 UTC (permalink / raw)
  To: Nadav Amit, ebiggers
  Cc: Andrew Morton, Andrea Arcangeli, Dmitry Vyukov, Hugh Dickins,
	Minchan Kim, rientjes, stable, mm-commits,
	open list:MEMORY MANAGEMENT, Linux Kernel Mailing List,
	Michal Hocko, nyc

On 08/25/2017 03:02 PM, Nadav Amit wrote:
> Michal Hocko <mhocko@kernel.org> wrote:
> 
>> Hmm, I do not see this neither in linux-mm nor LKML. Strange
>>
>> On Wed 23-08-17 14:41:21, Andrew Morton wrote:
>>> From: Eric Biggers <ebiggers@google.com>
>>> Subject: mm/madvise.c: fix freeing of locked page with MADV_FREE
>>>
>>> If madvise(..., MADV_FREE) split a transparent hugepage, it called
>>> put_page() before unlock_page().  This was wrong because put_page() can
>>> free the page, e.g.  if a concurrent madvise(..., MADV_DONTNEED) has
>>> removed it from the memory mapping.  put_page() then rightfully complained
>>> about freeing a locked page.
>>>
>>> Fix this by moving the unlock_page() before put_page().
> 
> Quick grep shows that a similar flow (put_page() followed by an
> unlock_page() ) also happens in hugetlbfs_fallocate(). Isn’t it a problem as
> well?

I assume you are asking about this block of code?

                /*
                 * page_put due to reference from alloc_huge_page()
                 * unlock_page because locked by add_to_page_cache()
                 */
                put_page(page);
                unlock_page(page);

Well, there is a typo (page_put) in the comment. :(

However, in this case we have just added the huge page to a hugetlbfs
file.  The put_page() is there just to drop the reference count on the
page (taken when allocated).  It will still be non-zero as we have
successfully added it to the page cache.  So, we are not freeing the
page here, just dropping the reference count.

This should not cause a problem like that seen in madvise.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: + mm-madvise-fix-freeing-of-locked-page-with-madv_free.patch added to -mm tree
  2017-08-25 22:31     ` Mike Kravetz
@ 2017-08-25 22:51       ` Nadav Amit
  2017-08-25 23:41         ` Mike Kravetz
  0 siblings, 1 reply; 14+ messages in thread
From: Nadav Amit @ 2017-08-25 22:51 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: ebiggers, Andrew Morton, Andrea Arcangeli, Dmitry Vyukov,
	Hugh Dickins, Minchan Kim, rientjes, stable, mm-commits,
	open list:MEMORY MANAGEMENT, Linux Kernel Mailing List,
	Michal Hocko, nyc

Mike Kravetz <mike.kravetz@oracle.com> wrote:

> On 08/25/2017 03:02 PM, Nadav Amit wrote:
>> Michal Hocko <mhocko@kernel.org> wrote:
>> 
>>> Hmm, I do not see this neither in linux-mm nor LKML. Strange
>>> 
>>> On Wed 23-08-17 14:41:21, Andrew Morton wrote:
>>>> From: Eric Biggers <ebiggers@google.com>
>>>> Subject: mm/madvise.c: fix freeing of locked page with MADV_FREE
>>>> 
>>>> If madvise(..., MADV_FREE) split a transparent hugepage, it called
>>>> put_page() before unlock_page().  This was wrong because put_page() can
>>>> free the page, e.g.  if a concurrent madvise(..., MADV_DONTNEED) has
>>>> removed it from the memory mapping.  put_page() then rightfully complained
>>>> about freeing a locked page.
>>>> 
>>>> Fix this by moving the unlock_page() before put_page().
>> 
>> Quick grep shows that a similar flow (put_page() followed by an
>> unlock_page() ) also happens in hugetlbfs_fallocate(). Isn’t it a problem as
>> well?
> 
> I assume you are asking about this block of code?

Yes.

> 
>                /*
>                 * page_put due to reference from alloc_huge_page()
>                 * unlock_page because locked by add_to_page_cache()
>                 */
>                put_page(page);
>                unlock_page(page);
> 
> Well, there is a typo (page_put) in the comment. :(
> 
> However, in this case we have just added the huge page to a hugetlbfs
> file.  The put_page() is there just to drop the reference count on the
> page (taken when allocated).  It will still be non-zero as we have
> successfully added it to the page cache.  So, we are not freeing the
> page here, just dropping the reference count.
> 
> This should not cause a problem like that seen in madvise.

Thanks for the quick response.

I am not too familiar with this piece of code, so just for the matter of
understanding: what prevents the page from being removed from the page cache
shortly after it is added (even if it is highly unlikely)? The page lock? The
inode lock?

Thanks again,
Nadav

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: + mm-madvise-fix-freeing-of-locked-page-with-madv_free.patch added to -mm tree
  2017-08-25 22:51       ` Nadav Amit
@ 2017-08-25 23:41         ` Mike Kravetz
  2017-08-26 21:09           ` Eric Biggers
  0 siblings, 1 reply; 14+ messages in thread
From: Mike Kravetz @ 2017-08-25 23:41 UTC (permalink / raw)
  To: Nadav Amit
  Cc: ebiggers, Andrew Morton, Andrea Arcangeli, Dmitry Vyukov,
	Hugh Dickins, Minchan Kim, rientjes, stable, mm-commits,
	open list:MEMORY MANAGEMENT, Linux Kernel Mailing List,
	Michal Hocko, nyc

On 08/25/2017 03:51 PM, Nadav Amit wrote:
> Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
>> On 08/25/2017 03:02 PM, Nadav Amit wrote:
>>> Michal Hocko <mhocko@kernel.org> wrote:
>>>
>>>> Hmm, I do not see this neither in linux-mm nor LKML. Strange
>>>>
>>>> On Wed 23-08-17 14:41:21, Andrew Morton wrote:
>>>>> From: Eric Biggers <ebiggers@google.com>
>>>>> Subject: mm/madvise.c: fix freeing of locked page with MADV_FREE
>>>>>
>>>>> If madvise(..., MADV_FREE) split a transparent hugepage, it called
>>>>> put_page() before unlock_page().  This was wrong because put_page() can
>>>>> free the page, e.g.  if a concurrent madvise(..., MADV_DONTNEED) has
>>>>> removed it from the memory mapping.  put_page() then rightfully complained
>>>>> about freeing a locked page.
>>>>>
>>>>> Fix this by moving the unlock_page() before put_page().
>>>
>>> Quick grep shows that a similar flow (put_page() followed by an
>>> unlock_page() ) also happens in hugetlbfs_fallocate(). Isn’t it a problem as
>>> well?
>>
>> I assume you are asking about this block of code?
> 
> Yes.
> 
>>
>>                /*
>>                 * page_put due to reference from alloc_huge_page()
>>                 * unlock_page because locked by add_to_page_cache()
>>                 */
>>                put_page(page);
>>                unlock_page(page);
>>
>> Well, there is a typo (page_put) in the comment. :(
>>
>> However, in this case we have just added the huge page to a hugetlbfs
>> file.  The put_page() is there just to drop the reference count on the
>> page (taken when allocated).  It will still be non-zero as we have
>> successfully added it to the page cache.  So, we are not freeing the
>> page here, just dropping the reference count.
>>
>> This should not cause a problem like that seen in madvise.
> 
> Thanks for the quick response.
> 
> I am not too familiar with this piece of code, so just for the matter of
> understanding: what prevents the page from being removed from the page cache
> shortly after it is added (even if it is highly unlikely)? The page lock? The
> inode lock?

Someone would need to acquire the inode lock to remove the page.  This
is held until we exit the routine.  Also note that put_page for this
type of huge page almost always results in the page being put back
on a free list within the hugetlb(fs) subsystem.  It is not returned
to the 'normal' memory allocators for general use.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH] hugetlbfs: change put_page/unlock_page order in hugetlbfs_fallocate()
  2017-08-26 21:09           ` Eric Biggers
@ 2017-08-26 19:11             ` Nadav Amit
  2017-08-27 17:15               ` Mike Kravetz
                                 ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Nadav Amit @ 2017-08-26 19:11 UTC (permalink / raw)
  To: Nadia Yvette Chambers
  Cc: linux-kernel, Nadav Amit, Nadav Amit, Eric Biggers, Mike Kravetz

hugetlfs_fallocate() currently performs put_page() before unlock_page().
This scenario opens a small time window, from the time the page is added
to the page cache, until it is unlocked, in which the page might be
removed from the page-cache by another core. If the page is removed
during this time windows, it might cause a memory corruption, as the
wrong page will be unlocked.

It is arguable whether this scenario can happen in a real system, and
there are several mitigating factors. The issue was found by code
inspection (actually grep), and not by actually triggering the flow.
Yet, since putting the page before unlocking is incorrect it should be
fixed, if only to prevent future breakage or someone copy-pasting this
code.

Fixes: 70c3547e36f5c ("hugetlbfs: add hugetlbfs_fallocate()")

cc: Eric Biggers <ebiggers3@gmail.com>
cc: Mike Kravetz <mike.kravetz@oracle.com>

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 fs/hugetlbfs/inode.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 28d2753be094..9475fee79cee 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -655,11 +655,11 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 
 		/*
-		 * page_put due to reference from alloc_huge_page()
 		 * unlock_page because locked by add_to_page_cache()
+		 * page_put due to reference from alloc_huge_page()
 		 */
-		put_page(page);
 		unlock_page(page);
+		put_page(page);
 	}
 
 	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: + mm-madvise-fix-freeing-of-locked-page-with-madv_free.patch added to -mm tree
  2017-08-25 23:41         ` Mike Kravetz
@ 2017-08-26 21:09           ` Eric Biggers
  2017-08-26 19:11             ` [PATCH] hugetlbfs: change put_page/unlock_page order in hugetlbfs_fallocate() Nadav Amit
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Biggers @ 2017-08-26 21:09 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Nadav Amit, ebiggers, Andrew Morton, Andrea Arcangeli,
	Dmitry Vyukov, Hugh Dickins, Minchan Kim, rientjes, stable,
	mm-commits, open list:MEMORY MANAGEMENT,
	Linux Kernel Mailing List, Michal Hocko, nyc

On Fri, Aug 25, 2017 at 04:41:36PM -0700, Mike Kravetz wrote:
> >>>>>
> >>>>> If madvise(..., MADV_FREE) split a transparent hugepage, it called
> >>>>> put_page() before unlock_page().  This was wrong because put_page() can
> >>>>> free the page, e.g.  if a concurrent madvise(..., MADV_DONTNEED) has
> >>>>> removed it from the memory mapping.  put_page() then rightfully complained
> >>>>> about freeing a locked page.
> >>>>>
> >>>>> Fix this by moving the unlock_page() before put_page().
> >>>
> >>> Quick grep shows that a similar flow (put_page() followed by an
> >>> unlock_page() ) also happens in hugetlbfs_fallocate(). Isn’t it a problem as
> >>> well?
> >>
> >> I assume you are asking about this block of code?
> > 
> > Yes.
> > 
> >>
> >>                /*
> >>                 * page_put due to reference from alloc_huge_page()
> >>                 * unlock_page because locked by add_to_page_cache()
> >>                 */
> >>                put_page(page);
> >>                unlock_page(page);
> >>
> >> Well, there is a typo (page_put) in the comment. :(
> >>
> >> However, in this case we have just added the huge page to a hugetlbfs
> >> file.  The put_page() is there just to drop the reference count on the
> >> page (taken when allocated).  It will still be non-zero as we have
> >> successfully added it to the page cache.  So, we are not freeing the
> >> page here, just dropping the reference count.
> >>
> >> This should not cause a problem like that seen in madvise.
> > 
> > Thanks for the quick response.
> > 
> > I am not too familiar with this piece of code, so just for the matter of
> > understanding: what prevents the page from being removed from the page cache
> > shortly after it is added (even if it is highly unlikely)? The page lock? The
> > inode lock?
> 
> Someone would need to acquire the inode lock to remove the page.  This
> is held until we exit the routine.  Also note that put_page for this
> type of huge page almost always results in the page being put back
> on a free list within the hugetlb(fs) subsystem.  It is not returned
> to the 'normal' memory allocators for general use.
> 

I'm not sure about that.  What about sys_fadvise64(..., POSIX_FADV_DONTNEED)?
That removes pages from the page cache without taking the inode lock.  It won't
remove locked pages though, so presumably it is only the page lock that prevents
the race with hugetlbfs_fallocate().

But in any case, why not do the "obviously correct" thing --- unlock before put?

Eric

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] hugetlbfs: change put_page/unlock_page order in hugetlbfs_fallocate()
  2017-08-26 19:11             ` [PATCH] hugetlbfs: change put_page/unlock_page order in hugetlbfs_fallocate() Nadav Amit
@ 2017-08-27 17:15               ` Mike Kravetz
  2017-08-27 20:08                 ` Nadav Amit
  2017-08-28 13:46               ` Michal Hocko
  2017-11-29  2:37               ` Eric Biggers
  2 siblings, 1 reply; 14+ messages in thread
From: Mike Kravetz @ 2017-08-27 17:15 UTC (permalink / raw)
  To: Nadav Amit, Nadia Yvette Chambers; +Cc: linux-kernel, Nadav Amit, Eric Biggers

On 08/26/2017 12:11 PM, Nadav Amit wrote:
> hugetlfs_fallocate() currently performs put_page() before unlock_page().
> This scenario opens a small time window, from the time the page is added
> to the page cache, until it is unlocked, in which the page might be
> removed from the page-cache by another core. If the page is removed
> during this time windows, it might cause a memory corruption, as the
> wrong page will be unlocked.
> 
> It is arguable whether this scenario can happen in a real system, and
> there are several mitigating factors. The issue was found by code
> inspection (actually grep), and not by actually triggering the flow.
> Yet, since putting the page before unlocking is incorrect it should be
> fixed, if only to prevent future breakage or someone copy-pasting this
> code.
> 
> Fixes: 70c3547e36f5c ("hugetlbfs: add hugetlbfs_fallocate()")
> 
> cc: Eric Biggers <ebiggers3@gmail.com>
> cc: Mike Kravetz <mike.kravetz@oracle.com>
> 
> Signed-off-by: Nadav Amit <namit@vmware.com>

Thank you Nadav.

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>

Since hugetlbfs is an in memory filesystem, the only way one 'should' be
able to remove a page (file content) is through an inode operation such as
truncate, hole punch, or unlink.  That was the basis for my response that
the inode lock would be required for page freeing.

Eric's question about sys_fadvise64(POSIX_FADV_DONTNEED) is interesting.
I was expecting to see a check for hugetlbfs pages and exit (without
modification) if encountered.  A quick review of the code did not find
any such checks.

I'll take a closer look to determine exactly how hugetlbfs files are
handled.  IMO, there should be something similar to the DAX check where
the routine quickly exits.
-- 
Mike Kravetz

> ---
>  fs/hugetlbfs/inode.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 28d2753be094..9475fee79cee 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -655,11 +655,11 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
>  		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
>  
>  		/*
> -		 * page_put due to reference from alloc_huge_page()
>  		 * unlock_page because locked by add_to_page_cache()
> +		 * page_put due to reference from alloc_huge_page()
>  		 */
> -		put_page(page);
>  		unlock_page(page);
> +		put_page(page);
>  	}
>  
>  	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] hugetlbfs: change put_page/unlock_page order in hugetlbfs_fallocate()
  2017-08-27 17:15               ` Mike Kravetz
@ 2017-08-27 20:08                 ` Nadav Amit
  2017-08-28 17:45                   ` Mike Kravetz
  0 siblings, 1 reply; 14+ messages in thread
From: Nadav Amit @ 2017-08-27 20:08 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Nadia Yvette Chambers, Linux Kernel Mailing List, Eric Biggers

Mike Kravetz <mike.kravetz@oracle.com> wrote:

> On 08/26/2017 12:11 PM, Nadav Amit wrote:
>> hugetlfs_fallocate() currently performs put_page() before unlock_page().
>> This scenario opens a small time window, from the time the page is added
>> to the page cache, until it is unlocked, in which the page might be
>> removed from the page-cache by another core. If the page is removed
>> during this time windows, it might cause a memory corruption, as the
>> wrong page will be unlocked.
>> 
>> It is arguable whether this scenario can happen in a real system, and
>> there are several mitigating factors. The issue was found by code
>> inspection (actually grep), and not by actually triggering the flow.
>> Yet, since putting the page before unlocking is incorrect it should be
>> fixed, if only to prevent future breakage or someone copy-pasting this
>> code.
>> 
>> Fixes: 70c3547e36f5c ("hugetlbfs: add hugetlbfs_fallocate()")
>> 
>> cc: Eric Biggers <ebiggers3@gmail.com>
>> cc: Mike Kravetz <mike.kravetz@oracle.com>
>> 
>> Signed-off-by: Nadav Amit <namit@vmware.com>
> 
> Thank you Nadav.

No problem.

> 
> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
> 
> Since hugetlbfs is an in memory filesystem, the only way one 'should' be
> able to remove a page (file content) is through an inode operation such as
> truncate, hole punch, or unlink.  That was the basis for my response that
> the inode lock would be required for page freeing.
> 
> Eric's question about sys_fadvise64(POSIX_FADV_DONTNEED) is interesting.
> I was expecting to see a check for hugetlbfs pages and exit (without
> modification) if encountered.  A quick review of the code did not find
> any such checks.
> 
> I'll take a closer look to determine exactly how hugetlbfs files are
> handled.  IMO, there should be something similar to the DAX check where
> the routine quickly exits.

I did not cc stable when submitting the patch, based on your previous
response. Let me know if you want me to send v2 which does so.

Thanks,
Nadav

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] hugetlbfs: change put_page/unlock_page order in hugetlbfs_fallocate()
  2017-08-26 19:11             ` [PATCH] hugetlbfs: change put_page/unlock_page order in hugetlbfs_fallocate() Nadav Amit
  2017-08-27 17:15               ` Mike Kravetz
@ 2017-08-28 13:46               ` Michal Hocko
  2017-11-29  2:37               ` Eric Biggers
  2 siblings, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2017-08-28 13:46 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Nadia Yvette Chambers, linux-kernel, Nadav Amit, Eric Biggers,
	Mike Kravetz, Andrew Morton

[CC Andrew]

On Sat 26-08-17 12:11:24, Nadav Amit wrote:
> hugetlfs_fallocate() currently performs put_page() before unlock_page().
> This scenario opens a small time window, from the time the page is added
> to the page cache, until it is unlocked, in which the page might be
> removed from the page-cache by another core. If the page is removed
> during this time windows, it might cause a memory corruption, as the
> wrong page will be unlocked.
> 
> It is arguable whether this scenario can happen in a real system, and
> there are several mitigating factors. The issue was found by code
> inspection (actually grep), and not by actually triggering the flow.
> Yet, since putting the page before unlocking is incorrect it should be
> fixed, if only to prevent future breakage or someone copy-pasting this
> code.

Even if this a theoretical problem it is definitely worth fixing because
it is a anti-pattern of how the reference counted object should be treated.
 
> Fixes: 70c3547e36f5c ("hugetlbfs: add hugetlbfs_fallocate()")
> 
> cc: Eric Biggers <ebiggers3@gmail.com>
> cc: Mike Kravetz <mike.kravetz@oracle.com>
> 
> Signed-off-by: Nadav Amit <namit@vmware.com>

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
>  fs/hugetlbfs/inode.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 28d2753be094..9475fee79cee 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -655,11 +655,11 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
>  		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
>  
>  		/*
> -		 * page_put due to reference from alloc_huge_page()
>  		 * unlock_page because locked by add_to_page_cache()
> +		 * page_put due to reference from alloc_huge_page()
>  		 */
> -		put_page(page);
>  		unlock_page(page);
> +		put_page(page);
>  	}
>  
>  	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
> -- 
> 2.11.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] hugetlbfs: change put_page/unlock_page order in hugetlbfs_fallocate()
  2017-08-27 20:08                 ` Nadav Amit
@ 2017-08-28 17:45                   ` Mike Kravetz
  2017-08-28 18:09                     ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: Mike Kravetz @ 2017-08-28 17:45 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Nadia Yvette Chambers, Linux Kernel Mailing List, Eric Biggers,
	Andrew Morton, Michal Hocko

Adding Andrew, Michal on CC

On 08/27/2017 01:08 PM, Nadav Amit wrote:
> Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
>> On 08/26/2017 12:11 PM, Nadav Amit wrote:
>>> hugetlfs_fallocate() currently performs put_page() before unlock_page().
>>> This scenario opens a small time window, from the time the page is added
>>> to the page cache, until it is unlocked, in which the page might be
>>> removed from the page-cache by another core. If the page is removed
>>> during this time windows, it might cause a memory corruption, as the
>>> wrong page will be unlocked.
>>>
>>> It is arguable whether this scenario can happen in a real system, and
>>> there are several mitigating factors. The issue was found by code
>>> inspection (actually grep), and not by actually triggering the flow.
>>> Yet, since putting the page before unlocking is incorrect it should be
>>> fixed, if only to prevent future breakage or someone copy-pasting this
>>> code.
>>>
>>> Fixes: 70c3547e36f5c ("hugetlbfs: add hugetlbfs_fallocate()")
>>>
>>> cc: Eric Biggers <ebiggers3@gmail.com>
>>> cc: Mike Kravetz <mike.kravetz@oracle.com>
>>>
>>> Signed-off-by: Nadav Amit <namit@vmware.com>
>>
>> Thank you Nadav.
> 
> No problem.
> 
>>
>> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
>>
>> Since hugetlbfs is an in memory filesystem, the only way one 'should' be
>> able to remove a page (file content) is through an inode operation such as
>> truncate, hole punch, or unlink.  That was the basis for my response that
>> the inode lock would be required for page freeing.
>>
>> Eric's question about sys_fadvise64(POSIX_FADV_DONTNEED) is interesting.
>> I was expecting to see a check for hugetlbfs pages and exit (without
>> modification) if encountered.  A quick review of the code did not find
>> any such checks.
>>
>> I'll take a closer look to determine exactly how hugetlbfs files are
>> handled.  IMO, there should be something similar to the DAX check where
>> the routine quickly exits.
> 
> I did not cc stable when submitting the patch, based on your previous
> response. Let me know if you want me to send v2 which does so.

I still do not believe there is a need to change this in stable.  Your patch
should be sufficient to ensure we do the right thing going forward.

Looking at and testing the sys_fadvise64(POSIX_FADV_DONTNEED) code with
hugetlbfs does indeed show a more general problem.  One can use
sys_fadvise64() to remove a huge page from a hugetlbfs file. :(  This does
not go through the special hugetlbfs page handling code, but rather the
normal mm paths.  As a result hugetlbfs accounting (like reserve counts)
gets out of sync and the hugetlbfs filesystem may become unusable.  Sigh!!!

I will address this issue in a separate patch.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] hugetlbfs: change put_page/unlock_page order in hugetlbfs_fallocate()
  2017-08-28 17:45                   ` Mike Kravetz
@ 2017-08-28 18:09                     ` Michal Hocko
  2017-08-28 18:51                       ` Mike Kravetz
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2017-08-28 18:09 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Nadav Amit, Nadia Yvette Chambers, Linux Kernel Mailing List,
	Eric Biggers, Andrew Morton

On Mon 28-08-17 10:45:58, Mike Kravetz wrote:
> Adding Andrew, Michal on CC
> 
> On 08/27/2017 01:08 PM, Nadav Amit wrote:
> > Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > 
> >> On 08/26/2017 12:11 PM, Nadav Amit wrote:
> >>> hugetlfs_fallocate() currently performs put_page() before unlock_page().
> >>> This scenario opens a small time window, from the time the page is added
> >>> to the page cache, until it is unlocked, in which the page might be
> >>> removed from the page-cache by another core. If the page is removed
> >>> during this time windows, it might cause a memory corruption, as the
> >>> wrong page will be unlocked.
> >>>
> >>> It is arguable whether this scenario can happen in a real system, and
> >>> there are several mitigating factors. The issue was found by code
> >>> inspection (actually grep), and not by actually triggering the flow.
> >>> Yet, since putting the page before unlocking is incorrect it should be
> >>> fixed, if only to prevent future breakage or someone copy-pasting this
> >>> code.
> >>>
> >>> Fixes: 70c3547e36f5c ("hugetlbfs: add hugetlbfs_fallocate()")
> >>>
> >>> cc: Eric Biggers <ebiggers3@gmail.com>
> >>> cc: Mike Kravetz <mike.kravetz@oracle.com>
> >>>
> >>> Signed-off-by: Nadav Amit <namit@vmware.com>
> >>
> >> Thank you Nadav.
> > 
> > No problem.
> > 
> >>
> >> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
> >>
> >> Since hugetlbfs is an in memory filesystem, the only way one 'should' be
> >> able to remove a page (file content) is through an inode operation such as
> >> truncate, hole punch, or unlink.  That was the basis for my response that
> >> the inode lock would be required for page freeing.
> >>
> >> Eric's question about sys_fadvise64(POSIX_FADV_DONTNEED) is interesting.
> >> I was expecting to see a check for hugetlbfs pages and exit (without
> >> modification) if encountered.  A quick review of the code did not find
> >> any such checks.
> >>
> >> I'll take a closer look to determine exactly how hugetlbfs files are
> >> handled.  IMO, there should be something similar to the DAX check where
> >> the routine quickly exits.
> > 
> > I did not cc stable when submitting the patch, based on your previous
> > response. Let me know if you want me to send v2 which does so.
> 
> I still do not believe there is a need to change this in stable.  Your patch
> should be sufficient to ensure we do the right thing going forward.
>
> Looking at and testing the sys_fadvise64(POSIX_FADV_DONTNEED) code with
> hugetlbfs does indeed show a more general problem.  One can use
> sys_fadvise64() to remove a huge page from a hugetlbfs file. :(  This does
> not go through the special hugetlbfs page handling code, but rather the
> normal mm paths.  As a result hugetlbfs accounting (like reserve counts)
> gets out of sync and the hugetlbfs filesystem may become unusable.  Sigh!!!
> 
> I will address this issue in a separate patch.

I didn't check very carefully but it seems that
http://ozlabs.org/~akpm/mmotm/broken-out/mm-fadvise-avoid-fadvise-for-fs-without-backing-device.patch
should help here, right?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] hugetlbfs: change put_page/unlock_page order in hugetlbfs_fallocate()
  2017-08-28 18:09                     ` Michal Hocko
@ 2017-08-28 18:51                       ` Mike Kravetz
  0 siblings, 0 replies; 14+ messages in thread
From: Mike Kravetz @ 2017-08-28 18:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Nadav Amit, Nadia Yvette Chambers, Linux Kernel Mailing List,
	Eric Biggers, Andrew Morton

On 08/28/2017 11:09 AM, Michal Hocko wrote:
> On Mon 28-08-17 10:45:58, Mike Kravetz wrote:
>> Adding Andrew, Michal on CC
>>
>> On 08/27/2017 01:08 PM, Nadav Amit wrote:
>>> Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>>
>>>> On 08/26/2017 12:11 PM, Nadav Amit wrote:
>>>>> hugetlfs_fallocate() currently performs put_page() before unlock_page().
>>>>> This scenario opens a small time window, from the time the page is added
>>>>> to the page cache, until it is unlocked, in which the page might be
>>>>> removed from the page-cache by another core. If the page is removed
>>>>> during this time windows, it might cause a memory corruption, as the
>>>>> wrong page will be unlocked.
>>>>>
>>>>> It is arguable whether this scenario can happen in a real system, and
>>>>> there are several mitigating factors. The issue was found by code
>>>>> inspection (actually grep), and not by actually triggering the flow.
>>>>> Yet, since putting the page before unlocking is incorrect it should be
>>>>> fixed, if only to prevent future breakage or someone copy-pasting this
>>>>> code.
>>>>>
>>>>> Fixes: 70c3547e36f5c ("hugetlbfs: add hugetlbfs_fallocate()")
>>>>>
>>>>> cc: Eric Biggers <ebiggers3@gmail.com>
>>>>> cc: Mike Kravetz <mike.kravetz@oracle.com>
>>>>>
>>>>> Signed-off-by: Nadav Amit <namit@vmware.com>
>>>>
>>>> Thank you Nadav.
>>>
>>> No problem.
>>>
>>>>
>>>> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
>>>>
>>>> Since hugetlbfs is an in memory filesystem, the only way one 'should' be
>>>> able to remove a page (file content) is through an inode operation such as
>>>> truncate, hole punch, or unlink.  That was the basis for my response that
>>>> the inode lock would be required for page freeing.
>>>>
>>>> Eric's question about sys_fadvise64(POSIX_FADV_DONTNEED) is interesting.
>>>> I was expecting to see a check for hugetlbfs pages and exit (without
>>>> modification) if encountered.  A quick review of the code did not find
>>>> any such checks.
>>>>
>>>> I'll take a closer look to determine exactly how hugetlbfs files are
>>>> handled.  IMO, there should be something similar to the DAX check where
>>>> the routine quickly exits.
>>>
>>> I did not cc stable when submitting the patch, based on your previous
>>> response. Let me know if you want me to send v2 which does so.
>>
>> I still do not believe there is a need to change this in stable.  Your patch
>> should be sufficient to ensure we do the right thing going forward.
>>
>> Looking at and testing the sys_fadvise64(POSIX_FADV_DONTNEED) code with
>> hugetlbfs does indeed show a more general problem.  One can use
>> sys_fadvise64() to remove a huge page from a hugetlbfs file. :(  This does
>> not go through the special hugetlbfs page handling code, but rather the
>> normal mm paths.  As a result hugetlbfs accounting (like reserve counts)
>> gets out of sync and the hugetlbfs filesystem may become unusable.  Sigh!!!
>>
>> I will address this issue in a separate patch.
> 
> I didn't check very carefully but it seems that
> http://ozlabs.org/~akpm/mmotm/broken-out/mm-fadvise-avoid-fadvise-for-fs-without-backing-device.patch
> should help here, right?

Thanks Michal.

Yes, that patch addresses the above issue with hugetlbfs.  I was also
wondering if there were similar issues with other in memory filesystems.
Looks like there are.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] hugetlbfs: change put_page/unlock_page order in hugetlbfs_fallocate()
  2017-08-26 19:11             ` [PATCH] hugetlbfs: change put_page/unlock_page order in hugetlbfs_fallocate() Nadav Amit
  2017-08-27 17:15               ` Mike Kravetz
  2017-08-28 13:46               ` Michal Hocko
@ 2017-11-29  2:37               ` Eric Biggers
  2017-11-29  3:22                 ` Mike Kravetz
  2 siblings, 1 reply; 14+ messages in thread
From: Eric Biggers @ 2017-11-29  2:37 UTC (permalink / raw)
  To: Nadav Amit; +Cc: Nadia Yvette Chambers, linux-kernel, Nadav Amit, Mike Kravetz

On Sat, Aug 26, 2017 at 12:11:24PM -0700, Nadav Amit wrote:
> hugetlfs_fallocate() currently performs put_page() before unlock_page().
> This scenario opens a small time window, from the time the page is added
> to the page cache, until it is unlocked, in which the page might be
> removed from the page-cache by another core. If the page is removed
> during this time windows, it might cause a memory corruption, as the
> wrong page will be unlocked.
> 
> It is arguable whether this scenario can happen in a real system, and
> there are several mitigating factors. The issue was found by code
> inspection (actually grep), and not by actually triggering the flow.
> Yet, since putting the page before unlocking is incorrect it should be
> fixed, if only to prevent future breakage or someone copy-pasting this
> code.
> 
> Fixes: 70c3547e36f5c ("hugetlbfs: add hugetlbfs_fallocate()")
> 
> cc: Eric Biggers <ebiggers3@gmail.com>
> cc: Mike Kravetz <mike.kravetz@oracle.com>
> 
> Signed-off-by: Nadav Amit <namit@vmware.com>
> ---
>  fs/hugetlbfs/inode.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 28d2753be094..9475fee79cee 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -655,11 +655,11 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
>  		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
>  
>  		/*
> -		 * page_put due to reference from alloc_huge_page()
>  		 * unlock_page because locked by add_to_page_cache()
> +		 * page_put due to reference from alloc_huge_page()
>  		 */
> -		put_page(page);
>  		unlock_page(page);
> +		put_page(page);
>  	}
>  
>  	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
> -- 

This patch wasn't ever applied.  Nadia, do you take patches for hugetlbfs, or
does this need to go through Andrew Morton?

Eric

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] hugetlbfs: change put_page/unlock_page order in hugetlbfs_fallocate()
  2017-11-29  2:37               ` Eric Biggers
@ 2017-11-29  3:22                 ` Mike Kravetz
  0 siblings, 0 replies; 14+ messages in thread
From: Mike Kravetz @ 2017-11-29  3:22 UTC (permalink / raw)
  To: Eric Biggers, Nadav Amit
  Cc: Nadia Yvette Chambers, linux-kernel, Nadav Amit, Michal Hocko,
	Andrew Morton

[CC Andrew, Michal]

On 11/28/2017 06:37 PM, Eric Biggers wrote:
> On Sat, Aug 26, 2017 at 12:11:24PM -0700, Nadav Amit wrote:
>> hugetlfs_fallocate() currently performs put_page() before unlock_page().
>> This scenario opens a small time window, from the time the page is added
>> to the page cache, until it is unlocked, in which the page might be
>> removed from the page-cache by another core. If the page is removed
>> during this time windows, it might cause a memory corruption, as the
>> wrong page will be unlocked.
>>
>> It is arguable whether this scenario can happen in a real system, and
>> there are several mitigating factors. The issue was found by code
>> inspection (actually grep), and not by actually triggering the flow.
>> Yet, since putting the page before unlocking is incorrect it should be
>> fixed, if only to prevent future breakage or someone copy-pasting this
>> code.
>>
>> Fixes: 70c3547e36f5c ("hugetlbfs: add hugetlbfs_fallocate()")
>>
>> cc: Eric Biggers <ebiggers3@gmail.com>
>> cc: Mike Kravetz <mike.kravetz@oracle.com>
>>
>> Signed-off-by: Nadav Amit <namit@vmware.com>
>> ---
>>  fs/hugetlbfs/inode.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>> index 28d2753be094..9475fee79cee 100644
>> --- a/fs/hugetlbfs/inode.c
>> +++ b/fs/hugetlbfs/inode.c
>> @@ -655,11 +655,11 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
>>  		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
>>  
>>  		/*
>> -		 * page_put due to reference from alloc_huge_page()
>>  		 * unlock_page because locked by add_to_page_cache()
>> +		 * page_put due to reference from alloc_huge_page()
>>  		 */
>> -		put_page(page);
>>  		unlock_page(page);
>> +		put_page(page);
>>  	}
>>  
>>  	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
>> -- 
> 
> This patch wasn't ever applied.  Nadia, do you take patches for hugetlbfs, or
> does this need to go through Andrew Morton?
> 
> Eric

Nadia has not been active for some time on hugetlbfs, so best to go through
Andrew.  Added Andrew and Michal on CC.

This patch has a:
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>

I am still of the opinion that this does not need to be sent to stable.
Although the ordering is current code is incorrect, there is no way for
this to be a problem with current locking.   In addition, I verified
that the perhaps bigger issue with sys_fadvise64(POSIX_FADV_DONTNEED)
for hugetlbfs and other filesystems is addressed in commit 3a77d214807c.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2017-11-29  3:22 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <599df681.NreP1dR3/HGSfpCe%akpm@linux-foundation.org>
     [not found] ` <20170824060957.GA29811@dhcp22.suse.cz>
2017-08-25 22:02   ` + mm-madvise-fix-freeing-of-locked-page-with-madv_free.patch added to -mm tree Nadav Amit
2017-08-25 22:31     ` Mike Kravetz
2017-08-25 22:51       ` Nadav Amit
2017-08-25 23:41         ` Mike Kravetz
2017-08-26 21:09           ` Eric Biggers
2017-08-26 19:11             ` [PATCH] hugetlbfs: change put_page/unlock_page order in hugetlbfs_fallocate() Nadav Amit
2017-08-27 17:15               ` Mike Kravetz
2017-08-27 20:08                 ` Nadav Amit
2017-08-28 17:45                   ` Mike Kravetz
2017-08-28 18:09                     ` Michal Hocko
2017-08-28 18:51                       ` Mike Kravetz
2017-08-28 13:46               ` Michal Hocko
2017-11-29  2:37               ` Eric Biggers
2017-11-29  3:22                 ` Mike Kravetz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).