All of lore.kernel.org
 help / color / mirror / Atom feed
* + mm-madvise-fix-freeing-of-locked-page-with-madv_free.patch added to -mm tree
@ 2017-08-23 21:41 akpm
  2017-08-24  6:09 ` Michal Hocko
  0 siblings, 1 reply; 25+ messages in thread
From: akpm @ 2017-08-23 21:41 UTC (permalink / raw)
  To: ebiggers, aarcange, dvyukov, hughd, mhocko, minchan, rientjes,
	stable, mm-commits


The patch titled
     Subject: mm/madvise.c: fix freeing of locked page with MADV_FREE
has been added to the -mm tree.  Its filename is
     mm-madvise-fix-freeing-of-locked-page-with-madv_free.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-madvise-fix-freeing-of-locked-page-with-madv_free.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-madvise-fix-freeing-of-locked-page-with-madv_free.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/SubmitChecklist when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Eric Biggers <ebiggers@google.com>
Subject: mm/madvise.c: fix freeing of locked page with MADV_FREE

If madvise(..., MADV_FREE) split a transparent hugepage, it called
put_page() before unlock_page().  This was wrong because put_page() can
free the page, e.g.  if a concurrent madvise(..., MADV_DONTNEED) has
removed it from the memory mapping.  put_page() then rightfully complained
about freeing a locked page.

Fix this by moving the unlock_page() before put_page().

This bug was found by syzkaller, which encountered the following splat:

    BUG: Bad page state in process syzkaller412798  pfn:1bd800
    page:ffffea0006f60000 count:0 mapcount:0 mapping:          (null) index:0x20a00
    flags: 0x200000000040019(locked|uptodate|dirty|swapbacked)
    raw: 0200000000040019 0000000000000000 0000000000020a00 00000000ffffffff
    raw: ffffea0006f60020 ffffea0006f60020 0000000000000000 0000000000000000
    page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
    bad because of flags: 0x1(locked)
    Modules linked in:
    CPU: 1 PID: 3037 Comm: syzkaller412798 Not tainted 4.13.0-rc5+ #35
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
     __dump_stack lib/dump_stack.c:16 [inline]
     dump_stack+0x194/0x257 lib/dump_stack.c:52
     bad_page+0x230/0x2b0 mm/page_alloc.c:565
     free_pages_check_bad+0x1f0/0x2e0 mm/page_alloc.c:943
     free_pages_check mm/page_alloc.c:952 [inline]
     free_pages_prepare mm/page_alloc.c:1043 [inline]
     free_pcp_prepare mm/page_alloc.c:1068 [inline]
     free_hot_cold_page+0x8cf/0x12b0 mm/page_alloc.c:2584
     __put_single_page mm/swap.c:79 [inline]
     __put_page+0xfb/0x160 mm/swap.c:113
     put_page include/linux/mm.h:814 [inline]
     madvise_free_pte_range+0x137a/0x1ec0 mm/madvise.c:371
     walk_pmd_range mm/pagewalk.c:50 [inline]
     walk_pud_range mm/pagewalk.c:108 [inline]
     walk_p4d_range mm/pagewalk.c:134 [inline]
     walk_pgd_range mm/pagewalk.c:160 [inline]
     __walk_page_range+0xc3a/0x1450 mm/pagewalk.c:249
     walk_page_range+0x200/0x470 mm/pagewalk.c:326
     madvise_free_page_range.isra.9+0x17d/0x230 mm/madvise.c:444
     madvise_free_single_vma+0x353/0x580 mm/madvise.c:471
     madvise_dontneed_free mm/madvise.c:555 [inline]
     madvise_vma mm/madvise.c:664 [inline]
     SYSC_madvise mm/madvise.c:832 [inline]
     SyS_madvise+0x7d3/0x13c0 mm/madvise.c:760
     entry_SYSCALL_64_fastpath+0x1f/0xbe

Here is a C reproducer:

    #define _GNU_SOURCE
    #include <pthread.h>
    #include <sys/mman.h>
    #include <unistd.h>

    #define MADV_FREE	8
    #define PAGE_SIZE	4096

    static void *mapping;
    static const size_t mapping_size = 0x1000000;

    static void *madvise_thrproc(void *arg)
    {
        madvise(mapping, mapping_size, (long)arg);
    }

    int main(void)
    {
        pthread_t t[2];

        for (;;) {
            mapping = mmap(NULL, mapping_size, PROT_WRITE,
                           MAP_POPULATE|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);

            munmap(mapping + mapping_size / 2, PAGE_SIZE);

            pthread_create(&t[0], 0, madvise_thrproc, (void*)MADV_DONTNEED);
            pthread_create(&t[1], 0, madvise_thrproc, (void*)MADV_FREE);
            pthread_join(t[0], NULL);
            pthread_join(t[1], NULL);
            munmap(mapping, mapping_size);
        }
    }

Note: to see the splat, CONFIG_TRANSPARENT_HUGEPAGE=y and
CONFIG_DEBUG_VM=y are needed.

Google Bug Id: 64696096

Link: http://lkml.kernel.org/r/20170823205235.132061-1-ebiggers3@gmail.com
Fixes: 854e9ed09ded ("mm: support madvise(MADV_FREE)")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>	[v4.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/madvise.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff -puN mm/madvise.c~mm-madvise-fix-freeing-of-locked-page-with-madv_free mm/madvise.c
--- a/mm/madvise.c~mm-madvise-fix-freeing-of-locked-page-with-madv_free
+++ a/mm/madvise.c
@@ -368,8 +368,8 @@ static int madvise_free_pte_range(pmd_t
 				pte_offset_map_lock(mm, pmd, addr, &ptl);
 				goto out;
 			}
-			put_page(page);
 			unlock_page(page);
+			put_page(page);
 			pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 			pte--;
 			addr -= PAGE_SIZE;
_

Patches currently in -mm which might be from ebiggers@google.com are

mm-madvise-fix-freeing-of-locked-page-with-madv_free.patch


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: + mm-madvise-fix-freeing-of-locked-page-with-madv_free.patch added to -mm tree
  2017-08-23 21:41 + mm-madvise-fix-freeing-of-locked-page-with-madv_free.patch added to -mm tree akpm
@ 2017-08-24  6:09 ` Michal Hocko
  2017-08-24  6:53   ` Michal Hocko
                     ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Michal Hocko @ 2017-08-24  6:09 UTC (permalink / raw)
  To: akpm
  Cc: ebiggers, aarcange, dvyukov, hughd, minchan, rientjes, stable,
	mm-commits, linux-mm

Hmm, I do not see this neither in linux-mm nor LKML. Strange

On Wed 23-08-17 14:41:21, Andrew Morton wrote:
> From: Eric Biggers <ebiggers@google.com>
> Subject: mm/madvise.c: fix freeing of locked page with MADV_FREE
> 
> If madvise(..., MADV_FREE) split a transparent hugepage, it called
> put_page() before unlock_page().  This was wrong because put_page() can
> free the page, e.g.  if a concurrent madvise(..., MADV_DONTNEED) has
> removed it from the memory mapping.  put_page() then rightfully complained
> about freeing a locked page.
> 
> Fix this by moving the unlock_page() before put_page().
> 
> This bug was found by syzkaller, which encountered the following splat:
> 
>     BUG: Bad page state in process syzkaller412798  pfn:1bd800
>     page:ffffea0006f60000 count:0 mapcount:0 mapping:          (null) index:0x20a00
>     flags: 0x200000000040019(locked|uptodate|dirty|swapbacked)
>     raw: 0200000000040019 0000000000000000 0000000000020a00 00000000ffffffff
>     raw: ffffea0006f60020 ffffea0006f60020 0000000000000000 0000000000000000
>     page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
>     bad because of flags: 0x1(locked)
>     Modules linked in:
>     CPU: 1 PID: 3037 Comm: syzkaller412798 Not tainted 4.13.0-rc5+ #35
>     Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
>     Call Trace:
>      __dump_stack lib/dump_stack.c:16 [inline]
>      dump_stack+0x194/0x257 lib/dump_stack.c:52
>      bad_page+0x230/0x2b0 mm/page_alloc.c:565
>      free_pages_check_bad+0x1f0/0x2e0 mm/page_alloc.c:943
>      free_pages_check mm/page_alloc.c:952 [inline]
>      free_pages_prepare mm/page_alloc.c:1043 [inline]
>      free_pcp_prepare mm/page_alloc.c:1068 [inline]
>      free_hot_cold_page+0x8cf/0x12b0 mm/page_alloc.c:2584
>      __put_single_page mm/swap.c:79 [inline]
>      __put_page+0xfb/0x160 mm/swap.c:113
>      put_page include/linux/mm.h:814 [inline]
>      madvise_free_pte_range+0x137a/0x1ec0 mm/madvise.c:371
>      walk_pmd_range mm/pagewalk.c:50 [inline]
>      walk_pud_range mm/pagewalk.c:108 [inline]
>      walk_p4d_range mm/pagewalk.c:134 [inline]
>      walk_pgd_range mm/pagewalk.c:160 [inline]
>      __walk_page_range+0xc3a/0x1450 mm/pagewalk.c:249
>      walk_page_range+0x200/0x470 mm/pagewalk.c:326
>      madvise_free_page_range.isra.9+0x17d/0x230 mm/madvise.c:444
>      madvise_free_single_vma+0x353/0x580 mm/madvise.c:471
>      madvise_dontneed_free mm/madvise.c:555 [inline]
>      madvise_vma mm/madvise.c:664 [inline]
>      SYSC_madvise mm/madvise.c:832 [inline]
>      SyS_madvise+0x7d3/0x13c0 mm/madvise.c:760
>      entry_SYSCALL_64_fastpath+0x1f/0xbe
> 
> Here is a C reproducer:
> 
>     #define _GNU_SOURCE
>     #include <pthread.h>
>     #include <sys/mman.h>
>     #include <unistd.h>
> 
>     #define MADV_FREE	8
>     #define PAGE_SIZE	4096
> 
>     static void *mapping;
>     static const size_t mapping_size = 0x1000000;
> 
>     static void *madvise_thrproc(void *arg)
>     {
>         madvise(mapping, mapping_size, (long)arg);
>     }
> 
>     int main(void)
>     {
>         pthread_t t[2];
> 
>         for (;;) {
>             mapping = mmap(NULL, mapping_size, PROT_WRITE,
>                            MAP_POPULATE|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
> 
>             munmap(mapping + mapping_size / 2, PAGE_SIZE);
> 
>             pthread_create(&t[0], 0, madvise_thrproc, (void*)MADV_DONTNEED);
>             pthread_create(&t[1], 0, madvise_thrproc, (void*)MADV_FREE);
>             pthread_join(t[0], NULL);
>             pthread_join(t[1], NULL);
>             munmap(mapping, mapping_size);
>         }
>     }
> 
> Note: to see the splat, CONFIG_TRANSPARENT_HUGEPAGE=y and
> CONFIG_DEBUG_VM=y are needed.
> 
> Google Bug Id: 64696096

Is this necessary in the changelog?

> Link: http://lkml.kernel.org/r/20170823205235.132061-1-ebiggers3@gmail.com
> Fixes: 854e9ed09ded ("mm: support madvise(MADV_FREE)")
> Signed-off-by: Eric Biggers <ebiggers@google.com>
> Acked-by: David Rientjes <rientjes@google.com>
> Cc: Dmitry Vyukov <dvyukov@google.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: <stable@vger.kernel.org>	[v4.5+]
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
> 
>  mm/madvise.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff -puN mm/madvise.c~mm-madvise-fix-freeing-of-locked-page-with-madv_free mm/madvise.c
> --- a/mm/madvise.c~mm-madvise-fix-freeing-of-locked-page-with-madv_free
> +++ a/mm/madvise.c
> @@ -368,8 +368,8 @@ static int madvise_free_pte_range(pmd_t
>  				pte_offset_map_lock(mm, pmd, addr, &ptl);
>  				goto out;
>  			}
> -			put_page(page);
>  			unlock_page(page);
> +			put_page(page);
>  			pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
>  			pte--;
>  			addr -= PAGE_SIZE;
> _
> 
> Patches currently in -mm which might be from ebiggers@google.com are
> 
> mm-madvise-fix-freeing-of-locked-page-with-madv_free.patch
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: + mm-madvise-fix-freeing-of-locked-page-with-madv_free.patch added to -mm tree
  2017-08-24  6:09 ` Michal Hocko
@ 2017-08-24  6:53   ` Michal Hocko
  2017-08-25 21:36     ` Andrew Morton
  2017-08-25 22:02   ` Nadav Amit
  2 siblings, 0 replies; 25+ messages in thread
From: Michal Hocko @ 2017-08-24  6:53 UTC (permalink / raw)
  To: akpm
  Cc: ebiggers, aarcange, dvyukov, hughd, minchan, rientjes, stable,
	mm-commits, linux-mm

On Thu 24-08-17 08:09:57, Michal Hocko wrote:
> Hmm, I do not see this neither in linux-mm nor LKML. Strange

Not strange, just my filters fooled me. Sorry about the confusion.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: + mm-madvise-fix-freeing-of-locked-page-with-madv_free.patch added to -mm tree
  2017-08-24  6:09 ` Michal Hocko
@ 2017-08-25 21:36     ` Andrew Morton
  2017-08-25 21:36     ` Andrew Morton
  2017-08-25 22:02   ` Nadav Amit
  2 siblings, 0 replies; 25+ messages in thread
From: Andrew Morton @ 2017-08-25 21:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: ebiggers, aarcange, dvyukov, hughd, minchan, rientjes, stable,
	mm-commits, linux-mm

On Thu, 24 Aug 2017 08:09:57 +0200 Michal Hocko <mhocko@kernel.org> wrote:

> > Google Bug Id: 64696096
> 
> Is this necessary in the changelog?

I tend to keep this sort of a thing as a courtesy to the sender.

But I did change it from "Google-Bug-Id:". 
Documentation/process/submitting-patches.rst lists "Reported-by:,
Tested-by:, Reviewed-by:, Suggested-by: and Fixes:" as the recognized
patch tags and I don't think it's a good idea to introduce new ones -
that just creates more work for the people who maintain parsers for
this stuff.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: + mm-madvise-fix-freeing-of-locked-page-with-madv_free.patch added to -mm tree
@ 2017-08-25 21:36     ` Andrew Morton
  0 siblings, 0 replies; 25+ messages in thread
From: Andrew Morton @ 2017-08-25 21:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: ebiggers, aarcange, dvyukov, hughd, minchan, rientjes, stable,
	mm-commits, linux-mm

On Thu, 24 Aug 2017 08:09:57 +0200 Michal Hocko <mhocko@kernel.org> wrote:

> > Google Bug Id: 64696096
> 
> Is this necessary in the changelog?

I tend to keep this sort of a thing as a courtesy to the sender.

But I did change it from "Google-Bug-Id:". 
Documentation/process/submitting-patches.rst lists "Reported-by:,
Tested-by:, Reviewed-by:, Suggested-by: and Fixes:" as the recognized
patch tags and I don't think it's a good idea to introduce new ones -
that just creates more work for the people who maintain parsers for
this stuff.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: + mm-madvise-fix-freeing-of-locked-page-with-madv_free.patch added to -mm tree
  2017-08-24  6:09 ` Michal Hocko
  2017-08-24  6:53   ` Michal Hocko
  2017-08-25 21:36     ` Andrew Morton
@ 2017-08-25 22:02   ` Nadav Amit
  2017-08-25 22:31       ` Mike Kravetz
  2 siblings, 1 reply; 25+ messages in thread
From: Nadav Amit @ 2017-08-25 22:02 UTC (permalink / raw)
  To: mike.kravetz, ebiggers
  Cc: Andrew Morton, Andrea Arcangeli, Dmitry Vyukov, Hugh Dickins,
	Minchan Kim, rientjes, stable, mm-commits,
	open list:MEMORY MANAGEMENT, Linux Kernel Mailing List,
	Michal Hocko, nyc

Michal Hocko <mhocko@kernel.org> wrote:

> Hmm, I do not see this neither in linux-mm nor LKML. Strange
> 
> On Wed 23-08-17 14:41:21, Andrew Morton wrote:
>> From: Eric Biggers <ebiggers@google.com>
>> Subject: mm/madvise.c: fix freeing of locked page with MADV_FREE
>> 
>> If madvise(..., MADV_FREE) split a transparent hugepage, it called
>> put_page() before unlock_page().  This was wrong because put_page() can
>> free the page, e.g.  if a concurrent madvise(..., MADV_DONTNEED) has
>> removed it from the memory mapping.  put_page() then rightfully complained
>> about freeing a locked page.
>> 
>> Fix this by moving the unlock_page() before put_page().

Quick grep shows that a similar flow (put_page() followed by an
unlock_page() ) also happens in hugetlbfs_fallocate(). Isn’t it a problem as
well?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: + mm-madvise-fix-freeing-of-locked-page-with-madv_free.patch added to -mm tree
  2017-08-25 22:02   ` Nadav Amit
  2017-08-25 22:31       ` Mike Kravetz
@ 2017-08-25 22:31       ` Mike Kravetz
  0 siblings, 0 replies; 25+ messages in thread
From: Mike Kravetz @ 2017-08-25 22:31 UTC (permalink / raw)
  To: Nadav Amit, ebiggers
  Cc: Andrew Morton, Andrea Arcangeli, Dmitry Vyukov, Hugh Dickins,
	Minchan Kim, rientjes, stable, mm-commits,
	open list:MEMORY MANAGEMENT, Linux Kernel Mailing List,
	Michal Hocko, nyc

On 08/25/2017 03:02 PM, Nadav Amit wrote:
> Michal Hocko <mhocko@kernel.org> wrote:
> 
>> Hmm, I do not see this neither in linux-mm nor LKML. Strange
>>
>> On Wed 23-08-17 14:41:21, Andrew Morton wrote:
>>> From: Eric Biggers <ebiggers@google.com>
>>> Subject: mm/madvise.c: fix freeing of locked page with MADV_FREE
>>>
>>> If madvise(..., MADV_FREE) split a transparent hugepage, it called
>>> put_page() before unlock_page().  This was wrong because put_page() can
>>> free the page, e.g.  if a concurrent madvise(..., MADV_DONTNEED) has
>>> removed it from the memory mapping.  put_page() then rightfully complained
>>> about freeing a locked page.
>>>
>>> Fix this by moving the unlock_page() before put_page().
> 
> Quick grep shows that a similar flow (put_page() followed by an
> unlock_page() ) also happens in hugetlbfs_fallocate(). Isn’t it a problem as
> well?

I assume you are asking about this block of code?

                /*
                 * page_put due to reference from alloc_huge_page()
                 * unlock_page because locked by add_to_page_cache()
                 */
                put_page(page);
                unlock_page(page);

Well, there is a typo (page_put) in the comment. :(

However, in this case we have just added the huge page to a hugetlbfs
file.  The put_page() is there just to drop the reference count on the
page (taken when allocated).  It will still be non-zero as we have
successfully added it to the page cache.  So, we are not freeing the
page here, just dropping the reference count.

This should not cause a problem like that seen in madvise.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: + mm-madvise-fix-freeing-of-locked-page-with-madv_free.patch added to -mm tree
@ 2017-08-25 22:31       ` Mike Kravetz
  0 siblings, 0 replies; 25+ messages in thread
From: Mike Kravetz @ 2017-08-25 22:31 UTC (permalink / raw)
  To: Nadav Amit, ebiggers
  Cc: Andrew Morton, Andrea Arcangeli, Dmitry Vyukov, Hugh Dickins,
	Minchan Kim, rientjes, stable, mm-commits,
	open list:MEMORY MANAGEMENT, Linux Kernel Mailing List,
	Michal Hocko, nyc

On 08/25/2017 03:02 PM, Nadav Amit wrote:
> Michal Hocko <mhocko@kernel.org> wrote:
> 
>> Hmm, I do not see this neither in linux-mm nor LKML. Strange
>>
>> On Wed 23-08-17 14:41:21, Andrew Morton wrote:
>>> From: Eric Biggers <ebiggers@google.com>
>>> Subject: mm/madvise.c: fix freeing of locked page with MADV_FREE
>>>
>>> If madvise(..., MADV_FREE) split a transparent hugepage, it called
>>> put_page() before unlock_page().  This was wrong because put_page() can
>>> free the page, e.g.  if a concurrent madvise(..., MADV_DONTNEED) has
>>> removed it from the memory mapping.  put_page() then rightfully complained
>>> about freeing a locked page.
>>>
>>> Fix this by moving the unlock_page() before put_page().
> 
> Quick grep shows that a similar flow (put_page() followed by an
> unlock_page() ) also happens in hugetlbfs_fallocate(). Isn’t it a problem as
> well?

I assume you are asking about this block of code?

                /*
                 * page_put due to reference from alloc_huge_page()
                 * unlock_page because locked by add_to_page_cache()
                 */
                put_page(page);
                unlock_page(page);

Well, there is a typo (page_put) in the comment. :(

However, in this case we have just added the huge page to a hugetlbfs
file.  The put_page() is there just to drop the reference count on the
page (taken when allocated).  It will still be non-zero as we have
successfully added it to the page cache.  So, we are not freeing the
page here, just dropping the reference count.

This should not cause a problem like that seen in madvise.
-- 
Mike Kravetz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: + mm-madvise-fix-freeing-of-locked-page-with-madv_free.patch added to -mm tree
@ 2017-08-25 22:31       ` Mike Kravetz
  0 siblings, 0 replies; 25+ messages in thread
From: Mike Kravetz @ 2017-08-25 22:31 UTC (permalink / raw)
  To: Nadav Amit, ebiggers
  Cc: Andrew Morton, Andrea Arcangeli, Dmitry Vyukov, Hugh Dickins,
	Minchan Kim, rientjes, stable, mm-commits,
	open list:MEMORY MANAGEMENT, Linux Kernel Mailing List,
	Michal Hocko, nyc

On 08/25/2017 03:02 PM, Nadav Amit wrote:
> Michal Hocko <mhocko@kernel.org> wrote:
> 
>> Hmm, I do not see this neither in linux-mm nor LKML. Strange
>>
>> On Wed 23-08-17 14:41:21, Andrew Morton wrote:
>>> From: Eric Biggers <ebiggers@google.com>
>>> Subject: mm/madvise.c: fix freeing of locked page with MADV_FREE
>>>
>>> If madvise(..., MADV_FREE) split a transparent hugepage, it called
>>> put_page() before unlock_page().  This was wrong because put_page() can
>>> free the page, e.g.  if a concurrent madvise(..., MADV_DONTNEED) has
>>> removed it from the memory mapping.  put_page() then rightfully complained
>>> about freeing a locked page.
>>>
>>> Fix this by moving the unlock_page() before put_page().
> 
> Quick grep shows that a similar flow (put_page() followed by an
> unlock_page() ) also happens in hugetlbfs_fallocate(). Isna??t it a problem as
> well?

I assume you are asking about this block of code?

                /*
                 * page_put due to reference from alloc_huge_page()
                 * unlock_page because locked by add_to_page_cache()
                 */
                put_page(page);
                unlock_page(page);

Well, there is a typo (page_put) in the comment. :(

However, in this case we have just added the huge page to a hugetlbfs
file.  The put_page() is there just to drop the reference count on the
page (taken when allocated).  It will still be non-zero as we have
successfully added it to the page cache.  So, we are not freeing the
page here, just dropping the reference count.

This should not cause a problem like that seen in madvise.
-- 
Mike Kravetz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: + mm-madvise-fix-freeing-of-locked-page-with-madv_free.patch added to -mm tree
  2017-08-25 22:31       ` Mike Kravetz
  (?)
  (?)
@ 2017-08-25 22:51       ` Nadav Amit
  2017-08-25 23:41           ` Mike Kravetz
  -1 siblings, 1 reply; 25+ messages in thread
From: Nadav Amit @ 2017-08-25 22:51 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: ebiggers, Andrew Morton, Andrea Arcangeli, Dmitry Vyukov,
	Hugh Dickins, Minchan Kim, rientjes, stable, mm-commits,
	open list:MEMORY MANAGEMENT, Linux Kernel Mailing List,
	Michal Hocko, nyc

Mike Kravetz <mike.kravetz@oracle.com> wrote:

> On 08/25/2017 03:02 PM, Nadav Amit wrote:
>> Michal Hocko <mhocko@kernel.org> wrote:
>> 
>>> Hmm, I do not see this neither in linux-mm nor LKML. Strange
>>> 
>>> On Wed 23-08-17 14:41:21, Andrew Morton wrote:
>>>> From: Eric Biggers <ebiggers@google.com>
>>>> Subject: mm/madvise.c: fix freeing of locked page with MADV_FREE
>>>> 
>>>> If madvise(..., MADV_FREE) split a transparent hugepage, it called
>>>> put_page() before unlock_page().  This was wrong because put_page() can
>>>> free the page, e.g.  if a concurrent madvise(..., MADV_DONTNEED) has
>>>> removed it from the memory mapping.  put_page() then rightfully complained
>>>> about freeing a locked page.
>>>> 
>>>> Fix this by moving the unlock_page() before put_page().
>> 
>> Quick grep shows that a similar flow (put_page() followed by an
>> unlock_page() ) also happens in hugetlbfs_fallocate(). Isn’t it a problem as
>> well?
> 
> I assume you are asking about this block of code?

Yes.

> 
>                /*
>                 * page_put due to reference from alloc_huge_page()
>                 * unlock_page because locked by add_to_page_cache()
>                 */
>                put_page(page);
>                unlock_page(page);
> 
> Well, there is a typo (page_put) in the comment. :(
> 
> However, in this case we have just added the huge page to a hugetlbfs
> file.  The put_page() is there just to drop the reference count on the
> page (taken when allocated).  It will still be non-zero as we have
> successfully added it to the page cache.  So, we are not freeing the
> page here, just dropping the reference count.
> 
> This should not cause a problem like that seen in madvise.

Thanks for the quick response.

I am not too familiar with this piece of code, so just for the matter of
understanding: what prevents the page from being removed from the page cache
shortly after it is added (even if it is highly unlikely)? The page lock? The
inode lock?

Thanks again,
Nadav

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: + mm-madvise-fix-freeing-of-locked-page-with-madv_free.patch added to -mm tree
  2017-08-25 22:51       ` Nadav Amit
  2017-08-25 23:41           ` Mike Kravetz
@ 2017-08-25 23:41           ` Mike Kravetz
  0 siblings, 0 replies; 25+ messages in thread
From: Mike Kravetz @ 2017-08-25 23:41 UTC (permalink / raw)
  To: Nadav Amit
  Cc: ebiggers, Andrew Morton, Andrea Arcangeli, Dmitry Vyukov,
	Hugh Dickins, Minchan Kim, rientjes, stable, mm-commits,
	open list:MEMORY MANAGEMENT, Linux Kernel Mailing List,
	Michal Hocko, nyc

On 08/25/2017 03:51 PM, Nadav Amit wrote:
> Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
>> On 08/25/2017 03:02 PM, Nadav Amit wrote:
>>> Michal Hocko <mhocko@kernel.org> wrote:
>>>
>>>> Hmm, I do not see this neither in linux-mm nor LKML. Strange
>>>>
>>>> On Wed 23-08-17 14:41:21, Andrew Morton wrote:
>>>>> From: Eric Biggers <ebiggers@google.com>
>>>>> Subject: mm/madvise.c: fix freeing of locked page with MADV_FREE
>>>>>
>>>>> If madvise(..., MADV_FREE) split a transparent hugepage, it called
>>>>> put_page() before unlock_page().  This was wrong because put_page() can
>>>>> free the page, e.g.  if a concurrent madvise(..., MADV_DONTNEED) has
>>>>> removed it from the memory mapping.  put_page() then rightfully complained
>>>>> about freeing a locked page.
>>>>>
>>>>> Fix this by moving the unlock_page() before put_page().
>>>
>>> Quick grep shows that a similar flow (put_page() followed by an
>>> unlock_page() ) also happens in hugetlbfs_fallocate(). Isn’t it a problem as
>>> well?
>>
>> I assume you are asking about this block of code?
> 
> Yes.
> 
>>
>>                /*
>>                 * page_put due to reference from alloc_huge_page()
>>                 * unlock_page because locked by add_to_page_cache()
>>                 */
>>                put_page(page);
>>                unlock_page(page);
>>
>> Well, there is a typo (page_put) in the comment. :(
>>
>> However, in this case we have just added the huge page to a hugetlbfs
>> file.  The put_page() is there just to drop the reference count on the
>> page (taken when allocated).  It will still be non-zero as we have
>> successfully added it to the page cache.  So, we are not freeing the
>> page here, just dropping the reference count.
>>
>> This should not cause a problem like that seen in madvise.
> 
> Thanks for the quick response.
> 
> I am not too familiar with this piece of code, so just for the matter of
> understanding: what prevents the page from being removed from the page cache
> shortly after it is added (even if it is highly unlikely)? The page lock? The
> inode lock?

Someone would need to acquire the inode lock to remove the page.  This
is held until we exit the routine.  Also note that put_page for this
type of huge page almost always results in the page being put back
on a free list within the hugetlb(fs) subsystem.  It is not returned
to the 'normal' memory allocators for general use.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: + mm-madvise-fix-freeing-of-locked-page-with-madv_free.patch added to -mm tree
@ 2017-08-25 23:41           ` Mike Kravetz
  0 siblings, 0 replies; 25+ messages in thread
From: Mike Kravetz @ 2017-08-25 23:41 UTC (permalink / raw)
  To: Nadav Amit
  Cc: ebiggers, Andrew Morton, Andrea Arcangeli, Dmitry Vyukov,
	Hugh Dickins, Minchan Kim, rientjes, stable, mm-commits,
	open list:MEMORY MANAGEMENT, Linux Kernel Mailing List,
	Michal Hocko, nyc

On 08/25/2017 03:51 PM, Nadav Amit wrote:
> Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
>> On 08/25/2017 03:02 PM, Nadav Amit wrote:
>>> Michal Hocko <mhocko@kernel.org> wrote:
>>>
>>>> Hmm, I do not see this neither in linux-mm nor LKML. Strange
>>>>
>>>> On Wed 23-08-17 14:41:21, Andrew Morton wrote:
>>>>> From: Eric Biggers <ebiggers@google.com>
>>>>> Subject: mm/madvise.c: fix freeing of locked page with MADV_FREE
>>>>>
>>>>> If madvise(..., MADV_FREE) split a transparent hugepage, it called
>>>>> put_page() before unlock_page().  This was wrong because put_page() can
>>>>> free the page, e.g.  if a concurrent madvise(..., MADV_DONTNEED) has
>>>>> removed it from the memory mapping.  put_page() then rightfully complained
>>>>> about freeing a locked page.
>>>>>
>>>>> Fix this by moving the unlock_page() before put_page().
>>>
>>> Quick grep shows that a similar flow (put_page() followed by an
>>> unlock_page() ) also happens in hugetlbfs_fallocate(). Isn’t it a problem as
>>> well?
>>
>> I assume you are asking about this block of code?
> 
> Yes.
> 
>>
>>                /*
>>                 * page_put due to reference from alloc_huge_page()
>>                 * unlock_page because locked by add_to_page_cache()
>>                 */
>>                put_page(page);
>>                unlock_page(page);
>>
>> Well, there is a typo (page_put) in the comment. :(
>>
>> However, in this case we have just added the huge page to a hugetlbfs
>> file.  The put_page() is there just to drop the reference count on the
>> page (taken when allocated).  It will still be non-zero as we have
>> successfully added it to the page cache.  So, we are not freeing the
>> page here, just dropping the reference count.
>>
>> This should not cause a problem like that seen in madvise.
> 
> Thanks for the quick response.
> 
> I am not too familiar with this piece of code, so just for the matter of
> understanding: what prevents the page from being removed from the page cache
> shortly after it is added (even if it is highly unlikely)? The page lock? The
> inode lock?

Someone would need to acquire the inode lock to remove the page.  This
is held until we exit the routine.  Also note that put_page for this
type of huge page almost always results in the page being put back
on a free list within the hugetlb(fs) subsystem.  It is not returned
to the 'normal' memory allocators for general use.

-- 
Mike Kravetz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: + mm-madvise-fix-freeing-of-locked-page-with-madv_free.patch added to -mm tree
@ 2017-08-25 23:41           ` Mike Kravetz
  0 siblings, 0 replies; 25+ messages in thread
From: Mike Kravetz @ 2017-08-25 23:41 UTC (permalink / raw)
  To: Nadav Amit
  Cc: ebiggers, Andrew Morton, Andrea Arcangeli, Dmitry Vyukov,
	Hugh Dickins, Minchan Kim, rientjes, stable, mm-commits,
	open list:MEMORY MANAGEMENT, Linux Kernel Mailing List,
	Michal Hocko, nyc

On 08/25/2017 03:51 PM, Nadav Amit wrote:
> Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
>> On 08/25/2017 03:02 PM, Nadav Amit wrote:
>>> Michal Hocko <mhocko@kernel.org> wrote:
>>>
>>>> Hmm, I do not see this neither in linux-mm nor LKML. Strange
>>>>
>>>> On Wed 23-08-17 14:41:21, Andrew Morton wrote:
>>>>> From: Eric Biggers <ebiggers@google.com>
>>>>> Subject: mm/madvise.c: fix freeing of locked page with MADV_FREE
>>>>>
>>>>> If madvise(..., MADV_FREE) split a transparent hugepage, it called
>>>>> put_page() before unlock_page().  This was wrong because put_page() can
>>>>> free the page, e.g.  if a concurrent madvise(..., MADV_DONTNEED) has
>>>>> removed it from the memory mapping.  put_page() then rightfully complained
>>>>> about freeing a locked page.
>>>>>
>>>>> Fix this by moving the unlock_page() before put_page().
>>>
>>> Quick grep shows that a similar flow (put_page() followed by an
>>> unlock_page() ) also happens in hugetlbfs_fallocate(). Isna??t it a problem as
>>> well?
>>
>> I assume you are asking about this block of code?
> 
> Yes.
> 
>>
>>                /*
>>                 * page_put due to reference from alloc_huge_page()
>>                 * unlock_page because locked by add_to_page_cache()
>>                 */
>>                put_page(page);
>>                unlock_page(page);
>>
>> Well, there is a typo (page_put) in the comment. :(
>>
>> However, in this case we have just added the huge page to a hugetlbfs
>> file.  The put_page() is there just to drop the reference count on the
>> page (taken when allocated).  It will still be non-zero as we have
>> successfully added it to the page cache.  So, we are not freeing the
>> page here, just dropping the reference count.
>>
>> This should not cause a problem like that seen in madvise.
> 
> Thanks for the quick response.
> 
> I am not too familiar with this piece of code, so just for the matter of
> understanding: what prevents the page from being removed from the page cache
> shortly after it is added (even if it is highly unlikely)? The page lock? The
> inode lock?

Someone would need to acquire the inode lock to remove the page.  This
is held until we exit the routine.  Also note that put_page for this
type of huge page almost always results in the page being put back
on a free list within the hugetlb(fs) subsystem.  It is not returned
to the 'normal' memory allocators for general use.

-- 
Mike Kravetz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH] hugetlbfs: change put_page/unlock_page order in hugetlbfs_fallocate()
  2017-08-26 21:09             ` Eric Biggers
  (?)
  (?)
@ 2017-08-26 19:11             ` Nadav Amit
  2017-08-27 17:15               ` Mike Kravetz
                                 ` (2 more replies)
  -1 siblings, 3 replies; 25+ messages in thread
From: Nadav Amit @ 2017-08-26 19:11 UTC (permalink / raw)
  To: Nadia Yvette Chambers
  Cc: linux-kernel, Nadav Amit, Nadav Amit, Eric Biggers, Mike Kravetz

hugetlfs_fallocate() currently performs put_page() before unlock_page().
This scenario opens a small time window, from the time the page is added
to the page cache, until it is unlocked, in which the page might be
removed from the page-cache by another core. If the page is removed
during this time windows, it might cause a memory corruption, as the
wrong page will be unlocked.

It is arguable whether this scenario can happen in a real system, and
there are several mitigating factors. The issue was found by code
inspection (actually grep), and not by actually triggering the flow.
Yet, since putting the page before unlocking is incorrect it should be
fixed, if only to prevent future breakage or someone copy-pasting this
code.

Fixes: 70c3547e36f5c ("hugetlbfs: add hugetlbfs_fallocate()")

cc: Eric Biggers <ebiggers3@gmail.com>
cc: Mike Kravetz <mike.kravetz@oracle.com>

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 fs/hugetlbfs/inode.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 28d2753be094..9475fee79cee 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -655,11 +655,11 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 
 		/*
-		 * page_put due to reference from alloc_huge_page()
 		 * unlock_page because locked by add_to_page_cache()
+		 * page_put due to reference from alloc_huge_page()
 		 */
-		put_page(page);
 		unlock_page(page);
+		put_page(page);
 	}
 
 	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: + mm-madvise-fix-freeing-of-locked-page-with-madv_free.patch added to -mm tree
  2017-08-25 23:41           ` Mike Kravetz
  (?)
@ 2017-08-26 21:09             ` Eric Biggers
  -1 siblings, 0 replies; 25+ messages in thread
From: Eric Biggers @ 2017-08-26 21:09 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Nadav Amit, ebiggers, Andrew Morton, Andrea Arcangeli,
	Dmitry Vyukov, Hugh Dickins, Minchan Kim, rientjes, stable,
	mm-commits, open list:MEMORY MANAGEMENT,
	Linux Kernel Mailing List, Michal Hocko, nyc

On Fri, Aug 25, 2017 at 04:41:36PM -0700, Mike Kravetz wrote:
> >>>>>
> >>>>> If madvise(..., MADV_FREE) split a transparent hugepage, it called
> >>>>> put_page() before unlock_page().  This was wrong because put_page() can
> >>>>> free the page, e.g.  if a concurrent madvise(..., MADV_DONTNEED) has
> >>>>> removed it from the memory mapping.  put_page() then rightfully complained
> >>>>> about freeing a locked page.
> >>>>>
> >>>>> Fix this by moving the unlock_page() before put_page().
> >>>
> >>> Quick grep shows that a similar flow (put_page() followed by an
> >>> unlock_page() ) also happens in hugetlbfs_fallocate(). Isn’t it a problem as
> >>> well?
> >>
> >> I assume you are asking about this block of code?
> > 
> > Yes.
> > 
> >>
> >>                /*
> >>                 * page_put due to reference from alloc_huge_page()
> >>                 * unlock_page because locked by add_to_page_cache()
> >>                 */
> >>                put_page(page);
> >>                unlock_page(page);
> >>
> >> Well, there is a typo (page_put) in the comment. :(
> >>
> >> However, in this case we have just added the huge page to a hugetlbfs
> >> file.  The put_page() is there just to drop the reference count on the
> >> page (taken when allocated).  It will still be non-zero as we have
> >> successfully added it to the page cache.  So, we are not freeing the
> >> page here, just dropping the reference count.
> >>
> >> This should not cause a problem like that seen in madvise.
> > 
> > Thanks for the quick response.
> > 
> > I am not too familiar with this piece of code, so just for the matter of
> > understanding: what prevents the page from being removed from the page cache
> > shortly after it is added (even if it is highly unlikely)? The page lock? The
> > inode lock?
> 
> Someone would need to acquire the inode lock to remove the page.  This
> is held until we exit the routine.  Also note that put_page for this
> type of huge page almost always results in the page being put back
> on a free list within the hugetlb(fs) subsystem.  It is not returned
> to the 'normal' memory allocators for general use.
> 

I'm not sure about that.  What about sys_fadvise64(..., POSIX_FADV_DONTNEED)?
That removes pages from the page cache without taking the inode lock.  It won't
remove locked pages though, so presumably it is only the page lock that prevents
the race with hugetlbfs_fallocate().

But in any case, why not do the "obviously correct" thing --- unlock before put?

Eric

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: + mm-madvise-fix-freeing-of-locked-page-with-madv_free.patch added to -mm tree
@ 2017-08-26 21:09             ` Eric Biggers
  0 siblings, 0 replies; 25+ messages in thread
From: Eric Biggers @ 2017-08-26 21:09 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Nadav Amit, ebiggers, Andrew Morton, Andrea Arcangeli,
	Dmitry Vyukov, Hugh Dickins, Minchan Kim, rientjes, stable,
	mm-commits, open list:MEMORY MANAGEMENT,
	Linux Kernel Mailing List, Michal Hocko, nyc

On Fri, Aug 25, 2017 at 04:41:36PM -0700, Mike Kravetz wrote:
> >>>>>
> >>>>> If madvise(..., MADV_FREE) split a transparent hugepage, it called
> >>>>> put_page() before unlock_page().  This was wrong because put_page() can
> >>>>> free the page, e.g.  if a concurrent madvise(..., MADV_DONTNEED) has
> >>>>> removed it from the memory mapping.  put_page() then rightfully complained
> >>>>> about freeing a locked page.
> >>>>>
> >>>>> Fix this by moving the unlock_page() before put_page().
> >>>
> >>> Quick grep shows that a similar flow (put_page() followed by an
> >>> unlock_page() ) also happens in hugetlbfs_fallocate(). Isn’t it a problem as
> >>> well?
> >>
> >> I assume you are asking about this block of code?
> > 
> > Yes.
> > 
> >>
> >>                /*
> >>                 * page_put due to reference from alloc_huge_page()
> >>                 * unlock_page because locked by add_to_page_cache()
> >>                 */
> >>                put_page(page);
> >>                unlock_page(page);
> >>
> >> Well, there is a typo (page_put) in the comment. :(
> >>
> >> However, in this case we have just added the huge page to a hugetlbfs
> >> file.  The put_page() is there just to drop the reference count on the
> >> page (taken when allocated).  It will still be non-zero as we have
> >> successfully added it to the page cache.  So, we are not freeing the
> >> page here, just dropping the reference count.
> >>
> >> This should not cause a problem like that seen in madvise.
> > 
> > Thanks for the quick response.
> > 
> > I am not too familiar with this piece of code, so just for the matter of
> > understanding: what prevents the page from being removed from the page cache
> > shortly after it is added (even if it is highly unlikely)? The page lock? The
> > inode lock?
> 
> Someone would need to acquire the inode lock to remove the page.  This
> is held until we exit the routine.  Also note that put_page for this
> type of huge page almost always results in the page being put back
> on a free list within the hugetlb(fs) subsystem.  It is not returned
> to the 'normal' memory allocators for general use.
> 

I'm not sure about that.  What about sys_fadvise64(..., POSIX_FADV_DONTNEED)?
That removes pages from the page cache without taking the inode lock.  It won't
remove locked pages though, so presumably it is only the page lock that prevents
the race with hugetlbfs_fallocate().

But in any case, why not do the "obviously correct" thing --- unlock before put?

Eric

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: + mm-madvise-fix-freeing-of-locked-page-with-madv_free.patch added to -mm tree
@ 2017-08-26 21:09             ` Eric Biggers
  0 siblings, 0 replies; 25+ messages in thread
From: Eric Biggers @ 2017-08-26 21:09 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Nadav Amit, ebiggers, Andrew Morton, Andrea Arcangeli,
	Dmitry Vyukov, Hugh Dickins, Minchan Kim, rientjes, stable,
	mm-commits, open list:MEMORY MANAGEMENT,
	Linux Kernel Mailing List, Michal Hocko, nyc

On Fri, Aug 25, 2017 at 04:41:36PM -0700, Mike Kravetz wrote:
> >>>>>
> >>>>> If madvise(..., MADV_FREE) split a transparent hugepage, it called
> >>>>> put_page() before unlock_page().  This was wrong because put_page() can
> >>>>> free the page, e.g.  if a concurrent madvise(..., MADV_DONTNEED) has
> >>>>> removed it from the memory mapping.  put_page() then rightfully complained
> >>>>> about freeing a locked page.
> >>>>>
> >>>>> Fix this by moving the unlock_page() before put_page().
> >>>
> >>> Quick grep shows that a similar flow (put_page() followed by an
> >>> unlock_page() ) also happens in hugetlbfs_fallocate(). Isna??t it a problem as
> >>> well?
> >>
> >> I assume you are asking about this block of code?
> > 
> > Yes.
> > 
> >>
> >>                /*
> >>                 * page_put due to reference from alloc_huge_page()
> >>                 * unlock_page because locked by add_to_page_cache()
> >>                 */
> >>                put_page(page);
> >>                unlock_page(page);
> >>
> >> Well, there is a typo (page_put) in the comment. :(
> >>
> >> However, in this case we have just added the huge page to a hugetlbfs
> >> file.  The put_page() is there just to drop the reference count on the
> >> page (taken when allocated).  It will still be non-zero as we have
> >> successfully added it to the page cache.  So, we are not freeing the
> >> page here, just dropping the reference count.
> >>
> >> This should not cause a problem like that seen in madvise.
> > 
> > Thanks for the quick response.
> > 
> > I am not too familiar with this piece of code, so just for the matter of
> > understanding: what prevents the page from being removed from the page cache
> > shortly after it is added (even if it is highly unlikely)? The page lock? The
> > inode lock?
> 
> Someone would need to acquire the inode lock to remove the page.  This
> is held until we exit the routine.  Also note that put_page for this
> type of huge page almost always results in the page being put back
> on a free list within the hugetlb(fs) subsystem.  It is not returned
> to the 'normal' memory allocators for general use.
> 

I'm not sure about that.  What about sys_fadvise64(..., POSIX_FADV_DONTNEED)?
That removes pages from the page cache without taking the inode lock.  It won't
remove locked pages though, so presumably it is only the page lock that prevents
the race with hugetlbfs_fallocate().

But in any case, why not do the "obviously correct" thing --- unlock before put?

Eric

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] hugetlbfs: change put_page/unlock_page order in hugetlbfs_fallocate()
  2017-08-26 19:11             ` [PATCH] hugetlbfs: change put_page/unlock_page order in hugetlbfs_fallocate() Nadav Amit
@ 2017-08-27 17:15               ` Mike Kravetz
  2017-08-27 20:08                 ` Nadav Amit
  2017-08-28 13:46               ` Michal Hocko
  2017-11-29  2:37               ` Eric Biggers
  2 siblings, 1 reply; 25+ messages in thread
From: Mike Kravetz @ 2017-08-27 17:15 UTC (permalink / raw)
  To: Nadav Amit, Nadia Yvette Chambers; +Cc: linux-kernel, Nadav Amit, Eric Biggers

On 08/26/2017 12:11 PM, Nadav Amit wrote:
> hugetlfs_fallocate() currently performs put_page() before unlock_page().
> This scenario opens a small time window, from the time the page is added
> to the page cache, until it is unlocked, in which the page might be
> removed from the page-cache by another core. If the page is removed
> during this time windows, it might cause a memory corruption, as the
> wrong page will be unlocked.
> 
> It is arguable whether this scenario can happen in a real system, and
> there are several mitigating factors. The issue was found by code
> inspection (actually grep), and not by actually triggering the flow.
> Yet, since putting the page before unlocking is incorrect it should be
> fixed, if only to prevent future breakage or someone copy-pasting this
> code.
> 
> Fixes: 70c3547e36f5c ("hugetlbfs: add hugetlbfs_fallocate()")
> 
> cc: Eric Biggers <ebiggers3@gmail.com>
> cc: Mike Kravetz <mike.kravetz@oracle.com>
> 
> Signed-off-by: Nadav Amit <namit@vmware.com>

Thank you Nadav.

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>

Since hugetlbfs is an in memory filesystem, the only way one 'should' be
able to remove a page (file content) is through an inode operation such as
truncate, hole punch, or unlink.  That was the basis for my response that
the inode lock would be required for page freeing.

Eric's question about sys_fadvise64(POSIX_FADV_DONTNEED) is interesting.
I was expecting to see a check for hugetlbfs pages and exit (without
modification) if encountered.  A quick review of the code did not find
any such checks.

I'll take a closer look to determine exactly how hugetlbfs files are
handled.  IMO, there should be something similar to the DAX check where
the routine quickly exits.
-- 
Mike Kravetz

> ---
>  fs/hugetlbfs/inode.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 28d2753be094..9475fee79cee 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -655,11 +655,11 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
>  		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
>  
>  		/*
> -		 * page_put due to reference from alloc_huge_page()
>  		 * unlock_page because locked by add_to_page_cache()
> +		 * page_put due to reference from alloc_huge_page()
>  		 */
> -		put_page(page);
>  		unlock_page(page);
> +		put_page(page);
>  	}
>  
>  	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] hugetlbfs: change put_page/unlock_page order in hugetlbfs_fallocate()
  2017-08-27 17:15               ` Mike Kravetz
@ 2017-08-27 20:08                 ` Nadav Amit
  2017-08-28 17:45                   ` Mike Kravetz
  0 siblings, 1 reply; 25+ messages in thread
From: Nadav Amit @ 2017-08-27 20:08 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Nadia Yvette Chambers, Linux Kernel Mailing List, Eric Biggers

Mike Kravetz <mike.kravetz@oracle.com> wrote:

> On 08/26/2017 12:11 PM, Nadav Amit wrote:
>> hugetlfs_fallocate() currently performs put_page() before unlock_page().
>> This scenario opens a small time window, from the time the page is added
>> to the page cache, until it is unlocked, in which the page might be
>> removed from the page-cache by another core. If the page is removed
>> during this time windows, it might cause a memory corruption, as the
>> wrong page will be unlocked.
>> 
>> It is arguable whether this scenario can happen in a real system, and
>> there are several mitigating factors. The issue was found by code
>> inspection (actually grep), and not by actually triggering the flow.
>> Yet, since putting the page before unlocking is incorrect it should be
>> fixed, if only to prevent future breakage or someone copy-pasting this
>> code.
>> 
>> Fixes: 70c3547e36f5c ("hugetlbfs: add hugetlbfs_fallocate()")
>> 
>> cc: Eric Biggers <ebiggers3@gmail.com>
>> cc: Mike Kravetz <mike.kravetz@oracle.com>
>> 
>> Signed-off-by: Nadav Amit <namit@vmware.com>
> 
> Thank you Nadav.

No problem.

> 
> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
> 
> Since hugetlbfs is an in memory filesystem, the only way one 'should' be
> able to remove a page (file content) is through an inode operation such as
> truncate, hole punch, or unlink.  That was the basis for my response that
> the inode lock would be required for page freeing.
> 
> Eric's question about sys_fadvise64(POSIX_FADV_DONTNEED) is interesting.
> I was expecting to see a check for hugetlbfs pages and exit (without
> modification) if encountered.  A quick review of the code did not find
> any such checks.
> 
> I'll take a closer look to determine exactly how hugetlbfs files are
> handled.  IMO, there should be something similar to the DAX check where
> the routine quickly exits.

I did not cc stable when submitting the patch, based on your previous
response. Let me know if you want me to send v2 which does so.

Thanks,
Nadav

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] hugetlbfs: change put_page/unlock_page order in hugetlbfs_fallocate()
  2017-08-26 19:11             ` [PATCH] hugetlbfs: change put_page/unlock_page order in hugetlbfs_fallocate() Nadav Amit
  2017-08-27 17:15               ` Mike Kravetz
@ 2017-08-28 13:46               ` Michal Hocko
  2017-11-29  2:37               ` Eric Biggers
  2 siblings, 0 replies; 25+ messages in thread
From: Michal Hocko @ 2017-08-28 13:46 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Nadia Yvette Chambers, linux-kernel, Nadav Amit, Eric Biggers,
	Mike Kravetz, Andrew Morton

[CC Andrew]

On Sat 26-08-17 12:11:24, Nadav Amit wrote:
> hugetlfs_fallocate() currently performs put_page() before unlock_page().
> This scenario opens a small time window, from the time the page is added
> to the page cache, until it is unlocked, in which the page might be
> removed from the page-cache by another core. If the page is removed
> during this time windows, it might cause a memory corruption, as the
> wrong page will be unlocked.
> 
> It is arguable whether this scenario can happen in a real system, and
> there are several mitigating factors. The issue was found by code
> inspection (actually grep), and not by actually triggering the flow.
> Yet, since putting the page before unlocking is incorrect it should be
> fixed, if only to prevent future breakage or someone copy-pasting this
> code.

Even if this a theoretical problem it is definitely worth fixing because
it is a anti-pattern of how the reference counted object should be treated.
 
> Fixes: 70c3547e36f5c ("hugetlbfs: add hugetlbfs_fallocate()")
> 
> cc: Eric Biggers <ebiggers3@gmail.com>
> cc: Mike Kravetz <mike.kravetz@oracle.com>
> 
> Signed-off-by: Nadav Amit <namit@vmware.com>

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
>  fs/hugetlbfs/inode.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 28d2753be094..9475fee79cee 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -655,11 +655,11 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
>  		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
>  
>  		/*
> -		 * page_put due to reference from alloc_huge_page()
>  		 * unlock_page because locked by add_to_page_cache()
> +		 * page_put due to reference from alloc_huge_page()
>  		 */
> -		put_page(page);
>  		unlock_page(page);
> +		put_page(page);
>  	}
>  
>  	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
> -- 
> 2.11.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] hugetlbfs: change put_page/unlock_page order in hugetlbfs_fallocate()
  2017-08-27 20:08                 ` Nadav Amit
@ 2017-08-28 17:45                   ` Mike Kravetz
  2017-08-28 18:09                     ` Michal Hocko
  0 siblings, 1 reply; 25+ messages in thread
From: Mike Kravetz @ 2017-08-28 17:45 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Nadia Yvette Chambers, Linux Kernel Mailing List, Eric Biggers,
	Andrew Morton, Michal Hocko

Adding Andrew, Michal on CC

On 08/27/2017 01:08 PM, Nadav Amit wrote:
> Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
>> On 08/26/2017 12:11 PM, Nadav Amit wrote:
>>> hugetlfs_fallocate() currently performs put_page() before unlock_page().
>>> This scenario opens a small time window, from the time the page is added
>>> to the page cache, until it is unlocked, in which the page might be
>>> removed from the page-cache by another core. If the page is removed
>>> during this time windows, it might cause a memory corruption, as the
>>> wrong page will be unlocked.
>>>
>>> It is arguable whether this scenario can happen in a real system, and
>>> there are several mitigating factors. The issue was found by code
>>> inspection (actually grep), and not by actually triggering the flow.
>>> Yet, since putting the page before unlocking is incorrect it should be
>>> fixed, if only to prevent future breakage or someone copy-pasting this
>>> code.
>>>
>>> Fixes: 70c3547e36f5c ("hugetlbfs: add hugetlbfs_fallocate()")
>>>
>>> cc: Eric Biggers <ebiggers3@gmail.com>
>>> cc: Mike Kravetz <mike.kravetz@oracle.com>
>>>
>>> Signed-off-by: Nadav Amit <namit@vmware.com>
>>
>> Thank you Nadav.
> 
> No problem.
> 
>>
>> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
>>
>> Since hugetlbfs is an in memory filesystem, the only way one 'should' be
>> able to remove a page (file content) is through an inode operation such as
>> truncate, hole punch, or unlink.  That was the basis for my response that
>> the inode lock would be required for page freeing.
>>
>> Eric's question about sys_fadvise64(POSIX_FADV_DONTNEED) is interesting.
>> I was expecting to see a check for hugetlbfs pages and exit (without
>> modification) if encountered.  A quick review of the code did not find
>> any such checks.
>>
>> I'll take a closer look to determine exactly how hugetlbfs files are
>> handled.  IMO, there should be something similar to the DAX check where
>> the routine quickly exits.
> 
> I did not cc stable when submitting the patch, based on your previous
> response. Let me know if you want me to send v2 which does so.

I still do not believe there is a need to change this in stable.  Your patch
should be sufficient to ensure we do the right thing going forward.

Looking at and testing the sys_fadvise64(POSIX_FADV_DONTNEED) code with
hugetlbfs does indeed show a more general problem.  One can use
sys_fadvise64() to remove a huge page from a hugetlbfs file. :(  This does
not go through the special hugetlbfs page handling code, but rather the
normal mm paths.  As a result hugetlbfs accounting (like reserve counts)
gets out of sync and the hugetlbfs filesystem may become unusable.  Sigh!!!

I will address this issue in a separate patch.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] hugetlbfs: change put_page/unlock_page order in hugetlbfs_fallocate()
  2017-08-28 17:45                   ` Mike Kravetz
@ 2017-08-28 18:09                     ` Michal Hocko
  2017-08-28 18:51                       ` Mike Kravetz
  0 siblings, 1 reply; 25+ messages in thread
From: Michal Hocko @ 2017-08-28 18:09 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Nadav Amit, Nadia Yvette Chambers, Linux Kernel Mailing List,
	Eric Biggers, Andrew Morton

On Mon 28-08-17 10:45:58, Mike Kravetz wrote:
> Adding Andrew, Michal on CC
> 
> On 08/27/2017 01:08 PM, Nadav Amit wrote:
> > Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > 
> >> On 08/26/2017 12:11 PM, Nadav Amit wrote:
> >>> hugetlfs_fallocate() currently performs put_page() before unlock_page().
> >>> This scenario opens a small time window, from the time the page is added
> >>> to the page cache, until it is unlocked, in which the page might be
> >>> removed from the page-cache by another core. If the page is removed
> >>> during this time windows, it might cause a memory corruption, as the
> >>> wrong page will be unlocked.
> >>>
> >>> It is arguable whether this scenario can happen in a real system, and
> >>> there are several mitigating factors. The issue was found by code
> >>> inspection (actually grep), and not by actually triggering the flow.
> >>> Yet, since putting the page before unlocking is incorrect it should be
> >>> fixed, if only to prevent future breakage or someone copy-pasting this
> >>> code.
> >>>
> >>> Fixes: 70c3547e36f5c ("hugetlbfs: add hugetlbfs_fallocate()")
> >>>
> >>> cc: Eric Biggers <ebiggers3@gmail.com>
> >>> cc: Mike Kravetz <mike.kravetz@oracle.com>
> >>>
> >>> Signed-off-by: Nadav Amit <namit@vmware.com>
> >>
> >> Thank you Nadav.
> > 
> > No problem.
> > 
> >>
> >> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
> >>
> >> Since hugetlbfs is an in memory filesystem, the only way one 'should' be
> >> able to remove a page (file content) is through an inode operation such as
> >> truncate, hole punch, or unlink.  That was the basis for my response that
> >> the inode lock would be required for page freeing.
> >>
> >> Eric's question about sys_fadvise64(POSIX_FADV_DONTNEED) is interesting.
> >> I was expecting to see a check for hugetlbfs pages and exit (without
> >> modification) if encountered.  A quick review of the code did not find
> >> any such checks.
> >>
> >> I'll take a closer look to determine exactly how hugetlbfs files are
> >> handled.  IMO, there should be something similar to the DAX check where
> >> the routine quickly exits.
> > 
> > I did not cc stable when submitting the patch, based on your previous
> > response. Let me know if you want me to send v2 which does so.
> 
> I still do not believe there is a need to change this in stable.  Your patch
> should be sufficient to ensure we do the right thing going forward.
>
> Looking at and testing the sys_fadvise64(POSIX_FADV_DONTNEED) code with
> hugetlbfs does indeed show a more general problem.  One can use
> sys_fadvise64() to remove a huge page from a hugetlbfs file. :(  This does
> not go through the special hugetlbfs page handling code, but rather the
> normal mm paths.  As a result hugetlbfs accounting (like reserve counts)
> gets out of sync and the hugetlbfs filesystem may become unusable.  Sigh!!!
> 
> I will address this issue in a separate patch.

I didn't check very carefully but it seems that
http://ozlabs.org/~akpm/mmotm/broken-out/mm-fadvise-avoid-fadvise-for-fs-without-backing-device.patch
should help here, right?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] hugetlbfs: change put_page/unlock_page order in hugetlbfs_fallocate()
  2017-08-28 18:09                     ` Michal Hocko
@ 2017-08-28 18:51                       ` Mike Kravetz
  0 siblings, 0 replies; 25+ messages in thread
From: Mike Kravetz @ 2017-08-28 18:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Nadav Amit, Nadia Yvette Chambers, Linux Kernel Mailing List,
	Eric Biggers, Andrew Morton

On 08/28/2017 11:09 AM, Michal Hocko wrote:
> On Mon 28-08-17 10:45:58, Mike Kravetz wrote:
>> Adding Andrew, Michal on CC
>>
>> On 08/27/2017 01:08 PM, Nadav Amit wrote:
>>> Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>>
>>>> On 08/26/2017 12:11 PM, Nadav Amit wrote:
>>>>> hugetlfs_fallocate() currently performs put_page() before unlock_page().
>>>>> This scenario opens a small time window, from the time the page is added
>>>>> to the page cache, until it is unlocked, in which the page might be
>>>>> removed from the page-cache by another core. If the page is removed
>>>>> during this time windows, it might cause a memory corruption, as the
>>>>> wrong page will be unlocked.
>>>>>
>>>>> It is arguable whether this scenario can happen in a real system, and
>>>>> there are several mitigating factors. The issue was found by code
>>>>> inspection (actually grep), and not by actually triggering the flow.
>>>>> Yet, since putting the page before unlocking is incorrect it should be
>>>>> fixed, if only to prevent future breakage or someone copy-pasting this
>>>>> code.
>>>>>
>>>>> Fixes: 70c3547e36f5c ("hugetlbfs: add hugetlbfs_fallocate()")
>>>>>
>>>>> cc: Eric Biggers <ebiggers3@gmail.com>
>>>>> cc: Mike Kravetz <mike.kravetz@oracle.com>
>>>>>
>>>>> Signed-off-by: Nadav Amit <namit@vmware.com>
>>>>
>>>> Thank you Nadav.
>>>
>>> No problem.
>>>
>>>>
>>>> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
>>>>
>>>> Since hugetlbfs is an in memory filesystem, the only way one 'should' be
>>>> able to remove a page (file content) is through an inode operation such as
>>>> truncate, hole punch, or unlink.  That was the basis for my response that
>>>> the inode lock would be required for page freeing.
>>>>
>>>> Eric's question about sys_fadvise64(POSIX_FADV_DONTNEED) is interesting.
>>>> I was expecting to see a check for hugetlbfs pages and exit (without
>>>> modification) if encountered.  A quick review of the code did not find
>>>> any such checks.
>>>>
>>>> I'll take a closer look to determine exactly how hugetlbfs files are
>>>> handled.  IMO, there should be something similar to the DAX check where
>>>> the routine quickly exits.
>>>
>>> I did not cc stable when submitting the patch, based on your previous
>>> response. Let me know if you want me to send v2 which does so.
>>
>> I still do not believe there is a need to change this in stable.  Your patch
>> should be sufficient to ensure we do the right thing going forward.
>>
>> Looking at and testing the sys_fadvise64(POSIX_FADV_DONTNEED) code with
>> hugetlbfs does indeed show a more general problem.  One can use
>> sys_fadvise64() to remove a huge page from a hugetlbfs file. :(  This does
>> not go through the special hugetlbfs page handling code, but rather the
>> normal mm paths.  As a result hugetlbfs accounting (like reserve counts)
>> gets out of sync and the hugetlbfs filesystem may become unusable.  Sigh!!!
>>
>> I will address this issue in a separate patch.
> 
> I didn't check very carefully but it seems that
> http://ozlabs.org/~akpm/mmotm/broken-out/mm-fadvise-avoid-fadvise-for-fs-without-backing-device.patch
> should help here, right?

Thanks Michal.

Yes, that patch addresses the above issue with hugetlbfs.  I was also
wondering if there were similar issues with other in memory filesystems.
Looks like there are.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] hugetlbfs: change put_page/unlock_page order in hugetlbfs_fallocate()
  2017-08-26 19:11             ` [PATCH] hugetlbfs: change put_page/unlock_page order in hugetlbfs_fallocate() Nadav Amit
  2017-08-27 17:15               ` Mike Kravetz
  2017-08-28 13:46               ` Michal Hocko
@ 2017-11-29  2:37               ` Eric Biggers
  2017-11-29  3:22                 ` Mike Kravetz
  2 siblings, 1 reply; 25+ messages in thread
From: Eric Biggers @ 2017-11-29  2:37 UTC (permalink / raw)
  To: Nadav Amit; +Cc: Nadia Yvette Chambers, linux-kernel, Nadav Amit, Mike Kravetz

On Sat, Aug 26, 2017 at 12:11:24PM -0700, Nadav Amit wrote:
> hugetlfs_fallocate() currently performs put_page() before unlock_page().
> This scenario opens a small time window, from the time the page is added
> to the page cache, until it is unlocked, in which the page might be
> removed from the page-cache by another core. If the page is removed
> during this time windows, it might cause a memory corruption, as the
> wrong page will be unlocked.
> 
> It is arguable whether this scenario can happen in a real system, and
> there are several mitigating factors. The issue was found by code
> inspection (actually grep), and not by actually triggering the flow.
> Yet, since putting the page before unlocking is incorrect it should be
> fixed, if only to prevent future breakage or someone copy-pasting this
> code.
> 
> Fixes: 70c3547e36f5c ("hugetlbfs: add hugetlbfs_fallocate()")
> 
> cc: Eric Biggers <ebiggers3@gmail.com>
> cc: Mike Kravetz <mike.kravetz@oracle.com>
> 
> Signed-off-by: Nadav Amit <namit@vmware.com>
> ---
>  fs/hugetlbfs/inode.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 28d2753be094..9475fee79cee 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -655,11 +655,11 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
>  		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
>  
>  		/*
> -		 * page_put due to reference from alloc_huge_page()
>  		 * unlock_page because locked by add_to_page_cache()
> +		 * page_put due to reference from alloc_huge_page()
>  		 */
> -		put_page(page);
>  		unlock_page(page);
> +		put_page(page);
>  	}
>  
>  	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
> -- 

This patch wasn't ever applied.  Nadia, do you take patches for hugetlbfs, or
does this need to go through Andrew Morton?

Eric

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] hugetlbfs: change put_page/unlock_page order in hugetlbfs_fallocate()
  2017-11-29  2:37               ` Eric Biggers
@ 2017-11-29  3:22                 ` Mike Kravetz
  0 siblings, 0 replies; 25+ messages in thread
From: Mike Kravetz @ 2017-11-29  3:22 UTC (permalink / raw)
  To: Eric Biggers, Nadav Amit
  Cc: Nadia Yvette Chambers, linux-kernel, Nadav Amit, Michal Hocko,
	Andrew Morton

[CC Andrew, Michal]

On 11/28/2017 06:37 PM, Eric Biggers wrote:
> On Sat, Aug 26, 2017 at 12:11:24PM -0700, Nadav Amit wrote:
>> hugetlfs_fallocate() currently performs put_page() before unlock_page().
>> This scenario opens a small time window, from the time the page is added
>> to the page cache, until it is unlocked, in which the page might be
>> removed from the page-cache by another core. If the page is removed
>> during this time windows, it might cause a memory corruption, as the
>> wrong page will be unlocked.
>>
>> It is arguable whether this scenario can happen in a real system, and
>> there are several mitigating factors. The issue was found by code
>> inspection (actually grep), and not by actually triggering the flow.
>> Yet, since putting the page before unlocking is incorrect it should be
>> fixed, if only to prevent future breakage or someone copy-pasting this
>> code.
>>
>> Fixes: 70c3547e36f5c ("hugetlbfs: add hugetlbfs_fallocate()")
>>
>> cc: Eric Biggers <ebiggers3@gmail.com>
>> cc: Mike Kravetz <mike.kravetz@oracle.com>
>>
>> Signed-off-by: Nadav Amit <namit@vmware.com>
>> ---
>>  fs/hugetlbfs/inode.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>> index 28d2753be094..9475fee79cee 100644
>> --- a/fs/hugetlbfs/inode.c
>> +++ b/fs/hugetlbfs/inode.c
>> @@ -655,11 +655,11 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
>>  		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
>>  
>>  		/*
>> -		 * page_put due to reference from alloc_huge_page()
>>  		 * unlock_page because locked by add_to_page_cache()
>> +		 * page_put due to reference from alloc_huge_page()
>>  		 */
>> -		put_page(page);
>>  		unlock_page(page);
>> +		put_page(page);
>>  	}
>>  
>>  	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
>> -- 
> 
> This patch wasn't ever applied.  Nadia, do you take patches for hugetlbfs, or
> does this need to go through Andrew Morton?
> 
> Eric

Nadia has not been active for some time on hugetlbfs, so best to go through
Andrew.  Added Andrew and Michal on CC.

This patch has a:
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>

I am still of the opinion that this does not need to be sent to stable.
Although the ordering is current code is incorrect, there is no way for
this to be a problem with current locking.   In addition, I verified
that the perhaps bigger issue with sys_fadvise64(POSIX_FADV_DONTNEED)
for hugetlbfs and other filesystems is addressed in commit 3a77d214807c.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2017-11-29  3:22 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-23 21:41 + mm-madvise-fix-freeing-of-locked-page-with-madv_free.patch added to -mm tree akpm
2017-08-24  6:09 ` Michal Hocko
2017-08-24  6:53   ` Michal Hocko
2017-08-25 21:36   ` Andrew Morton
2017-08-25 21:36     ` Andrew Morton
2017-08-25 22:02   ` Nadav Amit
2017-08-25 22:31     ` Mike Kravetz
2017-08-25 22:31       ` Mike Kravetz
2017-08-25 22:31       ` Mike Kravetz
2017-08-25 22:51       ` Nadav Amit
2017-08-25 23:41         ` Mike Kravetz
2017-08-25 23:41           ` Mike Kravetz
2017-08-25 23:41           ` Mike Kravetz
2017-08-26 21:09           ` Eric Biggers
2017-08-26 21:09             ` Eric Biggers
2017-08-26 21:09             ` Eric Biggers
2017-08-26 19:11             ` [PATCH] hugetlbfs: change put_page/unlock_page order in hugetlbfs_fallocate() Nadav Amit
2017-08-27 17:15               ` Mike Kravetz
2017-08-27 20:08                 ` Nadav Amit
2017-08-28 17:45                   ` Mike Kravetz
2017-08-28 18:09                     ` Michal Hocko
2017-08-28 18:51                       ` Mike Kravetz
2017-08-28 13:46               ` Michal Hocko
2017-11-29  2:37               ` Eric Biggers
2017-11-29  3:22                 ` Mike Kravetz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.