All of lore.kernel.org
 help / color / mirror / Atom feed
* page refcount race between prep_compound_gigantic_page() and __page_cache_add_speculative()?
@ 2021-06-15 11:03 ` Jann Horn
  0 siblings, 0 replies; 4+ messages in thread
From: Jann Horn @ 2021-06-15 11:03 UTC (permalink / raw)
  To: Linux-MM
  Cc: kernel list, Youquan Song, Andrea Arcangeli, Jan Kara,
	Mike Kravetz, John Hubbard, Kirill A. Shutemov

== short summary ==
sysfs/sysctl writes can invoke prep_compound_gigantic_page(), which
forcibly zeroes the refcount of a page with refcount >=1. in the
extremely rare case where the refcount is >1 because of a temporary
reference from concurrent __page_cache_add_speculative(), stuff will
probably blow up.


Because of John Hubbard's question on the "[PATCH v2] mm/gup: fix
try_grab_compound_head() race with split_huge_page()" thread
(https://lore.kernel.org/linux-mm/50d828d1-2ce6-21b4-0e27-fb15daa77561@nvidia.com/),
I was looking around in related code, and stumbled over this old
commit, whose changes are still present in the current kernel, and
which looks wrong to me.

I'm not currently planning to try to fix this (because I'm not
familiar with the compaction code and its interaction with the page
allocator); so if someone who is more familiar with this stuff wants
to pick this up, feel free to do so.


commit 58a84aa92723d1ac3e1cc4e3b0ff49291663f7e1
Author: Youquan Song <youquan.song@intel.com>
Date:   Thu Dec 8 14:34:18 2011 -0800

    thp: set compound tail page _count to zero

    Commit 70b50f94f1644 ("mm: thp: tail page refcounting fix") keeps all
    page_tail->_count zero at all times.  But the current kernel does not
    set page_tail->_count to zero if a 1GB page is utilized.  So when an
    IOMMU 1GB page is used by KVM, it wil result in a kernel oops because a
    tail page's _count does not equal zero.

      kernel BUG at include/linux/mm.h:386!
      invalid opcode: 0000 [#1] SMP
      Call Trace:
        gup_pud_range+0xb8/0x19d
        get_user_pages_fast+0xcb/0x192
        ? trace_hardirqs_off+0xd/0xf
        hva_to_pfn+0x119/0x2f2
        gfn_to_pfn_memslot+0x2c/0x2e
        kvm_iommu_map_pages+0xfd/0x1c1
        kvm_iommu_map_memslots+0x7c/0xbd
        kvm_iommu_map_guest+0xaa/0xbf
        kvm_vm_ioctl_assigned_device+0x2ef/0xa47
        kvm_vm_ioctl+0x36c/0x3a2
        do_vfs_ioctl+0x49e/0x4e4
        sys_ioctl+0x5a/0x7c
        system_call_fastpath+0x16/0x1b
      RIP  gup_huge_pud+0xf2/0x159

    Signed-off-by: Youquan Song <youquan.song@intel.com>
    Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index bb28a5f9db8d..73f17c0293c0 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -576,6 +576,7 @@ static void prep_compound_gigantic_page(struct
page *page, unsigned long order)
        __SetPageHead(page);
        for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
                __SetPageTail(p);
+               set_page_count(p, 0);
                p->first_page = page;
        }
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9dd443d89d8b..850009a7101e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -356,8 +356,8 @@ void prep_compound_page(struct page *page,
unsigned long order)
        __SetPageHead(page);
        for (i = 1; i < nr_pages; i++) {
                struct page *p = page + i;
-
                __SetPageTail(p);
+               set_page_count(p, 0);
                p->first_page = page;
        }
 }



__page_cache_add_speculative() can run on pages that have already been
allocated, and the only thing that can stop it from temporarily
lifting the page refcount is the page refcount being zero. So if these
set_page_count() calls have any effect (outside __init code), and the
refcount is not zero when they occur, then that means we can have a
race where a refcount is forcibly zeroed while
__page_cache_add_speculative() is holding temporary references; and
then we can end up with a use-after-free of struct page.

As far as I can tell, on the normal compound page allocation path
(prep_compound_page()), the whole compound page is coming fresh off
the allocator freelist (except for some __init logic) and only the
refcount of the head page has been initialized in post_alloc_hook();
and so all its tail pages are guaranteed to have a zero refcount. So
on that path the proper fix is probably to just replace the
set_page_count() call with a VM_BUG_ON_PAGE().

The messier path, as the original commit describes, is "gigantic" page
allocation. In that case, we'll go through the following path (if we
ignore CMA):

  alloc_fresh_huge_page():
    alloc_gigantic_page()
      alloc_contig_pages()
        __alloc_contig_pages()
          alloc_contig_range()
            isolate_freepages_range()
              split_map_pages()
                post_alloc_hook() [FOR EVERY PAGE]
                  set_page_refcounted()
                    set_page_count(page, 1)
    prep_compound_gigantic_page()
      set_page_count(p, 0) [FOR EVERY TAIL PAGE]

so all the tail pages are initially allocated with refcount 1 by the
page allocator, and then we overwrite those refcounts with zeroes.


Luckily, the only non-__init codepath that can get here is
__nr_hugepages_store_common(), which is only invoked from privileged
writes to sysfs/sysctls.

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* page refcount race between prep_compound_gigantic_page() and __page_cache_add_speculative()?
@ 2021-06-15 11:03 ` Jann Horn
  0 siblings, 0 replies; 4+ messages in thread
From: Jann Horn @ 2021-06-15 11:03 UTC (permalink / raw)
  To: Linux-MM
  Cc: kernel list, Youquan Song, Andrea Arcangeli, Jan Kara,
	Mike Kravetz, John Hubbard, Kirill A. Shutemov

== short summary ==
sysfs/sysctl writes can invoke prep_compound_gigantic_page(), which
forcibly zeroes the refcount of a page with refcount >=1. in the
extremely rare case where the refcount is >1 because of a temporary
reference from concurrent __page_cache_add_speculative(), stuff will
probably blow up.


Because of John Hubbard's question on the "[PATCH v2] mm/gup: fix
try_grab_compound_head() race with split_huge_page()" thread
(https://lore.kernel.org/linux-mm/50d828d1-2ce6-21b4-0e27-fb15daa77561@nvidia.com/),
I was looking around in related code, and stumbled over this old
commit, whose changes are still present in the current kernel, and
which looks wrong to me.

I'm not currently planning to try to fix this (because I'm not
familiar with the compaction code and its interaction with the page
allocator); so if someone who is more familiar with this stuff wants
to pick this up, feel free to do so.


commit 58a84aa92723d1ac3e1cc4e3b0ff49291663f7e1
Author: Youquan Song <youquan.song@intel.com>
Date:   Thu Dec 8 14:34:18 2011 -0800

    thp: set compound tail page _count to zero

    Commit 70b50f94f1644 ("mm: thp: tail page refcounting fix") keeps all
    page_tail->_count zero at all times.  But the current kernel does not
    set page_tail->_count to zero if a 1GB page is utilized.  So when an
    IOMMU 1GB page is used by KVM, it wil result in a kernel oops because a
    tail page's _count does not equal zero.

      kernel BUG at include/linux/mm.h:386!
      invalid opcode: 0000 [#1] SMP
      Call Trace:
        gup_pud_range+0xb8/0x19d
        get_user_pages_fast+0xcb/0x192
        ? trace_hardirqs_off+0xd/0xf
        hva_to_pfn+0x119/0x2f2
        gfn_to_pfn_memslot+0x2c/0x2e
        kvm_iommu_map_pages+0xfd/0x1c1
        kvm_iommu_map_memslots+0x7c/0xbd
        kvm_iommu_map_guest+0xaa/0xbf
        kvm_vm_ioctl_assigned_device+0x2ef/0xa47
        kvm_vm_ioctl+0x36c/0x3a2
        do_vfs_ioctl+0x49e/0x4e4
        sys_ioctl+0x5a/0x7c
        system_call_fastpath+0x16/0x1b
      RIP  gup_huge_pud+0xf2/0x159

    Signed-off-by: Youquan Song <youquan.song@intel.com>
    Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index bb28a5f9db8d..73f17c0293c0 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -576,6 +576,7 @@ static void prep_compound_gigantic_page(struct
page *page, unsigned long order)
        __SetPageHead(page);
        for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
                __SetPageTail(p);
+               set_page_count(p, 0);
                p->first_page = page;
        }
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9dd443d89d8b..850009a7101e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -356,8 +356,8 @@ void prep_compound_page(struct page *page,
unsigned long order)
        __SetPageHead(page);
        for (i = 1; i < nr_pages; i++) {
                struct page *p = page + i;
-
                __SetPageTail(p);
+               set_page_count(p, 0);
                p->first_page = page;
        }
 }



__page_cache_add_speculative() can run on pages that have already been
allocated, and the only thing that can stop it from temporarily
lifting the page refcount is the page refcount being zero. So if these
set_page_count() calls have any effect (outside __init code), and the
refcount is not zero when they occur, then that means we can have a
race where a refcount is forcibly zeroed while
__page_cache_add_speculative() is holding temporary references; and
then we can end up with a use-after-free of struct page.

As far as I can tell, on the normal compound page allocation path
(prep_compound_page()), the whole compound page is coming fresh off
the allocator freelist (except for some __init logic) and only the
refcount of the head page has been initialized in post_alloc_hook();
and so all its tail pages are guaranteed to have a zero refcount. So
on that path the proper fix is probably to just replace the
set_page_count() call with a VM_BUG_ON_PAGE().

The messier path, as the original commit describes, is "gigantic" page
allocation. In that case, we'll go through the following path (if we
ignore CMA):

  alloc_fresh_huge_page():
    alloc_gigantic_page()
      alloc_contig_pages()
        __alloc_contig_pages()
          alloc_contig_range()
            isolate_freepages_range()
              split_map_pages()
                post_alloc_hook() [FOR EVERY PAGE]
                  set_page_refcounted()
                    set_page_count(page, 1)
    prep_compound_gigantic_page()
      set_page_count(p, 0) [FOR EVERY TAIL PAGE]

so all the tail pages are initially allocated with refcount 1 by the
page allocator, and then we overwrite those refcounts with zeroes.


Luckily, the only non-__init codepath that can get here is
__nr_hugepages_store_common(), which is only invoked from privileged
writes to sysfs/sysctls.


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: page refcount race between prep_compound_gigantic_page() and __page_cache_add_speculative()?
  2021-06-15 11:03 ` Jann Horn
  (?)
@ 2021-06-15 12:40 ` Matthew Wilcox
  2021-06-15 18:27   ` Mike Kravetz
  -1 siblings, 1 reply; 4+ messages in thread
From: Matthew Wilcox @ 2021-06-15 12:40 UTC (permalink / raw)
  To: Jann Horn
  Cc: Linux-MM, kernel list, Youquan Song, Andrea Arcangeli, Jan Kara,
	Mike Kravetz, John Hubbard, Kirill A. Shutemov

On Tue, Jun 15, 2021 at 01:03:53PM +0200, Jann Horn wrote:
> The messier path, as the original commit describes, is "gigantic" page
> allocation. In that case, we'll go through the following path (if we
> ignore CMA):
> 
>   alloc_fresh_huge_page():
>     alloc_gigantic_page()
>       alloc_contig_pages()
>         __alloc_contig_pages()
>           alloc_contig_range()
>             isolate_freepages_range()
>               split_map_pages()
>                 post_alloc_hook() [FOR EVERY PAGE]
>                   set_page_refcounted()
>                     set_page_count(page, 1)
>     prep_compound_gigantic_page()
>       set_page_count(p, 0) [FOR EVERY TAIL PAGE]
> 
> so all the tail pages are initially allocated with refcount 1 by the
> page allocator, and then we overwrite those refcounts with zeroes.
> 
> 
> Luckily, the only non-__init codepath that can get here is
> __nr_hugepages_store_common(), which is only invoked from privileged
> writes to sysfs/sysctls.

Argh.  What if we passed __GFP_COMP into alloc_contig_pages()?
The current callers of alloc_contig_range() do not pass __GFP_COMP,
so it's no behaviour change for them, and __GFP_COMP implies this
kind of behaviour.  I think that would imply _not_ calling
split_map_pages(), which implies not calling post_alloc_hook(),
which means we probably need to do a lot of the parts of
post_alloc_hook() in alloc_gigantic_page().  Yuck.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: page refcount race between prep_compound_gigantic_page() and __page_cache_add_speculative()?
  2021-06-15 12:40 ` Matthew Wilcox
@ 2021-06-15 18:27   ` Mike Kravetz
  0 siblings, 0 replies; 4+ messages in thread
From: Mike Kravetz @ 2021-06-15 18:27 UTC (permalink / raw)
  To: Matthew Wilcox, Jann Horn
  Cc: Linux-MM, kernel list, Youquan Song, Andrea Arcangeli, Jan Kara,
	John Hubbard, Kirill A. Shutemov

On 6/15/21 5:40 AM, Matthew Wilcox wrote:
> On Tue, Jun 15, 2021 at 01:03:53PM +0200, Jann Horn wrote:
>> The messier path, as the original commit describes, is "gigantic" page
>> allocation. In that case, we'll go through the following path (if we
>> ignore CMA):
>>
>>   alloc_fresh_huge_page():
>>     alloc_gigantic_page()
>>       alloc_contig_pages()
>>         __alloc_contig_pages()
>>           alloc_contig_range()
>>             isolate_freepages_range()
>>               split_map_pages()
>>                 post_alloc_hook() [FOR EVERY PAGE]
>>                   set_page_refcounted()
>>                     set_page_count(page, 1)
>>     prep_compound_gigantic_page()
>>       set_page_count(p, 0) [FOR EVERY TAIL PAGE]
>>
>> so all the tail pages are initially allocated with refcount 1 by the
>> page allocator, and then we overwrite those refcounts with zeroes.
>>
>>
>> Luckily, the only non-__init codepath that can get here is
>> __nr_hugepages_store_common(), which is only invoked from privileged
>> writes to sysfs/sysctls.

Thanks for spotting this Jann!

> Argh.  What if we passed __GFP_COMP into alloc_contig_pages()?
> The current callers of alloc_contig_range() do not pass __GFP_COMP,
> so it's no behaviour change for them, and __GFP_COMP implies this
> kind of behaviour.  I think that would imply _not_ calling
> split_map_pages(), which implies not calling post_alloc_hook(),
> which means we probably need to do a lot of the parts of
> post_alloc_hook() in alloc_gigantic_page().  Yuck.

That might work.  We would need to do something 'like' split_map_pages
to split the compound free pages in the allocated range.  Then, stitch
them together into one big compound page.  We 'should' be able to call
post_alloc_hook on the resulting big compound page.  Of course, that is
all theory without digging into the details.

Note that in the general case alloc_contig_range/alloc_contig_pages can
be called to request a non-power of two number of pages.  In such cases
__GFP_COMP would make little sense.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-06-15 18:27 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-15 11:03 page refcount race between prep_compound_gigantic_page() and __page_cache_add_speculative()? Jann Horn
2021-06-15 11:03 ` Jann Horn
2021-06-15 12:40 ` Matthew Wilcox
2021-06-15 18:27   ` Mike Kravetz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.