linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* A small window for a race condition in mm/rmap.c:page_lock_anon_vma_read
@ 2016-12-20  9:32 Dashi DS1 Cao
  2016-12-21 14:43 ` Michal Hocko
  0 siblings, 1 reply; 13+ messages in thread
From: Dashi DS1 Cao @ 2016-12-20  9:32 UTC (permalink / raw)
  To: linux-mm, linux-kernel

I've collected four crash dumps with similar backtrace. 

PID: 247    TASK: ffff881fcfad8000  CPU: 14  COMMAND: "kswapd1"
 #0 [ffff881fcfad7978] machine_kexec at ffffffff81051e9b
 #1 [ffff881fcfad79d8] crash_kexec at ffffffff810f27e2
 #2 [ffff881fcfad7aa8] oops_end at ffffffff8163f448
 #3 [ffff881fcfad7ad0] die at ffffffff8101859b
 #4 [ffff881fcfad7b00] do_general_protection at ffffffff8163ed3e
 #5 [ffff881fcfad7b30] general_protection at ffffffff8163e5e8
    [exception RIP: down_read_trylock+9]
    RIP: ffffffff810aa9f9  RSP: ffff881fcfad7be0  RFLAGS: 00010286
    RAX: 0000000000000000  RBX: ffff882b47ddadc0  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: 91550b2b32f5a3e8
    RBP: ffff881fcfad7be0   R8: ffffea00ecc28860   R9: ffff883fcffeae28
    R10: ffffffff81a691a0  R11: 0000000000000001  R12: ffff882b47ddadc1
    R13: ffffea00ecc28840  R14: 91550b2b32f5a3e8  R15: ffffea00ecc28840
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
 #6 [ffff881fcfad7be8] page_lock_anon_vma_read at ffffffff811a3365
 #7 [ffff881fcfad7c18] page_referenced at ffffffff811a35e7
 #8 [ffff881fcfad7c90] shrink_active_list at ffffffff8117e8cc
 #9 [ffff881fcfad7d48] balance_pgdat at ffffffff81180288
#10 [ffff881fcfad7e20] kswapd at ffffffff81180813
#11 [ffff881fcfad7ec8] kthread at ffffffff810a5b8f
#12 [ffff881fcfad7f50] ret_from_fork at ffffffff81646a98

I suspect my customer hits into a small window of a race condition in mm/rmap.c: page_lock_anon_vma_read.
struct anon_vma *page_lock_anon_vma_read(struct page *page)
{
        struct anon_vma *anon_vma = NULL;
        struct anon_vma *root_anon_vma;
        unsigned long anon_mapping;

        rcu_read_lock();
        anon_mapping = (unsigned long)READ_ONCE(page->mapping);
        if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
                goto out;
        if (!page_mapped(page))
                goto out;

        anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
        root_anon_vma = READ_ONCE(anon_vma->root);
        if (down_read_trylock(&root_anon_vma->rwsem)) {
                /*
                 * If the page is still mapped, then this anon_vma is still
                 * its anon_vma, and holding the mutex ensures that it will
                 * not go away, see anon_vma_free().
                 */
                if (!page_mapped(page)) {
                        up_read(&root_anon_vma->rwsem);
                        anon_vma = NULL;
                }
                goto out;
        }
...
}

Between the time the two "page_mapped(page)" are checked, the address (anon_mapping - PAGE_MAPPING_ANON) is unmapped! However it seems that anon_vma->root could still be read in but the value is wild. So the kernel crashes in down_read_trylock. But it's weird that all the "struct page" has its member "_mapcount" still with value 0, not -1, in the four crashes.

Thanks,
Dashi Cao

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: A small window for a race condition in mm/rmap.c:page_lock_anon_vma_read
  2016-12-20  9:32 A small window for a race condition in mm/rmap.c:page_lock_anon_vma_read Dashi DS1 Cao
@ 2016-12-21 14:43 ` Michal Hocko
  2016-12-22 11:53   ` Dashi DS1 Cao
  2016-12-22 13:51   ` Peter Zijlstra
  0 siblings, 2 replies; 13+ messages in thread
From: Michal Hocko @ 2016-12-21 14:43 UTC (permalink / raw)
  To: Dashi DS1 Cao; +Cc: linux-mm, linux-kernel, Peter Zijlstra

anon_vma locking is clever^Wsubtle as hell. CC Peter...

On Tue 20-12-16 09:32:27, Dashi DS1 Cao wrote:
> I've collected four crash dumps with similar backtrace. 
> 
> PID: 247    TASK: ffff881fcfad8000  CPU: 14  COMMAND: "kswapd1"
>  #0 [ffff881fcfad7978] machine_kexec at ffffffff81051e9b
>  #1 [ffff881fcfad79d8] crash_kexec at ffffffff810f27e2
>  #2 [ffff881fcfad7aa8] oops_end at ffffffff8163f448
>  #3 [ffff881fcfad7ad0] die at ffffffff8101859b
>  #4 [ffff881fcfad7b00] do_general_protection at ffffffff8163ed3e
>  #5 [ffff881fcfad7b30] general_protection at ffffffff8163e5e8
>     [exception RIP: down_read_trylock+9]
>     RIP: ffffffff810aa9f9  RSP: ffff881fcfad7be0  RFLAGS: 00010286
>     RAX: 0000000000000000  RBX: ffff882b47ddadc0  RCX: 0000000000000000
>     RDX: 0000000000000000  RSI: 0000000000000000  RDI: 91550b2b32f5a3e8

rdi is obviously a mess - smells like a string. So either sombody has
overwritten root_anon_vma or this is really a use after free...

>     RBP: ffff881fcfad7be0   R8: ffffea00ecc28860   R9: ffff883fcffeae28
>     R10: ffffffff81a691a0  R11: 0000000000000001  R12: ffff882b47ddadc1
>     R13: ffffea00ecc28840  R14: 91550b2b32f5a3e8  R15: ffffea00ecc28840
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
>  #6 [ffff881fcfad7be8] page_lock_anon_vma_read at ffffffff811a3365
>  #7 [ffff881fcfad7c18] page_referenced at ffffffff811a35e7
>  #8 [ffff881fcfad7c90] shrink_active_list at ffffffff8117e8cc
>  #9 [ffff881fcfad7d48] balance_pgdat at ffffffff81180288
> #10 [ffff881fcfad7e20] kswapd at ffffffff81180813
> #11 [ffff881fcfad7ec8] kthread at ffffffff810a5b8f
> #12 [ffff881fcfad7f50] ret_from_fork at ffffffff81646a98
> 
> I suspect my customer hits into a small window of a race condition in mm/rmap.c: page_lock_anon_vma_read.
> struct anon_vma *page_lock_anon_vma_read(struct page *page)
> {
>         struct anon_vma *anon_vma = NULL;
>         struct anon_vma *root_anon_vma;
>         unsigned long anon_mapping;
> 
>         rcu_read_lock();
>         anon_mapping = (unsigned long)READ_ONCE(page->mapping);
>         if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
>                 goto out;
>         if (!page_mapped(page))
>                 goto out;
> 
>         anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
>         root_anon_vma = READ_ONCE(anon_vma->root);

Could you dump the anon_vma and struct page as well?

>         if (down_read_trylock(&root_anon_vma->rwsem)) {
>                 /*
>                  * If the page is still mapped, then this anon_vma is still
>                  * its anon_vma, and holding the mutex ensures that it will
>                  * not go away, see anon_vma_free().
>                  */
>                 if (!page_mapped(page)) {
>                         up_read(&root_anon_vma->rwsem);
>                         anon_vma = NULL;
>                 }
>                 goto out;
>         }
> ...
> }
> 
> Between the time the two "page_mapped(page)" are checked, the address
> (anon_mapping - PAGE_MAPPING_ANON) is unmapped! However it seems
> that anon_vma->root could still be read in but the value is wild. So
> the kernel crashes in down_read_trylock. But it's weird that all the
> "struct page" has its member "_mapcount" still with value 0, not -1,
> in the four crashes.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: A small window for a race condition in mm/rmap.c:page_lock_anon_vma_read
  2016-12-21 14:43 ` Michal Hocko
@ 2016-12-22 11:53   ` Dashi DS1 Cao
  2016-12-22 12:03     ` Michal Hocko
  2016-12-22 12:43     ` Dashi DS1 Cao
  2016-12-22 13:51   ` Peter Zijlstra
  1 sibling, 2 replies; 13+ messages in thread
From: Dashi DS1 Cao @ 2016-12-22 11:53 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, linux-kernel, Peter Zijlstra

I've used another dump with similar backtrace.

PID: 246    TASK: ffff881fd27df300  CPU: 0   COMMAND: "kswapd0"
 #0 [ffff881fcfb23748] machine_kexec at ffffffff81051e9b
 #1 [ffff881fcfb237a8] crash_kexec at ffffffff810f27e2
 #2 [ffff881fcfb23878] oops_end at ffffffff8163f448
 #3 [ffff881fcfb238a0] no_context at ffffffff8162f561
 #4 [ffff881fcfb238f0] __bad_area_nosemaphore at ffffffff8162f5f7
 #5 [ffff881fcfb23938] bad_area_nosemaphore at ffffffff8162f761
 #6 [ffff881fcfb23948] __do_page_fault at ffffffff816421ce
 #7 [ffff881fcfb239a8] do_page_fault at ffffffff81642363
 #8 [ffff881fcfb239d0] page_fault at ffffffff8163e648
    [exception RIP: down_read_trylock+9]
    RIP: ffffffff810aa9f9  RSP: ffff881fcfb23a88  RFLAGS: 00010202
    RAX: 0000000000000000  RBX: ffff8820833ed940  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: 0000000000000008
    RBP: ffff881fcfb23a88   R8: ffffea00779b3a60   R9: ffff883fd0d7b070
    R10: 000000000000000e  R11: ffffea00049e9580  R12: ffff8820833ed941
    R13: ffffea00779b3a40  R14: 0000000000000008  R15: ffffea00779b3a40
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #9 [ffff881fcfb23a90] page_lock_anon_vma_read at ffffffff811a3365
#10 [ffff881fcfb23ac0] page_referenced at ffffffff811a35e7
#11 [ffff881fcfb23b38] shrink_active_list at ffffffff8117e8cc
#12 [ffff881fcfb23bf0] shrink_lruvec at ffffffff8117ef8d
#13 [ffff881fcfb23cf0] shrink_zone at ffffffff8117f2a6
#14 [ffff881fcfb23d48] balance_pgdat at ffffffff8118054c
#15 [ffff881fcfb23e20] kswapd at ffffffff81180813
#16 [ffff881fcfb23ec8] kthread at ffffffff810a5b8f
#17 [ffff881fcfb23f50] ret_from_fork at ffffffff81646a98
crash> print *((struct page *)0xffffea00779b3a40j)
$1 = {
  flags = 13510794587668552, 
  mapping = 0xffff8820833ed941, 
  {
    {
      index = 34194823743, 
      freelist = 0x7f62b9a3f, 
      pfmemalloc = 63, 
      pmd_huge_pte = 0x7f62b9a3f
    }, 
    {
      counters = 8589934592, 
      {
        {
          _mapcount = {
            counter = 0
          }, 
          {
            inuse = 0, 
            objects = 0, 
            frozen = 0
          }, 
          units = 0
        }, 
        _count = {
          counter = 2
         }
      }
    }
  }, 
  {
    lru = {
      next = 0xdead000000100100, 
      prev = 0xdead000000200200
    }, 
    {
      next = 0xdead000000100100, 
      pages = 2097664, 
      pobjects = -559087616
    }, 
    list = {
      next = 0xdead000000100100, 
      prev = 0xdead000000200200
    }, 
    slab_page = 0xdead000000100100
  }, 
  {
    private = 0, 
    ptl = {
      {
        rlock = {
          raw_lock = {
             {
              head_tail = 0, 
              tickets = {
                head = 0, 
                tail = 0
              }
            }
          }
        }
      }
    }, 
    slab_cache = 0x0, 
    first_page = 0x0
  }
}
crash>  disassemble page_lock_anon_vma_read
Dump of assembler code for function page_lock_anon_vma_read:
   0xffffffff811a3310 <+0>:     nopl   0x0(%rax,%rax,1)
   0xffffffff811a3315 <+5>:     push   %rbp
   0xffffffff811a3316 <+6>:     mov    %rsp,%rbp
   0xffffffff811a3319 <+9>:     push   %r14
   0xffffffff811a331b <+11>:    push   %r13
   0xffffffff811a331d <+13>:    mov    %rdi,%r13
   0xffffffff811a3320 <+16>:    push   %r12
   0xffffffff811a3322 <+18>:    push   %rbx
   0xffffffff811a3323 <+19>:    mov    0x8(%rdi),%r12
   0xffffffff811a3327 <+23>:    mov    %r12,%rax
   0xffffffff811a332a <+26>:    and    $0x3,%eax
   0xffffffff811a332d <+29>:    cmp    $0x1,%rax
   0xffffffff811a3331 <+33>:    je     0xffffffff811a3348 <page_lock_anon_vma_read+56>
   0xffffffff811a3333 <+35>:    xor    %ebx,%ebx
   0xffffffff811a3335 <+37>:    mov    %rbx,%rax
   0xffffffff811a3338 <+40>:    pop    %rbx
   0xffffffff811a3339 <+41>:    pop    %r12
   0xffffffff811a333b <+43>:    pop    %r13
   0xffffffff811a333d <+45>:    pop    %r14
   0xffffffff811a333f <+47>:    pop    %rbp
   0xffffffff811a3340 <+48>:    retq   
   0xffffffff811a3341 <+49>:    nopl   0x0(%rax)
   0xffffffff811a3348 <+56>:    mov    0x18(%rdi),%eax
   0xffffffff811a334b <+59>:    test   %eax,%eax
   0xffffffff811a334d <+61>:    js     0xffffffff811a3333 <page_lock_anon_vma_read+35>
   0xffffffff811a334f <+63>:    mov    -0x1(%r12),%r14
   0xffffffff811a3354 <+68>:    lea    -0x1(%r12),%rbx
   0xffffffff811a3359 <+73>:    add    $0x8,%r14
   0xffffffff811a335d <+77>:    mov    %r14,%rdi
   0xffffffff811a3360 <+80>:    callq  0xffffffff810aa9f0 <down_read_trylock>
   0xffffffff811a3365 <+85>:    test   %eax,%eax
   0xffffffff811a3367 <+87>:    je     0xffffffff811a3380 <page_lock_anon_vma_read+112>
   0xffffffff811a3369 <+89>:    mov    0x18(%r13),%eax
   0xffffffff811a336d <+93>:    test   %eax,%eax
   0xffffffff811a336f <+95>:    jns    0xffffffff811a3335 <page_lock_anon_vma_read+37>
   0xffffffff811a3371 <+97>:    mov    %r14,%rdi
   0xffffffff811a3374 <+100>:   xor    %ebx,%ebx
   0xffffffff811a3376 <+102>:   callq  0xffffffff810aaa50 <up_read>
   0xffffffff811a337b <+107>:   jmp    0xffffffff811a3335 <page_lock_anon_vma_read+37>
   0xffffffff811a337d <+109>:   nopl   (%rax)
   0xffffffff811a3380 <+112>:   mov    0x28(%rbx),%edx
   0xffffffff811a3383 <+115>:   test   %edx,%edx
   0xffffffff811a3385 <+117>:   je     0xffffffff811a3333 <page_lock_anon_vma_read+35>
   0xffffffff811a3387 <+119>:   lea    0x1(%rdx),%ecx
   0xffffffff811a338a <+122>:   lea    0x27(%r12),%rsi
   0xffffffff811a338f <+127>:   mov    %edx,%eax
   0xffffffff811a3391 <+129>:   lock cmpxchg %ecx,0x27(%r12)
   0xffffffff811a3398 <+136>:   cmp    %edx,%eax
   0xffffffff811a339a <+138>:   mov    %eax,%ecx
   0xffffffff811a339c <+140>:   jne    0xffffffff811a3402 <page_lock_anon_vma_read+242>
   0xffffffff811a339e <+142>:   mov    0x18(%r13),%eax
   0xffffffff811a33a2 <+146>:   test   %eax,%eax
   0xffffffff811a33a4 <+148>:   js     0xffffffff811a33e2 <page_lock_anon_vma_read+210>
   0xffffffff811a33a6 <+150>:   mov    -0x1(%r12),%rax
   0xffffffff811a33ab <+155>:   lea    0x8(%rax),%rdi
   0xffffffff811a33af <+159>:   callq  0xffffffff8163ad30 <down_read>
   0xffffffff811a33b4 <+164>:   lock decl 0x27(%r12)
   0xffffffff811a33ba <+170>:   sete   %al
   0xffffffff811a33bd <+173>:   test   %al,%al
   0xffffffff811a33bf <+175>:   je     0xffffffff811a3335 <page_lock_anon_vma_read+37>
   0xffffffff811a33c5 <+181>:   mov    -0x1(%r12),%rdi
   0xffffffff811a33ca <+186>:   add    $0x8,%rdi
   0xffffffff811a33ce <+190>:   callq  0xffffffff810aaa50 <up_read>
   0xffffffff811a33d3 <+195>:   mov    %rbx,%rdi
   0xffffffff811a33d6 <+198>:   xor    %ebx,%ebx
   0xffffffff811a33d8 <+200>:   callq  0xffffffff811a2dd0 <__put_anon_vma>
   0xffffffff811a33dd <+205>:   jmpq   0xffffffff811a3335 <page_lock_anon_vma_read+37>
   0xffffffff811a33e2 <+210>:   lock decl 0x27(%r12)
   0xffffffff811a33e8 <+216>:   sete   %al
   0xffffffff811a33eb <+219>:   test   %al,%al
   0xffffffff811a33ed <+221>:   je     0xffffffff811a3333 <page_lock_anon_vma_read+35>
   0xffffffff811a33f3 <+227>:   mov    %rbx,%rdi
   0xffffffff811a33f6 <+230>:   xor    %ebx,%ebx
   0xffffffff811a33f8 <+232>:   callq  0xffffffff811a2dd0 <__put_anon_vma>
   0xffffffff811a33fd <+237>:   jmpq   0xffffffff811a3335 <page_lock_anon_vma_read+37>
   0xffffffff811a3402 <+242>:   test   %ecx,%ecx
   0xffffffff811a3404 <+244>:   je     0xffffffff811a3333 <page_lock_anon_vma_read+35>
   0xffffffff811a340a <+250>:   lea    0x1(%rcx),%edx
   0xffffffff811a340d <+253>:   mov    %ecx,%eax
   0xffffffff811a340f <+255>:   lock cmpxchg %edx,(%rsi)
   0xffffffff811a3413 <+259>:   cmp    %eax,%ecx
   0xffffffff811a3415 <+261>:   je     0xffffffff811a339e <page_lock_anon_vma_read+142>
   0xffffffff811a3417 <+263>:   mov    %eax,%ecx
   0xffffffff811a3419 <+265>:   jmp    0xffffffff811a3402 <page_lock_anon_vma_read+242>
End of assembler dump.
crash>  

Dashi Cao
-----Original Message-----
From: Michal Hocko [mailto:mhocko@kernel.org] 
Sent: Wednesday, December 21, 2016 10:44 PM
To: Dashi DS1 Cao <caods1@lenovo.com>
Cc: linux-mm@kvack.org; linux-kernel@vger.kernel.org; Peter Zijlstra <peterz@infradead.org>
Subject: Re: A small window for a race condition in mm/rmap.c:page_lock_anon_vma_read

anon_vma locking is clever^Wsubtle as hell. CC Peter...

On Tue 20-12-16 09:32:27, Dashi DS1 Cao wrote:
> I've collected four crash dumps with similar backtrace. 
> 
> PID: 247    TASK: ffff881fcfad8000  CPU: 14  COMMAND: "kswapd1"
>  #0 [ffff881fcfad7978] machine_kexec at ffffffff81051e9b
>  #1 [ffff881fcfad79d8] crash_kexec at ffffffff810f27e2
>  #2 [ffff881fcfad7aa8] oops_end at ffffffff8163f448
>  #3 [ffff881fcfad7ad0] die at ffffffff8101859b
>  #4 [ffff881fcfad7b00] do_general_protection at ffffffff8163ed3e
>  #5 [ffff881fcfad7b30] general_protection at ffffffff8163e5e8
>     [exception RIP: down_read_trylock+9]
>     RIP: ffffffff810aa9f9  RSP: ffff881fcfad7be0  RFLAGS: 00010286
>     RAX: 0000000000000000  RBX: ffff882b47ddadc0  RCX: 0000000000000000
>     RDX: 0000000000000000  RSI: 0000000000000000  RDI: 
> 91550b2b32f5a3e8

rdi is obviously a mess - smells like a string. So either sombody has overwritten root_anon_vma or this is really a use after free...

>     RBP: ffff881fcfad7be0   R8: ffffea00ecc28860   R9: ffff883fcffeae28
>     R10: ffffffff81a691a0  R11: 0000000000000001  R12: ffff882b47ddadc1
>     R13: ffffea00ecc28840  R14: 91550b2b32f5a3e8  R15: ffffea00ecc28840
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
>  #6 [ffff881fcfad7be8] page_lock_anon_vma_read at ffffffff811a3365
>  #7 [ffff881fcfad7c18] page_referenced at ffffffff811a35e7
>  #8 [ffff881fcfad7c90] shrink_active_list at ffffffff8117e8cc
>  #9 [ffff881fcfad7d48] balance_pgdat at ffffffff81180288
> #10 [ffff881fcfad7e20] kswapd at ffffffff81180813
> #11 [ffff881fcfad7ec8] kthread at ffffffff810a5b8f
> #12 [ffff881fcfad7f50] ret_from_fork at ffffffff81646a98
> 
> I suspect my customer hits into a small window of a race condition in mm/rmap.c: page_lock_anon_vma_read.
> struct anon_vma *page_lock_anon_vma_read(struct page *page) {
>         struct anon_vma *anon_vma = NULL;
>         struct anon_vma *root_anon_vma;
>         unsigned long anon_mapping;
> 
>         rcu_read_lock();
>         anon_mapping = (unsigned long)READ_ONCE(page->mapping);
>         if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
>                 goto out;
>         if (!page_mapped(page))
>                 goto out;
> 
>         anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
>         root_anon_vma = READ_ONCE(anon_vma->root);

Could you dump the anon_vma and struct page as well?

>         if (down_read_trylock(&root_anon_vma->rwsem)) {
>                 /*
>                  * If the page is still mapped, then this anon_vma is still
>                  * its anon_vma, and holding the mutex ensures that it will
>                  * not go away, see anon_vma_free().
>                  */
>                 if (!page_mapped(page)) {
>                         up_read(&root_anon_vma->rwsem);
>                         anon_vma = NULL;
>                 }
>                 goto out;
>         }
> ...
> }
> 
> Between the time the two "page_mapped(page)" are checked, the address 
> (anon_mapping - PAGE_MAPPING_ANON) is unmapped! However it seems that 
> anon_vma->root could still be read in but the value is wild. So the 
> kernel crashes in down_read_trylock. But it's weird that all the 
> "struct page" has its member "_mapcount" still with value 0, not -1, 
> in the four crashes.

--
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: A small window for a race condition in mm/rmap.c:page_lock_anon_vma_read
  2016-12-22 11:53   ` Dashi DS1 Cao
@ 2016-12-22 12:03     ` Michal Hocko
  2016-12-22 12:43     ` Dashi DS1 Cao
  1 sibling, 0 replies; 13+ messages in thread
From: Michal Hocko @ 2016-12-22 12:03 UTC (permalink / raw)
  To: Dashi DS1 Cao; +Cc: linux-mm, linux-kernel, Peter Zijlstra

On Thu 22-12-16 11:53:26, Dashi DS1 Cao wrote:
> I've used another dump with similar backtrace.

Please also dump the anon_vma of the page as well.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: A small window for a race condition in mm/rmap.c:page_lock_anon_vma_read
  2016-12-22 11:53   ` Dashi DS1 Cao
  2016-12-22 12:03     ` Michal Hocko
@ 2016-12-22 12:43     ` Dashi DS1 Cao
  1 sibling, 0 replies; 13+ messages in thread
From: Dashi DS1 Cao @ 2016-12-22 12:43 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, linux-kernel, Peter Zijlstra

Value of anon_vma:

print *((struct anon_vma *)0xffff8820833ed940)
$2 = {
  root = 0x0, 
  rwsem = {
    count = 0, 
    wait_lock = {
      raw_lock = {
        {
          head_tail = 0, 
          tickets = {
            head = 0, 
            tail = 0
          }
        }
      }
    }, 
    wait_list = {
      next = 0x0, 
      prev = 0x0
    }
  }, 
  refcount = {
    counter = 0
  }, 
  rb_root = {
    rb_node = 0x0
  }
}
crash>

-----Original Message-----
From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of Dashi DS1 Cao
Sent: Thursday, December 22, 2016 7:53 PM
To: Michal Hocko <mhocko@kernel.org>
Cc: linux-mm@kvack.org; linux-kernel@vger.kernel.org; Peter Zijlstra <peterz@infradead.org>
Subject: RE: A small window for a race condition in mm/rmap.c:page_lock_anon_vma_read

I've used another dump with similar backtrace.

PID: 246    TASK: ffff881fd27df300  CPU: 0   COMMAND: "kswapd0"
 #0 [ffff881fcfb23748] machine_kexec at ffffffff81051e9b
 #1 [ffff881fcfb237a8] crash_kexec at ffffffff810f27e2
 #2 [ffff881fcfb23878] oops_end at ffffffff8163f448
 #3 [ffff881fcfb238a0] no_context at ffffffff8162f561
 #4 [ffff881fcfb238f0] __bad_area_nosemaphore at ffffffff8162f5f7
 #5 [ffff881fcfb23938] bad_area_nosemaphore at ffffffff8162f761
 #6 [ffff881fcfb23948] __do_page_fault at ffffffff816421ce
 #7 [ffff881fcfb239a8] do_page_fault at ffffffff81642363
 #8 [ffff881fcfb239d0] page_fault at ffffffff8163e648
    [exception RIP: down_read_trylock+9]
    RIP: ffffffff810aa9f9  RSP: ffff881fcfb23a88  RFLAGS: 00010202
    RAX: 0000000000000000  RBX: ffff8820833ed940  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: 0000000000000008
    RBP: ffff881fcfb23a88   R8: ffffea00779b3a60   R9: ffff883fd0d7b070
    R10: 000000000000000e  R11: ffffea00049e9580  R12: ffff8820833ed941
    R13: ffffea00779b3a40  R14: 0000000000000008  R15: ffffea00779b3a40
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #9 [ffff881fcfb23a90] page_lock_anon_vma_read at ffffffff811a3365
#10 [ffff881fcfb23ac0] page_referenced at ffffffff811a35e7
#11 [ffff881fcfb23b38] shrink_active_list at ffffffff8117e8cc
#12 [ffff881fcfb23bf0] shrink_lruvec at ffffffff8117ef8d
#13 [ffff881fcfb23cf0] shrink_zone at ffffffff8117f2a6
#14 [ffff881fcfb23d48] balance_pgdat at ffffffff8118054c
#15 [ffff881fcfb23e20] kswapd at ffffffff81180813
#16 [ffff881fcfb23ec8] kthread at ffffffff810a5b8f
#17 [ffff881fcfb23f50] ret_from_fork at ffffffff81646a98
crash> print *((struct page *)0xffffea00779b3a40j)
$1 = {
  flags = 13510794587668552,
  mapping = 0xffff8820833ed941,
  {
    {
      index = 34194823743, 
      freelist = 0x7f62b9a3f, 
      pfmemalloc = 63, 
      pmd_huge_pte = 0x7f62b9a3f
    }, 
    {
      counters = 8589934592, 
      {
        {
          _mapcount = {
            counter = 0
          }, 
          {
            inuse = 0, 
            objects = 0, 
            frozen = 0
          }, 
          units = 0
        }, 
        _count = {
          counter = 2
         }
      }
    }
  },
  {
    lru = {
      next = 0xdead000000100100, 
      prev = 0xdead000000200200
    }, 
    {
      next = 0xdead000000100100, 
      pages = 2097664, 
      pobjects = -559087616
    }, 
    list = {
      next = 0xdead000000100100, 
      prev = 0xdead000000200200
    }, 
    slab_page = 0xdead000000100100
  },
  {
    private = 0, 
    ptl = {
      {
        rlock = {
          raw_lock = {
             {
              head_tail = 0, 
              tickets = {
                head = 0, 
                tail = 0
              }
            }
          }
        }
      }
    }, 
    slab_cache = 0x0, 
    first_page = 0x0
  }
}
crash>  disassemble page_lock_anon_vma_read
Dump of assembler code for function page_lock_anon_vma_read:
   0xffffffff811a3310 <+0>:     nopl   0x0(%rax,%rax,1)
   0xffffffff811a3315 <+5>:     push   %rbp
   0xffffffff811a3316 <+6>:     mov    %rsp,%rbp
   0xffffffff811a3319 <+9>:     push   %r14
   0xffffffff811a331b <+11>:    push   %r13
   0xffffffff811a331d <+13>:    mov    %rdi,%r13
   0xffffffff811a3320 <+16>:    push   %r12
   0xffffffff811a3322 <+18>:    push   %rbx
   0xffffffff811a3323 <+19>:    mov    0x8(%rdi),%r12
   0xffffffff811a3327 <+23>:    mov    %r12,%rax
   0xffffffff811a332a <+26>:    and    $0x3,%eax
   0xffffffff811a332d <+29>:    cmp    $0x1,%rax
   0xffffffff811a3331 <+33>:    je     0xffffffff811a3348 <page_lock_anon_vma_read+56>
   0xffffffff811a3333 <+35>:    xor    %ebx,%ebx
   0xffffffff811a3335 <+37>:    mov    %rbx,%rax
   0xffffffff811a3338 <+40>:    pop    %rbx
   0xffffffff811a3339 <+41>:    pop    %r12
   0xffffffff811a333b <+43>:    pop    %r13
   0xffffffff811a333d <+45>:    pop    %r14
   0xffffffff811a333f <+47>:    pop    %rbp
   0xffffffff811a3340 <+48>:    retq   
   0xffffffff811a3341 <+49>:    nopl   0x0(%rax)
   0xffffffff811a3348 <+56>:    mov    0x18(%rdi),%eax
   0xffffffff811a334b <+59>:    test   %eax,%eax
   0xffffffff811a334d <+61>:    js     0xffffffff811a3333 <page_lock_anon_vma_read+35>
   0xffffffff811a334f <+63>:    mov    -0x1(%r12),%r14
   0xffffffff811a3354 <+68>:    lea    -0x1(%r12),%rbx
   0xffffffff811a3359 <+73>:    add    $0x8,%r14
   0xffffffff811a335d <+77>:    mov    %r14,%rdi
   0xffffffff811a3360 <+80>:    callq  0xffffffff810aa9f0 <down_read_trylock>
   0xffffffff811a3365 <+85>:    test   %eax,%eax
   0xffffffff811a3367 <+87>:    je     0xffffffff811a3380 <page_lock_anon_vma_read+112>
   0xffffffff811a3369 <+89>:    mov    0x18(%r13),%eax
   0xffffffff811a336d <+93>:    test   %eax,%eax
   0xffffffff811a336f <+95>:    jns    0xffffffff811a3335 <page_lock_anon_vma_read+37>
   0xffffffff811a3371 <+97>:    mov    %r14,%rdi
   0xffffffff811a3374 <+100>:   xor    %ebx,%ebx
   0xffffffff811a3376 <+102>:   callq  0xffffffff810aaa50 <up_read>
   0xffffffff811a337b <+107>:   jmp    0xffffffff811a3335 <page_lock_anon_vma_read+37>
   0xffffffff811a337d <+109>:   nopl   (%rax)
   0xffffffff811a3380 <+112>:   mov    0x28(%rbx),%edx
   0xffffffff811a3383 <+115>:   test   %edx,%edx
   0xffffffff811a3385 <+117>:   je     0xffffffff811a3333 <page_lock_anon_vma_read+35>
   0xffffffff811a3387 <+119>:   lea    0x1(%rdx),%ecx
   0xffffffff811a338a <+122>:   lea    0x27(%r12),%rsi
   0xffffffff811a338f <+127>:   mov    %edx,%eax
   0xffffffff811a3391 <+129>:   lock cmpxchg %ecx,0x27(%r12)
   0xffffffff811a3398 <+136>:   cmp    %edx,%eax
   0xffffffff811a339a <+138>:   mov    %eax,%ecx
   0xffffffff811a339c <+140>:   jne    0xffffffff811a3402 <page_lock_anon_vma_read+242>
   0xffffffff811a339e <+142>:   mov    0x18(%r13),%eax
   0xffffffff811a33a2 <+146>:   test   %eax,%eax
   0xffffffff811a33a4 <+148>:   js     0xffffffff811a33e2 <page_lock_anon_vma_read+210>
   0xffffffff811a33a6 <+150>:   mov    -0x1(%r12),%rax
   0xffffffff811a33ab <+155>:   lea    0x8(%rax),%rdi
   0xffffffff811a33af <+159>:   callq  0xffffffff8163ad30 <down_read>
   0xffffffff811a33b4 <+164>:   lock decl 0x27(%r12)
   0xffffffff811a33ba <+170>:   sete   %al
   0xffffffff811a33bd <+173>:   test   %al,%al
   0xffffffff811a33bf <+175>:   je     0xffffffff811a3335 <page_lock_anon_vma_read+37>
   0xffffffff811a33c5 <+181>:   mov    -0x1(%r12),%rdi
   0xffffffff811a33ca <+186>:   add    $0x8,%rdi
   0xffffffff811a33ce <+190>:   callq  0xffffffff810aaa50 <up_read>
   0xffffffff811a33d3 <+195>:   mov    %rbx,%rdi
   0xffffffff811a33d6 <+198>:   xor    %ebx,%ebx
   0xffffffff811a33d8 <+200>:   callq  0xffffffff811a2dd0 <__put_anon_vma>
   0xffffffff811a33dd <+205>:   jmpq   0xffffffff811a3335 <page_lock_anon_vma_read+37>
   0xffffffff811a33e2 <+210>:   lock decl 0x27(%r12)
   0xffffffff811a33e8 <+216>:   sete   %al
   0xffffffff811a33eb <+219>:   test   %al,%al
   0xffffffff811a33ed <+221>:   je     0xffffffff811a3333 <page_lock_anon_vma_read+35>
   0xffffffff811a33f3 <+227>:   mov    %rbx,%rdi
   0xffffffff811a33f6 <+230>:   xor    %ebx,%ebx
   0xffffffff811a33f8 <+232>:   callq  0xffffffff811a2dd0 <__put_anon_vma>
   0xffffffff811a33fd <+237>:   jmpq   0xffffffff811a3335 <page_lock_anon_vma_read+37>
   0xffffffff811a3402 <+242>:   test   %ecx,%ecx
   0xffffffff811a3404 <+244>:   je     0xffffffff811a3333 <page_lock_anon_vma_read+35>
   0xffffffff811a340a <+250>:   lea    0x1(%rcx),%edx
   0xffffffff811a340d <+253>:   mov    %ecx,%eax
   0xffffffff811a340f <+255>:   lock cmpxchg %edx,(%rsi)
   0xffffffff811a3413 <+259>:   cmp    %eax,%ecx
   0xffffffff811a3415 <+261>:   je     0xffffffff811a339e <page_lock_anon_vma_read+142>
   0xffffffff811a3417 <+263>:   mov    %eax,%ecx
   0xffffffff811a3419 <+265>:   jmp    0xffffffff811a3402 <page_lock_anon_vma_read+242>
End of assembler dump.
crash>  

Dashi Cao
-----Original Message-----
From: Michal Hocko [mailto:mhocko@kernel.org]
Sent: Wednesday, December 21, 2016 10:44 PM
To: Dashi DS1 Cao <caods1@lenovo.com>
Cc: linux-mm@kvack.org; linux-kernel@vger.kernel.org; Peter Zijlstra <peterz@infradead.org>
Subject: Re: A small window for a race condition in mm/rmap.c:page_lock_anon_vma_read

anon_vma locking is clever^Wsubtle as hell. CC Peter...

On Tue 20-12-16 09:32:27, Dashi DS1 Cao wrote:
> I've collected four crash dumps with similar backtrace. 
> 
> PID: 247    TASK: ffff881fcfad8000  CPU: 14  COMMAND: "kswapd1"
>  #0 [ffff881fcfad7978] machine_kexec at ffffffff81051e9b
>  #1 [ffff881fcfad79d8] crash_kexec at ffffffff810f27e2
>  #2 [ffff881fcfad7aa8] oops_end at ffffffff8163f448
>  #3 [ffff881fcfad7ad0] die at ffffffff8101859b
>  #4 [ffff881fcfad7b00] do_general_protection at ffffffff8163ed3e
>  #5 [ffff881fcfad7b30] general_protection at ffffffff8163e5e8
>     [exception RIP: down_read_trylock+9]
>     RIP: ffffffff810aa9f9  RSP: ffff881fcfad7be0  RFLAGS: 00010286
>     RAX: 0000000000000000  RBX: ffff882b47ddadc0  RCX: 0000000000000000
>     RDX: 0000000000000000  RSI: 0000000000000000  RDI: 
> 91550b2b32f5a3e8

rdi is obviously a mess - smells like a string. So either sombody has overwritten root_anon_vma or this is really a use after free...

>     RBP: ffff881fcfad7be0   R8: ffffea00ecc28860   R9: ffff883fcffeae28
>     R10: ffffffff81a691a0  R11: 0000000000000001  R12: ffff882b47ddadc1
>     R13: ffffea00ecc28840  R14: 91550b2b32f5a3e8  R15: ffffea00ecc28840
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
>  #6 [ffff881fcfad7be8] page_lock_anon_vma_read at ffffffff811a3365
>  #7 [ffff881fcfad7c18] page_referenced at ffffffff811a35e7
>  #8 [ffff881fcfad7c90] shrink_active_list at ffffffff8117e8cc
>  #9 [ffff881fcfad7d48] balance_pgdat at ffffffff81180288
> #10 [ffff881fcfad7e20] kswapd at ffffffff81180813
> #11 [ffff881fcfad7ec8] kthread at ffffffff810a5b8f
> #12 [ffff881fcfad7f50] ret_from_fork at ffffffff81646a98
> 
> I suspect my customer hits into a small window of a race condition in mm/rmap.c: page_lock_anon_vma_read.
> struct anon_vma *page_lock_anon_vma_read(struct page *page) {
>         struct anon_vma *anon_vma = NULL;
>         struct anon_vma *root_anon_vma;
>         unsigned long anon_mapping;
> 
>         rcu_read_lock();
>         anon_mapping = (unsigned long)READ_ONCE(page->mapping);
>         if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
>                 goto out;
>         if (!page_mapped(page))
>                 goto out;
> 
>         anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
>         root_anon_vma = READ_ONCE(anon_vma->root);

Could you dump the anon_vma and struct page as well?

>         if (down_read_trylock(&root_anon_vma->rwsem)) {
>                 /*
>                  * If the page is still mapped, then this anon_vma is still
>                  * its anon_vma, and holding the mutex ensures that it will
>                  * not go away, see anon_vma_free().
>                  */
>                 if (!page_mapped(page)) {
>                         up_read(&root_anon_vma->rwsem);
>                         anon_vma = NULL;
>                 }
>                 goto out;
>         }
> ...
> }
> 
> Between the time the two "page_mapped(page)" are checked, the address 
> (anon_mapping - PAGE_MAPPING_ANON) is unmapped! However it seems that 
> anon_vma->root could still be read in but the value is wild. So the 
> kernel crashes in down_read_trylock. But it's weird that all the 
> "struct page" has its member "_mapcount" still with value 0, not -1, 
> in the four crashes.

--
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: A small window for a race condition in mm/rmap.c:page_lock_anon_vma_read
  2016-12-21 14:43 ` Michal Hocko
  2016-12-22 11:53   ` Dashi DS1 Cao
@ 2016-12-22 13:51   ` Peter Zijlstra
  2016-12-22 22:26     ` Hugh Dickins
  1 sibling, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2016-12-22 13:51 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Dashi DS1 Cao, linux-mm, linux-kernel, Hugh Dickins

On Wed, Dec 21, 2016 at 03:43:43PM +0100, Michal Hocko wrote:
> anon_vma locking is clever^Wsubtle as hell. CC Peter...
> 
> On Tue 20-12-16 09:32:27, Dashi DS1 Cao wrote:
> > I've collected four crash dumps with similar backtrace. 
> > 
> > PID: 247    TASK: ffff881fcfad8000  CPU: 14  COMMAND: "kswapd1"
> >  #0 [ffff881fcfad7978] machine_kexec at ffffffff81051e9b
> >  #1 [ffff881fcfad79d8] crash_kexec at ffffffff810f27e2
> >  #2 [ffff881fcfad7aa8] oops_end at ffffffff8163f448
> >  #3 [ffff881fcfad7ad0] die at ffffffff8101859b
> >  #4 [ffff881fcfad7b00] do_general_protection at ffffffff8163ed3e
> >  #5 [ffff881fcfad7b30] general_protection at ffffffff8163e5e8
> >     [exception RIP: down_read_trylock+9]
> >     RIP: ffffffff810aa9f9  RSP: ffff881fcfad7be0  RFLAGS: 00010286
> >     RAX: 0000000000000000  RBX: ffff882b47ddadc0  RCX: 0000000000000000
> >     RDX: 0000000000000000  RSI: 0000000000000000  RDI: 91550b2b32f5a3e8
> 
> rdi is obviously a mess - smells like a string. So either sombody has
> overwritten root_anon_vma or this is really a use after free...

e8 - �
a3 - �
f5 - �
32 - 2
2b - +
 b - 

55 - U
91 - �

Not a string..

> >     RBP: ffff881fcfad7be0   R8: ffffea00ecc28860   R9: ffff883fcffeae28
> >     R10: ffffffff81a691a0  R11: 0000000000000001  R12: ffff882b47ddadc1
> >     R13: ffffea00ecc28840  R14: 91550b2b32f5a3e8  R15: ffffea00ecc28840
> >     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
> >  #6 [ffff881fcfad7be8] page_lock_anon_vma_read at ffffffff811a3365
> >  #7 [ffff881fcfad7c18] page_referenced at ffffffff811a35e7
> >  #8 [ffff881fcfad7c90] shrink_active_list at ffffffff8117e8cc
> >  #9 [ffff881fcfad7d48] balance_pgdat at ffffffff81180288
> > #10 [ffff881fcfad7e20] kswapd at ffffffff81180813
> > #11 [ffff881fcfad7ec8] kthread at ffffffff810a5b8f
> > #12 [ffff881fcfad7f50] ret_from_fork at ffffffff81646a98
> > 
> > I suspect my customer hits into a small window of a race condition in mm/rmap.c: page_lock_anon_vma_read.
> > struct anon_vma *page_lock_anon_vma_read(struct page *page)
> > {
> >         struct anon_vma *anon_vma = NULL;
> >         struct anon_vma *root_anon_vma;
> >         unsigned long anon_mapping;
> > 
> >         rcu_read_lock();
> >         anon_mapping = (unsigned long)READ_ONCE(page->mapping);
> >         if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
> >                 goto out;
> >         if (!page_mapped(page))
> >                 goto out;
> > 
> >         anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
> >         root_anon_vma = READ_ONCE(anon_vma->root);
> 
> Could you dump the anon_vma and struct page as well?
> 
> >         if (down_read_trylock(&root_anon_vma->rwsem)) {
> >                 /*
> >                  * If the page is still mapped, then this anon_vma is still
> >                  * its anon_vma, and holding the mutex ensures that it will
> >                  * not go away, see anon_vma_free().
> >                  */
> >                 if (!page_mapped(page)) {
> >                         up_read(&root_anon_vma->rwsem);
> >                         anon_vma = NULL;
> >                 }
> >                 goto out;
> >         }
> > ...
> > }
> > 
> > Between the time the two "page_mapped(page)" are checked, the address
> > (anon_mapping - PAGE_MAPPING_ANON) is unmapped! However it seems
> > that anon_vma->root could still be read in but the value is wild. So
> > the kernel crashes in down_read_trylock. But it's weird that all the
> > "struct page" has its member "_mapcount" still with value 0, not -1,
> > in the four crashes.

So the point is that while we hold rcu_read_lock() the actual memory
backing the anon_vmas cannot be freed. It can be reused, but only for
another anon_vma.

Now, anon_vma_alloc() sets ->root to self, while anon_vma_free() leaves
->root set to whatever. And any other ->root assignment is to a valid
anon_vma.

Therefore, the same rules that ensure anon_vma stays valid, should also
ensure anon_vma->root stays valid.

Now, one thing that might go wobbly is that ->root assignments are not
done using WRITE_ONCE(), this means a naughty compiler can miscompile
those stores and introduce store-tearing, if our READ_ONCE() would
observe such a tear, we'd be up some creek without a paddle.


Now, its been a long time since I looked at any of this code, and I see
that Hugh has fixed at least two wobblies in my original code.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: A small window for a race condition in mm/rmap.c:page_lock_anon_vma_read
  2016-12-22 13:51   ` Peter Zijlstra
@ 2016-12-22 22:26     ` Hugh Dickins
  2016-12-23  2:02       ` Dashi DS1 Cao
  2016-12-23  2:38       ` Dashi DS1 Cao
  0 siblings, 2 replies; 13+ messages in thread
From: Hugh Dickins @ 2016-12-22 22:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michal Hocko, Dashi DS1 Cao, linux-mm, linux-kernel, Hugh Dickins

On Thu, 22 Dec 2016, Peter Zijlstra wrote:
> On Wed, Dec 21, 2016 at 03:43:43PM +0100, Michal Hocko wrote:
> > anon_vma locking is clever^Wsubtle as hell. CC Peter...
> > 
> > On Tue 20-12-16 09:32:27, Dashi DS1 Cao wrote:
> > > I've collected four crash dumps with similar backtrace. 
> > > 
> > > PID: 247    TASK: ffff881fcfad8000  CPU: 14  COMMAND: "kswapd1"
> > >  #0 [ffff881fcfad7978] machine_kexec at ffffffff81051e9b
> > >  #1 [ffff881fcfad79d8] crash_kexec at ffffffff810f27e2
> > >  #2 [ffff881fcfad7aa8] oops_end at ffffffff8163f448
> > >  #3 [ffff881fcfad7ad0] die at ffffffff8101859b
> > >  #4 [ffff881fcfad7b00] do_general_protection at ffffffff8163ed3e
> > >  #5 [ffff881fcfad7b30] general_protection at ffffffff8163e5e8
> > >     [exception RIP: down_read_trylock+9]
> > >     RIP: ffffffff810aa9f9  RSP: ffff881fcfad7be0  RFLAGS: 00010286
> > >     RAX: 0000000000000000  RBX: ffff882b47ddadc0  RCX: 0000000000000000
> > >     RDX: 0000000000000000  RSI: 0000000000000000  RDI: 91550b2b32f5a3e8
> > 
> > rdi is obviously a mess - smells like a string. So either sombody has
> > overwritten root_anon_vma or this is really a use after free...
> 
> e8 - ???
> a3 - ???
> f5 - ???
> 32 - 2
> 2b - +
>  b - 
> 
> 55 - U
> 91 - ???
> 
> Not a string..
> 
> > >     RBP: ffff881fcfad7be0   R8: ffffea00ecc28860   R9: ffff883fcffeae28
> > >     R10: ffffffff81a691a0  R11: 0000000000000001  R12: ffff882b47ddadc1
> > >     R13: ffffea00ecc28840  R14: 91550b2b32f5a3e8  R15: ffffea00ecc28840
> > >     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
> > >  #6 [ffff881fcfad7be8] page_lock_anon_vma_read at ffffffff811a3365
> > >  #7 [ffff881fcfad7c18] page_referenced at ffffffff811a35e7
> > >  #8 [ffff881fcfad7c90] shrink_active_list at ffffffff8117e8cc
> > >  #9 [ffff881fcfad7d48] balance_pgdat at ffffffff81180288
> > > #10 [ffff881fcfad7e20] kswapd at ffffffff81180813
> > > #11 [ffff881fcfad7ec8] kthread at ffffffff810a5b8f
> > > #12 [ffff881fcfad7f50] ret_from_fork at ffffffff81646a98
> > > 
> > > I suspect my customer hits into a small window of a race condition in mm/rmap.c: page_lock_anon_vma_read.
> > > struct anon_vma *page_lock_anon_vma_read(struct page *page)
> > > {
> > >         struct anon_vma *anon_vma = NULL;
> > >         struct anon_vma *root_anon_vma;
> > >         unsigned long anon_mapping;
> > > 
> > >         rcu_read_lock();
> > >         anon_mapping = (unsigned long)READ_ONCE(page->mapping);
> > >         if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
> > >                 goto out;
> > >         if (!page_mapped(page))
> > >                 goto out;
> > > 
> > >         anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
> > >         root_anon_vma = READ_ONCE(anon_vma->root);
> > 
> > Could you dump the anon_vma and struct page as well?
> > 
> > >         if (down_read_trylock(&root_anon_vma->rwsem)) {
> > >                 /*
> > >                  * If the page is still mapped, then this anon_vma is still
> > >                  * its anon_vma, and holding the mutex ensures that it will
> > >                  * not go away, see anon_vma_free().
> > >                  */
> > >                 if (!page_mapped(page)) {
> > >                         up_read(&root_anon_vma->rwsem);
> > >                         anon_vma = NULL;
> > >                 }
> > >                 goto out;
> > >         }
> > > ...
> > > }
> > > 
> > > Between the time the two "page_mapped(page)" are checked, the address
> > > (anon_mapping - PAGE_MAPPING_ANON) is unmapped! However it seems
> > > that anon_vma->root could still be read in but the value is wild. So
> > > the kernel crashes in down_read_trylock. But it's weird that all the
> > > "struct page" has its member "_mapcount" still with value 0, not -1,
> > > in the four crashes.
> 
> So the point is that while we hold rcu_read_lock() the actual memory
> backing the anon_vmas cannot be freed. It can be reused, but only for
> another anon_vma.
> 
> Now, anon_vma_alloc() sets ->root to self, while anon_vma_free() leaves
> ->root set to whatever. And any other ->root assignment is to a valid
> anon_vma.
> 
> Therefore, the same rules that ensure anon_vma stays valid, should also
> ensure anon_vma->root stays valid.
> 
> Now, one thing that might go wobbly is that ->root assignments are not
> done using WRITE_ONCE(), this means a naughty compiler can miscompile
> those stores and introduce store-tearing, if our READ_ONCE() would
> observe such a tear, we'd be up some creek without a paddle.

We would indeed.  And this being the season of goodwill, I'm biting
my tongue not to say what I think of the prospect of store tearing.
But that zeroed anon_vma implies tearing not the problem here anyway.

> 
> Now, its been a long time since I looked at any of this code, and I see
> that Hugh has fixed at least two wobblies in my original code.

Nothing much, and this (admittedly subtle) technique has been working
well for years, so I'm sceptical about "a small window for a race
condition".

But Dashi's right to point out that the struct page has _mapcount 0
(not -1 for logical 0) in these cases: it looks as if something is
freeing (or corrupting) the anon_vma despite it still having pages
mapped, or something is misaccounting (or corrupting) the _mapcount.

But I've no idea what, and we have not heard such reports elsewhere.
We don't even know what kernel this is - something special, perhaps?

Hugh

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: A small window for a race condition in mm/rmap.c:page_lock_anon_vma_read
  2016-12-22 22:26     ` Hugh Dickins
@ 2016-12-23  2:02       ` Dashi DS1 Cao
  2016-12-23  3:34         ` Hugh Dickins
  2016-12-23 14:19         ` Peter Zijlstra
  2016-12-23  2:38       ` Dashi DS1 Cao
  1 sibling, 2 replies; 13+ messages in thread
From: Dashi DS1 Cao @ 2016-12-23  2:02 UTC (permalink / raw)
  To: Hugh Dickins, Peter Zijlstra; +Cc: Michal Hocko, linux-mm, linux-kernel

The kernel version is "RELEASE: 3.10.0-327.36.3.el7.x86_64". It was the latest kernel release of CentOS 7.2 at that time, or maybe still now.
I've tried to print the value of anon_vma from other three dumps, but the content is not available in the dump. 
"gdb: page excluded: kernel virtual address: ffff882b47ddadc0"
I guess it is not copied out because it has already been put into some unused list.

Dashi Cao

-----Original Message-----
From: Hugh Dickins [mailto:hughd@google.com] 
Sent: Friday, December 23, 2016 6:27 AM
To: Peter Zijlstra <peterz@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>; Dashi DS1 Cao <caods1@lenovo.com>; linux-mm@kvack.org; linux-kernel@vger.kernel.org; Hugh Dickins <hughd@google.com>
Subject: Re: A small window for a race condition in mm/rmap.c:page_lock_anon_vma_read

On Thu, 22 Dec 2016, Peter Zijlstra wrote:
> On Wed, Dec 21, 2016 at 03:43:43PM +0100, Michal Hocko wrote:
> > anon_vma locking is clever^Wsubtle as hell. CC Peter...
> > 
> > On Tue 20-12-16 09:32:27, Dashi DS1 Cao wrote:
> > > I've collected four crash dumps with similar backtrace. 
> > > 
> > > PID: 247    TASK: ffff881fcfad8000  CPU: 14  COMMAND: "kswapd1"
> > >  #0 [ffff881fcfad7978] machine_kexec at ffffffff81051e9b
> > >  #1 [ffff881fcfad79d8] crash_kexec at ffffffff810f27e2
> > >  #2 [ffff881fcfad7aa8] oops_end at ffffffff8163f448
> > >  #3 [ffff881fcfad7ad0] die at ffffffff8101859b
> > >  #4 [ffff881fcfad7b00] do_general_protection at ffffffff8163ed3e
> > >  #5 [ffff881fcfad7b30] general_protection at ffffffff8163e5e8
> > >     [exception RIP: down_read_trylock+9]
> > >     RIP: ffffffff810aa9f9  RSP: ffff881fcfad7be0  RFLAGS: 00010286
> > >     RAX: 0000000000000000  RBX: ffff882b47ddadc0  RCX: 0000000000000000
> > >     RDX: 0000000000000000  RSI: 0000000000000000  RDI: 
> > > 91550b2b32f5a3e8
> > 
> > rdi is obviously a mess - smells like a string. So either sombody 
> > has overwritten root_anon_vma or this is really a use after free...
> 
> e8 - ???
> a3 - ???
> f5 - ???
> 32 - 2
> 2b - +
>  b -
> 
> 55 - U
> 91 - ???
> 
> Not a string..
> 
> > >     RBP: ffff881fcfad7be0   R8: ffffea00ecc28860   R9: ffff883fcffeae28
> > >     R10: ffffffff81a691a0  R11: 0000000000000001  R12: ffff882b47ddadc1
> > >     R13: ffffea00ecc28840  R14: 91550b2b32f5a3e8  R15: ffffea00ecc28840
> > >     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
> > >  #6 [ffff881fcfad7be8] page_lock_anon_vma_read at ffffffff811a3365
> > >  #7 [ffff881fcfad7c18] page_referenced at ffffffff811a35e7
> > >  #8 [ffff881fcfad7c90] shrink_active_list at ffffffff8117e8cc
> > >  #9 [ffff881fcfad7d48] balance_pgdat at ffffffff81180288
> > > #10 [ffff881fcfad7e20] kswapd at ffffffff81180813
> > > #11 [ffff881fcfad7ec8] kthread at ffffffff810a5b8f
> > > #12 [ffff881fcfad7f50] ret_from_fork at ffffffff81646a98
> > > 
> > > I suspect my customer hits into a small window of a race condition in mm/rmap.c: page_lock_anon_vma_read.
> > > struct anon_vma *page_lock_anon_vma_read(struct page *page) {
> > >         struct anon_vma *anon_vma = NULL;
> > >         struct anon_vma *root_anon_vma;
> > >         unsigned long anon_mapping;
> > > 
> > >         rcu_read_lock();
> > >         anon_mapping = (unsigned long)READ_ONCE(page->mapping);
> > >         if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
> > >                 goto out;
> > >         if (!page_mapped(page))
> > >                 goto out;
> > > 
> > >         anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
> > >         root_anon_vma = READ_ONCE(anon_vma->root);
> > 
> > Could you dump the anon_vma and struct page as well?
> > 
> > >         if (down_read_trylock(&root_anon_vma->rwsem)) {
> > >                 /*
> > >                  * If the page is still mapped, then this anon_vma is still
> > >                  * its anon_vma, and holding the mutex ensures that it will
> > >                  * not go away, see anon_vma_free().
> > >                  */
> > >                 if (!page_mapped(page)) {
> > >                         up_read(&root_anon_vma->rwsem);
> > >                         anon_vma = NULL;
> > >                 }
> > >                 goto out;
> > >         }
> > > ...
> > > }
> > > 
> > > Between the time the two "page_mapped(page)" are checked, the 
> > > address (anon_mapping - PAGE_MAPPING_ANON) is unmapped! However it 
> > > seems that anon_vma->root could still be read in but the value is 
> > > wild. So the kernel crashes in down_read_trylock. But it's weird 
> > > that all the "struct page" has its member "_mapcount" still with 
> > > value 0, not -1, in the four crashes.
> 
> So the point is that while we hold rcu_read_lock() the actual memory 
> backing the anon_vmas cannot be freed. It can be reused, but only for 
> another anon_vma.
> 
> Now, anon_vma_alloc() sets ->root to self, while anon_vma_free() 
> leaves
> ->root set to whatever. And any other ->root assignment is to a valid
> anon_vma.
> 
> Therefore, the same rules that ensure anon_vma stays valid, should 
> also ensure anon_vma->root stays valid.
> 
> Now, one thing that might go wobbly is that ->root assignments are not 
> done using WRITE_ONCE(), this means a naughty compiler can miscompile 
> those stores and introduce store-tearing, if our READ_ONCE() would 
> observe such a tear, we'd be up some creek without a paddle.

We would indeed.  And this being the season of goodwill, I'm biting my tongue not to say what I think of the prospect of store tearing.
But that zeroed anon_vma implies tearing not the problem here anyway.

> 
> Now, its been a long time since I looked at any of this code, and I 
> see that Hugh has fixed at least two wobblies in my original code.

Nothing much, and this (admittedly subtle) technique has been working well for years, so I'm sceptical about "a small window for a race condition".

But Dashi's right to point out that the struct page has _mapcount 0 (not -1 for logical 0) in these cases: it looks as if something is freeing (or corrupting) the anon_vma despite it still having pages mapped, or something is misaccounting (or corrupting) the _mapcount.

But I've no idea what, and we have not heard such reports elsewhere.
We don't even know what kernel this is - something special, perhaps?

Hugh

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: A small window for a race condition in mm/rmap.c:page_lock_anon_vma_read
  2016-12-22 22:26     ` Hugh Dickins
  2016-12-23  2:02       ` Dashi DS1 Cao
@ 2016-12-23  2:38       ` Dashi DS1 Cao
  2017-04-22 12:07         ` zhong jiang
  1 sibling, 1 reply; 13+ messages in thread
From: Dashi DS1 Cao @ 2016-12-23  2:38 UTC (permalink / raw)
  To: Hugh Dickins, Peter Zijlstra; +Cc: Michal Hocko, linux-mm, linux-kernel

I'd expected that one or more tasks doing the free were the current task of other cpu cores, but only one of the four dumps has two swapd task that are concurrently at execution on different cpu.
This is the task leading to the crash:
PID: 247    TASK: ffff881fcfad8000  CPU: 14  COMMAND: "kswapd1"
 #0 [ffff881fcfad7978] machine_kexec at ffffffff81051e9b
 #1 [ffff881fcfad79d8] crash_kexec at ffffffff810f27e2
 #2 [ffff881fcfad7aa8] oops_end at ffffffff8163f448
 #3 [ffff881fcfad7ad0] die at ffffffff8101859b
 #4 [ffff881fcfad7b00] do_general_protection at ffffffff8163ed3e
 #5 [ffff881fcfad7b30] general_protection at ffffffff8163e5e8
    [exception RIP: down_read_trylock+9]
    RIP: ffffffff810aa9f9  RSP: ffff881fcfad7be0  RFLAGS: 00010286
    RAX: 0000000000000000  RBX: ffff882b47ddadc0  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: 91550b2b32f5a3e8
    RBP: ffff881fcfad7be0   R8: ffffea00ecc28860   R9: ffff883fcffeae28
    R10: ffffffff81a691a0  R11: 0000000000000001  R12: ffff882b47ddadc1
    R13: ffffea00ecc28840  R14: 91550b2b32f5a3e8  R15: ffffea00ecc28840
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
 #6 [ffff881fcfad7be8] page_lock_anon_vma_read at ffffffff811a3365
 #7 [ffff881fcfad7c18] page_referenced at ffffffff811a35e7
 #8 [ffff881fcfad7c90] shrink_active_list at ffffffff8117e8cc
 #9 [ffff881fcfad7d48] balance_pgdat at ffffffff81180288
#10 [ffff881fcfad7e20] kswapd at ffffffff81180813
#11 [ffff881fcfad7ec8] kthread at ffffffff810a5b8f
#12 [ffff881fcfad7f50] ret_from_fork at ffffffff81646a98

And this is the one at the same time:
PID: 246    TASK: ffff881fd27af300  CPU: 20  COMMAND: "kswapd0"
 #0 [ffff881fffd05e70] crash_nmi_callback at ffffffff81045982
 #1 [ffff881fffd05e80] nmi_handle at ffffffff8163f5d9
 #2 [ffff881fffd05ec8] do_nmi at ffffffff8163f6f0
 #3 [ffff881fffd05ef0] end_repeat_nmi at ffffffff8163ea13
    [exception RIP: free_pcppages_bulk+529]
    RIP: ffffffff81171ae1  RSP: ffff881fcfad38f0  RFLAGS: 00000087
    RAX: 002fffff0000002c  RBX: ffffea007606ae40  RCX: 0000000000000000
    RDX: ffffea007606ae00  RSI: 00000000000002b9  RDI: 0000000000000000
    RBP: ffff881fcfad3978   R8: 0000000000000000   R9: 0000000000000001
    R10: ffff88207ffda000  R11: 0000000000000002  R12: ffffea007606ae40
    R13: 0000000000000002  R14: ffff88207ffda000  R15: 00000000000002b8
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #4 [ffff881fcfad38f0] free_pcppages_bulk at ffffffff81171ae1
 #5 [ffff881fcfad3980] free_hot_cold_page at ffffffff81171f08
 #6 [ffff881fcfad39b8] free_hot_cold_page_list at ffffffff81171f76
 #7 [ffff881fcfad39f0] shrink_page_list at ffffffff8117d71e
 #8 [ffff881fcfad3b28] shrink_inactive_list at ffffffff8117e37a
 #9 [ffff881fcfad3bf0] shrink_lruvec at ffffffff8117ee45
#10 [ffff881fcfad3cf0] shrink_zone at ffffffff8117f2a6
#11 [ffff881fcfad3d48] balance_pgdat at ffffffff8118054c
#12 [ffff881fcfad3e20] kswapd at ffffffff81180813
#13 [ffff881fcfad3ec8] kthread at ffffffff810a5b8f
#14 [ffff881fcfad3f50] ret_from_fork at ffffffff81646a98

I hope the information would be useful.
Dashi Cao

-----Original Message-----
From: Hugh Dickins [mailto:hughd@google.com] 
Sent: Friday, December 23, 2016 6:27 AM
To: Peter Zijlstra <peterz@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>; Dashi DS1 Cao <caods1@lenovo.com>; linux-mm@kvack.org; linux-kernel@vger.kernel.org; Hugh Dickins <hughd@google.com>
Subject: Re: A small window for a race condition in mm/rmap.c:page_lock_anon_vma_read

On Thu, 22 Dec 2016, Peter Zijlstra wrote:
> On Wed, Dec 21, 2016 at 03:43:43PM +0100, Michal Hocko wrote:
> > anon_vma locking is clever^Wsubtle as hell. CC Peter...
> > 
> > On Tue 20-12-16 09:32:27, Dashi DS1 Cao wrote:
> > > I've collected four crash dumps with similar backtrace. 
> > > 
> > > PID: 247    TASK: ffff881fcfad8000  CPU: 14  COMMAND: "kswapd1"
> > >  #0 [ffff881fcfad7978] machine_kexec at ffffffff81051e9b
> > >  #1 [ffff881fcfad79d8] crash_kexec at ffffffff810f27e2
> > >  #2 [ffff881fcfad7aa8] oops_end at ffffffff8163f448
> > >  #3 [ffff881fcfad7ad0] die at ffffffff8101859b
> > >  #4 [ffff881fcfad7b00] do_general_protection at ffffffff8163ed3e
> > >  #5 [ffff881fcfad7b30] general_protection at ffffffff8163e5e8
> > >     [exception RIP: down_read_trylock+9]
> > >     RIP: ffffffff810aa9f9  RSP: ffff881fcfad7be0  RFLAGS: 00010286
> > >     RAX: 0000000000000000  RBX: ffff882b47ddadc0  RCX: 0000000000000000
> > >     RDX: 0000000000000000  RSI: 0000000000000000  RDI: 
> > > 91550b2b32f5a3e8
> > 
> > rdi is obviously a mess - smells like a string. So either sombody 
> > has overwritten root_anon_vma or this is really a use after free...
> 
> e8 - ???
> a3 - ???
> f5 - ???
> 32 - 2
> 2b - +
>  b -
> 
> 55 - U
> 91 - ???
> 
> Not a string..
> 
> > >     RBP: ffff881fcfad7be0   R8: ffffea00ecc28860   R9: ffff883fcffeae28
> > >     R10: ffffffff81a691a0  R11: 0000000000000001  R12: ffff882b47ddadc1
> > >     R13: ffffea00ecc28840  R14: 91550b2b32f5a3e8  R15: ffffea00ecc28840
> > >     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
> > >  #6 [ffff881fcfad7be8] page_lock_anon_vma_read at ffffffff811a3365
> > >  #7 [ffff881fcfad7c18] page_referenced at ffffffff811a35e7
> > >  #8 [ffff881fcfad7c90] shrink_active_list at ffffffff8117e8cc
> > >  #9 [ffff881fcfad7d48] balance_pgdat at ffffffff81180288
> > > #10 [ffff881fcfad7e20] kswapd at ffffffff81180813
> > > #11 [ffff881fcfad7ec8] kthread at ffffffff810a5b8f
> > > #12 [ffff881fcfad7f50] ret_from_fork at ffffffff81646a98
> > > 
> > > I suspect my customer hits into a small window of a race condition in mm/rmap.c: page_lock_anon_vma_read.
> > > struct anon_vma *page_lock_anon_vma_read(struct page *page) {
> > >         struct anon_vma *anon_vma = NULL;
> > >         struct anon_vma *root_anon_vma;
> > >         unsigned long anon_mapping;
> > > 
> > >         rcu_read_lock();
> > >         anon_mapping = (unsigned long)READ_ONCE(page->mapping);
> > >         if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
> > >                 goto out;
> > >         if (!page_mapped(page))
> > >                 goto out;
> > > 
> > >         anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
> > >         root_anon_vma = READ_ONCE(anon_vma->root);
> > 
> > Could you dump the anon_vma and struct page as well?
> > 
> > >         if (down_read_trylock(&root_anon_vma->rwsem)) {
> > >                 /*
> > >                  * If the page is still mapped, then this anon_vma is still
> > >                  * its anon_vma, and holding the mutex ensures that it will
> > >                  * not go away, see anon_vma_free().
> > >                  */
> > >                 if (!page_mapped(page)) {
> > >                         up_read(&root_anon_vma->rwsem);
> > >                         anon_vma = NULL;
> > >                 }
> > >                 goto out;
> > >         }
> > > ...
> > > }
> > > 
> > > Between the time the two "page_mapped(page)" are checked, the 
> > > address (anon_mapping - PAGE_MAPPING_ANON) is unmapped! However it 
> > > seems that anon_vma->root could still be read in but the value is 
> > > wild. So the kernel crashes in down_read_trylock. But it's weird 
> > > that all the "struct page" has its member "_mapcount" still with 
> > > value 0, not -1, in the four crashes.
> 
> So the point is that while we hold rcu_read_lock() the actual memory 
> backing the anon_vmas cannot be freed. It can be reused, but only for 
> another anon_vma.
> 
> Now, anon_vma_alloc() sets ->root to self, while anon_vma_free() 
> leaves
> ->root set to whatever. And any other ->root assignment is to a valid
> anon_vma.
> 
> Therefore, the same rules that ensure anon_vma stays valid, should 
> also ensure anon_vma->root stays valid.
> 
> Now, one thing that might go wobbly is that ->root assignments are not 
> done using WRITE_ONCE(), this means a naughty compiler can miscompile 
> those stores and introduce store-tearing, if our READ_ONCE() would 
> observe such a tear, we'd be up some creek without a paddle.

We would indeed.  And this being the season of goodwill, I'm biting my tongue not to say what I think of the prospect of store tearing.
But that zeroed anon_vma implies tearing not the problem here anyway.

> 
> Now, its been a long time since I looked at any of this code, and I 
> see that Hugh has fixed at least two wobblies in my original code.

Nothing much, and this (admittedly subtle) technique has been working well for years, so I'm sceptical about "a small window for a race condition".

But Dashi's right to point out that the struct page has _mapcount 0 (not -1 for logical 0) in these cases: it looks as if something is freeing (or corrupting) the anon_vma despite it still having pages mapped, or something is misaccounting (or corrupting) the _mapcount.

But I've no idea what, and we have not heard such reports elsewhere.
We don't even know what kernel this is - something special, perhaps?

Hugh

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: A small window for a race condition in mm/rmap.c:page_lock_anon_vma_read
  2016-12-23  2:02       ` Dashi DS1 Cao
@ 2016-12-23  3:34         ` Hugh Dickins
  2016-12-23 14:19         ` Peter Zijlstra
  1 sibling, 0 replies; 13+ messages in thread
From: Hugh Dickins @ 2016-12-23  3:34 UTC (permalink / raw)
  To: Dashi DS1 Cao
  Cc: Hugh Dickins, Peter Zijlstra, Michal Hocko, Sasha Levin,
	linux-mm, linux-kernel

On Fri, 23 Dec 2016, Dashi DS1 Cao wrote:

> The kernel version is "RELEASE: 3.10.0-327.36.3.el7.x86_64". It was the latest kernel release of CentOS 7.2 at that time, or maybe still now.

Okay, thanks: so, basically a v3.10 kernel, with lots of added patches,
but also lacking many more recent fixes.

> I've tried to print the value of anon_vma from other three dumps, but the content is not available in the dump. 
> "gdb: page excluded: kernel virtual address: ffff882b47ddadc0"
> I guess it is not copied out because it has already been put into some unused list.

Useful info: that suggests that the anon_vma was rightly freed, and that
it's the page->_mapcount that's wrong.  The page isn't really mapped
anywhere now, but appearing to be still page_mapped(), it has tricked
page_lock_anon_vma_read() into thinking the stale anon_vma pointer is
safe to access.

That can happen if there's a race, and a page gets mapped with one pte
on top of another: only one of them will be unmapped later.  Incorrect
handling of page table entries.  But I cannot remember anywhere that
was shown to happen - beyond a project of my own, which never reached
the tree.

If it's a file page, that usually ends up as BUG_ON(page_mapped(page))
in __delete_from_page_cache() (in v3.10, changed a little later on),
when truncating or unlinking the file or unmounting the filesystem.
Those have been seen in the past, on rare occasions, but I don't
remember actually root-causing any of them.  If it's an anon page,
there is no equivalent place for such a BUG_ON.

mremap move has a tricky job to do, and might cause such a problem
if its locking were inadequate: but the only example I see since
v3.10 was dd18dbc2d42a "mm, thp: close race between mremap() and
split_huge_page()", and that used to crash in __split_huge_page().

Or see c0d73261f5c1 "mm/memory.c: use entry = ACCESS_ONCE(*pte)
in handle_pte_fault()", which brings us back to Peter's topic of
over-imaginative compilers; but none of us believed that change
really made a difference in practice.

Cc'ing Sasha Levin, long-time trinity-runner, just in case he might
remember any time when a BUG_ON(page_mapped(page)) was really solved:
if so, there's a chance the explanation might also apply to anonymous
pages, and be responsible for your page_lock_anon_vma_read() crashes.

Hugh

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: A small window for a race condition in mm/rmap.c:page_lock_anon_vma_read
  2016-12-23  2:02       ` Dashi DS1 Cao
  2016-12-23  3:34         ` Hugh Dickins
@ 2016-12-23 14:19         ` Peter Zijlstra
  2016-12-25  2:22           ` Dashi DS1 Cao
  1 sibling, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2016-12-23 14:19 UTC (permalink / raw)
  To: Dashi DS1 Cao; +Cc: Hugh Dickins, Michal Hocko, linux-mm, linux-kernel

On Fri, Dec 23, 2016 at 02:02:14AM +0000, Dashi DS1 Cao wrote:
> The kernel version is "RELEASE: 3.10.0-327.36.3.el7.x86_64". It was the latest kernel release of CentOS 7.2 at that time, or maybe still now.

This would be the point where we ask you to run a recent upstream kernel
and try and reproduce the problem with that, or contact RHT for support
on their franken-kernel ;-)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: A small window for a race condition in mm/rmap.c:page_lock_anon_vma_read
  2016-12-23 14:19         ` Peter Zijlstra
@ 2016-12-25  2:22           ` Dashi DS1 Cao
  0 siblings, 0 replies; 13+ messages in thread
From: Dashi DS1 Cao @ 2016-12-25  2:22 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Hugh Dickins, Michal Hocko, linux-mm, linux-kernel

It's a CentOS 7.2, so there is point in asking RHT. I'll try to persuade the customer to have a try with kernel version 4.9, if only I can get it work with CentOS 7.2.

Dashi Cao

-----Original Message-----
From: Peter Zijlstra [mailto:peterz@infradead.org] 
Sent: Friday, December 23, 2016 10:20 PM
To: Dashi DS1 Cao <caods1@lenovo.com>
Cc: Hugh Dickins <hughd@google.com>; Michal Hocko <mhocko@kernel.org>; linux-mm@kvack.org; linux-kernel@vger.kernel.org
Subject: Re: A small window for a race condition in mm/rmap.c:page_lock_anon_vma_read

On Fri, Dec 23, 2016 at 02:02:14AM +0000, Dashi DS1 Cao wrote:
> The kernel version is "RELEASE: 3.10.0-327.36.3.el7.x86_64". It was the latest kernel release of CentOS 7.2 at that time, or maybe still now.

This would be the point where we ask you to run a recent upstream kernel and try and reproduce the problem with that, or contact RHT for support on their franken-kernel ;-)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: A small window for a race condition in mm/rmap.c:page_lock_anon_vma_read
  2016-12-23  2:38       ` Dashi DS1 Cao
@ 2017-04-22 12:07         ` zhong jiang
  0 siblings, 0 replies; 13+ messages in thread
From: zhong jiang @ 2017-04-22 12:07 UTC (permalink / raw)
  To: Dashi DS1 Cao
  Cc: Hugh Dickins, Peter Zijlstra, Michal Hocko, linux-mm, linux-kernel

Hi,  Dashi
The same issue I had occured every other week.  Do you have solve it .
 I want to know how it is fixed.  The patch exist in the mainline.

Thanks
zhongjiang
On 2016/12/23 10:38, Dashi DS1 Cao wrote:
> I'd expected that one or more tasks doing the free were the current task of other cpu cores, but only one of the four dumps has two swapd task that are concurrently at execution on different cpu.
> This is the task leading to the crash:
> PID: 247    TASK: ffff881fcfad8000  CPU: 14  COMMAND: "kswapd1"
>  #0 [ffff881fcfad7978] machine_kexec at ffffffff81051e9b
>  #1 [ffff881fcfad79d8] crash_kexec at ffffffff810f27e2
>  #2 [ffff881fcfad7aa8] oops_end at ffffffff8163f448
>  #3 [ffff881fcfad7ad0] die at ffffffff8101859b
>  #4 [ffff881fcfad7b00] do_general_protection at ffffffff8163ed3e
>  #5 [ffff881fcfad7b30] general_protection at ffffffff8163e5e8
>     [exception RIP: down_read_trylock+9]
>     RIP: ffffffff810aa9f9  RSP: ffff881fcfad7be0  RFLAGS: 00010286
>     RAX: 0000000000000000  RBX: ffff882b47ddadc0  RCX: 0000000000000000
>     RDX: 0000000000000000  RSI: 0000000000000000  RDI: 91550b2b32f5a3e8
>     RBP: ffff881fcfad7be0   R8: ffffea00ecc28860   R9: ffff883fcffeae28
>     R10: ffffffff81a691a0  R11: 0000000000000001  R12: ffff882b47ddadc1
>     R13: ffffea00ecc28840  R14: 91550b2b32f5a3e8  R15: ffffea00ecc28840
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
>  #6 [ffff881fcfad7be8] page_lock_anon_vma_read at ffffffff811a3365
>  #7 [ffff881fcfad7c18] page_referenced at ffffffff811a35e7
>  #8 [ffff881fcfad7c90] shrink_active_list at ffffffff8117e8cc
>  #9 [ffff881fcfad7d48] balance_pgdat at ffffffff81180288
> #10 [ffff881fcfad7e20] kswapd at ffffffff81180813
> #11 [ffff881fcfad7ec8] kthread at ffffffff810a5b8f
> #12 [ffff881fcfad7f50] ret_from_fork at ffffffff81646a98
>
> And this is the one at the same time:
> PID: 246    TASK: ffff881fd27af300  CPU: 20  COMMAND: "kswapd0"
>  #0 [ffff881fffd05e70] crash_nmi_callback at ffffffff81045982
>  #1 [ffff881fffd05e80] nmi_handle at ffffffff8163f5d9
>  #2 [ffff881fffd05ec8] do_nmi at ffffffff8163f6f0
>  #3 [ffff881fffd05ef0] end_repeat_nmi at ffffffff8163ea13
>     [exception RIP: free_pcppages_bulk+529]
>     RIP: ffffffff81171ae1  RSP: ffff881fcfad38f0  RFLAGS: 00000087
>     RAX: 002fffff0000002c  RBX: ffffea007606ae40  RCX: 0000000000000000
>     RDX: ffffea007606ae00  RSI: 00000000000002b9  RDI: 0000000000000000
>     RBP: ffff881fcfad3978   R8: 0000000000000000   R9: 0000000000000001
>     R10: ffff88207ffda000  R11: 0000000000000002  R12: ffffea007606ae40
>     R13: 0000000000000002  R14: ffff88207ffda000  R15: 00000000000002b8
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> --- <NMI exception stack> ---
>  #4 [ffff881fcfad38f0] free_pcppages_bulk at ffffffff81171ae1
>  #5 [ffff881fcfad3980] free_hot_cold_page at ffffffff81171f08
>  #6 [ffff881fcfad39b8] free_hot_cold_page_list at ffffffff81171f76
>  #7 [ffff881fcfad39f0] shrink_page_list at ffffffff8117d71e
>  #8 [ffff881fcfad3b28] shrink_inactive_list at ffffffff8117e37a
>  #9 [ffff881fcfad3bf0] shrink_lruvec at ffffffff8117ee45
> #10 [ffff881fcfad3cf0] shrink_zone at ffffffff8117f2a6
> #11 [ffff881fcfad3d48] balance_pgdat at ffffffff8118054c
> #12 [ffff881fcfad3e20] kswapd at ffffffff81180813
> #13 [ffff881fcfad3ec8] kthread at ffffffff810a5b8f
> #14 [ffff881fcfad3f50] ret_from_fork at ffffffff81646a98
>
> I hope the information would be useful.
> Dashi Cao
>
> -----Original Message-----
> From: Hugh Dickins [mailto:hughd@google.com] 
> Sent: Friday, December 23, 2016 6:27 AM
> To: Peter Zijlstra <peterz@infradead.org>
> Cc: Michal Hocko <mhocko@kernel.org>; Dashi DS1 Cao <caods1@lenovo.com>; linux-mm@kvack.org; linux-kernel@vger.kernel.org; Hugh Dickins <hughd@google.com>
> Subject: Re: A small window for a race condition in mm/rmap.c:page_lock_anon_vma_read
>
> On Thu, 22 Dec 2016, Peter Zijlstra wrote:
>> On Wed, Dec 21, 2016 at 03:43:43PM +0100, Michal Hocko wrote:
>>> anon_vma locking is clever^Wsubtle as hell. CC Peter...
>>>
>>> On Tue 20-12-16 09:32:27, Dashi DS1 Cao wrote:
>>>> I've collected four crash dumps with similar backtrace. 
>>>>
>>>> PID: 247    TASK: ffff881fcfad8000  CPU: 14  COMMAND: "kswapd1"
>>>>  #0 [ffff881fcfad7978] machine_kexec at ffffffff81051e9b
>>>>  #1 [ffff881fcfad79d8] crash_kexec at ffffffff810f27e2
>>>>  #2 [ffff881fcfad7aa8] oops_end at ffffffff8163f448
>>>>  #3 [ffff881fcfad7ad0] die at ffffffff8101859b
>>>>  #4 [ffff881fcfad7b00] do_general_protection at ffffffff8163ed3e
>>>>  #5 [ffff881fcfad7b30] general_protection at ffffffff8163e5e8
>>>>     [exception RIP: down_read_trylock+9]
>>>>     RIP: ffffffff810aa9f9  RSP: ffff881fcfad7be0  RFLAGS: 00010286
>>>>     RAX: 0000000000000000  RBX: ffff882b47ddadc0  RCX: 0000000000000000
>>>>     RDX: 0000000000000000  RSI: 0000000000000000  RDI: 
>>>> 91550b2b32f5a3e8
>>> rdi is obviously a mess - smells like a string. So either sombody 
>>> has overwritten root_anon_vma or this is really a use after free...
>> e8 - ???
>> a3 - ???
>> f5 - ???
>> 32 - 2
>> 2b - +
>>  b -
>>
>> 55 - U
>> 91 - ???
>>
>> Not a string..
>>
>>>>     RBP: ffff881fcfad7be0   R8: ffffea00ecc28860   R9: ffff883fcffeae28
>>>>     R10: ffffffff81a691a0  R11: 0000000000000001  R12: ffff882b47ddadc1
>>>>     R13: ffffea00ecc28840  R14: 91550b2b32f5a3e8  R15: ffffea00ecc28840
>>>>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
>>>>  #6 [ffff881fcfad7be8] page_lock_anon_vma_read at ffffffff811a3365
>>>>  #7 [ffff881fcfad7c18] page_referenced at ffffffff811a35e7
>>>>  #8 [ffff881fcfad7c90] shrink_active_list at ffffffff8117e8cc
>>>>  #9 [ffff881fcfad7d48] balance_pgdat at ffffffff81180288
>>>> #10 [ffff881fcfad7e20] kswapd at ffffffff81180813
>>>> #11 [ffff881fcfad7ec8] kthread at ffffffff810a5b8f
>>>> #12 [ffff881fcfad7f50] ret_from_fork at ffffffff81646a98
>>>>
>>>> I suspect my customer hits into a small window of a race condition in mm/rmap.c: page_lock_anon_vma_read.
>>>> struct anon_vma *page_lock_anon_vma_read(struct page *page) {
>>>>         struct anon_vma *anon_vma = NULL;
>>>>         struct anon_vma *root_anon_vma;
>>>>         unsigned long anon_mapping;
>>>>
>>>>         rcu_read_lock();
>>>>         anon_mapping = (unsigned long)READ_ONCE(page->mapping);
>>>>         if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
>>>>                 goto out;
>>>>         if (!page_mapped(page))
>>>>                 goto out;
>>>>
>>>>         anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
>>>>         root_anon_vma = READ_ONCE(anon_vma->root);
>>> Could you dump the anon_vma and struct page as well?
>>>
>>>>         if (down_read_trylock(&root_anon_vma->rwsem)) {
>>>>                 /*
>>>>                  * If the page is still mapped, then this anon_vma is still
>>>>                  * its anon_vma, and holding the mutex ensures that it will
>>>>                  * not go away, see anon_vma_free().
>>>>                  */
>>>>                 if (!page_mapped(page)) {
>>>>                         up_read(&root_anon_vma->rwsem);
>>>>                         anon_vma = NULL;
>>>>                 }
>>>>                 goto out;
>>>>         }
>>>> ...
>>>> }
>>>>
>>>> Between the time the two "page_mapped(page)" are checked, the 
>>>> address (anon_mapping - PAGE_MAPPING_ANON) is unmapped! However it 
>>>> seems that anon_vma->root could still be read in but the value is 
>>>> wild. So the kernel crashes in down_read_trylock. But it's weird 
>>>> that all the "struct page" has its member "_mapcount" still with 
>>>> value 0, not -1, in the four crashes.
>> So the point is that while we hold rcu_read_lock() the actual memory 
>> backing the anon_vmas cannot be freed. It can be reused, but only for 
>> another anon_vma.
>>
>> Now, anon_vma_alloc() sets ->root to self, while anon_vma_free() 
>> leaves
>> ->root set to whatever. And any other ->root assignment is to a valid
>> anon_vma.
>>
>> Therefore, the same rules that ensure anon_vma stays valid, should 
>> also ensure anon_vma->root stays valid.
>>
>> Now, one thing that might go wobbly is that ->root assignments are not 
>> done using WRITE_ONCE(), this means a naughty compiler can miscompile 
>> those stores and introduce store-tearing, if our READ_ONCE() would 
>> observe such a tear, we'd be up some creek without a paddle.
> We would indeed.  And this being the season of goodwill, I'm biting my tongue not to say what I think of the prospect of store tearing.
> But that zeroed anon_vma implies tearing not the problem here anyway.
>
>> Now, its been a long time since I looked at any of this code, and I 
>> see that Hugh has fixed at least two wobblies in my original code.
> Nothing much, and this (admittedly subtle) technique has been working well for years, so I'm sceptical about "a small window for a race condition".
>
> But Dashi's right to point out that the struct page has _mapcount 0 (not -1 for logical 0) in these cases: it looks as if something is freeing (or corrupting) the anon_vma despite it still having pages mapped, or something is misaccounting (or corrupting) the _mapcount.
>
> But I've no idea what, and we have not heard such reports elsewhere.
> We don't even know what kernel this is - something special, perhaps?
>
> Hugh
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=ilto:"dont@kvack.org"> email@kvack.org </a>
>
> .
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-04-22 12:09 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-20  9:32 A small window for a race condition in mm/rmap.c:page_lock_anon_vma_read Dashi DS1 Cao
2016-12-21 14:43 ` Michal Hocko
2016-12-22 11:53   ` Dashi DS1 Cao
2016-12-22 12:03     ` Michal Hocko
2016-12-22 12:43     ` Dashi DS1 Cao
2016-12-22 13:51   ` Peter Zijlstra
2016-12-22 22:26     ` Hugh Dickins
2016-12-23  2:02       ` Dashi DS1 Cao
2016-12-23  3:34         ` Hugh Dickins
2016-12-23 14:19         ` Peter Zijlstra
2016-12-25  2:22           ` Dashi DS1 Cao
2016-12-23  2:38       ` Dashi DS1 Cao
2017-04-22 12:07         ` zhong jiang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).