linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [bug] aarch64: userspace stalls on page fault after dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
       [not found] <820667266.20994189.1557058281210.JavaMail.zimbra@redhat.com>
@ 2019-05-05 14:10 ` Jan Stancek
  2019-05-06 19:04   ` Yang Shi
  0 siblings, 1 reply; 10+ messages in thread
From: Jan Stancek @ 2019-05-05 14:10 UTC (permalink / raw)
  To: linux-mm, linux-arm-kernel
  Cc: yang shi, kirill, willy, kirill shutemov, vbabka,
	Andrea Arcangeli, akpm, Waiman Long, Jan Stancek

Hi,

I'm seeing userspace program getting stuck on aarch64, on kernels 4.20 and newer.
It stalls from seconds to hours.

I have simplified it to following scenario (reproducer linked below [1]):
  while (1):
    spawn Thread 1: mmap, write, munmap
    spawn Thread 2: <nothing>

Thread 1 is sporadically getting stuck on write to mapped area. User-space is not
moving forward - stdout output stops. Observed CPU usage is however 100%.

At this time, kernel appears to be busy handling page faults (~700k per second):

# perf top -a -g
-   98.97%     8.30%  a.out                     [.] map_write_unmap
   - 23.52% map_write_unmap
      - 24.29% el0_sync
         - 10.42% do_mem_abort
            - 17.81% do_translation_fault
               - 33.01% do_page_fault
                  - 56.18% handle_mm_fault
                       40.26% __handle_mm_fault
                       2.19% __ll_sc___cmpxchg_case_acq_4
                       0.87% mem_cgroup_from_task
                  - 6.18% find_vma
                       5.38% vmacache_find
                    1.35% __ll_sc___cmpxchg_case_acq_8
                    1.23% __ll_sc_atomic64_sub_return_release
                    0.78% down_read_trylock
           0.93% do_translation_fault
   + 8.30% thread_start

#  perf stat -p 8189 -d 
^C
 Performance counter stats for process id '8189':

        984.311350      task-clock (msec)         #    1.000 CPUs utilized
                 0      context-switches          #    0.000 K/sec
                 0      cpu-migrations            #    0.000 K/sec
           723,641      page-faults               #    0.735 M/sec
     2,559,199,434      cycles                    #    2.600 GHz
       711,933,112      instructions              #    0.28  insn per cycle
   <not supported>      branches
           757,658      branch-misses
       205,840,557      L1-dcache-loads           #  209.121 M/sec
        40,561,529      L1-dcache-load-misses     #   19.71% of all L1-dcache hits
   <not supported>      LLC-loads
   <not supported>      LLC-load-misses

       0.984454892 seconds time elapsed

With some extra traces, it appears looping in page fault for same address, over and over:
  do_page_fault // mm_flags: 0x55
    __do_page_fault
      __handle_mm_fault
        handle_pte_fault
          ptep_set_access_flags
            if (pte_same(pte, entry))  // pte: e8000805060f53, entry: e8000805060f53

I had traces in mmap() and munmap() as well, they don't get hit when reproducer
hits the bad state.

Notes:
- I'm not able to reproduce this on x86.
- Attaching GDB or strace immediatelly recovers application from stall.
- It also seems to recover faster when system is busy with other tasks.
- MAP_SHARED vs. MAP_PRIVATE makes no difference.
- Turning off THP makes no difference.
- Reproducer [1] usually hits it within ~minute on HW described below.
- Longman mentioned that "When the rwsem becomes reader-owned, it causes
  all the spinning writers to go to sleep adding wakeup latency to
  the time required to finish the critical sections", but this looks
  like busy loop, so I'm not sure if it's related to rwsem issues identified
  in: https://lore.kernel.org/lkml/20190428212557.13482-2-longman@redhat.com/
- I tried 2 different aarch64 systems so far: APM X-Gene CPU Potenza A3 and
  Qualcomm 65-LA-115-151.
  I can reproduce it on both with v5.1-rc7. It's easier to reproduce
  on latter one (for longer periods of time), which has 46 CPUs.
- Sample output of reproducer on otherwise idle system:
  # ./a.out
  [00000314] map_write_unmap took: 26305 ms
  [00000867] map_write_unmap took: 13642 ms
  [00002200] map_write_unmap took: 44237 ms
  [00002851] map_write_unmap took: 992 ms
  [00004725] map_write_unmap took: 542 ms
  [00006443] map_write_unmap took: 5333 ms
  [00006593] map_write_unmap took: 21162 ms
  [00007435] map_write_unmap took: 16982 ms
  [00007488] map_write unmap took: 13 ms^C

I ran a bisect, which identified following commit as first bad one:
  dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")

I can also make the issue go away with following change:
diff --git a/mm/mmap.c b/mm/mmap.c
index 330f12c17fa1..13ce465740e2 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2844,7 +2844,7 @@ EXPORT_SYMBOL(vm_munmap);
 SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
 {
        profile_munmap(addr);
-       return __vm_munmap(addr, len, true);
+       return __vm_munmap(addr, len, false);
 }

# cat /proc/cpuinfo  | head
processor       : 0
BogoMIPS        : 40.00
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid asimdrdm
CPU implementer : 0x51
CPU architecture: 8
CPU variant     : 0x0
CPU part        : 0xc00
CPU revision    : 1

# numactl -H
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
node 0 size: 97938 MB
node 0 free: 95732 MB
node distances:
node   0 
  0:  10 

Regards,
Jan

[1] https://github.com/jstancek/reproducers/blob/master/kernel/page_fault_stall/mmap5.c
[2] https://github.com/jstancek/reproducers/blob/master/kernel/page_fault_stall/config


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bug] aarch64: userspace stalls on page fault after dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
  2019-05-05 14:10 ` [bug] aarch64: userspace stalls on page fault after dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap") Jan Stancek
@ 2019-05-06 19:04   ` Yang Shi
  2019-05-06 21:35     ` Jan Stancek
  0 siblings, 1 reply; 10+ messages in thread
From: Yang Shi @ 2019-05-06 19:04 UTC (permalink / raw)
  To: Jan Stancek, linux-mm, linux-arm-kernel
  Cc: kirill, willy, kirill shutemov, vbabka, Andrea Arcangeli, akpm,
	Waiman Long



On 5/5/19 7:10 AM, Jan Stancek wrote:
> Hi,
>
> I'm seeing userspace program getting stuck on aarch64, on kernels 4.20 and newer.
> It stalls from seconds to hours.
>
> I have simplified it to following scenario (reproducer linked below [1]):
>    while (1):
>      spawn Thread 1: mmap, write, munmap
>      spawn Thread 2: <nothing>
>
> Thread 1 is sporadically getting stuck on write to mapped area. User-space is not
> moving forward - stdout output stops. Observed CPU usage is however 100%.
>
> At this time, kernel appears to be busy handling page faults (~700k per second):
>
> # perf top -a -g
> -   98.97%     8.30%  a.out                     [.] map_write_unmap
>     - 23.52% map_write_unmap
>        - 24.29% el0_sync
>           - 10.42% do_mem_abort
>              - 17.81% do_translation_fault
>                 - 33.01% do_page_fault
>                    - 56.18% handle_mm_fault
>                         40.26% __handle_mm_fault
>                         2.19% __ll_sc___cmpxchg_case_acq_4
>                         0.87% mem_cgroup_from_task
>                    - 6.18% find_vma
>                         5.38% vmacache_find
>                      1.35% __ll_sc___cmpxchg_case_acq_8
>                      1.23% __ll_sc_atomic64_sub_return_release
>                      0.78% down_read_trylock
>             0.93% do_translation_fault
>     + 8.30% thread_start
>
> #  perf stat -p 8189 -d
> ^C
>   Performance counter stats for process id '8189':
>
>          984.311350      task-clock (msec)         #    1.000 CPUs utilized
>                   0      context-switches          #    0.000 K/sec
>                   0      cpu-migrations            #    0.000 K/sec
>             723,641      page-faults               #    0.735 M/sec
>       2,559,199,434      cycles                    #    2.600 GHz
>         711,933,112      instructions              #    0.28  insn per cycle
>     <not supported>      branches
>             757,658      branch-misses
>         205,840,557      L1-dcache-loads           #  209.121 M/sec
>          40,561,529      L1-dcache-load-misses     #   19.71% of all L1-dcache hits
>     <not supported>      LLC-loads
>     <not supported>      LLC-load-misses
>
>         0.984454892 seconds time elapsed
>
> With some extra traces, it appears looping in page fault for same address, over and over:
>    do_page_fault // mm_flags: 0x55
>      __do_page_fault
>        __handle_mm_fault
>          handle_pte_fault
>            ptep_set_access_flags
>              if (pte_same(pte, entry))  // pte: e8000805060f53, entry: e8000805060f53
>
> I had traces in mmap() and munmap() as well, they don't get hit when reproducer
> hits the bad state.
>
> Notes:
> - I'm not able to reproduce this on x86.
> - Attaching GDB or strace immediatelly recovers application from stall.
> - It also seems to recover faster when system is busy with other tasks.
> - MAP_SHARED vs. MAP_PRIVATE makes no difference.
> - Turning off THP makes no difference.
> - Reproducer [1] usually hits it within ~minute on HW described below.
> - Longman mentioned that "When the rwsem becomes reader-owned, it causes
>    all the spinning writers to go to sleep adding wakeup latency to
>    the time required to finish the critical sections", but this looks
>    like busy loop, so I'm not sure if it's related to rwsem issues identified
>    in: https://lore.kernel.org/lkml/20190428212557.13482-2-longman@redhat.com/

It sounds possible to me. What the optimization done by the commit ("mm: 
mmap: zap pages with read mmap_sem in munmap") is to downgrade write 
rwsem to read when zapping pages and page table in munmap() after the 
vmas have been detached from the rbtree.

So the mmap(), which is writer, in your test may steal the lock and 
execute with the munmap(), which is the reader after the downgrade, in 
parallel to break the mutual exclusion.

In this case, the parallel mmap() may map to the same area since vmas 
have been detached by munmap(), then mmap() may create the complete same 
vmas, and page fault happens on the same vma at the same address.

I'm not sure why gdb or strace could recover this, but they use ptrace 
which may acquire mmap_sem to break the parallel inadvertently.

May you please try Waiman's patch to see if it makes any difference?

> - I tried 2 different aarch64 systems so far: APM X-Gene CPU Potenza A3 and
>    Qualcomm 65-LA-115-151.
>    I can reproduce it on both with v5.1-rc7. It's easier to reproduce
>    on latter one (for longer periods of time), which has 46 CPUs.
> - Sample output of reproducer on otherwise idle system:
>    # ./a.out
>    [00000314] map_write_unmap took: 26305 ms
>    [00000867] map_write_unmap took: 13642 ms
>    [00002200] map_write_unmap took: 44237 ms
>    [00002851] map_write_unmap took: 992 ms
>    [00004725] map_write_unmap took: 542 ms
>    [00006443] map_write_unmap took: 5333 ms
>    [00006593] map_write_unmap took: 21162 ms
>    [00007435] map_write_unmap took: 16982 ms
>    [00007488] map_write unmap took: 13 ms^C
>
> I ran a bisect, which identified following commit as first bad one:
>    dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
>
> I can also make the issue go away with following change:
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 330f12c17fa1..13ce465740e2 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2844,7 +2844,7 @@ EXPORT_SYMBOL(vm_munmap);
>   SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
>   {
>          profile_munmap(addr);
> -       return __vm_munmap(addr, len, true);
> +       return __vm_munmap(addr, len, false);
>   }
>
> # cat /proc/cpuinfo  | head
> processor       : 0
> BogoMIPS        : 40.00
> Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid asimdrdm
> CPU implementer : 0x51
> CPU architecture: 8
> CPU variant     : 0x0
> CPU part        : 0xc00
> CPU revision    : 1
>
> # numactl -H
> available: 1 nodes (0)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
> node 0 size: 97938 MB
> node 0 free: 95732 MB
> node distances:
> node   0
>    0:  10
>
> Regards,
> Jan
>
> [1] https://github.com/jstancek/reproducers/blob/master/kernel/page_fault_stall/mmap5.c
> [2] https://github.com/jstancek/reproducers/blob/master/kernel/page_fault_stall/config


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bug] aarch64: userspace stalls on page fault after dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
  2019-05-06 19:04   ` Yang Shi
@ 2019-05-06 21:35     ` Jan Stancek
  2019-05-06 23:07       ` Yang Shi
  0 siblings, 1 reply; 10+ messages in thread
From: Jan Stancek @ 2019-05-06 21:35 UTC (permalink / raw)
  To: Yang Shi, linux-mm, linux-arm-kernel
  Cc: kirill, willy, kirill shutemov, vbabka, Andrea Arcangeli, akpm,
	Waiman Long, Jan Stancek



----- Original Message -----
> 
> 
> On 5/5/19 7:10 AM, Jan Stancek wrote:
> > Hi,
> >
> > I'm seeing userspace program getting stuck on aarch64, on kernels 4.20 and
> > newer.
> > It stalls from seconds to hours.
> >
> > I have simplified it to following scenario (reproducer linked below [1]):
> >    while (1):
> >      spawn Thread 1: mmap, write, munmap
> >      spawn Thread 2: <nothing>
> >
> > Thread 1 is sporadically getting stuck on write to mapped area. User-space
> > is not
> > moving forward - stdout output stops. Observed CPU usage is however 100%.
> >
> > At this time, kernel appears to be busy handling page faults (~700k per
> > second):
> >
> > # perf top -a -g
> > -   98.97%     8.30%  a.out                     [.] map_write_unmap
> >     - 23.52% map_write_unmap
> >        - 24.29% el0_sync
> >           - 10.42% do_mem_abort
> >              - 17.81% do_translation_fault
> >                 - 33.01% do_page_fault
> >                    - 56.18% handle_mm_fault
> >                         40.26% __handle_mm_fault
> >                         2.19% __ll_sc___cmpxchg_case_acq_4
> >                         0.87% mem_cgroup_from_task
> >                    - 6.18% find_vma
> >                         5.38% vmacache_find
> >                      1.35% __ll_sc___cmpxchg_case_acq_8
> >                      1.23% __ll_sc_atomic64_sub_return_release
> >                      0.78% down_read_trylock
> >             0.93% do_translation_fault
> >     + 8.30% thread_start
> >
> > #  perf stat -p 8189 -d
> > ^C
> >   Performance counter stats for process id '8189':
> >
> >          984.311350      task-clock (msec)         #    1.000 CPUs utilized
> >                   0      context-switches          #    0.000 K/sec
> >                   0      cpu-migrations            #    0.000 K/sec
> >             723,641      page-faults               #    0.735 M/sec
> >       2,559,199,434      cycles                    #    2.600 GHz
> >         711,933,112      instructions              #    0.28  insn per
> >         cycle
> >     <not supported>      branches
> >             757,658      branch-misses
> >         205,840,557      L1-dcache-loads           #  209.121 M/sec
> >          40,561,529      L1-dcache-load-misses     #   19.71% of all
> >          L1-dcache hits
> >     <not supported>      LLC-loads
> >     <not supported>      LLC-load-misses
> >
> >         0.984454892 seconds time elapsed
> >
> > With some extra traces, it appears looping in page fault for same address,
> > over and over:
> >    do_page_fault // mm_flags: 0x55
> >      __do_page_fault
> >        __handle_mm_fault
> >          handle_pte_fault
> >            ptep_set_access_flags
> >              if (pte_same(pte, entry))  // pte: e8000805060f53, entry:
> >              e8000805060f53
> >
> > I had traces in mmap() and munmap() as well, they don't get hit when
> > reproducer
> > hits the bad state.
> >
> > Notes:
> > - I'm not able to reproduce this on x86.
> > - Attaching GDB or strace immediatelly recovers application from stall.
> > - It also seems to recover faster when system is busy with other tasks.
> > - MAP_SHARED vs. MAP_PRIVATE makes no difference.
> > - Turning off THP makes no difference.
> > - Reproducer [1] usually hits it within ~minute on HW described below.
> > - Longman mentioned that "When the rwsem becomes reader-owned, it causes
> >    all the spinning writers to go to sleep adding wakeup latency to
> >    the time required to finish the critical sections", but this looks
> >    like busy loop, so I'm not sure if it's related to rwsem issues
> >    identified
> >    in:
> >    https://lore.kernel.org/lkml/20190428212557.13482-2-longman@redhat.com/
> 
> It sounds possible to me. What the optimization done by the commit ("mm:
> mmap: zap pages with read mmap_sem in munmap") is to downgrade write
> rwsem to read when zapping pages and page table in munmap() after the
> vmas have been detached from the rbtree.
> 
> So the mmap(), which is writer, in your test may steal the lock and
> execute with the munmap(), which is the reader after the downgrade, in
> parallel to break the mutual exclusion.
> 
> In this case, the parallel mmap() may map to the same area since vmas
> have been detached by munmap(), then mmap() may create the complete same
> vmas, and page fault happens on the same vma at the same address.
> 
> I'm not sure why gdb or strace could recover this, but they use ptrace
> which may acquire mmap_sem to break the parallel inadvertently.
> 
> May you please try Waiman's patch to see if it makes any difference?

I don't see any difference in behaviour after applying:
  [PATCH-tip v7 01/20] locking/rwsem: Prevent decrement of reader count before increment
Issue is still easily reproducible for me.

I'm including output of mem_abort_decode() / show_pte() for sample PTE, that
I see in page fault loop. (I went through all bits, but couldn't find anything invalid about it)

  mem_abort_decode: Mem abort info:
  mem_abort_decode:   ESR = 0x92000047
  mem_abort_decode:   Exception class = DABT (lower EL), IL = 32 bits
  mem_abort_decode:   SET = 0, FnV = 0
  mem_abort_decode:   EA = 0, S1PTW = 0
  mem_abort_decode: Data abort info:
  mem_abort_decode:   ISV = 0, ISS = 0x00000047
  mem_abort_decode:   CM = 0, WnR = 1
  show_pte: user pgtable: 64k pages, 48-bit VAs, pgdp = 0000000067027567
  show_pte: [0000ffff6dff0000] pgd=000000176bae0003
  show_pte: , pud=000000176bae0003
  show_pte: , pmd=000000174ad60003
  show_pte: , pte=00e80008023a0f53
  show_pte: , pte_pfn: 8023a

  >>> print bin(0x47)
  0b1000111

  Per D12-2779 (ARM Architecture Reference Manual),
      ISS encoding for an exception from an Instruction Abort:
    IFSC, bits [5:0], Instruction Fault Status Code
    0b000111 Translation fault, level 3

---

My theory is that TLB is getting broken.

I made a dummy kernel module that exports debugfs file, which on read triggers:
  flush_tlb_all();

Any time reproducer stalls and I read debugfs file, it recovers
immediately and resumes printing to stdout.

> 
> > - I tried 2 different aarch64 systems so far: APM X-Gene CPU Potenza A3 and
> >    Qualcomm 65-LA-115-151.
> >    I can reproduce it on both with v5.1-rc7. It's easier to reproduce
> >    on latter one (for longer periods of time), which has 46 CPUs.
> > - Sample output of reproducer on otherwise idle system:
> >    # ./a.out
> >    [00000314] map_write_unmap took: 26305 ms
> >    [00000867] map_write_unmap took: 13642 ms
> >    [00002200] map_write_unmap took: 44237 ms
> >    [00002851] map_write_unmap took: 992 ms
> >    [00004725] map_write_unmap took: 542 ms
> >    [00006443] map_write_unmap took: 5333 ms
> >    [00006593] map_write_unmap took: 21162 ms
> >    [00007435] map_write_unmap took: 16982 ms
> >    [00007488] map_write unmap took: 13 ms^C
> >
> > I ran a bisect, which identified following commit as first bad one:
> >    dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
> >
> > I can also make the issue go away with following change:
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 330f12c17fa1..13ce465740e2 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -2844,7 +2844,7 @@ EXPORT_SYMBOL(vm_munmap);
> >   SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
> >   {
> >          profile_munmap(addr);
> > -       return __vm_munmap(addr, len, true);
> > +       return __vm_munmap(addr, len, false);
> >   }
> >
> > # cat /proc/cpuinfo  | head
> > processor       : 0
> > BogoMIPS        : 40.00
> > Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid asimdrdm
> > CPU implementer : 0x51
> > CPU architecture: 8
> > CPU variant     : 0x0
> > CPU part        : 0xc00
> > CPU revision    : 1
> >
> > # numactl -H
> > available: 1 nodes (0)
> > node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
> > 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
> > node 0 size: 97938 MB
> > node 0 free: 95732 MB
> > node distances:
> > node   0
> >    0:  10
> >
> > Regards,
> > Jan
> >
> > [1]
> > https://github.com/jstancek/reproducers/blob/master/kernel/page_fault_stall/mmap5.c
> > [2]
> > https://github.com/jstancek/reproducers/blob/master/kernel/page_fault_stall/config
> 
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bug] aarch64: userspace stalls on page fault after dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
  2019-05-06 21:35     ` Jan Stancek
@ 2019-05-06 23:07       ` Yang Shi
  2019-05-06 23:15         ` Yang Shi
  2019-05-07 11:51         ` Jan Stancek
  0 siblings, 2 replies; 10+ messages in thread
From: Yang Shi @ 2019-05-06 23:07 UTC (permalink / raw)
  To: Jan Stancek, linux-mm, linux-arm-kernel
  Cc: kirill, willy, kirill shutemov, vbabka, Andrea Arcangeli, akpm,
	Waiman Long, Mel Gorman, Rik van Riel, will.deacon,
	catalin.marinas



On 5/6/19 2:35 PM, Jan Stancek wrote:
>
> ----- Original Message -----
>>
>> On 5/5/19 7:10 AM, Jan Stancek wrote:
>>> Hi,
>>>
>>> I'm seeing userspace program getting stuck on aarch64, on kernels 4.20 and
>>> newer.
>>> It stalls from seconds to hours.
>>>
>>> I have simplified it to following scenario (reproducer linked below [1]):
>>>     while (1):
>>>       spawn Thread 1: mmap, write, munmap
>>>       spawn Thread 2: <nothing>
>>>
>>> Thread 1 is sporadically getting stuck on write to mapped area. User-space
>>> is not
>>> moving forward - stdout output stops. Observed CPU usage is however 100%.
>>>
>>> At this time, kernel appears to be busy handling page faults (~700k per
>>> second):
>>>
>>> # perf top -a -g
>>> -   98.97%     8.30%  a.out                     [.] map_write_unmap
>>>      - 23.52% map_write_unmap
>>>         - 24.29% el0_sync
>>>            - 10.42% do_mem_abort
>>>               - 17.81% do_translation_fault
>>>                  - 33.01% do_page_fault
>>>                     - 56.18% handle_mm_fault
>>>                          40.26% __handle_mm_fault
>>>                          2.19% __ll_sc___cmpxchg_case_acq_4
>>>                          0.87% mem_cgroup_from_task
>>>                     - 6.18% find_vma
>>>                          5.38% vmacache_find
>>>                       1.35% __ll_sc___cmpxchg_case_acq_8
>>>                       1.23% __ll_sc_atomic64_sub_return_release
>>>                       0.78% down_read_trylock
>>>              0.93% do_translation_fault
>>>      + 8.30% thread_start
>>>
>>> #  perf stat -p 8189 -d
>>> ^C
>>>    Performance counter stats for process id '8189':
>>>
>>>           984.311350      task-clock (msec)         #    1.000 CPUs utilized
>>>                    0      context-switches          #    0.000 K/sec
>>>                    0      cpu-migrations            #    0.000 K/sec
>>>              723,641      page-faults               #    0.735 M/sec
>>>        2,559,199,434      cycles                    #    2.600 GHz
>>>          711,933,112      instructions              #    0.28  insn per
>>>          cycle
>>>      <not supported>      branches
>>>              757,658      branch-misses
>>>          205,840,557      L1-dcache-loads           #  209.121 M/sec
>>>           40,561,529      L1-dcache-load-misses     #   19.71% of all
>>>           L1-dcache hits
>>>      <not supported>      LLC-loads
>>>      <not supported>      LLC-load-misses
>>>
>>>          0.984454892 seconds time elapsed
>>>
>>> With some extra traces, it appears looping in page fault for same address,
>>> over and over:
>>>     do_page_fault // mm_flags: 0x55
>>>       __do_page_fault
>>>         __handle_mm_fault
>>>           handle_pte_fault
>>>             ptep_set_access_flags
>>>               if (pte_same(pte, entry))  // pte: e8000805060f53, entry:
>>>               e8000805060f53
>>>
>>> I had traces in mmap() and munmap() as well, they don't get hit when
>>> reproducer
>>> hits the bad state.
>>>
>>> Notes:
>>> - I'm not able to reproduce this on x86.
>>> - Attaching GDB or strace immediatelly recovers application from stall.
>>> - It also seems to recover faster when system is busy with other tasks.
>>> - MAP_SHARED vs. MAP_PRIVATE makes no difference.
>>> - Turning off THP makes no difference.
>>> - Reproducer [1] usually hits it within ~minute on HW described below.
>>> - Longman mentioned that "When the rwsem becomes reader-owned, it causes
>>>     all the spinning writers to go to sleep adding wakeup latency to
>>>     the time required to finish the critical sections", but this looks
>>>     like busy loop, so I'm not sure if it's related to rwsem issues
>>>     identified
>>>     in:
>>>     https://lore.kernel.org/lkml/20190428212557.13482-2-longman@redhat.com/
>> It sounds possible to me. What the optimization done by the commit ("mm:
>> mmap: zap pages with read mmap_sem in munmap") is to downgrade write
>> rwsem to read when zapping pages and page table in munmap() after the
>> vmas have been detached from the rbtree.
>>
>> So the mmap(), which is writer, in your test may steal the lock and
>> execute with the munmap(), which is the reader after the downgrade, in
>> parallel to break the mutual exclusion.
>>
>> In this case, the parallel mmap() may map to the same area since vmas
>> have been detached by munmap(), then mmap() may create the complete same
>> vmas, and page fault happens on the same vma at the same address.
>>
>> I'm not sure why gdb or strace could recover this, but they use ptrace
>> which may acquire mmap_sem to break the parallel inadvertently.
>>
>> May you please try Waiman's patch to see if it makes any difference?
> I don't see any difference in behaviour after applying:
>    [PATCH-tip v7 01/20] locking/rwsem: Prevent decrement of reader count before increment
> Issue is still easily reproducible for me.
>
> I'm including output of mem_abort_decode() / show_pte() for sample PTE, that
> I see in page fault loop. (I went through all bits, but couldn't find anything invalid about it)
>
>    mem_abort_decode: Mem abort info:
>    mem_abort_decode:   ESR = 0x92000047
>    mem_abort_decode:   Exception class = DABT (lower EL), IL = 32 bits
>    mem_abort_decode:   SET = 0, FnV = 0
>    mem_abort_decode:   EA = 0, S1PTW = 0
>    mem_abort_decode: Data abort info:
>    mem_abort_decode:   ISV = 0, ISS = 0x00000047
>    mem_abort_decode:   CM = 0, WnR = 1
>    show_pte: user pgtable: 64k pages, 48-bit VAs, pgdp = 0000000067027567
>    show_pte: [0000ffff6dff0000] pgd=000000176bae0003
>    show_pte: , pud=000000176bae0003
>    show_pte: , pmd=000000174ad60003
>    show_pte: , pte=00e80008023a0f53
>    show_pte: , pte_pfn: 8023a
>
>    >>> print bin(0x47)
>    0b1000111
>
>    Per D12-2779 (ARM Architecture Reference Manual),
>        ISS encoding for an exception from an Instruction Abort:
>      IFSC, bits [5:0], Instruction Fault Status Code
>      0b000111 Translation fault, level 3
>
> ---
>
> My theory is that TLB is getting broken.
>
> I made a dummy kernel module that exports debugfs file, which on read triggers:
>    flush_tlb_all();
>
> Any time reproducer stalls and I read debugfs file, it recovers
> immediately and resumes printing to stdout.

That commit doesn't change anything about TLB flush, just move zapping 
pages under read mmap_sem as what MADV_DONTNEED does.

I don't have aarch64 board to reproduce and debug it. And, I'm not 
familiar with aarch64 architecture either. But, some history told me the 
parallel zapping page may run into stale TLB and defer a flush meaning 
that this call may observe pte_none and fails to flush the TLB. But, 
this has been solved by commit 56236a59556c ("mm: refactor TLB gathering 
API") and 99baac21e458 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss 
problem").

For more detail, please refer to commit 4647706ebeee ("mm: always flush 
VMA ranges affected by zap_page_range"). Copied Mel and Rik in this 
thread. Also added Will Deacon and Catalin Marinas, who are aarch64 
maintainers, in this loop

But, your test (triggering TLB flush) does demonstrate TLB flush is 
*not* done properly at some point as expected for aarch64. Could you 
please give the below patch a try?

diff --git a/mm/memory.c b/mm/memory.c
index ab650c2..ef41ad5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1336,8 +1336,10 @@ void unmap_vmas(struct mmu_gather *tlb,

         mmu_notifier_range_init(&range, vma->vm_mm, start_addr, end_addr);
         mmu_notifier_invalidate_range_start(&range);
-       for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
+       for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
                 unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
+               flush_tlb_range(vma, start_addr, end_addr);
+       }
         mmu_notifier_invalidate_range_end(&range);
  }

>
>>> - I tried 2 different aarch64 systems so far: APM X-Gene CPU Potenza A3 and
>>>     Qualcomm 65-LA-115-151.
>>>     I can reproduce it on both with v5.1-rc7. It's easier to reproduce
>>>     on latter one (for longer periods of time), which has 46 CPUs.
>>> - Sample output of reproducer on otherwise idle system:
>>>     # ./a.out
>>>     [00000314] map_write_unmap took: 26305 ms
>>>     [00000867] map_write_unmap took: 13642 ms
>>>     [00002200] map_write_unmap took: 44237 ms
>>>     [00002851] map_write_unmap took: 992 ms
>>>     [00004725] map_write_unmap took: 542 ms
>>>     [00006443] map_write_unmap took: 5333 ms
>>>     [00006593] map_write_unmap took: 21162 ms
>>>     [00007435] map_write_unmap took: 16982 ms
>>>     [00007488] map_write unmap took: 13 ms^C
>>>
>>> I ran a bisect, which identified following commit as first bad one:
>>>     dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
>>>
>>> I can also make the issue go away with following change:
>>> diff --git a/mm/mmap.c b/mm/mmap.c
>>> index 330f12c17fa1..13ce465740e2 100644
>>> --- a/mm/mmap.c
>>> +++ b/mm/mmap.c
>>> @@ -2844,7 +2844,7 @@ EXPORT_SYMBOL(vm_munmap);
>>>    SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
>>>    {
>>>           profile_munmap(addr);
>>> -       return __vm_munmap(addr, len, true);
>>> +       return __vm_munmap(addr, len, false);
>>>    }
>>>
>>> # cat /proc/cpuinfo  | head
>>> processor       : 0
>>> BogoMIPS        : 40.00
>>> Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid asimdrdm
>>> CPU implementer : 0x51
>>> CPU architecture: 8
>>> CPU variant     : 0x0
>>> CPU part        : 0xc00
>>> CPU revision    : 1
>>>
>>> # numactl -H
>>> available: 1 nodes (0)
>>> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
>>> 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
>>> node 0 size: 97938 MB
>>> node 0 free: 95732 MB
>>> node distances:
>>> node   0
>>>     0:  10
>>>
>>> Regards,
>>> Jan
>>>
>>> [1]
>>> https://github.com/jstancek/reproducers/blob/master/kernel/page_fault_stall/mmap5.c
>>> [2]
>>> https://github.com/jstancek/reproducers/blob/master/kernel/page_fault_stall/config
>>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bug] aarch64: userspace stalls on page fault after dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
  2019-05-06 23:07       ` Yang Shi
@ 2019-05-06 23:15         ` Yang Shi
  2019-05-07 11:51         ` Jan Stancek
  1 sibling, 0 replies; 10+ messages in thread
From: Yang Shi @ 2019-05-06 23:15 UTC (permalink / raw)
  To: Jan Stancek, linux-mm, linux-arm-kernel
  Cc: kirill, willy, kirill shutemov, vbabka, Andrea Arcangeli, akpm,
	Waiman Long, Mel Gorman, Rik van Riel, will.deacon,
	catalin.marinas



On 5/6/19 4:07 PM, Yang Shi wrote:
>
>
> On 5/6/19 2:35 PM, Jan Stancek wrote:
>>
>> ----- Original Message -----
>>>
>>> On 5/5/19 7:10 AM, Jan Stancek wrote:
>>>> Hi,
>>>>
>>>> I'm seeing userspace program getting stuck on aarch64, on kernels 
>>>> 4.20 and
>>>> newer.
>>>> It stalls from seconds to hours.
>>>>
>>>> I have simplified it to following scenario (reproducer linked below 
>>>> [1]):
>>>>     while (1):
>>>>       spawn Thread 1: mmap, write, munmap
>>>>       spawn Thread 2: <nothing>
>>>>
>>>> Thread 1 is sporadically getting stuck on write to mapped area. 
>>>> User-space
>>>> is not
>>>> moving forward - stdout output stops. Observed CPU usage is however 
>>>> 100%.
>>>>
>>>> At this time, kernel appears to be busy handling page faults (~700k 
>>>> per
>>>> second):
>>>>
>>>> # perf top -a -g
>>>> -   98.97%     8.30%  a.out                     [.] map_write_unmap
>>>>      - 23.52% map_write_unmap
>>>>         - 24.29% el0_sync
>>>>            - 10.42% do_mem_abort
>>>>               - 17.81% do_translation_fault
>>>>                  - 33.01% do_page_fault
>>>>                     - 56.18% handle_mm_fault
>>>>                          40.26% __handle_mm_fault
>>>>                          2.19% __ll_sc___cmpxchg_case_acq_4
>>>>                          0.87% mem_cgroup_from_task
>>>>                     - 6.18% find_vma
>>>>                          5.38% vmacache_find
>>>>                       1.35% __ll_sc___cmpxchg_case_acq_8
>>>>                       1.23% __ll_sc_atomic64_sub_return_release
>>>>                       0.78% down_read_trylock
>>>>              0.93% do_translation_fault
>>>>      + 8.30% thread_start
>>>>
>>>> #  perf stat -p 8189 -d
>>>> ^C
>>>>    Performance counter stats for process id '8189':
>>>>
>>>>           984.311350      task-clock (msec)         # 1.000 CPUs 
>>>> utilized
>>>>                    0      context-switches          # 0.000 K/sec
>>>>                    0      cpu-migrations            # 0.000 K/sec
>>>>              723,641      page-faults               # 0.735 M/sec
>>>>        2,559,199,434      cycles                    # 2.600 GHz
>>>>          711,933,112      instructions              # 0.28  insn per
>>>>          cycle
>>>>      <not supported>      branches
>>>>              757,658      branch-misses
>>>>          205,840,557      L1-dcache-loads           # 209.121 M/sec
>>>>           40,561,529      L1-dcache-load-misses     # 19.71% of all
>>>>           L1-dcache hits
>>>>      <not supported>      LLC-loads
>>>>      <not supported>      LLC-load-misses
>>>>
>>>>          0.984454892 seconds time elapsed
>>>>
>>>> With some extra traces, it appears looping in page fault for same 
>>>> address,
>>>> over and over:
>>>>     do_page_fault // mm_flags: 0x55
>>>>       __do_page_fault
>>>>         __handle_mm_fault
>>>>           handle_pte_fault
>>>>             ptep_set_access_flags
>>>>               if (pte_same(pte, entry))  // pte: e8000805060f53, 
>>>> entry:
>>>>               e8000805060f53
>>>>
>>>> I had traces in mmap() and munmap() as well, they don't get hit when
>>>> reproducer
>>>> hits the bad state.
>>>>
>>>> Notes:
>>>> - I'm not able to reproduce this on x86.
>>>> - Attaching GDB or strace immediatelly recovers application from 
>>>> stall.
>>>> - It also seems to recover faster when system is busy with other 
>>>> tasks.
>>>> - MAP_SHARED vs. MAP_PRIVATE makes no difference.
>>>> - Turning off THP makes no difference.
>>>> - Reproducer [1] usually hits it within ~minute on HW described below.
>>>> - Longman mentioned that "When the rwsem becomes reader-owned, it 
>>>> causes
>>>>     all the spinning writers to go to sleep adding wakeup latency to
>>>>     the time required to finish the critical sections", but this looks
>>>>     like busy loop, so I'm not sure if it's related to rwsem issues
>>>>     identified
>>>>     in:
>>>> https://lore.kernel.org/lkml/20190428212557.13482-2-longman@redhat.com/ 
>>>>
>>> It sounds possible to me. What the optimization done by the commit 
>>> ("mm:
>>> mmap: zap pages with read mmap_sem in munmap") is to downgrade write
>>> rwsem to read when zapping pages and page table in munmap() after the
>>> vmas have been detached from the rbtree.
>>>
>>> So the mmap(), which is writer, in your test may steal the lock and
>>> execute with the munmap(), which is the reader after the downgrade, in
>>> parallel to break the mutual exclusion.
>>>
>>> In this case, the parallel mmap() may map to the same area since vmas
>>> have been detached by munmap(), then mmap() may create the complete 
>>> same
>>> vmas, and page fault happens on the same vma at the same address.
>>>
>>> I'm not sure why gdb or strace could recover this, but they use ptrace
>>> which may acquire mmap_sem to break the parallel inadvertently.
>>>
>>> May you please try Waiman's patch to see if it makes any difference?
>> I don't see any difference in behaviour after applying:
>>    [PATCH-tip v7 01/20] locking/rwsem: Prevent decrement of reader 
>> count before increment
>> Issue is still easily reproducible for me.
>>
>> I'm including output of mem_abort_decode() / show_pte() for sample 
>> PTE, that
>> I see in page fault loop. (I went through all bits, but couldn't find 
>> anything invalid about it)
>>
>>    mem_abort_decode: Mem abort info:
>>    mem_abort_decode:   ESR = 0x92000047
>>    mem_abort_decode:   Exception class = DABT (lower EL), IL = 32 bits
>>    mem_abort_decode:   SET = 0, FnV = 0
>>    mem_abort_decode:   EA = 0, S1PTW = 0
>>    mem_abort_decode: Data abort info:
>>    mem_abort_decode:   ISV = 0, ISS = 0x00000047
>>    mem_abort_decode:   CM = 0, WnR = 1
>>    show_pte: user pgtable: 64k pages, 48-bit VAs, pgdp = 
>> 0000000067027567
>>    show_pte: [0000ffff6dff0000] pgd=000000176bae0003
>>    show_pte: , pud=000000176bae0003
>>    show_pte: , pmd=000000174ad60003
>>    show_pte: , pte=00e80008023a0f53
>>    show_pte: , pte_pfn: 8023a
>>
>>    >>> print bin(0x47)
>>    0b1000111
>>
>>    Per D12-2779 (ARM Architecture Reference Manual),
>>        ISS encoding for an exception from an Instruction Abort:
>>      IFSC, bits [5:0], Instruction Fault Status Code
>>      0b000111 Translation fault, level 3
>>
>> ---
>>
>> My theory is that TLB is getting broken.
>>
>> I made a dummy kernel module that exports debugfs file, which on read 
>> triggers:
>>    flush_tlb_all();
>>
>> Any time reproducer stalls and I read debugfs file, it recovers
>> immediately and resumes printing to stdout.
>
> That commit doesn't change anything about TLB flush, just move zapping 
> pages under read mmap_sem as what MADV_DONTNEED does.
>
> I don't have aarch64 board to reproduce and debug it. And, I'm not 
> familiar with aarch64 architecture either. But, some history told me 
> the parallel zapping page may run into stale TLB and defer a flush 
> meaning that this call may observe pte_none and fails to flush the 
> TLB. But, this has been solved by commit 56236a59556c ("mm: refactor 
> TLB gathering API") and 99baac21e458 ("mm: fix MADV_[FREE|DONTNEED] 
> TLB flush miss problem").
>
> For more detail, please refer to commit 4647706ebeee ("mm: always 
> flush VMA ranges affected by zap_page_range"). 

Actually, this commit has been reverted. I mean refer to that revert 
commit, which is 50c150f26261 ("Revert "mm: always flush VMA ranges 
affected by zap_page_range""). The commit log explains why those two 
commits are good enough and this one could be reverted.

> Copied Mel and Rik in this thread. Also added Will Deacon and Catalin 
> Marinas, who are aarch64 maintainers, in this loop
>
> But, your test (triggering TLB flush) does demonstrate TLB flush is 
> *not* done properly at some point as expected for aarch64. Could you 
> please give the below patch a try?
>
> diff --git a/mm/memory.c b/mm/memory.c
> index ab650c2..ef41ad5 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1336,8 +1336,10 @@ void unmap_vmas(struct mmu_gather *tlb,
>
>         mmu_notifier_range_init(&range, vma->vm_mm, start_addr, 
> end_addr);
>         mmu_notifier_invalidate_range_start(&range);
> -       for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
> +       for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
>                 unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
> +               flush_tlb_range(vma, start_addr, end_addr);
> +       }
>         mmu_notifier_invalidate_range_end(&range);
>  }
>
>>
>>>> - I tried 2 different aarch64 systems so far: APM X-Gene CPU 
>>>> Potenza A3 and
>>>>     Qualcomm 65-LA-115-151.
>>>>     I can reproduce it on both with v5.1-rc7. It's easier to reproduce
>>>>     on latter one (for longer periods of time), which has 46 CPUs.
>>>> - Sample output of reproducer on otherwise idle system:
>>>>     # ./a.out
>>>>     [00000314] map_write_unmap took: 26305 ms
>>>>     [00000867] map_write_unmap took: 13642 ms
>>>>     [00002200] map_write_unmap took: 44237 ms
>>>>     [00002851] map_write_unmap took: 992 ms
>>>>     [00004725] map_write_unmap took: 542 ms
>>>>     [00006443] map_write_unmap took: 5333 ms
>>>>     [00006593] map_write_unmap took: 21162 ms
>>>>     [00007435] map_write_unmap took: 16982 ms
>>>>     [00007488] map_write unmap took: 13 ms^C
>>>>
>>>> I ran a bisect, which identified following commit as first bad one:
>>>>     dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
>>>>
>>>> I can also make the issue go away with following change:
>>>> diff --git a/mm/mmap.c b/mm/mmap.c
>>>> index 330f12c17fa1..13ce465740e2 100644
>>>> --- a/mm/mmap.c
>>>> +++ b/mm/mmap.c
>>>> @@ -2844,7 +2844,7 @@ EXPORT_SYMBOL(vm_munmap);
>>>>    SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
>>>>    {
>>>>           profile_munmap(addr);
>>>> -       return __vm_munmap(addr, len, true);
>>>> +       return __vm_munmap(addr, len, false);
>>>>    }
>>>>
>>>> # cat /proc/cpuinfo  | head
>>>> processor       : 0
>>>> BogoMIPS        : 40.00
>>>> Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid 
>>>> asimdrdm
>>>> CPU implementer : 0x51
>>>> CPU architecture: 8
>>>> CPU variant     : 0x0
>>>> CPU part        : 0xc00
>>>> CPU revision    : 1
>>>>
>>>> # numactl -H
>>>> available: 1 nodes (0)
>>>> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 
>>>> 21 22 23
>>>> 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
>>>> node 0 size: 97938 MB
>>>> node 0 free: 95732 MB
>>>> node distances:
>>>> node   0
>>>>     0:  10
>>>>
>>>> Regards,
>>>> Jan
>>>>
>>>> [1]
>>>> https://github.com/jstancek/reproducers/blob/master/kernel/page_fault_stall/mmap5.c 
>>>>
>>>> [2]
>>>> https://github.com/jstancek/reproducers/blob/master/kernel/page_fault_stall/config 
>>>>
>>>
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bug] aarch64: userspace stalls on page fault after dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
  2019-05-06 23:07       ` Yang Shi
  2019-05-06 23:15         ` Yang Shi
@ 2019-05-07 11:51         ` Jan Stancek
  2019-05-07 16:42           ` Yang Shi
  1 sibling, 1 reply; 10+ messages in thread
From: Jan Stancek @ 2019-05-07 11:51 UTC (permalink / raw)
  To: Yang Shi, will deacon, linux-arm-kernel, linux-mm
  Cc: kirill, willy, kirill shutemov, vbabka, Andrea Arcangeli, akpm,
	Waiman Long, Mel Gorman, Rik van Riel, catalin marinas,
	Jan Stancek


----- Original Message -----
> 
> 
> On 5/6/19 2:35 PM, Jan Stancek wrote:
> >
> > ----- Original Message -----
> >>
> >> On 5/5/19 7:10 AM, Jan Stancek wrote:
> >>> Hi,
> >>>
> >>> I'm seeing userspace program getting stuck on aarch64, on kernels 4.20
> >>> and
> >>> newer.
> >>> It stalls from seconds to hours.
> >>>
> >>> I have simplified it to following scenario (reproducer linked below [1]):
> >>>     while (1):
> >>>       spawn Thread 1: mmap, write, munmap
> >>>       spawn Thread 2: <nothing>
> >>>
> >>> Thread 1 is sporadically getting stuck on write to mapped area.
> >>> User-space
> >>> is not
> >>> moving forward - stdout output stops. Observed CPU usage is however 100%.
> >>>
> >>> At this time, kernel appears to be busy handling page faults (~700k per
> >>> second):
> >>>
> >>> # perf top -a -g
> >>> -   98.97%     8.30%  a.out                     [.] map_write_unmap
> >>>      - 23.52% map_write_unmap
> >>>         - 24.29% el0_sync
> >>>            - 10.42% do_mem_abort
> >>>               - 17.81% do_translation_fault
> >>>                  - 33.01% do_page_fault
> >>>                     - 56.18% handle_mm_fault
> >>>                          40.26% __handle_mm_fault
> >>>                          2.19% __ll_sc___cmpxchg_case_acq_4
> >>>                          0.87% mem_cgroup_from_task
> >>>                     - 6.18% find_vma
> >>>                          5.38% vmacache_find
> >>>                       1.35% __ll_sc___cmpxchg_case_acq_8
> >>>                       1.23% __ll_sc_atomic64_sub_return_release
> >>>                       0.78% down_read_trylock
> >>>              0.93% do_translation_fault
> >>>      + 8.30% thread_start
> >>>
> >>> #  perf stat -p 8189 -d
> >>> ^C
> >>>    Performance counter stats for process id '8189':
> >>>
> >>>           984.311350      task-clock (msec)         #    1.000 CPUs
> >>>           utilized
> >>>                    0      context-switches          #    0.000 K/sec
> >>>                    0      cpu-migrations            #    0.000 K/sec
> >>>              723,641      page-faults               #    0.735 M/sec
> >>>        2,559,199,434      cycles                    #    2.600 GHz
> >>>          711,933,112      instructions              #    0.28  insn per
> >>>          cycle
> >>>      <not supported>      branches
> >>>              757,658      branch-misses
> >>>          205,840,557      L1-dcache-loads           #  209.121 M/sec
> >>>           40,561,529      L1-dcache-load-misses     #   19.71% of all
> >>>           L1-dcache hits
> >>>      <not supported>      LLC-loads
> >>>      <not supported>      LLC-load-misses
> >>>
> >>>          0.984454892 seconds time elapsed
> >>>
> >>> With some extra traces, it appears looping in page fault for same
> >>> address,
> >>> over and over:
> >>>     do_page_fault // mm_flags: 0x55
> >>>       __do_page_fault
> >>>         __handle_mm_fault
> >>>           handle_pte_fault
> >>>             ptep_set_access_flags
> >>>               if (pte_same(pte, entry))  // pte: e8000805060f53, entry:
> >>>               e8000805060f53
> >>>
> >>> I had traces in mmap() and munmap() as well, they don't get hit when
> >>> reproducer
> >>> hits the bad state.
> >>>
> >>> Notes:
> >>> - I'm not able to reproduce this on x86.
> >>> - Attaching GDB or strace immediatelly recovers application from stall.
> >>> - It also seems to recover faster when system is busy with other tasks.
> >>> - MAP_SHARED vs. MAP_PRIVATE makes no difference.
> >>> - Turning off THP makes no difference.
> >>> - Reproducer [1] usually hits it within ~minute on HW described below.
> >>> - Longman mentioned that "When the rwsem becomes reader-owned, it causes
> >>>     all the spinning writers to go to sleep adding wakeup latency to
> >>>     the time required to finish the critical sections", but this looks
> >>>     like busy loop, so I'm not sure if it's related to rwsem issues
> >>>     identified
> >>>     in:
> >>>     https://lore.kernel.org/lkml/20190428212557.13482-2-longman@redhat.com/
> >> It sounds possible to me. What the optimization done by the commit ("mm:
> >> mmap: zap pages with read mmap_sem in munmap") is to downgrade write
> >> rwsem to read when zapping pages and page table in munmap() after the
> >> vmas have been detached from the rbtree.
> >>
> >> So the mmap(), which is writer, in your test may steal the lock and
> >> execute with the munmap(), which is the reader after the downgrade, in
> >> parallel to break the mutual exclusion.
> >>
> >> In this case, the parallel mmap() may map to the same area since vmas
> >> have been detached by munmap(), then mmap() may create the complete same
> >> vmas, and page fault happens on the same vma at the same address.
> >>
> >> I'm not sure why gdb or strace could recover this, but they use ptrace
> >> which may acquire mmap_sem to break the parallel inadvertently.
> >>
> >> May you please try Waiman's patch to see if it makes any difference?
> > I don't see any difference in behaviour after applying:
> >    [PATCH-tip v7 01/20] locking/rwsem: Prevent decrement of reader count
> >    before increment
> > Issue is still easily reproducible for me.
> >
> > I'm including output of mem_abort_decode() / show_pte() for sample PTE,
> > that
> > I see in page fault loop. (I went through all bits, but couldn't find
> > anything invalid about it)
> >
> >    mem_abort_decode: Mem abort info:
> >    mem_abort_decode:   ESR = 0x92000047
> >    mem_abort_decode:   Exception class = DABT (lower EL), IL = 32 bits
> >    mem_abort_decode:   SET = 0, FnV = 0
> >    mem_abort_decode:   EA = 0, S1PTW = 0
> >    mem_abort_decode: Data abort info:
> >    mem_abort_decode:   ISV = 0, ISS = 0x00000047
> >    mem_abort_decode:   CM = 0, WnR = 1
> >    show_pte: user pgtable: 64k pages, 48-bit VAs, pgdp = 0000000067027567
> >    show_pte: [0000ffff6dff0000] pgd=000000176bae0003
> >    show_pte: , pud=000000176bae0003
> >    show_pte: , pmd=000000174ad60003
> >    show_pte: , pte=00e80008023a0f53
> >    show_pte: , pte_pfn: 8023a
> >
> >    >>> print bin(0x47)
> >    0b1000111
> >
> >    Per D12-2779 (ARM Architecture Reference Manual),
> >        ISS encoding for an exception from an Instruction Abort:
> >      IFSC, bits [5:0], Instruction Fault Status Code
> >      0b000111 Translation fault, level 3
> >
> > ---
> >
> > My theory is that TLB is getting broken.

Theory continued:

unmap_region() is batching updates to TLB (for vmas and page tables).
And at the same time another thread handles page fault for same mm,
which increases "tlb_flush_pending".

tlb_finish_mmu() called from unmap_region() will thus set 'force = 1'.
And arch_tlb_finish_mmu() will in turn reset TLB range, presumably making
it smaller then it would be if force == 0.

Change below appears to fix it:

diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index f2f03c655807..a4cef21bd62b 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -93,7 +93,7 @@ void arch_tlb_finish_mmu(struct mmu_gather *tlb,
        struct mmu_gather_batch *batch, *next;
 
        if (force) {
-               __tlb_reset_range(tlb);
                __tlb_adjust_range(tlb, start, end - start);
        }

> >
> > I made a dummy kernel module that exports debugfs file, which on read
> > triggers:
> >    flush_tlb_all();
> >
> > Any time reproducer stalls and I read debugfs file, it recovers
> > immediately and resumes printing to stdout.
> 
> That commit doesn't change anything about TLB flush, just move zapping
> pages under read mmap_sem as what MADV_DONTNEED does.
> 
> I don't have aarch64 board to reproduce and debug it. And, I'm not
> familiar with aarch64 architecture either. But, some history told me the
> parallel zapping page may run into stale TLB and defer a flush meaning
> that this call may observe pte_none and fails to flush the TLB. But,
> this has been solved by commit 56236a59556c ("mm: refactor TLB gathering
> API") and 99baac21e458 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss
> problem").
> 
> For more detail, please refer to commit 4647706ebeee ("mm: always flush
> VMA ranges affected by zap_page_range"). Copied Mel and Rik in this
> thread. Also added Will Deacon and Catalin Marinas, who are aarch64
> maintainers, in this loop

Thanks

> 
> But, your test (triggering TLB flush) does demonstrate TLB flush is
> *not* done properly at some point as expected for aarch64. Could you
> please give the below patch a try?

Your patch also fixes my reproducer.

> 
> diff --git a/mm/memory.c b/mm/memory.c
> index ab650c2..ef41ad5 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1336,8 +1336,10 @@ void unmap_vmas(struct mmu_gather *tlb,
> 
>          mmu_notifier_range_init(&range, vma->vm_mm, start_addr, end_addr);
>          mmu_notifier_invalidate_range_start(&range);
> -       for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
> +       for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
>                  unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
> +               flush_tlb_range(vma, start_addr, end_addr);
> +       }
>          mmu_notifier_invalidate_range_end(&range);
>   }
> 
> >
> >>> - I tried 2 different aarch64 systems so far: APM X-Gene CPU Potenza A3
> >>> and
> >>>     Qualcomm 65-LA-115-151.
> >>>     I can reproduce it on both with v5.1-rc7. It's easier to reproduce
> >>>     on latter one (for longer periods of time), which has 46 CPUs.
> >>> - Sample output of reproducer on otherwise idle system:
> >>>     # ./a.out
> >>>     [00000314] map_write_unmap took: 26305 ms
> >>>     [00000867] map_write_unmap took: 13642 ms
> >>>     [00002200] map_write_unmap took: 44237 ms
> >>>     [00002851] map_write_unmap took: 992 ms
> >>>     [00004725] map_write_unmap took: 542 ms
> >>>     [00006443] map_write_unmap took: 5333 ms
> >>>     [00006593] map_write_unmap took: 21162 ms
> >>>     [00007435] map_write_unmap took: 16982 ms
> >>>     [00007488] map_write unmap took: 13 ms^C
> >>>
> >>> I ran a bisect, which identified following commit as first bad one:
> >>>     dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
> >>>
> >>> I can also make the issue go away with following change:
> >>> diff --git a/mm/mmap.c b/mm/mmap.c
> >>> index 330f12c17fa1..13ce465740e2 100644
> >>> --- a/mm/mmap.c
> >>> +++ b/mm/mmap.c
> >>> @@ -2844,7 +2844,7 @@ EXPORT_SYMBOL(vm_munmap);
> >>>    SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
> >>>    {
> >>>           profile_munmap(addr);
> >>> -       return __vm_munmap(addr, len, true);
> >>> +       return __vm_munmap(addr, len, false);
> >>>    }
> >>>
> >>> # cat /proc/cpuinfo  | head
> >>> processor       : 0
> >>> BogoMIPS        : 40.00
> >>> Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
> >>> asimdrdm
> >>> CPU implementer : 0x51
> >>> CPU architecture: 8
> >>> CPU variant     : 0x0
> >>> CPU part        : 0xc00
> >>> CPU revision    : 1
> >>>
> >>> # numactl -H
> >>> available: 1 nodes (0)
> >>> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
> >>> 23
> >>> 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
> >>> node 0 size: 97938 MB
> >>> node 0 free: 95732 MB
> >>> node distances:
> >>> node   0
> >>>     0:  10
> >>>
> >>> Regards,
> >>> Jan
> >>>
> >>> [1]
> >>> https://github.com/jstancek/reproducers/blob/master/kernel/page_fault_stall/mmap5.c
> >>> [2]
> >>> https://github.com/jstancek/reproducers/blob/master/kernel/page_fault_stall/config
> >>
> 
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bug] aarch64: userspace stalls on page fault after dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
  2019-05-07 11:51         ` Jan Stancek
@ 2019-05-07 16:42           ` Yang Shi
  2019-05-07 17:19             ` Jan Stancek
  0 siblings, 1 reply; 10+ messages in thread
From: Yang Shi @ 2019-05-07 16:42 UTC (permalink / raw)
  To: Jan Stancek, will deacon, linux-arm-kernel, linux-mm
  Cc: kirill, willy, kirill shutemov, vbabka, Andrea Arcangeli, akpm,
	Waiman Long, Mel Gorman, Rik van Riel, catalin marinas



On 5/7/19 4:51 AM, Jan Stancek wrote:
> ----- Original Message -----
>>
>> On 5/6/19 2:35 PM, Jan Stancek wrote:
>>> ----- Original Message -----
>>>> On 5/5/19 7:10 AM, Jan Stancek wrote:
>>>>> Hi,
>>>>>
>>>>> I'm seeing userspace program getting stuck on aarch64, on kernels 4.20
>>>>> and
>>>>> newer.
>>>>> It stalls from seconds to hours.
>>>>>
>>>>> I have simplified it to following scenario (reproducer linked below [1]):
>>>>>      while (1):
>>>>>        spawn Thread 1: mmap, write, munmap
>>>>>        spawn Thread 2: <nothing>
>>>>>
>>>>> Thread 1 is sporadically getting stuck on write to mapped area.
>>>>> User-space
>>>>> is not
>>>>> moving forward - stdout output stops. Observed CPU usage is however 100%.
>>>>>
>>>>> At this time, kernel appears to be busy handling page faults (~700k per
>>>>> second):
>>>>>
>>>>> # perf top -a -g
>>>>> -   98.97%     8.30%  a.out                     [.] map_write_unmap
>>>>>       - 23.52% map_write_unmap
>>>>>          - 24.29% el0_sync
>>>>>             - 10.42% do_mem_abort
>>>>>                - 17.81% do_translation_fault
>>>>>                   - 33.01% do_page_fault
>>>>>                      - 56.18% handle_mm_fault
>>>>>                           40.26% __handle_mm_fault
>>>>>                           2.19% __ll_sc___cmpxchg_case_acq_4
>>>>>                           0.87% mem_cgroup_from_task
>>>>>                      - 6.18% find_vma
>>>>>                           5.38% vmacache_find
>>>>>                        1.35% __ll_sc___cmpxchg_case_acq_8
>>>>>                        1.23% __ll_sc_atomic64_sub_return_release
>>>>>                        0.78% down_read_trylock
>>>>>               0.93% do_translation_fault
>>>>>       + 8.30% thread_start
>>>>>
>>>>> #  perf stat -p 8189 -d
>>>>> ^C
>>>>>     Performance counter stats for process id '8189':
>>>>>
>>>>>            984.311350      task-clock (msec)         #    1.000 CPUs
>>>>>            utilized
>>>>>                     0      context-switches          #    0.000 K/sec
>>>>>                     0      cpu-migrations            #    0.000 K/sec
>>>>>               723,641      page-faults               #    0.735 M/sec
>>>>>         2,559,199,434      cycles                    #    2.600 GHz
>>>>>           711,933,112      instructions              #    0.28  insn per
>>>>>           cycle
>>>>>       <not supported>      branches
>>>>>               757,658      branch-misses
>>>>>           205,840,557      L1-dcache-loads           #  209.121 M/sec
>>>>>            40,561,529      L1-dcache-load-misses     #   19.71% of all
>>>>>            L1-dcache hits
>>>>>       <not supported>      LLC-loads
>>>>>       <not supported>      LLC-load-misses
>>>>>
>>>>>           0.984454892 seconds time elapsed
>>>>>
>>>>> With some extra traces, it appears looping in page fault for same
>>>>> address,
>>>>> over and over:
>>>>>      do_page_fault // mm_flags: 0x55
>>>>>        __do_page_fault
>>>>>          __handle_mm_fault
>>>>>            handle_pte_fault
>>>>>              ptep_set_access_flags
>>>>>                if (pte_same(pte, entry))  // pte: e8000805060f53, entry:
>>>>>                e8000805060f53
>>>>>
>>>>> I had traces in mmap() and munmap() as well, they don't get hit when
>>>>> reproducer
>>>>> hits the bad state.
>>>>>
>>>>> Notes:
>>>>> - I'm not able to reproduce this on x86.
>>>>> - Attaching GDB or strace immediatelly recovers application from stall.
>>>>> - It also seems to recover faster when system is busy with other tasks.
>>>>> - MAP_SHARED vs. MAP_PRIVATE makes no difference.
>>>>> - Turning off THP makes no difference.
>>>>> - Reproducer [1] usually hits it within ~minute on HW described below.
>>>>> - Longman mentioned that "When the rwsem becomes reader-owned, it causes
>>>>>      all the spinning writers to go to sleep adding wakeup latency to
>>>>>      the time required to finish the critical sections", but this looks
>>>>>      like busy loop, so I'm not sure if it's related to rwsem issues
>>>>>      identified
>>>>>      in:
>>>>>      https://lore.kernel.org/lkml/20190428212557.13482-2-longman@redhat.com/
>>>> It sounds possible to me. What the optimization done by the commit ("mm:
>>>> mmap: zap pages with read mmap_sem in munmap") is to downgrade write
>>>> rwsem to read when zapping pages and page table in munmap() after the
>>>> vmas have been detached from the rbtree.
>>>>
>>>> So the mmap(), which is writer, in your test may steal the lock and
>>>> execute with the munmap(), which is the reader after the downgrade, in
>>>> parallel to break the mutual exclusion.
>>>>
>>>> In this case, the parallel mmap() may map to the same area since vmas
>>>> have been detached by munmap(), then mmap() may create the complete same
>>>> vmas, and page fault happens on the same vma at the same address.
>>>>
>>>> I'm not sure why gdb or strace could recover this, but they use ptrace
>>>> which may acquire mmap_sem to break the parallel inadvertently.
>>>>
>>>> May you please try Waiman's patch to see if it makes any difference?
>>> I don't see any difference in behaviour after applying:
>>>     [PATCH-tip v7 01/20] locking/rwsem: Prevent decrement of reader count
>>>     before increment
>>> Issue is still easily reproducible for me.
>>>
>>> I'm including output of mem_abort_decode() / show_pte() for sample PTE,
>>> that
>>> I see in page fault loop. (I went through all bits, but couldn't find
>>> anything invalid about it)
>>>
>>>     mem_abort_decode: Mem abort info:
>>>     mem_abort_decode:   ESR = 0x92000047
>>>     mem_abort_decode:   Exception class = DABT (lower EL), IL = 32 bits
>>>     mem_abort_decode:   SET = 0, FnV = 0
>>>     mem_abort_decode:   EA = 0, S1PTW = 0
>>>     mem_abort_decode: Data abort info:
>>>     mem_abort_decode:   ISV = 0, ISS = 0x00000047
>>>     mem_abort_decode:   CM = 0, WnR = 1
>>>     show_pte: user pgtable: 64k pages, 48-bit VAs, pgdp = 0000000067027567
>>>     show_pte: [0000ffff6dff0000] pgd=000000176bae0003
>>>     show_pte: , pud=000000176bae0003
>>>     show_pte: , pmd=000000174ad60003
>>>     show_pte: , pte=00e80008023a0f53
>>>     show_pte: , pte_pfn: 8023a
>>>
>>>     >>> print bin(0x47)
>>>     0b1000111
>>>
>>>     Per D12-2779 (ARM Architecture Reference Manual),
>>>         ISS encoding for an exception from an Instruction Abort:
>>>       IFSC, bits [5:0], Instruction Fault Status Code
>>>       0b000111 Translation fault, level 3
>>>
>>> ---
>>>
>>> My theory is that TLB is getting broken.
> Theory continued:
>
> unmap_region() is batching updates to TLB (for vmas and page tables).
> And at the same time another thread handles page fault for same mm,
> which increases "tlb_flush_pending".
>
> tlb_finish_mmu() called from unmap_region() will thus set 'force = 1'.
> And arch_tlb_finish_mmu() will in turn reset TLB range, presumably making
> it smaller then it would be if force == 0.
>
> Change below appears to fix it:
>
> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> index f2f03c655807..a4cef21bd62b 100644
> --- a/mm/mmu_gather.c
> +++ b/mm/mmu_gather.c
> @@ -93,7 +93,7 @@ void arch_tlb_finish_mmu(struct mmu_gather *tlb,
>          struct mmu_gather_batch *batch, *next;
>   
>          if (force) {
> -               __tlb_reset_range(tlb);
>                  __tlb_adjust_range(tlb, start, end - start);

I don't get why the change could fix it? __tlb_reset_range() just reset 
start and end to TASK_SIZE and 0, then __tlb_adjust_range() set proper 
start and end. I don't get why "force" flush smaller range?

>          }
>
>>> I made a dummy kernel module that exports debugfs file, which on read
>>> triggers:
>>>     flush_tlb_all();
>>>
>>> Any time reproducer stalls and I read debugfs file, it recovers
>>> immediately and resumes printing to stdout.
>> That commit doesn't change anything about TLB flush, just move zapping
>> pages under read mmap_sem as what MADV_DONTNEED does.
>>
>> I don't have aarch64 board to reproduce and debug it. And, I'm not
>> familiar with aarch64 architecture either. But, some history told me the
>> parallel zapping page may run into stale TLB and defer a flush meaning
>> that this call may observe pte_none and fails to flush the TLB. But,
>> this has been solved by commit 56236a59556c ("mm: refactor TLB gathering
>> API") and 99baac21e458 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss
>> problem").
>>
>> For more detail, please refer to commit 4647706ebeee ("mm: always flush
>> VMA ranges affected by zap_page_range"). Copied Mel and Rik in this
>> thread. Also added Will Deacon and Catalin Marinas, who are aarch64
>> maintainers, in this loop
> Thanks
>
>> But, your test (triggering TLB flush) does demonstrate TLB flush is
>> *not* done properly at some point as expected for aarch64. Could you
>> please give the below patch a try?
> Your patch also fixes my reproducer.

Thanks for testing it.

>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index ab650c2..ef41ad5 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -1336,8 +1336,10 @@ void unmap_vmas(struct mmu_gather *tlb,
>>
>>           mmu_notifier_range_init(&range, vma->vm_mm, start_addr, end_addr);
>>           mmu_notifier_invalidate_range_start(&range);
>> -       for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
>> +       for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
>>                   unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
>> +               flush_tlb_range(vma, start_addr, end_addr);
>> +       }
>>           mmu_notifier_invalidate_range_end(&range);
>>    }
>>
>>>>> - I tried 2 different aarch64 systems so far: APM X-Gene CPU Potenza A3
>>>>> and
>>>>>      Qualcomm 65-LA-115-151.
>>>>>      I can reproduce it on both with v5.1-rc7. It's easier to reproduce
>>>>>      on latter one (for longer periods of time), which has 46 CPUs.
>>>>> - Sample output of reproducer on otherwise idle system:
>>>>>      # ./a.out
>>>>>      [00000314] map_write_unmap took: 26305 ms
>>>>>      [00000867] map_write_unmap took: 13642 ms
>>>>>      [00002200] map_write_unmap took: 44237 ms
>>>>>      [00002851] map_write_unmap took: 992 ms
>>>>>      [00004725] map_write_unmap took: 542 ms
>>>>>      [00006443] map_write_unmap took: 5333 ms
>>>>>      [00006593] map_write_unmap took: 21162 ms
>>>>>      [00007435] map_write_unmap took: 16982 ms
>>>>>      [00007488] map_write unmap took: 13 ms^C
>>>>>
>>>>> I ran a bisect, which identified following commit as first bad one:
>>>>>      dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
>>>>>
>>>>> I can also make the issue go away with following change:
>>>>> diff --git a/mm/mmap.c b/mm/mmap.c
>>>>> index 330f12c17fa1..13ce465740e2 100644
>>>>> --- a/mm/mmap.c
>>>>> +++ b/mm/mmap.c
>>>>> @@ -2844,7 +2844,7 @@ EXPORT_SYMBOL(vm_munmap);
>>>>>     SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
>>>>>     {
>>>>>            profile_munmap(addr);
>>>>> -       return __vm_munmap(addr, len, true);
>>>>> +       return __vm_munmap(addr, len, false);
>>>>>     }
>>>>>
>>>>> # cat /proc/cpuinfo  | head
>>>>> processor       : 0
>>>>> BogoMIPS        : 40.00
>>>>> Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
>>>>> asimdrdm
>>>>> CPU implementer : 0x51
>>>>> CPU architecture: 8
>>>>> CPU variant     : 0x0
>>>>> CPU part        : 0xc00
>>>>> CPU revision    : 1
>>>>>
>>>>> # numactl -H
>>>>> available: 1 nodes (0)
>>>>> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
>>>>> 23
>>>>> 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
>>>>> node 0 size: 97938 MB
>>>>> node 0 free: 95732 MB
>>>>> node distances:
>>>>> node   0
>>>>>      0:  10
>>>>>
>>>>> Regards,
>>>>> Jan
>>>>>
>>>>> [1]
>>>>> https://github.com/jstancek/reproducers/blob/master/kernel/page_fault_stall/mmap5.c
>>>>> [2]
>>>>> https://github.com/jstancek/reproducers/blob/master/kernel/page_fault_stall/config
>>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bug] aarch64: userspace stalls on page fault after dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
  2019-05-07 16:42           ` Yang Shi
@ 2019-05-07 17:19             ` Jan Stancek
  2019-05-07 19:55               ` Yang Shi
  0 siblings, 1 reply; 10+ messages in thread
From: Jan Stancek @ 2019-05-07 17:19 UTC (permalink / raw)
  To: Yang Shi
  Cc: will deacon, linux-arm-kernel, linux-mm, kirill, willy,
	kirill shutemov, vbabka, Andrea Arcangeli, akpm, Waiman Long,
	Mel Gorman, Rik van Riel, catalin marinas, Jan Stancek


----- Original Message -----
> 
> 
> On 5/7/19 4:51 AM, Jan Stancek wrote:
> > ----- Original Message -----
> >>
> >> On 5/6/19 2:35 PM, Jan Stancek wrote:
> >>> ----- Original Message -----
> >>>> On 5/5/19 7:10 AM, Jan Stancek wrote:
> >>>>> Hi,
> >>>>>
> >>>>> I'm seeing userspace program getting stuck on aarch64, on kernels 4.20
> >>>>> and
> >>>>> newer.
> >>>>> It stalls from seconds to hours.
> >>>>>
> >>>>> I have simplified it to following scenario (reproducer linked below
> >>>>> [1]):
> >>>>>      while (1):
> >>>>>        spawn Thread 1: mmap, write, munmap
> >>>>>        spawn Thread 2: <nothing>
> >>>>>
> >>>>> Thread 1 is sporadically getting stuck on write to mapped area.
> >>>>> User-space
> >>>>> is not
> >>>>> moving forward - stdout output stops. Observed CPU usage is however
> >>>>> 100%.
> >>>>>
> >>>>> At this time, kernel appears to be busy handling page faults (~700k per
> >>>>> second):
> >>>>>
> >>>>> # perf top -a -g
> >>>>> -   98.97%     8.30%  a.out                     [.] map_write_unmap
> >>>>>       - 23.52% map_write_unmap
> >>>>>          - 24.29% el0_sync
> >>>>>             - 10.42% do_mem_abort
> >>>>>                - 17.81% do_translation_fault
> >>>>>                   - 33.01% do_page_fault
> >>>>>                      - 56.18% handle_mm_fault
> >>>>>                           40.26% __handle_mm_fault
> >>>>>                           2.19% __ll_sc___cmpxchg_case_acq_4
> >>>>>                           0.87% mem_cgroup_from_task
> >>>>>                      - 6.18% find_vma
> >>>>>                           5.38% vmacache_find
> >>>>>                        1.35% __ll_sc___cmpxchg_case_acq_8
> >>>>>                        1.23% __ll_sc_atomic64_sub_return_release
> >>>>>                        0.78% down_read_trylock
> >>>>>               0.93% do_translation_fault
> >>>>>       + 8.30% thread_start
> >>>>>
> >>>>> #  perf stat -p 8189 -d
> >>>>> ^C
> >>>>>     Performance counter stats for process id '8189':
> >>>>>
> >>>>>            984.311350      task-clock (msec)         #    1.000 CPUs
> >>>>>            utilized
> >>>>>                     0      context-switches          #    0.000 K/sec
> >>>>>                     0      cpu-migrations            #    0.000 K/sec
> >>>>>               723,641      page-faults               #    0.735 M/sec
> >>>>>         2,559,199,434      cycles                    #    2.600 GHz
> >>>>>           711,933,112      instructions              #    0.28  insn
> >>>>>           per
> >>>>>           cycle
> >>>>>       <not supported>      branches
> >>>>>               757,658      branch-misses
> >>>>>           205,840,557      L1-dcache-loads           #  209.121 M/sec
> >>>>>            40,561,529      L1-dcache-load-misses     #   19.71% of all
> >>>>>            L1-dcache hits
> >>>>>       <not supported>      LLC-loads
> >>>>>       <not supported>      LLC-load-misses
> >>>>>
> >>>>>           0.984454892 seconds time elapsed
> >>>>>
> >>>>> With some extra traces, it appears looping in page fault for same
> >>>>> address,
> >>>>> over and over:
> >>>>>      do_page_fault // mm_flags: 0x55
> >>>>>        __do_page_fault
> >>>>>          __handle_mm_fault
> >>>>>            handle_pte_fault
> >>>>>              ptep_set_access_flags
> >>>>>                if (pte_same(pte, entry))  // pte: e8000805060f53,
> >>>>>                entry:
> >>>>>                e8000805060f53
> >>>>>
> >>>>> I had traces in mmap() and munmap() as well, they don't get hit when
> >>>>> reproducer
> >>>>> hits the bad state.
> >>>>>
> >>>>> Notes:
> >>>>> - I'm not able to reproduce this on x86.
> >>>>> - Attaching GDB or strace immediatelly recovers application from stall.
> >>>>> - It also seems to recover faster when system is busy with other tasks.
> >>>>> - MAP_SHARED vs. MAP_PRIVATE makes no difference.
> >>>>> - Turning off THP makes no difference.
> >>>>> - Reproducer [1] usually hits it within ~minute on HW described below.
> >>>>> - Longman mentioned that "When the rwsem becomes reader-owned, it
> >>>>> causes
> >>>>>      all the spinning writers to go to sleep adding wakeup latency to
> >>>>>      the time required to finish the critical sections", but this looks
> >>>>>      like busy loop, so I'm not sure if it's related to rwsem issues
> >>>>>      identified
> >>>>>      in:
> >>>>>      https://lore.kernel.org/lkml/20190428212557.13482-2-longman@redhat.com/
> >>>> It sounds possible to me. What the optimization done by the commit ("mm:
> >>>> mmap: zap pages with read mmap_sem in munmap") is to downgrade write
> >>>> rwsem to read when zapping pages and page table in munmap() after the
> >>>> vmas have been detached from the rbtree.
> >>>>
> >>>> So the mmap(), which is writer, in your test may steal the lock and
> >>>> execute with the munmap(), which is the reader after the downgrade, in
> >>>> parallel to break the mutual exclusion.
> >>>>
> >>>> In this case, the parallel mmap() may map to the same area since vmas
> >>>> have been detached by munmap(), then mmap() may create the complete same
> >>>> vmas, and page fault happens on the same vma at the same address.
> >>>>
> >>>> I'm not sure why gdb or strace could recover this, but they use ptrace
> >>>> which may acquire mmap_sem to break the parallel inadvertently.
> >>>>
> >>>> May you please try Waiman's patch to see if it makes any difference?
> >>> I don't see any difference in behaviour after applying:
> >>>     [PATCH-tip v7 01/20] locking/rwsem: Prevent decrement of reader count
> >>>     before increment
> >>> Issue is still easily reproducible for me.
> >>>
> >>> I'm including output of mem_abort_decode() / show_pte() for sample PTE,
> >>> that
> >>> I see in page fault loop. (I went through all bits, but couldn't find
> >>> anything invalid about it)
> >>>
> >>>     mem_abort_decode: Mem abort info:
> >>>     mem_abort_decode:   ESR = 0x92000047
> >>>     mem_abort_decode:   Exception class = DABT (lower EL), IL = 32 bits
> >>>     mem_abort_decode:   SET = 0, FnV = 0
> >>>     mem_abort_decode:   EA = 0, S1PTW = 0
> >>>     mem_abort_decode: Data abort info:
> >>>     mem_abort_decode:   ISV = 0, ISS = 0x00000047
> >>>     mem_abort_decode:   CM = 0, WnR = 1
> >>>     show_pte: user pgtable: 64k pages, 48-bit VAs, pgdp =
> >>>     0000000067027567
> >>>     show_pte: [0000ffff6dff0000] pgd=000000176bae0003
> >>>     show_pte: , pud=000000176bae0003
> >>>     show_pte: , pmd=000000174ad60003
> >>>     show_pte: , pte=00e80008023a0f53
> >>>     show_pte: , pte_pfn: 8023a
> >>>
> >>>     >>> print bin(0x47)
> >>>     0b1000111
> >>>
> >>>     Per D12-2779 (ARM Architecture Reference Manual),
> >>>         ISS encoding for an exception from an Instruction Abort:
> >>>       IFSC, bits [5:0], Instruction Fault Status Code
> >>>       0b000111 Translation fault, level 3
> >>>
> >>> ---
> >>>
> >>> My theory is that TLB is getting broken.
> > Theory continued:
> >
> > unmap_region() is batching updates to TLB (for vmas and page tables).
> > And at the same time another thread handles page fault for same mm,
> > which increases "tlb_flush_pending".
> >
> > tlb_finish_mmu() called from unmap_region() will thus set 'force = 1'.
> > And arch_tlb_finish_mmu() will in turn reset TLB range, presumably making
> > it smaller then it would be if force == 0.
> >
> > Change below appears to fix it:
> >
> > diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> > index f2f03c655807..a4cef21bd62b 100644
> > --- a/mm/mmu_gather.c
> > +++ b/mm/mmu_gather.c
> > @@ -93,7 +93,7 @@ void arch_tlb_finish_mmu(struct mmu_gather *tlb,
> >          struct mmu_gather_batch *batch, *next;
> >   
> >          if (force) {
> > -               __tlb_reset_range(tlb);
> >                  __tlb_adjust_range(tlb, start, end - start);
> 
> I don't get why the change could fix it?

My guess is that reset clears "tlb->freed_tables", which changes how
tlb_flush() operates, see "bool last_level = !tlb->freed_tables;" in
arch/arm64/include/asm/tlb.h. Maybe that doesn't clear some intermediate
entries? No clue.

If I let it reset the range, but preserve "freed_tables", it also
seems to solve the problem:

diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index f2f03c655807..17fb0d7edc03 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -93,8 +93,20 @@ void arch_tlb_finish_mmu(struct mmu_gather *tlb,
        struct mmu_gather_batch *batch, *next;
 
        if (force) {
-               __tlb_reset_range(tlb);
+               if (tlb->fullmm) {
+                       tlb->start = tlb->end = ~0;
+               } else {
+                       tlb->start = TASK_SIZE;
+                       tlb->end = 0;
+               }
                __tlb_adjust_range(tlb, start, end - start);

> __tlb_reset_range() just reset
> start and end to TASK_SIZE and 0, then __tlb_adjust_range() set proper
> start and end. I don't get why "force" flush smaller range?

I'm still trying to understand this part. It's actually not smaller, but it changes:

unmap_region()
  # vm_start: ffff49bd0000 vm_end: ffff49be0000
  ...
  # tlb.start, tlb.end: 1000000000000 0
  free_pgtables()
  # tlb.start, tlb.end: ffff40000000 ffff40010000
  tlb_finish_mmu()
    arch_tlb_finish_mmu()
      # will see force == 1
      # resets tlb.start, tlb.end to: ffff49bd0000 ffff49be0000

> 
> >          }
> >
> >>> I made a dummy kernel module that exports debugfs file, which on read
> >>> triggers:
> >>>     flush_tlb_all();
> >>>
> >>> Any time reproducer stalls and I read debugfs file, it recovers
> >>> immediately and resumes printing to stdout.
> >> That commit doesn't change anything about TLB flush, just move zapping
> >> pages under read mmap_sem as what MADV_DONTNEED does.
> >>
> >> I don't have aarch64 board to reproduce and debug it. And, I'm not
> >> familiar with aarch64 architecture either. But, some history told me the
> >> parallel zapping page may run into stale TLB and defer a flush meaning
> >> that this call may observe pte_none and fails to flush the TLB. But,
> >> this has been solved by commit 56236a59556c ("mm: refactor TLB gathering
> >> API") and 99baac21e458 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss
> >> problem").
> >>
> >> For more detail, please refer to commit 4647706ebeee ("mm: always flush
> >> VMA ranges affected by zap_page_range"). Copied Mel and Rik in this
> >> thread. Also added Will Deacon and Catalin Marinas, who are aarch64
> >> maintainers, in this loop
> > Thanks
> >
> >> But, your test (triggering TLB flush) does demonstrate TLB flush is
> >> *not* done properly at some point as expected for aarch64. Could you
> >> please give the below patch a try?
> > Your patch also fixes my reproducer.
> 
> Thanks for testing it.
> 
> >
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index ab650c2..ef41ad5 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -1336,8 +1336,10 @@ void unmap_vmas(struct mmu_gather *tlb,
> >>
> >>           mmu_notifier_range_init(&range, vma->vm_mm, start_addr,
> >>           end_addr);
> >>           mmu_notifier_invalidate_range_start(&range);
> >> -       for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
> >> +       for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
> >>                   unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
> >> +               flush_tlb_range(vma, start_addr, end_addr);
> >> +       }
> >>           mmu_notifier_invalidate_range_end(&range);
> >>    }
> >>
> >>>>> - I tried 2 different aarch64 systems so far: APM X-Gene CPU Potenza A3
> >>>>> and
> >>>>>      Qualcomm 65-LA-115-151.
> >>>>>      I can reproduce it on both with v5.1-rc7. It's easier to reproduce
> >>>>>      on latter one (for longer periods of time), which has 46 CPUs.
> >>>>> - Sample output of reproducer on otherwise idle system:
> >>>>>      # ./a.out
> >>>>>      [00000314] map_write_unmap took: 26305 ms
> >>>>>      [00000867] map_write_unmap took: 13642 ms
> >>>>>      [00002200] map_write_unmap took: 44237 ms
> >>>>>      [00002851] map_write_unmap took: 992 ms
> >>>>>      [00004725] map_write_unmap took: 542 ms
> >>>>>      [00006443] map_write_unmap took: 5333 ms
> >>>>>      [00006593] map_write_unmap took: 21162 ms
> >>>>>      [00007435] map_write_unmap took: 16982 ms
> >>>>>      [00007488] map_write unmap took: 13 ms^C
> >>>>>
> >>>>> I ran a bisect, which identified following commit as first bad one:
> >>>>>      dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
> >>>>>
> >>>>> I can also make the issue go away with following change:
> >>>>> diff --git a/mm/mmap.c b/mm/mmap.c
> >>>>> index 330f12c17fa1..13ce465740e2 100644
> >>>>> --- a/mm/mmap.c
> >>>>> +++ b/mm/mmap.c
> >>>>> @@ -2844,7 +2844,7 @@ EXPORT_SYMBOL(vm_munmap);
> >>>>>     SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
> >>>>>     {
> >>>>>            profile_munmap(addr);
> >>>>> -       return __vm_munmap(addr, len, true);
> >>>>> +       return __vm_munmap(addr, len, false);
> >>>>>     }
> >>>>>
> >>>>> # cat /proc/cpuinfo  | head
> >>>>> processor       : 0
> >>>>> BogoMIPS        : 40.00
> >>>>> Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
> >>>>> asimdrdm
> >>>>> CPU implementer : 0x51
> >>>>> CPU architecture: 8
> >>>>> CPU variant     : 0x0
> >>>>> CPU part        : 0xc00
> >>>>> CPU revision    : 1
> >>>>>
> >>>>> # numactl -H
> >>>>> available: 1 nodes (0)
> >>>>> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
> >>>>> 23
> >>>>> 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
> >>>>> node 0 size: 97938 MB
> >>>>> node 0 free: 95732 MB
> >>>>> node distances:
> >>>>> node   0
> >>>>>      0:  10
> >>>>>
> >>>>> Regards,
> >>>>> Jan
> >>>>>
> >>>>> [1]
> >>>>> https://github.com/jstancek/reproducers/blob/master/kernel/page_fault_stall/mmap5.c
> >>>>> [2]
> >>>>> https://github.com/jstancek/reproducers/blob/master/kernel/page_fault_stall/config
> >>
> 
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bug] aarch64: userspace stalls on page fault after dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
  2019-05-07 17:19             ` Jan Stancek
@ 2019-05-07 19:55               ` Yang Shi
  2019-05-07 21:04                 ` Yang Shi
  0 siblings, 1 reply; 10+ messages in thread
From: Yang Shi @ 2019-05-07 19:55 UTC (permalink / raw)
  To: Jan Stancek
  Cc: will deacon, linux-arm-kernel, linux-mm, kirill, willy,
	kirill shutemov, vbabka, Andrea Arcangeli, akpm, Waiman Long,
	Mel Gorman, Rik van Riel, catalin marinas



On 5/7/19 10:19 AM, Jan Stancek wrote:
> ----- Original Message -----
>>
>> On 5/7/19 4:51 AM, Jan Stancek wrote:
>>> ----- Original Message -----
>>>> On 5/6/19 2:35 PM, Jan Stancek wrote:
>>>>> ----- Original Message -----
>>>>>> On 5/5/19 7:10 AM, Jan Stancek wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I'm seeing userspace program getting stuck on aarch64, on kernels 4.20
>>>>>>> and
>>>>>>> newer.
>>>>>>> It stalls from seconds to hours.
>>>>>>>
>>>>>>> I have simplified it to following scenario (reproducer linked below
>>>>>>> [1]):
>>>>>>>       while (1):
>>>>>>>         spawn Thread 1: mmap, write, munmap
>>>>>>>         spawn Thread 2: <nothing>
>>>>>>>
>>>>>>> Thread 1 is sporadically getting stuck on write to mapped area.
>>>>>>> User-space
>>>>>>> is not
>>>>>>> moving forward - stdout output stops. Observed CPU usage is however
>>>>>>> 100%.
>>>>>>>
>>>>>>> At this time, kernel appears to be busy handling page faults (~700k per
>>>>>>> second):
>>>>>>>
>>>>>>> # perf top -a -g
>>>>>>> -   98.97%     8.30%  a.out                     [.] map_write_unmap
>>>>>>>        - 23.52% map_write_unmap
>>>>>>>           - 24.29% el0_sync
>>>>>>>              - 10.42% do_mem_abort
>>>>>>>                 - 17.81% do_translation_fault
>>>>>>>                    - 33.01% do_page_fault
>>>>>>>                       - 56.18% handle_mm_fault
>>>>>>>                            40.26% __handle_mm_fault
>>>>>>>                            2.19% __ll_sc___cmpxchg_case_acq_4
>>>>>>>                            0.87% mem_cgroup_from_task
>>>>>>>                       - 6.18% find_vma
>>>>>>>                            5.38% vmacache_find
>>>>>>>                         1.35% __ll_sc___cmpxchg_case_acq_8
>>>>>>>                         1.23% __ll_sc_atomic64_sub_return_release
>>>>>>>                         0.78% down_read_trylock
>>>>>>>                0.93% do_translation_fault
>>>>>>>        + 8.30% thread_start
>>>>>>>
>>>>>>> #  perf stat -p 8189 -d
>>>>>>> ^C
>>>>>>>      Performance counter stats for process id '8189':
>>>>>>>
>>>>>>>             984.311350      task-clock (msec)         #    1.000 CPUs
>>>>>>>             utilized
>>>>>>>                      0      context-switches          #    0.000 K/sec
>>>>>>>                      0      cpu-migrations            #    0.000 K/sec
>>>>>>>                723,641      page-faults               #    0.735 M/sec
>>>>>>>          2,559,199,434      cycles                    #    2.600 GHz
>>>>>>>            711,933,112      instructions              #    0.28  insn
>>>>>>>            per
>>>>>>>            cycle
>>>>>>>        <not supported>      branches
>>>>>>>                757,658      branch-misses
>>>>>>>            205,840,557      L1-dcache-loads           #  209.121 M/sec
>>>>>>>             40,561,529      L1-dcache-load-misses     #   19.71% of all
>>>>>>>             L1-dcache hits
>>>>>>>        <not supported>      LLC-loads
>>>>>>>        <not supported>      LLC-load-misses
>>>>>>>
>>>>>>>            0.984454892 seconds time elapsed
>>>>>>>
>>>>>>> With some extra traces, it appears looping in page fault for same
>>>>>>> address,
>>>>>>> over and over:
>>>>>>>       do_page_fault // mm_flags: 0x55
>>>>>>>         __do_page_fault
>>>>>>>           __handle_mm_fault
>>>>>>>             handle_pte_fault
>>>>>>>               ptep_set_access_flags
>>>>>>>                 if (pte_same(pte, entry))  // pte: e8000805060f53,
>>>>>>>                 entry:
>>>>>>>                 e8000805060f53
>>>>>>>
>>>>>>> I had traces in mmap() and munmap() as well, they don't get hit when
>>>>>>> reproducer
>>>>>>> hits the bad state.
>>>>>>>
>>>>>>> Notes:
>>>>>>> - I'm not able to reproduce this on x86.
>>>>>>> - Attaching GDB or strace immediatelly recovers application from stall.
>>>>>>> - It also seems to recover faster when system is busy with other tasks.
>>>>>>> - MAP_SHARED vs. MAP_PRIVATE makes no difference.
>>>>>>> - Turning off THP makes no difference.
>>>>>>> - Reproducer [1] usually hits it within ~minute on HW described below.
>>>>>>> - Longman mentioned that "When the rwsem becomes reader-owned, it
>>>>>>> causes
>>>>>>>       all the spinning writers to go to sleep adding wakeup latency to
>>>>>>>       the time required to finish the critical sections", but this looks
>>>>>>>       like busy loop, so I'm not sure if it's related to rwsem issues
>>>>>>>       identified
>>>>>>>       in:
>>>>>>>       https://lore.kernel.org/lkml/20190428212557.13482-2-longman@redhat.com/
>>>>>> It sounds possible to me. What the optimization done by the commit ("mm:
>>>>>> mmap: zap pages with read mmap_sem in munmap") is to downgrade write
>>>>>> rwsem to read when zapping pages and page table in munmap() after the
>>>>>> vmas have been detached from the rbtree.
>>>>>>
>>>>>> So the mmap(), which is writer, in your test may steal the lock and
>>>>>> execute with the munmap(), which is the reader after the downgrade, in
>>>>>> parallel to break the mutual exclusion.
>>>>>>
>>>>>> In this case, the parallel mmap() may map to the same area since vmas
>>>>>> have been detached by munmap(), then mmap() may create the complete same
>>>>>> vmas, and page fault happens on the same vma at the same address.
>>>>>>
>>>>>> I'm not sure why gdb or strace could recover this, but they use ptrace
>>>>>> which may acquire mmap_sem to break the parallel inadvertently.
>>>>>>
>>>>>> May you please try Waiman's patch to see if it makes any difference?
>>>>> I don't see any difference in behaviour after applying:
>>>>>      [PATCH-tip v7 01/20] locking/rwsem: Prevent decrement of reader count
>>>>>      before increment
>>>>> Issue is still easily reproducible for me.
>>>>>
>>>>> I'm including output of mem_abort_decode() / show_pte() for sample PTE,
>>>>> that
>>>>> I see in page fault loop. (I went through all bits, but couldn't find
>>>>> anything invalid about it)
>>>>>
>>>>>      mem_abort_decode: Mem abort info:
>>>>>      mem_abort_decode:   ESR = 0x92000047
>>>>>      mem_abort_decode:   Exception class = DABT (lower EL), IL = 32 bits
>>>>>      mem_abort_decode:   SET = 0, FnV = 0
>>>>>      mem_abort_decode:   EA = 0, S1PTW = 0
>>>>>      mem_abort_decode: Data abort info:
>>>>>      mem_abort_decode:   ISV = 0, ISS = 0x00000047
>>>>>      mem_abort_decode:   CM = 0, WnR = 1
>>>>>      show_pte: user pgtable: 64k pages, 48-bit VAs, pgdp =
>>>>>      0000000067027567
>>>>>      show_pte: [0000ffff6dff0000] pgd=000000176bae0003
>>>>>      show_pte: , pud=000000176bae0003
>>>>>      show_pte: , pmd=000000174ad60003
>>>>>      show_pte: , pte=00e80008023a0f53
>>>>>      show_pte: , pte_pfn: 8023a
>>>>>
>>>>>      >>> print bin(0x47)
>>>>>      0b1000111
>>>>>
>>>>>      Per D12-2779 (ARM Architecture Reference Manual),
>>>>>          ISS encoding for an exception from an Instruction Abort:
>>>>>        IFSC, bits [5:0], Instruction Fault Status Code
>>>>>        0b000111 Translation fault, level 3
>>>>>
>>>>> ---
>>>>>
>>>>> My theory is that TLB is getting broken.
>>> Theory continued:
>>>
>>> unmap_region() is batching updates to TLB (for vmas and page tables).
>>> And at the same time another thread handles page fault for same mm,
>>> which increases "tlb_flush_pending".
>>>
>>> tlb_finish_mmu() called from unmap_region() will thus set 'force = 1'.
>>> And arch_tlb_finish_mmu() will in turn reset TLB range, presumably making
>>> it smaller then it would be if force == 0.
>>>
>>> Change below appears to fix it:
>>>
>>> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
>>> index f2f03c655807..a4cef21bd62b 100644
>>> --- a/mm/mmu_gather.c
>>> +++ b/mm/mmu_gather.c
>>> @@ -93,7 +93,7 @@ void arch_tlb_finish_mmu(struct mmu_gather *tlb,
>>>           struct mmu_gather_batch *batch, *next;
>>>    
>>>           if (force) {
>>> -               __tlb_reset_range(tlb);
>>>                   __tlb_adjust_range(tlb, start, end - start);
>> I don't get why the change could fix it?
> My guess is that reset clears "tlb->freed_tables", which changes how
> tlb_flush() operates, see "bool last_level = !tlb->freed_tables;" in
> arch/arm64/include/asm/tlb.h. Maybe that doesn't clear some intermediate
> entries? No clue.
>
> If I let it reset the range, but preserve "freed_tables", it also
> seems to solve the problem:

This makes sense. munmap() does free page tables, so "freed_tables" 
should be 1 instead of 0. So, in this case, __tlb_reset_range() should 
not be called.

>
> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> index f2f03c655807..17fb0d7edc03 100644
> --- a/mm/mmu_gather.c
> +++ b/mm/mmu_gather.c
> @@ -93,8 +93,20 @@ void arch_tlb_finish_mmu(struct mmu_gather *tlb,
>          struct mmu_gather_batch *batch, *next;
>   
>          if (force) {
> -               __tlb_reset_range(tlb);
> +               if (tlb->fullmm) {
> +                       tlb->start = tlb->end = ~0;
> +               } else {
> +                       tlb->start = TASK_SIZE;
> +                       tlb->end = 0;
> +               }
>                  __tlb_adjust_range(tlb, start, end - start);
>
>> __tlb_reset_range() just reset
>> start and end to TASK_SIZE and 0, then __tlb_adjust_range() set proper
>> start and end. I don't get why "force" flush smaller range?
> I'm still trying to understand this part. It's actually not smaller, but it changes:
>
> unmap_region()
>    # vm_start: ffff49bd0000 vm_end: ffff49be0000
>    ...
>    # tlb.start, tlb.end: 1000000000000 0
>    free_pgtables()
>    # tlb.start, tlb.end: ffff40000000 ffff40010000
>    tlb_finish_mmu()
>      arch_tlb_finish_mmu()
>        # will see force == 1
>        # resets tlb.start, tlb.end to: ffff49bd0000 ffff49be0000
>
>>>           }
>>>
>>>>> I made a dummy kernel module that exports debugfs file, which on read
>>>>> triggers:
>>>>>      flush_tlb_all();
>>>>>
>>>>> Any time reproducer stalls and I read debugfs file, it recovers
>>>>> immediately and resumes printing to stdout.
>>>> That commit doesn't change anything about TLB flush, just move zapping
>>>> pages under read mmap_sem as what MADV_DONTNEED does.
>>>>
>>>> I don't have aarch64 board to reproduce and debug it. And, I'm not
>>>> familiar with aarch64 architecture either. But, some history told me the
>>>> parallel zapping page may run into stale TLB and defer a flush meaning
>>>> that this call may observe pte_none and fails to flush the TLB. But,
>>>> this has been solved by commit 56236a59556c ("mm: refactor TLB gathering
>>>> API") and 99baac21e458 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss
>>>> problem").
>>>>
>>>> For more detail, please refer to commit 4647706ebeee ("mm: always flush
>>>> VMA ranges affected by zap_page_range"). Copied Mel and Rik in this
>>>> thread. Also added Will Deacon and Catalin Marinas, who are aarch64
>>>> maintainers, in this loop
>>> Thanks
>>>
>>>> But, your test (triggering TLB flush) does demonstrate TLB flush is
>>>> *not* done properly at some point as expected for aarch64. Could you
>>>> please give the below patch a try?
>>> Your patch also fixes my reproducer.
>> Thanks for testing it.
>>
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index ab650c2..ef41ad5 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -1336,8 +1336,10 @@ void unmap_vmas(struct mmu_gather *tlb,
>>>>
>>>>            mmu_notifier_range_init(&range, vma->vm_mm, start_addr,
>>>>            end_addr);
>>>>            mmu_notifier_invalidate_range_start(&range);
>>>> -       for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
>>>> +       for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
>>>>                    unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
>>>> +               flush_tlb_range(vma, start_addr, end_addr);
>>>> +       }
>>>>            mmu_notifier_invalidate_range_end(&range);
>>>>     }
>>>>
>>>>>>> - I tried 2 different aarch64 systems so far: APM X-Gene CPU Potenza A3
>>>>>>> and
>>>>>>>       Qualcomm 65-LA-115-151.
>>>>>>>       I can reproduce it on both with v5.1-rc7. It's easier to reproduce
>>>>>>>       on latter one (for longer periods of time), which has 46 CPUs.
>>>>>>> - Sample output of reproducer on otherwise idle system:
>>>>>>>       # ./a.out
>>>>>>>       [00000314] map_write_unmap took: 26305 ms
>>>>>>>       [00000867] map_write_unmap took: 13642 ms
>>>>>>>       [00002200] map_write_unmap took: 44237 ms
>>>>>>>       [00002851] map_write_unmap took: 992 ms
>>>>>>>       [00004725] map_write_unmap took: 542 ms
>>>>>>>       [00006443] map_write_unmap took: 5333 ms
>>>>>>>       [00006593] map_write_unmap took: 21162 ms
>>>>>>>       [00007435] map_write_unmap took: 16982 ms
>>>>>>>       [00007488] map_write unmap took: 13 ms^C
>>>>>>>
>>>>>>> I ran a bisect, which identified following commit as first bad one:
>>>>>>>       dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
>>>>>>>
>>>>>>> I can also make the issue go away with following change:
>>>>>>> diff --git a/mm/mmap.c b/mm/mmap.c
>>>>>>> index 330f12c17fa1..13ce465740e2 100644
>>>>>>> --- a/mm/mmap.c
>>>>>>> +++ b/mm/mmap.c
>>>>>>> @@ -2844,7 +2844,7 @@ EXPORT_SYMBOL(vm_munmap);
>>>>>>>      SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
>>>>>>>      {
>>>>>>>             profile_munmap(addr);
>>>>>>> -       return __vm_munmap(addr, len, true);
>>>>>>> +       return __vm_munmap(addr, len, false);
>>>>>>>      }
>>>>>>>
>>>>>>> # cat /proc/cpuinfo  | head
>>>>>>> processor       : 0
>>>>>>> BogoMIPS        : 40.00
>>>>>>> Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
>>>>>>> asimdrdm
>>>>>>> CPU implementer : 0x51
>>>>>>> CPU architecture: 8
>>>>>>> CPU variant     : 0x0
>>>>>>> CPU part        : 0xc00
>>>>>>> CPU revision    : 1
>>>>>>>
>>>>>>> # numactl -H
>>>>>>> available: 1 nodes (0)
>>>>>>> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
>>>>>>> 23
>>>>>>> 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
>>>>>>> node 0 size: 97938 MB
>>>>>>> node 0 free: 95732 MB
>>>>>>> node distances:
>>>>>>> node   0
>>>>>>>       0:  10
>>>>>>>
>>>>>>> Regards,
>>>>>>> Jan
>>>>>>>
>>>>>>> [1]
>>>>>>> https://github.com/jstancek/reproducers/blob/master/kernel/page_fault_stall/mmap5.c
>>>>>>> [2]
>>>>>>> https://github.com/jstancek/reproducers/blob/master/kernel/page_fault_stall/config
>>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bug] aarch64: userspace stalls on page fault after dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
  2019-05-07 19:55               ` Yang Shi
@ 2019-05-07 21:04                 ` Yang Shi
  0 siblings, 0 replies; 10+ messages in thread
From: Yang Shi @ 2019-05-07 21:04 UTC (permalink / raw)
  To: Jan Stancek
  Cc: will deacon, linux-arm-kernel, linux-mm, kirill, willy,
	kirill shutemov, vbabka, Andrea Arcangeli, akpm, Waiman Long,
	Mel Gorman, Rik van Riel, catalin marinas



On 5/7/19 12:55 PM, Yang Shi wrote:
>
>
> On 5/7/19 10:19 AM, Jan Stancek wrote:
>> ----- Original Message -----
>>>
>>> On 5/7/19 4:51 AM, Jan Stancek wrote:
>>>> ----- Original Message -----
>>>>> On 5/6/19 2:35 PM, Jan Stancek wrote:
>>>>>> ----- Original Message -----
>>>>>>> On 5/5/19 7:10 AM, Jan Stancek wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I'm seeing userspace program getting stuck on aarch64, on 
>>>>>>>> kernels 4.20
>>>>>>>> and
>>>>>>>> newer.
>>>>>>>> It stalls from seconds to hours.
>>>>>>>>
>>>>>>>> I have simplified it to following scenario (reproducer linked 
>>>>>>>> below
>>>>>>>> [1]):
>>>>>>>>       while (1):
>>>>>>>>         spawn Thread 1: mmap, write, munmap
>>>>>>>>         spawn Thread 2: <nothing>
>>>>>>>>
>>>>>>>> Thread 1 is sporadically getting stuck on write to mapped area.
>>>>>>>> User-space
>>>>>>>> is not
>>>>>>>> moving forward - stdout output stops. Observed CPU usage is 
>>>>>>>> however
>>>>>>>> 100%.
>>>>>>>>
>>>>>>>> At this time, kernel appears to be busy handling page faults 
>>>>>>>> (~700k per
>>>>>>>> second):
>>>>>>>>
>>>>>>>> # perf top -a -g
>>>>>>>> -   98.97%     8.30%  a.out                     [.] 
>>>>>>>> map_write_unmap
>>>>>>>>        - 23.52% map_write_unmap
>>>>>>>>           - 24.29% el0_sync
>>>>>>>>              - 10.42% do_mem_abort
>>>>>>>>                 - 17.81% do_translation_fault
>>>>>>>>                    - 33.01% do_page_fault
>>>>>>>>                       - 56.18% handle_mm_fault
>>>>>>>>                            40.26% __handle_mm_fault
>>>>>>>>                            2.19% __ll_sc___cmpxchg_case_acq_4
>>>>>>>>                            0.87% mem_cgroup_from_task
>>>>>>>>                       - 6.18% find_vma
>>>>>>>>                            5.38% vmacache_find
>>>>>>>>                         1.35% __ll_sc___cmpxchg_case_acq_8
>>>>>>>>                         1.23% __ll_sc_atomic64_sub_return_release
>>>>>>>>                         0.78% down_read_trylock
>>>>>>>>                0.93% do_translation_fault
>>>>>>>>        + 8.30% thread_start
>>>>>>>>
>>>>>>>> #  perf stat -p 8189 -d
>>>>>>>> ^C
>>>>>>>>      Performance counter stats for process id '8189':
>>>>>>>>
>>>>>>>>             984.311350      task-clock (msec)         #    
>>>>>>>> 1.000 CPUs
>>>>>>>>             utilized
>>>>>>>>                      0 context-switches          #    0.000 K/sec
>>>>>>>>                      0 cpu-migrations            #    0.000 K/sec
>>>>>>>>                723,641 page-faults               #    0.735 M/sec
>>>>>>>>          2,559,199,434 cycles                    #    2.600 GHz
>>>>>>>>            711,933,112 instructions              #    0.28  insn
>>>>>>>>            per
>>>>>>>>            cycle
>>>>>>>>        <not supported>      branches
>>>>>>>>                757,658      branch-misses
>>>>>>>>            205,840,557 L1-dcache-loads           #  209.121 M/sec
>>>>>>>>             40,561,529 L1-dcache-load-misses     #   19.71% of all
>>>>>>>>             L1-dcache hits
>>>>>>>>        <not supported>      LLC-loads
>>>>>>>>        <not supported>      LLC-load-misses
>>>>>>>>
>>>>>>>>            0.984454892 seconds time elapsed
>>>>>>>>
>>>>>>>> With some extra traces, it appears looping in page fault for same
>>>>>>>> address,
>>>>>>>> over and over:
>>>>>>>>       do_page_fault // mm_flags: 0x55
>>>>>>>>         __do_page_fault
>>>>>>>>           __handle_mm_fault
>>>>>>>>             handle_pte_fault
>>>>>>>>               ptep_set_access_flags
>>>>>>>>                 if (pte_same(pte, entry))  // pte: e8000805060f53,
>>>>>>>>                 entry:
>>>>>>>>                 e8000805060f53
>>>>>>>>
>>>>>>>> I had traces in mmap() and munmap() as well, they don't get hit 
>>>>>>>> when
>>>>>>>> reproducer
>>>>>>>> hits the bad state.
>>>>>>>>
>>>>>>>> Notes:
>>>>>>>> - I'm not able to reproduce this on x86.
>>>>>>>> - Attaching GDB or strace immediatelly recovers application 
>>>>>>>> from stall.
>>>>>>>> - It also seems to recover faster when system is busy with 
>>>>>>>> other tasks.
>>>>>>>> - MAP_SHARED vs. MAP_PRIVATE makes no difference.
>>>>>>>> - Turning off THP makes no difference.
>>>>>>>> - Reproducer [1] usually hits it within ~minute on HW described 
>>>>>>>> below.
>>>>>>>> - Longman mentioned that "When the rwsem becomes reader-owned, it
>>>>>>>> causes
>>>>>>>>       all the spinning writers to go to sleep adding wakeup 
>>>>>>>> latency to
>>>>>>>>       the time required to finish the critical sections", but 
>>>>>>>> this looks
>>>>>>>>       like busy loop, so I'm not sure if it's related to rwsem 
>>>>>>>> issues
>>>>>>>>       identified
>>>>>>>>       in:
>>>>>>>> https://lore.kernel.org/lkml/20190428212557.13482-2-longman@redhat.com/ 
>>>>>>>>
>>>>>>> It sounds possible to me. What the optimization done by the 
>>>>>>> commit ("mm:
>>>>>>> mmap: zap pages with read mmap_sem in munmap") is to downgrade 
>>>>>>> write
>>>>>>> rwsem to read when zapping pages and page table in munmap() 
>>>>>>> after the
>>>>>>> vmas have been detached from the rbtree.
>>>>>>>
>>>>>>> So the mmap(), which is writer, in your test may steal the lock and
>>>>>>> execute with the munmap(), which is the reader after the 
>>>>>>> downgrade, in
>>>>>>> parallel to break the mutual exclusion.
>>>>>>>
>>>>>>> In this case, the parallel mmap() may map to the same area since 
>>>>>>> vmas
>>>>>>> have been detached by munmap(), then mmap() may create the 
>>>>>>> complete same
>>>>>>> vmas, and page fault happens on the same vma at the same address.
>>>>>>>
>>>>>>> I'm not sure why gdb or strace could recover this, but they use 
>>>>>>> ptrace
>>>>>>> which may acquire mmap_sem to break the parallel inadvertently.
>>>>>>>
>>>>>>> May you please try Waiman's patch to see if it makes any 
>>>>>>> difference?
>>>>>> I don't see any difference in behaviour after applying:
>>>>>>      [PATCH-tip v7 01/20] locking/rwsem: Prevent decrement of 
>>>>>> reader count
>>>>>>      before increment
>>>>>> Issue is still easily reproducible for me.
>>>>>>
>>>>>> I'm including output of mem_abort_decode() / show_pte() for 
>>>>>> sample PTE,
>>>>>> that
>>>>>> I see in page fault loop. (I went through all bits, but couldn't 
>>>>>> find
>>>>>> anything invalid about it)
>>>>>>
>>>>>>      mem_abort_decode: Mem abort info:
>>>>>>      mem_abort_decode:   ESR = 0x92000047
>>>>>>      mem_abort_decode:   Exception class = DABT (lower EL), IL = 
>>>>>> 32 bits
>>>>>>      mem_abort_decode:   SET = 0, FnV = 0
>>>>>>      mem_abort_decode:   EA = 0, S1PTW = 0
>>>>>>      mem_abort_decode: Data abort info:
>>>>>>      mem_abort_decode:   ISV = 0, ISS = 0x00000047
>>>>>>      mem_abort_decode:   CM = 0, WnR = 1
>>>>>>      show_pte: user pgtable: 64k pages, 48-bit VAs, pgdp =
>>>>>>      0000000067027567
>>>>>>      show_pte: [0000ffff6dff0000] pgd=000000176bae0003
>>>>>>      show_pte: , pud=000000176bae0003
>>>>>>      show_pte: , pmd=000000174ad60003
>>>>>>      show_pte: , pte=00e80008023a0f53
>>>>>>      show_pte: , pte_pfn: 8023a
>>>>>>
>>>>>>      >>> print bin(0x47)
>>>>>>      0b1000111
>>>>>>
>>>>>>      Per D12-2779 (ARM Architecture Reference Manual),
>>>>>>          ISS encoding for an exception from an Instruction Abort:
>>>>>>        IFSC, bits [5:0], Instruction Fault Status Code
>>>>>>        0b000111 Translation fault, level 3
>>>>>>
>>>>>> ---
>>>>>>
>>>>>> My theory is that TLB is getting broken.
>>>> Theory continued:
>>>>
>>>> unmap_region() is batching updates to TLB (for vmas and page tables).
>>>> And at the same time another thread handles page fault for same mm,
>>>> which increases "tlb_flush_pending".
>>>>
>>>> tlb_finish_mmu() called from unmap_region() will thus set 'force = 1'.
>>>> And arch_tlb_finish_mmu() will in turn reset TLB range, presumably 
>>>> making
>>>> it smaller then it would be if force == 0.
>>>>
>>>> Change below appears to fix it:
>>>>
>>>> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
>>>> index f2f03c655807..a4cef21bd62b 100644
>>>> --- a/mm/mmu_gather.c
>>>> +++ b/mm/mmu_gather.c
>>>> @@ -93,7 +93,7 @@ void arch_tlb_finish_mmu(struct mmu_gather *tlb,
>>>>           struct mmu_gather_batch *batch, *next;
>>>>              if (force) {
>>>> -               __tlb_reset_range(tlb);
>>>>                   __tlb_adjust_range(tlb, start, end - start);
>>> I don't get why the change could fix it?
>> My guess is that reset clears "tlb->freed_tables", which changes how
>> tlb_flush() operates, see "bool last_level = !tlb->freed_tables;" in
>> arch/arm64/include/asm/tlb.h. Maybe that doesn't clear some intermediate
>> entries? No clue.
>>
>> If I let it reset the range, but preserve "freed_tables", it also
>> seems to solve the problem:
>
> This makes sense. munmap() does free page tables, so "freed_tables" 
> should be 1 instead of 0. So, in this case, __tlb_reset_range() should 
> not be called.

Further investigation shows the __tlb_reset_range() call is added by 
commit a6d60245d6d9 ("asm-generic/tlb: Track which levels of the page 
tables have been cleared"), which is aimed to make TLB flush smarter for 
huge page.

The commit was merged by 4.20 too with my munmap optimization. So, it 
looks nobody noticed that munmap() could run into "force" TLB flush 
other than MADV_DONTNEED. And, it does remove page tables. So, I believe 
removing __tlb_reset_range() should be the right fix.

>
>>
>> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
>> index f2f03c655807..17fb0d7edc03 100644
>> --- a/mm/mmu_gather.c
>> +++ b/mm/mmu_gather.c
>> @@ -93,8 +93,20 @@ void arch_tlb_finish_mmu(struct mmu_gather *tlb,
>>          struct mmu_gather_batch *batch, *next;
>>            if (force) {
>> -               __tlb_reset_range(tlb);
>> +               if (tlb->fullmm) {
>> +                       tlb->start = tlb->end = ~0;
>> +               } else {
>> +                       tlb->start = TASK_SIZE;
>> +                       tlb->end = 0;
>> +               }
>>                  __tlb_adjust_range(tlb, start, end - start);
>>
>>> __tlb_reset_range() just reset
>>> start and end to TASK_SIZE and 0, then __tlb_adjust_range() set proper
>>> start and end. I don't get why "force" flush smaller range?
>> I'm still trying to understand this part. It's actually not smaller, 
>> but it changes:
>>
>> unmap_region()
>>    # vm_start: ffff49bd0000 vm_end: ffff49be0000
>>    ...
>>    # tlb.start, tlb.end: 1000000000000 0
>>    free_pgtables()
>>    # tlb.start, tlb.end: ffff40000000 ffff40010000
>>    tlb_finish_mmu()
>>      arch_tlb_finish_mmu()
>>        # will see force == 1
>>        # resets tlb.start, tlb.end to: ffff49bd0000 ffff49be0000
>>
>>>>           }
>>>>
>>>>>> I made a dummy kernel module that exports debugfs file, which on 
>>>>>> read
>>>>>> triggers:
>>>>>>      flush_tlb_all();
>>>>>>
>>>>>> Any time reproducer stalls and I read debugfs file, it recovers
>>>>>> immediately and resumes printing to stdout.
>>>>> That commit doesn't change anything about TLB flush, just move 
>>>>> zapping
>>>>> pages under read mmap_sem as what MADV_DONTNEED does.
>>>>>
>>>>> I don't have aarch64 board to reproduce and debug it. And, I'm not
>>>>> familiar with aarch64 architecture either. But, some history told 
>>>>> me the
>>>>> parallel zapping page may run into stale TLB and defer a flush 
>>>>> meaning
>>>>> that this call may observe pte_none and fails to flush the TLB. But,
>>>>> this has been solved by commit 56236a59556c ("mm: refactor TLB 
>>>>> gathering
>>>>> API") and 99baac21e458 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss
>>>>> problem").
>>>>>
>>>>> For more detail, please refer to commit 4647706ebeee ("mm: always 
>>>>> flush
>>>>> VMA ranges affected by zap_page_range"). Copied Mel and Rik in this
>>>>> thread. Also added Will Deacon and Catalin Marinas, who are aarch64
>>>>> maintainers, in this loop
>>>> Thanks
>>>>
>>>>> But, your test (triggering TLB flush) does demonstrate TLB flush is
>>>>> *not* done properly at some point as expected for aarch64. Could you
>>>>> please give the below patch a try?
>>>> Your patch also fixes my reproducer.
>>> Thanks for testing it.
>>>
>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>> index ab650c2..ef41ad5 100644
>>>>> --- a/mm/memory.c
>>>>> +++ b/mm/memory.c
>>>>> @@ -1336,8 +1336,10 @@ void unmap_vmas(struct mmu_gather *tlb,
>>>>>
>>>>>            mmu_notifier_range_init(&range, vma->vm_mm, start_addr,
>>>>>            end_addr);
>>>>> mmu_notifier_invalidate_range_start(&range);
>>>>> -       for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
>>>>> +       for ( ; vma && vma->vm_start < end_addr; vma = 
>>>>> vma->vm_next) {
>>>>>                    unmap_single_vma(tlb, vma, start_addr, 
>>>>> end_addr, NULL);
>>>>> +               flush_tlb_range(vma, start_addr, end_addr);
>>>>> +       }
>>>>>            mmu_notifier_invalidate_range_end(&range);
>>>>>     }
>>>>>
>>>>>>>> - I tried 2 different aarch64 systems so far: APM X-Gene CPU 
>>>>>>>> Potenza A3
>>>>>>>> and
>>>>>>>>       Qualcomm 65-LA-115-151.
>>>>>>>>       I can reproduce it on both with v5.1-rc7. It's easier to 
>>>>>>>> reproduce
>>>>>>>>       on latter one (for longer periods of time), which has 46 
>>>>>>>> CPUs.
>>>>>>>> - Sample output of reproducer on otherwise idle system:
>>>>>>>>       # ./a.out
>>>>>>>>       [00000314] map_write_unmap took: 26305 ms
>>>>>>>>       [00000867] map_write_unmap took: 13642 ms
>>>>>>>>       [00002200] map_write_unmap took: 44237 ms
>>>>>>>>       [00002851] map_write_unmap took: 992 ms
>>>>>>>>       [00004725] map_write_unmap took: 542 ms
>>>>>>>>       [00006443] map_write_unmap took: 5333 ms
>>>>>>>>       [00006593] map_write_unmap took: 21162 ms
>>>>>>>>       [00007435] map_write_unmap took: 16982 ms
>>>>>>>>       [00007488] map_write unmap took: 13 ms^C
>>>>>>>>
>>>>>>>> I ran a bisect, which identified following commit as first bad 
>>>>>>>> one:
>>>>>>>>       dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in 
>>>>>>>> munmap")
>>>>>>>>
>>>>>>>> I can also make the issue go away with following change:
>>>>>>>> diff --git a/mm/mmap.c b/mm/mmap.c
>>>>>>>> index 330f12c17fa1..13ce465740e2 100644
>>>>>>>> --- a/mm/mmap.c
>>>>>>>> +++ b/mm/mmap.c
>>>>>>>> @@ -2844,7 +2844,7 @@ EXPORT_SYMBOL(vm_munmap);
>>>>>>>>      SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
>>>>>>>>      {
>>>>>>>>             profile_munmap(addr);
>>>>>>>> -       return __vm_munmap(addr, len, true);
>>>>>>>> +       return __vm_munmap(addr, len, false);
>>>>>>>>      }
>>>>>>>>
>>>>>>>> # cat /proc/cpuinfo  | head
>>>>>>>> processor       : 0
>>>>>>>> BogoMIPS        : 40.00
>>>>>>>> Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
>>>>>>>> asimdrdm
>>>>>>>> CPU implementer : 0x51
>>>>>>>> CPU architecture: 8
>>>>>>>> CPU variant     : 0x0
>>>>>>>> CPU part        : 0xc00
>>>>>>>> CPU revision    : 1
>>>>>>>>
>>>>>>>> # numactl -H
>>>>>>>> available: 1 nodes (0)
>>>>>>>> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 
>>>>>>>> 20 21 22
>>>>>>>> 23
>>>>>>>> 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
>>>>>>>> node 0 size: 97938 MB
>>>>>>>> node 0 free: 95732 MB
>>>>>>>> node distances:
>>>>>>>> node   0
>>>>>>>>       0:  10
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Jan
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://github.com/jstancek/reproducers/blob/master/kernel/page_fault_stall/mmap5.c 
>>>>>>>>
>>>>>>>> [2]
>>>>>>>> https://github.com/jstancek/reproducers/blob/master/kernel/page_fault_stall/config 
>>>>>>>>
>>>
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-05-07 21:05 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <820667266.20994189.1557058281210.JavaMail.zimbra@redhat.com>
2019-05-05 14:10 ` [bug] aarch64: userspace stalls on page fault after dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap") Jan Stancek
2019-05-06 19:04   ` Yang Shi
2019-05-06 21:35     ` Jan Stancek
2019-05-06 23:07       ` Yang Shi
2019-05-06 23:15         ` Yang Shi
2019-05-07 11:51         ` Jan Stancek
2019-05-07 16:42           ` Yang Shi
2019-05-07 17:19             ` Jan Stancek
2019-05-07 19:55               ` Yang Shi
2019-05-07 21:04                 ` Yang Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).