[bug] aarch64: userspace stalls on page fault after dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")

From: Jan Stancek <jstancek@redhat.com>
To: linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org
Cc: yang shi <yang.shi@linux.alibaba.com>,
	kirill@shutemov.name,  willy@infradead.org,
	 kirill shutemov <kirill.shutemov@linux.intel.com>,
	vbabka@suse.cz,  Andrea Arcangeli <aarcange@redhat.com>,
	akpm@linux-foundation.org,  Waiman Long <longman@redhat.com>,
	Jan Stancek <jstancek@redhat.com>
Subject: [bug] aarch64: userspace stalls on page fault after dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
Date: Sun, 5 May 2019 10:10:45 -0400 (EDT)	[thread overview]
Message-ID: <1817839533.20996552.1557065445233.JavaMail.zimbra@redhat.com> (raw)
In-Reply-To: <820667266.20994189.1557058281210.JavaMail.zimbra@redhat.com>

Hi,

I'm seeing userspace program getting stuck on aarch64, on kernels 4.20 and newer.
It stalls from seconds to hours.

I have simplified it to following scenario (reproducer linked below [1]):
  while (1):
    spawn Thread 1: mmap, write, munmap
    spawn Thread 2: <nothing>

Thread 1 is sporadically getting stuck on write to mapped area. User-space is not
moving forward - stdout output stops. Observed CPU usage is however 100%.

At this time, kernel appears to be busy handling page faults (~700k per second):

# perf top -a -g
-   98.97%     8.30%  a.out                     [.] map_write_unmap
   - 23.52% map_write_unmap
      - 24.29% el0_sync
         - 10.42% do_mem_abort
            - 17.81% do_translation_fault
               - 33.01% do_page_fault
                  - 56.18% handle_mm_fault
                       40.26% __handle_mm_fault
                       2.19% __ll_sc___cmpxchg_case_acq_4
                       0.87% mem_cgroup_from_task
                  - 6.18% find_vma
                       5.38% vmacache_find
                    1.35% __ll_sc___cmpxchg_case_acq_8
                    1.23% __ll_sc_atomic64_sub_return_release
                    0.78% down_read_trylock
           0.93% do_translation_fault
   + 8.30% thread_start

#  perf stat -p 8189 -d 
^C
 Performance counter stats for process id '8189':

        984.311350      task-clock (msec)         #    1.000 CPUs utilized
                 0      context-switches          #    0.000 K/sec
                 0      cpu-migrations            #    0.000 K/sec
           723,641      page-faults               #    0.735 M/sec
     2,559,199,434      cycles                    #    2.600 GHz
       711,933,112      instructions              #    0.28  insn per cycle
   <not supported>      branches
           757,658      branch-misses
       205,840,557      L1-dcache-loads           #  209.121 M/sec
        40,561,529      L1-dcache-load-misses     #   19.71% of all L1-dcache hits
   <not supported>      LLC-loads
   <not supported>      LLC-load-misses

       0.984454892 seconds time elapsed

With some extra traces, it appears looping in page fault for same address, over and over:
  do_page_fault // mm_flags: 0x55
    __do_page_fault
      __handle_mm_fault
        handle_pte_fault
          ptep_set_access_flags
            if (pte_same(pte, entry))  // pte: e8000805060f53, entry: e8000805060f53

I had traces in mmap() and munmap() as well, they don't get hit when reproducer
hits the bad state.

Notes:
- I'm not able to reproduce this on x86.
- Attaching GDB or strace immediatelly recovers application from stall.
- It also seems to recover faster when system is busy with other tasks.
- MAP_SHARED vs. MAP_PRIVATE makes no difference.
- Turning off THP makes no difference.
- Reproducer [1] usually hits it within ~minute on HW described below.
- Longman mentioned that "When the rwsem becomes reader-owned, it causes
  all the spinning writers to go to sleep adding wakeup latency to
  the time required to finish the critical sections", but this looks
  like busy loop, so I'm not sure if it's related to rwsem issues identified
  in: https://lore.kernel.org/lkml/20190428212557.13482-2-longman@redhat.com/
- I tried 2 different aarch64 systems so far: APM X-Gene CPU Potenza A3 and
  Qualcomm 65-LA-115-151.
  I can reproduce it on both with v5.1-rc7. It's easier to reproduce
  on latter one (for longer periods of time), which has 46 CPUs.
- Sample output of reproducer on otherwise idle system:
  # ./a.out
  [00000314] map_write_unmap took: 26305 ms
  [00000867] map_write_unmap took: 13642 ms
  [00002200] map_write_unmap took: 44237 ms
  [00002851] map_write_unmap took: 992 ms
  [00004725] map_write_unmap took: 542 ms
  [00006443] map_write_unmap took: 5333 ms
  [00006593] map_write_unmap took: 21162 ms
  [00007435] map_write_unmap took: 16982 ms
  [00007488] map_write unmap took: 13 ms^C

I ran a bisect, which identified following commit as first bad one:
  dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")

I can also make the issue go away with following change:

diff --git a/mm/mmap.c b/mm/mmap.c
index 330f12c17fa1..13ce465740e2 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2844,7 +2844,7 @@ EXPORT_SYMBOL(vm_munmap);
 SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
 {
        profile_munmap(addr);
-       return __vm_munmap(addr, len, true);
+       return __vm_munmap(addr, len, false);
 }

# cat /proc/cpuinfo  | head
processor       : 0
BogoMIPS        : 40.00
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid asimdrdm
CPU implementer : 0x51
CPU architecture: 8
CPU variant     : 0x0
CPU part        : 0xc00
CPU revision    : 1

# numactl -H
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
node 0 size: 97938 MB
node 0 free: 95732 MB
node distances:
node   0 
  0:  10 

Regards,
Jan

[1] https://github.com/jstancek/reproducers/blob/master/kernel/page_fault_stall/mmap5.c
[2] https://github.com/jstancek/reproducers/blob/master/kernel/page_fault_stall/config