Re: [bug] aarch64: userspace stalls on page fault after dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")

From: Jan Stancek <jstancek@redhat.com>
To: Yang Shi <yang.shi@linux.alibaba.com>,
	will deacon <will.deacon@arm.com>,
	 linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org
Cc: kirill@shutemov.name, willy@infradead.org,
	 kirill shutemov <kirill.shutemov@linux.intel.com>,
	vbabka@suse.cz,  Andrea Arcangeli <aarcange@redhat.com>,
	akpm@linux-foundation.org,  Waiman Long <longman@redhat.com>,
	 Mel Gorman <mgorman@techsingularity.net>,
	 Rik van Riel <riel@surriel.com>,
	 catalin marinas <catalin.marinas@arm.com>,
	 Jan Stancek <jstancek@redhat.com>
Subject: Re: [bug] aarch64: userspace stalls on page fault after dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
Date: Tue, 7 May 2019 07:51:29 -0400 (EDT)	[thread overview]
Message-ID: <756571293.21386229.1557229889545.JavaMail.zimbra@redhat.com> (raw)
In-Reply-To: <2b2006bf-753b-c4b8-e9a2-fd27ae65fe14@linux.alibaba.com>


----- Original Message -----
> 
> 
> On 5/6/19 2:35 PM, Jan Stancek wrote:
> >
> > ----- Original Message -----
> >>
> >> On 5/5/19 7:10 AM, Jan Stancek wrote:
> >>> Hi,
> >>>
> >>> I'm seeing userspace program getting stuck on aarch64, on kernels 4.20
> >>> and
> >>> newer.
> >>> It stalls from seconds to hours.
> >>>
> >>> I have simplified it to following scenario (reproducer linked below [1]):
> >>>     while (1):
> >>>       spawn Thread 1: mmap, write, munmap
> >>>       spawn Thread 2: <nothing>
> >>>
> >>> Thread 1 is sporadically getting stuck on write to mapped area.
> >>> User-space
> >>> is not
> >>> moving forward - stdout output stops. Observed CPU usage is however 100%.
> >>>
> >>> At this time, kernel appears to be busy handling page faults (~700k per
> >>> second):
> >>>
> >>> # perf top -a -g
> >>> -   98.97%     8.30%  a.out                     [.] map_write_unmap
> >>>      - 23.52% map_write_unmap
> >>>         - 24.29% el0_sync
> >>>            - 10.42% do_mem_abort
> >>>               - 17.81% do_translation_fault
> >>>                  - 33.01% do_page_fault
> >>>                     - 56.18% handle_mm_fault
> >>>                          40.26% __handle_mm_fault
> >>>                          2.19% __ll_sc___cmpxchg_case_acq_4
> >>>                          0.87% mem_cgroup_from_task
> >>>                     - 6.18% find_vma
> >>>                          5.38% vmacache_find
> >>>                       1.35% __ll_sc___cmpxchg_case_acq_8
> >>>                       1.23% __ll_sc_atomic64_sub_return_release
> >>>                       0.78% down_read_trylock
> >>>              0.93% do_translation_fault
> >>>      + 8.30% thread_start
> >>>
> >>> #  perf stat -p 8189 -d
> >>> ^C
> >>>    Performance counter stats for process id '8189':
> >>>
> >>>           984.311350      task-clock (msec)         #    1.000 CPUs
> >>>           utilized
> >>>                    0      context-switches          #    0.000 K/sec
> >>>                    0      cpu-migrations            #    0.000 K/sec
> >>>              723,641      page-faults               #    0.735 M/sec
> >>>        2,559,199,434      cycles                    #    2.600 GHz
> >>>          711,933,112      instructions              #    0.28  insn per
> >>>          cycle
> >>>      <not supported>      branches
> >>>              757,658      branch-misses
> >>>          205,840,557      L1-dcache-loads           #  209.121 M/sec
> >>>           40,561,529      L1-dcache-load-misses     #   19.71% of all
> >>>           L1-dcache hits
> >>>      <not supported>      LLC-loads
> >>>      <not supported>      LLC-load-misses
> >>>
> >>>          0.984454892 seconds time elapsed
> >>>
> >>> With some extra traces, it appears looping in page fault for same
> >>> address,
> >>> over and over:
> >>>     do_page_fault // mm_flags: 0x55
> >>>       __do_page_fault
> >>>         __handle_mm_fault
> >>>           handle_pte_fault
> >>>             ptep_set_access_flags
> >>>               if (pte_same(pte, entry))  // pte: e8000805060f53, entry:
> >>>               e8000805060f53
> >>>
> >>> I had traces in mmap() and munmap() as well, they don't get hit when
> >>> reproducer
> >>> hits the bad state.
> >>>
> >>> Notes:
> >>> - I'm not able to reproduce this on x86.
> >>> - Attaching GDB or strace immediatelly recovers application from stall.
> >>> - It also seems to recover faster when system is busy with other tasks.
> >>> - MAP_SHARED vs. MAP_PRIVATE makes no difference.
> >>> - Turning off THP makes no difference.
> >>> - Reproducer [1] usually hits it within ~minute on HW described below.
> >>> - Longman mentioned that "When the rwsem becomes reader-owned, it causes
> >>>     all the spinning writers to go to sleep adding wakeup latency to
> >>>     the time required to finish the critical sections", but this looks
> >>>     like busy loop, so I'm not sure if it's related to rwsem issues
> >>>     identified
> >>>     in:
> >>>     https://lore.kernel.org/lkml/20190428212557.13482-2-longman@redhat.com/
> >> It sounds possible to me. What the optimization done by the commit ("mm:
> >> mmap: zap pages with read mmap_sem in munmap") is to downgrade write
> >> rwsem to read when zapping pages and page table in munmap() after the
> >> vmas have been detached from the rbtree.
> >>
> >> So the mmap(), which is writer, in your test may steal the lock and
> >> execute with the munmap(), which is the reader after the downgrade, in
> >> parallel to break the mutual exclusion.
> >>
> >> In this case, the parallel mmap() may map to the same area since vmas
> >> have been detached by munmap(), then mmap() may create the complete same
> >> vmas, and page fault happens on the same vma at the same address.
> >>
> >> I'm not sure why gdb or strace could recover this, but they use ptrace
> >> which may acquire mmap_sem to break the parallel inadvertently.
> >>
> >> May you please try Waiman's patch to see if it makes any difference?
> > I don't see any difference in behaviour after applying:
> >    [PATCH-tip v7 01/20] locking/rwsem: Prevent decrement of reader count
> >    before increment
> > Issue is still easily reproducible for me.
> >
> > I'm including output of mem_abort_decode() / show_pte() for sample PTE,
> > that
> > I see in page fault loop. (I went through all bits, but couldn't find
> > anything invalid about it)
> >
> >    mem_abort_decode: Mem abort info:
> >    mem_abort_decode:   ESR = 0x92000047
> >    mem_abort_decode:   Exception class = DABT (lower EL), IL = 32 bits
> >    mem_abort_decode:   SET = 0, FnV = 0
> >    mem_abort_decode:   EA = 0, S1PTW = 0
> >    mem_abort_decode: Data abort info:
> >    mem_abort_decode:   ISV = 0, ISS = 0x00000047
> >    mem_abort_decode:   CM = 0, WnR = 1
> >    show_pte: user pgtable: 64k pages, 48-bit VAs, pgdp = 0000000067027567
> >    show_pte: [0000ffff6dff0000] pgd=000000176bae0003
> >    show_pte: , pud=000000176bae0003
> >    show_pte: , pmd=000000174ad60003
> >    show_pte: , pte=00e80008023a0f53
> >    show_pte: , pte_pfn: 8023a
> >
> >    >>> print bin(0x47)
> >    0b1000111
> >
> >    Per D12-2779 (ARM Architecture Reference Manual),
> >        ISS encoding for an exception from an Instruction Abort:
> >      IFSC, bits [5:0], Instruction Fault Status Code
> >      0b000111 Translation fault, level 3
> >
> > ---
> >
> > My theory is that TLB is getting broken.

Theory continued:

unmap_region() is batching updates to TLB (for vmas and page tables).
And at the same time another thread handles page fault for same mm,
which increases "tlb_flush_pending".

tlb_finish_mmu() called from unmap_region() will thus set 'force = 1'.
And arch_tlb_finish_mmu() will in turn reset TLB range, presumably making
it smaller then it would be if force == 0.

Change below appears to fix it:

diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index f2f03c655807..a4cef21bd62b 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -93,7 +93,7 @@ void arch_tlb_finish_mmu(struct mmu_gather *tlb,
        struct mmu_gather_batch *batch, *next;
 
        if (force) {
-               __tlb_reset_range(tlb);
                __tlb_adjust_range(tlb, start, end - start);
        }

> >
> > I made a dummy kernel module that exports debugfs file, which on read
> > triggers:
> >    flush_tlb_all();
> >
> > Any time reproducer stalls and I read debugfs file, it recovers
> > immediately and resumes printing to stdout.
> 
> That commit doesn't change anything about TLB flush, just move zapping
> pages under read mmap_sem as what MADV_DONTNEED does.
> 
> I don't have aarch64 board to reproduce and debug it. And, I'm not
> familiar with aarch64 architecture either. But, some history told me the
> parallel zapping page may run into stale TLB and defer a flush meaning
> that this call may observe pte_none and fails to flush the TLB. But,
> this has been solved by commit 56236a59556c ("mm: refactor TLB gathering
> API") and 99baac21e458 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss
> problem").
> 
> For more detail, please refer to commit 4647706ebeee ("mm: always flush
> VMA ranges affected by zap_page_range"). Copied Mel and Rik in this
> thread. Also added Will Deacon and Catalin Marinas, who are aarch64
> maintainers, in this loop

Thanks

> 
> But, your test (triggering TLB flush) does demonstrate TLB flush is
> *not* done properly at some point as expected for aarch64. Could you
> please give the below patch a try?

Your patch also fixes my reproducer.

> 
> diff --git a/mm/memory.c b/mm/memory.c
> index ab650c2..ef41ad5 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1336,8 +1336,10 @@ void unmap_vmas(struct mmu_gather *tlb,
> 
>          mmu_notifier_range_init(&range, vma->vm_mm, start_addr, end_addr);
>          mmu_notifier_invalidate_range_start(&range);
> -       for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
> +       for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
>                  unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
> +               flush_tlb_range(vma, start_addr, end_addr);
> +       }
>          mmu_notifier_invalidate_range_end(&range);
>   }
> 
> >
> >>> - I tried 2 different aarch64 systems so far: APM X-Gene CPU Potenza A3
> >>> and
> >>>     Qualcomm 65-LA-115-151.
> >>>     I can reproduce it on both with v5.1-rc7. It's easier to reproduce
> >>>     on latter one (for longer periods of time), which has 46 CPUs.
> >>> - Sample output of reproducer on otherwise idle system:
> >>>     # ./a.out
> >>>     [00000314] map_write_unmap took: 26305 ms
> >>>     [00000867] map_write_unmap took: 13642 ms
> >>>     [00002200] map_write_unmap took: 44237 ms
> >>>     [00002851] map_write_unmap took: 992 ms
> >>>     [00004725] map_write_unmap took: 542 ms
> >>>     [00006443] map_write_unmap took: 5333 ms
> >>>     [00006593] map_write_unmap took: 21162 ms
> >>>     [00007435] map_write_unmap took: 16982 ms
> >>>     [00007488] map_write unmap took: 13 ms^C
> >>>
> >>> I ran a bisect, which identified following commit as first bad one:
> >>>     dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
> >>>
> >>> I can also make the issue go away with following change:
> >>> diff --git a/mm/mmap.c b/mm/mmap.c
> >>> index 330f12c17fa1..13ce465740e2 100644
> >>> --- a/mm/mmap.c
> >>> +++ b/mm/mmap.c
> >>> @@ -2844,7 +2844,7 @@ EXPORT_SYMBOL(vm_munmap);
> >>>    SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
> >>>    {
> >>>           profile_munmap(addr);
> >>> -       return __vm_munmap(addr, len, true);
> >>> +       return __vm_munmap(addr, len, false);
> >>>    }
> >>>
> >>> # cat /proc/cpuinfo  | head
> >>> processor       : 0
> >>> BogoMIPS        : 40.00
> >>> Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
> >>> asimdrdm
> >>> CPU implementer : 0x51
> >>> CPU architecture: 8
> >>> CPU variant     : 0x0
> >>> CPU part        : 0xc00
> >>> CPU revision    : 1
> >>>
> >>> # numactl -H
> >>> available: 1 nodes (0)
> >>> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
> >>> 23
> >>> 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
> >>> node 0 size: 97938 MB
> >>> node 0 free: 95732 MB
> >>> node distances:
> >>> node   0
> >>>     0:  10
> >>>
> >>> Regards,
> >>> Jan
> >>>
> >>> [1]
> >>> https://github.com/jstancek/reproducers/blob/master/kernel/page_fault_stall/mmap5.c
> >>> [2]
> >>> https://github.com/jstancek/reproducers/blob/master/kernel/page_fault_stall/config
> >>
> 
>