From: Jan Stancek <jstancek@redhat.com>
To: Yang Shi <yang.shi@linux.alibaba.com>
Cc: will deacon <will.deacon@arm.com>,
linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org,
kirill@shutemov.name, willy@infradead.org,
kirill shutemov <kirill.shutemov@linux.intel.com>,
vbabka@suse.cz, Andrea Arcangeli <aarcange@redhat.com>,
akpm@linux-foundation.org, Waiman Long <longman@redhat.com>,
Mel Gorman <mgorman@techsingularity.net>,
Rik van Riel <riel@surriel.com>,
catalin marinas <catalin.marinas@arm.com>,
Jan Stancek <jstancek@redhat.com>
Subject: Re: [bug] aarch64: userspace stalls on page fault after dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
Date: Tue, 7 May 2019 13:19:28 -0400 (EDT) [thread overview]
Message-ID: <2058828796.21479120.1557249568244.JavaMail.zimbra@redhat.com> (raw)
In-Reply-To: <3d0843fa-1a34-1d5a-ca4d-abe4032bad8b@linux.alibaba.com>
----- Original Message -----
>
>
> On 5/7/19 4:51 AM, Jan Stancek wrote:
> > ----- Original Message -----
> >>
> >> On 5/6/19 2:35 PM, Jan Stancek wrote:
> >>> ----- Original Message -----
> >>>> On 5/5/19 7:10 AM, Jan Stancek wrote:
> >>>>> Hi,
> >>>>>
> >>>>> I'm seeing userspace program getting stuck on aarch64, on kernels 4.20
> >>>>> and
> >>>>> newer.
> >>>>> It stalls from seconds to hours.
> >>>>>
> >>>>> I have simplified it to following scenario (reproducer linked below
> >>>>> [1]):
> >>>>> while (1):
> >>>>> spawn Thread 1: mmap, write, munmap
> >>>>> spawn Thread 2: <nothing>
> >>>>>
> >>>>> Thread 1 is sporadically getting stuck on write to mapped area.
> >>>>> User-space
> >>>>> is not
> >>>>> moving forward - stdout output stops. Observed CPU usage is however
> >>>>> 100%.
> >>>>>
> >>>>> At this time, kernel appears to be busy handling page faults (~700k per
> >>>>> second):
> >>>>>
> >>>>> # perf top -a -g
> >>>>> - 98.97% 8.30% a.out [.] map_write_unmap
> >>>>> - 23.52% map_write_unmap
> >>>>> - 24.29% el0_sync
> >>>>> - 10.42% do_mem_abort
> >>>>> - 17.81% do_translation_fault
> >>>>> - 33.01% do_page_fault
> >>>>> - 56.18% handle_mm_fault
> >>>>> 40.26% __handle_mm_fault
> >>>>> 2.19% __ll_sc___cmpxchg_case_acq_4
> >>>>> 0.87% mem_cgroup_from_task
> >>>>> - 6.18% find_vma
> >>>>> 5.38% vmacache_find
> >>>>> 1.35% __ll_sc___cmpxchg_case_acq_8
> >>>>> 1.23% __ll_sc_atomic64_sub_return_release
> >>>>> 0.78% down_read_trylock
> >>>>> 0.93% do_translation_fault
> >>>>> + 8.30% thread_start
> >>>>>
> >>>>> # perf stat -p 8189 -d
> >>>>> ^C
> >>>>> Performance counter stats for process id '8189':
> >>>>>
> >>>>> 984.311350 task-clock (msec) # 1.000 CPUs
> >>>>> utilized
> >>>>> 0 context-switches # 0.000 K/sec
> >>>>> 0 cpu-migrations # 0.000 K/sec
> >>>>> 723,641 page-faults # 0.735 M/sec
> >>>>> 2,559,199,434 cycles # 2.600 GHz
> >>>>> 711,933,112 instructions # 0.28 insn
> >>>>> per
> >>>>> cycle
> >>>>> <not supported> branches
> >>>>> 757,658 branch-misses
> >>>>> 205,840,557 L1-dcache-loads # 209.121 M/sec
> >>>>> 40,561,529 L1-dcache-load-misses # 19.71% of all
> >>>>> L1-dcache hits
> >>>>> <not supported> LLC-loads
> >>>>> <not supported> LLC-load-misses
> >>>>>
> >>>>> 0.984454892 seconds time elapsed
> >>>>>
> >>>>> With some extra traces, it appears looping in page fault for same
> >>>>> address,
> >>>>> over and over:
> >>>>> do_page_fault // mm_flags: 0x55
> >>>>> __do_page_fault
> >>>>> __handle_mm_fault
> >>>>> handle_pte_fault
> >>>>> ptep_set_access_flags
> >>>>> if (pte_same(pte, entry)) // pte: e8000805060f53,
> >>>>> entry:
> >>>>> e8000805060f53
> >>>>>
> >>>>> I had traces in mmap() and munmap() as well, they don't get hit when
> >>>>> reproducer
> >>>>> hits the bad state.
> >>>>>
> >>>>> Notes:
> >>>>> - I'm not able to reproduce this on x86.
> >>>>> - Attaching GDB or strace immediatelly recovers application from stall.
> >>>>> - It also seems to recover faster when system is busy with other tasks.
> >>>>> - MAP_SHARED vs. MAP_PRIVATE makes no difference.
> >>>>> - Turning off THP makes no difference.
> >>>>> - Reproducer [1] usually hits it within ~minute on HW described below.
> >>>>> - Longman mentioned that "When the rwsem becomes reader-owned, it
> >>>>> causes
> >>>>> all the spinning writers to go to sleep adding wakeup latency to
> >>>>> the time required to finish the critical sections", but this looks
> >>>>> like busy loop, so I'm not sure if it's related to rwsem issues
> >>>>> identified
> >>>>> in:
> >>>>> https://lore.kernel.org/lkml/20190428212557.13482-2-longman@redhat.com/
> >>>> It sounds possible to me. What the optimization done by the commit ("mm:
> >>>> mmap: zap pages with read mmap_sem in munmap") is to downgrade write
> >>>> rwsem to read when zapping pages and page table in munmap() after the
> >>>> vmas have been detached from the rbtree.
> >>>>
> >>>> So the mmap(), which is writer, in your test may steal the lock and
> >>>> execute with the munmap(), which is the reader after the downgrade, in
> >>>> parallel to break the mutual exclusion.
> >>>>
> >>>> In this case, the parallel mmap() may map to the same area since vmas
> >>>> have been detached by munmap(), then mmap() may create the complete same
> >>>> vmas, and page fault happens on the same vma at the same address.
> >>>>
> >>>> I'm not sure why gdb or strace could recover this, but they use ptrace
> >>>> which may acquire mmap_sem to break the parallel inadvertently.
> >>>>
> >>>> May you please try Waiman's patch to see if it makes any difference?
> >>> I don't see any difference in behaviour after applying:
> >>> [PATCH-tip v7 01/20] locking/rwsem: Prevent decrement of reader count
> >>> before increment
> >>> Issue is still easily reproducible for me.
> >>>
> >>> I'm including output of mem_abort_decode() / show_pte() for sample PTE,
> >>> that
> >>> I see in page fault loop. (I went through all bits, but couldn't find
> >>> anything invalid about it)
> >>>
> >>> mem_abort_decode: Mem abort info:
> >>> mem_abort_decode: ESR = 0x92000047
> >>> mem_abort_decode: Exception class = DABT (lower EL), IL = 32 bits
> >>> mem_abort_decode: SET = 0, FnV = 0
> >>> mem_abort_decode: EA = 0, S1PTW = 0
> >>> mem_abort_decode: Data abort info:
> >>> mem_abort_decode: ISV = 0, ISS = 0x00000047
> >>> mem_abort_decode: CM = 0, WnR = 1
> >>> show_pte: user pgtable: 64k pages, 48-bit VAs, pgdp =
> >>> 0000000067027567
> >>> show_pte: [0000ffff6dff0000] pgd=000000176bae0003
> >>> show_pte: , pud=000000176bae0003
> >>> show_pte: , pmd=000000174ad60003
> >>> show_pte: , pte=00e80008023a0f53
> >>> show_pte: , pte_pfn: 8023a
> >>>
> >>> >>> print bin(0x47)
> >>> 0b1000111
> >>>
> >>> Per D12-2779 (ARM Architecture Reference Manual),
> >>> ISS encoding for an exception from an Instruction Abort:
> >>> IFSC, bits [5:0], Instruction Fault Status Code
> >>> 0b000111 Translation fault, level 3
> >>>
> >>> ---
> >>>
> >>> My theory is that TLB is getting broken.
> > Theory continued:
> >
> > unmap_region() is batching updates to TLB (for vmas and page tables).
> > And at the same time another thread handles page fault for same mm,
> > which increases "tlb_flush_pending".
> >
> > tlb_finish_mmu() called from unmap_region() will thus set 'force = 1'.
> > And arch_tlb_finish_mmu() will in turn reset TLB range, presumably making
> > it smaller then it would be if force == 0.
> >
> > Change below appears to fix it:
> >
> > diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> > index f2f03c655807..a4cef21bd62b 100644
> > --- a/mm/mmu_gather.c
> > +++ b/mm/mmu_gather.c
> > @@ -93,7 +93,7 @@ void arch_tlb_finish_mmu(struct mmu_gather *tlb,
> > struct mmu_gather_batch *batch, *next;
> >
> > if (force) {
> > - __tlb_reset_range(tlb);
> > __tlb_adjust_range(tlb, start, end - start);
>
> I don't get why the change could fix it?
My guess is that reset clears "tlb->freed_tables", which changes how
tlb_flush() operates, see "bool last_level = !tlb->freed_tables;" in
arch/arm64/include/asm/tlb.h. Maybe that doesn't clear some intermediate
entries? No clue.
If I let it reset the range, but preserve "freed_tables", it also
seems to solve the problem:
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index f2f03c655807..17fb0d7edc03 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -93,8 +93,20 @@ void arch_tlb_finish_mmu(struct mmu_gather *tlb,
struct mmu_gather_batch *batch, *next;
if (force) {
- __tlb_reset_range(tlb);
+ if (tlb->fullmm) {
+ tlb->start = tlb->end = ~0;
+ } else {
+ tlb->start = TASK_SIZE;
+ tlb->end = 0;
+ }
__tlb_adjust_range(tlb, start, end - start);
> __tlb_reset_range() just reset
> start and end to TASK_SIZE and 0, then __tlb_adjust_range() set proper
> start and end. I don't get why "force" flush smaller range?
I'm still trying to understand this part. It's actually not smaller, but it changes:
unmap_region()
# vm_start: ffff49bd0000 vm_end: ffff49be0000
...
# tlb.start, tlb.end: 1000000000000 0
free_pgtables()
# tlb.start, tlb.end: ffff40000000 ffff40010000
tlb_finish_mmu()
arch_tlb_finish_mmu()
# will see force == 1
# resets tlb.start, tlb.end to: ffff49bd0000 ffff49be0000
>
> > }
> >
> >>> I made a dummy kernel module that exports debugfs file, which on read
> >>> triggers:
> >>> flush_tlb_all();
> >>>
> >>> Any time reproducer stalls and I read debugfs file, it recovers
> >>> immediately and resumes printing to stdout.
> >> That commit doesn't change anything about TLB flush, just move zapping
> >> pages under read mmap_sem as what MADV_DONTNEED does.
> >>
> >> I don't have aarch64 board to reproduce and debug it. And, I'm not
> >> familiar with aarch64 architecture either. But, some history told me the
> >> parallel zapping page may run into stale TLB and defer a flush meaning
> >> that this call may observe pte_none and fails to flush the TLB. But,
> >> this has been solved by commit 56236a59556c ("mm: refactor TLB gathering
> >> API") and 99baac21e458 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss
> >> problem").
> >>
> >> For more detail, please refer to commit 4647706ebeee ("mm: always flush
> >> VMA ranges affected by zap_page_range"). Copied Mel and Rik in this
> >> thread. Also added Will Deacon and Catalin Marinas, who are aarch64
> >> maintainers, in this loop
> > Thanks
> >
> >> But, your test (triggering TLB flush) does demonstrate TLB flush is
> >> *not* done properly at some point as expected for aarch64. Could you
> >> please give the below patch a try?
> > Your patch also fixes my reproducer.
>
> Thanks for testing it.
>
> >
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index ab650c2..ef41ad5 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -1336,8 +1336,10 @@ void unmap_vmas(struct mmu_gather *tlb,
> >>
> >> mmu_notifier_range_init(&range, vma->vm_mm, start_addr,
> >> end_addr);
> >> mmu_notifier_invalidate_range_start(&range);
> >> - for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
> >> + for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
> >> unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
> >> + flush_tlb_range(vma, start_addr, end_addr);
> >> + }
> >> mmu_notifier_invalidate_range_end(&range);
> >> }
> >>
> >>>>> - I tried 2 different aarch64 systems so far: APM X-Gene CPU Potenza A3
> >>>>> and
> >>>>> Qualcomm 65-LA-115-151.
> >>>>> I can reproduce it on both with v5.1-rc7. It's easier to reproduce
> >>>>> on latter one (for longer periods of time), which has 46 CPUs.
> >>>>> - Sample output of reproducer on otherwise idle system:
> >>>>> # ./a.out
> >>>>> [00000314] map_write_unmap took: 26305 ms
> >>>>> [00000867] map_write_unmap took: 13642 ms
> >>>>> [00002200] map_write_unmap took: 44237 ms
> >>>>> [00002851] map_write_unmap took: 992 ms
> >>>>> [00004725] map_write_unmap took: 542 ms
> >>>>> [00006443] map_write_unmap took: 5333 ms
> >>>>> [00006593] map_write_unmap took: 21162 ms
> >>>>> [00007435] map_write_unmap took: 16982 ms
> >>>>> [00007488] map_write unmap took: 13 ms^C
> >>>>>
> >>>>> I ran a bisect, which identified following commit as first bad one:
> >>>>> dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
> >>>>>
> >>>>> I can also make the issue go away with following change:
> >>>>> diff --git a/mm/mmap.c b/mm/mmap.c
> >>>>> index 330f12c17fa1..13ce465740e2 100644
> >>>>> --- a/mm/mmap.c
> >>>>> +++ b/mm/mmap.c
> >>>>> @@ -2844,7 +2844,7 @@ EXPORT_SYMBOL(vm_munmap);
> >>>>> SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
> >>>>> {
> >>>>> profile_munmap(addr);
> >>>>> - return __vm_munmap(addr, len, true);
> >>>>> + return __vm_munmap(addr, len, false);
> >>>>> }
> >>>>>
> >>>>> # cat /proc/cpuinfo | head
> >>>>> processor : 0
> >>>>> BogoMIPS : 40.00
> >>>>> Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
> >>>>> asimdrdm
> >>>>> CPU implementer : 0x51
> >>>>> CPU architecture: 8
> >>>>> CPU variant : 0x0
> >>>>> CPU part : 0xc00
> >>>>> CPU revision : 1
> >>>>>
> >>>>> # numactl -H
> >>>>> available: 1 nodes (0)
> >>>>> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
> >>>>> 23
> >>>>> 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
> >>>>> node 0 size: 97938 MB
> >>>>> node 0 free: 95732 MB
> >>>>> node distances:
> >>>>> node 0
> >>>>> 0: 10
> >>>>>
> >>>>> Regards,
> >>>>> Jan
> >>>>>
> >>>>> [1]
> >>>>> https://github.com/jstancek/reproducers/blob/master/kernel/page_fault_stall/mmap5.c
> >>>>> [2]
> >>>>> https://github.com/jstancek/reproducers/blob/master/kernel/page_fault_stall/config
> >>
>
>
next prev parent reply other threads:[~2019-05-07 17:19 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <820667266.20994189.1557058281210.JavaMail.zimbra@redhat.com>
2019-05-05 14:10 ` [bug] aarch64: userspace stalls on page fault after dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap") Jan Stancek
2019-05-06 19:04 ` Yang Shi
2019-05-06 21:35 ` Jan Stancek
2019-05-06 23:07 ` Yang Shi
2019-05-06 23:15 ` Yang Shi
2019-05-07 11:51 ` Jan Stancek
2019-05-07 16:42 ` Yang Shi
2019-05-07 17:19 ` Jan Stancek [this message]
2019-05-07 19:55 ` Yang Shi
2019-05-07 21:04 ` Yang Shi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2058828796.21479120.1557249568244.JavaMail.zimbra@redhat.com \
--to=jstancek@redhat.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=catalin.marinas@arm.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=kirill@shutemov.name \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-mm@kvack.org \
--cc=longman@redhat.com \
--cc=mgorman@techsingularity.net \
--cc=riel@surriel.com \
--cc=vbabka@suse.cz \
--cc=will.deacon@arm.com \
--cc=willy@infradead.org \
--cc=yang.shi@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).