From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.9 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,MENTIONS_GIT_HOSTING,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1BF3AC004C9 for ; Tue, 7 May 2019 17:19:32 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id BED372054F for ; Tue, 7 May 2019 17:19:31 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org BED372054F Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 758B96B0007; Tue, 7 May 2019 13:19:31 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 709C66B0008; Tue, 7 May 2019 13:19:31 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5D85D6B000A; Tue, 7 May 2019 13:19:31 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from mail-qt1-f200.google.com (mail-qt1-f200.google.com [209.85.160.200]) by kanga.kvack.org (Postfix) with ESMTP id 3898B6B0007 for ; Tue, 7 May 2019 13:19:31 -0400 (EDT) Received: by mail-qt1-f200.google.com with SMTP id o34so20056268qte.5 for ; Tue, 07 May 2019 10:19:31 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:date:from:to :cc:message-id:in-reply-to:references:subject:mime-version :content-transfer-encoding:thread-topic:thread-index; bh=8KF8hLUDZ1r4q++anmbOmMh7jGTsSiVj2+U3ICX6ng0=; b=YUzbqrG5C6RuQVQR4r9MfsghRtcEa/KKCHaw3fqTsiyeBzwdl41YKmmw/36nNaibD7 QVOj5AbkSKQdWQAZwYRDHnIvZSp5bup6EyMMtn8xnGX9lB567enw7xTwGzGnl67yLRug /YjkxXPb+EjnRMKEfjD+nYkEdD3yugII2/meWBHMCWbGqOa4FK/kMUEsTxXqJEYmy22D kOryNur3PSbwJW4LKkBjv1hnzEEk/CjsNOEHZ0M2ZiYwYcAo+QaCCvTsq0b5Ow/O37h3 GOQxzUsIFMi3IyjrL1I2kXtHrmv0XA2NYiZX7YHBo5ghMHKGToQL2fTk6DJeVwE+IRcX YHKw== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of jstancek@redhat.com designates 209.132.183.28 as permitted sender) smtp.mailfrom=jstancek@redhat.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com X-Gm-Message-State: APjAAAWdhD95eaVVk44Q8fcNupsQ49C89gKAA0V3IansgeUtqjieM7HY M5oy/0hD9tn0slrtaKwYZOtJ8PW7osdI4DbK1/WJZiv9hYjCxEXGZfuab1r37ZPyH9Quo6I89Qs u7ynLxypIP7knkr3i8MGlTYJ5dqk7WPn2YdhvLFrGhCmBoBvOjlfRBadvgmu577OQXQ== X-Received: by 2002:aed:3ffa:: with SMTP id w55mr28265794qth.142.1557249570920; Tue, 07 May 2019 10:19:30 -0700 (PDT) X-Google-Smtp-Source: APXvYqxJNJjX89omStOg07uoO0YiHMWLzRev2J/4QI4JOMkqjEFNKdNwcP5Tup7Cr5QPmmAAkVXw X-Received: by 2002:aed:3ffa:: with SMTP id w55mr28265746qth.142.1557249569983; Tue, 07 May 2019 10:19:29 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1557249569; cv=none; d=google.com; s=arc-20160816; b=oCJaEQQXqdrG3fxPYrYKU8Gf9jeZeQDE1sGIoGd68uCd1/PUuSsz6IZ3KXqFPNfkZS ffkkwOBIIFotwY9uWXag/p7DTKBa5nRimwswer7HEuGDCbVF98Ylv9L3H+zww28Gi/dn gSGaHAZJ+95b+9i/CZfhVdMXck0gM7bJr4ijIyDFzzZAAHLPZm1Xt/HHo6NTM2cG1s3e urYV5wq8p+47XGs8n6Id7n5BKlyokqnb7L7PoBSW0elU6KeXcKXV6GEJ91tqPOoicDTW 3YDlvT+74YkXJBBKw+bsyB6Ft/2PkHrvaa4rI7IrQWJCPSeYcv2EhyAYISGnXaIzkbXF wqBw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=thread-index:thread-topic:content-transfer-encoding:mime-version :subject:references:in-reply-to:message-id:cc:to:from:date; bh=8KF8hLUDZ1r4q++anmbOmMh7jGTsSiVj2+U3ICX6ng0=; b=QwVoLI1/Xi4JaphJEhmoZ6JW6IJRCv9TfDdh/fCghPuQLlAZL7CcYueNmmA4y96zW5 8WTEzyB6nhGpoMBIkPicP2i7493mb+fSJVnsxY1lmeR83fOQs/OG8L9t5xfWZDL/J092 kzXQF9tIlyDf9ujqn0vOb7RkqwKE1AXPy7H6pcVJIhAfOtTe9uLDhC0B75jh3EV8DY1G saIIW6LQPLxsCGTVyVShIUAkH+Lukx5CvjpWS7H4D6K26SAxVQAif8mnjvVUCNpO5TO0 rLXH21z9FmkgLrxjp3MIiKP6IRl+GbBnBZ48le2N3X02DZlzbMn3lWmF40CBREm49aT9 taJQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of jstancek@redhat.com designates 209.132.183.28 as permitted sender) smtp.mailfrom=jstancek@redhat.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id d6si6955053qtj.292.2019.05.07.10.19.29 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 07 May 2019 10:19:29 -0700 (PDT) Received-SPF: pass (google.com: domain of jstancek@redhat.com designates 209.132.183.28 as permitted sender) client-ip=209.132.183.28; Authentication-Results: mx.google.com; spf=pass (google.com: domain of jstancek@redhat.com designates 209.132.183.28 as permitted sender) smtp.mailfrom=jstancek@redhat.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id F31E32DA99A; Tue, 7 May 2019 17:19:28 +0000 (UTC) Received: from colo-mx.corp.redhat.com (colo-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.21]) by smtp.corp.redhat.com (Postfix) with ESMTPS id E0F74608CA; Tue, 7 May 2019 17:19:28 +0000 (UTC) Received: from zmail17.collab.prod.int.phx2.redhat.com (zmail17.collab.prod.int.phx2.redhat.com [10.5.83.19]) by colo-mx.corp.redhat.com (Postfix) with ESMTP id 7E3783FB10; Tue, 7 May 2019 17:19:28 +0000 (UTC) Date: Tue, 7 May 2019 13:19:28 -0400 (EDT) From: Jan Stancek To: Yang Shi Cc: will deacon , linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org, kirill@shutemov.name, willy@infradead.org, kirill shutemov , vbabka@suse.cz, Andrea Arcangeli , akpm@linux-foundation.org, Waiman Long , Mel Gorman , Rik van Riel , catalin marinas , Jan Stancek Message-ID: <2058828796.21479120.1557249568244.JavaMail.zimbra@redhat.com> In-Reply-To: <3d0843fa-1a34-1d5a-ca4d-abe4032bad8b@linux.alibaba.com> References: <1817839533.20996552.1557065445233.JavaMail.zimbra@redhat.com> <1928544225.21255545.1557178548494.JavaMail.zimbra@redhat.com> <2b2006bf-753b-c4b8-e9a2-fd27ae65fe14@linux.alibaba.com> <756571293.21386229.1557229889545.JavaMail.zimbra@redhat.com> <3d0843fa-1a34-1d5a-ca4d-abe4032bad8b@linux.alibaba.com> Subject: Re: [bug] aarch64: userspace stalls on page fault after dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap") MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Originating-IP: [10.40.204.177, 10.4.195.25] Thread-Topic: aarch64: userspace stalls on page fault after dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap") Thread-Index: gKpt2e16SeHI5yEAfB9u2ttX2c306w== X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.29]); Tue, 07 May 2019 17:19:29 +0000 (UTC) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: ----- Original Message ----- >=20 >=20 > On 5/7/19 4:51 AM, Jan Stancek wrote: > > ----- Original Message ----- > >> > >> On 5/6/19 2:35 PM, Jan Stancek wrote: > >>> ----- Original Message ----- > >>>> On 5/5/19 7:10 AM, Jan Stancek wrote: > >>>>> Hi, > >>>>> > >>>>> I'm seeing userspace program getting stuck on aarch64, on kernels 4= .20 > >>>>> and > >>>>> newer. > >>>>> It stalls from seconds to hours. > >>>>> > >>>>> I have simplified it to following scenario (reproducer linked below > >>>>> [1]): > >>>>> while (1): > >>>>> spawn Thread 1: mmap, write, munmap > >>>>> spawn Thread 2: > >>>>> > >>>>> Thread 1 is sporadically getting stuck on write to mapped area. > >>>>> User-space > >>>>> is not > >>>>> moving forward - stdout output stops. Observed CPU usage is however > >>>>> 100%. > >>>>> > >>>>> At this time, kernel appears to be busy handling page faults (~700k= per > >>>>> second): > >>>>> > >>>>> # perf top -a -g > >>>>> - 98.97% 8.30% a.out [.] map_write_unmap > >>>>> - 23.52% map_write_unmap > >>>>> - 24.29% el0_sync > >>>>> - 10.42% do_mem_abort > >>>>> - 17.81% do_translation_fault > >>>>> - 33.01% do_page_fault > >>>>> - 56.18% handle_mm_fault > >>>>> 40.26% __handle_mm_fault > >>>>> 2.19% __ll_sc___cmpxchg_case_acq_4 > >>>>> 0.87% mem_cgroup_from_task > >>>>> - 6.18% find_vma > >>>>> 5.38% vmacache_find > >>>>> 1.35% __ll_sc___cmpxchg_case_acq_8 > >>>>> 1.23% __ll_sc_atomic64_sub_return_release > >>>>> 0.78% down_read_trylock > >>>>> 0.93% do_translation_fault > >>>>> + 8.30% thread_start > >>>>> > >>>>> # perf stat -p 8189 -d > >>>>> ^C > >>>>> Performance counter stats for process id '8189': > >>>>> > >>>>> 984.311350 task-clock (msec) # 1.000 CPU= s > >>>>> utilized > >>>>> 0 context-switches # 0.000 K/s= ec > >>>>> 0 cpu-migrations # 0.000 K/s= ec > >>>>> 723,641 page-faults # 0.735 M/s= ec > >>>>> 2,559,199,434 cycles # 2.600 GHz > >>>>> 711,933,112 instructions # 0.28 ins= n > >>>>> per > >>>>> cycle > >>>>> branches > >>>>> 757,658 branch-misses > >>>>> 205,840,557 L1-dcache-loads # 209.121 M/s= ec > >>>>> 40,561,529 L1-dcache-load-misses # 19.71% of = all > >>>>> L1-dcache hits > >>>>> LLC-loads > >>>>> LLC-load-misses > >>>>> > >>>>> 0.984454892 seconds time elapsed > >>>>> > >>>>> With some extra traces, it appears looping in page fault for same > >>>>> address, > >>>>> over and over: > >>>>> do_page_fault // mm_flags: 0x55 > >>>>> __do_page_fault > >>>>> __handle_mm_fault > >>>>> handle_pte_fault > >>>>> ptep_set_access_flags > >>>>> if (pte_same(pte, entry)) // pte: e8000805060f53, > >>>>> entry: > >>>>> e8000805060f53 > >>>>> > >>>>> I had traces in mmap() and munmap() as well, they don't get hit whe= n > >>>>> reproducer > >>>>> hits the bad state. > >>>>> > >>>>> Notes: > >>>>> - I'm not able to reproduce this on x86. > >>>>> - Attaching GDB or strace immediatelly recovers application from st= all. > >>>>> - It also seems to recover faster when system is busy with other ta= sks. > >>>>> - MAP_SHARED vs. MAP_PRIVATE makes no difference. > >>>>> - Turning off THP makes no difference. > >>>>> - Reproducer [1] usually hits it within ~minute on HW described bel= ow. > >>>>> - Longman mentioned that "When the rwsem becomes reader-owned, it > >>>>> causes > >>>>> all the spinning writers to go to sleep adding wakeup latency = to > >>>>> the time required to finish the critical sections", but this l= ooks > >>>>> like busy loop, so I'm not sure if it's related to rwsem issue= s > >>>>> identified > >>>>> in: > >>>>> https://lore.kernel.org/lkml/20190428212557.13482-2-longman@re= dhat.com/ > >>>> It sounds possible to me. What the optimization done by the commit (= "mm: > >>>> mmap: zap pages with read mmap_sem in munmap") is to downgrade write > >>>> rwsem to read when zapping pages and page table in munmap() after th= e > >>>> vmas have been detached from the rbtree. > >>>> > >>>> So the mmap(), which is writer, in your test may steal the lock and > >>>> execute with the munmap(), which is the reader after the downgrade, = in > >>>> parallel to break the mutual exclusion. > >>>> > >>>> In this case, the parallel mmap() may map to the same area since vma= s > >>>> have been detached by munmap(), then mmap() may create the complete = same > >>>> vmas, and page fault happens on the same vma at the same address. > >>>> > >>>> I'm not sure why gdb or strace could recover this, but they use ptra= ce > >>>> which may acquire mmap_sem to break the parallel inadvertently. > >>>> > >>>> May you please try Waiman's patch to see if it makes any difference? > >>> I don't see any difference in behaviour after applying: > >>> [PATCH-tip v7 01/20] locking/rwsem: Prevent decrement of reader c= ount > >>> before increment > >>> Issue is still easily reproducible for me. > >>> > >>> I'm including output of mem_abort_decode() / show_pte() for sample PT= E, > >>> that > >>> I see in page fault loop. (I went through all bits, but couldn't find > >>> anything invalid about it) > >>> > >>> mem_abort_decode: Mem abort info: > >>> mem_abort_decode: ESR =3D 0x92000047 > >>> mem_abort_decode: Exception class =3D DABT (lower EL), IL =3D 3= 2 bits > >>> mem_abort_decode: SET =3D 0, FnV =3D 0 > >>> mem_abort_decode: EA =3D 0, S1PTW =3D 0 > >>> mem_abort_decode: Data abort info: > >>> mem_abort_decode: ISV =3D 0, ISS =3D 0x00000047 > >>> mem_abort_decode: CM =3D 0, WnR =3D 1 > >>> show_pte: user pgtable: 64k pages, 48-bit VAs, pgdp =3D > >>> 0000000067027567 > >>> show_pte: [0000ffff6dff0000] pgd=3D000000176bae0003 > >>> show_pte: , pud=3D000000176bae0003 > >>> show_pte: , pmd=3D000000174ad60003 > >>> show_pte: , pte=3D00e80008023a0f53 > >>> show_pte: , pte_pfn: 8023a > >>> > >>> >>> print bin(0x47) > >>> 0b1000111 > >>> > >>> Per D12-2779 (ARM Architecture Reference Manual), > >>> ISS encoding for an exception from an Instruction Abort: > >>> IFSC, bits [5:0], Instruction Fault Status Code > >>> 0b000111 Translation fault, level 3 > >>> > >>> --- > >>> > >>> My theory is that TLB is getting broken. > > Theory continued: > > > > unmap_region() is batching updates to TLB (for vmas and page tables). > > And at the same time another thread handles page fault for same mm, > > which increases "tlb_flush_pending". > > > > tlb_finish_mmu() called from unmap_region() will thus set 'force =3D 1'= . > > And arch_tlb_finish_mmu() will in turn reset TLB range, presumably maki= ng > > it smaller then it would be if force =3D=3D 0. > > > > Change below appears to fix it: > > > > diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c > > index f2f03c655807..a4cef21bd62b 100644 > > --- a/mm/mmu_gather.c > > +++ b/mm/mmu_gather.c > > @@ -93,7 +93,7 @@ void arch_tlb_finish_mmu(struct mmu_gather *tlb, > > struct mmu_gather_batch *batch, *next; > > =20 > > if (force) { > > - __tlb_reset_range(tlb); > > __tlb_adjust_range(tlb, start, end - start); >=20 > I don't get why the change could fix it? My guess is that reset clears "tlb->freed_tables", which changes how tlb_flush() operates, see "bool last_level =3D !tlb->freed_tables;" in arch/arm64/include/asm/tlb.h. Maybe that doesn't clear some intermediate entries? No clue. If I let it reset the range, but preserve "freed_tables", it also seems to solve the problem: diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index f2f03c655807..17fb0d7edc03 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -93,8 +93,20 @@ void arch_tlb_finish_mmu(struct mmu_gather *tlb, struct mmu_gather_batch *batch, *next; =20 if (force) { - __tlb_reset_range(tlb); + if (tlb->fullmm) { + tlb->start =3D tlb->end =3D ~0; + } else { + tlb->start =3D TASK_SIZE; + tlb->end =3D 0; + } __tlb_adjust_range(tlb, start, end - start); > __tlb_reset_range() just reset > start and end to TASK_SIZE and 0, then __tlb_adjust_range() set proper > start and end. I don't get why "force" flush smaller range? I'm still trying to understand this part. It's actually not smaller, but it= changes: unmap_region() # vm_start: ffff49bd0000 vm_end: ffff49be0000 ... # tlb.start, tlb.end: 1000000000000 0 free_pgtables() # tlb.start, tlb.end: ffff40000000 ffff40010000 tlb_finish_mmu() arch_tlb_finish_mmu() # will see force =3D=3D 1 # resets tlb.start, tlb.end to: ffff49bd0000 ffff49be0000 >=20 > > } > > > >>> I made a dummy kernel module that exports debugfs file, which on read > >>> triggers: > >>> flush_tlb_all(); > >>> > >>> Any time reproducer stalls and I read debugfs file, it recovers > >>> immediately and resumes printing to stdout. > >> That commit doesn't change anything about TLB flush, just move zapping > >> pages under read mmap_sem as what MADV_DONTNEED does. > >> > >> I don't have aarch64 board to reproduce and debug it. And, I'm not > >> familiar with aarch64 architecture either. But, some history told me t= he > >> parallel zapping page may run into stale TLB and defer a flush meaning > >> that this call may observe pte_none and fails to flush the TLB. But, > >> this has been solved by commit 56236a59556c ("mm: refactor TLB gatheri= ng > >> API") and 99baac21e458 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss > >> problem"). > >> > >> For more detail, please refer to commit 4647706ebeee ("mm: always flus= h > >> VMA ranges affected by zap_page_range"). Copied Mel and Rik in this > >> thread. Also added Will Deacon and Catalin Marinas, who are aarch64 > >> maintainers, in this loop > > Thanks > > > >> But, your test (triggering TLB flush) does demonstrate TLB flush is > >> *not* done properly at some point as expected for aarch64. Could you > >> please give the below patch a try? > > Your patch also fixes my reproducer. >=20 > Thanks for testing it. >=20 > > > >> diff --git a/mm/memory.c b/mm/memory.c > >> index ab650c2..ef41ad5 100644 > >> --- a/mm/memory.c > >> +++ b/mm/memory.c > >> @@ -1336,8 +1336,10 @@ void unmap_vmas(struct mmu_gather *tlb, > >> > >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 mmu_notifier_range_init(&= range, vma->vm_mm, start_addr, > >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 end_addr); > >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 mmu_notifier_invalidate_r= ange_start(&range); > >> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 for ( ; vma && vma->vm_start < e= nd_addr; vma =3D vma->vm_next) > >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 for ( ; vma && vma->vm_start < e= nd_addr; vma =3D vma->vm_next) { > >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 unmap_single_vma(tlb, vma, start_addr, end_addr, N= ULL); > >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 flush_tlb_range(vma, start_addr, end_addr); > >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } > >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 mmu_notifier_invalidate_r= ange_end(&range); > >> =C2=A0} > >> > >>>>> - I tried 2 different aarch64 systems so far: APM X-Gene CPU Potenz= a A3 > >>>>> and > >>>>> Qualcomm 65-LA-115-151. > >>>>> I can reproduce it on both with v5.1-rc7. It's easier to repro= duce > >>>>> on latter one (for longer periods of time), which has 46 CPUs. > >>>>> - Sample output of reproducer on otherwise idle system: > >>>>> # ./a.out > >>>>> [00000314] map_write_unmap took: 26305 ms > >>>>> [00000867] map_write_unmap took: 13642 ms > >>>>> [00002200] map_write_unmap took: 44237 ms > >>>>> [00002851] map_write_unmap took: 992 ms > >>>>> [00004725] map_write_unmap took: 542 ms > >>>>> [00006443] map_write_unmap took: 5333 ms > >>>>> [00006593] map_write_unmap took: 21162 ms > >>>>> [00007435] map_write_unmap took: 16982 ms > >>>>> [00007488] map_write unmap took: 13 ms^C > >>>>> > >>>>> I ran a bisect, which identified following commit as first bad one: > >>>>> dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munma= p") > >>>>> > >>>>> I can also make the issue go away with following change: > >>>>> diff --git a/mm/mmap.c b/mm/mmap.c > >>>>> index 330f12c17fa1..13ce465740e2 100644 > >>>>> --- a/mm/mmap.c > >>>>> +++ b/mm/mmap.c > >>>>> @@ -2844,7 +2844,7 @@ EXPORT_SYMBOL(vm_munmap); > >>>>> SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len) > >>>>> { > >>>>> profile_munmap(addr); > >>>>> - return __vm_munmap(addr, len, true); > >>>>> + return __vm_munmap(addr, len, false); > >>>>> } > >>>>> > >>>>> # cat /proc/cpuinfo | head > >>>>> processor : 0 > >>>>> BogoMIPS : 40.00 > >>>>> Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid > >>>>> asimdrdm > >>>>> CPU implementer : 0x51 > >>>>> CPU architecture: 8 > >>>>> CPU variant : 0x0 > >>>>> CPU part : 0xc00 > >>>>> CPU revision : 1 > >>>>> > >>>>> # numactl -H > >>>>> available: 1 nodes (0) > >>>>> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 2= 1 22 > >>>>> 23 > >>>>> 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 > >>>>> node 0 size: 97938 MB > >>>>> node 0 free: 95732 MB > >>>>> node distances: > >>>>> node 0 > >>>>> 0: 10 > >>>>> > >>>>> Regards, > >>>>> Jan > >>>>> > >>>>> [1] > >>>>> https://github.com/jstancek/reproducers/blob/master/kernel/page_fau= lt_stall/mmap5.c > >>>>> [2] > >>>>> https://github.com/jstancek/reproducers/blob/master/kernel/page_fau= lt_stall/config > >> >=20 >=20