Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Yang Shi <yang.shi@linux.alibaba.com>,
	linux-arch <linux-arch@vger.kernel.org>
Cc: Yu Xu <xuyu@linux.alibaba.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	 Andrew Morton <akpm@linux-foundation.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	 Hillf Danton <hdanton@sina.com>, Hugh Dickins <hughd@google.com>,
	Josef Bacik <josef@toxicpanda.com>,
	 "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Linux-MM <linux-mm@kvack.org>,
	 mm-commits@vger.kernel.org, Will Deacon <will.deacon@arm.com>,
	 Matthew Wilcox <willy@infradead.org>
Subject: Re: [patch 01/15] mm/memory.c: avoid access flag update TLB flush for retried page fault
Date: Mon, 27 Jul 2020 11:37:42 -0700
Message-ID: <CAHk-=wha6f0gF1SJg96R77h0oTuc_oO7-37wD=mYGy6TyJOwbQ@mail.gmail.com> (raw)
In-Reply-To: <eb1f5cb4-7c3d-df42-f4aa-804e12df45e2@linux.alibaba.com>

[ Adding linux-arch, just to make other architectures aware of this issue too.

  We have a "flush_tlb_fix_spurious_fault()" thing to take care of the
"TLB may contain stale entries, we can't take the same fault over and
over again" situation.

  On x86, it's a no-op, because x86 doesn't do that. x86 will re-walk
the page tables - or possibly just always invalidate the faulting TLB
entry - before taking a fault, so there can be no long-term stale
TLB's.

  Other architectures may or may not need explicit "invalidate this
TLB entry, because if you make no changes to the page tables, I'll
just otherwise take this fault again. Forever". That is what
"flush_tlb_fix_spurious_fault()" does.

  NOTE! One reason for a stale TLB entry is that another CPU has
already done the change, and is just _about_ to flush the TLB, but the
hardware took the fault before it did so. The code is under the page
table lock, but the hardware fault handler doesn't know or care. So by
the time we get to "flush_tlb_fix_spurious_fault()", we _will_ have
synchronized (because we took the page table lock), and it's entirely
possible that the architecture thus has nothing to do. Make it a
no-op.

  The other reason for a stale TLB entry is if you don't do the
cross-CPU flush for "minor" events that don't matter (ie turning
things dirty, things like that). Rather than flush the TLB, you _want_
the other CPU to take the fault in the (presumabl;y unlikely) case
that it had that old TLB entry in the first place, and thought _it_
needed to do mark it dirty.

  Anyway, theres' a reason for "flush_tlb_fix_spurious_fault()", but
not all architectures need it.

  HOWEVER.

  On architectures that don't explicitly define it, it falls back to a
default of "flush_tlb_page()", which sounds sane, but in fact is
completely insane and horribly horribly wrong.

  It's completely insane and horribly wrong, because that fallback
predates the "everybody is SMP" days. On UP, it's fine and sane.

  But on SMP, it's absolutely horrendously bad. Because
flush_tlb_fix_spurious_fault() should not do any cross-CPU
invalidates.

  It looks like arm64 got this nasty performance problem because of
this all, with the cross-CPU invalidates being insanely expensive, and
completely pointless  - and easy to hit in some circumstances.

  It looks like powerpc people at least thought about this, and only
do it if there is a coprocessor. Which sounds a bit confused, but I
don't know the rules.

  It looks like a lot of others are ok mainly because they don't do
SMP, or they don't have the kinds of loads where this matters.

  But I wanted to cc the arch mailing list, to make people more aware
of it. And we *should* change the default. It shouldn't be
"flush_tlb_page()". It _should_ be "local_flush_tlb_page()", but we
don't have that, although many architectures implement something like
that as part of their SMP invalidation support ]

On Mon, Jul 27, 2020 at 11:04 AM Yang Shi <yang.shi@linux.alibaba.com> wrote:
>
> It looks Linus's patch has better data. It seems sane to me since
> Catalin's patch still needs flush TLB in the shared domain.

Well, my patch as posted never built at all, I think.

Looking back at that patch, I used FAULT_FLAG_RETRY. But that's not
the correct name for any of the bits.

So you must have fixed it. Did you make it use "FAULT_FLAG_TRIED"?
Because that's the right bit - don't flush if this is actually the
second (or more) attempt.

But I'm a bit worried that you would have used one of the other bits
(FAULT_FLAG_ALLOW_RETRY or FAULT_FLAG_RETRY_NOWAIT), and that would be
wrong. Those get set on the first attempt to say "you _may_ retry",
but they get set on the first one.

That just shows how much I tested the patch I sent out. It was
whitespace-damaged on purpose, but I still want to check.

The "FAULT_FLAG_TRIED" bit I believe is reasonable to test. That one
literally says "I've gone through this once already, don't bother with
spurious faults".  But I don't think it triggers much in practice. We
seldom actually retry faults, it needs a page that we actually start
IO on (and dropped the mmap lock for) to happen. It wouldn't happen on
the "turn existing page dirty" case, for example.

The "FAULT_FLAG_WRITE" bit is what we test right now. I think it's
wrong. I think it is a "this happens to work" bit, and cuts down on a
lot of common cases, by simply skipping something that might be needed
but basically never is.

So I think a lot of this is dodgy. It doesn't matter on x86, and
nobody cared. Because x86 will always re-walk the page tables before
taking an architectural fault (the same way it walks them for
dirty/accessed bit updates - you could think of x86 as doing all the
things everybody else does in software, they just do in the hw walker
micro-fault logic instead).

A local TLB invalidate of a single virtual address should be basically
free. We're talking single cycles kind of free. The problem here isn't
the flush_tlb_fix_spurious_fault() itself, the problem here is that
arm64 (and pretty much everybody else who uses the default fallback)
does something horribly horribly wrong, and doesn't do the free
version.

               Linus


  reply index

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-24  4:14 incoming Andrew Morton
2020-07-24  4:15 ` [patch 01/15] mm/memory.c: avoid access flag update TLB flush for retried page fault Andrew Morton
2020-07-24  4:38   ` Yang Shi
2020-07-24  4:56     ` Andrew Morton
2020-07-24 19:27   ` Linus Torvalds
2020-07-24 20:22     ` Linus Torvalds
2020-07-25  0:36       ` Yang Shi
2020-07-25  1:29         ` Linus Torvalds
2020-07-25 15:58           ` Catalin Marinas
2020-07-28  9:22             ` Will Deacon
2020-07-28  9:39               ` Catalin Marinas
2020-07-28 10:07                 ` Yu Xu
2020-07-28 11:46                   ` Catalin Marinas
2020-07-28 10:21                 ` Will Deacon
2020-07-28 18:28                 ` Linus Torvalds
2020-07-27 17:52           ` Yang Shi
2020-07-27 18:04             ` Linus Torvalds
2020-07-27 18:42               ` Catalin Marinas
2020-07-27 20:56                 ` Linus Torvalds
2020-07-27 22:34               ` Yang Shi
2020-07-27  7:31       ` Yu Xu
2020-07-27 11:05         ` Catalin Marinas
2020-07-27 17:01           ` Linus Torvalds
2020-07-28 11:19             ` Catalin Marinas
2020-07-27 17:12           ` Yu Xu
2020-07-27 18:04             ` Yang Shi
2020-07-27 18:37               ` Linus Torvalds [this message]
2020-07-27 22:43                 ` Yang Shi
2020-07-28  0:38                   ` Linus Torvalds
2020-07-28  0:13                 ` Yu Xu
2020-07-28 10:53                 ` Nicholas Piggin
2020-07-28 19:02                   ` Linus Torvalds
2020-07-28 22:53                     ` Nicholas Piggin
2020-07-29 13:58                       ` Michael Ellerman
2020-07-28  6:41             ` Yu Xu
2020-07-24  4:15 ` [patch 02/15] mm/mmap.c: close race between munmap() and expand_upwards()/downwards() Andrew Morton
2020-07-24  4:15 ` [patch 03/15] vfs/xattr: mm/shmem: kernfs: release simple xattr entry in a right way Andrew Morton
2020-07-24  4:15 ` [patch 04/15] mm: initialize return of vm_insert_pages Andrew Morton
2020-07-24  4:15 ` [patch 05/15] mm/memcontrol: fix OOPS inside mem_cgroup_get_nr_swap_pages() Andrew Morton
2020-07-24  4:15 ` [patch 06/15] mm/memcg: fix refcount error while moving and swapping Andrew Morton
2020-07-24 13:41   ` Alex Shi
2020-07-24  4:15 ` [patch 07/15] mm: memcg/slab: fix memory leak at non-root kmem_cache destroy Andrew Morton
2020-07-24  4:15 ` [patch 08/15] mm/hugetlb: avoid hardcoding while checking if cma is enabled Andrew Morton
2020-07-24  4:15 ` [patch 09/15] khugepaged: fix null-pointer dereference due to race Andrew Morton
2020-07-24  4:15 ` [patch 10/15] mailmap: add entry for Mike Rapoport Andrew Morton
2020-07-24  4:15 ` [patch 11/15] squashfs: fix length field overlap check in metadata reading Andrew Morton
2020-07-24  4:15 ` [patch 12/15] scripts/decode_stacktrace: strip basepath from all paths Andrew Morton
2020-07-24  4:15 ` [patch 13/15] io-mapping: indicate mapping failure Andrew Morton
2020-07-24  4:15 ` [patch 14/15] MAINTAINERS: add KCOV section Andrew Morton
2020-07-24  4:15 ` [patch 15/15] scripts/gdb: fix lx-symbols 'gdb.error' while loading modules Andrew Morton
2020-07-28  1:19 ` mmotm 2020-07-27-18-18 uploaded Andrew Morton
2020-07-28  2:14   ` Stephen Rothwell
2020-07-28  3:22   ` mmotm 2020-07-27-18-18 uploaded (drivers/scsi/ufs/: SCSI_UFS_EXYNOS) Randy Dunlap
2020-07-28  8:23     ` Alim Akhtar
2020-07-28 12:33   ` mmotm 2020-07-27-18-18 uploaded (mm/page_alloc.c) Randy Dunlap
2020-07-28 21:55     ` Andrew Morton
2020-07-28 22:20       ` Stephen Rothwell
2020-07-28 22:31         ` Andrew Morton
2020-07-29 14:18           ` Michael S. Tsirkin
2020-07-29 14:38             ` David Hildenbrand
2020-07-29 16:14               ` David Hildenbrand
2020-07-29 17:29                 ` Randy Dunlap
2020-07-28 22:39       ` Randy Dunlap
2020-07-29  1:43         ` Nathan Chancellor
2020-07-29  1:44         ` Andrew Morton
2020-07-29  2:04           ` Randy Dunlap
2020-07-29 14:09           ` make oldconfig (Re: mmotm 2020-07-27-18-18 uploaded (mm/page_alloc.c)) Alexey Dobriyan
2020-07-31 23:46 ` mmotm 2020-07-31-16-45 uploaded Andrew Morton
2020-08-01  5:24   ` mmotm 2020-07-31-16-45 uploaded (drivers/staging/vc04_services/) Randy Dunlap

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAHk-=wha6f0gF1SJg96R77h0oTuc_oO7-37wD=mYGy6TyJOwbQ@mail.gmail.com' \
    --to=torvalds@linux-foundation.org \
    --cc=akpm@linux-foundation.org \
    --cc=catalin.marinas@arm.com \
    --cc=hannes@cmpxchg.org \
    --cc=hdanton@sina.com \
    --cc=hughd@google.com \
    --cc=josef@toxicpanda.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mm-commits@vger.kernel.org \
    --cc=will.deacon@arm.com \
    --cc=willy@infradead.org \
    --cc=xuyu@linux.alibaba.com \
    --cc=yang.shi@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git