Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Yang Shi <yang.shi@linux.alibaba.com>
Cc: linux-arch <linux-arch@vger.kernel.org>,
	Yu Xu <xuyu@linux.alibaba.com>,
	 Catalin Marinas <catalin.marinas@arm.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	 Johannes Weiner <hannes@cmpxchg.org>,
	Hillf Danton <hdanton@sina.com>, Hugh Dickins <hughd@google.com>,
	 Josef Bacik <josef@toxicpanda.com>,
	 "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Linux-MM <linux-mm@kvack.org>,
	 mm-commits@vger.kernel.org, Will Deacon <will.deacon@arm.com>,
	 Matthew Wilcox <willy@infradead.org>
Subject: Re: [patch 01/15] mm/memory.c: avoid access flag update TLB flush for retried page fault
Date: Mon, 27 Jul 2020 17:38:01 -0700
Message-ID: <CAHk-=wg8vswrKBhKSzfyLJ-UVENeydK-RMVEHT+puDPyWr+bnQ@mail.gmail.com> (raw)
In-Reply-To: <89c6671a-39ba-d1cc-9bac-2e6717916220@linux.alibaba.com>

On Mon, Jul 27, 2020 at 3:43 PM Yang Shi <yang.shi@linux.alibaba.com> wrote:
>
> With the commit ("mm: drop mmap_sem before calling balance_dirty_pages()
> in write fault") the retried fault may happen much more frequently than
> before since it would drop mmap lock as long as dirty throttling happens.

Sure. And that probably explains why it shows up as a regression.

That said, the fact that it showed up as a regression that clearly
probably means that the whole spurious TLB flush has always been a
low-grade problem, and the extra retries just made it much more
noticeable because now there was a change to it.

The fact that we have that (very questionable) optimization to only do
it for writes kind of reinforces that notion - it has happened before,
it's just never been fixed properly, and it's just never been
noticeable on most machines because this is all a no-op on x86.

I think Catalin's patch - with some way to fix the problem with KVM -
is the way to go.

That said, testing FAULT_FLAG_TRIED and suppressing the spurious TLB
fill for that case is certainly always safe. At worst, we'll take
another fault, and then do the TLB flush at _that_ point when not
retrying.

So it's the FAULT_FLAG_WRITE test that I think is bogus, or at least
should be protected by some architecture decision (with a comment
about why it's ok for that architecture, ie the ARM kind of "old PTE's
will never be in the TLB, and if it's not a write fault we know it
doesn't depend on the dirty bit either")

Of course, it may be that on every architecture that requires SW
accessed bits, the "old PTE's will never be in the TLB is true".

Except I think I know at least one architecture where that isn't true.
On alpha, the way the acccessed bit works is exactly the same way the
dirty bit works - except it's done for reads, instead of writes.

So on at least one architecture, access faults and dirty faults are
100% equivalent, just using read/write bits respectively.

Of course, alpha doesn't really matter any more. But it's an example
of an architecture where "old" does not necessarily mean "cannot be in
the TLB", and where testing for FAULT_FLAG_WRITE looks buggy.

Again: I think in practice, it's really *really* hard to hit the
problem with accessed bits, unlike dirty bits. Normally, PTE's are all
instantiated young if they are in the TLB. You have to kind of work at
it to get an old PTE _and_ then hit the "now access it exactly at the
same time from two different CPU's, and watch one CPU keep taking page
faults forever because it never flushes its TLB entry".

Of course, it is so long since I worked with alpha that maybe there's
some other reason this can't happen. Like "PAL-code always flushes the
TLB entry of the faulting address".

Which all hardware should do, dammit. It's all kinds of stupid to
cache a faulting TLB entry. The fault is thousands of times more
expensive than a reload would be, even if  it were intentional and
done repeatedly (which sounds like an insane thing to optimize for
anyway).

So one way to fix this problem would be to just specify that "every
pagefault handler _must_ flush the local-CPU TLB entry that the fault
happened for if the architecture doesn't already do that in hardware
or microcode".

And then we'd just remove the spurious TLB flush code entirely.

                     Linus


  reply index

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-24  4:14 incoming Andrew Morton
2020-07-24  4:15 ` [patch 01/15] mm/memory.c: avoid access flag update TLB flush for retried page fault Andrew Morton
2020-07-24  4:38   ` Yang Shi
2020-07-24  4:56     ` Andrew Morton
2020-07-24 19:27   ` Linus Torvalds
2020-07-24 20:22     ` Linus Torvalds
2020-07-25  0:36       ` Yang Shi
2020-07-25  1:29         ` Linus Torvalds
2020-07-25 15:58           ` Catalin Marinas
2020-07-28  9:22             ` Will Deacon
2020-07-28  9:39               ` Catalin Marinas
2020-07-28 10:07                 ` Yu Xu
2020-07-28 11:46                   ` Catalin Marinas
2020-07-28 10:21                 ` Will Deacon
2020-07-28 18:28                 ` Linus Torvalds
2020-07-27 17:52           ` Yang Shi
2020-07-27 18:04             ` Linus Torvalds
2020-07-27 18:42               ` Catalin Marinas
2020-07-27 20:56                 ` Linus Torvalds
2020-07-27 22:34               ` Yang Shi
2020-07-27  7:31       ` Yu Xu
2020-07-27 11:05         ` Catalin Marinas
2020-07-27 17:01           ` Linus Torvalds
2020-07-28 11:19             ` Catalin Marinas
2020-07-27 17:12           ` Yu Xu
2020-07-27 18:04             ` Yang Shi
2020-07-27 18:37               ` Linus Torvalds
2020-07-27 22:43                 ` Yang Shi
2020-07-28  0:38                   ` Linus Torvalds [this message]
2020-07-28  0:13                 ` Yu Xu
2020-07-28 10:53                 ` Nicholas Piggin
2020-07-28 19:02                   ` Linus Torvalds
2020-07-28 22:53                     ` Nicholas Piggin
2020-07-29 13:58                       ` Michael Ellerman
2020-07-28  6:41             ` Yu Xu
2020-07-24  4:15 ` [patch 02/15] mm/mmap.c: close race between munmap() and expand_upwards()/downwards() Andrew Morton
2020-07-24  4:15 ` [patch 03/15] vfs/xattr: mm/shmem: kernfs: release simple xattr entry in a right way Andrew Morton
2020-07-24  4:15 ` [patch 04/15] mm: initialize return of vm_insert_pages Andrew Morton
2020-07-24  4:15 ` [patch 05/15] mm/memcontrol: fix OOPS inside mem_cgroup_get_nr_swap_pages() Andrew Morton
2020-07-24  4:15 ` [patch 06/15] mm/memcg: fix refcount error while moving and swapping Andrew Morton
2020-07-24 13:41   ` Alex Shi
2020-07-24  4:15 ` [patch 07/15] mm: memcg/slab: fix memory leak at non-root kmem_cache destroy Andrew Morton
2020-07-24  4:15 ` [patch 08/15] mm/hugetlb: avoid hardcoding while checking if cma is enabled Andrew Morton
2020-07-24  4:15 ` [patch 09/15] khugepaged: fix null-pointer dereference due to race Andrew Morton
2020-07-24  4:15 ` [patch 10/15] mailmap: add entry for Mike Rapoport Andrew Morton
2020-07-24  4:15 ` [patch 11/15] squashfs: fix length field overlap check in metadata reading Andrew Morton
2020-07-24  4:15 ` [patch 12/15] scripts/decode_stacktrace: strip basepath from all paths Andrew Morton
2020-07-24  4:15 ` [patch 13/15] io-mapping: indicate mapping failure Andrew Morton
2020-07-24  4:15 ` [patch 14/15] MAINTAINERS: add KCOV section Andrew Morton
2020-07-24  4:15 ` [patch 15/15] scripts/gdb: fix lx-symbols 'gdb.error' while loading modules Andrew Morton
2020-07-28  1:19 ` mmotm 2020-07-27-18-18 uploaded Andrew Morton
2020-07-28  2:14   ` Stephen Rothwell
2020-07-28  3:22   ` mmotm 2020-07-27-18-18 uploaded (drivers/scsi/ufs/: SCSI_UFS_EXYNOS) Randy Dunlap
2020-07-28  8:23     ` Alim Akhtar
2020-07-28 12:33   ` mmotm 2020-07-27-18-18 uploaded (mm/page_alloc.c) Randy Dunlap
2020-07-28 21:55     ` Andrew Morton
2020-07-28 22:20       ` Stephen Rothwell
2020-07-28 22:31         ` Andrew Morton
2020-07-29 14:18           ` Michael S. Tsirkin
2020-07-29 14:38             ` David Hildenbrand
2020-07-29 16:14               ` David Hildenbrand
2020-07-29 17:29                 ` Randy Dunlap
2020-07-28 22:39       ` Randy Dunlap
2020-07-29  1:43         ` Nathan Chancellor
2020-07-29  1:44         ` Andrew Morton
2020-07-29  2:04           ` Randy Dunlap
2020-07-29 14:09           ` make oldconfig (Re: mmotm 2020-07-27-18-18 uploaded (mm/page_alloc.c)) Alexey Dobriyan
2020-07-31 23:46 ` mmotm 2020-07-31-16-45 uploaded Andrew Morton
2020-08-01  5:24   ` mmotm 2020-07-31-16-45 uploaded (drivers/staging/vc04_services/) Randy Dunlap

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAHk-=wg8vswrKBhKSzfyLJ-UVENeydK-RMVEHT+puDPyWr+bnQ@mail.gmail.com' \
    --to=torvalds@linux-foundation.org \
    --cc=akpm@linux-foundation.org \
    --cc=catalin.marinas@arm.com \
    --cc=hannes@cmpxchg.org \
    --cc=hdanton@sina.com \
    --cc=hughd@google.com \
    --cc=josef@toxicpanda.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mm-commits@vger.kernel.org \
    --cc=will.deacon@arm.com \
    --cc=willy@infradead.org \
    --cc=xuyu@linux.alibaba.com \
    --cc=yang.shi@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git