Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Catalin Marinas <catalin.marinas@arm.com>,
	linux-arch <linux-arch@vger.kernel.org>
Cc: Yang Shi <yang.shi@linux.alibaba.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	 Johannes Weiner <hannes@cmpxchg.org>,
	Hillf Danton <hdanton@sina.com>, Hugh Dickins <hughd@google.com>,
	 Josef Bacik <josef@toxicpanda.com>,
	 "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Linux-MM <linux-mm@kvack.org>,
	 mm-commits@vger.kernel.org, Will Deacon <will@kernel.org>,
	 Matthew Wilcox <willy@infradead.org>,
	Yu Xu <xuyu@linux.alibaba.com>
Subject: Re: [patch 01/15] mm/memory.c: avoid access flag update TLB flush for retried page fault
Date: Mon, 27 Jul 2020 13:56:41 -0700
Message-ID: <CAHk-=wjX+DjY+ww5_SJ5UKJ3hUy-8S7HZPp6vj7Hbwc5zou3zQ@mail.gmail.com> (raw)
In-Reply-To: <20200727184239.GA21230@gaia>

[ The arch list is going to be missing some of the emails in this
thread, but they are all on lore:

      https://lore.kernel.org/linux-mm/20200727184239.GA21230@gaia/

   and I think the context is probably sufficient even without that. ]

On Mon, Jul 27, 2020 at 11:42 AM Catalin Marinas
<catalin.marinas@arm.com> wrote:
>
> At least on arm64 (and arm32), old ptes are not cached in the TLB, so
> there is no need to flush if the only action was to make the pte young
> from old. However, this may not be the same on other architectures.
>
> Also not sure about races with making a pte old then young again, some
> CPU could get confused.

Hmm. I'd love to extend the interface (at the same time we fix the
bogus traditional default) to show what the old pte state was, but the
whole point is that we don't even know.

It is, by definition, gone.

We got a fault for something that is no longer the case, and we didn't
modify anything in the page tables. And so we know that the TLB didn't
- at the time of the fault - match what we now see in the page tables.

So we don't even know if we "made the pte young from old". Somebody
*else* did that part. Maybe two CPU's both hit the HW page table walk
at roughly the same time, both saw and old entry and triggered a sw
fault, one CPU then won the race to the page table spinlock, and
marked it young.

And then the other CPU comes along, says "ok, nothing seems to have
changed, it's a spurious fault as far as I can tell, now what?"

It *could* be that "old->young" transition. But it *could* also have
been that the other CPU did a write access and turned things writable
(in addition to turning it young). We don't even know.

So if arm doesn't cache old ptes in the TLB, then I guess for ARM, the
"only try to flush for write faults" is fine, because the only
possible stale bit in the TLB would be the dirty bit.

And _maybe_ that ends up being true on other architectures too. But it
does sound very very dodgy.

Maybe we should just pass in the fault information we have (ie just
give flush_tlb_fix_spurious_fault() the whole vmf pointer), and then
the architecture can make their own decision based on that.

So if the architecture can say "the only case that might be cached is
a non-dirty old PTE that I need to flush now because it's a write
fault, and not flushing it would possibly cause an endless loop", then
that test for

        if (vmf->flags & FAULT_FLAG_WRITE)

is the right thing.

NOTE! The vmf does have a bit that is called "orig_pte", and has the
comment "Value of PTE at the time of fault" associated with it. That
comment is bogus.

We don't _really_ know what the original pte was, and that "orig_pte"
is just the one we loaded fairly early, and before we took the page
table lock. We've made decisions based on the value, but we've also
already checked that after taking the page table lock, the pte still
matches.

So that vmf structure may be less useful than you'd think. The only
really useful information in there is likely just the address and the
fault flags.

Even the vma is by definition not really useful. The vma we looked up
may not be the same vma that the original hardware fault happened
with. If we took a fault, and some other CPU got around to do a mmap()
before we got the mmap semaphore in the fault handler, we'll have the
*new* vma, but the spurious fault might have come from another source
entirely.

But again - any *serious* page table updates we should have
synchronized against, and the other CPU will have done the TLB
shootdown etc. So we shouldn't need to do anything. The only thing
that matters is the trivial bits that _may_ have been changed without
bothering with a cross-CPU TLB flush.

So it's likely only dirty/accessed bits. But I really do have this
strong memory of us at least at some point deciding that we can avoid
it for some other "this operation only ever _adds_ permissions,
doesn't take them away" case.

I can't find that code, though, so it might be either early-onset
Alzheimer's, or some historical footnote that just isn't true any
longer.

That said, I *can* find places where we delay TLB flushes a _lot_. So
another CPU may be modifying the page tables, and the flush happens
much much later.

For example: look at fork(). We'll mark the source page table as being
read-only for COW purposes, but we'll delay the actual TLB flush to
long long after we did so (but we'll do so with the mmap lock held for
writing to protect against stack growing).

So it's not even like the page table lock really synchronizes the page
table changes with the faults and the TLB flush. The mmap lock for
writing may do that too.

So there's a fairly large window for these "spurious" faults, where
the fault may have happened relatively much earlier, and things have
changed a *lot* by the time we actually got all our locks, and saw
"hey, I see nothing to change in the page tables, the fault is
spurious".

                Linus


  reply index

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-24  4:14 incoming Andrew Morton
2020-07-24  4:15 ` [patch 01/15] mm/memory.c: avoid access flag update TLB flush for retried page fault Andrew Morton
2020-07-24  4:38   ` Yang Shi
2020-07-24  4:56     ` Andrew Morton
2020-07-24 19:27   ` Linus Torvalds
2020-07-24 20:22     ` Linus Torvalds
2020-07-25  0:36       ` Yang Shi
2020-07-25  1:29         ` Linus Torvalds
2020-07-25 15:58           ` Catalin Marinas
2020-07-28  9:22             ` Will Deacon
2020-07-28  9:39               ` Catalin Marinas
2020-07-28 10:07                 ` Yu Xu
2020-07-28 11:46                   ` Catalin Marinas
2020-07-28 10:21                 ` Will Deacon
2020-07-28 18:28                 ` Linus Torvalds
2020-07-27 17:52           ` Yang Shi
2020-07-27 18:04             ` Linus Torvalds
2020-07-27 18:42               ` Catalin Marinas
2020-07-27 20:56                 ` Linus Torvalds [this message]
2020-07-27 22:34               ` Yang Shi
2020-07-27  7:31       ` Yu Xu
2020-07-27 11:05         ` Catalin Marinas
2020-07-27 17:01           ` Linus Torvalds
2020-07-28 11:19             ` Catalin Marinas
2020-07-27 17:12           ` Yu Xu
2020-07-27 18:04             ` Yang Shi
2020-07-27 18:37               ` Linus Torvalds
2020-07-27 22:43                 ` Yang Shi
2020-07-28  0:38                   ` Linus Torvalds
2020-07-28  0:13                 ` Yu Xu
2020-07-28 10:53                 ` Nicholas Piggin
2020-07-28 19:02                   ` Linus Torvalds
2020-07-28 22:53                     ` Nicholas Piggin
2020-07-29 13:58                       ` Michael Ellerman
2020-07-28  6:41             ` Yu Xu
2020-07-24  4:15 ` [patch 02/15] mm/mmap.c: close race between munmap() and expand_upwards()/downwards() Andrew Morton
2020-07-24  4:15 ` [patch 03/15] vfs/xattr: mm/shmem: kernfs: release simple xattr entry in a right way Andrew Morton
2020-07-24  4:15 ` [patch 04/15] mm: initialize return of vm_insert_pages Andrew Morton
2020-07-24  4:15 ` [patch 05/15] mm/memcontrol: fix OOPS inside mem_cgroup_get_nr_swap_pages() Andrew Morton
2020-07-24  4:15 ` [patch 06/15] mm/memcg: fix refcount error while moving and swapping Andrew Morton
2020-07-24 13:41   ` Alex Shi
2020-07-24  4:15 ` [patch 07/15] mm: memcg/slab: fix memory leak at non-root kmem_cache destroy Andrew Morton
2020-07-24  4:15 ` [patch 08/15] mm/hugetlb: avoid hardcoding while checking if cma is enabled Andrew Morton
2020-07-24  4:15 ` [patch 09/15] khugepaged: fix null-pointer dereference due to race Andrew Morton
2020-07-24  4:15 ` [patch 10/15] mailmap: add entry for Mike Rapoport Andrew Morton
2020-07-24  4:15 ` [patch 11/15] squashfs: fix length field overlap check in metadata reading Andrew Morton
2020-07-24  4:15 ` [patch 12/15] scripts/decode_stacktrace: strip basepath from all paths Andrew Morton
2020-07-24  4:15 ` [patch 13/15] io-mapping: indicate mapping failure Andrew Morton
2020-07-24  4:15 ` [patch 14/15] MAINTAINERS: add KCOV section Andrew Morton
2020-07-24  4:15 ` [patch 15/15] scripts/gdb: fix lx-symbols 'gdb.error' while loading modules Andrew Morton
2020-07-28  1:19 ` mmotm 2020-07-27-18-18 uploaded Andrew Morton
2020-07-28  2:14   ` Stephen Rothwell
2020-07-28  3:22   ` mmotm 2020-07-27-18-18 uploaded (drivers/scsi/ufs/: SCSI_UFS_EXYNOS) Randy Dunlap
2020-07-28  8:23     ` Alim Akhtar
2020-07-28 12:33   ` mmotm 2020-07-27-18-18 uploaded (mm/page_alloc.c) Randy Dunlap
2020-07-28 21:55     ` Andrew Morton
2020-07-28 22:20       ` Stephen Rothwell
2020-07-28 22:31         ` Andrew Morton
2020-07-29 14:18           ` Michael S. Tsirkin
2020-07-29 14:38             ` David Hildenbrand
2020-07-29 16:14               ` David Hildenbrand
2020-07-29 17:29                 ` Randy Dunlap
2020-07-28 22:39       ` Randy Dunlap
2020-07-29  1:43         ` Nathan Chancellor
2020-07-29  1:44         ` Andrew Morton
2020-07-29  2:04           ` Randy Dunlap
2020-07-29 14:09           ` make oldconfig (Re: mmotm 2020-07-27-18-18 uploaded (mm/page_alloc.c)) Alexey Dobriyan
2020-07-31 23:46 ` mmotm 2020-07-31-16-45 uploaded Andrew Morton
2020-08-01  5:24   ` mmotm 2020-07-31-16-45 uploaded (drivers/staging/vc04_services/) Randy Dunlap

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAHk-=wjX+DjY+ww5_SJ5UKJ3hUy-8S7HZPp6vj7Hbwc5zou3zQ@mail.gmail.com' \
    --to=torvalds@linux-foundation.org \
    --cc=akpm@linux-foundation.org \
    --cc=catalin.marinas@arm.com \
    --cc=hannes@cmpxchg.org \
    --cc=hdanton@sina.com \
    --cc=hughd@google.com \
    --cc=josef@toxicpanda.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mm-commits@vger.kernel.org \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=xuyu@linux.alibaba.com \
    --cc=yang.shi@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git