Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
From: Yang Shi <yang.shi@linux.alibaba.com>
To: Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Hillf Danton <hdanton@sina.com>, Hugh Dickins <hughd@google.com>,
	Josef Bacik <josef@toxicpanda.com>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Linux-MM <linux-mm@kvack.org>,
	mm-commits@vger.kernel.org, Will Deacon <will.deacon@arm.com>,
	Matthew Wilcox <willy@infradead.org>,
	xuyu@linux.alibaba.com
Subject: Re: [patch 01/15] mm/memory.c: avoid access flag update TLB flush for retried page fault
Date: Fri, 24 Jul 2020 17:36:57 -0700
Message-ID: <7de20d4a-f86c-8e1f-b238-65f02b560325@linux.alibaba.com> (raw)
In-Reply-To: <CAHk-=wh4kmU5FdT=Yy7N9wA=se=ALbrquCrOkjCMhiQnOBLvDA@mail.gmail.com>



On 7/24/20 1:22 PM, Linus Torvalds wrote:
> On Fri, Jul 24, 2020 at 12:27 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> It *may* make sense to say "ok, don't bother flushing the TLB if this
>> is a retry, because we already did that originally". MAYBE.
> That sounds wrong to me too.
>
> Maybe a *BIG* and understandable comment about why the
> FAULT_FLAG_TRIED case shouldn't actually do anything at all.
>
> But it does smell like the real issue is that the "else" case for
> ptep_set_access_flags() is simply wrong.
>
> Or maybe something else. But this thing from the changelog really
> raises my hackles:
>
>         "But it seems not necessary to modify those
>    bits again for #3 since they should be already set by the first page fault
>    try"
>
> since we just verified that we know _exactly_ what the pte is:
>
>          if (unlikely(!pte_same(*vmf->pte, entry))) {
>                  update_mmu_tlb(vmf->vma, vmf->address, vmf->pte);
>                  goto unlock;
>          }
>
> so there is no "should be already set" case. We have 100% information
> about what the current state is.
>
> And if we don't make any changes, then that's exactly when
> ptep_set_access_flags() returns zero.
>
> So the real question is "why do we need the
> flush_tlb_fix_spurious_fault() thing".
>
> We could say that we never need it at all for FAULT_FLAG_RETRY. That
> makes a lot of sense to me.

Thanks a lot for looking into this. It does make sense to me.

A follow-up question about your comment in the previous email "The 
notion of "this is a retry, so let's do nothing" is fundamentally 
wrong.", do you mean it is not safe? Or since we have pte_same check, we 
should just rely on it to skip unnecessary TLB flush?

>
> So a patch that does something like the appended (intentionally
> whitespace-damaged) seems sensible.
>
> But note the XYZ in that commit. When do we actually have stale TLB
> entries? Do we actually do the lazy "avoid TLB flushing when loosening
> the rules" anywhere?
>
> I think that "when it's a write fault" is actually bogus. I could
> imagine that code pages could get the same issue. So the
> "FAULT_FLAG_RETRY" part of the check makes perfect sense to me, but
> the legacy "FAULT_FLAG_WRITE" case I'd actually want to document more.
>
> On x86, we never care about lazy faults. Page faulting will always
> update the TLB.
>
> On other architectures, I can see spurious faults happening either due
> to lazy reasons, or simply because another core is modifying the page
> table right now (ie the concurrent fault thing), but hasn't actually
> flushed yet.
>
> Can somebody flesh out the comment about the
> "spurious_protection_fault()" thing? Because something like this I
> wouldn't mind, but I'd like that comment to explain the
> FAULT_FLAG_WRITE part too.

I'm not quite familiar with other architectures, my wild guess is 
FAULT_FLAG_WRITE is a cheap way to tell us if this is a .text page or 
not. The original commit which introduced the check said so as well:

commit 1a44e149084d772a1bcf4cdbdde8a013a8a1cfde
Author: Andrea Arcangeli <andrea@suse.de>
Date:   Sat Oct 29 18:16:48 2005 -0700

     [PATCH] .text page fault SMP scalability optimization

     We had a problem on ppc64 where with more than 4 threads a large system
     wouldn't scale well while faulting in the .text (most of the time 
was spent
     in the kernel despite it was an userland compute intensive app).  The
     reason is the useless overwrite of the same pte from all cpu.

     I fixed it this way (verified on an older kernel but the forward 
port is
     almost identical).  This will benefit all archs not just ppc64.

     Signed-off-by: Andrea Arcangeli <andrea@suse.de>
     Cc: Hugh Dickins <hugh@veritas.com>
     Signed-off-by: Andrew Morton <akpm@osdl.org>
     Signed-off-by: Linus Torvalds <torvalds@osdl.org>

>
>                Linus
>
> ---
> diff --git a/mm/memory.c b/mm/memory.c
> index 3ecad55103ad..9994c98d88c3 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4163,6 +4163,26 @@ static vm_fault_t wp_huge_pud(struct vm_fault
> *vmf, pud_t orig_pud)
>           return VM_FAULT_FALLBACK;
>   }
>
> +/*
> + * If ptep_set_access_flags() returns zero, that means that
> + * it made no changes. Why did we get a fault?
> + *
> + * It might be a spurious protection fault because we at
> + * some point lazily didn't flush a TLB when we only loosened
> + * the protection rules. But it might also be because a
> + * concurrent fault on another CPU had already marked things
> + * young, and our young/dirty changes didn't change anything.
> + *
> + * The lazy TLB optimization only happens when we make things
> + * writable. See XYZ.
> + */
> +static inline bool spurious_protection_fault(unsigned int flags)
> +{
> +        if (flags & FAULT_FLAG_RETRY)
> +                return false;
> +        return flags & FAULT_FLAG_WRITE;
> +}
> +
>   /*
>    * These routines also need to handle stuff like marking pages dirty
>    * and/or accessed for architectures that don't do it in hardware (most
> @@ -4247,15 +4267,8 @@ static vm_fault_t handle_pte_fault(struct
> vm_fault *vmf)
>           if (ptep_set_access_flags(vmf->vma, vmf->address, vmf->pte, entry,
>                                   vmf->flags & FAULT_FLAG_WRITE)) {
>                   update_mmu_cache(vmf->vma, vmf->address, vmf->pte);
> -        } else {
> -                /*
> -                 * This is needed only for protection faults but the
> arch code
> -                 * is not yet telling us if this is a protection
> fault or not.
> -                 * This still avoids useless tlb flushes for .text
> page faults
> -                 * with threads.
> -                 */
> -                if (vmf->flags & FAULT_FLAG_WRITE)
> -                        flush_tlb_fix_spurious_fault(vmf->vma,
> vmf->address);
> +        } else if (spurious_protection_fault(vmf->flags)) {
> +                flush_tlb_fix_spurious_fault(vmf->vma, vmf->address);
>           }
>   unlock:
>           pte_unmap_unlock(vmf->pte, vmf->ptl);



  reply index

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-24  4:14 incoming Andrew Morton
2020-07-24  4:15 ` [patch 01/15] mm/memory.c: avoid access flag update TLB flush for retried page fault Andrew Morton
2020-07-24  4:38   ` Yang Shi
2020-07-24  4:56     ` Andrew Morton
2020-07-24 19:27   ` Linus Torvalds
2020-07-24 20:22     ` Linus Torvalds
2020-07-25  0:36       ` Yang Shi [this message]
2020-07-25  1:29         ` Linus Torvalds
2020-07-25 15:58           ` Catalin Marinas
2020-07-28  9:22             ` Will Deacon
2020-07-28  9:39               ` Catalin Marinas
2020-07-28 10:07                 ` Yu Xu
2020-07-28 11:46                   ` Catalin Marinas
2020-07-28 10:21                 ` Will Deacon
2020-07-28 18:28                 ` Linus Torvalds
2020-07-27 17:52           ` Yang Shi
2020-07-27 18:04             ` Linus Torvalds
2020-07-27 18:42               ` Catalin Marinas
2020-07-27 20:56                 ` Linus Torvalds
2020-07-27 22:34               ` Yang Shi
2020-07-27  7:31       ` Yu Xu
2020-07-27 11:05         ` Catalin Marinas
2020-07-27 17:01           ` Linus Torvalds
2020-07-28 11:19             ` Catalin Marinas
2020-07-27 17:12           ` Yu Xu
2020-07-27 18:04             ` Yang Shi
2020-07-27 18:37               ` Linus Torvalds
2020-07-27 22:43                 ` Yang Shi
2020-07-28  0:38                   ` Linus Torvalds
2020-07-28  0:13                 ` Yu Xu
2020-07-28 10:53                 ` Nicholas Piggin
2020-07-28 19:02                   ` Linus Torvalds
2020-07-28 22:53                     ` Nicholas Piggin
2020-07-29 13:58                       ` Michael Ellerman
2020-07-28  6:41             ` Yu Xu
2020-07-24  4:15 ` [patch 02/15] mm/mmap.c: close race between munmap() and expand_upwards()/downwards() Andrew Morton
2020-07-24  4:15 ` [patch 03/15] vfs/xattr: mm/shmem: kernfs: release simple xattr entry in a right way Andrew Morton
2020-07-24  4:15 ` [patch 04/15] mm: initialize return of vm_insert_pages Andrew Morton
2020-07-24  4:15 ` [patch 05/15] mm/memcontrol: fix OOPS inside mem_cgroup_get_nr_swap_pages() Andrew Morton
2020-07-24  4:15 ` [patch 06/15] mm/memcg: fix refcount error while moving and swapping Andrew Morton
2020-07-24 13:41   ` Alex Shi
2020-07-24  4:15 ` [patch 07/15] mm: memcg/slab: fix memory leak at non-root kmem_cache destroy Andrew Morton
2020-07-24  4:15 ` [patch 08/15] mm/hugetlb: avoid hardcoding while checking if cma is enabled Andrew Morton
2020-07-24  4:15 ` [patch 09/15] khugepaged: fix null-pointer dereference due to race Andrew Morton
2020-07-24  4:15 ` [patch 10/15] mailmap: add entry for Mike Rapoport Andrew Morton
2020-07-24  4:15 ` [patch 11/15] squashfs: fix length field overlap check in metadata reading Andrew Morton
2020-07-24  4:15 ` [patch 12/15] scripts/decode_stacktrace: strip basepath from all paths Andrew Morton
2020-07-24  4:15 ` [patch 13/15] io-mapping: indicate mapping failure Andrew Morton
2020-07-24  4:15 ` [patch 14/15] MAINTAINERS: add KCOV section Andrew Morton
2020-07-24  4:15 ` [patch 15/15] scripts/gdb: fix lx-symbols 'gdb.error' while loading modules Andrew Morton
2020-07-28  1:19 ` mmotm 2020-07-27-18-18 uploaded Andrew Morton
2020-07-28  2:14   ` Stephen Rothwell
2020-07-28  3:22   ` mmotm 2020-07-27-18-18 uploaded (drivers/scsi/ufs/: SCSI_UFS_EXYNOS) Randy Dunlap
2020-07-28  8:23     ` Alim Akhtar
2020-07-28 12:33   ` mmotm 2020-07-27-18-18 uploaded (mm/page_alloc.c) Randy Dunlap
2020-07-28 21:55     ` Andrew Morton
2020-07-28 22:20       ` Stephen Rothwell
2020-07-28 22:31         ` Andrew Morton
2020-07-29 14:18           ` Michael S. Tsirkin
2020-07-29 14:38             ` David Hildenbrand
2020-07-29 16:14               ` David Hildenbrand
2020-07-29 17:29                 ` Randy Dunlap
2020-07-28 22:39       ` Randy Dunlap
2020-07-29  1:43         ` Nathan Chancellor
2020-07-29  1:44         ` Andrew Morton
2020-07-29  2:04           ` Randy Dunlap
2020-07-29 14:09           ` make oldconfig (Re: mmotm 2020-07-27-18-18 uploaded (mm/page_alloc.c)) Alexey Dobriyan
2020-07-31 23:46 ` mmotm 2020-07-31-16-45 uploaded Andrew Morton
2020-08-01  5:24   ` mmotm 2020-07-31-16-45 uploaded (drivers/staging/vc04_services/) Randy Dunlap

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7de20d4a-f86c-8e1f-b238-65f02b560325@linux.alibaba.com \
    --to=yang.shi@linux.alibaba.com \
    --cc=akpm@linux-foundation.org \
    --cc=catalin.marinas@arm.com \
    --cc=hannes@cmpxchg.org \
    --cc=hdanton@sina.com \
    --cc=hughd@google.com \
    --cc=josef@toxicpanda.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-mm@kvack.org \
    --cc=mm-commits@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=will.deacon@arm.com \
    --cc=willy@infradead.org \
    --cc=xuyu@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git