Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
From: Yang Shi <yang.shi@linux.alibaba.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Hillf Danton <hdanton@sina.com>, Hugh Dickins <hughd@google.com>,
	Josef Bacik <josef@toxicpanda.com>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Linux-MM <linux-mm@kvack.org>,
	mm-commits@vger.kernel.org, Will Deacon <will.deacon@arm.com>,
	Matthew Wilcox <willy@infradead.org>,
	xuyu@linux.alibaba.com
Subject: Re: [patch 01/15] mm/memory.c: avoid access flag update TLB flush for retried page fault
Date: Mon, 27 Jul 2020 10:52:01 -0700
Message-ID: <45015c63-e719-a58a-7d07-6c156273a890@linux.alibaba.com> (raw)
In-Reply-To: <CAHk-=wi8ztx7E3fmU3anHSUzxC9g+ac7kdBOop9UWdvdG-jFOg@mail.gmail.com>



On 7/24/20 6:29 PM, Linus Torvalds wrote:
> On Fri, Jul 24, 2020 at 5:37 PM Yang Shi <yang.shi@linux.alibaba.com> wrote:
>> A follow-up question about your comment in the previous email "The
>> notion of "this is a retry, so let's do nothing" is fundamentally
>> wrong.", do you mean it is not safe?
> I mean it fails my "smell test".
>
> The patch didn't just avoid the TLB flush, it avoided all the other
> "mark it dirty and young" things too. And that made me go "why would
> RETRY be different in this regard"?
>
> It sounds unsafe, because it basically means that a retry does
> something else than the initial page fault handling would do.
>
> See what worries me and makes me go "that's not safe"?
>
>> Or since we have pte_same check, we
>> should just rely on it to skip unnecessary TLB flush?
> Right. That makes me much happier, because if the retry flag is only
> used to avoid a TLB flush (when the pte's are identical, of course),
> then I feel that the retry path is _logically_ all the same. The page
> tables end up looking exactly the same, and the only difference is
> whether we do that TLB invalidate for a spurious fault.
>
> And that, in turn, makes me feel it is safe, because even if it turns
> out that "yes, we keep getting a spurious fault because we have some
> stale TLB entries", then checking the RETRY bit is fine: we'll do a
> full page fault next time around without the retry bit set.
>
> So that's why I feel that your patch is sketchy and unsafe, but I
> don't worry about testing the RETRY bit in that "clear spurious TLB
> entries" case.
>
> See?

Yes, I got your point. Thanks for elaborating.

>
>>> Can somebody flesh out the comment about the
>>> "spurious_protection_fault()" thing? Because something like this I
>>> wouldn't mind, but I'd like that comment to explain the
>>> FAULT_FLAG_WRITE part too.
>> I'm not quite familiar with other architectures, my wild guess is
>> FAULT_FLAG_WRITE is a cheap way to tell us if this is a .text page or
>> not.
> Yes. However, I'm not seeing why a text page would be so special.
>
> IOW, if it's ok to skip the TLB flush fo ra text page, then why isn't
> it ok to skip for a normal page?

It looks normal page is skipped too unless it is a write fault. The 
comment might be a little bit misleading.

Read fault should just change young bit and typically TLB won't get 
flushed if just young bit is changed and TLB flush can be deferred again 
to write fault which may change access permission and/or dirty bit.

>
> My suspicion is that we have stale TLB entries for potentially
> multiple different reasons:
>
>   - software optimizations, where we decide "skip the TLB flush,
> because it's expensive and it is likely to never matter".
>
>     I have a _memory_ of us doing this when we have a pure "loosening"
> of the protections (IOW, make something writable that wasn't writable
> before), but I can't actually find the code. I'm thinking things like
> the wp_page_reuse() case.
>
>   - temporarily stale TLB entries because we've _just_updated them on
> another CPU, but it hasn't gotten to the actual TLB flush yet.
>
>     By the time we actually get to this point, we'll have serialized
> with the page table lock, but the *fault* happened when the CPU saw
> the original stale TLB entry, so we took the fault with what is now a
> stale TLB entry.
>
>   - actual software bugs where we've not flushed the TLB properly.
>
> Anyway, the _reason_ for that "flush_tlb_fix_spurious_fault()" is that
> some architectures don't flush their TLB on a fault.
>
> So if you don't flush the TLB when talking a page fault, and you may
> have these stale TLB entries around, you'll just keep faulting until
> enough other system event happens that just ends up flushing the TLB
> sufficiently.
>
> On an otherwise idle system, that "keep faulting until enough other
> system event happens" might be effectively forever.
>
> For any architecture that guarantees that a page fault will always
> flush the old TLB entry for this kind of situation, that
> flush_tlb_fix_spurious_fault() thing can be a no-op.
>
> So that's why on x86, we just do
>
>    #define flush_tlb_fix_spurious_fault(vma, address) do { } while (0)
>
> and have no issues.
>
> Note that it does *not* need to do any cross-CPU flushing or anything
> like that. So it's actually wrong (I think) to have that default
> fallback for
>
>     #define flush_tlb_fix_spurious_fault(vma, address)
> flush_tlb_page(vma, address)
>
> because flush_tlb_page() is the serious "do cross CPU etc".
>
> Does the arm64 flush_tlb_page() perhaps do the whole expensive
> cross-CPU thing rather than the much cheaper "just local invalidate"
> version?
>
> The "random letter combination" thing that ARM documentation uses for
> these things is really confusing, but I think the "is" in "vale1is"
> means that it's broadcast to all "inner sharable" - ie CPU cores.
>
> I get the feeling that on arm64, flush_tlb_fix_spurious_fault() should
> either be a no-op, or it should perhaps be a non-broadcasting version
> of the TLB invalidates, and use just "vale1"
>
> Catalin? Will?
>
>                  Linus



  parent reply index

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-24  4:14 incoming Andrew Morton
2020-07-24  4:15 ` [patch 01/15] mm/memory.c: avoid access flag update TLB flush for retried page fault Andrew Morton
2020-07-24  4:38   ` Yang Shi
2020-07-24  4:56     ` Andrew Morton
2020-07-24 19:27   ` Linus Torvalds
2020-07-24 20:22     ` Linus Torvalds
2020-07-25  0:36       ` Yang Shi
2020-07-25  1:29         ` Linus Torvalds
2020-07-25 15:58           ` Catalin Marinas
2020-07-28  9:22             ` Will Deacon
2020-07-28  9:39               ` Catalin Marinas
2020-07-28 10:07                 ` Yu Xu
2020-07-28 11:46                   ` Catalin Marinas
2020-07-28 10:21                 ` Will Deacon
2020-07-28 18:28                 ` Linus Torvalds
2020-07-27 17:52           ` Yang Shi [this message]
2020-07-27 18:04             ` Linus Torvalds
2020-07-27 18:42               ` Catalin Marinas
2020-07-27 20:56                 ` Linus Torvalds
2020-07-27 22:34               ` Yang Shi
2020-07-27  7:31       ` Yu Xu
2020-07-27 11:05         ` Catalin Marinas
2020-07-27 17:01           ` Linus Torvalds
2020-07-28 11:19             ` Catalin Marinas
2020-07-27 17:12           ` Yu Xu
2020-07-27 18:04             ` Yang Shi
2020-07-27 18:37               ` Linus Torvalds
2020-07-27 22:43                 ` Yang Shi
2020-07-28  0:38                   ` Linus Torvalds
2020-07-28  0:13                 ` Yu Xu
2020-07-28 10:53                 ` Nicholas Piggin
2020-07-28 19:02                   ` Linus Torvalds
2020-07-28 22:53                     ` Nicholas Piggin
2020-07-29 13:58                       ` Michael Ellerman
2020-07-28  6:41             ` Yu Xu
2020-07-24  4:15 ` [patch 02/15] mm/mmap.c: close race between munmap() and expand_upwards()/downwards() Andrew Morton
2020-07-24  4:15 ` [patch 03/15] vfs/xattr: mm/shmem: kernfs: release simple xattr entry in a right way Andrew Morton
2020-07-24  4:15 ` [patch 04/15] mm: initialize return of vm_insert_pages Andrew Morton
2020-07-24  4:15 ` [patch 05/15] mm/memcontrol: fix OOPS inside mem_cgroup_get_nr_swap_pages() Andrew Morton
2020-07-24  4:15 ` [patch 06/15] mm/memcg: fix refcount error while moving and swapping Andrew Morton
2020-07-24 13:41   ` Alex Shi
2020-07-24  4:15 ` [patch 07/15] mm: memcg/slab: fix memory leak at non-root kmem_cache destroy Andrew Morton
2020-07-24  4:15 ` [patch 08/15] mm/hugetlb: avoid hardcoding while checking if cma is enabled Andrew Morton
2020-07-24  4:15 ` [patch 09/15] khugepaged: fix null-pointer dereference due to race Andrew Morton
2020-07-24  4:15 ` [patch 10/15] mailmap: add entry for Mike Rapoport Andrew Morton
2020-07-24  4:15 ` [patch 11/15] squashfs: fix length field overlap check in metadata reading Andrew Morton
2020-07-24  4:15 ` [patch 12/15] scripts/decode_stacktrace: strip basepath from all paths Andrew Morton
2020-07-24  4:15 ` [patch 13/15] io-mapping: indicate mapping failure Andrew Morton
2020-07-24  4:15 ` [patch 14/15] MAINTAINERS: add KCOV section Andrew Morton
2020-07-24  4:15 ` [patch 15/15] scripts/gdb: fix lx-symbols 'gdb.error' while loading modules Andrew Morton
2020-07-28  1:19 ` mmotm 2020-07-27-18-18 uploaded Andrew Morton
2020-07-28  2:14   ` Stephen Rothwell
2020-07-28  3:22   ` mmotm 2020-07-27-18-18 uploaded (drivers/scsi/ufs/: SCSI_UFS_EXYNOS) Randy Dunlap
2020-07-28  8:23     ` Alim Akhtar
2020-07-28 12:33   ` mmotm 2020-07-27-18-18 uploaded (mm/page_alloc.c) Randy Dunlap
2020-07-28 21:55     ` Andrew Morton
2020-07-28 22:20       ` Stephen Rothwell
2020-07-28 22:31         ` Andrew Morton
2020-07-29 14:18           ` Michael S. Tsirkin
2020-07-29 14:38             ` David Hildenbrand
2020-07-29 16:14               ` David Hildenbrand
2020-07-29 17:29                 ` Randy Dunlap
2020-07-28 22:39       ` Randy Dunlap
2020-07-29  1:43         ` Nathan Chancellor
2020-07-29  1:44         ` Andrew Morton
2020-07-29  2:04           ` Randy Dunlap
2020-07-29 14:09           ` make oldconfig (Re: mmotm 2020-07-27-18-18 uploaded (mm/page_alloc.c)) Alexey Dobriyan
2020-07-31 23:46 ` mmotm 2020-07-31-16-45 uploaded Andrew Morton
2020-08-01  5:24   ` mmotm 2020-07-31-16-45 uploaded (drivers/staging/vc04_services/) Randy Dunlap

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=45015c63-e719-a58a-7d07-6c156273a890@linux.alibaba.com \
    --to=yang.shi@linux.alibaba.com \
    --cc=akpm@linux-foundation.org \
    --cc=catalin.marinas@arm.com \
    --cc=hannes@cmpxchg.org \
    --cc=hdanton@sina.com \
    --cc=hughd@google.com \
    --cc=josef@toxicpanda.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-mm@kvack.org \
    --cc=mm-commits@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=will.deacon@arm.com \
    --cc=willy@infradead.org \
    --cc=xuyu@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git