All of lore.kernel.org
 help / color / mirror / Atom feed
From: Yu Zhao <yuzhao@google.com>
To: Will Deacon <will@kernel.org>
Cc: linux-kernel@vger.kernel.org, kernel-team@android.com,
	Catalin Marinas <catalin.marinas@arm.com>,
	Minchan Kim <minchan@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Anshuman Khandual <anshuman.khandual@arm.com>,
	linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org
Subject: Re: [PATCH 4/6] mm: proc: Invalidate TLB after clearing soft-dirty page state
Date: Mon, 23 Nov 2020 12:21:16 -0700	[thread overview]
Message-ID: <20201123192116.GA3883038@google.com> (raw)
In-Reply-To: <20201121024922.GA1363491@google.com>

On Fri, Nov 20, 2020 at 07:49:22PM -0700, Yu Zhao wrote:
> On Fri, Nov 20, 2020 at 01:22:53PM -0700, Yu Zhao wrote:
> > On Fri, Nov 20, 2020 at 02:35:55PM +0000, Will Deacon wrote:
> > > Since commit 0758cd830494 ("asm-generic/tlb: avoid potential double flush"),
> > > TLB invalidation is elided in tlb_finish_mmu() if no entries were batched
> > > via the tlb_remove_*() functions. Consequently, the page-table modifications
> > > performed by clear_refs_write() in response to a write to
> > > /proc/<pid>/clear_refs do not perform TLB invalidation. Although this is
> > > fine when simply aging the ptes, in the case of clearing the "soft-dirty"
> > > state we can end up with entries where pte_write() is false, yet a
> > > writable mapping remains in the TLB.

I double checked my conclusion and I think it holds. But let me
correct some typos and add a summary.

> > I don't think we need a TLB flush in this context, same reason as we
                                ^^^^^ gather

> > don't have one in copy_present_pte() which uses ptep_set_wrprotect()
> > to write-protect a src PTE.
> > 
> > ptep_modify_prot_start/commit() and ptep_set_wrprotect() guarantee
> > either the dirty bit is set (when a PTE is still writable) or a PF
> > happens (when a PTE has become r/o) when h/w page table walker races
> > with kernel that modifies a PTE using the two APIs.
> 
> After we remove the writable bit, if we end up with a clean PTE, any
> subsequent write will trigger a page fault. We can't have a stale
> writable tlb entry. The architecture-specific APIs guarantee this.
> 
> If we end up with a dirty PTE, then yes, there will be a stale
> writable tlb entry. But this won't be a problem because when we
> write-protect a page (not PTE), we always check both pte_dirty()
> and pte_write(), i.e., write_protect_page() and page_mkclean_one().
> When they see this dirty PTE, they will flush. And generally, only
> callers of pte_mkclean() should flush tlb; otherwise we end up one
> extra if callers of pte_mkclean() and pte_wrprotect() both flush.
> 
> Now let's take a step back and see why we got
> tlb_gather/finish_mmu() here in the first place. Commit b3a81d0841a95
> ("mm: fix KSM data corruption") explains the problem clearly. But
> to fix a problem created by two threads clearing pte_write() and
> pte_dirty() independently, we only need one of them to set
> mm_tlb_flush_pending(). Given only removing the writable bit requires
                                                  ^^^^^^^^ dirty

> tlb flush, that thread should be the one, as I just explained. Adding
> tlb_gather/finish_mmu() is unnecessary in that fix. And there is no
> point in having the original flush_tlb_mm() either, given data
> integrity is already guaranteed.
(i.e., writable tlb entries are flushed when removing the dirty bit.)

> Of course, with it we have more accurate access tracking.
> 
> Does a similar problem exist for page_mkclean_one()? Possibly. It
> checks pte_dirty() and pte_write() but not mm_tlb_flush_pending().
> At the moment, madvise_free_pte_range() only supports anonymous
> memory, which doesn't do writeback. But the missing
> mm_tlb_flush_pending() just seems to be an accident waiting to happen.
> E.g., clean_record_pte() calls pte_mkclean() and does batched flush.
> I don't know what it's for, but if it's called on file VMAs, a similar
> race involving 4 CPUs can happen. This time CPU 1 runs
> clean_record_pte() and CPU 3 runs page_mkclean_one().

To summarize, IMO, we should 1) remove tlb_gather/finish_mmu() here;
2) check mm_tlb_flush_pending() in page_mkclean_one() and
dax_entry_mkclean().

WARNING: multiple messages have this Message-ID (diff)
From: Yu Zhao <yuzhao@google.com>
To: Will Deacon <will@kernel.org>
Cc: kernel-team@android.com,
	Anshuman Khandual <anshuman.khandual@arm.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Minchan Kim <minchan@kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	linux-arm-kernel@lists.infradead.org
Subject: Re: [PATCH 4/6] mm: proc: Invalidate TLB after clearing soft-dirty page state
Date: Mon, 23 Nov 2020 12:21:16 -0700	[thread overview]
Message-ID: <20201123192116.GA3883038@google.com> (raw)
In-Reply-To: <20201121024922.GA1363491@google.com>

On Fri, Nov 20, 2020 at 07:49:22PM -0700, Yu Zhao wrote:
> On Fri, Nov 20, 2020 at 01:22:53PM -0700, Yu Zhao wrote:
> > On Fri, Nov 20, 2020 at 02:35:55PM +0000, Will Deacon wrote:
> > > Since commit 0758cd830494 ("asm-generic/tlb: avoid potential double flush"),
> > > TLB invalidation is elided in tlb_finish_mmu() if no entries were batched
> > > via the tlb_remove_*() functions. Consequently, the page-table modifications
> > > performed by clear_refs_write() in response to a write to
> > > /proc/<pid>/clear_refs do not perform TLB invalidation. Although this is
> > > fine when simply aging the ptes, in the case of clearing the "soft-dirty"
> > > state we can end up with entries where pte_write() is false, yet a
> > > writable mapping remains in the TLB.

I double checked my conclusion and I think it holds. But let me
correct some typos and add a summary.

> > I don't think we need a TLB flush in this context, same reason as we
                                ^^^^^ gather

> > don't have one in copy_present_pte() which uses ptep_set_wrprotect()
> > to write-protect a src PTE.
> > 
> > ptep_modify_prot_start/commit() and ptep_set_wrprotect() guarantee
> > either the dirty bit is set (when a PTE is still writable) or a PF
> > happens (when a PTE has become r/o) when h/w page table walker races
> > with kernel that modifies a PTE using the two APIs.
> 
> After we remove the writable bit, if we end up with a clean PTE, any
> subsequent write will trigger a page fault. We can't have a stale
> writable tlb entry. The architecture-specific APIs guarantee this.
> 
> If we end up with a dirty PTE, then yes, there will be a stale
> writable tlb entry. But this won't be a problem because when we
> write-protect a page (not PTE), we always check both pte_dirty()
> and pte_write(), i.e., write_protect_page() and page_mkclean_one().
> When they see this dirty PTE, they will flush. And generally, only
> callers of pte_mkclean() should flush tlb; otherwise we end up one
> extra if callers of pte_mkclean() and pte_wrprotect() both flush.
> 
> Now let's take a step back and see why we got
> tlb_gather/finish_mmu() here in the first place. Commit b3a81d0841a95
> ("mm: fix KSM data corruption") explains the problem clearly. But
> to fix a problem created by two threads clearing pte_write() and
> pte_dirty() independently, we only need one of them to set
> mm_tlb_flush_pending(). Given only removing the writable bit requires
                                                  ^^^^^^^^ dirty

> tlb flush, that thread should be the one, as I just explained. Adding
> tlb_gather/finish_mmu() is unnecessary in that fix. And there is no
> point in having the original flush_tlb_mm() either, given data
> integrity is already guaranteed.
(i.e., writable tlb entries are flushed when removing the dirty bit.)

> Of course, with it we have more accurate access tracking.
> 
> Does a similar problem exist for page_mkclean_one()? Possibly. It
> checks pte_dirty() and pte_write() but not mm_tlb_flush_pending().
> At the moment, madvise_free_pte_range() only supports anonymous
> memory, which doesn't do writeback. But the missing
> mm_tlb_flush_pending() just seems to be an accident waiting to happen.
> E.g., clean_record_pte() calls pte_mkclean() and does batched flush.
> I don't know what it's for, but if it's called on file VMAs, a similar
> race involving 4 CPUs can happen. This time CPU 1 runs
> clean_record_pte() and CPU 3 runs page_mkclean_one().

To summarize, IMO, we should 1) remove tlb_gather/finish_mmu() here;
2) check mm_tlb_flush_pending() in page_mkclean_one() and
dax_entry_mkclean().

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  reply	other threads:[~2020-11-23 19:21 UTC|newest]

Thread overview: 91+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-20 14:35 [PATCH 0/6] tlb: Fix access and (soft-)dirty bit management Will Deacon
2020-11-20 14:35 ` Will Deacon
2020-11-20 14:35 ` [PATCH 1/6] arm64: pgtable: Fix pte_accessible() Will Deacon
2020-11-20 14:35   ` Will Deacon
2020-11-20 16:03   ` Minchan Kim
2020-11-20 16:03     ` Minchan Kim
2020-11-20 19:53   ` Yu Zhao
2020-11-20 19:53     ` Yu Zhao
2020-11-23 13:27   ` Catalin Marinas
2020-11-23 13:27     ` Catalin Marinas
2020-11-24 10:02   ` Anshuman Khandual
2020-11-24 10:02     ` Anshuman Khandual
2020-11-20 14:35 ` [PATCH 2/6] arm64: pgtable: Ensure dirty bit is preserved across pte_wrprotect() Will Deacon
2020-11-20 14:35   ` Will Deacon
2020-11-20 17:09   ` Minchan Kim
2020-11-20 17:09     ` Minchan Kim
2020-11-23 14:31     ` Catalin Marinas
2020-11-23 14:31       ` Catalin Marinas
2020-11-23 14:22   ` Catalin Marinas
2020-11-23 14:22     ` Catalin Marinas
2020-11-20 14:35 ` [PATCH 3/6] tlb: mmu_gather: Remove unused start/end arguments from tlb_finish_mmu() Will Deacon
2020-11-20 14:35   ` Will Deacon
2020-11-20 17:20   ` Linus Torvalds
2020-11-20 17:20     ` Linus Torvalds
2020-11-20 17:20     ` Linus Torvalds
2020-11-23 16:48     ` Will Deacon
2020-11-23 16:48       ` Will Deacon
2020-11-20 14:35 ` [PATCH 4/6] mm: proc: Invalidate TLB after clearing soft-dirty page state Will Deacon
2020-11-20 14:35   ` Will Deacon
2020-11-20 15:00   ` Peter Zijlstra
2020-11-20 15:00     ` Peter Zijlstra
2020-11-20 15:09     ` Peter Zijlstra
2020-11-20 15:09       ` Peter Zijlstra
2020-11-20 15:15     ` Will Deacon
2020-11-20 15:15       ` Will Deacon
2020-11-20 15:27       ` Peter Zijlstra
2020-11-20 15:27         ` Peter Zijlstra
2020-11-23 18:23         ` Will Deacon
2020-11-23 18:23           ` Will Deacon
2020-11-20 15:55     ` Minchan Kim
2020-11-20 15:55       ` Minchan Kim
2020-11-23 18:41       ` Will Deacon
2020-11-23 18:41         ` Will Deacon
2020-11-25 22:51         ` Minchan Kim
2020-11-25 22:51           ` Minchan Kim
2020-11-20 20:22   ` Yu Zhao
2020-11-20 20:22     ` Yu Zhao
2020-11-21  2:49     ` Yu Zhao
2020-11-21  2:49       ` Yu Zhao
2020-11-23 19:21       ` Yu Zhao [this message]
2020-11-23 19:21         ` Yu Zhao
2020-11-23 22:04       ` Will Deacon
2020-11-23 22:04         ` Will Deacon
2020-11-20 14:35 ` [PATCH 5/6] tlb: mmu_gather: Introduce tlb_gather_mmu_fullmm() Will Deacon
2020-11-20 14:35   ` Will Deacon
2020-11-20 17:22   ` Linus Torvalds
2020-11-20 17:22     ` Linus Torvalds
2020-11-20 17:22     ` Linus Torvalds
2020-11-20 17:31     ` Linus Torvalds
2020-11-20 17:31       ` Linus Torvalds
2020-11-20 17:31       ` Linus Torvalds
2020-11-23 16:48       ` Will Deacon
2020-11-23 16:48         ` Will Deacon
2021-02-01 11:32       ` [tip: core/mm] tlb: mmu_gather: Remove start/end arguments from tlb_gather_mmu() tip-bot2 for Will Deacon
2020-11-22 15:11   ` [tlb] e242a269fa: WARNING:at_mm/mmu_gather.c:#tlb_gather_mmu kernel test robot
2020-11-23 17:51     ` Will Deacon
2020-11-23 17:51       ` Will Deacon
2020-11-20 14:35 ` [PATCH 6/6] mm: proc: Avoid fullmm flush for young/dirty bit toggling Will Deacon
2020-11-20 14:35   ` Will Deacon
2020-11-20 17:41   ` Linus Torvalds
2020-11-20 17:41     ` Linus Torvalds
2020-11-20 17:41     ` Linus Torvalds
2020-11-20 17:45     ` Linus Torvalds
2020-11-20 17:45       ` Linus Torvalds
2020-11-20 17:45       ` Linus Torvalds
2020-11-20 20:40   ` Yu Zhao
2020-11-20 20:40     ` Yu Zhao
2020-11-23 18:35     ` Will Deacon
2020-11-23 18:35       ` Will Deacon
2020-11-23 20:04       ` Yu Zhao
2020-11-23 20:04         ` Yu Zhao
2020-11-23 21:17         ` Will Deacon
2020-11-23 21:17           ` Will Deacon
2020-11-24  1:13           ` Yu Zhao
2020-11-24  1:13             ` Yu Zhao
2020-11-24 14:31             ` Will Deacon
2020-11-24 14:31               ` Will Deacon
2020-11-25 22:01             ` Minchan Kim
2020-11-25 22:01               ` Minchan Kim
2020-11-24 14:46     ` Peter Zijlstra
2020-11-24 14:46       ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201123192116.GA3883038@google.com \
    --to=yuzhao@google.com \
    --cc=anshuman.khandual@arm.com \
    --cc=catalin.marinas@arm.com \
    --cc=kernel-team@android.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=minchan@kernel.org \
    --cc=peterz@infradead.org \
    --cc=torvalds@linux-foundation.org \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.