From: Yu Zhao <yuzhao@google.com> To: Will Deacon <will@kernel.org> Cc: linux-kernel@vger.kernel.org, kernel-team@android.com, Catalin Marinas <catalin.marinas@arm.com>, Minchan Kim <minchan@kernel.org>, Peter Zijlstra <peterz@infradead.org>, Linus Torvalds <torvalds@linux-foundation.org>, Anshuman Khandual <anshuman.khandual@arm.com>, linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org Subject: Re: [PATCH 6/6] mm: proc: Avoid fullmm flush for young/dirty bit toggling Date: Mon, 23 Nov 2020 18:13:34 -0700 [thread overview] Message-ID: <20201124011334.GA140483@google.com> (raw) In-Reply-To: <20201123211750.GA12069@willie-the-truck> On Mon, Nov 23, 2020 at 09:17:51PM +0000, Will Deacon wrote: > On Mon, Nov 23, 2020 at 01:04:03PM -0700, Yu Zhao wrote: > > On Mon, Nov 23, 2020 at 06:35:55PM +0000, Will Deacon wrote: > > > On Fri, Nov 20, 2020 at 01:40:05PM -0700, Yu Zhao wrote: > > > > On Fri, Nov 20, 2020 at 02:35:57PM +0000, Will Deacon wrote: > > > > > clear_refs_write() uses the 'fullmm' API for invalidating TLBs after > > > > > updating the page-tables for the current mm. However, since the mm is not > > > > > being freed, this can result in stale TLB entries on architectures which > > > > > elide 'fullmm' invalidation. > > > > > > > > > > Ensure that TLB invalidation is performed after updating soft-dirty > > > > > entries via clear_refs_write() by using the non-fullmm API to MMU gather. > > > > > > > > > > Signed-off-by: Will Deacon <will@kernel.org> > > > > > --- > > > > > fs/proc/task_mmu.c | 2 +- > > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > > > > index a76d339b5754..316af047f1aa 100644 > > > > > --- a/fs/proc/task_mmu.c > > > > > +++ b/fs/proc/task_mmu.c > > > > > @@ -1238,7 +1238,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, > > > > > count = -EINTR; > > > > > goto out_mm; > > > > > } > > > > > - tlb_gather_mmu_fullmm(&tlb, mm); > > > > > + tlb_gather_mmu(&tlb, mm, 0, TASK_SIZE); > > > > > > > > Let's assume my reply to patch 4 is wrong, and therefore we still need > > > > tlb_gather/finish_mmu() here. But then wouldn't this change deprive > > > > architectures other than ARM the opportunity to optimize based on the > > > > fact it's a full-mm flush? > > > > I double checked my conclusion on patch 4, and aside from a couple > > of typos, it still seems correct after the weekend. > > I still need to digest that, but I would prefer that we restore the > invalidation first, and then have a subsequent commit to relax it. I find > it hard to believe that the behaviour in mainline at the moment is deliberate. > > That is, I'm not against optimising this, but I'd rather get it "obviously > correct" first and the current code is definitely not that. I wouldn't mind having this patch and patch 4 if the invalidation they restore were in a correct state -- b3a81d0841a9 ("mm: fix KSM data corruption") isn't correct to start with. It is complicated, so please bear with me. Let's study this by looking at examples this time. > > > Only for the soft-dirty case, but I think TLB invalidation is required > > > there because we are write-protecting the entries and I don't see any > > > mechanism to handle lazy invalidation for that (compared with the aging > > > case, which is handled via pte_accessible()). > > > > The lazy invalidation for that is done when we write-protect a page, > > not an individual PTE. When we do so, our decision is based on both > > the dirty bit and the writable bit on each PTE mapping this page. So > > we only need to make sure we don't lose both on a PTE. And we don't > > here. > > Sorry, I don't follow what you're getting at here (page vs pte). Please can > you point me to the code you're referring to? The case I'm worried about is > code that holds sufficient locks (e.g. mmap_sem + ptl) finding an entry > where !pte_write() and assuming (despite pte_dirty()) that there can't be > any concurrent modifications to the mapped page. Granted, I haven't found > anything doing that, but I could not convince myself that it would be a bug > to write such code, either. Example 1: memory corruption is still possible with patch 4 & 6 CPU0 CPU1 CPU2 CPU3 ---- ---- ---- ---- userspace page writeback [cache writable PTE in TLB] inc_tlb_flush_pending() clean_record_pte() pte_mkclean() tlb_gather_mmu() [set mm_tlb_flush_pending()] clear_refs_write() pte_wrprotect() page_mkclean_one() !pte_dirty() && !pte_write() [true, no flush] write page to disk Write to page [using stale PTE] drop clean page [data integrity compromised] flush_tlb_range() tlb_finish_mmu() [flush (with patch 4)] Example 2: why no flush when write-protecting is not a problem (after we fix the problem correctly by adding mm_tlb_flush_pending()). Case a: CPU0 CPU1 CPU2 CPU3 ---- ---- ---- ---- userspace page writeback [cache writable PTE in TLB] inc_tlb_flush_pending() clean_record_pte() pte_mkclean() clear_refs_write() pte_wrprotect() page_mkclean_one() !pte_dirty() && !pte_write() && !mm_tlb_flush_pending() [false: flush] write page to disk Write to page [page fault] drop clean page [data integrity guaranteed] flush_tlb_range() Case b: CPU0 CPU1 CPU2 ---- ---- ---- userspace page writeback [cache writable PTE in TLB] clear_refs_write() pte_wrprotect() [pte_dirty() is false] page_mkclean_one() !pte_dirty() && !pte_write() && !mm_tlb_flush_pending() [true: no flush] write page to disk Write to page [h/w tries to set the dirty bit but sees write- protected PTE, page fault] drop clean page [data integrity guaranteed] Case c: CPU0 CPU1 CPU2 ---- ---- ---- userspace page writeback [cache writable PTE in TLB] clear_refs_write() pte_wrprotect() [pte_dirty() is true] page_mkclean_one() !pte_dirty() && !pte_write() && !mm_tlb_flush_pending() [false: flush] write page to disk Write to page [page fault] drop clean page [data integrity guaranteed] > > > Furthermore, If we decide that we can relax the TLB invalidation > > > requirements here, then I'd much rather than was done deliberately, rather > > > than as an accidental side-effect of another commit (since I think the > > > current behaviour was a consequence of 7a30df49f63a). > > > > Nope. tlb_gather/finish_mmu() should be added by b3a81d0841a9 ^^^^^^ shouldn't Another typo, I apologize. > > ("mm: fix KSM data corruption") in the first place. > > Sure, but if you check out b3a81d0841a9 then you have a fullmm TLB > invalidation in tlb_finish_mmu(). 7a30df49f63a is what removed that, no? > > Will
WARNING: multiple messages have this Message-ID (diff)
From: Yu Zhao <yuzhao@google.com> To: Will Deacon <will@kernel.org> Cc: kernel-team@android.com, Anshuman Khandual <anshuman.khandual@arm.com>, Peter Zijlstra <peterz@infradead.org>, Catalin Marinas <catalin.marinas@arm.com>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Minchan Kim <minchan@kernel.org>, Linus Torvalds <torvalds@linux-foundation.org>, linux-arm-kernel@lists.infradead.org Subject: Re: [PATCH 6/6] mm: proc: Avoid fullmm flush for young/dirty bit toggling Date: Mon, 23 Nov 2020 18:13:34 -0700 [thread overview] Message-ID: <20201124011334.GA140483@google.com> (raw) In-Reply-To: <20201123211750.GA12069@willie-the-truck> On Mon, Nov 23, 2020 at 09:17:51PM +0000, Will Deacon wrote: > On Mon, Nov 23, 2020 at 01:04:03PM -0700, Yu Zhao wrote: > > On Mon, Nov 23, 2020 at 06:35:55PM +0000, Will Deacon wrote: > > > On Fri, Nov 20, 2020 at 01:40:05PM -0700, Yu Zhao wrote: > > > > On Fri, Nov 20, 2020 at 02:35:57PM +0000, Will Deacon wrote: > > > > > clear_refs_write() uses the 'fullmm' API for invalidating TLBs after > > > > > updating the page-tables for the current mm. However, since the mm is not > > > > > being freed, this can result in stale TLB entries on architectures which > > > > > elide 'fullmm' invalidation. > > > > > > > > > > Ensure that TLB invalidation is performed after updating soft-dirty > > > > > entries via clear_refs_write() by using the non-fullmm API to MMU gather. > > > > > > > > > > Signed-off-by: Will Deacon <will@kernel.org> > > > > > --- > > > > > fs/proc/task_mmu.c | 2 +- > > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > > > > index a76d339b5754..316af047f1aa 100644 > > > > > --- a/fs/proc/task_mmu.c > > > > > +++ b/fs/proc/task_mmu.c > > > > > @@ -1238,7 +1238,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, > > > > > count = -EINTR; > > > > > goto out_mm; > > > > > } > > > > > - tlb_gather_mmu_fullmm(&tlb, mm); > > > > > + tlb_gather_mmu(&tlb, mm, 0, TASK_SIZE); > > > > > > > > Let's assume my reply to patch 4 is wrong, and therefore we still need > > > > tlb_gather/finish_mmu() here. But then wouldn't this change deprive > > > > architectures other than ARM the opportunity to optimize based on the > > > > fact it's a full-mm flush? > > > > I double checked my conclusion on patch 4, and aside from a couple > > of typos, it still seems correct after the weekend. > > I still need to digest that, but I would prefer that we restore the > invalidation first, and then have a subsequent commit to relax it. I find > it hard to believe that the behaviour in mainline at the moment is deliberate. > > That is, I'm not against optimising this, but I'd rather get it "obviously > correct" first and the current code is definitely not that. I wouldn't mind having this patch and patch 4 if the invalidation they restore were in a correct state -- b3a81d0841a9 ("mm: fix KSM data corruption") isn't correct to start with. It is complicated, so please bear with me. Let's study this by looking at examples this time. > > > Only for the soft-dirty case, but I think TLB invalidation is required > > > there because we are write-protecting the entries and I don't see any > > > mechanism to handle lazy invalidation for that (compared with the aging > > > case, which is handled via pte_accessible()). > > > > The lazy invalidation for that is done when we write-protect a page, > > not an individual PTE. When we do so, our decision is based on both > > the dirty bit and the writable bit on each PTE mapping this page. So > > we only need to make sure we don't lose both on a PTE. And we don't > > here. > > Sorry, I don't follow what you're getting at here (page vs pte). Please can > you point me to the code you're referring to? The case I'm worried about is > code that holds sufficient locks (e.g. mmap_sem + ptl) finding an entry > where !pte_write() and assuming (despite pte_dirty()) that there can't be > any concurrent modifications to the mapped page. Granted, I haven't found > anything doing that, but I could not convince myself that it would be a bug > to write such code, either. Example 1: memory corruption is still possible with patch 4 & 6 CPU0 CPU1 CPU2 CPU3 ---- ---- ---- ---- userspace page writeback [cache writable PTE in TLB] inc_tlb_flush_pending() clean_record_pte() pte_mkclean() tlb_gather_mmu() [set mm_tlb_flush_pending()] clear_refs_write() pte_wrprotect() page_mkclean_one() !pte_dirty() && !pte_write() [true, no flush] write page to disk Write to page [using stale PTE] drop clean page [data integrity compromised] flush_tlb_range() tlb_finish_mmu() [flush (with patch 4)] Example 2: why no flush when write-protecting is not a problem (after we fix the problem correctly by adding mm_tlb_flush_pending()). Case a: CPU0 CPU1 CPU2 CPU3 ---- ---- ---- ---- userspace page writeback [cache writable PTE in TLB] inc_tlb_flush_pending() clean_record_pte() pte_mkclean() clear_refs_write() pte_wrprotect() page_mkclean_one() !pte_dirty() && !pte_write() && !mm_tlb_flush_pending() [false: flush] write page to disk Write to page [page fault] drop clean page [data integrity guaranteed] flush_tlb_range() Case b: CPU0 CPU1 CPU2 ---- ---- ---- userspace page writeback [cache writable PTE in TLB] clear_refs_write() pte_wrprotect() [pte_dirty() is false] page_mkclean_one() !pte_dirty() && !pte_write() && !mm_tlb_flush_pending() [true: no flush] write page to disk Write to page [h/w tries to set the dirty bit but sees write- protected PTE, page fault] drop clean page [data integrity guaranteed] Case c: CPU0 CPU1 CPU2 ---- ---- ---- userspace page writeback [cache writable PTE in TLB] clear_refs_write() pte_wrprotect() [pte_dirty() is true] page_mkclean_one() !pte_dirty() && !pte_write() && !mm_tlb_flush_pending() [false: flush] write page to disk Write to page [page fault] drop clean page [data integrity guaranteed] > > > Furthermore, If we decide that we can relax the TLB invalidation > > > requirements here, then I'd much rather than was done deliberately, rather > > > than as an accidental side-effect of another commit (since I think the > > > current behaviour was a consequence of 7a30df49f63a). > > > > Nope. tlb_gather/finish_mmu() should be added by b3a81d0841a9 ^^^^^^ shouldn't Another typo, I apologize. > > ("mm: fix KSM data corruption") in the first place. > > Sure, but if you check out b3a81d0841a9 then you have a fullmm TLB > invalidation in tlb_finish_mmu(). 7a30df49f63a is what removed that, no? > > Will _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
next prev parent reply other threads:[~2020-11-24 1:13 UTC|newest] Thread overview: 91+ messages / expand[flat|nested] mbox.gz Atom feed top 2020-11-20 14:35 [PATCH 0/6] tlb: Fix access and (soft-)dirty bit management Will Deacon 2020-11-20 14:35 ` Will Deacon 2020-11-20 14:35 ` [PATCH 1/6] arm64: pgtable: Fix pte_accessible() Will Deacon 2020-11-20 14:35 ` Will Deacon 2020-11-20 16:03 ` Minchan Kim 2020-11-20 16:03 ` Minchan Kim 2020-11-20 19:53 ` Yu Zhao 2020-11-20 19:53 ` Yu Zhao 2020-11-23 13:27 ` Catalin Marinas 2020-11-23 13:27 ` Catalin Marinas 2020-11-24 10:02 ` Anshuman Khandual 2020-11-24 10:02 ` Anshuman Khandual 2020-11-20 14:35 ` [PATCH 2/6] arm64: pgtable: Ensure dirty bit is preserved across pte_wrprotect() Will Deacon 2020-11-20 14:35 ` Will Deacon 2020-11-20 17:09 ` Minchan Kim 2020-11-20 17:09 ` Minchan Kim 2020-11-23 14:31 ` Catalin Marinas 2020-11-23 14:31 ` Catalin Marinas 2020-11-23 14:22 ` Catalin Marinas 2020-11-23 14:22 ` Catalin Marinas 2020-11-20 14:35 ` [PATCH 3/6] tlb: mmu_gather: Remove unused start/end arguments from tlb_finish_mmu() Will Deacon 2020-11-20 14:35 ` Will Deacon 2020-11-20 17:20 ` Linus Torvalds 2020-11-20 17:20 ` Linus Torvalds 2020-11-20 17:20 ` Linus Torvalds 2020-11-23 16:48 ` Will Deacon 2020-11-23 16:48 ` Will Deacon 2020-11-20 14:35 ` [PATCH 4/6] mm: proc: Invalidate TLB after clearing soft-dirty page state Will Deacon 2020-11-20 14:35 ` Will Deacon 2020-11-20 15:00 ` Peter Zijlstra 2020-11-20 15:00 ` Peter Zijlstra 2020-11-20 15:09 ` Peter Zijlstra 2020-11-20 15:09 ` Peter Zijlstra 2020-11-20 15:15 ` Will Deacon 2020-11-20 15:15 ` Will Deacon 2020-11-20 15:27 ` Peter Zijlstra 2020-11-20 15:27 ` Peter Zijlstra 2020-11-23 18:23 ` Will Deacon 2020-11-23 18:23 ` Will Deacon 2020-11-20 15:55 ` Minchan Kim 2020-11-20 15:55 ` Minchan Kim 2020-11-23 18:41 ` Will Deacon 2020-11-23 18:41 ` Will Deacon 2020-11-25 22:51 ` Minchan Kim 2020-11-25 22:51 ` Minchan Kim 2020-11-20 20:22 ` Yu Zhao 2020-11-20 20:22 ` Yu Zhao 2020-11-21 2:49 ` Yu Zhao 2020-11-21 2:49 ` Yu Zhao 2020-11-23 19:21 ` Yu Zhao 2020-11-23 19:21 ` Yu Zhao 2020-11-23 22:04 ` Will Deacon 2020-11-23 22:04 ` Will Deacon 2020-11-20 14:35 ` [PATCH 5/6] tlb: mmu_gather: Introduce tlb_gather_mmu_fullmm() Will Deacon 2020-11-20 14:35 ` Will Deacon 2020-11-20 17:22 ` Linus Torvalds 2020-11-20 17:22 ` Linus Torvalds 2020-11-20 17:22 ` Linus Torvalds 2020-11-20 17:31 ` Linus Torvalds 2020-11-20 17:31 ` Linus Torvalds 2020-11-20 17:31 ` Linus Torvalds 2020-11-23 16:48 ` Will Deacon 2020-11-23 16:48 ` Will Deacon 2021-02-01 11:32 ` [tip: core/mm] tlb: mmu_gather: Remove start/end arguments from tlb_gather_mmu() tip-bot2 for Will Deacon 2020-11-22 15:11 ` [tlb] e242a269fa: WARNING:at_mm/mmu_gather.c:#tlb_gather_mmu kernel test robot 2020-11-23 17:51 ` Will Deacon 2020-11-23 17:51 ` Will Deacon 2020-11-20 14:35 ` [PATCH 6/6] mm: proc: Avoid fullmm flush for young/dirty bit toggling Will Deacon 2020-11-20 14:35 ` Will Deacon 2020-11-20 17:41 ` Linus Torvalds 2020-11-20 17:41 ` Linus Torvalds 2020-11-20 17:41 ` Linus Torvalds 2020-11-20 17:45 ` Linus Torvalds 2020-11-20 17:45 ` Linus Torvalds 2020-11-20 17:45 ` Linus Torvalds 2020-11-20 20:40 ` Yu Zhao 2020-11-20 20:40 ` Yu Zhao 2020-11-23 18:35 ` Will Deacon 2020-11-23 18:35 ` Will Deacon 2020-11-23 20:04 ` Yu Zhao 2020-11-23 20:04 ` Yu Zhao 2020-11-23 21:17 ` Will Deacon 2020-11-23 21:17 ` Will Deacon 2020-11-24 1:13 ` Yu Zhao [this message] 2020-11-24 1:13 ` Yu Zhao 2020-11-24 14:31 ` Will Deacon 2020-11-24 14:31 ` Will Deacon 2020-11-25 22:01 ` Minchan Kim 2020-11-25 22:01 ` Minchan Kim 2020-11-24 14:46 ` Peter Zijlstra 2020-11-24 14:46 ` Peter Zijlstra
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20201124011334.GA140483@google.com \ --to=yuzhao@google.com \ --cc=anshuman.khandual@arm.com \ --cc=catalin.marinas@arm.com \ --cc=kernel-team@android.com \ --cc=linux-arm-kernel@lists.infradead.org \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-mm@kvack.org \ --cc=minchan@kernel.org \ --cc=peterz@infradead.org \ --cc=torvalds@linux-foundation.org \ --cc=will@kernel.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.