From mboxrd@z Thu Jan 1 00:00:00 1970 From: Minchan Kim Date: Fri, 28 Jul 2017 15:12:24 +0000 Subject: Re: [PATCH 2/3] mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem Message-Id: <20170728151224.GA24201@bbox> List-Id: References: <1501224112-23656-1-git-send-email-minchan@kernel.org> <1501224112-23656-3-git-send-email-minchan@kernel.org> <20170728084634.foo3wjhsyydml6yj@techsingularity.net> In-Reply-To: <20170728084634.foo3wjhsyydml6yj@techsingularity.net> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Mel Gorman Cc: Andrew Morton , kernel-team , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Rik van Riel , Ingo Molnar , x86@kernel.org, Russell King , linux-arm-kernel@lists.infradead.org, Tony Luck , linux-ia64@vger.kernel.org, Martin Schwidefsky , "David S. Miller" , Heiko Carstens , linux-s390@vger.kernel.org, Yoshinori Sato , linux-sh@vger.kernel.org, Jeff Dike , user-mode-linux-devel@lists.sourceforge.net, linux-arch@vger.kernel.org, Nadav Amit On Fri, Jul 28, 2017 at 09:46:34AM +0100, Mel Gorman wrote: > On Fri, Jul 28, 2017 at 03:41:51PM +0900, Minchan Kim wrote: > > Nadav reported parallel MADV_DONTNEED on same range has a stale TLB > > problem and Mel fixed it[1] and found same problem on MADV_FREE[2]. > > > > Quote from Mel Gorman > > > > "The race in question is CPU 0 running madv_free and updating some PTEs > > while CPU 1 is also running madv_free and looking at the same PTEs. > > CPU 1 may have writable TLB entries for a page but fail the pte_dirty > > check (because CPU 0 has updated it already) and potentially fail to flush. > > Hence, when madv_free on CPU 1 returns, there are still potentially writable > > TLB entries and the underlying PTE is still present so that a subsequent write > > does not necessarily propagate the dirty bit to the underlying PTE any more. > > Reclaim at some unknown time at the future may then see that the PTE is still > > clean and discard the page even though a write has happened in the meantime. > > I think this is possible but I could have missed some protection in madv_free > > that prevents it happening." > > > > This patch aims for solving both problems all at once and is ready for > > other problem with KSM, MADV_FREE and soft-dirty story[3]. > > > > TLB batch API(tlb_[gather|finish]_mmu] uses [set|clear]_tlb_flush_pending > > and mmu_tlb_flush_pending so that when tlb_finish_mmu is called, we can catch > > there are parallel threads going on. In that case, flush TLB to prevent > > for user to access memory via stale TLB entry although it fail to gather > > pte entry. > > > > I confiremd this patch works with [4] test program Nadav gave so this patch > > supersedes "mm: Always flush VMA ranges affected by zap_page_range v2" > > in current mmotm. > > > > NOTE: > > This patch modifies arch-specific TLB gathering interface(x86, ia64, > > s390, sh, um). It seems most of architecture are straightforward but s390 > > need to be careful because tlb_flush_mmu works only if mm->context.flush_mm > > is set to non-zero which happens only a pte entry really is cleared by > > ptep_get_and_clear and friends. However, this problem never changes the > > pte entries but need to flush to prevent memory access from stale tlb. > > > > Any thoughts? > > > > The cc list is somewhat ..... extensive, given the topic. Trim it if > there is another version. Most of them are maintainers and mailling list for each architecures I am changing. I'm not sure what I can trim. As you said it's rather extensive, I will trim mailing list for each arch but keep maintainers and linux-arch. > > > index 3f2eb76243e3..8c26961f0503 100644 > > --- a/arch/arm/include/asm/tlb.h > > +++ b/arch/arm/include/asm/tlb.h > > @@ -163,13 +163,26 @@ tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long start > > #ifdef CONFIG_HAVE_RCU_TABLE_FREE > > tlb->batch = NULL; > > #endif > > + set_tlb_flush_pending(tlb->mm); > > } > > > > static inline void > > tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end) > > { > > - tlb_flush_mmu(tlb); > > + /* > > + * If there are parallel threads are doing PTE changes on same range > > + * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB > > + * flush by batching, a thread has stable TLB entry can fail to flush > > + * the TLB by observing pte_none|!pte_dirty, for example so flush TLB > > + * if we detect parallel PTE batching threads. > > + */ > > + if (mm_tlb_flush_pending(tlb->mm, false) > 1) { > > + tlb->range_start = start; > > + tlb->range_end = end; > > + } > > > > + tlb_flush_mmu(tlb); > > + clear_tlb_flush_pending(tlb->mm); > > /* keep the page table cache within bounds */ > > check_pgt_cache(); > > > > mm_tlb_flush_pending shouldn't be taking a barrier specific arg. I expect > this to change in the future and cause a conflict. At least I think in > this context, it's the conditional barrier stuff. > Yub. I saw your comment to Nadav so I expect you want mm_tlb_flush_pending be called under pte lock. However, I will use it out of pte lock where tlb_finish_mmu, however, in that case, atomic op and barrier to prevent compiler reordering between tlb flush and atomic_read in mm_tlb_flush_pending are enough to work. > That aside, it's very unfortunate that the return value of > mm_tlb_flush_pending really matters. Knowing why 1 is magic there requires > knowledge of the internals on a per-arch basis which is a bit nuts. > Consider renaming this to mm_tlb_flush_parallel() to return true if there > is a nr_pending > 1 with comments explaining why. I don't think any of > the callers expect a nr_pending of 0 ever. That removes some knowledge of > the specifics. Okay. If you are not strong against, I prefer mm_tlb_flush_nested which returns true if nr_pending > 1. > > The arch-specific changes to tlb_gather_mmu are almost all identical. > It's a little tricky to split the arch layer and core mm to have all > the set/clear of mm_tlb_flush_pending handled by the core mm. It's not > required but it would be preferred. The set one is obvious. rename > tlb_gather_mmu to arch_tlb_gather_mmu (including the generic implementation) > and create a tlb_gather_mmu alias that calls arch_tlb_gather_mmu and > set_tlb_flush_pending. > > The clear is not as straight-forward but can be done by creating a new > arch helper that handles this hunk on a per-arch basis > > > + if (mm_tlb_flush_pending(tlb->mm, false) > 1) { > > + tlb->start = start; > > + tlb->end = end; > > + } > > It'll be churn initially but it means any different handling in the TLB > batching area will be mostly a core concern. Fair enough. I will respin next week. Thanks for the review, Mel. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752473AbdG1PMd (ORCPT ); Fri, 28 Jul 2017 11:12:33 -0400 Received: from LGEAMRELO13.lge.com ([156.147.23.53]:40751 "EHLO lgeamrelo13.lge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752355AbdG1PMa (ORCPT ); Fri, 28 Jul 2017 11:12:30 -0400 X-Original-SENDERIP: 156.147.1.151 X-Original-MAILFROM: minchan@kernel.org X-Original-SENDERIP: 10.177.220.163 X-Original-MAILFROM: minchan@kernel.org Date: Sat, 29 Jul 2017 00:12:24 +0900 From: Minchan Kim To: Mel Gorman Cc: Andrew Morton , kernel-team , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Rik van Riel , Ingo Molnar , x86@kernel.org, Russell King , linux-arm-kernel@lists.infradead.org, Tony Luck , linux-ia64@vger.kernel.org, Martin Schwidefsky , "David S. Miller" , Heiko Carstens , linux-s390@vger.kernel.org, Yoshinori Sato , linux-sh@vger.kernel.org, Jeff Dike , user-mode-linux-devel@lists.sourceforge.net, linux-arch@vger.kernel.org, Nadav Amit Subject: Re: [PATCH 2/3] mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem Message-ID: <20170728151224.GA24201@bbox> References: <1501224112-23656-1-git-send-email-minchan@kernel.org> <1501224112-23656-3-git-send-email-minchan@kernel.org> <20170728084634.foo3wjhsyydml6yj@techsingularity.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170728084634.foo3wjhsyydml6yj@techsingularity.net> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jul 28, 2017 at 09:46:34AM +0100, Mel Gorman wrote: > On Fri, Jul 28, 2017 at 03:41:51PM +0900, Minchan Kim wrote: > > Nadav reported parallel MADV_DONTNEED on same range has a stale TLB > > problem and Mel fixed it[1] and found same problem on MADV_FREE[2]. > > > > Quote from Mel Gorman > > > > "The race in question is CPU 0 running madv_free and updating some PTEs > > while CPU 1 is also running madv_free and looking at the same PTEs. > > CPU 1 may have writable TLB entries for a page but fail the pte_dirty > > check (because CPU 0 has updated it already) and potentially fail to flush. > > Hence, when madv_free on CPU 1 returns, there are still potentially writable > > TLB entries and the underlying PTE is still present so that a subsequent write > > does not necessarily propagate the dirty bit to the underlying PTE any more. > > Reclaim at some unknown time at the future may then see that the PTE is still > > clean and discard the page even though a write has happened in the meantime. > > I think this is possible but I could have missed some protection in madv_free > > that prevents it happening." > > > > This patch aims for solving both problems all at once and is ready for > > other problem with KSM, MADV_FREE and soft-dirty story[3]. > > > > TLB batch API(tlb_[gather|finish]_mmu] uses [set|clear]_tlb_flush_pending > > and mmu_tlb_flush_pending so that when tlb_finish_mmu is called, we can catch > > there are parallel threads going on. In that case, flush TLB to prevent > > for user to access memory via stale TLB entry although it fail to gather > > pte entry. > > > > I confiremd this patch works with [4] test program Nadav gave so this patch > > supersedes "mm: Always flush VMA ranges affected by zap_page_range v2" > > in current mmotm. > > > > NOTE: > > This patch modifies arch-specific TLB gathering interface(x86, ia64, > > s390, sh, um). It seems most of architecture are straightforward but s390 > > need to be careful because tlb_flush_mmu works only if mm->context.flush_mm > > is set to non-zero which happens only a pte entry really is cleared by > > ptep_get_and_clear and friends. However, this problem never changes the > > pte entries but need to flush to prevent memory access from stale tlb. > > > > Any thoughts? > > > > The cc list is somewhat ..... extensive, given the topic. Trim it if > there is another version. Most of them are maintainers and mailling list for each architecures I am changing. I'm not sure what I can trim. As you said it's rather extensive, I will trim mailing list for each arch but keep maintainers and linux-arch. > > > index 3f2eb76243e3..8c26961f0503 100644 > > --- a/arch/arm/include/asm/tlb.h > > +++ b/arch/arm/include/asm/tlb.h > > @@ -163,13 +163,26 @@ tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long start > > #ifdef CONFIG_HAVE_RCU_TABLE_FREE > > tlb->batch = NULL; > > #endif > > + set_tlb_flush_pending(tlb->mm); > > } > > > > static inline void > > tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end) > > { > > - tlb_flush_mmu(tlb); > > + /* > > + * If there are parallel threads are doing PTE changes on same range > > + * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB > > + * flush by batching, a thread has stable TLB entry can fail to flush > > + * the TLB by observing pte_none|!pte_dirty, for example so flush TLB > > + * if we detect parallel PTE batching threads. > > + */ > > + if (mm_tlb_flush_pending(tlb->mm, false) > 1) { > > + tlb->range_start = start; > > + tlb->range_end = end; > > + } > > > > + tlb_flush_mmu(tlb); > > + clear_tlb_flush_pending(tlb->mm); > > /* keep the page table cache within bounds */ > > check_pgt_cache(); > > > > mm_tlb_flush_pending shouldn't be taking a barrier specific arg. I expect > this to change in the future and cause a conflict. At least I think in > this context, it's the conditional barrier stuff. > Yub. I saw your comment to Nadav so I expect you want mm_tlb_flush_pending be called under pte lock. However, I will use it out of pte lock where tlb_finish_mmu, however, in that case, atomic op and barrier to prevent compiler reordering between tlb flush and atomic_read in mm_tlb_flush_pending are enough to work. > That aside, it's very unfortunate that the return value of > mm_tlb_flush_pending really matters. Knowing why 1 is magic there requires > knowledge of the internals on a per-arch basis which is a bit nuts. > Consider renaming this to mm_tlb_flush_parallel() to return true if there > is a nr_pending > 1 with comments explaining why. I don't think any of > the callers expect a nr_pending of 0 ever. That removes some knowledge of > the specifics. Okay. If you are not strong against, I prefer mm_tlb_flush_nested which returns true if nr_pending > 1. > > The arch-specific changes to tlb_gather_mmu are almost all identical. > It's a little tricky to split the arch layer and core mm to have all > the set/clear of mm_tlb_flush_pending handled by the core mm. It's not > required but it would be preferred. The set one is obvious. rename > tlb_gather_mmu to arch_tlb_gather_mmu (including the generic implementation) > and create a tlb_gather_mmu alias that calls arch_tlb_gather_mmu and > set_tlb_flush_pending. > > The clear is not as straight-forward but can be done by creating a new > arch helper that handles this hunk on a per-arch basis > > > + if (mm_tlb_flush_pending(tlb->mm, false) > 1) { > > + tlb->start = start; > > + tlb->end = end; > > + } > > It'll be churn initially but it means any different handling in the TLB > batching area will be mostly a core concern. Fair enough. I will respin next week. Thanks for the review, Mel. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f199.google.com (mail-pf0-f199.google.com [209.85.192.199]) by kanga.kvack.org (Postfix) with ESMTP id 610016B055A for ; Fri, 28 Jul 2017 11:12:29 -0400 (EDT) Received: by mail-pf0-f199.google.com with SMTP id j79so241803600pfj.9 for ; Fri, 28 Jul 2017 08:12:29 -0700 (PDT) Received: from lgeamrelo13.lge.com (LGEAMRELO13.lge.com. [156.147.23.53]) by mx.google.com with ESMTP id a94si4870886pli.289.2017.07.28.08.12.27 for ; Fri, 28 Jul 2017 08:12:28 -0700 (PDT) Date: Sat, 29 Jul 2017 00:12:24 +0900 From: Minchan Kim Subject: Re: [PATCH 2/3] mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem Message-ID: <20170728151224.GA24201@bbox> References: <1501224112-23656-1-git-send-email-minchan@kernel.org> <1501224112-23656-3-git-send-email-minchan@kernel.org> <20170728084634.foo3wjhsyydml6yj@techsingularity.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170728084634.foo3wjhsyydml6yj@techsingularity.net> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andrew Morton , kernel-team , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Rik van Riel , Ingo Molnar , x86@kernel.org, Russell King , linux-arm-kernel@lists.infradead.org, Tony Luck , linux-ia64@vger.kernel.org, Martin Schwidefsky , "David S. Miller" , Heiko Carstens , linux-s390@vger.kernel.org, Yoshinori Sato , linux-sh@vger.kernel.org, Jeff Dike , user-mode-linux-devel@lists.sourceforge.net, linux-arch@vger.kernel.org, Nadav Amit On Fri, Jul 28, 2017 at 09:46:34AM +0100, Mel Gorman wrote: > On Fri, Jul 28, 2017 at 03:41:51PM +0900, Minchan Kim wrote: > > Nadav reported parallel MADV_DONTNEED on same range has a stale TLB > > problem and Mel fixed it[1] and found same problem on MADV_FREE[2]. > > > > Quote from Mel Gorman > > > > "The race in question is CPU 0 running madv_free and updating some PTEs > > while CPU 1 is also running madv_free and looking at the same PTEs. > > CPU 1 may have writable TLB entries for a page but fail the pte_dirty > > check (because CPU 0 has updated it already) and potentially fail to flush. > > Hence, when madv_free on CPU 1 returns, there are still potentially writable > > TLB entries and the underlying PTE is still present so that a subsequent write > > does not necessarily propagate the dirty bit to the underlying PTE any more. > > Reclaim at some unknown time at the future may then see that the PTE is still > > clean and discard the page even though a write has happened in the meantime. > > I think this is possible but I could have missed some protection in madv_free > > that prevents it happening." > > > > This patch aims for solving both problems all at once and is ready for > > other problem with KSM, MADV_FREE and soft-dirty story[3]. > > > > TLB batch API(tlb_[gather|finish]_mmu] uses [set|clear]_tlb_flush_pending > > and mmu_tlb_flush_pending so that when tlb_finish_mmu is called, we can catch > > there are parallel threads going on. In that case, flush TLB to prevent > > for user to access memory via stale TLB entry although it fail to gather > > pte entry. > > > > I confiremd this patch works with [4] test program Nadav gave so this patch > > supersedes "mm: Always flush VMA ranges affected by zap_page_range v2" > > in current mmotm. > > > > NOTE: > > This patch modifies arch-specific TLB gathering interface(x86, ia64, > > s390, sh, um). It seems most of architecture are straightforward but s390 > > need to be careful because tlb_flush_mmu works only if mm->context.flush_mm > > is set to non-zero which happens only a pte entry really is cleared by > > ptep_get_and_clear and friends. However, this problem never changes the > > pte entries but need to flush to prevent memory access from stale tlb. > > > > Any thoughts? > > > > The cc list is somewhat ..... extensive, given the topic. Trim it if > there is another version. Most of them are maintainers and mailling list for each architecures I am changing. I'm not sure what I can trim. As you said it's rather extensive, I will trim mailing list for each arch but keep maintainers and linux-arch. > > > index 3f2eb76243e3..8c26961f0503 100644 > > --- a/arch/arm/include/asm/tlb.h > > +++ b/arch/arm/include/asm/tlb.h > > @@ -163,13 +163,26 @@ tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long start > > #ifdef CONFIG_HAVE_RCU_TABLE_FREE > > tlb->batch = NULL; > > #endif > > + set_tlb_flush_pending(tlb->mm); > > } > > > > static inline void > > tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end) > > { > > - tlb_flush_mmu(tlb); > > + /* > > + * If there are parallel threads are doing PTE changes on same range > > + * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB > > + * flush by batching, a thread has stable TLB entry can fail to flush > > + * the TLB by observing pte_none|!pte_dirty, for example so flush TLB > > + * if we detect parallel PTE batching threads. > > + */ > > + if (mm_tlb_flush_pending(tlb->mm, false) > 1) { > > + tlb->range_start = start; > > + tlb->range_end = end; > > + } > > > > + tlb_flush_mmu(tlb); > > + clear_tlb_flush_pending(tlb->mm); > > /* keep the page table cache within bounds */ > > check_pgt_cache(); > > > > mm_tlb_flush_pending shouldn't be taking a barrier specific arg. I expect > this to change in the future and cause a conflict. At least I think in > this context, it's the conditional barrier stuff. > Yub. I saw your comment to Nadav so I expect you want mm_tlb_flush_pending be called under pte lock. However, I will use it out of pte lock where tlb_finish_mmu, however, in that case, atomic op and barrier to prevent compiler reordering between tlb flush and atomic_read in mm_tlb_flush_pending are enough to work. > That aside, it's very unfortunate that the return value of > mm_tlb_flush_pending really matters. Knowing why 1 is magic there requires > knowledge of the internals on a per-arch basis which is a bit nuts. > Consider renaming this to mm_tlb_flush_parallel() to return true if there > is a nr_pending > 1 with comments explaining why. I don't think any of > the callers expect a nr_pending of 0 ever. That removes some knowledge of > the specifics. Okay. If you are not strong against, I prefer mm_tlb_flush_nested which returns true if nr_pending > 1. > > The arch-specific changes to tlb_gather_mmu are almost all identical. > It's a little tricky to split the arch layer and core mm to have all > the set/clear of mm_tlb_flush_pending handled by the core mm. It's not > required but it would be preferred. The set one is obvious. rename > tlb_gather_mmu to arch_tlb_gather_mmu (including the generic implementation) > and create a tlb_gather_mmu alias that calls arch_tlb_gather_mmu and > set_tlb_flush_pending. > > The clear is not as straight-forward but can be done by creating a new > arch helper that handles this hunk on a per-arch basis > > > + if (mm_tlb_flush_pending(tlb->mm, false) > 1) { > > + tlb->start = start; > > + tlb->end = end; > > + } > > It'll be churn initially but it means any different handling in the TLB > batching area will be mostly a core concern. Fair enough. I will respin next week. Thanks for the review, Mel. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: minchan@kernel.org (Minchan Kim) Date: Sat, 29 Jul 2017 00:12:24 +0900 Subject: [PATCH 2/3] mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem In-Reply-To: <20170728084634.foo3wjhsyydml6yj@techsingularity.net> References: <1501224112-23656-1-git-send-email-minchan@kernel.org> <1501224112-23656-3-git-send-email-minchan@kernel.org> <20170728084634.foo3wjhsyydml6yj@techsingularity.net> Message-ID: <20170728151224.GA24201@bbox> To: linux-arm-kernel@lists.infradead.org List-Id: linux-arm-kernel.lists.infradead.org On Fri, Jul 28, 2017 at 09:46:34AM +0100, Mel Gorman wrote: > On Fri, Jul 28, 2017 at 03:41:51PM +0900, Minchan Kim wrote: > > Nadav reported parallel MADV_DONTNEED on same range has a stale TLB > > problem and Mel fixed it[1] and found same problem on MADV_FREE[2]. > > > > Quote from Mel Gorman > > > > "The race in question is CPU 0 running madv_free and updating some PTEs > > while CPU 1 is also running madv_free and looking at the same PTEs. > > CPU 1 may have writable TLB entries for a page but fail the pte_dirty > > check (because CPU 0 has updated it already) and potentially fail to flush. > > Hence, when madv_free on CPU 1 returns, there are still potentially writable > > TLB entries and the underlying PTE is still present so that a subsequent write > > does not necessarily propagate the dirty bit to the underlying PTE any more. > > Reclaim at some unknown time at the future may then see that the PTE is still > > clean and discard the page even though a write has happened in the meantime. > > I think this is possible but I could have missed some protection in madv_free > > that prevents it happening." > > > > This patch aims for solving both problems all at once and is ready for > > other problem with KSM, MADV_FREE and soft-dirty story[3]. > > > > TLB batch API(tlb_[gather|finish]_mmu] uses [set|clear]_tlb_flush_pending > > and mmu_tlb_flush_pending so that when tlb_finish_mmu is called, we can catch > > there are parallel threads going on. In that case, flush TLB to prevent > > for user to access memory via stale TLB entry although it fail to gather > > pte entry. > > > > I confiremd this patch works with [4] test program Nadav gave so this patch > > supersedes "mm: Always flush VMA ranges affected by zap_page_range v2" > > in current mmotm. > > > > NOTE: > > This patch modifies arch-specific TLB gathering interface(x86, ia64, > > s390, sh, um). It seems most of architecture are straightforward but s390 > > need to be careful because tlb_flush_mmu works only if mm->context.flush_mm > > is set to non-zero which happens only a pte entry really is cleared by > > ptep_get_and_clear and friends. However, this problem never changes the > > pte entries but need to flush to prevent memory access from stale tlb. > > > > Any thoughts? > > > > The cc list is somewhat ..... extensive, given the topic. Trim it if > there is another version. Most of them are maintainers and mailling list for each architecures I am changing. I'm not sure what I can trim. As you said it's rather extensive, I will trim mailing list for each arch but keep maintainers and linux-arch. > > > index 3f2eb76243e3..8c26961f0503 100644 > > --- a/arch/arm/include/asm/tlb.h > > +++ b/arch/arm/include/asm/tlb.h > > @@ -163,13 +163,26 @@ tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long start > > #ifdef CONFIG_HAVE_RCU_TABLE_FREE > > tlb->batch = NULL; > > #endif > > + set_tlb_flush_pending(tlb->mm); > > } > > > > static inline void > > tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end) > > { > > - tlb_flush_mmu(tlb); > > + /* > > + * If there are parallel threads are doing PTE changes on same range > > + * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB > > + * flush by batching, a thread has stable TLB entry can fail to flush > > + * the TLB by observing pte_none|!pte_dirty, for example so flush TLB > > + * if we detect parallel PTE batching threads. > > + */ > > + if (mm_tlb_flush_pending(tlb->mm, false) > 1) { > > + tlb->range_start = start; > > + tlb->range_end = end; > > + } > > > > + tlb_flush_mmu(tlb); > > + clear_tlb_flush_pending(tlb->mm); > > /* keep the page table cache within bounds */ > > check_pgt_cache(); > > > > mm_tlb_flush_pending shouldn't be taking a barrier specific arg. I expect > this to change in the future and cause a conflict. At least I think in > this context, it's the conditional barrier stuff. > Yub. I saw your comment to Nadav so I expect you want mm_tlb_flush_pending be called under pte lock. However, I will use it out of pte lock where tlb_finish_mmu, however, in that case, atomic op and barrier to prevent compiler reordering between tlb flush and atomic_read in mm_tlb_flush_pending are enough to work. > That aside, it's very unfortunate that the return value of > mm_tlb_flush_pending really matters. Knowing why 1 is magic there requires > knowledge of the internals on a per-arch basis which is a bit nuts. > Consider renaming this to mm_tlb_flush_parallel() to return true if there > is a nr_pending > 1 with comments explaining why. I don't think any of > the callers expect a nr_pending of 0 ever. That removes some knowledge of > the specifics. Okay. If you are not strong against, I prefer mm_tlb_flush_nested which returns true if nr_pending > 1. > > The arch-specific changes to tlb_gather_mmu are almost all identical. > It's a little tricky to split the arch layer and core mm to have all > the set/clear of mm_tlb_flush_pending handled by the core mm. It's not > required but it would be preferred. The set one is obvious. rename > tlb_gather_mmu to arch_tlb_gather_mmu (including the generic implementation) > and create a tlb_gather_mmu alias that calls arch_tlb_gather_mmu and > set_tlb_flush_pending. > > The clear is not as straight-forward but can be done by creating a new > arch helper that handles this hunk on a per-arch basis > > > + if (mm_tlb_flush_pending(tlb->mm, false) > 1) { > > + tlb->start = start; > > + tlb->end = end; > > + } > > It'll be churn initially but it means any different handling in the TLB > batching area will be mostly a core concern. Fair enough. I will respin next week. Thanks for the review, Mel.