From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Message-ID: <1501208919.4073.58.camel@redhat.com> Subject: Re: [PATCH v3 1/2] mm: migrate: prevent racy access to tlb_flush_pending From: Rik van Riel To: Nadav Amit , linux-mm@kvack.org Cc: sergey.senozhatsky@gmail.com, minchan@kernel.org, nadav.amit@gmail.com, mgorman@suse.de, luto@kernel.org, stable@vger.kernel.org Date: Thu, 27 Jul 2017 22:28:39 -0400 In-Reply-To: <20170727114015.3452-2-namit@vmware.com> References: <20170727114015.3452-1-namit@vmware.com> <20170727114015.3452-2-namit@vmware.com> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: On Thu, 2017-07-27 at 04:40 -0700, Nadav Amit wrote: > From: Nadav Amit > > Setting and clearing mm->tlb_flush_pending can be performed by > multiple > threads, since mmap_sem may only be acquired for read in > task_numa_work(). If this happens, tlb_flush_pending might be cleared > while one of the threads still changes PTEs and batches TLB flushes. > > This can lead to the same race between migration and > change_protection_range() that led to the introduction of > tlb_flush_pending. The result of this race was data corruption, which > means that this patch also addresses a theoretically possible data > corruption. > > An actual data corruption was not observed, yet the race was > was confirmed by adding assertion to check tlb_flush_pending is not > set > by two threads, adding artificial latency in > change_protection_range() > and using sysctl to reduce kernel.numa_balancing_scan_delay_ms. > > Fixes: 20841405940e ("mm: fix TLB flush race between migration, and > change_protection_range") > > Cc: stable@vger.kernel.org > > Signed-off-by: Nadav Amit > Acked-by: Mel Gorman > Acked-by: Rik van Riel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: [PATCH v3 1/2] mm: migrate: prevent racy access to tlb_flush_pending From: Nadav Amit In-Reply-To: <20170728013423.GA358@jagdpanzerIV.localdomain> Date: Thu, 27 Jul 2017 18:44:34 -0700 Cc: "open list:MEMORY MANAGEMENT" , Sergey Senozhatsky , Minchan Kim , Mel Gorman , Rik van Riel , Andy Lutomirski , stable@vger.kernel.org, Andrew Morton Content-Transfer-Encoding: quoted-printable Message-Id: <41EF8C12-A581-4514-AAEA-5F5DAA08D322@gmail.com> References: <20170727114015.3452-1-namit@vmware.com> <20170727114015.3452-2-namit@vmware.com> <20170728013423.GA358@jagdpanzerIV.localdomain> To: Sergey Senozhatsky Sender: owner-linux-mm@kvack.org List-ID: Sergey Senozhatsky wrote: > just my 5 silly cents, >=20 > On (07/27/17 04:40), Nadav Amit wrote: > [..] >> static inline void set_tlb_flush_pending(struct mm_struct *mm) >> { >> - mm->tlb_flush_pending =3D true; >> + atomic_inc(&mm->tlb_flush_pending); >>=20 >> /* >> * Guarantee that the tlb_flush_pending store does not leak into = the >> @@ -544,7 +544,7 @@ static inline void set_tlb_flush_pending(struct = mm_struct *mm) >> static inline void clear_tlb_flush_pending(struct mm_struct *mm) >> { >> barrier(); >> - mm->tlb_flush_pending =3D false; >> + atomic_dec(&mm->tlb_flush_pending); >> } >=20 > so, _technically_, set_tlb_flush_pending() can be nested, right? IOW, >=20 > set_tlb_flush_pending() > set_tlb_flush_pending() > flush_tlb_range() > clear_tlb_flush_pending() > clear_tlb_flush_pending() // if we miss this one, then > // ->tlb_flush_pending is !clear, > // even though we called > // clear_tlb_flush_pending() >=20 > if so then set_ and clear_ are a bit misleading names for something > that does atomic_inc()/atomic_dec() internally. >=20 > especially when one sees this part >=20 >> - clear_tlb_flush_pending(mm); >> +#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION) >> + atomic_set(&mm->tlb_flush_pending, 0); >> +#endif >=20 > so we have clear_tlb_flush_pending() function which probably should > set it to 0 as the name suggests (I see what you did tho), yet still > do atomic_set() under ifdef-s. >=20 > well, just nitpicks. I see your point. Initially, I tried to keep exactly the same interface = to reduce the number of changes, but it might be misleading. I will change = the names (inc/dec_tlb_flush_pending). I will create = init_tlb_flush_pending() for initialization since you ask, but Minchan's changes would likely = remove the ifdef=E2=80=99s, making it a bit strange for a single use. Anyhow, I=E2=80=99ll wait to see if there any other comments and then do = it for v4. Thanks, Nadav -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Fri, 28 Jul 2017 10:42:04 +0900 From: Sergey Senozhatsky To: Nadav Amit Cc: linux-mm@kvack.org, sergey.senozhatsky@gmail.com, minchan@kernel.org, nadav.amit@gmail.com, mgorman@suse.de, riel@redhat.com, luto@kernel.org, stable@vger.kernel.org, Andrew Morton Subject: Re: [PATCH v3 1/2] mm: migrate: prevent racy access to tlb_flush_pending Message-ID: <20170728014204.GA26322@jagdpanzerIV.localdomain> References: <20170727114015.3452-1-namit@vmware.com> <20170727114015.3452-2-namit@vmware.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170727114015.3452-2-namit@vmware.com> Sender: owner-linux-mm@kvack.org List-ID: On (07/27/17 04:40), Nadav Amit wrote: [..] > --- a/mm/debug.c > +++ b/mm/debug.c > @@ -159,7 +159,7 @@ void dump_mm(const struct mm_struct *mm) > mm->numa_next_scan, mm->numa_scan_offset, mm->numa_scan_seq, > #endif > #if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION) > - mm->tlb_flush_pending, > + atomic_read(&mm->tlb_flush_pending), > #endif can we use mm_tlb_flush_pending() here and get rid of ifdef-s? /* I understand that this a -stable patch, so we can do it in a separate patch. */ -ss -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Fri, 28 Jul 2017 10:34:23 +0900 From: Sergey Senozhatsky To: Nadav Amit Cc: linux-mm@kvack.org, sergey.senozhatsky@gmail.com, minchan@kernel.org, nadav.amit@gmail.com, mgorman@suse.de, riel@redhat.com, luto@kernel.org, stable@vger.kernel.org, Andrew Morton Subject: Re: [PATCH v3 1/2] mm: migrate: prevent racy access to tlb_flush_pending Message-ID: <20170728013423.GA358@jagdpanzerIV.localdomain> References: <20170727114015.3452-1-namit@vmware.com> <20170727114015.3452-2-namit@vmware.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170727114015.3452-2-namit@vmware.com> Sender: owner-linux-mm@kvack.org List-ID: just my 5 silly cents, On (07/27/17 04:40), Nadav Amit wrote: [..] > static inline void set_tlb_flush_pending(struct mm_struct *mm) > { > - mm->tlb_flush_pending = true; > + atomic_inc(&mm->tlb_flush_pending); > > /* > * Guarantee that the tlb_flush_pending store does not leak into the > @@ -544,7 +544,7 @@ static inline void set_tlb_flush_pending(struct mm_struct *mm) > static inline void clear_tlb_flush_pending(struct mm_struct *mm) > { > barrier(); > - mm->tlb_flush_pending = false; > + atomic_dec(&mm->tlb_flush_pending); > } so, _technically_, set_tlb_flush_pending() can be nested, right? IOW, set_tlb_flush_pending() set_tlb_flush_pending() flush_tlb_range() clear_tlb_flush_pending() clear_tlb_flush_pending() // if we miss this one, then // ->tlb_flush_pending is !clear, // even though we called // clear_tlb_flush_pending() if so then set_ and clear_ are a bit misleading names for something that does atomic_inc()/atomic_dec() internally. especially when one sees this part > - clear_tlb_flush_pending(mm); > +#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION) > + atomic_set(&mm->tlb_flush_pending, 0); > +#endif so we have clear_tlb_flush_pending() function which probably should set it to 0 as the name suggests (I see what you did tho), yet still do atomic_set() under ifdef-s. well, just nitpicks. -ss -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: Nadav Amit To: CC: , , , , , , , Nadav Amit Subject: [PATCH v3 1/2] mm: migrate: prevent racy access to tlb_flush_pending Date: Thu, 27 Jul 2017 04:40:14 -0700 Message-ID: <20170727114015.3452-2-namit@vmware.com> In-Reply-To: <20170727114015.3452-1-namit@vmware.com> References: <20170727114015.3452-1-namit@vmware.com> MIME-Version: 1.0 Content-Type: text/plain Sender: owner-linux-mm@kvack.org List-ID: From: Nadav Amit Setting and clearing mm->tlb_flush_pending can be performed by multiple threads, since mmap_sem may only be acquired for read in task_numa_work(). If this happens, tlb_flush_pending might be cleared while one of the threads still changes PTEs and batches TLB flushes. This can lead to the same race between migration and change_protection_range() that led to the introduction of tlb_flush_pending. The result of this race was data corruption, which means that this patch also addresses a theoretically possible data corruption. An actual data corruption was not observed, yet the race was was confirmed by adding assertion to check tlb_flush_pending is not set by two threads, adding artificial latency in change_protection_range() and using sysctl to reduce kernel.numa_balancing_scan_delay_ms. Fixes: 20841405940e ("mm: fix TLB flush race between migration, and change_protection_range") Cc: stable@vger.kernel.org Signed-off-by: Nadav Amit Acked-by: Mel Gorman --- include/linux/mm_types.h | 8 ++++---- kernel/fork.c | 4 +++- mm/debug.c | 2 +- 3 files changed, 8 insertions(+), 6 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 45cdb27791a3..36f4ec589544 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -493,7 +493,7 @@ struct mm_struct { * can move process memory needs to flush the TLB when moving a * PROT_NONE or PROT_NUMA mapped page. */ - bool tlb_flush_pending; + atomic_t tlb_flush_pending; #endif struct uprobes_state uprobes_state; #ifdef CONFIG_HUGETLB_PAGE @@ -528,11 +528,11 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm) static inline bool mm_tlb_flush_pending(struct mm_struct *mm) { barrier(); - return mm->tlb_flush_pending; + return atomic_read(&mm->tlb_flush_pending) > 0; } static inline void set_tlb_flush_pending(struct mm_struct *mm) { - mm->tlb_flush_pending = true; + atomic_inc(&mm->tlb_flush_pending); /* * Guarantee that the tlb_flush_pending store does not leak into the @@ -544,7 +544,7 @@ static inline void set_tlb_flush_pending(struct mm_struct *mm) static inline void clear_tlb_flush_pending(struct mm_struct *mm) { barrier(); - mm->tlb_flush_pending = false; + atomic_dec(&mm->tlb_flush_pending); } #else static inline bool mm_tlb_flush_pending(struct mm_struct *mm) diff --git a/kernel/fork.c b/kernel/fork.c index e53770d2bf95..d6bf35b1cf31 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -809,7 +809,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, mm_init_aio(mm); mm_init_owner(mm, p); mmu_notifier_mm_init(mm); - clear_tlb_flush_pending(mm); +#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION) + atomic_set(&mm->tlb_flush_pending, 0); +#endif #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS mm->pmd_huge_pte = NULL; #endif diff --git a/mm/debug.c b/mm/debug.c index db1cd26d8752..d70103bb4731 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -159,7 +159,7 @@ void dump_mm(const struct mm_struct *mm) mm->numa_next_scan, mm->numa_scan_offset, mm->numa_scan_seq, #endif #if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION) - mm->tlb_flush_pending, + atomic_read(&mm->tlb_flush_pending), #endif mm->def_flags, &mm->def_flags ); -- 2.11.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f72.google.com (mail-pg0-f72.google.com [74.125.83.72]) by kanga.kvack.org (Postfix) with ESMTP id 8A0FF6B04B0 for ; Thu, 27 Jul 2017 14:50:10 -0400 (EDT) Received: by mail-pg0-f72.google.com with SMTP id u199so157685026pgb.13 for ; Thu, 27 Jul 2017 11:50:10 -0700 (PDT) Received: from EX13-EDG-OU-001.vmware.com (ex13-edg-ou-001.vmware.com. [208.91.0.189]) by mx.google.com with ESMTPS id m3si8751825pgc.963.2017.07.27.11.50.09 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Thu, 27 Jul 2017 11:50:09 -0700 (PDT) From: Nadav Amit Subject: [PATCH v3 2/2] mm: migrate: fix barriers around tlb_flush_pending Date: Thu, 27 Jul 2017 04:40:15 -0700 Message-ID: <20170727114015.3452-3-namit@vmware.com> In-Reply-To: <20170727114015.3452-1-namit@vmware.com> References: <20170727114015.3452-1-namit@vmware.com> MIME-Version: 1.0 Content-Type: text/plain Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: sergey.senozhatsky@gmail.com, minchan@kernel.org, nadav.amit@gmail.com, mgorman@suse.de, riel@redhat.com, luto@kernel.org, Nadav Amit Reading tlb_flush_pending while the page-table lock is taken does not require a barrier, since the lock/unlock already acts as a barrier. Removing the barrier in mm_tlb_flush_pending() to address this issue. However, migrate_misplaced_transhuge_page() calls mm_tlb_flush_pending() while the page-table lock is already released, which may present a problem on architectures with weak memory model (PPC). To deal with this case, a new parameter is added to mm_tlb_flush_pending() to indicate if it is read without the page-table lock taken, and calling smp_mb__after_unlock_lock() in this case. Signed-off-by: Nadav Amit --- arch/arm/include/asm/pgtable.h | 3 ++- arch/arm64/include/asm/pgtable.h | 3 ++- arch/x86/include/asm/pgtable.h | 2 +- include/linux/mm_types.h | 31 +++++++++++++++++++++++-------- mm/migrate.c | 2 +- 5 files changed, 29 insertions(+), 12 deletions(-) diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h index 1c462381c225..2e0608a8049d 100644 --- a/arch/arm/include/asm/pgtable.h +++ b/arch/arm/include/asm/pgtable.h @@ -223,7 +223,8 @@ static inline pte_t *pmd_page_vaddr(pmd_t pmd) #define pte_none(pte) (!pte_val(pte)) #define pte_present(pte) (pte_isset((pte), L_PTE_PRESENT)) #define pte_valid(pte) (pte_isset((pte), L_PTE_VALID)) -#define pte_accessible(mm, pte) (mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid(pte)) +#define pte_accessible(mm, pte) (mm_tlb_flush_pending(mm, true) ? \ + pte_present(pte) : pte_valid(pte)) #define pte_write(pte) (pte_isclear((pte), L_PTE_RDONLY)) #define pte_dirty(pte) (pte_isset((pte), L_PTE_DIRTY)) #define pte_young(pte) (pte_isset((pte), L_PTE_YOUNG)) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index c213fdbd056c..47f934d378ca 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -108,7 +108,8 @@ extern unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]; * remapped as PROT_NONE but are yet to be flushed from the TLB. */ #define pte_accessible(mm, pte) \ - (mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid_young(pte)) + (mm_tlb_flush_pending(mm, true) ? pte_present(pte) : \ + pte_valid_young(pte)) static inline pte_t clear_pte_bit(pte_t pte, pgprot_t prot) { diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index f5af95a0c6b8..da16793203dd 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -642,7 +642,7 @@ static inline bool pte_accessible(struct mm_struct *mm, pte_t a) return true; if ((pte_flags(a) & _PAGE_PROTNONE) && - mm_tlb_flush_pending(mm)) + mm_tlb_flush_pending(mm, true)) return true; return false; diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 36f4ec589544..57ab8061a2c0 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -522,12 +522,21 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm) /* * Memory barriers to keep this state in sync are graciously provided by * the page table locks, outside of which no page table modifications happen. - * The barriers below prevent the compiler from re-ordering the instructions - * around the memory barriers that are already present in the code. + * The barriers are used to ensure the order between tlb_flush_pending updates, + * which happen while the lock is not taken, and the PTE updates, which happen + * while the lock is taken, are serialized. */ -static inline bool mm_tlb_flush_pending(struct mm_struct *mm) +static inline bool mm_tlb_flush_pending(struct mm_struct *mm, bool pt_locked) { - barrier(); + /* + * mm_tlb_flush_pending() is safe if it is executed while the page-table + * lock is taken. But if the lock was already released, there does not + * seem to be a guarantee that a memory barrier. A memory barrier is + * therefore needed on architectures with weak memory models. + */ + if (!pt_locked) + smp_mb__after_unlock_lock(); + return atomic_read(&mm->tlb_flush_pending) > 0; } static inline void set_tlb_flush_pending(struct mm_struct *mm) @@ -535,19 +544,25 @@ static inline void set_tlb_flush_pending(struct mm_struct *mm) atomic_inc(&mm->tlb_flush_pending); /* - * Guarantee that the tlb_flush_pending store does not leak into the + * Guarantee that the tlb_flush_pending increase does not leak into the * critical section updating the page tables */ smp_mb__before_spinlock(); } -/* Clearing is done after a TLB flush, which also provides a barrier. */ + static inline void clear_tlb_flush_pending(struct mm_struct *mm) { - barrier(); + /* + * Guarantee that the tlb_flush_pending does not not leak into the + * critical section, since we must order the PTE change and changes to + * the pending TLB flush indication. We could have relied on TLB flush + * as a memory barrier, but this behavior is not clearly documented. + */ + smp_mb__before_atomic(); atomic_dec(&mm->tlb_flush_pending); } #else -static inline bool mm_tlb_flush_pending(struct mm_struct *mm) +static inline bool mm_tlb_flush_pending(struct mm_struct *mm, bool pt_locked) { return false; } diff --git a/mm/migrate.c b/mm/migrate.c index 89a0a1707f4c..169c3165be41 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1939,7 +1939,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, * We are not sure a pending tlb flush here is for a huge page * mapping or not. Hence use the tlb range variant */ - if (mm_tlb_flush_pending(mm)) + if (mm_tlb_flush_pending(mm, false)) flush_tlb_range(vma, mmun_start, mmun_end); /* Prepare a page as a migration target */ -- 2.11.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f200.google.com (mail-pf0-f200.google.com [209.85.192.200]) by kanga.kvack.org (Postfix) with ESMTP id 968716B04AE for ; Thu, 27 Jul 2017 14:50:10 -0400 (EDT) Received: by mail-pf0-f200.google.com with SMTP id o82so13352976pfj.11 for ; Thu, 27 Jul 2017 11:50:10 -0700 (PDT) Received: from EX13-EDG-OU-001.vmware.com (ex13-edg-ou-001.vmware.com. [208.91.0.189]) by mx.google.com with ESMTPS id m3si8751825pgc.963.2017.07.27.11.50.09 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Thu, 27 Jul 2017 11:50:09 -0700 (PDT) From: Nadav Amit Subject: [PATCH v3 0/2] mm: fixes of tlb_flush_pending races Date: Thu, 27 Jul 2017 04:40:13 -0700 Message-ID: <20170727114015.3452-1-namit@vmware.com> MIME-Version: 1.0 Content-Type: text/plain Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: sergey.senozhatsky@gmail.com, minchan@kernel.org, nadav.amit@gmail.com, mgorman@suse.de, riel@redhat.com, luto@kernel.org, Nadav Amit These two patches address tlb_flush_pending issues. The first one address a race when accessing tlb_flush_pending and is the important one. The second patch addresses Andrew Morton question regarding the barriers. This patch is not really related to the first one: the atomic operations atomic_read() and atomic_inc() do not act as a memory barrier, and replacing existing barriers with smp_mb__after_atomic() did not seem beneficial. Yet, while reviewing the memory barriers around the use of tlb_flush_pending, few issues were identified. v2 -> v3: - Do not init tlb_flush_pending if it is not defined without (Sergey) - Internalize memory barriers to mm_tlb_flush_pending (Minchan) v1 -> v2: - Explain the implications of the implications of the race (Andrew) - Mark the patch that address the race as stable (Andrew) - Add another patch to clean the use of barriers (Andrew) Nadav Amit (2): mm: migrate: prevent racy access to tlb_flush_pending mm: migrate: fix barriers around tlb_flush_pending arch/arm/include/asm/pgtable.h | 3 ++- arch/arm64/include/asm/pgtable.h | 3 ++- arch/x86/include/asm/pgtable.h | 2 +- include/linux/mm_types.h | 39 +++++++++++++++++++++++++++------------ kernel/fork.c | 4 +++- mm/debug.c | 2 +- mm/migrate.c | 2 +- 7 files changed, 37 insertions(+), 18 deletions(-) -- 2.11.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f71.google.com (mail-wm0-f71.google.com [74.125.82.71]) by kanga.kvack.org (Postfix) with ESMTP id 572FC280393 for ; Fri, 28 Jul 2017 03:43:03 -0400 (EDT) Received: by mail-wm0-f71.google.com with SMTP id g71so15326405wmg.13 for ; Fri, 28 Jul 2017 00:43:03 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id d19si16501263wrb.486.2017.07.28.00.43.01 for (version=TLS1 cipher=AES128-SHA bits=128/128); Fri, 28 Jul 2017 00:43:02 -0700 (PDT) Date: Fri, 28 Jul 2017 08:42:56 +0100 From: Mel Gorman Subject: Re: [PATCH v3 2/2] mm: migrate: fix barriers around tlb_flush_pending Message-ID: <20170728074256.7xsnoldtfuh7ywir@suse.de> References: <20170727114015.3452-1-namit@vmware.com> <20170727114015.3452-3-namit@vmware.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20170727114015.3452-3-namit@vmware.com> Sender: owner-linux-mm@kvack.org List-ID: To: Nadav Amit Cc: linux-mm@kvack.org, sergey.senozhatsky@gmail.com, minchan@kernel.org, nadav.amit@gmail.com, riel@redhat.com, luto@kernel.org On Thu, Jul 27, 2017 at 04:40:15AM -0700, Nadav Amit wrote: > Reading tlb_flush_pending while the page-table lock is taken does not > require a barrier, since the lock/unlock already acts as a barrier. > Removing the barrier in mm_tlb_flush_pending() to address this issue. > > However, migrate_misplaced_transhuge_page() calls mm_tlb_flush_pending() > while the page-table lock is already released, which may present a > problem on architectures with weak memory model (PPC). To deal with this > case, a new parameter is added to mm_tlb_flush_pending() to indicate > if it is read without the page-table lock taken, and calling > smp_mb__after_unlock_lock() in this case. > > Signed-off-by: Nadav Amit Conditional locking based on function arguements are often considered extremely hazardous. Conditional barriers are even more troublesome because it's simply too easy to get wrong. Revert b0943d61b8fa420180f92f64ef67662b4f6cc493 instead of this patch. It's not a clean revert but conflicts are due to comment changes. It moves the check back under the PTL and the impact is marginal given that it a spurious TLB flush will only occur when potentially racing with change_prot_range. Since that commit went in, a lot of changes have happened that alter the scan rate of automatic NUMA balancing so it shouldn't be a serious issue. It's certainly a nicer option than using conditional barriers. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f71.google.com (mail-pg0-f71.google.com [74.125.83.71]) by kanga.kvack.org (Postfix) with ESMTP id 51C516B04AE for ; Thu, 27 Jul 2017 14:50:10 -0400 (EDT) Received: by mail-pg0-f71.google.com with SMTP id u199so157684976pgb.13 for ; Thu, 27 Jul 2017 11:50:10 -0700 (PDT) Received: from EX13-EDG-OU-001.vmware.com (ex13-edg-ou-001.vmware.com. [208.91.0.189]) by mx.google.com with ESMTPS id m3si8751825pgc.963.2017.07.27.11.50.08 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Thu, 27 Jul 2017 11:50:08 -0700 (PDT) From: Nadav Amit Subject: [PATCH v3 1/2] mm: migrate: prevent racy access to tlb_flush_pending Date: Thu, 27 Jul 2017 04:40:14 -0700 Message-ID: <20170727114015.3452-2-namit@vmware.com> In-Reply-To: <20170727114015.3452-1-namit@vmware.com> References: <20170727114015.3452-1-namit@vmware.com> MIME-Version: 1.0 Content-Type: text/plain Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: sergey.senozhatsky@gmail.com, minchan@kernel.org, nadav.amit@gmail.com, mgorman@suse.de, riel@redhat.com, luto@kernel.org, stable@vger.kernel.org, Nadav Amit From: Nadav Amit Setting and clearing mm->tlb_flush_pending can be performed by multiple threads, since mmap_sem may only be acquired for read in task_numa_work(). If this happens, tlb_flush_pending might be cleared while one of the threads still changes PTEs and batches TLB flushes. This can lead to the same race between migration and change_protection_range() that led to the introduction of tlb_flush_pending. The result of this race was data corruption, which means that this patch also addresses a theoretically possible data corruption. An actual data corruption was not observed, yet the race was was confirmed by adding assertion to check tlb_flush_pending is not set by two threads, adding artificial latency in change_protection_range() and using sysctl to reduce kernel.numa_balancing_scan_delay_ms. Fixes: 20841405940e ("mm: fix TLB flush race between migration, and change_protection_range") Cc: stable@vger.kernel.org Signed-off-by: Nadav Amit Acked-by: Mel Gorman --- include/linux/mm_types.h | 8 ++++---- kernel/fork.c | 4 +++- mm/debug.c | 2 +- 3 files changed, 8 insertions(+), 6 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 45cdb27791a3..36f4ec589544 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -493,7 +493,7 @@ struct mm_struct { * can move process memory needs to flush the TLB when moving a * PROT_NONE or PROT_NUMA mapped page. */ - bool tlb_flush_pending; + atomic_t tlb_flush_pending; #endif struct uprobes_state uprobes_state; #ifdef CONFIG_HUGETLB_PAGE @@ -528,11 +528,11 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm) static inline bool mm_tlb_flush_pending(struct mm_struct *mm) { barrier(); - return mm->tlb_flush_pending; + return atomic_read(&mm->tlb_flush_pending) > 0; } static inline void set_tlb_flush_pending(struct mm_struct *mm) { - mm->tlb_flush_pending = true; + atomic_inc(&mm->tlb_flush_pending); /* * Guarantee that the tlb_flush_pending store does not leak into the @@ -544,7 +544,7 @@ static inline void set_tlb_flush_pending(struct mm_struct *mm) static inline void clear_tlb_flush_pending(struct mm_struct *mm) { barrier(); - mm->tlb_flush_pending = false; + atomic_dec(&mm->tlb_flush_pending); } #else static inline bool mm_tlb_flush_pending(struct mm_struct *mm) diff --git a/kernel/fork.c b/kernel/fork.c index e53770d2bf95..d6bf35b1cf31 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -809,7 +809,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, mm_init_aio(mm); mm_init_owner(mm, p); mmu_notifier_mm_init(mm); - clear_tlb_flush_pending(mm); +#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION) + atomic_set(&mm->tlb_flush_pending, 0); +#endif #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS mm->pmd_huge_pte = NULL; #endif diff --git a/mm/debug.c b/mm/debug.c index db1cd26d8752..d70103bb4731 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -159,7 +159,7 @@ void dump_mm(const struct mm_struct *mm) mm->numa_next_scan, mm->numa_scan_offset, mm->numa_scan_seq, #endif #if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION) - mm->tlb_flush_pending, + atomic_read(&mm->tlb_flush_pending), #endif mm->def_flags, &mm->def_flags ); -- 2.11.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f199.google.com (mail-pf0-f199.google.com [209.85.192.199]) by kanga.kvack.org (Postfix) with ESMTP id 1508D6B0561 for ; Fri, 28 Jul 2017 12:40:30 -0400 (EDT) Received: by mail-pf0-f199.google.com with SMTP id v62so243816261pfd.10 for ; Fri, 28 Jul 2017 09:40:30 -0700 (PDT) Received: from mail-pg0-x241.google.com (mail-pg0-x241.google.com. [2607:f8b0:400e:c05::241]) by mx.google.com with ESMTPS id d6si1631247plj.353.2017.07.28.09.40.28 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 28 Jul 2017 09:40:28 -0700 (PDT) Received: by mail-pg0-x241.google.com with SMTP id k190so8354383pgk.4 for ; Fri, 28 Jul 2017 09:40:28 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: [PATCH v3 2/2] mm: migrate: fix barriers around tlb_flush_pending From: Nadav Amit In-Reply-To: <20170728074256.7xsnoldtfuh7ywir@suse.de> Date: Fri, 28 Jul 2017 09:40:24 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <4F474FD8-82BE-4881-AE47-40AA6A4091C1@gmail.com> References: <20170727114015.3452-1-namit@vmware.com> <20170727114015.3452-3-namit@vmware.com> <20170728074256.7xsnoldtfuh7ywir@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: "open list:MEMORY MANAGEMENT" , Sergey Senozhatsky , Minchan Kim , Rik van Riel , luto@kernel.org Mel Gorman wrote: > On Thu, Jul 27, 2017 at 04:40:15AM -0700, Nadav Amit wrote: >> Reading tlb_flush_pending while the page-table lock is taken does not >> require a barrier, since the lock/unlock already acts as a barrier. >> Removing the barrier in mm_tlb_flush_pending() to address this issue. >>=20 >> However, migrate_misplaced_transhuge_page() calls = mm_tlb_flush_pending() >> while the page-table lock is already released, which may present a >> problem on architectures with weak memory model (PPC). To deal with = this >> case, a new parameter is added to mm_tlb_flush_pending() to indicate >> if it is read without the page-table lock taken, and calling >> smp_mb__after_unlock_lock() in this case. >>=20 >> Signed-off-by: Nadav Amit >=20 > Conditional locking based on function arguements are often considered > extremely hazardous. Conditional barriers are even more troublesome = because > it's simply too easy to get wrong. >=20 > Revert b0943d61b8fa420180f92f64ef67662b4f6cc493 instead of this patch. = It's > not a clean revert but conflicts are due to comment changes. It moves > the check back under the PTL and the impact is marginal given that > it a spurious TLB flush will only occur when potentially racing with > change_prot_range. Since that commit went in, a lot of changes have = happened > that alter the scan rate of automatic NUMA balancing so it shouldn't = be a > serious issue. It's certainly a nicer option than using conditional = barriers. Ok. Initially, I added a memory barrier only in migrate_misplaced_transhuge_page(), and included a detailed comment = about it - I still think it is better. But since you feel confident the impact = will be relatively small, I will do the revert. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org