From: Nadav Amit <namit@vmware.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Minchan Kim <minchan@kernel.org>, Ingo Molnar <mingo@kernel.org>,
Stephen Rothwell <sfr@canb.auug.org.au>,
Andrew Morton <akpm@linux-foundation.org>,
Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@elte.hu>,
"H. Peter Anvin" <hpa@zytor.com>,
Linux-Next Mailing List <linux-next@vger.kernel.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
Linus <torvalds@linux-foundation.org>
Subject: Re: linux-next: manual merge of the akpm-current tree with the tip tree
Date: Tue, 15 Aug 2017 07:51:57 +0000 [thread overview]
Message-ID: <1138ED5D-AA95-48D0-86D4-75F20DFE0E0B@vmware.com> (raw)
In-Reply-To: <20170814193828.GN6524@worktop.programming.kicks-ass.net>
Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Aug 14, 2017 at 05:07:19AM +0000, Nadav Amit wrote:
>>>> So I'm not entirely clear about this yet.
>>>>
>>>> How about:
>>>>
>>>>
>>>> CPU0 CPU1
>>>>
>>>> tlb_gather_mmu()
>>>>
>>>> lock PTLn
>>>> no mod
>>>> unlock PTLn
>>>>
>>>> tlb_gather_mmu()
>>>>
>>>> lock PTLm
>>>> mod
>>>> include in tlb range
>>>> unlock PTLm
>>>>
>>>> lock PTLn
>>>> mod
>>>> unlock PTLn
>>>>
>>>> tlb_finish_mmu()
>>>> force = mm_tlb_flush_nested(tlb->mm);
>>>> arch_tlb_finish_mmu(force);
>>>>
>>>>
>>>> ... more ...
>>>>
>>>> tlb_finish_mmu()
>>>>
>>>>
>>>>
>>>> In this case you also want CPU1's mm_tlb_flush_nested() call to return
>>>> true, right?
>>>
>>> No, because CPU 1 mofified pte and added it into tlb range
>>> so regardless of nested, it will flush TLB so there is no stale
>>> TLB problem.
>
>> To clarify: the main problem that these patches address is when the first
>> CPU updates the PTE, and second CPU sees the updated value and thinks: “the
>> PTE is already what I wanted - no flush is needed”.
>
> OK, that simplifies things.
>
>> For some reason (I would assume intentional), all the examples here first
>> “do not modify” the PTE, and then modify it - which is not an “interesting”
>> case.
>
> Depends on what you call 'interesting' :-) They are 'interesting' to
> make work from a memory ordering POV. And since I didn't get they were
> excluded from the set, I worried.
>
> In fact, if they were to be included, I couldn't make it work at all. So
> I'm really glad to hear we can disregard them.
>
>> However, based on what I understand on the memory barriers, I think
>> there is indeed a missing barrier before reading it in
>> mm_tlb_flush_nested(). IIUC using smp_mb__after_unlock_lock() in this case,
>> before reading, would solve the problem with least impact on systems with
>> strong memory ordering.
>
> No, all is well. If, as you say, we're naturally constrained to the case
> where we only care about prior modification we can rely on the RCpc PTL
> locks.
>
> Consider:
>
>
> CPU0 CPU1
>
> tlb_gather_mmu()
>
> tlb_gather_mmu()
> inc --------.
> | (inc is constrained by RELEASE)
> lock PTLn |
> mod ^
> unlock PTLn -----------------> lock PTLn
> v no mod
> | unlock PTLn
> |
> | lock PTLm
> | mod
> | include in tlb range
> | unlock PTLm
> |
> (read is constrained |
> by ACQUIRE) |
> | tlb_finish_mmu()
> `---- force = mm_tlb_flush_nested(tlb->mm);
> arch_tlb_finish_mmu(force);
>
>
> ... more ...
>
> tlb_finish_mmu()
>
>
> Then CPU1's acquire of PTLn orders against CPU0's release of that same
> PTLn which guarantees we observe both its (prior) modified PTE and the
> mm->tlb_flush_pending increment from tlb_gather_mmu().
>
> So all we need for mm_tlb_flush_nested() to work is having acquired the
> right PTL at least once before calling it.
>
> At the same time, the decrements need to be after the TLB invalidate is
> complete, this ensures that _IF_ we observe the decrement, we must've
> also observed the corresponding invalidate.
>
> Something like the below is then sufficient.
>
> ---
> Subject: mm: Clarify tlb_flush_pending barriers
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Fri, 11 Aug 2017 16:04:50 +0200
>
> Better document the ordering around tlb_flush_pending.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
> include/linux/mm_types.h | 78 +++++++++++++++++++++++++++--------------------
> 1 file changed, 45 insertions(+), 33 deletions(-)
>
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -526,30 +526,6 @@ extern void tlb_gather_mmu(struct mmu_ga
> extern void tlb_finish_mmu(struct mmu_gather *tlb,
> unsigned long start, unsigned long end);
>
> -/*
> - * Memory barriers to keep this state in sync are graciously provided by
> - * the page table locks, outside of which no page table modifications happen.
> - * The barriers are used to ensure the order between tlb_flush_pending updates,
> - * which happen while the lock is not taken, and the PTE updates, which happen
> - * while the lock is taken, are serialized.
> - */
> -static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
> -{
> - /*
> - * Must be called with PTL held; such that our PTL acquire will have
> - * observed the store from set_tlb_flush_pending().
> - */
> - return atomic_read(&mm->tlb_flush_pending) > 0;
> -}
> -
> -/*
> - * Returns true if there are two above TLB batching threads in parallel.
> - */
> -static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
> -{
> - return atomic_read(&mm->tlb_flush_pending) > 1;
> -}
> -
> static inline void init_tlb_flush_pending(struct mm_struct *mm)
> {
> atomic_set(&mm->tlb_flush_pending, 0);
> @@ -558,7 +534,6 @@ static inline void init_tlb_flush_pendin
> static inline void inc_tlb_flush_pending(struct mm_struct *mm)
> {
> atomic_inc(&mm->tlb_flush_pending);
> -
> /*
> * The only time this value is relevant is when there are indeed pages
> * to flush. And we'll only flush pages after changing them, which
> @@ -580,24 +555,61 @@ static inline void inc_tlb_flush_pending
> * flush_tlb_range();
> * atomic_dec(&mm->tlb_flush_pending);
> *
> - * So the =true store is constrained by the PTL unlock, and the =false
> - * store is constrained by the TLB invalidate.
> + * Where the increment if constrained by the PTL unlock, it thus
> + * ensures that the increment is visible if the PTE modification is
> + * visible. After all, if there is no PTE modification, nobody cares
> + * about TLB flushes either.
> + *
> + * This very much relies on users (mm_tlb_flush_pending() and
> + * mm_tlb_flush_nested()) only caring about _specific_ PTEs (and
> + * therefore specific PTLs), because with SPLIT_PTE_PTLOCKS and RCpc
> + * locks (PPC) the unlock of one doesn't order against the lock of
> + * another PTL.
> + *
> + * The decrement is ordered by the flush_tlb_range(), such that
> + * mm_tlb_flush_pending() will not return false unless all flushes have
> + * completed.
> */
> }
>
> -/* Clearing is done after a TLB flush, which also provides a barrier. */
> static inline void dec_tlb_flush_pending(struct mm_struct *mm)
> {
> /*
> - * Guarantee that the tlb_flush_pending does not not leak into the
> - * critical section, since we must order the PTE change and changes to
> - * the pending TLB flush indication. We could have relied on TLB flush
> - * as a memory barrier, but this behavior is not clearly documented.
> + * See inc_tlb_flush_pending().
> + *
> + * This cannot be smp_mb__before_atomic() because smp_mb() simply does
> + * not order against TLB invalidate completion, which is what we need.
> + *
> + * Therefore we must rely on tlb_flush_*() to guarantee order.
> */
> - smp_mb__before_atomic();
> atomic_dec(&mm->tlb_flush_pending);
> }
>
> +static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
> +{
> + /*
> + * Must be called after having acquired the PTL; orders against that
> + * PTLs release and therefore ensures that if we observe the modified
> + * PTE we must also observe the increment from inc_tlb_flush_pending().
> + *
> + * That is, it only guarantees to return true if there is a flush
> + * pending for _this_ PTL.
> + */
> + return atomic_read(&mm->tlb_flush_pending);
> +}
> +
> +static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
> +{
> + /*
> + * Similar to mm_tlb_flush_pending(), we must have acquired the PTL
> + * for which there is a TLB flush pending in order to guarantee
> + * we've seen both that PTE modification and the increment.
> + *
> + * (no requirement on actually still holding the PTL, that is irrelevant)
> + */
> + return atomic_read(&mm->tlb_flush_pending) > 1;
> +}
> +
> struct vm_fault;
>
> struct vm_special_mapping {
Thanks for the detailed explanation. I will pay more attention next time.
next prev parent reply other threads:[~2017-08-15 7:52 UTC|newest]
Thread overview: 112+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-08-11 7:53 linux-next: manual merge of the akpm-current tree with the tip tree Stephen Rothwell
2017-08-11 9:34 ` Peter Zijlstra
2017-08-11 10:48 ` Peter Zijlstra
2017-08-11 11:45 ` Stephen Rothwell
2017-08-11 11:56 ` Ingo Molnar
2017-08-11 12:17 ` Peter Zijlstra
2017-08-11 12:44 ` Ingo Molnar
2017-08-11 13:49 ` Stephen Rothwell
2017-08-11 14:04 ` Peter Zijlstra
2017-08-13 6:06 ` Nadav Amit
2017-08-13 12:50 ` Peter Zijlstra
2017-08-14 3:16 ` Minchan Kim
2017-08-14 5:07 ` Nadav Amit
2017-08-14 5:23 ` Minchan Kim
2017-08-14 8:38 ` Minchan Kim
2017-08-14 19:57 ` Peter Zijlstra
2017-08-16 4:14 ` Minchan Kim
2017-08-14 19:38 ` Peter Zijlstra
2017-08-15 7:51 ` Nadav Amit [this message]
2017-08-14 3:09 ` Minchan Kim
2017-08-14 18:54 ` Peter Zijlstra
-- strict thread matches above, loose matches on Subject: below --
2022-02-16 5:38 Stephen Rothwell
2021-10-07 6:27 Stephen Rothwell
2021-03-22 6:12 Stephen Rothwell
2020-12-11 8:56 Stephen Rothwell
2020-12-11 12:47 ` Jason Gunthorpe
2020-11-27 7:48 Stephen Rothwell
2020-11-27 7:39 Stephen Rothwell
2020-11-27 11:54 ` Andy Shevchenko
2020-11-30 9:27 ` Thomas Gleixner
2020-11-23 8:05 Stephen Rothwell
2020-11-09 6:00 Stephen Rothwell
2020-10-13 6:59 Stephen Rothwell
2020-07-17 10:19 Stephen Rothwell
2020-05-29 11:05 Stephen Rothwell
2020-05-29 10:18 Stephen Rothwell
2020-05-29 10:05 Stephen Rothwell
2020-05-29 9:58 Stephen Rothwell
2020-05-25 11:04 Stephen Rothwell
2020-05-26 4:41 ` Singh, Balbir
2020-06-03 4:43 ` Stephen Rothwell
2020-05-19 16:18 Stephen Rothwell
2020-03-25 7:48 Stephen Rothwell
2020-03-19 6:42 Stephen Rothwell
2020-01-20 6:37 Stephen Rothwell
2020-01-20 6:30 Stephen Rothwell
2019-10-31 5:43 Stephen Rothwell
2019-06-24 10:24 Stephen Rothwell
2019-05-01 11:10 Stephen Rothwell
2019-01-31 4:31 Stephen Rothwell
2018-08-20 4:32 Stephen Rothwell
2018-08-20 19:52 ` Andrew Morton
2018-03-23 5:59 Stephen Rothwell
2017-12-18 5:04 Stephen Rothwell
2017-11-10 4:33 Stephen Rothwell
2017-11-02 7:19 Stephen Rothwell
2017-08-22 6:57 Stephen Rothwell
2017-08-23 6:39 ` Vlastimil Babka
2017-04-12 6:46 Stephen Rothwell
2017-04-12 20:53 ` Vlastimil Babka
2017-04-20 2:17 ` NeilBrown
2017-03-24 5:25 Stephen Rothwell
2017-02-17 4:40 Stephen Rothwell
2016-11-14 6:08 Stephen Rothwell
2016-07-29 4:14 Stephen Rothwell
2016-06-15 5:23 Stephen Rothwell
2016-06-18 19:39 ` Manfred Spraul
2016-04-29 6:12 Stephen Rothwell
2016-04-29 6:26 ` Ingo Molnar
2016-03-02 5:40 Stephen Rothwell
2016-02-26 5:07 Stephen Rothwell
2016-02-26 21:35 ` Andrew Morton
2016-02-19 4:09 Stephen Rothwell
2016-02-19 15:26 ` Ard Biesheuvel
2015-12-07 8:06 Stephen Rothwell
2015-10-02 4:21 Stephen Rothwell
2015-07-28 6:00 Stephen Rothwell
2015-07-29 17:12 ` Andrea Arcangeli
2015-07-29 17:47 ` Andy Lutomirski
2015-07-29 18:46 ` Thomas Gleixner
2015-07-30 15:38 ` Andrea Arcangeli
2015-07-29 23:06 ` Stephen Rothwell
2015-07-29 23:07 ` Thomas Gleixner
2015-09-07 23:35 ` Stephen Rothwell
2015-09-08 18:11 ` Linus Torvalds
2015-09-08 22:56 ` Stephen Rothwell
2015-09-08 23:03 ` Linus Torvalds
2015-09-08 23:21 ` Andrew Morton
2015-09-16 6:58 ` Geert Uytterhoeven
2015-06-04 12:07 Stephen Rothwell
2015-04-08 8:28 Stephen Rothwell
2015-04-08 8:25 Stephen Rothwell
2014-03-17 9:31 Stephen Rothwell
2014-03-17 9:36 ` Peter Zijlstra
2014-03-19 23:27 ` Andrew Morton
2014-01-14 4:53 Stephen Rothwell
2014-01-14 5:04 ` Davidlohr Bueso
2014-01-14 12:51 ` Peter Zijlstra
2014-01-14 13:17 ` Geert Uytterhoeven
2014-01-14 13:33 ` Peter Zijlstra
2014-01-14 16:19 ` H. Peter Anvin
2014-01-14 15:15 ` H. Peter Anvin
2014-01-14 15:20 ` Geert Uytterhoeven
2014-01-14 15:41 ` Peter Zijlstra
2014-01-14 15:48 ` H. Peter Anvin
2014-01-07 6:00 Stephen Rothwell
2014-01-07 6:34 ` Tang Chen
2013-11-08 7:48 Stephen Rothwell
2013-11-08 18:58 ` Josh Triplett
2013-11-08 23:20 ` Stephen Rothwell
2013-11-09 0:19 ` Josh Triplett
2013-10-30 6:40 Stephen Rothwell
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1138ED5D-AA95-48D0-86D4-75F20DFE0E0B@vmware.com \
--to=namit@vmware.com \
--cc=akpm@linux-foundation.org \
--cc=hpa@zytor.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-next@vger.kernel.org \
--cc=minchan@kernel.org \
--cc=mingo@elte.hu \
--cc=mingo@kernel.org \
--cc=peterz@infradead.org \
--cc=sfr@canb.auug.org.au \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).