linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Nick Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Andrew Lutomirski <luto@kernel.org>,
	"the arch/x86 maintainers" <x86@kernel.org>,
	Borislav Petkov <bp@alien8.de>, Will Deacon <will.deacon@arm.com>,
	Rik van Riel <riel@surriel.com>, Jann Horn <jannh@google.com>,
	Adin Scannell <ascannell@google.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	linux-mm <linux-mm@kvack.org>, David Miller <davem@davemloft.net>,
	Martin Schwidefsky <schwidefsky@de.ibm.com>,
	Michael Ellerman <mpe@ellerman.id.au>
Subject: Re: [PATCH 2/4] mm/tlb: Remove tlb_remove_table() non-concurrent condition
Date: Wed, 22 Aug 2018 21:54:48 -0700	[thread overview]
Message-ID: <CA+55aFzfnWv3JoB0mR7iCX322KsiE+uRq3HcmOpciEAiTw-oLw@mail.gmail.com> (raw)
In-Reply-To: <20180823141642.38b53175@roar.ozlabs.ibm.com>

On Wed, Aug 22, 2018 at 9:16 PM Nicholas Piggin <npiggin@gmail.com> wrote:
>
> On Wed, 22 Aug 2018 20:35:16 -0700
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >
> > And yes, those lazy tlbs are all kernel threads, but they can still
> > speculatively load user addresses.
>
> So?
>
> If the arch does not shoot those all down after the user page tables
> are removed then it's buggy regardless of this short cut.

Read the code.

That shortcut frees the pages *WITHOUT* doing that TLB flush. It just
does __tlb_remove_table(), which does *not* do that whole page
queueing so that we can batch flush and then free later.

So the fast-case it's buggy, exactly because of the reason you state.

The logic used to be that if you were the only cpu that had that tlb,
and it was a mid-level table, it didn't need that synchronization at
all.

And that logic is simply wrong. Exactly because even if mm_users is 1,
there can be things looking up TLB entries on other CPU's. Either
because of a lazy mm and a hw walker with speculation, or because of
use_mm() and a software walker.

So the whole "free immediately" shortcut was bogus. You *always* need
to queue, then flush the tlb, and then free.

That said, that wasn't actually the bug that Jann found. He found the
bug a few lines lower down, where the "I can't allocate memory for
queueing" ended up *also* not flushing the TLB.

> The only real problem I could see would be if a page walk cache still
> points to the freed table, then the table gets re-allocated and used
> elsewhere, and meanwhile a speculative access tries to load an entry
> from the page that is an invalid form of page table that might cause
> a machine check or something. That would be (u)arch specific, but if
> that's what we're concerned with here it's a different issue and needs
> to be documented as such.

We've *seen* that case, we had exactly that when we were being
aggressive about trying to avoid flushes for the lazy mmu case
entirely, because "we can just flush when we activate the lazy mm
state instead".

The logic was actually solid from a TLB case - who cares what garbage
TLB entries we might speculatively fill, when we're going to flush
them before they can possibly be used.

It turns out that that logic is solid, but hits another problem: at
least some AMD CPU's will cause machine checks when the TLB entries
are inconsistent with the machine setup. That can't happen with a
*good* page table, but when you speculatively walk already freed
tables, you can get any garbage.

I forget what the exact trigger was, but this was actually reported.
So you can't free page directory pages without flushing the tlb first
(to make that internal tlb node caches are flushed).

So the logic for freeing leaf pages and freeing middle nodes has to be
exactly the same: you make the modification to the page table to
remove the node/leaf, you queue up the memory for freeing, you flush
the tlb, and you then free the page.

That's the ordering that tlb_remove_page() has always honored, but
that tlb_remove_tabl() didn't.

It honored it for the *normal* case, which is why it took so long to
notice that the TLB shootdown had been broken on x86 when it moved to
the "generic" code. The *normal* case does this all right, and batches
things up, and then when the batch fills up it does a
tlb_table_flush() which does the TLB flush and schedules the actual
freeing.

But there were two cases that *didn't* do that. The special "I'm the
only thread" fast case, and the "oops I ran out of memory, so now I'll
fake it, and just synchronize with page twalkers manually, and then do
that special direct remove without flushing the tlb".

NOTE! Jann triggered that one by
 (a) forcing the machine low on memory
 (b) force-poisoning the page tables immediately after free

I suspect it's really hard to trigger under normal loads, exactly
because the *normal* case gets it right. It's literally the "oops, I
can't batch up because I ran out of memory" case that gets it wrong.

(And the special single-thread + lazy or use_mm() case, but that's
going to be entirely impossible to trigger, because in practice it's
just a single thread, and you won't ever hit the magic timing needed
that frees the page in the single thread at exactly the same time that
some optimistic lazy mm on another cpu happens to speculatively load
that address).

So the "atomic_read(&mm_users)" case is likely entirely impossible to
trigger any sane way. But because Jann found the problem with the 'ran
out of memory" case, we started looking at the more theoretical cases
that matched the same kind of "no tlb flush before free" pattern.

              Linus

  reply	other threads:[~2018-08-23  4:56 UTC|newest]

Thread overview: 96+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-08-22 15:30 [PATCH 0/4] x86: TLB invalidate fixes Peter Zijlstra
2018-08-22 15:30 ` [PATCH 1/4] x86/mm/tlb: Revert the recent lazy TLB patches Peter Zijlstra
2018-08-22 21:37   ` Linus Torvalds
2018-08-22 22:11     ` Rik van Riel
2018-08-22 15:30 ` [PATCH 2/4] mm/tlb: Remove tlb_remove_table() non-concurrent condition Peter Zijlstra
2018-08-23  3:31   ` Nicholas Piggin
2018-08-23  3:35     ` Linus Torvalds
2018-08-23  3:44       ` Linus Torvalds
2018-08-23  4:16       ` Nicholas Piggin
2018-08-23  4:54         ` Linus Torvalds [this message]
2018-08-23  5:15           ` Nicholas Piggin
2018-08-24  8:42           ` Peter Zijlstra
2018-08-23 13:40   ` Will Deacon
2018-08-22 15:30 ` [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE Peter Zijlstra
2018-08-22 15:55   ` Peter Zijlstra
2018-08-23  3:45     ` Nicholas Piggin
2018-08-23  3:59       ` Linus Torvalds
2018-08-23  4:33         ` Nicholas Piggin
2018-08-23  5:03           ` Linus Torvalds
2018-08-23  5:58             ` Nicholas Piggin
2018-08-23  4:54         ` Benjamin Herrenschmidt
2018-08-23  5:11           ` Linus Torvalds
2018-08-23  5:20             ` Linus Torvalds
2018-08-23  6:48               ` Martin Schwidefsky
2018-08-23  5:21             ` Benjamin Herrenschmidt
2018-08-23  6:15               ` Nicholas Piggin
2018-08-23 13:39             ` Will Deacon
2018-08-24  8:47               ` Peter Zijlstra
2018-08-24 11:32                 ` Peter Zijlstra
2018-08-24 11:39                   ` Peter Zijlstra
2018-08-27  5:00                     ` Nicholas Piggin
2018-08-27  7:47                       ` Peter Zijlstra
2018-08-27  8:04                         ` Nicholas Piggin
2018-08-27  8:09                           ` Benjamin Herrenschmidt
2018-08-27  8:20                             ` Peter Zijlstra
2018-08-27  8:54                               ` Nicholas Piggin
2018-08-27  9:02                             ` Nicholas Piggin
2018-08-27 22:13                               ` Benjamin Herrenschmidt
2018-08-27 13:36                           ` Rik van Riel
2018-08-27 14:29                             ` Nicholas Piggin
2018-08-27  8:57                         ` removig ia64, was: " Christoph Hellwig
2018-08-27 11:28                           ` Peter Zijlstra
2018-08-27 11:45                           ` Jason Duerstock
2018-08-27 11:00                         ` Peter Zijlstra
2018-08-30  0:13                           ` Vineet Gupta
2018-08-30 10:23                             ` Peter Zijlstra
2018-08-24 17:26                 ` Nadav Amit
2018-08-24 18:04                   ` Peter Zijlstra
2018-08-24 18:35                     ` TLB flushes on fixmap changes Nadav Amit
2018-08-24 19:31                       ` Linus Torvalds
2018-08-24 20:24                         ` Nadav Amit
2018-08-25  0:58                           ` Linus Torvalds
2018-08-25  2:29                             ` nadav.amit
2018-08-25  4:23                               ` Andy Lutomirski
2018-08-26  2:23                                 ` Masami Hiramatsu
2018-08-26  4:21                                   ` Andy Lutomirski
2018-08-26  4:43                                     ` Kees Cook
2018-08-26  5:53                                       ` Nadav Amit
2018-08-26 14:20                                       ` Andy Lutomirski
2018-08-26 16:47                                         ` Kees Cook
2018-08-26 17:25                                           ` Andy Lutomirski
2018-08-26 20:15                                             ` Thomas Gleixner
2018-08-26 22:03                                               ` Kees Cook
2018-08-26 22:15                                                 ` Matthew Wilcox
2018-08-26 22:29                                                 ` Jann Horn
2018-08-26  9:09                                     ` Peter Zijlstra
2018-08-27  3:03                                       ` Masami Hiramatsu
2018-08-27  3:26                                         ` Nadav Amit
2018-08-27  8:05                                           ` Masami Hiramatsu
2018-08-27 17:34                                             ` Nadav Amit
2018-08-27 18:45                                               ` Andy Lutomirski
2018-08-27 18:54                                                 ` Nadav Amit
2018-08-27 18:58                                                   ` Andy Lutomirski
2018-08-27 19:10                                                     ` Nadav Amit
2018-08-27 19:43                                                       ` Nadav Amit
2018-08-27 19:58                                                         ` Andy Lutomirski
2018-08-27 20:16                                                           ` Nadav Amit
2018-08-27 21:55                                                             ` Nadav Amit
2018-08-27 22:32                                                               ` Andy Lutomirski
2018-08-27 22:54                                                                 ` Nadav Amit
2018-08-27 23:01                                                                   ` Andy Lutomirski
2018-08-28  8:49                                                                     ` Masami Hiramatsu
2018-08-28 17:33                                                                       ` Nadav Amit
2018-08-27  8:13                                         ` Peter Zijlstra
2018-08-27  9:39                                           ` Masami Hiramatsu
2018-08-27  9:55                                           ` Jann Horn
2018-08-26 22:48                                     ` Jann Horn
2018-08-24  8:35           ` [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE Peter Zijlstra
2018-08-24 13:13             ` Peter Zijlstra
2018-08-24 13:14               ` Peter Zijlstra
2018-08-24 15:49               ` Will Deacon
2018-08-23 23:31     ` Will Deacon
2018-08-22 21:34   ` Linus Torvalds
2018-08-23  8:46   ` Nicholas Piggin
2018-08-22 15:30 ` [PATCH 4/4] x86/mm: Only use tlb_remove_table() for paravirt Peter Zijlstra
2018-08-22 22:12   ` Eduardo Valentin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CA+55aFzfnWv3JoB0mR7iCX322KsiE+uRq3HcmOpciEAiTw-oLw@mail.gmail.com \
    --to=torvalds@linux-foundation.org \
    --cc=ascannell@google.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@intel.com \
    --cc=davem@davemloft.net \
    --cc=jannh@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mpe@ellerman.id.au \
    --cc=npiggin@gmail.com \
    --cc=peterz@infradead.org \
    --cc=riel@surriel.com \
    --cc=schwidefsky@de.ibm.com \
    --cc=will.deacon@arm.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).