All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jann Horn <jannh@google.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
	John Hubbard <jhubbard@nvidia.com>,
	x86@kernel.org, willy@infradead.org, akpm@linux-foundation.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	aarcange@redhat.com, kirill.shutemov@linux.intel.com,
	jroedel@suse.de, ubizjak@gmail.com
Subject: Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
Date: Mon, 24 Oct 2022 22:23:51 +0200	[thread overview]
Message-ID: <CAG48ez3KMuM2aLpc8SvGiqZvv1KFc7E2DBZ4CLzxQeFhrMN_-w@mail.gmail.com> (raw)
In-Reply-To: <CAHk-=wjoGnHVVp5pcb69C6HtnJ_avqQo_b5NK-X9bgoDDw_XbA@mail.gmail.com>

On Mon, Oct 24, 2022 at 10:19 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, Oct 24, 2022 at 12:58 PM Jann Horn <jannh@google.com> wrote:
> >
> > Unless I'm completely misunderstanding what's going on here, the whole
> > "remove_table" thing only happens when you "remove a table", meaning
> > you free an entire *pagetable*. Just zapping PTEs doesn't trigger that
> > logic.
>
> I do have to admit that I'd be happier if this code - and the GUP code
> that also relies on "interrupts off" behavior - would just use a
> sequence counter instead.
>
> Relying on blocking IPI's is clever, but also clearly very subtle and
> somewhat dangerous.
>
> I think our GUP code is a *lot* more important than some "legacy
> x86-32 has problems in case you have an incredibly unlikely race that
> re-populates the page table with a different page that just happens to
> be exactly the same MOD-4GB", so honestly, I don't think the
> load-tearing is even worth worrying about - if you have hardware that
> is good enough at virtualizing things, it's almost certainly already
> 64-bit, and running 32-bit virtual machines with PAE you really only
> have yourself to blame.
>
> So I can't find it in myself to care about the 32-bit tearing thing,
> but this discussion makes me worried about Fast GUP.
>
> Note that even with proper atomic
>
>                 pte_t pte = ptep_get_lockless(ptep);
>
> in gup_pte_range(), and even if the page tables are RCU-free'd, that
> just means that the 'ptep' access itself is safe.
>
> But then you have the whole "the lookup of the page pointer is not
> atomic" wrt that. And right now that GUP code does rely on the "block
> IPI" to make it basically valid.
>
> I don't think it matters if GUP races with munmap or madvise() or
> something like that - if you get the old page, that's still a valid
> page, and the user only has himself to blame.
>
> But if we have memory pressure that causes vmscan to push out a page,
> and it gets replaced with a new page, and GUP gets the old page with
> no serialization, that sounds like a possible source of data
> inconsistency.
>
> I don't know if this can happen, but the whole "interrupts disabled
> doesn't actually block IPI's and synchronize with TLB flushes" really
> sounds like it would affect GUP too. And be much more serious there
> than on some x86-32 platform that nobody should be using anyway.

That's why GUP-fast re-checks the PTE after it has grabbed a reference
to the page. Pasting from a longer writeup that I plan to publish
soon, describing how gup_pte_range() works:

"""
This guarantees that the page tables that are being walked
aren't freed concurrently, but at the end of the walk, we
have to grab a stable reference to the referenced page.
For this we use the grab-reference-and-revalidate trick
from above again: First we (locklessly) load the page
table entry, then we grab a reference to the page that it
points to (which can fail if the refcount is zero, in that
case we bail), then we recheck that the page table entry
is still the same, and if it changed in between, we drop the
page reference and bail.
This can, again, grab a reference to a page after it has
already been freed and reallocated. The reason why this is
fine is that the metadata structure that holds this refcount,
`struct folio` (or `struct page`, depending on which kernel
version you're looking at; in current kernels it's `folio`
but `struct page` and `struct folio` are actually aliases for
the same memory, basically, though that is supposed to maybe
change at some point) is never freed; even when a page is
freed and reallocated, the corresponding `struct folio`
stays. This does have the fun consequence that whenever a
page/folio has a non-zero refcount, the refcount can
spuriously go up and then back down for a little bit.
(Also it's technically not as simple as I just described it,
because the `struct page` that the PTE points to might be
a "tail page" of a `struct folio`.
So actually we first read the PTE, the PTE gives us the
`page*`, then from that we go to the `folio*`, then we
try to grab a reference to the `folio`, then if that worked
we check that the `page` still points to the same `folio`,
and then we recheck that the PTE is still the same.)
"""

  reply	other threads:[~2022-10-24 22:12 UTC|newest]

Thread overview: 148+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-22 11:14 [PATCH 00/13] Clean up pmd_get_atomic() and i386-PAE Peter Zijlstra
2022-10-22 11:14 ` [PATCH 01/13] mm: Update ptep_get_lockless()s comment Peter Zijlstra
2022-10-24  5:42   ` John Hubbard
2022-10-24  8:00     ` Peter Zijlstra
2022-10-24 19:58       ` Jann Horn
2022-10-24 20:19         ` Linus Torvalds
2022-10-24 20:23           ` Jann Horn [this message]
2022-10-24 20:36             ` Linus Torvalds
2022-10-25  3:21             ` Matthew Wilcox
2022-10-25  7:54               ` Alistair Popple
2022-10-25 13:33                 ` Peter Zijlstra
2022-10-25 13:44                 ` Jann Horn
2022-10-26  0:45                   ` Alistair Popple
2022-10-25 14:02         ` Peter Zijlstra
2022-10-25 14:18           ` Jann Horn
2022-10-25 15:06             ` Peter Zijlstra
2022-10-26 16:45               ` Jann Horn
2022-10-27  7:08                 ` Peter Zijlstra
2022-10-27 18:13                   ` Linus Torvalds
2022-10-27 19:35                     ` Peter Zijlstra
2022-10-27 19:43                       ` Linus Torvalds
2022-10-27 20:15                     ` Nadav Amit
2022-10-27 20:31                       ` Linus Torvalds
2022-10-27 21:44                         ` Nadav Amit
2022-10-28 23:57                           ` Nadav Amit
2022-10-29  0:42                             ` Linus Torvalds
2022-10-29 18:05                               ` Nadav Amit
2022-10-29 18:36                                 ` Linus Torvalds
2022-10-29 18:58                                   ` Linus Torvalds
2022-10-29 19:14                                     ` Linus Torvalds
2022-10-29 19:28                                       ` Nadav Amit
2022-10-30  0:18                                       ` Nadav Amit
2022-10-30  2:17                                     ` Nadav Amit
2022-10-30 18:19                                       ` Linus Torvalds
2022-10-30 18:51                                         ` Linus Torvalds
2022-10-30 22:47                                           ` Linus Torvalds
2022-10-31  1:47                                             ` Linus Torvalds
2022-10-31  4:09                                               ` Nadav Amit
2022-10-31  4:55                                                 ` Nadav Amit
2022-10-31  5:00                                                 ` Linus Torvalds
2022-10-31 15:43                                                   ` Nadav Amit
2022-10-31 17:32                                                     ` Linus Torvalds
2022-10-31  9:36                                               ` Peter Zijlstra
2022-10-31 17:28                                                 ` Linus Torvalds
2022-10-31 18:43                                                   ` mm: delay rmap removal until after TLB flush Linus Torvalds
2022-11-02  9:14                                                     ` Christian Borntraeger
2022-11-02  9:23                                                       ` Christian Borntraeger
2022-11-02 17:55                                                       ` Linus Torvalds
2022-11-02 18:28                                                         ` Linus Torvalds
2022-11-02 22:29                                                         ` Gerald Schaefer
2022-11-02 12:45                                                     ` Peter Zijlstra
2022-11-02 22:31                                                     ` Gerald Schaefer
2022-11-02 23:13                                                       ` Linus Torvalds
2022-11-03  9:52                                                     ` David Hildenbrand
2022-11-03 16:54                                                       ` Linus Torvalds
2022-11-03 17:09                                                         ` Linus Torvalds
2022-11-03 17:36                                                           ` David Hildenbrand
2022-11-04  6:33                                                     ` Alexander Gordeev
2022-11-04 17:35                                                       ` Linus Torvalds
2022-11-06 21:06                                                         ` Hugh Dickins
2022-11-06 22:34                                                           ` Linus Torvalds
2022-11-06 23:14                                                             ` Andrew Morton
2022-11-07  0:06                                                               ` Stephen Rothwell
2022-11-07 16:19                                                               ` Linus Torvalds
2022-11-07 23:02                                                                 ` Andrew Morton
2022-11-07 23:44                                                                   ` Stephen Rothwell
2022-11-07  9:12                                                           ` Peter Zijlstra
2022-11-07 20:07                                                           ` Johannes Weiner
2022-11-07 20:29                                                             ` Linus Torvalds
2022-11-07 23:47                                                               ` Linus Torvalds
2022-11-08  4:28                                                                 ` Linus Torvalds
2022-11-08 19:56                                                                   ` Linus Torvalds
2022-11-08 20:03                                                                     ` Konstantin Ryabitsev
2022-11-08 20:18                                                                       ` Linus Torvalds
2022-11-08 19:41                                                                 ` [PATCH 1/4] mm: introduce 'encoded' page pointers with embedded extra bits Linus Torvalds
2022-11-08 20:37                                                                   ` Nadav Amit
2022-11-08 20:46                                                                     ` Linus Torvalds
2022-11-09  6:36                                                                   ` Alexander Gordeev
2022-11-09 18:00                                                                     ` Linus Torvalds
2022-11-09 20:02                                                                       ` Linus Torvalds
2022-11-08 19:41                                                                 ` [PATCH 2/4] mm: teach release_pages() to take an array of encoded page pointers too Linus Torvalds
2022-11-08 19:41                                                                 ` [PATCH 3/4] mm: mmu_gather: prepare to gather encoded page pointers with flags Linus Torvalds
2022-11-08 19:41                                                                 ` [PATCH 4/4] mm: delay page_remove_rmap() until after the TLB has been flushed Linus Torvalds
2022-11-08 20:48                                                                   ` [lkp] [+115 bytes kernel size regression] [i386-tinyconfig] [0309f16088] " kernel test robot
2022-11-08 21:01                                                                     ` Linus Torvalds
2022-11-08 21:05                                                                   ` [PATCH 4/4] " Nadav Amit
2022-11-09 15:53                                                                   ` Johannes Weiner
2022-11-09 19:31                                                                     ` Hugh Dickins
2022-10-31  9:39                                               ` [PATCH 01/13] mm: Update ptep_get_lockless()s comment Peter Zijlstra
2022-10-31 17:22                                                 ` Linus Torvalds
2022-10-31  9:46                                               ` Peter Zijlstra
2022-10-31  9:28                                             ` Peter Zijlstra
2022-10-31 17:19                                               ` Linus Torvalds
2022-10-30 19:34                                         ` Nadav Amit
2022-10-29 19:39                                   ` John Hubbard
2022-10-29 20:15                                     ` Linus Torvalds
2022-10-29 20:30                                       ` Linus Torvalds
2022-10-29 20:42                                         ` John Hubbard
2022-10-29 20:56                                       ` Nadav Amit
2022-10-29 21:03                                         ` Nadav Amit
2022-10-29 21:12                                         ` Linus Torvalds
2022-10-29 20:59                                       ` Theodore Ts'o
2022-10-26 19:43               ` Nadav Amit
2022-10-27  7:27                 ` Peter Zijlstra
2022-10-27 17:30                   ` Nadav Amit
2022-10-22 11:14 ` [PATCH 02/13] x86/mm/pae: Make pmd_t similar to pte_t Peter Zijlstra
2022-10-22 11:14 ` [PATCH 03/13] sh/mm: " Peter Zijlstra
2022-12-21 13:54   ` Guenter Roeck
2022-10-22 11:14 ` [PATCH 04/13] mm: Fix pmd_read_atomic() Peter Zijlstra
2022-10-22 17:30   ` Linus Torvalds
2022-10-24  8:09     ` Peter Zijlstra
2022-11-01 12:41     ` Peter Zijlstra
2022-11-01 17:42       ` Linus Torvalds
2022-11-02  9:12       ` [tip: x86/mm] mm: Convert __HAVE_ARCH_P..P_GET to the new style tip-bot2 for Peter Zijlstra
2022-11-03 21:15       ` tip-bot2 for Peter Zijlstra
2022-12-17 18:55       ` tip-bot2 for Peter Zijlstra
2022-10-22 11:14 ` [PATCH 05/13] mm: Rename GUP_GET_PTE_LOW_HIGH Peter Zijlstra
2022-10-22 11:14 ` [PATCH 06/13] mm: Rename pmd_read_atomic() Peter Zijlstra
2022-10-22 11:14 ` [PATCH 07/13] mm/gup: Fix the lockless PMD access Peter Zijlstra
2022-10-23  0:42   ` Hugh Dickins
2022-10-24  7:42     ` Peter Zijlstra
2022-10-25  3:58       ` Hugh Dickins
2022-10-22 11:14 ` [PATCH 08/13] x86/mm/pae: Dont (ab)use atomic64 Peter Zijlstra
2022-10-22 11:14 ` [PATCH 09/13] x86/mm/pae: Use WRITE_ONCE() Peter Zijlstra
2022-10-22 17:42   ` Linus Torvalds
2022-10-24 10:21     ` Peter Zijlstra
2022-10-22 11:14 ` [PATCH 10/13] x86/mm/pae: Be consistent with pXXp_get_and_clear() Peter Zijlstra
2022-10-22 17:53   ` Linus Torvalds
2022-10-24 11:13     ` Peter Zijlstra
2022-10-22 11:14 ` [PATCH 11/13] x86_64: Remove pointless set_64bit() usage Peter Zijlstra
2022-10-22 17:55   ` Linus Torvalds
2022-11-03 19:09   ` Nathan Chancellor
2022-11-03 19:23     ` Uros Bizjak
2022-11-03 19:35       ` Nathan Chancellor
2022-11-03 20:39         ` Linus Torvalds
2022-11-03 21:06           ` Peter Zijlstra
2022-11-04 16:01           ` Peter Zijlstra
2022-11-04 17:15             ` Linus Torvalds
2022-11-05 13:29               ` Jason A. Donenfeld
2022-11-05 15:14                 ` Peter Zijlstra
2022-11-05 20:54                   ` Jason A. Donenfeld
2022-11-07  9:14                   ` David Laight
2022-12-19 15:44               ` Peter Zijlstra
2022-10-22 11:14 ` [PATCH 12/13] x86/mm/pae: Get rid of set_64bit() Peter Zijlstra
2022-10-22 11:14 ` [PATCH 13/13] mm: Remove pointless barrier() after pmdp_get_lockless() Peter Zijlstra
2022-10-22 19:59   ` Yu Zhao
2022-10-22 17:57 ` [PATCH 00/13] Clean up pmd_get_atomic() and i386-PAE Linus Torvalds
2022-10-29 12:21 ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAG48ez3KMuM2aLpc8SvGiqZvv1KFc7E2DBZ4CLzxQeFhrMN_-w@mail.gmail.com \
    --to=jannh@google.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=jhubbard@nvidia.com \
    --cc=jroedel@suse.de \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=peterz@infradead.org \
    --cc=torvalds@linux-foundation.org \
    --cc=ubizjak@gmail.com \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.