All of lore.kernel.org
 help / color / mirror / Atom feed
From: Hugh Dickins <hughd@google.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	Stephen Rothwell <sfr@canb.auug.org.au>,
	Alexander Gordeev <agordeev@linux.ibm.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Will Deacon <will@kernel.org>,
	Aneesh Kumar <aneesh.kumar@linux.ibm.com>,
	Nick Piggin <npiggin@gmail.com>,
	Heiko Carstens <hca@linux.ibm.com>,
	Vasily Gorbik <gor@linux.ibm.com>,
	Christian Borntraeger <borntraeger@linux.ibm.com>,
	Sven Schnelle <svens@linux.ibm.com>,
	Nadav Amit <nadav.amit@gmail.com>, Jann Horn <jannh@google.com>,
	John Hubbard <jhubbard@nvidia.com>, X86 ML <x86@kernel.org>,
	Matthew Wilcox <willy@infradead.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	kernel list <linux-kernel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Joerg Roedel <jroedel@suse.de>, Uros Bizjak <ubizjak@gmail.com>,
	Alistair Popple <apopple@nvidia.com>,
	linux-arch <linux-arch@vger.kernel.org>
Subject: Re: mm: delay rmap removal until after TLB flush
Date: Sun, 6 Nov 2022 13:06:19 -0800 (PST)	[thread overview]
Message-ID: <8a1e97c9-bd5-7473-6da8-2aa75198fbe8@google.com> (raw)
In-Reply-To: <CAHk-=wjzp65=-QE1dg8KfqG-tVHiT+yAfHXGx9sro=8yOceELg@mail.gmail.com>

Adding Stephen to Cc, for next-20221107 alert.
Adding Johannes to Cc, particularly for lock_page_memcg discussion.

On Fri, 4 Nov 2022, Linus Torvalds wrote:
> On Thu, Nov 3, 2022 at 11:33 PM Alexander Gordeev
> <agordeev@linux.ibm.com> wrote:
> >
> > I rather have a question to the generic part (had to master the code quotting).
> 
> Sure.
> 
> Although now I think the series in in Andrew's -mm tree, or just about
> to get moved in there, so I'm not going to touch my actual branch any
> more.

Linus, we've been exchanging about my mm/rmap.c mods in private mail,
I need to raise some points about your mods here in public mail.

Four topics - munlock (good), anon_vma (bad), mm-unstable (bad),
lock_page_memcg (uncertain).  I'm asking Andrew here to back your
patchset out of mm-unstable for now, until we have its fixes in:
otherwise I'm worried that next-20221107 will waste everyone's time.

munlock (good)
--------------
You've separated out the munlock_vma_page() from the rest of PTE
remove rmap work.  Worried me at first, but I think that's fine:
bundling it into page_remove_rmap() was mainly so that we wouldn't
forget to do it anywhere (and other rmap funcs already took vma arg).

Certainly during development, I had been using page mapcount somewhere
inside munlock_vma_page(), either for optimization or for sanity check,
I forget; but gave up on that as unnecessary complication, and I think
it became impossible with the pagevec; so not an issue now.

If it were my change, I'd certainly do it by changing the name of the
vma arg in page_remove_rmap() from vma to munlock_vma, not doing the
munlock when that is NULL, and avoid the need for duplicating code:
whereas you're very pleased with cutting out the unnecessary stuff
in slimmed-down page_zap_pte_rmap().  Differing tastes (or perhaps
you'll say taste versus no taste).

anon_vma (bad)
--------------
My first reaction to your patchset was, that it wasn't obvious to me
that delaying the decrementation of mapcount is safe: I needed time to
think about it.  But happily, testing saved me from needing much thought.

The first problem, came immediately in the first iteration of my
huge tmpfs kbuild swapping loads, was BUG at mm/migrate.c:1105!
VM_BUG_ON_FOLIO(folio_test_anon(src) && !folio_test_ksm(src) && !anon_vma, src);
Okay, that's interesting but not necessarily fatal, we can very easily
make it a recoverable condition.  I didn't bother to think on that one,
just patched around it.

The second problem was more significant: came after nearly five hours
of load, BUG NULL dereference (address 8) in down_read_trylock() in
rmap_walk_anon(), while kswapd was doing a folio_referenced().

That one goes right to the heart of the matter, and instinct had been
correct to worry about delaying the decrementation of mapcount, that is,
extending the life of page_mapped() or folio_mapped().

See folio_lock_anon_vma_read(): folio_mapped() plays a key role in
establishing the continued validity of an anon_vma.  See comments
above folio_get_anon_vma(), some by me but most by PeterZ IIRC.

I believe what has happened is that your patchset has, very intentionally,
kept the page as "folio_mapped" until after free_pgtables() does its
unlink_anon_vmas(); but that is telling folio_lock_anon_vma_read()
that the anon_vma is safe to use when actually it has been freed.
(It looked like a page table when I peeped at it.)

I'm not certain, but I think that you made page_zap_pte_rmap() handle
anon as well as file, just for the righteous additional simplification;
but I'm afraid that (without opening a huge anon_vma refcounting can of
worms) that unification has to be reverted, and anon left to go the
same old way it did before.

I didn't look into whether reverting one of your patches would achieve
that, I just adjusted the code in zap_pte_range() to go the old way for
PageAnon; and that has been running successfully, hitting neither BUG,
for 15 hours now.

mm-unstable (bad)
-----------------
Aside from that PageAnon issue, mm-unstable is in an understandably bad
state because you could not have foreseen my subpages_mapcount addition
to page_remove_rmap().  page_zap_pte_rmap() now needs to handle the
PageCompound (but not the "compound") case too.  I rushed you and akpm
an emergency patch for that on Friday night, but you, let's say, had
reservations about it.  So I haven't posted it, and while the PageAnon
issue remains, I think your patchset has to be removed from mm-unstable
and linux-next anyway.

What happens to mm-unstable with page_zap_pte_rmap() not handling
subpages_mapcount?  In my CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y case,
I get a "Bad page state" even before reaching first login after boot,
"page dumped because: nonzero subpages_mapcount".  Yes, I make it
worse for myself by having a BUG() in bad_page(); others will limp
along with more and more of those "Bad page" messages.

lock_page_memcg (uncertain)
---------------------------
Johannes, please help! Linus has got quite upset by lock_page_memcg(),
its calls from mm/rmap.c anyway, and most particularly by the way
in which it is called at the start of page_remove_rmap(), before
anyone's critical atomic_add_negative(), yet its use is to guarantee
the stability of page memcg while doing the stats updates, done only
when atomic_add_negative() says so.

I do have one relevant insight on this.  It (or its antecedents under
other names) date from the days when we did "reparenting" of memcg
charges from an LRU: and in those days the lock_page_memcg() before
mapcount adjustment was vital, to pair with the uses of folio_mapped()
or page_mapped() in mem_cgroup_move_account() - those "mapped" checks
are precisely around the stats which the rmap functions affect.

But nowadays mem_cgroup_move_account() is only called, with page table
lock held, on matching pages found in a task's page table: so its
"mapped" checks are redundant - I've sometimes thought in the past of
removing them, but held back, because I always have the notion (not
hope!) that "reparenting" may one day be brought back from the grave.
I'm too out of touch with memcg to know where that notion stands today.

I've gone through a multiverse of opinions on those lock_page_memcg()s
in the last day: I currently believe that Linus is right, that the
lock_page_memcg()s could and should be moved just before the stats
updates.  But I am not 100% certain of that - is there still some
reason why it's important that the page memcg at the instant of the
critical mapcount transition be kept unchanged until the stats are
updated?  I've tried running scenarios through my mind but given up.

(And note that the answer might be different with Linus's changes
than without them: since he delays the mapcount decrementation
until long after pte was removed and page table lock dropped).

And I do wish that we could settle this lock_page_memcg() question
in an entirely separate patch: as it stands, page_zap_pte_rmap()
gets to benefit from Linus's insight (or not), and all the other rmap
functions carry on with the mis?placed lock_page_memcg() as before.

Let's press Send,
Hugh

  reply	other threads:[~2022-11-06 21:06 UTC|newest]

Thread overview: 148+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-22 11:14 [PATCH 00/13] Clean up pmd_get_atomic() and i386-PAE Peter Zijlstra
2022-10-22 11:14 ` [PATCH 01/13] mm: Update ptep_get_lockless()s comment Peter Zijlstra
2022-10-24  5:42   ` John Hubbard
2022-10-24  8:00     ` Peter Zijlstra
2022-10-24 19:58       ` Jann Horn
2022-10-24 20:19         ` Linus Torvalds
2022-10-24 20:23           ` Jann Horn
2022-10-24 20:36             ` Linus Torvalds
2022-10-25  3:21             ` Matthew Wilcox
2022-10-25  7:54               ` Alistair Popple
2022-10-25 13:33                 ` Peter Zijlstra
2022-10-25 13:44                 ` Jann Horn
2022-10-26  0:45                   ` Alistair Popple
2022-10-25 14:02         ` Peter Zijlstra
2022-10-25 14:18           ` Jann Horn
2022-10-25 15:06             ` Peter Zijlstra
2022-10-26 16:45               ` Jann Horn
2022-10-27  7:08                 ` Peter Zijlstra
2022-10-27 18:13                   ` Linus Torvalds
2022-10-27 19:35                     ` Peter Zijlstra
2022-10-27 19:43                       ` Linus Torvalds
2022-10-27 20:15                     ` Nadav Amit
2022-10-27 20:31                       ` Linus Torvalds
2022-10-27 21:44                         ` Nadav Amit
2022-10-28 23:57                           ` Nadav Amit
2022-10-29  0:42                             ` Linus Torvalds
2022-10-29 18:05                               ` Nadav Amit
2022-10-29 18:36                                 ` Linus Torvalds
2022-10-29 18:58                                   ` Linus Torvalds
2022-10-29 19:14                                     ` Linus Torvalds
2022-10-29 19:28                                       ` Nadav Amit
2022-10-30  0:18                                       ` Nadav Amit
2022-10-30  2:17                                     ` Nadav Amit
2022-10-30 18:19                                       ` Linus Torvalds
2022-10-30 18:51                                         ` Linus Torvalds
2022-10-30 22:47                                           ` Linus Torvalds
2022-10-31  1:47                                             ` Linus Torvalds
2022-10-31  4:09                                               ` Nadav Amit
2022-10-31  4:55                                                 ` Nadav Amit
2022-10-31  5:00                                                 ` Linus Torvalds
2022-10-31 15:43                                                   ` Nadav Amit
2022-10-31 17:32                                                     ` Linus Torvalds
2022-10-31  9:36                                               ` Peter Zijlstra
2022-10-31 17:28                                                 ` Linus Torvalds
2022-10-31 18:43                                                   ` mm: delay rmap removal until after TLB flush Linus Torvalds
2022-11-02  9:14                                                     ` Christian Borntraeger
2022-11-02  9:23                                                       ` Christian Borntraeger
2022-11-02 17:55                                                       ` Linus Torvalds
2022-11-02 18:28                                                         ` Linus Torvalds
2022-11-02 22:29                                                         ` Gerald Schaefer
2022-11-02 12:45                                                     ` Peter Zijlstra
2022-11-02 22:31                                                     ` Gerald Schaefer
2022-11-02 23:13                                                       ` Linus Torvalds
2022-11-03  9:52                                                     ` David Hildenbrand
2022-11-03 16:54                                                       ` Linus Torvalds
2022-11-03 17:09                                                         ` Linus Torvalds
2022-11-03 17:36                                                           ` David Hildenbrand
2022-11-04  6:33                                                     ` Alexander Gordeev
2022-11-04 17:35                                                       ` Linus Torvalds
2022-11-06 21:06                                                         ` Hugh Dickins [this message]
2022-11-06 22:34                                                           ` Linus Torvalds
2022-11-06 23:14                                                             ` Andrew Morton
2022-11-07  0:06                                                               ` Stephen Rothwell
2022-11-07 16:19                                                               ` Linus Torvalds
2022-11-07 23:02                                                                 ` Andrew Morton
2022-11-07 23:44                                                                   ` Stephen Rothwell
2022-11-07  9:12                                                           ` Peter Zijlstra
2022-11-07 20:07                                                           ` Johannes Weiner
2022-11-07 20:29                                                             ` Linus Torvalds
2022-11-07 23:47                                                               ` Linus Torvalds
2022-11-08  4:28                                                                 ` Linus Torvalds
2022-11-08 19:56                                                                   ` Linus Torvalds
2022-11-08 20:03                                                                     ` Konstantin Ryabitsev
2022-11-08 20:18                                                                       ` Linus Torvalds
2022-11-08 19:41                                                                 ` [PATCH 1/4] mm: introduce 'encoded' page pointers with embedded extra bits Linus Torvalds
2022-11-08 20:37                                                                   ` Nadav Amit
2022-11-08 20:46                                                                     ` Linus Torvalds
2022-11-09  6:36                                                                   ` Alexander Gordeev
2022-11-09 18:00                                                                     ` Linus Torvalds
2022-11-09 20:02                                                                       ` Linus Torvalds
2022-11-08 19:41                                                                 ` [PATCH 2/4] mm: teach release_pages() to take an array of encoded page pointers too Linus Torvalds
2022-11-08 19:41                                                                 ` [PATCH 3/4] mm: mmu_gather: prepare to gather encoded page pointers with flags Linus Torvalds
2022-11-08 19:41                                                                 ` [PATCH 4/4] mm: delay page_remove_rmap() until after the TLB has been flushed Linus Torvalds
2022-11-08 20:48                                                                   ` [lkp] [+115 bytes kernel size regression] [i386-tinyconfig] [0309f16088] " kernel test robot
2022-11-08 21:01                                                                     ` Linus Torvalds
2022-11-08 21:05                                                                   ` [PATCH 4/4] " Nadav Amit
2022-11-09 15:53                                                                   ` Johannes Weiner
2022-11-09 19:31                                                                     ` Hugh Dickins
2022-10-31  9:39                                               ` [PATCH 01/13] mm: Update ptep_get_lockless()s comment Peter Zijlstra
2022-10-31 17:22                                                 ` Linus Torvalds
2022-10-31  9:46                                               ` Peter Zijlstra
2022-10-31  9:28                                             ` Peter Zijlstra
2022-10-31 17:19                                               ` Linus Torvalds
2022-10-30 19:34                                         ` Nadav Amit
2022-10-29 19:39                                   ` John Hubbard
2022-10-29 20:15                                     ` Linus Torvalds
2022-10-29 20:30                                       ` Linus Torvalds
2022-10-29 20:42                                         ` John Hubbard
2022-10-29 20:56                                       ` Nadav Amit
2022-10-29 21:03                                         ` Nadav Amit
2022-10-29 21:12                                         ` Linus Torvalds
2022-10-29 20:59                                       ` Theodore Ts'o
2022-10-26 19:43               ` Nadav Amit
2022-10-27  7:27                 ` Peter Zijlstra
2022-10-27 17:30                   ` Nadav Amit
2022-10-22 11:14 ` [PATCH 02/13] x86/mm/pae: Make pmd_t similar to pte_t Peter Zijlstra
2022-10-22 11:14 ` [PATCH 03/13] sh/mm: " Peter Zijlstra
2022-12-21 13:54   ` Guenter Roeck
2022-10-22 11:14 ` [PATCH 04/13] mm: Fix pmd_read_atomic() Peter Zijlstra
2022-10-22 17:30   ` Linus Torvalds
2022-10-24  8:09     ` Peter Zijlstra
2022-11-01 12:41     ` Peter Zijlstra
2022-11-01 17:42       ` Linus Torvalds
2022-11-02  9:12       ` [tip: x86/mm] mm: Convert __HAVE_ARCH_P..P_GET to the new style tip-bot2 for Peter Zijlstra
2022-11-03 21:15       ` tip-bot2 for Peter Zijlstra
2022-12-17 18:55       ` tip-bot2 for Peter Zijlstra
2022-10-22 11:14 ` [PATCH 05/13] mm: Rename GUP_GET_PTE_LOW_HIGH Peter Zijlstra
2022-10-22 11:14 ` [PATCH 06/13] mm: Rename pmd_read_atomic() Peter Zijlstra
2022-10-22 11:14 ` [PATCH 07/13] mm/gup: Fix the lockless PMD access Peter Zijlstra
2022-10-23  0:42   ` Hugh Dickins
2022-10-24  7:42     ` Peter Zijlstra
2022-10-25  3:58       ` Hugh Dickins
2022-10-22 11:14 ` [PATCH 08/13] x86/mm/pae: Dont (ab)use atomic64 Peter Zijlstra
2022-10-22 11:14 ` [PATCH 09/13] x86/mm/pae: Use WRITE_ONCE() Peter Zijlstra
2022-10-22 17:42   ` Linus Torvalds
2022-10-24 10:21     ` Peter Zijlstra
2022-10-22 11:14 ` [PATCH 10/13] x86/mm/pae: Be consistent with pXXp_get_and_clear() Peter Zijlstra
2022-10-22 17:53   ` Linus Torvalds
2022-10-24 11:13     ` Peter Zijlstra
2022-10-22 11:14 ` [PATCH 11/13] x86_64: Remove pointless set_64bit() usage Peter Zijlstra
2022-10-22 17:55   ` Linus Torvalds
2022-11-03 19:09   ` Nathan Chancellor
2022-11-03 19:23     ` Uros Bizjak
2022-11-03 19:35       ` Nathan Chancellor
2022-11-03 20:39         ` Linus Torvalds
2022-11-03 21:06           ` Peter Zijlstra
2022-11-04 16:01           ` Peter Zijlstra
2022-11-04 17:15             ` Linus Torvalds
2022-11-05 13:29               ` Jason A. Donenfeld
2022-11-05 15:14                 ` Peter Zijlstra
2022-11-05 20:54                   ` Jason A. Donenfeld
2022-11-07  9:14                   ` David Laight
2022-12-19 15:44               ` Peter Zijlstra
2022-10-22 11:14 ` [PATCH 12/13] x86/mm/pae: Get rid of set_64bit() Peter Zijlstra
2022-10-22 11:14 ` [PATCH 13/13] mm: Remove pointless barrier() after pmdp_get_lockless() Peter Zijlstra
2022-10-22 19:59   ` Yu Zhao
2022-10-22 17:57 ` [PATCH 00/13] Clean up pmd_get_atomic() and i386-PAE Linus Torvalds
2022-10-29 12:21 ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8a1e97c9-bd5-7473-6da8-2aa75198fbe8@google.com \
    --to=hughd@google.com \
    --cc=aarcange@redhat.com \
    --cc=agordeev@linux.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=apopple@nvidia.com \
    --cc=borntraeger@linux.ibm.com \
    --cc=gor@linux.ibm.com \
    --cc=hannes@cmpxchg.org \
    --cc=hca@linux.ibm.com \
    --cc=jannh@google.com \
    --cc=jhubbard@nvidia.com \
    --cc=jroedel@suse.de \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nadav.amit@gmail.com \
    --cc=npiggin@gmail.com \
    --cc=peterz@infradead.org \
    --cc=sfr@canb.auug.org.au \
    --cc=svens@linux.ibm.com \
    --cc=torvalds@linux-foundation.org \
    --cc=ubizjak@gmail.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.