From: Suren Baghdasaryan <surenb@google.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Thorsten Leemhuis <regressions@leemhuis.info>,
Bagas Sanjaya <bagasdotme@gmail.com>,
Jacob Young <jacobly.alt@gmail.com>,
Laurent Dufour <ldufour@linux.ibm.com>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
Linux Memory Management <linux-mm@kvack.org>,
Linux PowerPC <linuxppc-dev@lists.ozlabs.org>,
Linux ARM <linux-arm-kernel@lists.infradead.org>,
Greg KH <gregkh@linuxfoundation.org>,
Linux regressions mailing list <regressions@lists.linux.dev>
Subject: Re: Fwd: Memory corruption in multithreaded user space program while calling fork
Date: Sat, 8 Jul 2023 12:17:10 -0700 [thread overview]
Message-ID: <CAJuCfpHszCAc5hDdsxry+1xh3kz+=jsYdBCXKQez-Th9LESSZA@mail.gmail.com> (raw)
In-Reply-To: <CAHk-=wi-99-DyMOGywTbjRnRRC+XfpPm=r=pei4A=MEL0QDBXA@mail.gmail.com>
On Sat, Jul 8, 2023 at 12:06 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Sat, 8 Jul 2023 at 11:40, Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > My understanding was that flush_cache_dup_mm() is there to ensure
> > nothing is in the cache, so locking VMAs before doing that would
> > ensure that no page faults would pollute the caches after we flushed
> > them. Is that reasoning incorrect?
>
> It is indeed incorrect.
>
> The VIVT caches are fundamentally broken, and we have various random
> hacks for them to make them work in legacy situations.
>
> And that flush_cache_dup_mm() is exactly that: a band-aid to make sure
> that when we do a fork(), any previous writes that are dirty in the
> caches will have made it to memory, so that they will show up in the
> *new* process that has a different virtual mapping.
>
> BUT!
>
> This has nothing to do with page faults, or other threads.
>
> If you have a threaded application that does fork(), it can - and will
> - dirty the VIVT caches *during* the fork, and so the whole
> "flush_cache_dup_mm()" is completely and fundamentally race wrt any
> *new* activity.
>
> It's not even what it is trying to deal with. All it tries to do is to
> make sure that the newly forked child AT LEAST sees all the changes
> that the parent did up to the point of the fork. Anything after that
> is simply not relevant at all.
>
> So think of all this not as some kind of absolute synchronization and
> cache coherency (because you will never get that on a VIVT
> architecture anyway), but as a "for the simple cases, this will at
> least get you the expected behavior".
>
> But as mentioned, for the issue of PER_VMA_LOCK, this is all *doubly*
> irrelevant. Not only was it not relevant to begin with (ie that cache
> flush only synchronizes parent -> child, not other-threads -> child),
> but VIVT caches don't even exist on any relevant architecture because
> they are fundamentally broken in so many other ways.
>
> So all our "synchronize caches by hand" is literally just band-aid for
> legacy architectures. I think it's mostly things like the old broken
> MIPS chips, some sparc32, pa-risc: the "old RISC" stuff, where people
> simplified the hardware a bit too much.
>
> VIVT is lovely for hardware people becasue they get a shortcut. But
> it's "lovely" in the same way that "PI=3" is lovely. It's simpler -
> but it's _wrong_.
>
> And it's almost entirely useless if you ever do SMP. I guarantee we
> have tons of races with it for very fundamental reasons - the problems
> it causes for software are not fixable, they are "hidable for the
> simple case".
>
> So you'll also find things like dcache_page_flush(), which flushes
> writes to a page to memory. And exactly like the fork() case, it's
> *not* real cache coherency, and it's *not* some kind of true global
> serialization.
>
> It's used in cases where we have a particular user that wants the
> changes *it* made to be made visible. And exactly like
> flush_cache_dup_mm(), it cannot deal with concurrent changes that
> other threads make.
Thanks for the explanation! It's quite educational.
>
> > Ok, I think these two are non-controversial:
> > https://lkml.kernel.org/r/20230707043211.3682710-1-surenb@google.com
> > https://lkml.kernel.org/r/20230707043211.3682710-2-surenb@google.com
>
> These look sane to me. I wonder if the vma_start_write() should have
> been somewhere else, but at least it makes sense in context, even if I
> get the feeling that maybe it should have been done in some helper
> earlier.
>
> As it is, we randomly do it in other helpers like vm_flags_set(), and
> I've often had the reaction that these vma_start_write() calls are
> randomly sprinked around without any clear _design_ for where they
> are.
We write-lock a VMA before any modification. I tried to provide
explanations for each such locking in my comments/patch descriptions
but I guess I haven't done a good job at that...
>
> > and the question now is how we fix the fork() case:
> > https://lore.kernel.org/all/20230706011400.2949242-2-surenb@google.com/
> > (if my above explanation makes sense to you)
>
> See above. That patch is nonsensical. Trying to order
> flush_cache_dup_mm() is not about page faults, and is fundamentally
> not doable with threads anyway.
>
> > https://lore.kernel.org/all/20230705063711.2670599-2-surenb@google.com/
>
> This is the one that makes sense to me.
Ok, I sent you 3-patch series with the fixes here:
https://lore.kernel.org/all/20230708191212.4147700-1-surenb@google.com/
Do you want me to disable per-VMA locks by default as well?
>
> Linus
next prev parent reply other threads:[~2023-07-08 19:17 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-07-02 12:27 Fwd: Memory corruption in multithreaded user space program while calling fork Bagas Sanjaya
[not found] ` <CALrpxLe2VagXEhsHPb9P4vJC97hkBYkLswFJB_jmhu1K+x_QhQ@mail.gmail.com>
2023-07-02 14:11 ` Bagas Sanjaya
2023-07-03 9:53 ` Fwd: " Linux regression tracking (Thorsten Leemhuis)
2023-07-03 18:08 ` Suren Baghdasaryan
2023-07-03 18:27 ` Suren Baghdasaryan
2023-07-03 18:44 ` Greg KH
2023-07-04 7:45 ` Suren Baghdasaryan
2023-07-04 8:00 ` Greg KH
2023-07-04 16:18 ` Andrew Morton
2023-07-04 20:22 ` Suren Baghdasaryan
2023-07-04 21:28 ` Andrew Morton
2023-07-04 22:04 ` Suren Baghdasaryan
2023-07-05 6:42 ` Suren Baghdasaryan
2023-07-05 7:08 ` Greg KH
2023-07-05 8:51 ` Linux regression tracking (Thorsten Leemhuis)
2023-07-05 9:27 ` Greg KH
2023-07-05 15:49 ` Andrew Morton
2023-07-05 16:14 ` Suren Baghdasaryan
2023-07-05 17:17 ` Suren Baghdasaryan
2023-07-08 11:35 ` Thorsten Leemhuis
2023-07-08 17:29 ` Linus Torvalds
2023-07-08 17:39 ` Andrew Morton
2023-07-08 18:04 ` Linus Torvalds
2023-07-08 18:40 ` Suren Baghdasaryan
2023-07-08 19:05 ` Linus Torvalds
2023-07-08 19:17 ` Suren Baghdasaryan [this message]
2023-07-08 19:22 ` Linus Torvalds
2023-07-08 19:41 ` Suren Baghdasaryan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAJuCfpHszCAc5hDdsxry+1xh3kz+=jsYdBCXKQez-Th9LESSZA@mail.gmail.com' \
--to=surenb@google.com \
--cc=akpm@linux-foundation.org \
--cc=bagasdotme@gmail.com \
--cc=gregkh@linuxfoundation.org \
--cc=jacobly.alt@gmail.com \
--cc=ldufour@linux.ibm.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=regressions@leemhuis.info \
--cc=regressions@lists.linux.dev \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).