Re: Fwd: Memory corruption in multithreaded user space program while calling fork

From: Suren Baghdasaryan <surenb@google.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Thorsten Leemhuis <regressions@leemhuis.info>,
	Bagas Sanjaya <bagasdotme@gmail.com>,
	Jacob Young <jacobly.alt@gmail.com>,
	Laurent Dufour <ldufour@linux.ibm.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Linux Memory Management <linux-mm@kvack.org>,
	Linux PowerPC <linuxppc-dev@lists.ozlabs.org>,
	Linux ARM <linux-arm-kernel@lists.infradead.org>,
	Greg KH <gregkh@linuxfoundation.org>,
	Linux regressions mailing list <regressions@lists.linux.dev>
Subject: Re: Fwd: Memory corruption in multithreaded user space program while calling fork
Date: Sat, 8 Jul 2023 12:17:10 -0700	[thread overview]
Message-ID: <CAJuCfpHszCAc5hDdsxry+1xh3kz+=jsYdBCXKQez-Th9LESSZA@mail.gmail.com> (raw)
In-Reply-To: <CAHk-=wi-99-DyMOGywTbjRnRRC+XfpPm=r=pei4A=MEL0QDBXA@mail.gmail.com>

On Sat, Jul 8, 2023 at 12:06 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Sat, 8 Jul 2023 at 11:40, Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > My understanding was that flush_cache_dup_mm() is there to ensure
> > nothing is in the cache, so locking VMAs before doing that would
> > ensure that no page faults would pollute the caches after we flushed
> > them. Is that reasoning incorrect?
>
> It is indeed incorrect.
>
> The VIVT caches are fundamentally broken, and we have various random
> hacks for them to make them work in legacy situations.
>
> And that flush_cache_dup_mm() is exactly that: a band-aid to make sure
> that when we do a fork(), any previous writes that are dirty in the
> caches will have made it to memory, so that they will show up in the
> *new* process that has a different virtual mapping.
>
> BUT!
>
> This has nothing to do with page faults, or other threads.
>
> If you have a threaded application that does fork(), it can - and will
> - dirty the VIVT caches *during* the fork, and so the whole
> "flush_cache_dup_mm()" is completely and fundamentally race wrt any
> *new* activity.
>
> It's not even what it is trying to deal with. All it tries to do is to
> make sure that the newly forked child AT LEAST sees all the changes
> that the parent did up to the point of the fork. Anything after that
> is simply not relevant at all.
>
> So think of all this not as some kind of absolute synchronization and
> cache coherency (because you will never get that on a VIVT
> architecture anyway), but as a "for the simple cases, this will at
> least get you the expected behavior".
>
> But as mentioned, for the issue of PER_VMA_LOCK, this is all *doubly*
> irrelevant. Not only was it not relevant to begin with (ie that cache
> flush only synchronizes parent -> child, not other-threads -> child),
> but VIVT caches don't even exist on any relevant architecture because
> they are fundamentally broken in so many other ways.
>
> So all our "synchronize caches by hand" is literally just band-aid for
> legacy architectures. I think it's mostly things like the old broken
> MIPS chips, some sparc32, pa-risc: the "old RISC" stuff, where people
> simplified the hardware a bit too much.
>
> VIVT is lovely for hardware people becasue they get a shortcut. But
> it's "lovely" in the same way that "PI=3" is lovely. It's simpler -
> but it's _wrong_.
>
> And it's almost entirely useless if you ever do SMP. I guarantee we
> have tons of races with it for very fundamental reasons - the problems
> it causes for software are not fixable, they are "hidable for the
> simple case".
>
> So you'll also find things like dcache_page_flush(), which flushes
> writes to a page to memory. And exactly like the fork() case, it's
> *not* real cache coherency, and it's *not* some kind of true global
> serialization.
>
> It's used in cases where we have a particular user that wants the
> changes *it* made to be made visible. And exactly like
> flush_cache_dup_mm(), it cannot deal with concurrent changes that
> other threads make.

Thanks for the explanation! It's quite educational.

>
> > Ok, I think these two are non-controversial:
> > https://lkml.kernel.org/r/20230707043211.3682710-1-surenb@google.com
> > https://lkml.kernel.org/r/20230707043211.3682710-2-surenb@google.com
>
> These look sane to me. I wonder if the vma_start_write() should have
> been somewhere else, but at least it makes sense in context, even if I
> get the feeling that maybe it should have been done in some helper
> earlier.
>
> As it is, we randomly do it in other helpers like vm_flags_set(), and
> I've often had the reaction that these vma_start_write() calls are
> randomly sprinked around without any clear _design_ for where they
> are.

We write-lock a VMA before any modification. I tried to provide
explanations for each such locking in my comments/patch descriptions
but I guess I haven't done a good job at that...

>
> > and the question now is how we fix the fork() case:
> > https://lore.kernel.org/all/20230706011400.2949242-2-surenb@google.com/
> > (if my above explanation makes sense to you)
>
> See above. That patch is nonsensical. Trying to order
> flush_cache_dup_mm() is not about page faults, and is fundamentally
> not doable with threads anyway.
>
> > https://lore.kernel.org/all/20230705063711.2670599-2-surenb@google.com/
>
> This is the one that makes sense to me.

Ok, I sent you 3-patch series with the fixes here:
https://lore.kernel.org/all/20230708191212.4147700-1-surenb@google.com/
Do you want me to disable per-VMA locks by default as well?

>
>                Linus