Spurious SIGSEGV with rseq/membarrier

From: Dmitry Vyukov <dvyukov@google.com>
To: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	Peter Oskolkov <posk@google.com>,
	 Peter Zijlstra <peterz@infradead.org>,
	"Paul E. McKenney" <paulmck@kernel.org>,
	 Boqun Feng <boqun.feng@gmail.com>
Cc: LKML <linux-kernel@vger.kernel.org>,
	Chris Kennelly <ckennelly@google.com>
Subject: Spurious SIGSEGV with rseq/membarrier
Date: Mon, 12 Feb 2024 11:02:38 +0100	[thread overview]
Message-ID: <CACT4Y+bXfekygoyhO7pCctjnL15=E=Zs31BUGXU0dk8d4rc1Cw@mail.gmail.com> (raw)

Hi rseq/membarrier maintainers,

I've spent a bit debugging some spurious SIGSEGVs and it turned out to
be an interesting interaction between page faults, rseq and
membarrier. The manifestation is that membarrier(EXPEDITED_RSEQ) is
effectively not working for a thread (doesn't restart its rseq
critical section).

The real code is inside of tcmalloc and relates to the "slabs resing" procedure:

https://github.com/google/tcmalloc/blob/39775a2d57969eda9497f3673421766bc1e886a0/tcmalloc/internal/percpu_tcmalloc.cc#L176

The essence is:
Threads use a data structure inside of rseq critical section.
The resize procedure replaces the old data structure with a new one,
uses a membarrier to ensure that threads don't use the old one any
more and unmaps/mprotects pages that back the old data structure. At
this point no threads use the old data structure anymore and no
threads should get SIGSEGV.

However, what happens is as follows:
A thread gets a minor page fault on the old data structure inside of
rseq critical section.
The page fault handler re-enables preemption and allows other threads
to be scheduled (I am tno sure this is actually important, but that's
what I observed in all traces, and it makes the failure scenario much
more likely).
Now, the resize procedure is executed, replaces all pointers to the
old data structure to the new one, executes the membarrier and unmaps
the old data structure.
Now the page fault handler resumes, verifies VMA protection and finds
out that the VMA is indeed inaccessible and the page fault is not a
minor one, but rather should result in SIGSEGV and sends SIGSEGV.
Note: at this point the thread has rseq restart pending (from both
preemption and membarrier), and the restart indeed happens as part of
SIGSEGV delivery, but it's already too late.

I think the page fault handling should give the rseq restart
preference in this case, and realize the thread shouldn't be executing
the faulting instruction in the first place. In such case the thread
would be restarted, and access the new data structure after the
restart.

Unmapping/mprotecting the old data in this case is useful for 2 reasons:
1. It allows to release memory (not possible to do reliably now).
2. It allows to ensure there are no logical bugs in the user-space
code and thread don't access the old data when they shouldn't. I was
actually tracking a potential bug in user-space code, but after
mprotecting old data, started seeing more of more confusing crashes
(this spurious SIGSEGV).