[PATCH 0/4] xfs: fix CIL shutdown UAF and shutdown hang

* [PATCH 0/4] xfs: fix CIL shutdown UAF and shutdown hang
@ 2021-06-22  4:06 Dave Chinner
  2021-06-22  4:06 ` [PATCH 1/4] xfs: don't nest icloglock inside ic_callback_lock Dave Chinner
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Dave Chinner @ 2021-06-22  4:06 UTC (permalink / raw)
  To: linux-xfs

Hi folks,

The following patches implement an initial fix for the UAF that can
occur in the CIL push code when a racing shutdown occurs. This was a
zero-day bug in the delayed logging code, and only recently
uncovered by the CIL pipelining changes that addresses a different
zero-day bug in the delayed logging code. This UAF exists regardless
in all kernels that support delayed logging (i.e. since 2.6.36), but
is extremely unlikely that anyone has hit it as it requires a
shutdown with extremely tight timing tolerances to trigger a UAF.

This is more of a problem for the current for-next tree, though,
because there is now a call to xlog_wait_on_iclog() in the UAF
window. While we don't reference the CIL context after the wait,
this will soon be needed to fix the /other/ zero-day problems found
by the CIL pipelining changes.

The encapsulation of the entire CIL commit iclog processing epilogue
in the icloglock effectively serialises this code against shutdown
races and allows us to error out before attaching the context to the
iclog if a shutdown has already occurred. Callbacks used to be under
the icloglock, but were split out in 2008 because of icloglock
contention causing log scalability problems (sound familiar? :).
Delayed logging fixed those icloglock scalability issues by moving
it out of the hot transaction commit path, so we can move the
callbacks back under the icloglock without re-introducing ancient
problems and solve the initial UAF problem this way.

With that problem solved, we can then fix the call to
xlog_wait_on_iclog() in the CIL push code by ensuring that it only
waits on older iclogs via LSN checks. As the wait drops the icloglock and
potentially re-opens us to the above UAF on shutdown, we have to be
careful not to reference the CIL context after the wait returns.

Hence the patches don't really fix the underlying cause of the
shutdown UAF here - this is intended as a low impact, easily
backportable solution to the problem. Work to fix the underlying
shutdown brokenness to remove the need to hold the icloglock from
callback attachment to xlog_state_release_iclog() is needed
(underway) before we can then apply start record ordering fixes and
re-introduce the CIL pipelining fixes and the rest of the CIL
scalabilty work....

Cheers,

Dave.

^ permalink raw reply	[flat|nested] 16+ messages in thread