linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5 v3] xfs: strictly order log start records
@ 2021-08-10  5:21 Dave Chinner
  2021-08-10  5:21 ` [PATCH 1/5] xfs: move xlog_commit_record to xfs_log_cil.c Dave Chinner
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: Dave Chinner @ 2021-08-10  5:21 UTC (permalink / raw)
  To: linux-xfs

We recently found a zero-day log recovery issue where overlapping
log transactions used the wrong LSN for recovery operations despite
being replayed in the correct commit record order. The issue is that
the log recovery code uses the LSN of the start record of the log
transaction for ordering and metadata stamping, while the actual
order of transaction replay is determined by the commit record.
Hence if we pipeline CIL commits we can end up with overlapping
transactions in the log like:

Start A .. Start C .. Start B .... Commit A .. Commit B .. Commit C

The issue is that the "start B" lsn is later than the "start C" lsn.
When the same metadata block is modified in both transaction B and
C, writeback from "commit B" will correctly stamp "start B" into the
metadata.

However, when "commit C" runs, it will see the LSN in that metadata
block is "start B", which is *more recent than "Start C" and so
will, incorrectly, fail to recover that change into the metadata
block. This results in silent metadata corruption, which can then be
exposed by future recovery operations failing, runtime
inconsistencies causing shutdowns and/or xfs_scrub/xfs_repair check
failures.

We could fix log recovery to avoid this problem, but there's a
runtime problem as well: the AIL is ordered by start record LSN. We
cannot order the AIL by commit LSN as we cannot allow the tail of
the log to overwrite -any- of the log transaction until the entire
transaction has been written. As the lowest LSN of the items in the
AIL defines the current log tail, the same metadata writeback
ordering issues apply as with log recovery.

In this case, we run the callbacks for commit B first, which place
all the items at the head of the log at "start B". Then we run
callbacks for "commit C", which then do not insert at the head -
they get inserted before "start B". If the item was modified in both
B and C, then it moves *backwards* in the AIL and this screws up all
manner of things that assume relogging can only move objects
forwards in the log. One of these things it can screw up is the tail
lsn of the log. Nothing good comes from this...

Because we have both runtime and journal-based ordering requirements
for the start_lsn, we have multiple places where there is an
implicit assumption that transaction start records are strictly
ordered. Rather than play whack-a-mole with such assumptions, and to
avoid the eternal "are you running a fixed kernel" question, it's
better just to strictly order the start records in the same way we
strictly order the commit records.

This patch series takes the mechanisms of the strict commit record
ordering and utilises them for strict start record ordering. It
builds upon the shutdown rework patchset to guarantee that the CIL
context structure will not get freed from under it by a racing
shutdown, and so moves the LSN recording for ordering up into a
callback from xlog_write() once we have a guaranteed iclog write
location. This means we have one mechanism for both start and commit
record ordering, and they both work in exactly the same way.

Version 3:
- rebase on 5.14-rc4 + for-next + "xfs: shutdown is a racy mess"
- fixed typos in subject line

Version 2:
- https://lore.kernel.org/linux-xfs/20210714033656.2621741-1-david@fromorbit.com/
- rebase on 5.14-rc1 + "xfs: shutdown is a racy mess"
- fixed typos in commit messages

Version 1:
- https://lore.kernel.org/linux-xfs/20210630072108.1752073-1-david@fromorbit.com/


^ permalink raw reply	[flat|nested] 10+ messages in thread
* [PATCH 0/5 v2] xfs: strictly order log start records
@ 2021-07-14  3:36 Dave Chinner
  2021-07-14  3:36 ` [PATCH 3/5] xfs: factor out log write ordering from xlog_cil_push_work() Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2021-07-14  3:36 UTC (permalink / raw)
  To: linux-xfs


We recently found a zero-day log recovery issue where overlapping
log transactions used the wrong LSN for recovery operations despite
being replayed in the correct commit record order. The issue is that
the log recovery code uses the LSN of the start record of the log
transaction for ordering and metadata stamping, while the actual
order of transaction replay is determined by the commit record.
Hence if we pipeline CIL commits we can end up with overlapping
transactions in the log like:

Start A .. Start C .. Start B .... Commit A .. Commit B .. Commit C

The issue is that the "start B" lsn is later than the "start C" lsn.
When the same metadata block is modified in both transaction B and
C, writeback from "commit B" will correctly stamp "start B" into the
metadata.

However, when "commit C" runs, it will see the LSN in that metadata
block is "start B", which is *more recent than "Start C" and so
will, incorrectly, fail to recover that change into the metadata
block. This results in silent metadata corruption, which can then be
exposed by future recovery operations failing, runtime
inconsistencies causing shutdowns and/or xfs_scrub/xfs_repair check
failures.

We could fix log recovery to avoid this problem, but there's a
runtime problem as well: the AIL is ordered by start record LSN. We
cannot order the AIL by commit LSN as we cannot allow the tail of
the log to overwrite -any- of the log transaction until the entire
transaction has been written. As the lowest LSN of the items in the
AIL defines the current log tail, the same metadata writeback
ordering issues apply as with log recovery.

In this case, we run the callbacks for commit B first, which place
all the items at the head of the log at "start B". Then we run
callbacks for "commit C", which then do not insert at the head -
they get inserted before "start B". If the item was modified in both
B and C, then it moves *backwards* in the AIL and this screws up all
manner of things that assume relogging can only move objects
forwards in the log. One of these things it can screw up is the tail
lsn of the log. Nothing good comes from this...

Because we have both runtime and journal-based ordering requirements
for the start_lsn, we have multiple places where there is an
implicit assumption that transaction start records are strictly
ordered. Rather than play whack-a-mole with such assumptions, and to
avoid the eternal "are you running a fixed kernel" question, it's
better just to strictly order the start records in the same way we
strictly order the commit records.

This patch series takes the mechanisms of the strict commit record
ordering and utilises them for strict start record ordering. It
builds upon the shutdown rework patchset to guarantee that the CIL
context structure will not get freed from under it by a racing
shutdown, and so moves the LSN recording for ordering up into a
callback from xlog_write() once we have a guaranteed iclog write
location. This means we have one mechanism for both start and commit
record ordering, and they both work in exactly the same way.


Version 2:
- rebase on 5.14-rc1 + "xfs: shutdown is a racy mess"
- fixed typos in commit messages

Version 1:
- https://lore.kernel.org/linux-xfs/20210630072108.1752073-1-david@fromorbit.com/


^ permalink raw reply	[flat|nested] 10+ messages in thread
* [PATCH 0/5] xfs: strictly order log start records
@ 2021-06-30  7:21 Dave Chinner
  2021-06-30  7:21 ` [PATCH 3/5] xfs: factor out log write ordering from xlog_cil_push_work() Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2021-06-30  7:21 UTC (permalink / raw)
  To: linux-xfs

Hi folks,

We recently found a zero-day log recovery issue where overlapping
log transactions used the wrong LSN for recovery operations despite
being replayed in the correct commit record order. The issue is that
the log recovery code uses the LSN of the start record of the log
transaction for ordering and metadata stamping, while the actual
order of transaction replay is determined by the commit record.
Hence if we pipeline CIL commits we can end up with overlapping
transactions in the log like:

Start A .. Start C .. Start B .... Commit A .. Commit B .. Commit C

The issue is that the "start B" lsn is later than the "start C" lsn.
When the same metadata block is modified in both transaction B and
C, writeback from "commit B" will correctly stamp "start B" into the
metadata.

However, when "commit C" runs, it will see the LSN in that metadata
block is "start B", which is *more recent than "Start C" and so
will, incorrectly, fail to recover that change into the metadata
block. This results in silent metadata corruption, which can then be
exposed by future recovery operations failing, runtime
inconsistencies causing shutdowns and/or xfs_scrub/xfs_repair check
failures.

We could fix log recovery to avoid this problem, but
there's a runtime problem as well: the AIL is ordered by start
record LSN. We cannot order the AIL by commit LSN as we cannot allow
the tail of the log to overwrite -any- of the log transaction until
the entire transaction has been written. As the lowest LSN of the
items in the AIL defines the current log tail, the same metadata
writeback ordering issues apply as with log recovery.

In this case, we run the callbacks for commit B first, which place
all the items at the head of the log at "start B". Then we run
callbacks for "commit C", which then do not insert at the head -
they get inserted before "start B". If the item was modified in
both B and C, then it moves *backwards* in the AIL and this screws
up all manner of things that assume relogging can only move objects
forwards in the log. One of these things it can screw up is the tail
lsn of the log. Nothing good comes from this...

Because we have both runtime and journal-based ordering requirements
for the start_lsn, we have multiple places where there is an
implicit assumption that transaction start records are strictly
ordered. Rather than play whack-a-mole with such assumptions, and to
avoid the eternal "are you running a fixed kernel" question, it's
better just to strictly order the start records in the same way we
strictly order the commit records.

This patch series takes the mechanisms of the strict commit record
ordering and utilises them for strict start record ordering. It
builds upon the shutdown rework patchset to guarantee that the CIL
context structure will not get freed from under it by a racing
shutdown, and so moves the LSN recording for ordering up into a
callback from xlog_write() once we have a guaranteed iclog write
location. This means we have one mechanism for both start and commit
record ordering, and they both work in exactly the same way.

Cheers,

Dave.


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-08-12  7:50 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-10  5:21 [PATCH 0/5 v3] xfs: strictly order log start records Dave Chinner
2021-08-10  5:21 ` [PATCH 1/5] xfs: move xlog_commit_record to xfs_log_cil.c Dave Chinner
2021-08-10  5:21 ` [PATCH 2/5] xfs: pass a CIL context to xlog_write() Dave Chinner
2021-08-10  5:21 ` [PATCH 3/5] xfs: factor out log write ordering from xlog_cil_push_work() Dave Chinner
2021-08-10  5:21 ` [PATCH 4/5] xfs: attach iclog callbacks in xlog_cil_set_ctx_write_state() Dave Chinner
2021-08-10  5:21 ` [PATCH 5/5] xfs: order CIL checkpoint start records Dave Chinner
2021-08-12  7:49   ` Christoph Hellwig
  -- strict thread matches above, loose matches on Subject: below --
2021-07-14  3:36 [PATCH 0/5 v2] xfs: strictly order log " Dave Chinner
2021-07-14  3:36 ` [PATCH 3/5] xfs: factor out log write ordering from xlog_cil_push_work() Dave Chinner
2021-06-30  7:21 [PATCH 0/5] xfs: strictly order log start records Dave Chinner
2021-06-30  7:21 ` [PATCH 3/5] xfs: factor out log write ordering from xlog_cil_push_work() Dave Chinner
2021-07-02  9:00   ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).