All of lore.kernel.org
 help / color / mirror / Atom feed
* XFS journal write ordering constraints?
@ 2017-06-08 15:42 Sweet Tea Dorminy
  2017-06-09 12:38 ` Brian Foster
  2017-06-09 23:44 ` Dave Chinner
  0 siblings, 2 replies; 14+ messages in thread
From: Sweet Tea Dorminy @ 2017-06-08 15:42 UTC (permalink / raw)
  To: linux-xfs

Greetings;

When using XFS with a 1k block size atop our device, we regularly get
"log record CRC mismatch"es when mounting XFS after a crash, and we
are attempting to understand why. We are using RHEL7.3 with its kernel
3.10.0-514.10.2.el7.x86_64, xfsprogs version 4.5.0.

Tracing indicates the following situation occurs:
       Some pair of consecutive locations contains data A1 and B1, respectively.
       The XFS journal issues new writes to those locations,
containing data A2 and B2.
       The write of B' finishes, but A' is still outstanding at the
time of the crash.
       Crash occurs. The data on disk is A1 and B2, respectively.
       XFS fails to mount, complaining that the checksum mismatches.

Does XFS expect sequentially issued journal IO to be committed to disk
in the order of issuance due to the use of FUA?

Thanks!

Sweet Tea Dorminy
Permabit Technology Corporation
Cambridge, MA

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: XFS journal write ordering constraints?
  2017-06-08 15:42 XFS journal write ordering constraints? Sweet Tea Dorminy
@ 2017-06-09 12:38 ` Brian Foster
  2017-06-09 17:30   ` Brian Foster
  2017-06-09 23:44 ` Dave Chinner
  1 sibling, 1 reply; 14+ messages in thread
From: Brian Foster @ 2017-06-09 12:38 UTC (permalink / raw)
  To: Sweet Tea Dorminy; +Cc: linux-xfs

On Thu, Jun 08, 2017 at 11:42:11AM -0400, Sweet Tea Dorminy wrote:
> Greetings;
> 
> When using XFS with a 1k block size atop our device, we regularly get
> "log record CRC mismatch"es when mounting XFS after a crash, and we
> are attempting to understand why. We are using RHEL7.3 with its kernel
> 3.10.0-514.10.2.el7.x86_64, xfsprogs version 4.5.0.
> 
> Tracing indicates the following situation occurs:
>        Some pair of consecutive locations contains data A1 and B1, respectively.
>        The XFS journal issues new writes to those locations,
> containing data A2 and B2.
>        The write of B' finishes, but A' is still outstanding at the
> time of the crash.
>        Crash occurs. The data on disk is A1 and B2, respectively.
>        XFS fails to mount, complaining that the checksum mismatches.
> 
> Does XFS expect sequentially issued journal IO to be committed to disk
> in the order of issuance due to the use of FUA?
> 

Hmm, I don't believe there is any such sequential I/O ordering
constraint, but the log is complex and I could be missing something. We
do have higher level ordering rules in various places. For example,
commit records are written to the in-core logs in order. It also looks
like in-core log I/O completion takes explicit measures to process
callbacks in order in the event that the associated I/Os do not complete
in order. That tends to imply there is no explicit log I/O submission
ordering in place.

Of course, that also implies that log recovery should be able to handle
this situation just the same. I'm not quite sure what the expected log
recovery behavior is off the top of my head, but my initial guess would
be that the log LSN stamping could help us identify the valid part of
the log during head/tail discovery.

Anyways, I think more information is required to try and understand what
is happening in your situation. What is the xfs_info for this
filesystem? What granularity are these A and B regions (sectors or
larger)? Are you running on some kind of special block device that
reproduces this? Do you have a consistent reproducer and/or have you
reproduced on an upstream kernel? Could you provide an xfs_metadump
image of the filesystem that fails log recovery with CRC errors?

Brian

> Thanks!
> 
> Sweet Tea Dorminy
> Permabit Technology Corporation
> Cambridge, MA
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: XFS journal write ordering constraints?
  2017-06-09 12:38 ` Brian Foster
@ 2017-06-09 17:30   ` Brian Foster
  0 siblings, 0 replies; 14+ messages in thread
From: Brian Foster @ 2017-06-09 17:30 UTC (permalink / raw)
  To: Sweet Tea Dorminy; +Cc: linux-xfs

On Fri, Jun 09, 2017 at 08:38:32AM -0400, Brian Foster wrote:
> On Thu, Jun 08, 2017 at 11:42:11AM -0400, Sweet Tea Dorminy wrote:
> > Greetings;
> > 
> > When using XFS with a 1k block size atop our device, we regularly get
> > "log record CRC mismatch"es when mounting XFS after a crash, and we
> > are attempting to understand why. We are using RHEL7.3 with its kernel
> > 3.10.0-514.10.2.el7.x86_64, xfsprogs version 4.5.0.
> > 
> > Tracing indicates the following situation occurs:
> >        Some pair of consecutive locations contains data A1 and B1, respectively.
> >        The XFS journal issues new writes to those locations,
> > containing data A2 and B2.
> >        The write of B' finishes, but A' is still outstanding at the
> > time of the crash.
> >        Crash occurs. The data on disk is A1 and B2, respectively.
> >        XFS fails to mount, complaining that the checksum mismatches.
> > 
> > Does XFS expect sequentially issued journal IO to be committed to disk
> > in the order of issuance due to the use of FUA?
> > 
> 
> Hmm, I don't believe there is any such sequential I/O ordering
> constraint, but the log is complex and I could be missing something. We
> do have higher level ordering rules in various places. For example,
> commit records are written to the in-core logs in order. It also looks
> like in-core log I/O completion takes explicit measures to process
> callbacks in order in the event that the associated I/Os do not complete
> in order. That tends to imply there is no explicit log I/O submission
> ordering in place.
> 
> Of course, that also implies that log recovery should be able to handle
> this situation just the same. I'm not quite sure what the expected log
> recovery behavior is off the top of my head, but my initial guess would
> be that the log LSN stamping could help us identify the valid part of
> the log during head/tail discovery.
> 

After digging a bit more into the log recovery code, this does actually
appear to be the case. The process of finding the head of the log at
mount time starts with a rough approximation of the head location based
on cycle numbers which are stamped into the first bytes of every sector
written to the log. From there, it searches a previous number of blocks
based on the maximum log buffer concurrency allowed by the fs to
determine whether any such "holes" exist in that range. If so, the head
is walked back to the first instance of such a "hole," effectively
working around out of order buffer completion at the time of a
filesystem crash.

This basically means that such ranges are not part of the active log to
be recovered and thus should not lead to CRC errors. So if the
granularity of the ranges noted above is something like the size of a
log buffer and resides towards the end of the active log, it seems more
likely this could be expected behavior and not the source of the
problem. If the granularity is something smaller (i.e., a sector) it
seems more likely something is wrong beneath the filesystem, or if the
range is larger but much farther behind the head, then the problem could
be something else entirely.

(When looking through some of this, I also noticed that log recovery
leaks memory for partial transactions. Thanks! :P).

Brian

> Anyways, I think more information is required to try and understand what
> is happening in your situation. What is the xfs_info for this
> filesystem? What granularity are these A and B regions (sectors or
> larger)? Are you running on some kind of special block device that
> reproduces this? Do you have a consistent reproducer and/or have you
> reproduced on an upstream kernel? Could you provide an xfs_metadump
> image of the filesystem that fails log recovery with CRC errors?
> 
> Brian
> 
> > Thanks!
> > 
> > Sweet Tea Dorminy
> > Permabit Technology Corporation
> > Cambridge, MA
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: XFS journal write ordering constraints?
  2017-06-08 15:42 XFS journal write ordering constraints? Sweet Tea Dorminy
  2017-06-09 12:38 ` Brian Foster
@ 2017-06-09 23:44 ` Dave Chinner
  2017-06-10  2:06   ` Sweet Tea Dorminy
  1 sibling, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2017-06-09 23:44 UTC (permalink / raw)
  To: Sweet Tea Dorminy; +Cc: linux-xfs

On Thu, Jun 08, 2017 at 11:42:11AM -0400, Sweet Tea Dorminy wrote:
> Greetings;
> 
> When using XFS with a 1k block size atop our device, we regularly get
> "log record CRC mismatch"es when mounting XFS after a crash, and we
> are attempting to understand why. We are using RHEL7.3 with its kernel
> 3.10.0-514.10.2.el7.x86_64, xfsprogs version 4.5.0.
> 
> Tracing indicates the following situation occurs:
>        Some pair of consecutive locations contains data A1 and B1, respectively.
>        The XFS journal issues new writes to those locations,
> containing data A2 and B2.
>        The write of B' finishes, but A' is still outstanding at the
> time of the crash.
>        Crash occurs. The data on disk is A1 and B2, respectively.
>        XFS fails to mount, complaining that the checksum mismatches.
> 
> Does XFS expect sequentially issued journal IO to be committed to disk
> in the order of issuance due to the use of FUA?

Journal IO is not sequentially issued. It's an async process. At
runtime, ordering is handled by journal IO completion processing
being queued and run in order, so IOs can both be issued and
physically complete out of order.

Log recovery is supposed to handle this. It searches and finds the
latest contiguous journal entry and does not replay past holes that
may arise from out of order journal writes.

CRC errors like this in recovery imply that journal writes are being
torn or not completed fully, which may mean that your storage does
not correctly implement flush/FUA ordering semantics....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: XFS journal write ordering constraints?
  2017-06-09 23:44 ` Dave Chinner
@ 2017-06-10  2:06   ` Sweet Tea Dorminy
  2017-06-12 14:55     ` Brian Foster
  2017-06-12 23:50     ` Dave Chinner
  0 siblings, 2 replies; 14+ messages in thread
From: Sweet Tea Dorminy @ 2017-06-10  2:06 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

>What is the xfs_info for this filesystem?
       meta-data=/dev/mapper/tracer-vdo0 isize=256    agcount=4,
       agsize=5242880 blks
                =                       sectsz=512   attr=2, projid32bit=0
       data     =                       bsize=1024   blocks=20971520,
       imaxpct=25
                =                       sunit=0      swidth=0 blks
       naming   =version 2              bsize=4096   ascii-ci=0
       log      =internal               bsize=1024   blocks=10240, version=2
                =                       sectsz=512   sunit=0 blks,
       lazy-count=1
       realtime =none                   extsz=4096   blocks=0, rtextents=0

> What granularity are these A and B regions (sectors or larger)?
A is 1k, B is 3k.

>Are you running on some kind of special block device that reproduces this?
It's a device we are developing,
asynchronous, which we believe obeys FLUSH and FUA correctly but may
have missed some case; we
encountered this issue when testing an XFS filesystem on it, and other
filesystems appear to work fine (although obviously we could have
merely gotten lucky). Currently, when a flush returns from the device,
we guarantee the data from all bios completed before the flush was
issued is stably on disk; when a write+FUA bio returns from the
device, the data in that bio (only) is guaranteed to be stable on disk. The
device may, however, commit sequentially issued write+fua bios to disk in an
arbitrary order.

> Do you have a consistent reproducer and/or have you
reproduced on an upstream kernel
Our reproducer fails about 20% of the time. We have not tried on an
upstream kernel.

>Could you provide an xfs_metadump image of the filesystem that fails log recovery with CRC errors?
I can capture such on Monday.
For now, just the journal (gathered with xfs_logprint -C fsLog) can be
found at (10M)
https://s3.amazonaws.com/permabit-development/20170609-xfsUpload/fsLog
.
A log of the journal writes can be found at (17M)
https://s3.amazonaws.com/permabit-development/20170609-xfsUpload/log_writes_only.blkparse.
It is in a blkparse-like format. For each 512-byte sector of a bio,
either starting or finishing, the data hash is recorded; the sector is
recorded; and the index of this sector and the number of sectors
within the current bio is recorded. Bios recorded as "FAILED" indicate
that the device has crashed / become disconnected and the bio has
returned with an error.

>From there, it searches a previous number of blocks
based on the maximum log buffer concurrency allowed by the fs to
determine whether any such "holes" exist in that range. If so, the head
is walked back to the first instance of such a "hole," effectively
working around out of order buffer completion at the time of a
filesystem crash.

In the case logged and linked above, there are 256k of outstanding log
write bytes at once; 187k of these fail and 69k succeed. Of the 69k which
succeed, they are always the first 1k of the 4k block to which they
belong. Is this within the permitted amount of outstanding log buffers?

Thanks!

Sweet Tea

On Fri, Jun 9, 2017 at 7:44 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Jun 08, 2017 at 11:42:11AM -0400, Sweet Tea Dorminy wrote:
>> Greetings;
>>
>> When using XFS with a 1k block size atop our device, we regularly get
>> "log record CRC mismatch"es when mounting XFS after a crash, and we
>> are attempting to understand why. We are using RHEL7.3 with its kernel
>> 3.10.0-514.10.2.el7.x86_64, xfsprogs version 4.5.0.
>>
>> Tracing indicates the following situation occurs:
>>        Some pair of consecutive locations contains data A1 and B1, respectively.
>>        The XFS journal issues new writes to those locations,
>> containing data A2 and B2.
>>        The write of B' finishes, but A' is still outstanding at the
>> time of the crash.
>>        Crash occurs. The data on disk is A1 and B2, respectively.
>>        XFS fails to mount, complaining that the checksum mismatches.
>>
>> Does XFS expect sequentially issued journal IO to be committed to disk
>> in the order of issuance due to the use of FUA?
>
> Journal IO is not sequentially issued. It's an async process. At
> runtime, ordering is handled by journal IO completion processing
> being queued and run in order, so IOs can both be issued and
> physically complete out of order.
>
> Log recovery is supposed to handle this. It searches and finds the
> latest contiguous journal entry and does not replay past holes that
> may arise from out of order journal writes.
>
> CRC errors like this in recovery imply that journal writes are being
> torn or not completed fully, which may mean that your storage does
> not correctly implement flush/FUA ordering semantics....
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: XFS journal write ordering constraints?
  2017-06-10  2:06   ` Sweet Tea Dorminy
@ 2017-06-12 14:55     ` Brian Foster
  2017-06-12 16:18       ` Brian Foster
  2017-06-12 23:50     ` Dave Chinner
  1 sibling, 1 reply; 14+ messages in thread
From: Brian Foster @ 2017-06-12 14:55 UTC (permalink / raw)
  To: Sweet Tea Dorminy; +Cc: Dave Chinner, linux-xfs

It looks like you've replied to one mail and copy/pasted the text from
another. FWIW, it's usually best to reply to each mail independently to
preserve threading and whatnot.

On Fri, Jun 09, 2017 at 10:06:26PM -0400, Sweet Tea Dorminy wrote:
> >What is the xfs_info for this filesystem?
>        meta-data=/dev/mapper/tracer-vdo0 isize=256    agcount=4,
>        agsize=5242880 blks
>                 =                       sectsz=512   attr=2, projid32bit=0
>        data     =                       bsize=1024   blocks=20971520,
>        imaxpct=25
>                 =                       sunit=0      swidth=0 blks
>        naming   =version 2              bsize=4096   ascii-ci=0
>        log      =internal               bsize=1024   blocks=10240, version=2
>                 =                       sectsz=512   sunit=0 blks,
>        lazy-count=1
>        realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> > What granularity are these A and B regions (sectors or larger)?
> A is 1k, B is 3k.
> 
> >Are you running on some kind of special block device that reproduces this?
> It's a device we are developing,
> asynchronous, which we believe obeys FLUSH and FUA correctly but may
> have missed some case; we
> encountered this issue when testing an XFS filesystem on it, and other
> filesystems appear to work fine (although obviously we could have
> merely gotten lucky). Currently, when a flush returns from the device,
> we guarantee the data from all bios completed before the flush was
> issued is stably on disk; when a write+FUA bio returns from the
> device, the data in that bio (only) is guaranteed to be stable on disk. The
> device may, however, commit sequentially issued write+fua bios to disk in an
> arbitrary order.
> 
> > Do you have a consistent reproducer and/or have you
> reproduced on an upstream kernel
> Our reproducer fails about 20% of the time. We have not tried on an
> upstream kernel.
> 

Please do try on something upstream if possible. Also, what exactly
triggers the failure that ultimately leads to log recovery on the next
mount? Are you doing an XFS shutdown or shutting down the block device
underneath the fs?

> >Could you provide an xfs_metadump image of the filesystem that fails log recovery with CRC errors?
> I can capture such on Monday.
> For now, just the journal (gathered with xfs_logprint -C fsLog) can be
> found at (10M)
> https://s3.amazonaws.com/permabit-development/20170609-xfsUpload/fsLog
> .

Thanks. This certainly looks a bit strange. Up through sectors 19692 of
the log are populated with a cycle 2 LSN. After that appears to be the
previous cycle 1, which suggests that is the head of the log. The tail
then points to 1:20105, but from that point forward a strange mix of
cycle 1 and cycle 2 LSNs are seen.

It's not totally clear to me from the physical log state whether this
was the in-core state at the time of the crash, but a first guess may be
that at least this is what log recovery thinks and thus complains about
the tail of the log being borked.

> A log of the journal writes can be found at (17M)
> https://s3.amazonaws.com/permabit-development/20170609-xfsUpload/log_writes_only.blkparse.
> It is in a blkparse-like format. For each 512-byte sector of a bio,
> either starting or finishing, the data hash is recorded; the sector is
> recorded; and the index of this sector and the number of sectors
> within the current bio is recorded. Bios recorded as "FAILED" indicate
> that the device has crashed / become disconnected and the bio has
> returned with an error.
> 
> >From there, it searches a previous number of blocks
> based on the maximum log buffer concurrency allowed by the fs to
> determine whether any such "holes" exist in that range. If so, the head
> is walked back to the first instance of such a "hole," effectively
> working around out of order buffer completion at the time of a
> filesystem crash.
> 
> In the case logged and linked above, there are 256k of outstanding log
> write bytes at once; 187k of these fail and 69k succeed. Of the 69k which
> succeed, they are always the first 1k of the 4k block to which they
> belong. Is this within the permitted amount of outstanding log buffers?
> 

Generally, yes. Log recovery uses the maximum in-core log buffer count
(8) and the maximum supported log buffer size (256k) to determine how
far to look back (2M). The default log buffer size is 32k, so I suspect
256k means every log buffer was in-flight at this time.

When you reproduce this and presumably provide a metadump, could you
also collect the /sys/fs/xfs/<dev>/log/* state information at the time
of the crash of that particular metadump? Note that this directory
disappears after unmount so you'll have to collect it before.

Brian

> Thanks!
> 
> Sweet Tea
> 
> On Fri, Jun 9, 2017 at 7:44 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Thu, Jun 08, 2017 at 11:42:11AM -0400, Sweet Tea Dorminy wrote:
> >> Greetings;
> >>
> >> When using XFS with a 1k block size atop our device, we regularly get
> >> "log record CRC mismatch"es when mounting XFS after a crash, and we
> >> are attempting to understand why. We are using RHEL7.3 with its kernel
> >> 3.10.0-514.10.2.el7.x86_64, xfsprogs version 4.5.0.
> >>
> >> Tracing indicates the following situation occurs:
> >>        Some pair of consecutive locations contains data A1 and B1, respectively.
> >>        The XFS journal issues new writes to those locations,
> >> containing data A2 and B2.
> >>        The write of B' finishes, but A' is still outstanding at the
> >> time of the crash.
> >>        Crash occurs. The data on disk is A1 and B2, respectively.
> >>        XFS fails to mount, complaining that the checksum mismatches.
> >>
> >> Does XFS expect sequentially issued journal IO to be committed to disk
> >> in the order of issuance due to the use of FUA?
> >
> > Journal IO is not sequentially issued. It's an async process. At
> > runtime, ordering is handled by journal IO completion processing
> > being queued and run in order, so IOs can both be issued and
> > physically complete out of order.
> >
> > Log recovery is supposed to handle this. It searches and finds the
> > latest contiguous journal entry and does not replay past holes that
> > may arise from out of order journal writes.
> >
> > CRC errors like this in recovery imply that journal writes are being
> > torn or not completed fully, which may mean that your storage does
> > not correctly implement flush/FUA ordering semantics....
> >
> > Cheers,
> >
> > Dave.
> > --
> > Dave Chinner
> > david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: XFS journal write ordering constraints?
  2017-06-12 14:55     ` Brian Foster
@ 2017-06-12 16:18       ` Brian Foster
  2017-06-15 22:28         ` Sweet Tea Dorminy
  0 siblings, 1 reply; 14+ messages in thread
From: Brian Foster @ 2017-06-12 16:18 UTC (permalink / raw)
  To: Sweet Tea Dorminy; +Cc: Dave Chinner, linux-xfs

On Mon, Jun 12, 2017 at 10:55:26AM -0400, Brian Foster wrote:
> It looks like you've replied to one mail and copy/pasted the text from
> another. FWIW, it's usually best to reply to each mail independently to
> preserve threading and whatnot.
> 
> On Fri, Jun 09, 2017 at 10:06:26PM -0400, Sweet Tea Dorminy wrote:
> > >What is the xfs_info for this filesystem?
> >        meta-data=/dev/mapper/tracer-vdo0 isize=256    agcount=4,
> >        agsize=5242880 blks
> >                 =                       sectsz=512   attr=2, projid32bit=0
> >        data     =                       bsize=1024   blocks=20971520,
> >        imaxpct=25
> >                 =                       sunit=0      swidth=0 blks
> >        naming   =version 2              bsize=4096   ascii-ci=0
> >        log      =internal               bsize=1024   blocks=10240, version=2
> >                 =                       sectsz=512   sunit=0 blks,
> >        lazy-count=1
> >        realtime =none                   extsz=4096   blocks=0, rtextents=0
> > 
> > > What granularity are these A and B regions (sectors or larger)?
> > A is 1k, B is 3k.
> > 
> > >Are you running on some kind of special block device that reproduces this?
> > It's a device we are developing,
> > asynchronous, which we believe obeys FLUSH and FUA correctly but may
> > have missed some case; we
> > encountered this issue when testing an XFS filesystem on it, and other
> > filesystems appear to work fine (although obviously we could have
> > merely gotten lucky). Currently, when a flush returns from the device,
> > we guarantee the data from all bios completed before the flush was
> > issued is stably on disk; when a write+FUA bio returns from the
> > device, the data in that bio (only) is guaranteed to be stable on disk. The
> > device may, however, commit sequentially issued write+fua bios to disk in an
> > arbitrary order.
> > 
> > > Do you have a consistent reproducer and/or have you
> > reproduced on an upstream kernel
> > Our reproducer fails about 20% of the time. We have not tried on an
> > upstream kernel.
> > 
> 
> Please do try on something upstream if possible. Also, what exactly
> triggers the failure that ultimately leads to log recovery on the next
> mount? Are you doing an XFS shutdown or shutting down the block device
> underneath the fs?
> 
> > >Could you provide an xfs_metadump image of the filesystem that fails log recovery with CRC errors?
> > I can capture such on Monday.
> > For now, just the journal (gathered with xfs_logprint -C fsLog) can be
> > found at (10M)
> > https://s3.amazonaws.com/permabit-development/20170609-xfsUpload/fsLog
> > .
> 
> Thanks. This certainly looks a bit strange. Up through sectors 19692 of
> the log are populated with a cycle 2 LSN. After that appears to be the
> previous cycle 1, which suggests that is the head of the log. The tail
> then points to 1:20105, but from that point forward a strange mix of
> cycle 1 and cycle 2 LSNs are seen.
> 
> It's not totally clear to me from the physical log state whether this
> was the in-core state at the time of the crash, but a first guess may be
> that at least this is what log recovery thinks and thus complains about
> the tail of the log being borked.
> 

Another observation and possible theory... the last two cycle 2 headers
in the log are:

 20116 HEADER Cycle 2 tail 2:010284 len  32256 ops 288
 ...
 20180 HEADER Cycle 2 tail 2:010284 len  32256 ops 261

... which appear sanely spaced apart. They are mixed with cycle 1/2
changes, which seems to correspond with your observation that parts of
the 32k log buffers are succeeding and parts are failing. IOW, these log
records are effectively incomplete on disk.

That accounts for 2 of 8 log buffers. If we assume these are the "last"
two log buffers, each log buffer is full and also spaced 64 sectors
apart, then the first log buffer would be targetting log sector (20116 -
6 * 64) = 19732. That refers to the head of the on-disk log noted above:

 ...
 19628 HEADER Cycle 2 tail 1:020105 len  32256 ops 327
 19692 HEADER Cycle 2 tail 1:020105 len  19968 ops 89
 19732 HEADER Cycle 1 tail 1:020105 len      0 ops 0
[00000 - 19732] Cycle 0x00000002 New Cycle 0x00000001
 19733 HEADER Cycle 1 tail 1:020105 len      0 ops 0
 ...

So I'm wondering whether if it's possible that all of the log buffers
were dirty at the time of the crash, the last two were the only ones
that made it to disk, and they only made it partially at that. Further,
in partially writing those last two bufs at 20116 and beyond, we've
begun to overwrite the tail pointed to by the last record on disk before
the head (tail 1:020105 -> cycle 1 sector 20105), corrupting that
reference as well.

There are also a bunch of the len == 0 headers at 19733 and beyond,
which I'm not quite clear on. I'm wondering if log recovery might
actually do that after it settles on a log head. That could mean there
actually was more cycle 2 data from other log bufs in that range but it
was all detected as invalid and cleared out during log recovery.

When you grab the metadump to provide, can you make sure you do so
_before_ you attempt log recovery on the filesystem for the first time?
E.g., reproduce crash, umount & create metadump, mount and test for
corruption, if reproduce -> send metadump, else toss metadump & repeat.
This is because the first attempt at log recovery may modify the
physical log and cause some information loss if we take the metadump
after you've already attempted recovery.

Brian

P.S., I've also been assuming based on the log state that you haven't
used the logbufs/logbsize mount options to change from the defaults, but
please do confirm that you aren't changing either of those defaults.
Thanks.

> > A log of the journal writes can be found at (17M)
> > https://s3.amazonaws.com/permabit-development/20170609-xfsUpload/log_writes_only.blkparse.
> > It is in a blkparse-like format. For each 512-byte sector of a bio,
> > either starting or finishing, the data hash is recorded; the sector is
> > recorded; and the index of this sector and the number of sectors
> > within the current bio is recorded. Bios recorded as "FAILED" indicate
> > that the device has crashed / become disconnected and the bio has
> > returned with an error.
> > 
> > >From there, it searches a previous number of blocks
> > based on the maximum log buffer concurrency allowed by the fs to
> > determine whether any such "holes" exist in that range. If so, the head
> > is walked back to the first instance of such a "hole," effectively
> > working around out of order buffer completion at the time of a
> > filesystem crash.
> > 
> > In the case logged and linked above, there are 256k of outstanding log
> > write bytes at once; 187k of these fail and 69k succeed. Of the 69k which
> > succeed, they are always the first 1k of the 4k block to which they
> > belong. Is this within the permitted amount of outstanding log buffers?
> > 
> 
> Generally, yes. Log recovery uses the maximum in-core log buffer count
> (8) and the maximum supported log buffer size (256k) to determine how
> far to look back (2M). The default log buffer size is 32k, so I suspect
> 256k means every log buffer was in-flight at this time.
> 
> When you reproduce this and presumably provide a metadump, could you
> also collect the /sys/fs/xfs/<dev>/log/* state information at the time
> of the crash of that particular metadump? Note that this directory
> disappears after unmount so you'll have to collect it before.
> 
> Brian
> 
> > Thanks!
> > 
> > Sweet Tea
> > 
> > On Fri, Jun 9, 2017 at 7:44 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > On Thu, Jun 08, 2017 at 11:42:11AM -0400, Sweet Tea Dorminy wrote:
> > >> Greetings;
> > >>
> > >> When using XFS with a 1k block size atop our device, we regularly get
> > >> "log record CRC mismatch"es when mounting XFS after a crash, and we
> > >> are attempting to understand why. We are using RHEL7.3 with its kernel
> > >> 3.10.0-514.10.2.el7.x86_64, xfsprogs version 4.5.0.
> > >>
> > >> Tracing indicates the following situation occurs:
> > >>        Some pair of consecutive locations contains data A1 and B1, respectively.
> > >>        The XFS journal issues new writes to those locations,
> > >> containing data A2 and B2.
> > >>        The write of B' finishes, but A' is still outstanding at the
> > >> time of the crash.
> > >>        Crash occurs. The data on disk is A1 and B2, respectively.
> > >>        XFS fails to mount, complaining that the checksum mismatches.
> > >>
> > >> Does XFS expect sequentially issued journal IO to be committed to disk
> > >> in the order of issuance due to the use of FUA?
> > >
> > > Journal IO is not sequentially issued. It's an async process. At
> > > runtime, ordering is handled by journal IO completion processing
> > > being queued and run in order, so IOs can both be issued and
> > > physically complete out of order.
> > >
> > > Log recovery is supposed to handle this. It searches and finds the
> > > latest contiguous journal entry and does not replay past holes that
> > > may arise from out of order journal writes.
> > >
> > > CRC errors like this in recovery imply that journal writes are being
> > > torn or not completed fully, which may mean that your storage does
> > > not correctly implement flush/FUA ordering semantics....
> > >
> > > Cheers,
> > >
> > > Dave.
> > > --
> > > Dave Chinner
> > > david@fromorbit.com
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: XFS journal write ordering constraints?
  2017-06-10  2:06   ` Sweet Tea Dorminy
  2017-06-12 14:55     ` Brian Foster
@ 2017-06-12 23:50     ` Dave Chinner
  2017-06-13 14:14       ` Sweet Tea Dorminy
  2017-06-13 16:29       ` Brian Foster
  1 sibling, 2 replies; 14+ messages in thread
From: Dave Chinner @ 2017-06-12 23:50 UTC (permalink / raw)
  To: Sweet Tea Dorminy; +Cc: linux-xfs

On Fri, Jun 09, 2017 at 10:06:26PM -0400, Sweet Tea Dorminy wrote:
> >What is the xfs_info for this filesystem?
>        meta-data=/dev/mapper/tracer-vdo0 isize=256    agcount=4,
>        agsize=5242880 blks
>                 =                       sectsz=512   attr=2, projid32bit=0
>        data     =                       bsize=1024   blocks=20971520,
>        imaxpct=25
>                 =                       sunit=0      swidth=0 blks
>        naming   =version 2              bsize=4096   ascii-ci=0
>        log      =internal               bsize=1024   blocks=10240, version=2
>                 =                       sectsz=512   sunit=0 blks,
>        lazy-count=1
>        realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> > What granularity are these A and B regions (sectors or larger)?
> A is 1k, B is 3k.
> 
> >Are you running on some kind of special block device that reproduces this?
> It's a device we are developing,
> asynchronous, which we believe obeys FLUSH and FUA correctly but may
> have missed some case;

So Occam's Razor applies here....

> we
> encountered this issue when testing an XFS filesystem on it, and other
> filesystems appear to work fine (although obviously we could have
> merely gotten lucky).

XFS has quite sophisticated async IO dispatch and ordering
mechanisms compared to other filesystems and so frequently exposes
problems in the underlying storage layers that other filesystems
don't exercise.

> Currently, when a flush returns from the device,
> we guarantee the data from all bios completed before the flush was
> issued is stably on disk;

Yup, that's according to
Documentation/block/writeback_cache_control.txt, however....

> when a write+FUA bio returns from the
> device, the data in that bio (only) is guaranteed to be stable on disk. The
> device may, however, commit sequentially issued write+fua bios to disk in an
> arbitrary order.

.... XFS issues log writes with REQ_PREFLUSH|REQ_FUA. This means
sequentially issued log writes have clearly specified ordering
constraints. i.e. the preflush completion order requirements means
that the block device must commit preflush+write+fua bios to stable
storage in the exact order they were issued by the filesystem....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: XFS journal write ordering constraints?
  2017-06-12 23:50     ` Dave Chinner
@ 2017-06-13 14:14       ` Sweet Tea Dorminy
  2017-06-13 22:16         ` Dave Chinner
  2017-06-13 16:29       ` Brian Foster
  1 sibling, 1 reply; 14+ messages in thread
From: Sweet Tea Dorminy @ 2017-06-13 14:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Thank you! I'm glad that we've established it's a mismatch between our
device's implementation and XFS expectations.

>.... XFS issues log writes with REQ_PREFLUSH|REQ_FUA. This means
sequentially issued log writes have clearly specified ordering
constraints. i.e. the preflush completion order requirements means
that the block device must commit preflush+write+fua bios to stable
storage in the exact order they were issued by the filesystem....

That is certainly what REQ_BARRIER did back in the day. But when
REQ_BARRIER was replaced with separate REQ_FUA and REQ_FLUSH
flags, and barrier.txt got replaced with writeback_cache_control.txt,
the documentation seemed to imply the ordering requirement on *issued*
IO had gone away (but maybe I'm missing something).


Quoth writeback_cache_control.txt about REQ_PREFLUSH:
> will make sure the volatile cache of the storage device
>has been flushed before the actual I/O operation is started.
> This explicitly guarantees that previously completed write requests are on non-volatile
> storage before the flagged bio starts.

And about REQ_FUA:
> I/O completion for this request is only
> signaled after the data has been committed to non-volatile storage.

I am perhaps overlooking where REQ_PREFLUSH guarantees all previously
issued write requests with FLUSH|FUA are stable, not just all
previously completed ones. Is this documented somewhere?
Nevertheless, if XFS is expecting this guarantee, that would certainly
be the source of this corruption.

Thanks again!

On Mon, Jun 12, 2017 at 7:50 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Fri, Jun 09, 2017 at 10:06:26PM -0400, Sweet Tea Dorminy wrote:
>> >What is the xfs_info for this filesystem?
>>        meta-data=/dev/mapper/tracer-vdo0 isize=256    agcount=4,
>>        agsize=5242880 blks
>>                 =                       sectsz=512   attr=2, projid32bit=0
>>        data     =                       bsize=1024   blocks=20971520,
>>        imaxpct=25
>>                 =                       sunit=0      swidth=0 blks
>>        naming   =version 2              bsize=4096   ascii-ci=0
>>        log      =internal               bsize=1024   blocks=10240, version=2
>>                 =                       sectsz=512   sunit=0 blks,
>>        lazy-count=1
>>        realtime =none                   extsz=4096   blocks=0, rtextents=0
>>
>> > What granularity are these A and B regions (sectors or larger)?
>> A is 1k, B is 3k.
>>
>> >Are you running on some kind of special block device that reproduces this?
>> It's a device we are developing,
>> asynchronous, which we believe obeys FLUSH and FUA correctly but may
>> have missed some case;
>
> So Occam's Razor applies here....
>
>> we
>> encountered this issue when testing an XFS filesystem on it, and other
>> filesystems appear to work fine (although obviously we could have
>> merely gotten lucky).
>
> XFS has quite sophisticated async IO dispatch and ordering
> mechanisms compared to other filesystems and so frequently exposes
> problems in the underlying storage layers that other filesystems
> don't exercise.
>
>> Currently, when a flush returns from the device,
>> we guarantee the data from all bios completed before the flush was
>> issued is stably on disk;
>
> Yup, that's according to
> Documentation/block/writeback_cache_control.txt, however....
>
>> when a write+FUA bio returns from the
>> device, the data in that bio (only) is guaranteed to be stable on disk. The
>> device may, however, commit sequentially issued write+fua bios to disk in an
>> arbitrary order.
>
> .... XFS issues log writes with REQ_PREFLUSH|REQ_FUA. This means
> sequentially issued log writes have clearly specified ordering
> constraints. i.e. the preflush completion order requirements means
> that the block device must commit preflush+write+fua bios to stable
> storage in the exact order they were issued by the filesystem....
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: XFS journal write ordering constraints?
  2017-06-12 23:50     ` Dave Chinner
  2017-06-13 14:14       ` Sweet Tea Dorminy
@ 2017-06-13 16:29       ` Brian Foster
  1 sibling, 0 replies; 14+ messages in thread
From: Brian Foster @ 2017-06-13 16:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Sweet Tea Dorminy, linux-xfs

On Tue, Jun 13, 2017 at 09:50:02AM +1000, Dave Chinner wrote:
> On Fri, Jun 09, 2017 at 10:06:26PM -0400, Sweet Tea Dorminy wrote:
> > >What is the xfs_info for this filesystem?
> >        meta-data=/dev/mapper/tracer-vdo0 isize=256    agcount=4,
> >        agsize=5242880 blks
> >                 =                       sectsz=512   attr=2, projid32bit=0
> >        data     =                       bsize=1024   blocks=20971520,
> >        imaxpct=25
> >                 =                       sunit=0      swidth=0 blks
> >        naming   =version 2              bsize=4096   ascii-ci=0
> >        log      =internal               bsize=1024   blocks=10240, version=2
> >                 =                       sectsz=512   sunit=0 blks,
> >        lazy-count=1
> >        realtime =none                   extsz=4096   blocks=0, rtextents=0
> > 
> > > What granularity are these A and B regions (sectors or larger)?
> > A is 1k, B is 3k.
> > 
> > >Are you running on some kind of special block device that reproduces this?
> > It's a device we are developing,
> > asynchronous, which we believe obeys FLUSH and FUA correctly but may
> > have missed some case;
> 
> So Occam's Razor applies here....
> 

This was my inclination as well. It may very well be that this block
device is broken in some way, but at the same time some (hacky)
experimentation suggests we are susceptible to the problem outlined
previously:

- Log tail is pinned and remains so long enough to push the head behind
  the tail.
- Once the log fills, suppose the gap from the head to tail is smaller
  than (logbsize * logbufs). Since the tail is pinned, the tail_lsn
  referenced by the last record successfully written to the log points
  into this gap.
- The tail unpins and several async iclog flushes occur. These all
  partially fail and happen to splatter garbage into the log (similar to
  the failure characteristic described in this thread). As a result, the
  filesystem shuts down on log I/O completion.
- Log recovery runs on the subsequent mount, correctly identifies the
  range of the log that contains garbage and walks back to the last
  successfully written record in the log.

Under normal circumstances, the log tail is not far behind the head,
this "previous tail" lsn points to log data that is still valid and
everything just works. Because the log was pinned, however, the tail_lsn
of that last record points into the area that the failed log flush
splattered garbage data. This essentially results in log record checksum
mismatches and/or worse.

I won't say I've been able to manufacture this exactly, but I think I
can emulate it well enough to demonstrate the problem by pinning the
log, intentionally dumping bad data to the log immediately followed by
shutting down the fs.

Note that this is not reproducible on default 4k fsb configs because log
reservations seem large enough to consistently leave a head->tail gap
larger than the 256k used by the log buffers. This can be reproduced on
such fs' if the logbsize is increased, however. This is also
reproducible on 1k fsb fs' with default logbsize because log
reservations are small enough to leave a sub-200k gap when the head
pushes behind the tail. In summary, this probably requires a combination
of a heavily loaded fs to push the log head behind the tail, a
non-default fsb and/or logbsize configuration, fairly strange failure
characteristics of the underlying device (to write parts of the log bufs
successfully and not others), and perhaps just some bad luck for N log
buffers to be flushed at the same time and fail before the fs ultimately
shuts down. Other than that, I think it's possible. ;)

ISTM that we need something that prevents an overwrite of the tail lsn
last written to the log. That might be a bit tricky because we can't
just push the ail before the associated overwrite, but rather must do
something like make sure the prior log buffer write to the one that
overwrites the tail is serialized against an ail push that moves the
tail lsn forward. :/ Thoughts?

Brian

> > we
> > encountered this issue when testing an XFS filesystem on it, and other
> > filesystems appear to work fine (although obviously we could have
> > merely gotten lucky).
> 
> XFS has quite sophisticated async IO dispatch and ordering
> mechanisms compared to other filesystems and so frequently exposes
> problems in the underlying storage layers that other filesystems
> don't exercise.
> 
> > Currently, when a flush returns from the device,
> > we guarantee the data from all bios completed before the flush was
> > issued is stably on disk;
> 
> Yup, that's according to
> Documentation/block/writeback_cache_control.txt, however....
> 
> > when a write+FUA bio returns from the
> > device, the data in that bio (only) is guaranteed to be stable on disk. The
> > device may, however, commit sequentially issued write+fua bios to disk in an
> > arbitrary order.
> 
> .... XFS issues log writes with REQ_PREFLUSH|REQ_FUA. This means
> sequentially issued log writes have clearly specified ordering
> constraints. i.e. the preflush completion order requirements means
> that the block device must commit preflush+write+fua bios to stable
> storage in the exact order they were issued by the filesystem....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: XFS journal write ordering constraints?
  2017-06-13 14:14       ` Sweet Tea Dorminy
@ 2017-06-13 22:16         ` Dave Chinner
  2017-06-14  6:46           ` Christoph Hellwig
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2017-06-13 22:16 UTC (permalink / raw)
  To: Sweet Tea Dorminy; +Cc: linux-xfs

On Tue, Jun 13, 2017 at 10:14:10AM -0400, Sweet Tea Dorminy wrote:
> Thank you! I'm glad that we've established it's a mismatch between our
> device's implementation and XFS expectations.
> 
> >.... XFS issues log writes with REQ_PREFLUSH|REQ_FUA. This means
> >sequentially issued log writes have clearly specified ordering
> >constraints. i.e. the preflush completion order requirements means
> >that the block device must commit preflush+write+fua bios to stable
> >storage in the exact order they were issued by the filesystem....
> 
> That is certainly what REQ_BARRIER did back in the day. But when
> REQ_BARRIER was replaced with separate REQ_FUA and REQ_FLUSH
> flags, and barrier.txt got replaced with writeback_cache_control.txt,
> the documentation seemed to imply the ordering requirement on *issued*
> IO had gone away (but maybe I'm missing something).

Yes, that's my understanding, too, but I also thought that multiple
outstanding flushes are ordered by the block layer. i.e. flushes can
be reordered against other operations, but not other flushes.  I
could very well be wrong, but flush-to-flush ordering was what I
thought the ordered pending flush list for PREFLUSH requests in
blk_flush_complete_seq() did.

Like I said, Christoph is the expert here - he'll correct me if I'm
wrong.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: XFS journal write ordering constraints?
  2017-06-13 22:16         ` Dave Chinner
@ 2017-06-14  6:46           ` Christoph Hellwig
  0 siblings, 0 replies; 14+ messages in thread
From: Christoph Hellwig @ 2017-06-14  6:46 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Sweet Tea Dorminy, linux-xfs

On Wed, Jun 14, 2017 at 08:16:47AM +1000, Dave Chinner wrote:
> Yes, that's my understanding, too, but I also thought that multiple
> outstanding flushes are ordered by the block layer. i.e. flushes can
> be reordered against other operations, but not other flushes.

There is no such guarantee.  For request based drivers some ordering
happens in practice, but that's an implementation artifact and not
a gurantee, and for bio based drivers all bets are off.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: XFS journal write ordering constraints?
  2017-06-12 16:18       ` Brian Foster
@ 2017-06-15 22:28         ` Sweet Tea Dorminy
  2017-06-16 13:42           ` Brian Foster
  0 siblings, 1 reply; 14+ messages in thread
From: Sweet Tea Dorminy @ 2017-06-15 22:28 UTC (permalink / raw)
  To: Brian Foster; +Cc: Dave Chinner, linux-xfs

Apologies for the time required to reproduce, much greater than
expected. Here is a full metadump pre and post recovery:
       https://s3.amazonaws.com/permabit-development/20170615-xfsUpload/metadump-post-recovery
       https://s3.amazonaws.com/permabit-development/20170615-xfsUpload/metadump-pre-recovery

I also obtained a different, less common, problem, which may or may
not have the same origin; I'm hopeful it comes from the same root
cause and helps diagnose. The symptom is: xfs mount fails with
"Structure needs cleaning".
The backtrace in the kernel log is of the form

Nov 12 02:43:03 localhost kernel: [  112.897334] XFS (dm-4): Internal
error xlog_valid_rec_header(1) at line 4104 of file
fs/xfs/xfs_log_recover.c.  Caller xlog_do_recovery_pass
+0x4fc/0x5a0 [xfs]
Nov 12 02:43:03 localhost kernel: [  112.897339] CPU: 0 PID: 2435
Comm: mount Tainted: P           OE  ------------
3.10.0-327.36.2.el7.x86_64 #1
Nov 12 02:43:03 localhost kernel: [  112.897342]  ffff880067cdc000
00000000fa6da3c8 ffff88003c9fbb20 ffffffff81636401
Nov 12 02:43:03 localhost kernel: [  112.897347]  ffff88003c9fbb38
ffffffffa0360e6b ffffffffa03891fc ffff88003c9fbb48
Nov 12 02:43:03 localhost kernel: [  112.897350]  ffffffffa0385458
ffff88003c9fbc50 ffffffffa03891fc ffffffff81cae082
Nov 12 02:43:03 localhost kernel: [  112.897353] Call Trace:
Nov 12 02:43:03 localhost kernel: [  112.897366]  [<ffffffff81636401>]
dump_stack+0x19/0x1b
Nov 12 02:43:03 localhost kernel: [  112.897387]  [<ffffffffa0360e6b>]
xfs_error_report+0x3b/0x40 [xfs]
Nov 12 02:43:03 localhost kernel: [  112.897409]  [<ffffffffa03891fc>]
? xlog_do_recovery_pass+0x4fc/0x5a0 [xfs]
Nov 12 02:43:03 localhost kernel: [  112.897430]  [<ffffffffa0385458>]
xlog_valid_rec_header.isra.12+0xd8/0xe0 [xfs]
Nov 12 02:43:03 localhost kernel: [  112.897451]  [<ffffffffa03891fc>]
xlog_do_recovery_pass+0x4fc/0x5a0 [xfs]
Nov 12 02:43:03 localhost kernel: [  112.897458]  [<ffffffff8107e719>]
? vprintk_default+0x29/0x40
Nov 12 02:43:03 localhost kernel: [  112.897462]  [<ffffffff8162fe31>]
? printk+0x5e/0x75
Nov 12 02:43:03 localhost kernel: [  112.897467]  [<ffffffff811c25d3>]
? __kmalloc+0x1f3/0x230
Nov 12 02:43:03 localhost kernel: [  112.897488]  [<ffffffffa037bf57>]
? kmem_alloc+0x77/0xf0 [xfs]
Nov 12 02:43:03 localhost kernel: [  112.897510]  [<ffffffffa0389309>]
xlog_do_log_recovery+0x69/0xd0 [xfs]
Nov 12 02:43:03 localhost kernel: [  112.897530]  [<ffffffffa0389386>]
xlog_do_recover+0x16/0xe0 [xfs]
Nov 12 02:43:03 localhost kernel: [  112.897549]  [<ffffffffa038a19f>]
xlog_recover+0x9f/0x130 [xfs]
Nov 12 02:43:03 localhost kernel: [  112.897569]  [<ffffffffa037e01c>]
xfs_log_mount+0x24c/0x2c0 [xfs]
Nov 12 02:43:03 localhost kernel: [  112.897589]  [<ffffffffa037577c>]
xfs_mountfs+0x42c/0x750 [xfs]
Nov 12 02:43:03 localhost kernel: [  112.897609]  [<ffffffffa03789a5>]
xfs_fs_fill_super+0x355/0x3d0 [xfs]
Nov 12 02:43:03 localhost kernel: [  112.897615]  [<ffffffff811e26c0>]
mount_bdev+0x1b0/0x1f0
Nov 12 02:43:03 localhost kernel: [  112.897634]  [<ffffffffa0378650>]
? xfs_parseargs+0xbf0/0xbf0 [xfs]
Nov 12 02:43:03 localhost kernel: [  112.897653]  [<ffffffffa03768c5>]
xfs_fs_mount+0x15/0x20 [xfs]
Nov 12 02:43:03 localhost kernel: [  112.897657]  [<ffffffff811e2e59>]
mount_fs+0x39/0x1b0

The full dump of _this_ failure is at:
       https://s3.amazonaws.com/permabit-development/20170613-xfsUpload/metadump-pre-recovery
       https://s3.amazonaws.com/permabit-development/20170613-xfsUpload/metadump-post-recovery


>P.S., I've also been assuming based on the log state that you haven't
used the logbufs/logbsize mount options to change from the defaults, but
please do confirm that you aren't changing either of those defaults.
Thanks.

Correct. The mkfs invocation is 'mkfs -t xfs -f -q -K -b size=1024
$device' and the mount invocation is "mount -t xfs -o discard $device
$mountpoint"

Thanks again!

On Mon, Jun 12, 2017 at 12:18 PM, Brian Foster <bfoster@redhat.com> wrote:
> On Mon, Jun 12, 2017 at 10:55:26AM -0400, Brian Foster wrote:
>> It looks like you've replied to one mail and copy/pasted the text from
>> another. FWIW, it's usually best to reply to each mail independently to
>> preserve threading and whatnot.
>>
>> On Fri, Jun 09, 2017 at 10:06:26PM -0400, Sweet Tea Dorminy wrote:
>> > >What is the xfs_info for this filesystem?
>> >        meta-data=/dev/mapper/tracer-vdo0 isize=256    agcount=4,
>> >        agsize=5242880 blks
>> >                 =                       sectsz=512   attr=2, projid32bit=0
>> >        data     =                       bsize=1024   blocks=20971520,
>> >        imaxpct=25
>> >                 =                       sunit=0      swidth=0 blks
>> >        naming   =version 2              bsize=4096   ascii-ci=0
>> >        log      =internal               bsize=1024   blocks=10240, version=2
>> >                 =                       sectsz=512   sunit=0 blks,
>> >        lazy-count=1
>> >        realtime =none                   extsz=4096   blocks=0, rtextents=0
>> >
>> > > What granularity are these A and B regions (sectors or larger)?
>> > A is 1k, B is 3k.
>> >
>> > >Are you running on some kind of special block device that reproduces this?
>> > It's a device we are developing,
>> > asynchronous, which we believe obeys FLUSH and FUA correctly but may
>> > have missed some case; we
>> > encountered this issue when testing an XFS filesystem on it, and other
>> > filesystems appear to work fine (although obviously we could have
>> > merely gotten lucky). Currently, when a flush returns from the device,
>> > we guarantee the data from all bios completed before the flush was
>> > issued is stably on disk; when a write+FUA bio returns from the
>> > device, the data in that bio (only) is guaranteed to be stable on disk. The
>> > device may, however, commit sequentially issued write+fua bios to disk in an
>> > arbitrary order.
>> >
>> > > Do you have a consistent reproducer and/or have you
>> > reproduced on an upstream kernel
>> > Our reproducer fails about 20% of the time. We have not tried on an
>> > upstream kernel.
>> >
>>
>> Please do try on something upstream if possible. Also, what exactly
>> triggers the failure that ultimately leads to log recovery on the next
>> mount? Are you doing an XFS shutdown or shutting down the block device
>> underneath the fs?
>>
>> > >Could you provide an xfs_metadump image of the filesystem that fails log recovery with CRC errors?
>> > I can capture such on Monday.
>> > For now, just the journal (gathered with xfs_logprint -C fsLog) can be
>> > found at (10M)
>> > https://s3.amazonaws.com/permabit-development/20170609-xfsUpload/fsLog
>> > .
>>
>> Thanks. This certainly looks a bit strange. Up through sectors 19692 of
>> the log are populated with a cycle 2 LSN. After that appears to be the
>> previous cycle 1, which suggests that is the head of the log. The tail
>> then points to 1:20105, but from that point forward a strange mix of
>> cycle 1 and cycle 2 LSNs are seen.
>>
>> It's not totally clear to me from the physical log state whether this
>> was the in-core state at the time of the crash, but a first guess may be
>> that at least this is what log recovery thinks and thus complains about
>> the tail of the log being borked.
>>
>
> Another observation and possible theory... the last two cycle 2 headers
> in the log are:
>
>  20116 HEADER Cycle 2 tail 2:010284 len  32256 ops 288
>  ...
>  20180 HEADER Cycle 2 tail 2:010284 len  32256 ops 261
>
> ... which appear sanely spaced apart. They are mixed with cycle 1/2
> changes, which seems to correspond with your observation that parts of
> the 32k log buffers are succeeding and parts are failing. IOW, these log
> records are effectively incomplete on disk.
>
> That accounts for 2 of 8 log buffers. If we assume these are the "last"
> two log buffers, each log buffer is full and also spaced 64 sectors
> apart, then the first log buffer would be targetting log sector (20116 -
> 6 * 64) = 19732. That refers to the head of the on-disk log noted above:
>
>  ...
>  19628 HEADER Cycle 2 tail 1:020105 len  32256 ops 327
>  19692 HEADER Cycle 2 tail 1:020105 len  19968 ops 89
>  19732 HEADER Cycle 1 tail 1:020105 len      0 ops 0
> [00000 - 19732] Cycle 0x00000002 New Cycle 0x00000001
>  19733 HEADER Cycle 1 tail 1:020105 len      0 ops 0
>  ...
>
> So I'm wondering whether if it's possible that all of the log buffers
> were dirty at the time of the crash, the last two were the only ones
> that made it to disk, and they only made it partially at that. Further,
> in partially writing those last two bufs at 20116 and beyond, we've
> begun to overwrite the tail pointed to by the last record on disk before
> the head (tail 1:020105 -> cycle 1 sector 20105), corrupting that
> reference as well.
>
> There are also a bunch of the len == 0 headers at 19733 and beyond,
> which I'm not quite clear on. I'm wondering if log recovery might
> actually do that after it settles on a log head. That could mean there
> actually was more cycle 2 data from other log bufs in that range but it
> was all detected as invalid and cleared out during log recovery.
>
> When you grab the metadump to provide, can you make sure you do so
> _before_ you attempt log recovery on the filesystem for the first time?
> E.g., reproduce crash, umount & create metadump, mount and test for
> corruption, if reproduce -> send metadump, else toss metadump & repeat.
> This is because the first attempt at log recovery may modify the
> physical log and cause some information loss if we take the metadump
> after you've already attempted recovery.
>
> Brian
>
> P.S., I've also been assuming based on the log state that you haven't
> used the logbufs/logbsize mount options to change from the defaults, but
> please do confirm that you aren't changing either of those defaults.
> Thanks.
>
>> > A log of the journal writes can be found at (17M)
>> > https://s3.amazonaws.com/permabit-development/20170609-xfsUpload/log_writes_only.blkparse.
>> > It is in a blkparse-like format. For each 512-byte sector of a bio,
>> > either starting or finishing, the data hash is recorded; the sector is
>> > recorded; and the index of this sector and the number of sectors
>> > within the current bio is recorded. Bios recorded as "FAILED" indicate
>> > that the device has crashed / become disconnected and the bio has
>> > returned with an error.
>> >
>> > >From there, it searches a previous number of blocks
>> > based on the maximum log buffer concurrency allowed by the fs to
>> > determine whether any such "holes" exist in that range. If so, the head
>> > is walked back to the first instance of such a "hole," effectively
>> > working around out of order buffer completion at the time of a
>> > filesystem crash.
>> >
>> > In the case logged and linked above, there are 256k of outstanding log
>> > write bytes at once; 187k of these fail and 69k succeed. Of the 69k which
>> > succeed, they are always the first 1k of the 4k block to which they
>> > belong. Is this within the permitted amount of outstanding log buffers?
>> >
>>
>> Generally, yes. Log recovery uses the maximum in-core log buffer count
>> (8) and the maximum supported log buffer size (256k) to determine how
>> far to look back (2M). The default log buffer size is 32k, so I suspect
>> 256k means every log buffer was in-flight at this time.
>>
>> When you reproduce this and presumably provide a metadump, could you
>> also collect the /sys/fs/xfs/<dev>/log/* state information at the time
>> of the crash of that particular metadump? Note that this directory
>> disappears after unmount so you'll have to collect it before.
>>
>> Brian
>>
>> > Thanks!
>> >
>> > Sweet Tea
>> >
>> > On Fri, Jun 9, 2017 at 7:44 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > > On Thu, Jun 08, 2017 at 11:42:11AM -0400, Sweet Tea Dorminy wrote:
>> > >> Greetings;
>> > >>
>> > >> When using XFS with a 1k block size atop our device, we regularly get
>> > >> "log record CRC mismatch"es when mounting XFS after a crash, and we
>> > >> are attempting to understand why. We are using RHEL7.3 with its kernel
>> > >> 3.10.0-514.10.2.el7.x86_64, xfsprogs version 4.5.0.
>> > >>
>> > >> Tracing indicates the following situation occurs:
>> > >>        Some pair of consecutive locations contains data A1 and B1, respectively.
>> > >>        The XFS journal issues new writes to those locations,
>> > >> containing data A2 and B2.
>> > >>        The write of B' finishes, but A' is still outstanding at the
>> > >> time of the crash.
>> > >>        Crash occurs. The data on disk is A1 and B2, respectively.
>> > >>        XFS fails to mount, complaining that the checksum mismatches.
>> > >>
>> > >> Does XFS expect sequentially issued journal IO to be committed to disk
>> > >> in the order of issuance due to the use of FUA?
>> > >
>> > > Journal IO is not sequentially issued. It's an async process. At
>> > > runtime, ordering is handled by journal IO completion processing
>> > > being queued and run in order, so IOs can both be issued and
>> > > physically complete out of order.
>> > >
>> > > Log recovery is supposed to handle this. It searches and finds the
>> > > latest contiguous journal entry and does not replay past holes that
>> > > may arise from out of order journal writes.
>> > >
>> > > CRC errors like this in recovery imply that journal writes are being
>> > > torn or not completed fully, which may mean that your storage does
>> > > not correctly implement flush/FUA ordering semantics....
>> > >
>> > > Cheers,
>> > >
>> > > Dave.
>> > > --
>> > > Dave Chinner
>> > > david@fromorbit.com
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: XFS journal write ordering constraints?
  2017-06-15 22:28         ` Sweet Tea Dorminy
@ 2017-06-16 13:42           ` Brian Foster
  0 siblings, 0 replies; 14+ messages in thread
From: Brian Foster @ 2017-06-16 13:42 UTC (permalink / raw)
  To: Sweet Tea Dorminy; +Cc: Dave Chinner, linux-xfs

On Thu, Jun 15, 2017 at 06:28:05PM -0400, Sweet Tea Dorminy wrote:
> Apologies for the time required to reproduce, much greater than
> expected. Here is a full metadump pre and post recovery:
>        https://s3.amazonaws.com/permabit-development/20170615-xfsUpload/metadump-post-recovery
>        https://s3.amazonaws.com/permabit-development/20170615-xfsUpload/metadump-pre-recovery
> 

Thanks for the data. The pre/post images do in fact show that some
modification occurs during log recovery. Taking a look at 'xfs_logprint
-d' output from the post image, we can see that the head of the log is
here:

...
 19991 HEADER Cycle 2 tail 1:020377 len   3584 ops 29      
 19999 HEADER Cycle 1 tail 1:020377 len      0 ops 0       
[00031 - 19999] Cycle 0x00000002 New Cycle 0x00000001      
...

... because that is the last cycle 2 write in the physical log. The pre
image shows that sectors 19999 and beyond contained cycle 1 data that
was zapped by log recovery because it was beyond the head.

The start of the tail up through the end of the physical log actually
looks sane:

...
 20377 HEADER Cycle 1 tail 1:007684 len  32256 ops 331
 20441 HEADER Cycle 1 tail 1:007684 len  32256 ops 317

... but as the tail wraps to the beginning of the physical log, we run
into this:

[00000 - 00000] Cycle 0xffffffff New Cycle 0x00000003
[00000 - 00007] Cycle 0x00000003 New Cycle 0x00000002
[00007 - 00014] Cycle 0x00000002 New Cycle 0x00000003
[00014 - 00015] Cycle 0x00000003 New Cycle 0x00000002
[00015 - 00022] Cycle 0x00000002 New Cycle 0x00000003
[00022 - 00023] Cycle 0x00000003 New Cycle 0x00000002
    25 HEADER Cycle 2 tail 1:007684 len  32256 ops 238
[00023 - 00030] Cycle 0x00000002 New Cycle 0x00000003
[00030 - 00031] Cycle 0x00000003 New Cycle 0x00000002
...

... which tells us that this was the beginning of the 3rd cycle in the
log, but obviously this log data is junk. For log recovery from the tail
to the head to work correctly, we need to have the valid cycle 2 data in
place up to the head.

If we consider that you have 8 32k log buffers and the in-core head at
the time was 2:19999, that means the maximum "reach" of those 8 log bufs
is (8 * 64s) = 512 sectors. Walking forward from the head and wrapping
around the max log sector (which bumps the cycle from 2 to 3) gives us
19999 + 512 - 20480 = 31, which corresponds to the end of the garbage
output above.

In summary, I think the problem you are hitting is overwriting part of
the last good tail in the log, as suspected in the previous mail. What
happens at log recovery time is that it correctly identifies the garbage
region of the log, walks the head back to 2:19999 and attempts to
recover from the tail of 1:020377 to the head. Once recovery gets to the
start of cycle 2, however, it runs into the garbage data and falls over.

It's highly likely that this particular data is actually not needed to
recover filesystem consistency, because the fs can only overwrite log
data once that data has been written back to the final location on disk.
The problem here is that log recovery does not know that. It considers
the log corrupted and stops. This prevents recovery from moving on the
subsequent records that may very well need to be recovered to the fs and
thus leaves the fs in an inconsistent state.

> I also obtained a different, less common, problem, which may or may
> not have the same origin; I'm hopeful it comes from the same root
> cause and helps diagnose. The symptom is: xfs mount fails with
> "Structure needs cleaning".
> The backtrace in the kernel log is of the form
> 
> Nov 12 02:43:03 localhost kernel: [  112.897334] XFS (dm-4): Internal
> error xlog_valid_rec_header(1) at line 4104 of file
> fs/xfs/xfs_log_recover.c.  Caller xlog_do_recovery_pass
> +0x4fc/0x5a0 [xfs]
...
> 
> The full dump of _this_ failure is at:
>        https://s3.amazonaws.com/permabit-development/20170613-xfsUpload/metadump-pre-recovery
>        https://s3.amazonaws.com/permabit-development/20170613-xfsUpload/metadump-post-recovery
> 

Thanks. This looks to me like the same fundamental problem, just in a
different spot in the log. By the looks of it, the tail is completely
annihilated in this case and doesn't even represent a valid record. I
suppose this could lead to slightly different error behavior at log
recovery time, but the root cause appears the same to me.

> 
> >P.S., I've also been assuming based on the log state that you haven't
> used the logbufs/logbsize mount options to change from the defaults, but
> please do confirm that you aren't changing either of those defaults.
> Thanks.
> 
> Correct. The mkfs invocation is 'mkfs -t xfs -f -q -K -b size=1024
> $device' and the mount invocation is "mount -t xfs -o discard $device
> $mountpoint"
> 

Ah, so the discard mount option could be the source of latency that
causes the log to fill up in the first place. This is a synchronous
operation (no longer the case upstream, actually) and I believe is known
to lead to stalls. My understanding is that running a background fstrim
process on regular intervals is preferred to the discard mount option.

With regard to fixing the problem, it's probably not going to be a
trivial change and the fix is not yet clear to me (though I am planning
to post a fairly reliable reproducer test case shortly). It's also not
clear to me whether this is a production configuration for you or you
are just testing your block device on various configurations. If you
require a workaround, the first option may be to avoid the discard mount
option noted above. That is still non-deterministic because other
latencies could cause the log to fill.

In practice, I think that either increasing the block size or reducing
the log buffer size (or count) should prevent this problem from
occurring at all. For example, cutting log buffer space in half from 32k
log buffers to logbsize=16k (or logbufs=8 to logbufs=4 with 32k bufs)
should be enough to prevent the problem with -bsize=1k. Also, I don't
think this will reproduce with 4k blocks and otherwise default settings.

If you wanted to provide additional verification that this is the
problem, I suspect bumping up to logbsize=256k (with discard enabled)
should be enough to improve the reproducibility of this for any
supported block size (if you do attempt to confirm that, please do post
results).

Brian

> Thanks again!
> 
> On Mon, Jun 12, 2017 at 12:18 PM, Brian Foster <bfoster@redhat.com> wrote:
> > On Mon, Jun 12, 2017 at 10:55:26AM -0400, Brian Foster wrote:
> >> It looks like you've replied to one mail and copy/pasted the text from
> >> another. FWIW, it's usually best to reply to each mail independently to
> >> preserve threading and whatnot.
> >>
> >> On Fri, Jun 09, 2017 at 10:06:26PM -0400, Sweet Tea Dorminy wrote:
> >> > >What is the xfs_info for this filesystem?
> >> >        meta-data=/dev/mapper/tracer-vdo0 isize=256    agcount=4,
> >> >        agsize=5242880 blks
> >> >                 =                       sectsz=512   attr=2, projid32bit=0
> >> >        data     =                       bsize=1024   blocks=20971520,
> >> >        imaxpct=25
> >> >                 =                       sunit=0      swidth=0 blks
> >> >        naming   =version 2              bsize=4096   ascii-ci=0
> >> >        log      =internal               bsize=1024   blocks=10240, version=2
> >> >                 =                       sectsz=512   sunit=0 blks,
> >> >        lazy-count=1
> >> >        realtime =none                   extsz=4096   blocks=0, rtextents=0
> >> >
> >> > > What granularity are these A and B regions (sectors or larger)?
> >> > A is 1k, B is 3k.
> >> >
> >> > >Are you running on some kind of special block device that reproduces this?
> >> > It's a device we are developing,
> >> > asynchronous, which we believe obeys FLUSH and FUA correctly but may
> >> > have missed some case; we
> >> > encountered this issue when testing an XFS filesystem on it, and other
> >> > filesystems appear to work fine (although obviously we could have
> >> > merely gotten lucky). Currently, when a flush returns from the device,
> >> > we guarantee the data from all bios completed before the flush was
> >> > issued is stably on disk; when a write+FUA bio returns from the
> >> > device, the data in that bio (only) is guaranteed to be stable on disk. The
> >> > device may, however, commit sequentially issued write+fua bios to disk in an
> >> > arbitrary order.
> >> >
> >> > > Do you have a consistent reproducer and/or have you
> >> > reproduced on an upstream kernel
> >> > Our reproducer fails about 20% of the time. We have not tried on an
> >> > upstream kernel.
> >> >
> >>
> >> Please do try on something upstream if possible. Also, what exactly
> >> triggers the failure that ultimately leads to log recovery on the next
> >> mount? Are you doing an XFS shutdown or shutting down the block device
> >> underneath the fs?
> >>
> >> > >Could you provide an xfs_metadump image of the filesystem that fails log recovery with CRC errors?
> >> > I can capture such on Monday.
> >> > For now, just the journal (gathered with xfs_logprint -C fsLog) can be
> >> > found at (10M)
> >> > https://s3.amazonaws.com/permabit-development/20170609-xfsUpload/fsLog
> >> > .
> >>
> >> Thanks. This certainly looks a bit strange. Up through sectors 19692 of
> >> the log are populated with a cycle 2 LSN. After that appears to be the
> >> previous cycle 1, which suggests that is the head of the log. The tail
> >> then points to 1:20105, but from that point forward a strange mix of
> >> cycle 1 and cycle 2 LSNs are seen.
> >>
> >> It's not totally clear to me from the physical log state whether this
> >> was the in-core state at the time of the crash, but a first guess may be
> >> that at least this is what log recovery thinks and thus complains about
> >> the tail of the log being borked.
> >>
> >
> > Another observation and possible theory... the last two cycle 2 headers
> > in the log are:
> >
> >  20116 HEADER Cycle 2 tail 2:010284 len  32256 ops 288
> >  ...
> >  20180 HEADER Cycle 2 tail 2:010284 len  32256 ops 261
> >
> > ... which appear sanely spaced apart. They are mixed with cycle 1/2
> > changes, which seems to correspond with your observation that parts of
> > the 32k log buffers are succeeding and parts are failing. IOW, these log
> > records are effectively incomplete on disk.
> >
> > That accounts for 2 of 8 log buffers. If we assume these are the "last"
> > two log buffers, each log buffer is full and also spaced 64 sectors
> > apart, then the first log buffer would be targetting log sector (20116 -
> > 6 * 64) = 19732. That refers to the head of the on-disk log noted above:
> >
> >  ...
> >  19628 HEADER Cycle 2 tail 1:020105 len  32256 ops 327
> >  19692 HEADER Cycle 2 tail 1:020105 len  19968 ops 89
> >  19732 HEADER Cycle 1 tail 1:020105 len      0 ops 0
> > [00000 - 19732] Cycle 0x00000002 New Cycle 0x00000001
> >  19733 HEADER Cycle 1 tail 1:020105 len      0 ops 0
> >  ...
> >
> > So I'm wondering whether if it's possible that all of the log buffers
> > were dirty at the time of the crash, the last two were the only ones
> > that made it to disk, and they only made it partially at that. Further,
> > in partially writing those last two bufs at 20116 and beyond, we've
> > begun to overwrite the tail pointed to by the last record on disk before
> > the head (tail 1:020105 -> cycle 1 sector 20105), corrupting that
> > reference as well.
> >
> > There are also a bunch of the len == 0 headers at 19733 and beyond,
> > which I'm not quite clear on. I'm wondering if log recovery might
> > actually do that after it settles on a log head. That could mean there
> > actually was more cycle 2 data from other log bufs in that range but it
> > was all detected as invalid and cleared out during log recovery.
> >
> > When you grab the metadump to provide, can you make sure you do so
> > _before_ you attempt log recovery on the filesystem for the first time?
> > E.g., reproduce crash, umount & create metadump, mount and test for
> > corruption, if reproduce -> send metadump, else toss metadump & repeat.
> > This is because the first attempt at log recovery may modify the
> > physical log and cause some information loss if we take the metadump
> > after you've already attempted recovery.
> >
> > Brian
> >
> > P.S., I've also been assuming based on the log state that you haven't
> > used the logbufs/logbsize mount options to change from the defaults, but
> > please do confirm that you aren't changing either of those defaults.
> > Thanks.
> >
> >> > A log of the journal writes can be found at (17M)
> >> > https://s3.amazonaws.com/permabit-development/20170609-xfsUpload/log_writes_only.blkparse.
> >> > It is in a blkparse-like format. For each 512-byte sector of a bio,
> >> > either starting or finishing, the data hash is recorded; the sector is
> >> > recorded; and the index of this sector and the number of sectors
> >> > within the current bio is recorded. Bios recorded as "FAILED" indicate
> >> > that the device has crashed / become disconnected and the bio has
> >> > returned with an error.
> >> >
> >> > >From there, it searches a previous number of blocks
> >> > based on the maximum log buffer concurrency allowed by the fs to
> >> > determine whether any such "holes" exist in that range. If so, the head
> >> > is walked back to the first instance of such a "hole," effectively
> >> > working around out of order buffer completion at the time of a
> >> > filesystem crash.
> >> >
> >> > In the case logged and linked above, there are 256k of outstanding log
> >> > write bytes at once; 187k of these fail and 69k succeed. Of the 69k which
> >> > succeed, they are always the first 1k of the 4k block to which they
> >> > belong. Is this within the permitted amount of outstanding log buffers?
> >> >
> >>
> >> Generally, yes. Log recovery uses the maximum in-core log buffer count
> >> (8) and the maximum supported log buffer size (256k) to determine how
> >> far to look back (2M). The default log buffer size is 32k, so I suspect
> >> 256k means every log buffer was in-flight at this time.
> >>
> >> When you reproduce this and presumably provide a metadump, could you
> >> also collect the /sys/fs/xfs/<dev>/log/* state information at the time
> >> of the crash of that particular metadump? Note that this directory
> >> disappears after unmount so you'll have to collect it before.
> >>
> >> Brian
> >>
> >> > Thanks!
> >> >
> >> > Sweet Tea
> >> >
> >> > On Fri, Jun 9, 2017 at 7:44 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > > On Thu, Jun 08, 2017 at 11:42:11AM -0400, Sweet Tea Dorminy wrote:
> >> > >> Greetings;
> >> > >>
> >> > >> When using XFS with a 1k block size atop our device, we regularly get
> >> > >> "log record CRC mismatch"es when mounting XFS after a crash, and we
> >> > >> are attempting to understand why. We are using RHEL7.3 with its kernel
> >> > >> 3.10.0-514.10.2.el7.x86_64, xfsprogs version 4.5.0.
> >> > >>
> >> > >> Tracing indicates the following situation occurs:
> >> > >>        Some pair of consecutive locations contains data A1 and B1, respectively.
> >> > >>        The XFS journal issues new writes to those locations,
> >> > >> containing data A2 and B2.
> >> > >>        The write of B' finishes, but A' is still outstanding at the
> >> > >> time of the crash.
> >> > >>        Crash occurs. The data on disk is A1 and B2, respectively.
> >> > >>        XFS fails to mount, complaining that the checksum mismatches.
> >> > >>
> >> > >> Does XFS expect sequentially issued journal IO to be committed to disk
> >> > >> in the order of issuance due to the use of FUA?
> >> > >
> >> > > Journal IO is not sequentially issued. It's an async process. At
> >> > > runtime, ordering is handled by journal IO completion processing
> >> > > being queued and run in order, so IOs can both be issued and
> >> > > physically complete out of order.
> >> > >
> >> > > Log recovery is supposed to handle this. It searches and finds the
> >> > > latest contiguous journal entry and does not replay past holes that
> >> > > may arise from out of order journal writes.
> >> > >
> >> > > CRC errors like this in recovery imply that journal writes are being
> >> > > torn or not completed fully, which may mean that your storage does
> >> > > not correctly implement flush/FUA ordering semantics....
> >> > >
> >> > > Cheers,
> >> > >
> >> > > Dave.
> >> > > --
> >> > > Dave Chinner
> >> > > david@fromorbit.com
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> >> > the body of a message to majordomo@vger.kernel.org
> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2017-06-16 13:42 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-08 15:42 XFS journal write ordering constraints? Sweet Tea Dorminy
2017-06-09 12:38 ` Brian Foster
2017-06-09 17:30   ` Brian Foster
2017-06-09 23:44 ` Dave Chinner
2017-06-10  2:06   ` Sweet Tea Dorminy
2017-06-12 14:55     ` Brian Foster
2017-06-12 16:18       ` Brian Foster
2017-06-15 22:28         ` Sweet Tea Dorminy
2017-06-16 13:42           ` Brian Foster
2017-06-12 23:50     ` Dave Chinner
2017-06-13 14:14       ` Sweet Tea Dorminy
2017-06-13 22:16         ` Dave Chinner
2017-06-14  6:46           ` Christoph Hellwig
2017-06-13 16:29       ` Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.