All of lore.kernel.org
 help / color / mirror / Atom feed
* Questions about XFS discard and xfs_free_extent() code (newbie)
@ 2013-12-18 18:37 Alex Lyakas
  2013-12-18 23:06 ` Dave Chinner
  0 siblings, 1 reply; 47+ messages in thread
From: Alex Lyakas @ 2013-12-18 18:37 UTC (permalink / raw)
  To: xfs

Greetings XFS developers & community,

I am studying the XFS code, primarily focusing now at the free-space
allocation and deallocation parts.

I learned that freeing an extent happens like this:
- xfs_free_extent() calls xfs_free_ag_extent(), which attempts to merge the
freed extents from left and from right in the by-bno btree. Then the by-size
btree is updated accordingly.
- xfs_free_extent marks the original (un-merged) extent as "busy" by
xfs_extent_busy_insert(). This prevents this original extent from being
allocated. (Except that for metadata allocations such extent or part of it
can be "unbusied", while it is still not marked for discard with
XFS_EXTENT_BUSY_DISCARDED).
- Once the appropriate part of the log is committed, xlog_cil_committed
calls xfs_discard_extents. This discards the extents using the synchronous
blkdev_issue_discard() API, and only them "unbusies" the extents. This makes
sense, because we cannot allow allocating these extents until discarding
completed.

WRT to this flow, I have some questions:

- xfs_free_extent first inserts the extent into the free-space btrees, and
only then marks it as busy. How come there is no race window here? Can
somebody allocate the freed extent before it is marked as busy? Or the
free-space btrees somehow are locked at this point? The code says "validate
the extent size is legal now we have the agf locked". I more or less see
that xfs_alloc_fix_freelist() locks *something*, but I don't see
xfs_free_extent() unlocking anything.

- If xfs_extent_busy_insert() fails to alloc a xfs_extent_busy structure,
such extent cannot be discarded, correct?

- xfs_discard_extents() doesn't check the discard granularity of the
underlying block device, like xfs_ioc_trim() does. So it may send a small
discard request, which cannot be handled. If it would have checked the
granularity, it could have avoided sending small requests. But the thing is
that the busy extent might have been merged in the free-space btree into a
larger extent, which is now suitable for discard.

I want to attempt the following logic in xfs_discard_extents():
# search the "by-bno" free-space btree for a larger extent that fully
encapsulates the busy extent (which we want to discard)
# if found, check whether some other part of the larger extent is still busy
(except for the current busy extent we want to discard)
# if no, send discard for the larger extent
Does this make send? And I think that we need to hold the larger extent 
locked somehow until the
discard completes, to prevent allocation from the discarded range.

Can anybody please comment on these questions?

Thanks!
Alex. 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Questions about XFS discard and xfs_free_extent() code (newbie)
  2013-12-18 18:37 Questions about XFS discard and xfs_free_extent() code (newbie) Alex Lyakas
@ 2013-12-18 23:06 ` Dave Chinner
  2013-12-19  9:24   ` Alex Lyakas
  0 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2013-12-18 23:06 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: xfs

On Wed, Dec 18, 2013 at 08:37:29PM +0200, Alex Lyakas wrote:
> Greetings XFS developers & community,
> 
> I am studying the XFS code, primarily focusing now at the free-space
> allocation and deallocation parts.
> 
> I learned that freeing an extent happens like this:
> - xfs_free_extent() calls xfs_free_ag_extent(), which attempts to merge the
> freed extents from left and from right in the by-bno btree. Then the by-size
> btree is updated accordingly.
> - xfs_free_extent marks the original (un-merged) extent as "busy" by
> xfs_extent_busy_insert(). This prevents this original extent from being
> allocated. (Except that for metadata allocations such extent or part of it
> can be "unbusied", while it is still not marked for discard with
> XFS_EXTENT_BUSY_DISCARDED).
> - Once the appropriate part of the log is committed, xlog_cil_committed
> calls xfs_discard_extents. This discards the extents using the synchronous
> blkdev_issue_discard() API, and only them "unbusies" the extents. This makes
> sense, because we cannot allow allocating these extents until discarding
> completed.
> 
> WRT to this flow, I have some questions:
> 
> - xfs_free_extent first inserts the extent into the free-space btrees, and
> only then marks it as busy. How come there is no race window here?

Because the AGF is locked exclusively at this point, meaning only
one process can be modifying the free space tree at this point in
time.

> Can
> somebody allocate the freed extent before it is marked as busy? Or the
> free-space btrees somehow are locked at this point? The code says "validate
> the extent size is legal now we have the agf locked". I more or less see
> that xfs_alloc_fix_freelist() locks *something*, but I don't see
> xfs_free_extent() unlocking anything.

The AGF remains locked until the transaction is committed. The
transaction commit code unlocks items modified in the transaction
via the ->iop_unlock log item callback....

> - If xfs_extent_busy_insert() fails to alloc a xfs_extent_busy structure,
> such extent cannot be discarded, correct?

Correct.

> - xfs_discard_extents() doesn't check the discard granularity of the
> underlying block device, like xfs_ioc_trim() does. So it may send a small
> discard request, which cannot be handled.

Discard is a "advisory" operation - it is never guaranteed to do
anything.

> If it would have checked the
> granularity, it could have avoided sending small requests. But the thing is
> that the busy extent might have been merged in the free-space btree into a
> larger extent, which is now suitable for discard.

Sure, but the busy extent tree tracks extents across multiple
transaction contexts, and we cannot merge extents that are in
different contexts.

> I want to attempt the following logic in xfs_discard_extents():
> # search the "by-bno" free-space btree for a larger extent that fully
> encapsulates the busy extent (which we want to discard)
> # if found, check whether some other part of the larger extent is still busy
> (except for the current busy extent we want to discard)
> # if no, send discard for the larger extent
> Does this make send? And I think that we need to hold the larger
> extent locked somehow until the
> discard completes, to prevent allocation from the discarded range.

You can't search the freespace btrees in log IO completion context -
that will cause deadlocks because we can be holding the locks
searching the freespace trees when we issue a log force and block
waiting for log IO completion to occur. e.g. in
xfs_extent_busy_reuse()....

Also, walking the free space btrees can be an IO bound operation,
overhead/latency we absolutely do not want to add to log IO completion.

Further, walking the free space btrees can be a memory intensive
operation (buffers are demand paged from disk) and log IO completion
may be necessary for memory reclaim to make progress in low memory
situations. So adding unbound memory demand to log IO completion
will cause low memory deadlocks, too.

IOWs, adding freespace tree processing to xfs_discard_extents() just
won't work.

What we really need is a smarter block layer implementation of the
discard operation - it needs to be asynchronous, and it needs to
support merging of adjacent discard requests. Now that SATA 3.1
devices are appearing on the market, queued trim operations are now
possible. Dispatching discard oeprations as synchronous operations
prevents us from taking advantage of these operations. Further,
because it's synchronous, the block layer can't merge adjacent
discards, not batch multiple discard ranges up into a single TRIM
command.

IOWs, what we really need is for the block layer discard code to be
brought up to the capabilities of the hardware on the market first.
Then we will be in a position to be able to optimise the XFS code to
use async dispatch and new IO completion handlers to finish the log
IO completion processing, and at that point we shouldn't need to
care anymore. Note that XFS already dispatches discards in ascending
block order, so if we issue adjacent discards the block layer will
be able to merge them appropriately. Hence we don't need to add that
complexity to XFS....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Questions about XFS discard and xfs_free_extent() code (newbie)
  2013-12-18 23:06 ` Dave Chinner
@ 2013-12-19  9:24   ` Alex Lyakas
  2013-12-19 10:55     ` Dave Chinner
  0 siblings, 1 reply; 47+ messages in thread
From: Alex Lyakas @ 2013-12-19  9:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Hi Dave,
Thank you for your comments.
I realize now that what I proposed cannot be done; I need to understand 
deeper how XFS transactions work (unfortunately, the awesome "XFS Filesystem 
Structure" doc has a TODO in the "Journaling Log" section).

Can you please comment on one more question:
Let's say we had such fully asynchronous "fire-and-forget" discard operation 
(I can implement one myself for my block-device via a custom IOCTL). What is 
wrong if we trigger such operation in xfs_free_ag_extent(), right after we 
have merged the freed extent into a bigger one? I understand that the 
extent-free-intent is not yet committed to the log at this point. But from 
the user's point of view, the extent has been deleted, no? So if the 
underlying block device discards the merged extent right away, before 
committing to the log, what issues this can cause?

Thanks,
Alex.


-----Original Message----- 
From: Dave Chinner
Sent: 19 December, 2013 1:06 AM
To: Alex Lyakas
Cc: xfs@oss.sgi.com
Subject: Re: Questions about XFS discard and xfs_free_extent() code (newbie)

On Wed, Dec 18, 2013 at 08:37:29PM +0200, Alex Lyakas wrote:
> Greetings XFS developers & community,
>
> I am studying the XFS code, primarily focusing now at the free-space
> allocation and deallocation parts.
>
> I learned that freeing an extent happens like this:
> - xfs_free_extent() calls xfs_free_ag_extent(), which attempts to merge 
> the
> freed extents from left and from right in the by-bno btree. Then the 
> by-size
> btree is updated accordingly.
> - xfs_free_extent marks the original (un-merged) extent as "busy" by
> xfs_extent_busy_insert(). This prevents this original extent from being
> allocated. (Except that for metadata allocations such extent or part of it
> can be "unbusied", while it is still not marked for discard with
> XFS_EXTENT_BUSY_DISCARDED).
> - Once the appropriate part of the log is committed, xlog_cil_committed
> calls xfs_discard_extents. This discards the extents using the synchronous
> blkdev_issue_discard() API, and only them "unbusies" the extents. This 
> makes
> sense, because we cannot allow allocating these extents until discarding
> completed.
>
> WRT to this flow, I have some questions:
>
> - xfs_free_extent first inserts the extent into the free-space btrees, and
> only then marks it as busy. How come there is no race window here?

Because the AGF is locked exclusively at this point, meaning only
one process can be modifying the free space tree at this point in
time.

> Can
> somebody allocate the freed extent before it is marked as busy? Or the
> free-space btrees somehow are locked at this point? The code says 
> "validate
> the extent size is legal now we have the agf locked". I more or less see
> that xfs_alloc_fix_freelist() locks *something*, but I don't see
> xfs_free_extent() unlocking anything.

The AGF remains locked until the transaction is committed. The
transaction commit code unlocks items modified in the transaction
via the ->iop_unlock log item callback....

> - If xfs_extent_busy_insert() fails to alloc a xfs_extent_busy structure,
> such extent cannot be discarded, correct?

Correct.

> - xfs_discard_extents() doesn't check the discard granularity of the
> underlying block device, like xfs_ioc_trim() does. So it may send a small
> discard request, which cannot be handled.

Discard is a "advisory" operation - it is never guaranteed to do
anything.

> If it would have checked the
> granularity, it could have avoided sending small requests. But the thing 
> is
> that the busy extent might have been merged in the free-space btree into a
> larger extent, which is now suitable for discard.

Sure, but the busy extent tree tracks extents across multiple
transaction contexts, and we cannot merge extents that are in
different contexts.

> I want to attempt the following logic in xfs_discard_extents():
> # search the "by-bno" free-space btree for a larger extent that fully
> encapsulates the busy extent (which we want to discard)
> # if found, check whether some other part of the larger extent is still 
> busy
> (except for the current busy extent we want to discard)
> # if no, send discard for the larger extent
> Does this make send? And I think that we need to hold the larger
> extent locked somehow until the
> discard completes, to prevent allocation from the discarded range.

You can't search the freespace btrees in log IO completion context -
that will cause deadlocks because we can be holding the locks
searching the freespace trees when we issue a log force and block
waiting for log IO completion to occur. e.g. in
xfs_extent_busy_reuse()....

Also, walking the free space btrees can be an IO bound operation,
overhead/latency we absolutely do not want to add to log IO completion.

Further, walking the free space btrees can be a memory intensive
operation (buffers are demand paged from disk) and log IO completion
may be necessary for memory reclaim to make progress in low memory
situations. So adding unbound memory demand to log IO completion
will cause low memory deadlocks, too.

IOWs, adding freespace tree processing to xfs_discard_extents() just
won't work.

What we really need is a smarter block layer implementation of the
discard operation - it needs to be asynchronous, and it needs to
support merging of adjacent discard requests. Now that SATA 3.1
devices are appearing on the market, queued trim operations are now
possible. Dispatching discard oeprations as synchronous operations
prevents us from taking advantage of these operations. Further,
because it's synchronous, the block layer can't merge adjacent
discards, not batch multiple discard ranges up into a single TRIM
command.

IOWs, what we really need is for the block layer discard code to be
brought up to the capabilities of the hardware on the market first.
Then we will be in a position to be able to optimise the XFS code to
use async dispatch and new IO completion handlers to finish the log
IO completion processing, and at that point we shouldn't need to
care anymore. Note that XFS already dispatches discards in ascending
block order, so if we issue adjacent discards the block layer will
be able to merge them appropriately. Hence we don't need to add that
complexity to XFS....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Questions about XFS discard and xfs_free_extent() code (newbie)
  2013-12-19  9:24   ` Alex Lyakas
@ 2013-12-19 10:55     ` Dave Chinner
  2013-12-19 19:24       ` Alex Lyakas
  2013-12-24 18:21       ` Alex Lyakas
  0 siblings, 2 replies; 47+ messages in thread
From: Dave Chinner @ 2013-12-19 10:55 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: xfs

On Thu, Dec 19, 2013 at 11:24:15AM +0200, Alex Lyakas wrote:
> Hi Dave,
> Thank you for your comments.
> I realize now that what I proposed cannot be done; I need to
> understand deeper how XFS transactions work (unfortunately, the
> awesome "XFS Filesystem Structure" doc has a TODO in the "Journaling
> Log" section).
> 
> Can you please comment on one more question:
> Let's say we had such fully asynchronous "fire-and-forget" discard
> operation (I can implement one myself for my block-device via a
> custom IOCTL). What is wrong if we trigger such operation in
> xfs_free_ag_extent(), right after we have merged the freed extent
> into a bigger one? I understand that the extent-free-intent is not
> yet committed to the log at this point. But from the user's point of
> view, the extent has been deleted, no? So if the underlying block
> device discards the merged extent right away, before committing to
> the log, what issues this can cause?

Think of what happens when a crash occurs immediately after the
discard completes. The freeing of the extent never made it to th
elog, so after recovery, the file still exists and the user can
access it. Except that it's contents are now all different to
before the crash occurred.

IOWs, issuing the discard before the transaction that frees the
extent is on stable storage means we are discarding user data or
metadata before we've guaranteed that the extent free transaction
is permanent and that means we violate certain guarantees with
respect to crash recovery...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Questions about XFS discard and xfs_free_extent() code (newbie)
  2013-12-19 10:55     ` Dave Chinner
@ 2013-12-19 19:24       ` Alex Lyakas
  2013-12-21 17:03         ` Chris Murphy
  2013-12-24 18:21       ` Alex Lyakas
  1 sibling, 1 reply; 47+ messages in thread
From: Alex Lyakas @ 2013-12-19 19:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Hi Dave,
It makes sense. I agree it might break some guarantees. Although if the user 
deleted some blocks in the file or the whole file, maybe it's ok to not have 
a clear promise what he sees after the crash. But I agree, it's not a clear 
semantics.

Thanks for the comments,
Alex.


-----Original Message----- 
From: Dave Chinner
Sent: 19 December, 2013 12:55 PM
To: Alex Lyakas
Cc: xfs@oss.sgi.com
Subject: Re: Questions about XFS discard and xfs_free_extent() code (newbie)

On Thu, Dec 19, 2013 at 11:24:15AM +0200, Alex Lyakas wrote:
> Hi Dave,
> Thank you for your comments.
> I realize now that what I proposed cannot be done; I need to
> understand deeper how XFS transactions work (unfortunately, the
> awesome "XFS Filesystem Structure" doc has a TODO in the "Journaling
> Log" section).
>
> Can you please comment on one more question:
> Let's say we had such fully asynchronous "fire-and-forget" discard
> operation (I can implement one myself for my block-device via a
> custom IOCTL). What is wrong if we trigger such operation in
> xfs_free_ag_extent(), right after we have merged the freed extent
> into a bigger one? I understand that the extent-free-intent is not
> yet committed to the log at this point. But from the user's point of
> view, the extent has been deleted, no? So if the underlying block
> device discards the merged extent right away, before committing to
> the log, what issues this can cause?

Think of what happens when a crash occurs immediately after the
discard completes. The freeing of the extent never made it to th
elog, so after recovery, the file still exists and the user can
access it. Except that it's contents are now all different to
before the crash occurred.

IOWs, issuing the discard before the transaction that frees the
extent is on stable storage means we are discarding user data or
metadata before we've guaranteed that the extent free transaction
is permanent and that means we violate certain guarantees with
respect to crash recovery...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Questions about XFS discard and xfs_free_extent() code (newbie)
  2013-12-19 19:24       ` Alex Lyakas
@ 2013-12-21 17:03         ` Chris Murphy
  0 siblings, 0 replies; 47+ messages in thread
From: Chris Murphy @ 2013-12-21 17:03 UTC (permalink / raw)
  To: xfs


On Dec 19, 2013, at 12:24 PM, Alex Lyakas <alex@zadarastorage.com> wrote:

> Hi Dave,
> It makes sense. I agree it might break some guarantees. Although if the user deleted some blocks in the file or the whole file, maybe it's ok to not have a clear promise what he sees after the crash. 

User perspective: I disagree. Sounds like a possible zombie file invasion, with no clear way of reversion. The file either needs to be gone, as in not accessible in user space, or it needs to be present and intact.  There isn't a reasonable expectation for a file to be resurrected from the dead that's also corrupted.

If the file name isn't also corrupted, the problem is worse. It looks like a legitimate file, yet it's useless. The zombie files will be subject to backup and restore, just like their valid predecessors. All I need is to stumble upon a handful of these files, which I won't necessarily remember were deleted files, to start assuming I have some sort of weird file system corruption, at which point at best I'll become really confused not knowing what to do next. At worse, I may start throwing hammers that end up causing worse problems.


Chris Murphy
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Questions about XFS discard and xfs_free_extent() code (newbie)
  2013-12-19 10:55     ` Dave Chinner
  2013-12-19 19:24       ` Alex Lyakas
@ 2013-12-24 18:21       ` Alex Lyakas
  2013-12-26 23:00         ` Dave Chinner
  1 sibling, 1 reply; 47+ messages in thread
From: Alex Lyakas @ 2013-12-24 18:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Hi Dave,
Reading through the code some more, I see that the extent that is freed 
through xfs_free_extent() can be an XFS metadata extent as well.
For example, xfs_inobt_free_block() frees a block of the AG's free-inode 
btree. Also, xfs_bmbt_free_block() frees a generic btree block by putting it 
onto the cursor's "to-be-freed" list, which will be dropped into the 
free-space btree (by xfs_free_extent) in xfs_bmap_finish(). If we discard 
such metadata block before the transaction is committed to the log and we 
crash, we might not be able to properly mount after reboot, is that right? I 
mean it's not that some file's data block will show 0s to the user instead 
of before-delete data, but some XFS btree node (for example) will be wiped 
in such case. Can this happen?

Thanks,
Alex.


-----Original Message----- 
From: Dave Chinner
Sent: 19 December, 2013 12:55 PM
To: Alex Lyakas
Cc: xfs@oss.sgi.com
Subject: Re: Questions about XFS discard and xfs_free_extent() code (newbie)

On Thu, Dec 19, 2013 at 11:24:15AM +0200, Alex Lyakas wrote:
> Hi Dave,
> Thank you for your comments.
> I realize now that what I proposed cannot be done; I need to
> understand deeper how XFS transactions work (unfortunately, the
> awesome "XFS Filesystem Structure" doc has a TODO in the "Journaling
> Log" section).
>
> Can you please comment on one more question:
> Let's say we had such fully asynchronous "fire-and-forget" discard
> operation (I can implement one myself for my block-device via a
> custom IOCTL). What is wrong if we trigger such operation in
> xfs_free_ag_extent(), right after we have merged the freed extent
> into a bigger one? I understand that the extent-free-intent is not
> yet committed to the log at this point. But from the user's point of
> view, the extent has been deleted, no? So if the underlying block
> device discards the merged extent right away, before committing to
> the log, what issues this can cause?

Think of what happens when a crash occurs immediately after the
discard completes. The freeing of the extent never made it to th
elog, so after recovery, the file still exists and the user can
access it. Except that it's contents are now all different to
before the crash occurred.

IOWs, issuing the discard before the transaction that frees the
extent is on stable storage means we are discarding user data or
metadata before we've guaranteed that the extent free transaction
is permanent and that means we violate certain guarantees with
respect to crash recovery...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Questions about XFS discard and xfs_free_extent() code (newbie)
  2013-12-24 18:21       ` Alex Lyakas
@ 2013-12-26 23:00         ` Dave Chinner
  2014-01-08 18:13           ` Alex Lyakas
  0 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2013-12-26 23:00 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: xfs

On Tue, Dec 24, 2013 at 08:21:50PM +0200, Alex Lyakas wrote:
> Hi Dave,
> Reading through the code some more, I see that the extent that is
> freed through xfs_free_extent() can be an XFS metadata extent as
> well.
> For example, xfs_inobt_free_block() frees a block of the AG's
> free-inode btree. Also, xfs_bmbt_free_block() frees a generic btree
> block by putting it onto the cursor's "to-be-freed" list, which will
> be dropped into the free-space btree (by xfs_free_extent) in
> xfs_bmap_finish(). If we discard such metadata block before the
> transaction is committed to the log and we crash, we might not be
> able to properly mount after reboot, is that right?

Yes. The log stores a delta of the transactional changes, and so
requires th eprevious version of the block to be intact for revoery
to take place.

> I mean it's not
> that some file's data block will show 0s to the user instead of
> before-delete data, but some XFS btree node (for example) will be
> wiped in such case. Can this happen?

Yes, it could. That's what I meant by:

[snip]

> > IOWs, issuing the discard before the transaction that frees the
> > extent is on stable storage means we are discarding user data or
                                                                  ^^
> > metadata before we've guaranteed that the extent free transaction
    ^^^^^^^^
> > is permanent and that means we violate certain guarantees with
> > respect to crash recovery...

The "or metadata" part of the above sentence.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Questions about XFS discard and xfs_free_extent() code (newbie)
  2013-12-26 23:00         ` Dave Chinner
@ 2014-01-08 18:13           ` Alex Lyakas
  2014-01-13  3:02             ` Dave Chinner
  0 siblings, 1 reply; 47+ messages in thread
From: Alex Lyakas @ 2014-01-08 18:13 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Hello Dave,

Currently I am working on the following approach:

Basic idea: make each xfs_extent_busy struct carry potential information 
about a larger extent-for-discard, i.e., one that is a multiple of 
discard_granularity.

In detail this looks like this:
# xfs_free_ag_extent attempts to merge the freed extent into a larger 
free-extent in the by-bno btree.  When this function completes its work, we 
have [nbno, nlen], which is potentially a larger free-extent. At this point 
we also know that AGF is locked.
# We trim [nbno, nlen] to be a multiple-of and aligned-by 
discard_granularity (if possible), and we receive [dbno, dlen], which is a 
very nice extent to discard.
# When calling xfs_extent_busy_insert(), we add these two values to the 
xfs_extent_busy struct.
# When the extent-free operation is committed for this busy extent, we know 
that we can discard this [dbno, dlen] area, unless somebody have allocated 
an extent, which overlaps this area.

To address that, at the end of  xfs_alloc_fixup_trees() we do the following:
# We know that we are going to allocate [rbno, rlen] from appropriate AG. So 
at this point, we search the busy extents tree to check if there is a busy 
extent that holds [dbno, dlen] (this is a multiple-of and aligned-by discard 
granularity), which overlaps [rbno, rlen] fully or partially. If found, we 
shrink the [dbno, dlen] area to be still a multiple-of and aligned by 
discard-granularity, if possible. So we have a new smaller [dbno, dlen] that 
we still can discard, attached to the same busy extent. Or we discover that 
the new area is too small to discard, so we forget about it.
# The allocation flow anyways searches the busy extents tree, so we should 
be ok WRT to locking order, but adding some extra work.

This way, we basically track larger chunks, which are nice to discard.

I am aware that I need to handle additional issues like:
# A busy extent can be "unbusyied" or shrunk by 
xfs_extent_busy_update_extent(). We need to update [dbno, dlen] accordingly 
or delete it fully
# To be able to search for [dbno, dlen], we probably need another rbtree 
(under the same pag->pagb_lock), which tracks large extents for discard. 
xfs_extent_busy needs additional rbnode.
# If during xfs_alloc_fixup_trees() we discover that extent is already being 
discarded, we need to wait. Assuming we have asynchronous discard, this wait 
will be short - we only need the block device to queue the discard request, 
and then we are good to allocate from that area again.

One thing I am unsure about, is a scenario like this:
# assume discard-granularity=1MB
# we have a 1MB almost free, except two 4K blocks, somewhere in the free 
space
# Transaction t1 comes and frees 4K block A, but the 1MB extent is not fully 
free yet, so nothing to discard
# Transaction t2 frees the second 4K block B, now 1MB is free and we attach 
a [dbno, dlen] to the second busy extent

However, I think there is no guarantee that t1 will commit before t2; is 
that right? But we cannot discard the 1MB extent, before both transactions 
commit. (One approach to solve this, is to give a sequence number for each 
xfs_extent_busy extent, and have a background thread that does delayed 
discards, once all needed busy extents are committed. The delayed discards 
are also considered in the check that xfs_alloc_fixup_trees()  does).

What do you think overall about this approach? Is there something 
fundamental that prevents it from working?

Also (if you are still reading:),  can you kindly comment this question that 
I have:
# xfs_free_ag_extent() has a "isfl" parameter. If it is "true", then this 
extent is added as usual to the free-space btrees, but the caller doesn't 
add it as a busy extent. This means that such extent is suitable for 
allocation right away, without waiting for the log commit?

Thank you for helping,
Alex.



-----Original Message----- 
From: Dave Chinner
Sent: 27 December, 2013 1:00 AM
To: Alex Lyakas
Cc: xfs@oss.sgi.com
Subject: Re: Questions about XFS discard and xfs_free_extent() code (newbie)

On Tue, Dec 24, 2013 at 08:21:50PM +0200, Alex Lyakas wrote:
> Hi Dave,
> Reading through the code some more, I see that the extent that is
> freed through xfs_free_extent() can be an XFS metadata extent as
> well.
> For example, xfs_inobt_free_block() frees a block of the AG's
> free-inode btree. Also, xfs_bmbt_free_block() frees a generic btree
> block by putting it onto the cursor's "to-be-freed" list, which will
> be dropped into the free-space btree (by xfs_free_extent) in
> xfs_bmap_finish(). If we discard such metadata block before the
> transaction is committed to the log and we crash, we might not be
> able to properly mount after reboot, is that right?

Yes. The log stores a delta of the transactional changes, and so
requires th eprevious version of the block to be intact for revoery
to take place.

> I mean it's not
> that some file's data block will show 0s to the user instead of
> before-delete data, but some XFS btree node (for example) will be
> wiped in such case. Can this happen?

Yes, it could. That's what I meant by:

[snip]

> > IOWs, issuing the discard before the transaction that frees the
> > extent is on stable storage means we are discarding user data or
                                                                  ^^
> > metadata before we've guaranteed that the extent free transaction
    ^^^^^^^^
> > is permanent and that means we violate certain guarantees with
> > respect to crash recovery...

The "or metadata" part of the above sentence.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Questions about XFS discard and xfs_free_extent() code (newbie)
  2014-01-08 18:13           ` Alex Lyakas
@ 2014-01-13  3:02             ` Dave Chinner
  2014-01-13 17:44               ` Alex Lyakas
  0 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2014-01-13  3:02 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: xfs

On Wed, Jan 08, 2014 at 08:13:38PM +0200, Alex Lyakas wrote:
> Hello Dave,
> 
> Currently I am working on the following approach:
> 
> Basic idea: make each xfs_extent_busy struct carry potential
> information about a larger extent-for-discard, i.e., one that is a
> multiple of discard_granularity.

You're making a big assumption here that discard_granularity is
useful for aggregating large extents. Discard granularity on modern
SSDs is a single sector:

$ cat /sys/block/sda/queue/discard_granularity 
512
$

I've just checked 4 different types of SSDs I have here (from 3
years old no-name sandforce SSDs to a brand new samsung 840 EVO),
and htey all give the same result.

IOWs, in most cases other than thin provisioning it will not be
useful for optimising discards into larger, aligned extents.

> In detail this looks like this:
> # xfs_free_ag_extent attempts to merge the freed extent into a
> larger free-extent in the by-bno btree.  When this function
> completes its work, we have [nbno, nlen], which is potentially a
> larger free-extent. At this point we also know that AGF is locked.
> # We trim [nbno, nlen] to be a multiple-of and aligned-by
> discard_granularity (if possible), and we receive [dbno, dlen],
> which is a very nice extent to discard.
> # When calling xfs_extent_busy_insert(), we add these two values to
> the xfs_extent_busy struct.
> # When the extent-free operation is committed for this busy extent,
> we know that we can discard this [dbno, dlen] area, unless somebody
> have allocated an extent, which overlaps this area.

This strikes me as optimisation at the wrong time i.e. optimising
discard ranges at extent free time ignores the fact that the extent
can be immediately reallocated, and so all the optimisation work is
wasted.

> To address that, at the end of  xfs_alloc_fixup_trees() we do the following:
> # We know that we are going to allocate [rbno, rlen] from
> appropriate AG. So at this point, we search the busy extents tree to
> check if there is a busy extent that holds [dbno, dlen] (this is a
> multiple-of and aligned-by discard granularity), which overlaps

How do you find that? The busy extent tree is indexed on individual
extents that have been freed, not the discard granularity ranges
they track.

> [rbno, rlen] fully or partially. If found, we shrink the [dbno,
> dlen] area to be still a multiple-of and aligned by
> discard-granularity, if possible. So we have a new smaller [dbno,
> dlen] that we still can discard, attached to the same busy extent.
> Or we discover that the new area is too small to discard, so we
> forget about it.

Forget about the discard range, right? We can't ignore the busy
extent that covers the range being freed - it must be tracked all
the way through to transaction commit completion.

> # The allocation flow anyways searches the busy extents tree, so we
> should be ok WRT to locking order, but adding some extra work.

There are other locking issues to be concerned about than order....

> I am aware that I need to handle additional issues like:
> # A busy extent can be "unbusyied" or shrunk by
> xfs_extent_busy_update_extent(). We need to update [dbno, dlen]
> accordingly or delete it fully
> # To be able to search for [dbno, dlen], we probably need another
> rbtree (under the same pag->pagb_lock), which tracks large extents
> for discard. xfs_extent_busy needs additional rbnode.
> # If during xfs_alloc_fixup_trees() we discover that extent is
> already being discarded, we need to wait. Assuming we have
> asynchronous discard, this wait will be short - we only need the
> block device to queue the discard request, and then we are good to
> allocate from that area again.

* multiple busy extents can be found in the one "discard range".
Hence there is a n:1 relationship between the busy extents and the
related "discard extent" that might be related to it. Hence if we
end up with:

	busy1	busy2	busy3
	+-------+-------+------+
	+----------------------+
	     discard1

and then we reuse busy2, then we have to delete discard1 and update
busy1 and busy3 not to point at discard1. Indeed, depending on
the discard granularity, it might ned up:

	busy1		busy3
	+-------+       +------+
	+-------+       +------+
	discard1	discard2

And so the act of having to track optimal "discard ranges" becomes
very, very complex.

I really don't see any advantage in tracking discard ranges like
this, because we can do these optimisations of merging and trimming
just before issuing the discards. And realistically, merging and
trimming is something the block layer should be doing for us
already.

> 
> One thing I am unsure about, is a scenario like this:
> # assume discard-granularity=1MB
> # we have a 1MB almost free, except two 4K blocks, somewhere in the
> free space
> # Transaction t1 comes and frees 4K block A, but the 1MB extent is
> not fully free yet, so nothing to discard
> # Transaction t2 frees the second 4K block B, now 1MB is free and we
> attach a [dbno, dlen] to the second busy extent
> 
> However, I think there is no guarantee that t1 will commit before
> t2; is that right?

Correct.

> But we cannot discard the 1MB extent, before both
> transactions commit. (One approach to solve this, is to give a
> sequence number for each xfs_extent_busy extent, and have a
> background thread that does delayed discards, once all needed busy
> extents are committed. The delayed discards are also considered in
> the check that xfs_alloc_fixup_trees()  does).

We used to have a log sequence number in the busy extent to prevent
reuse of a busy extent - it would trigger a log force up to the
given LSN before allowing the extent to be reused. It caused
significant scalability problems for the busy extent tracking code,
and so it was removed and replaced with the non-blocking searches we
do now. See:

ed3b4d6 xfs: Improve scalability of busy extent tracking
e26f050 xfs: do not immediately reuse busy extent ranges
97d3ac7 xfs: exact busy extent tracking

i.e. blocking waiting for discard or log IO completion while holding
the AGF locked is a major problem for allocation latency
determinism. With discards potentially taking seconds, waiting for
them to complete while holding the AGF locked will effectively stall
parts of the filesystem for long periods of time. That blocking is
what the above commits prevent, and by doing this allow us to use
the busy extent tree for issuing discard on ranges that have been
freed....

> What do you think overall about this approach? Is there something
> fundamental that prevents it from working?

I'm not convinced that re-introducing busy extent commit
sequence tracking and blocking to optimise discard operations is a
particularly good idea given the above.

> Also (if you are still reading:),  can you kindly comment this
> question that I have:
> # xfs_free_ag_extent() has a "isfl" parameter. If it is "true", then
> this extent is added as usual to the free-space btrees, but the
> caller doesn't add it as a busy extent. This means that such extent
> is suitable for allocation right away, without waiting for the log
> commit?

It means the extent is being moved from the AGFL to the free space
btree. blocks on the AGFL have already gone through free space
accounting and busy extent tracking to get to the AGFL, and so there
is no need to repeat it when moving it to the free space btrees.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Questions about XFS discard and xfs_free_extent() code (newbie)
  2014-01-13  3:02             ` Dave Chinner
@ 2014-01-13 17:44               ` Alex Lyakas
  2014-01-13 20:43                 ` Dave Chinner
  0 siblings, 1 reply; 47+ messages in thread
From: Alex Lyakas @ 2014-01-13 17:44 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Hi Dave,
Thank you for your comments, and for pointing me at the commits.

On Mon, Jan 13, 2014 at 5:02 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Jan 08, 2014 at 08:13:38PM +0200, Alex Lyakas wrote:
>> Hello Dave,
>>
>> Currently I am working on the following approach:
>>
>> Basic idea: make each xfs_extent_busy struct carry potential
>> information about a larger extent-for-discard, i.e., one that is a
>> multiple of discard_granularity.
>
> You're making a big assumption here that discard_granularity is
> useful for aggregating large extents. Discard granularity on modern
> SSDs is a single sector:
>
> $ cat /sys/block/sda/queue/discard_granularity
> 512
> $
>
> I've just checked 4 different types of SSDs I have here (from 3
> years old no-name sandforce SSDs to a brand new samsung 840 EVO),
> and htey all give the same result.
>
> IOWs, in most cases other than thin provisioning it will not be
> useful for optimising discards into larger, aligned extents.
Agree. The use case I primarily look into is, indeed, thin
provisioning. In my case, discard granularity can be, for example,
256Kb or even higher. I realize this is not the case with SSDs.

>
>> In detail this looks like this:
>> # xfs_free_ag_extent attempts to merge the freed extent into a
>> larger free-extent in the by-bno btree.  When this function
>> completes its work, we have [nbno, nlen], which is potentially a
>> larger free-extent. At this point we also know that AGF is locked.
>> # We trim [nbno, nlen] to be a multiple-of and aligned-by
>> discard_granularity (if possible), and we receive [dbno, dlen],
>> which is a very nice extent to discard.
>> # When calling xfs_extent_busy_insert(), we add these two values to
>> the xfs_extent_busy struct.
>> # When the extent-free operation is committed for this busy extent,
>> we know that we can discard this [dbno, dlen] area, unless somebody
>> have allocated an extent, which overlaps this area.
>
> This strikes me as optimisation at the wrong time i.e. optimising
> discard ranges at extent free time ignores the fact that the extent
> can be immediately reallocated, and so all the optimisation work is
> wasted.
Agree it can be reallocated. Although, based on the code, only a
metadata extent can be reallocated immediately, while user data busy
extents cannot be "unbusied" until their freeing is committed to the
log. I agree it can happen, that a non-busy part of a discard-range
can be allocated, and then the whole range cannot be discarded.

>
>> To address that, at the end of  xfs_alloc_fixup_trees() we do the following:
>> # We know that we are going to allocate [rbno, rlen] from
>> appropriate AG. So at this point, we search the busy extents tree to
>> check if there is a busy extent that holds [dbno, dlen] (this is a
>> multiple-of and aligned-by discard granularity), which overlaps
>
> How do you find that? The busy extent tree is indexed on individual
> extents that have been freed, not the discard granularity ranges
> they track.
Agree, that's why I mentioned that another rbtree is needed, that
indexes discard-ranges.

>
>> [rbno, rlen] fully or partially. If found, we shrink the [dbno,
>> dlen] area to be still a multiple-of and aligned by
>> discard-granularity, if possible. So we have a new smaller [dbno,
>> dlen] that we still can discard, attached to the same busy extent.
>> Or we discover that the new area is too small to discard, so we
>> forget about it.
>
> Forget about the discard range, right? We can't ignore the busy
> extent that covers the range being freed - it must be tracked all
> the way through to transaction commit completion.
Agree that busy extent cannot be forgotten about. I meant that we
forget only about the discard range that is related to this busy
extent. So when the transaction commit completes, we will "unbusy" the
busy extent as usual, but we will not discard anything, because there
will be no discard-range connected to this busy extent.

>
>> # The allocation flow anyways searches the busy extents tree, so we
>> should be ok WRT to locking order, but adding some extra work.
>
> There are other locking issues to be concerned about than order....
>
>> I am aware that I need to handle additional issues like:
>> # A busy extent can be "unbusyied" or shrunk by
>> xfs_extent_busy_update_extent(). We need to update [dbno, dlen]
>> accordingly or delete it fully
>> # To be able to search for [dbno, dlen], we probably need another
>> rbtree (under the same pag->pagb_lock), which tracks large extents
>> for discard. xfs_extent_busy needs additional rbnode.
>> # If during xfs_alloc_fixup_trees() we discover that extent is
>> already being discarded, we need to wait. Assuming we have
>> asynchronous discard, this wait will be short - we only need the
>> block device to queue the discard request, and then we are good to
>> allocate from that area again.
>
> * multiple busy extents can be found in the one "discard range".
> Hence there is a n:1 relationship between the busy extents and the
> related "discard extent" that might be related to it. Hence if we
> end up with:
>
>         busy1   busy2   busy3
>         +-------+-------+------+
>         +----------------------+
>              discard1
>
> and then we reuse busy2, then we have to delete discard1 and update
> busy1 and busy3 not to point at discard1. Indeed, depending on
> the discard granularity, it might ned up:
>
>         busy1           busy3
>         +-------+       +------+
>         +-------+       +------+
>         discard1        discard2
>
> And so the act of having to track optimal "discard ranges" becomes
> very, very complex.
I agree with the examples you have given. That's why I realized that
the numbering of busy extents is needed. In the first example, the
extent that was freed *last* out of busy1/busy2/busy3 will be the one
pointing at the discard range. Assume it was busy3. Continuing with
your example, even if we fully reuse busy2, we delete it from
pagb_tree, but we *never* delete it from the transaction's t_busy
list. So we can split discard1 into discard1/2 and keep them connected
to busy3. When busy3 commits, we check if busy1 and busy2 have already
committed, using the busy extents numbering. If yes, we discard. If
not, we "unbusy" busy3 as usual, and we leave discard1/2 in the second
rbtree (they have the sequence number of busy3 to know when we can
discard them).
Similar would have happened if discard1 was originally connected to
busy2. Even after full reusage of busy2, busy2 is not deleted until it
commits (it is only erased from the pagb_tree).

>
> I really don't see any advantage in tracking discard ranges like
> this, because we can do these optimisations of merging and trimming
> just before issuing the discards. And realistically, merging and
> trimming is something the block layer should be doing for us
> already.
>
Yes, I realize that XFS mindset is to do pure filesystem work, i.e.,
arrange blocks of data in files and map them to disk. The rest should
be handled by application above and by the storage system below. In
your awesome AU2012 talk, you also confirm that mindset. Trouble is
that the block layer cannot really merge small discard requests,
without the information that you have in AGF btrees.

>>
>> One thing I am unsure about, is a scenario like this:
>> # assume discard-granularity=1MB
>> # we have a 1MB almost free, except two 4K blocks, somewhere in the
>> free space
>> # Transaction t1 comes and frees 4K block A, but the 1MB extent is
>> not fully free yet, so nothing to discard
>> # Transaction t2 frees the second 4K block B, now 1MB is free and we
>> attach a [dbno, dlen] to the second busy extent
>>
>> However, I think there is no guarantee that t1 will commit before
>> t2; is that right?
>
> Correct.
>
>> But we cannot discard the 1MB extent, before both
>> transactions commit. (One approach to solve this, is to give a
>> sequence number for each xfs_extent_busy extent, and have a
>> background thread that does delayed discards, once all needed busy
>> extents are committed. The delayed discards are also considered in
>> the check that xfs_alloc_fixup_trees()  does).
>
> We used to have a log sequence number in the busy extent to prevent
> reuse of a busy extent - it would trigger a log force up to the
> given LSN before allowing the extent to be reused. It caused
> significant scalability problems for the busy extent tracking code,
> and so it was removed and replaced with the non-blocking searches we
> do now. See:
>
> ed3b4d6 xfs: Improve scalability of busy extent tracking
> e26f050 xfs: do not immediately reuse busy extent ranges
> 97d3ac7 xfs: exact busy extent tracking
>
> i.e. blocking waiting for discard or log IO completion while holding
> the AGF locked is a major problem for allocation latency
> determinism. With discards potentially taking seconds, waiting for
> them to complete while holding the AGF locked will effectively stall
> parts of the filesystem for long periods of time. That blocking is
> what the above commits prevent, and by doing this allow us to use
> the busy extent tree for issuing discard on ranges that have been
> freed....
>
>> What do you think overall about this approach? Is there something
>> fundamental that prevents it from working?
>
> I'm not convinced that re-introducing busy extent commit
> sequence tracking and blocking to optimise discard operations is a
> particularly good idea given the above.
I am not sure I am suggesting to block or lock anything (perhaps I am,
without realizing that). By and large, I suggest to have another data
structure, an rbtree, that tracks discard ranges. This rbtree is
loosely connected to the busy extent rbtree. And I suggest three
things to do with this new rbtree:
- Whenever a busy extent is added, maybe add a discard range to the
second rbtree, and attach it to the busy extent (if we got a nice
discard range)
- For each new allocation, check if something needs to be
removed/changed in this rbtree. Yes, I realize this is additional
work.
- When a busy extent commits, by all means we "unbusy" the extent as
usual. But we also check in the second rbtree, whether we can issue a
discard for some discard range. Perhaps we can. Or we cannot because
of other busy extents, that have not committed yet (the numbering is
used to determine that). In that case, we will discard later, when all
the needed busy extent commit. Unless new allocation removed/changed
this discard range already. But we are not delaying the "unbusying" of
the busy extent, and we are not keeping the AGF locked (I think).
Also, we are issuing discards in the same place and context where XFS
does it today.

>
>> Also (if you are still reading:),  can you kindly comment this
>> question that I have:
>> # xfs_free_ag_extent() has a "isfl" parameter. If it is "true", then
>> this extent is added as usual to the free-space btrees, but the
>> caller doesn't add it as a busy extent. This means that such extent
>> is suitable for allocation right away, without waiting for the log
>> commit?
>
> It means the extent is being moved from the AGFL to the free space
> btree. blocks on the AGFL have already gone through free space
> accounting and busy extent tracking to get to the AGFL, and so there
> is no need to repeat it when moving it to the free space btrees.
Ok, I realize now that this block has already gone through the busy
extent tracking via xfs_allocbt_free_block().

Thanks,
Alex.

P.S.: Just watched your AU2014 talk. Interesting.

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Questions about XFS discard and xfs_free_extent() code (newbie)
  2014-01-13 17:44               ` Alex Lyakas
@ 2014-01-13 20:43                 ` Dave Chinner
  2014-01-14 13:48                   ` Alex Lyakas
  0 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2014-01-13 20:43 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: xfs

On Mon, Jan 13, 2014 at 07:44:13PM +0200, Alex Lyakas wrote:
> On Mon, Jan 13, 2014 at 5:02 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Wed, Jan 08, 2014 at 08:13:38PM +0200, Alex Lyakas wrote:
> Hi Dave,
> Thank you for your comments, and for pointing me at the commits.

[snip stuff we both understand]

> > I really don't see any advantage in tracking discard ranges like
> > this, because we can do these optimisations of merging and trimming
> > just before issuing the discards. And realistically, merging and
> > trimming is something the block layer should be doing for us
> > already.
> >
> Yes, I realize that XFS mindset is to do pure filesystem work, i.e.,
> arrange blocks of data in files and map them to disk. The rest should
> be handled by application above and by the storage system below. In
> your awesome AU2012 talk, you also confirm that mindset. Trouble is
> that the block layer cannot really merge small discard requests,
> without the information that you have in AGF btrees.

Ok, I see what you are getting at here - mounting -o discard does
not alleviate the need for occasionally running fstrim to issue
discards on large free extents that may have formed from lots of
small, disjoint extents being freed. IOWs, we cannot perfectly
optimise fine grained discards without some form of help from
tracking "overlaps" with already freed, discarded space.

[snip]

> >> What do you think overall about this approach? Is there something
> >> fundamental that prevents it from working?
> >
> > I'm not convinced that re-introducing busy extent commit
> > sequence tracking and blocking to optimise discard operations is a
> > particularly good idea given the above.
> I am not sure I am suggesting to block or lock anything (perhaps I am,
> without realizing that). By and large, I suggest to have another data
> structure, an rbtree, that tracks discard ranges.

Right, I understand this. Go back to this comment you had about
allocating a range that a discard is currently being issued on:

| If during xfs_alloc_fixup_trees() we discover that extent is already
| being discarded, we need to wait. Assuming we have asynchronous
| discard, this wait will be short - we only need the block device to
| queue the discard request, and then we are good to allocate from
| that area again

That will be blocking with the AGF held, regardless of whether we
have asynchronous discard or not. Essentially, background discard
can be considered "asynchronous" when viewed from the context of
allocation.

I'd forgotten that we effectively do that blocking right now
xfs_extent_busy_update_extent(), when we trip over an extent being
discarded, so this shouldn't be a blocker for a different discard
tracking implementation. :)

> This rbtree is
> loosely connected to the busy extent rbtree. And I suggest three
> things to do with this new rbtree:

Yes, but lets improve that "loose connection" by making them
almost not connected at all.

> - Whenever a busy extent is added, maybe add a discard range to the
> second rbtree, and attach it to the busy extent (if we got a nice
> discard range)
> - For each new allocation, check if something needs to be
> removed/changed in this rbtree. Yes, I realize this is additional
> work.

It's not a huge amount of extra work compared to the rest of the
allocation path, so I don't see this as a major issue.

> - When a busy extent commits, by all means we "unbusy" the extent as
> usual. But we also check in the second rbtree, whether we can issue a
> discard for some discard range. Perhaps we can. Or we cannot because
> of other busy extents, that have not committed yet (the numbering is
> used to determine that). In that case, we will discard later, when all
> the needed busy extent commit. Unless new allocation removed/changed
> this discard range already. But we are not delaying the "unbusying" of
> the busy extent, and we are not keeping the AGF locked (I think).
> Also, we are issuing discards in the same place and context where XFS
> does it today.

This is where I think the issues lie. We don't want to have to do
anything when a busy extent is removed at transaction commit -
that's the reason online discard sucks right now. And we want to
avoid having to care about transactions and ordering when it comes
to tracking discard ranges and issuing them.

The way I see it is that if we have a worker thread that
periodically walks the discard tree to issue discards, we simply
need to do a busy extent tree lookup on the range of each discard
being tracked. If there are busy extents that span the discard
range, then the free space isn't yet stable and so we can't issue
the discard on that range. If there are no busy extents over the
discard range then the free space is stable and we can issue the
discard.

i.e. if we completely dissociate the discard and busy extent
tracking and just replace it with a busy extent lookup at discard
time then we don't need any sort of reference counting or log
sequence tracking on busy extents or discard ranges.

FWIW, if we do this then we can change fstrim xfs_trim_extents() to
queue up all the work to be done in the background simply by
populating the discard tree with all the free space ranges we wish
to discard.  This will significantly reduce the impact of fstrim on
filesystem runtime performance as the AGF will only be held locked
long enough to populate the discard tree.  And if we do the work
per-ag, then we are also parallelising it by allowing discards on
multiple AGs to be issued at once and hence it will be significantly
faster on devices that can queue TRIM commands (SATA 3.1, SAS and
NVMe devices).....

The fact that this track and background issue mechanism would allow
us to optimise both forms of discard the filesystem supports makes
me optimistic that we are on the right path. :)

> >> question that I have:
> >> # xfs_free_ag_extent() has a "isfl" parameter. If it is "true", then
> >> this extent is added as usual to the free-space btrees, but the
> >> caller doesn't add it as a busy extent. This means that such extent
> >> is suitable for allocation right away, without waiting for the log
> >> commit?
> >
> > It means the extent is being moved from the AGFL to the free space
> > btree. blocks on the AGFL have already gone through free space
> > accounting and busy extent tracking to get to the AGFL, and so there
> > is no need to repeat it when moving it to the free space btrees.
> Ok, I realize now that this block has already gone through the busy
> extent tracking via xfs_allocbt_free_block().

Right, and note that blocks going through that path aren't discarded
due to the XFS_EXTENT_BUSY_SKIP_DISCARD flag. This is due to the
fact they are being freed to the AGFL and as such are likely to be
reused immediately. ;)

> P.S.: Just watched your AU2014 talk. Interesting.

It was a little bit different. And fun. ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Questions about XFS discard and xfs_free_extent() code (newbie)
  2014-01-13 20:43                 ` Dave Chinner
@ 2014-01-14 13:48                   ` Alex Lyakas
  2014-01-15  1:45                     ` Dave Chinner
  0 siblings, 1 reply; 47+ messages in thread
From: Alex Lyakas @ 2014-01-14 13:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Hi Dave,

[snip]

> Right, I understand this. Go back to this comment you had about
> allocating a range that a discard is currently being issued on:
>
> | If during xfs_alloc_fixup_trees() we discover that extent is already
> | being discarded, we need to wait. Assuming we have asynchronous
> | discard, this wait will be short - we only need the block device to
> | queue the discard request, and then we are good to allocate from
> | that area again
>
> That will be blocking with the AGF held, regardless of whether we
> have asynchronous discard or not. Essentially, background discard
> can be considered "asynchronous" when viewed from the context of
> allocation.
>
> I'd forgotten that we effectively do that blocking right now
> xfs_extent_busy_update_extent(), when we trip over an extent being
> discarded, so this shouldn't be a blocker for a different discard
> tracking implementation. :)
Exactly, you already do that anyways. And similarly, we will do:
lock()
lookup overlapping discard-range
if range->flags & XFS_EXTENT_BUSY_DISCARDED => unlock and sleep and retry...
...
and when discarding a discard-range, we mark it with this flag (under
lock) and leave it in the tree until discarded.

[snip]

>> - When a busy extent commits, by all means we "unbusy" the extent as
>> usual. But we also check in the second rbtree, whether we can issue a
>> discard for some discard range. Perhaps we can. Or we cannot because
>> of other busy extents, that have not committed yet (the numbering is
>> used to determine that). In that case, we will discard later, when all
>> the needed busy extent commit. Unless new allocation removed/changed
>> this discard range already. But we are not delaying the "unbusying" of
>> the busy extent, and we are not keeping the AGF locked (I think).
>> Also, we are issuing discards in the same place and context where XFS
>> does it today.
>
> This is where I think the issues lie. We don't want to have to do
> anything when a busy extent is removed at transaction commit -
> that's the reason online discard sucks right now. And we want to
> avoid having to care about transactions and ordering when it comes
> to tracking discard ranges and issuing them.
>
> The way I see it is that if we have a worker thread that
> periodically walks the discard tree to issue discards, we simply
> need to do a busy extent tree lookup on the range of each discard
> being tracked. If there are busy extents that span the discard
> range, then the free space isn't yet stable and so we can't issue
> the discard on that range. If there are no busy extents over the
> discard range then the free space is stable and we can issue the
> discard.
>
> i.e. if we completely dissociate the discard and busy extent
> tracking and just replace it with a busy extent lookup at discard
> time then we don't need any sort of reference counting or log
> sequence tracking on busy extents or discard ranges.
Nice. I like the idea of doing the busy extent lookup instead of
numbering busy extents and tracking the order of commits.
So first of all, this idea can also be applied to what I suggest,
i.e., doing the discard at its current place. But instead of tracking
busy extent numbers, we:
- when a busy extent commits and it has a discard range attached to
it, we lookup in the busy extents tree to check for other busy extents
overlapping the discard range. Anyways the original code locks the
pagb_lock in that context, so we might as well do the search.
- if we find an overlapping busy extent, we detach the discard-range
from our busy extent and attach it to the overlapping extent. When
this overlapping busy extent commits, we will retry the search.

WRT that I have questions: in xfs_extent_busy_update_extent() we can
"unbusy" part of the extent, or even rb_erase() the busy extent from
the busy extent tree (it still remains in t_busy list and will be
tracked).
Q1: why it is ok to do so? why it is ok for "metadata" to reuse part
(or all) of the busy extent before its extent-free-intent is
committed?
Q2: assume we have two busy extents on the same discard range:
+--busy1--+ +--busy2--+
+----------discard1--------+
Assume that xfs_extent_busy_update_extent() fully unbusies busy1. Now
busy2 commits, searches for overlapping busy extent, does not find one
and discards discard1. I assume it is fine, because:
xfs_extent_busy_update_extent() is called before
xfs_alloc_fixup_trees() where I intend to check for overlapping
discard range. So if we manage to discard before
xfs_alloc_fixup_trees(), it is fine, because XFS has not yet really
allocated this space. Otherwise, xfs_alloc_fixup_trees() will knock
off discard1 and we will not discard. Works?

WRT to the worker thread: we need some good strategy when to awake it.
Like use a workqueue and a work item, that tells exactly which discard
range is now a candidate for discard and needs to be checked for
overlapping busy extents?

>
> FWIW, if we do this then we can change fstrim xfs_trim_extents() to
> queue up all the work to be done in the background simply by
> populating the discard tree with all the free space ranges we wish
> to discard.  This will significantly reduce the impact of fstrim on
> filesystem runtime performance as the AGF will only be held locked
> long enough to populate the discard tree.  And if we do the work
> per-ag, then we are also parallelising it by allowing discards on
> multiple AGs to be issued at once and hence it will be significantly
> faster on devices that can queue TRIM commands (SATA 3.1, SAS and
> NVMe devices).....
If we have a huge filesystem with a lot of ranges to discard, this
will require an un-bound memory amount to populate this tree?


>
> The fact that this track and background issue mechanism would allow
> us to optimise both forms of discard the filesystem supports makes
> me optimistic that we are on the right path. :)
>
>> >> question that I have:
>> >> # xfs_free_ag_extent() has a "isfl" parameter. If it is "true", then
>> >> this extent is added as usual to the free-space btrees, but the
>> >> caller doesn't add it as a busy extent. This means that such extent
>> >> is suitable for allocation right away, without waiting for the log
>> >> commit?
>> >
>> > It means the extent is being moved from the AGFL to the free space
>> > btree. blocks on the AGFL have already gone through free space
>> > accounting and busy extent tracking to get to the AGFL, and so there
>> > is no need to repeat it when moving it to the free space btrees.
>> Ok, I realize now that this block has already gone through the busy
>> extent tracking via xfs_allocbt_free_block().
>
> Right, and note that blocks going through that path aren't discarded
> due to the XFS_EXTENT_BUSY_SKIP_DISCARD flag. This is due to the
> fact they are being freed to the AGFL and as such are likely to be
> reused immediately. ;)
Yes, and WRT that: is it true to say that the following holds:
if we have busy extent with this flag, then we know appropriate range
is not in the free-space btrees. Because when we insert such busy
extent, we don't drop it into the free-space btrees. As a result, we
should never have a discard range that overlaps a busy extent with
XFS_EXTENT_BUSY_SKIP_DISCARD. Because all our discard ranges are also
free in the free-space btrees. Therefore, busy extents with this flag
do not require any special treatment; we can ignore them fully or
simply ignore the fact that they have this special flag - they will
never have a discard range attached anyways.

Thanks! You are very responsive.
Alex.


>
>> P.S.: Just watched your AU2014 talk. Interesting.
>
> It was a little bit different. And fun. ;)
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Questions about XFS discard and xfs_free_extent() code (newbie)
  2014-01-14 13:48                   ` Alex Lyakas
@ 2014-01-15  1:45                     ` Dave Chinner
  2014-01-19  9:38                       ` Alex Lyakas
  0 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2014-01-15  1:45 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: xfs

On Tue, Jan 14, 2014 at 03:48:46PM +0200, Alex Lyakas wrote:
> >> - When a busy extent commits, by all means we "unbusy" the extent as
> >> usual. But we also check in the second rbtree, whether we can issue a
> >> discard for some discard range. Perhaps we can. Or we cannot because
> >> of other busy extents, that have not committed yet (the numbering is
> >> used to determine that). In that case, we will discard later, when all
> >> the needed busy extent commit. Unless new allocation removed/changed
> >> this discard range already. But we are not delaying the "unbusying" of
> >> the busy extent, and we are not keeping the AGF locked (I think).
> >> Also, we are issuing discards in the same place and context where XFS
> >> does it today.
> >
> > This is where I think the issues lie. We don't want to have to do
> > anything when a busy extent is removed at transaction commit -
> > that's the reason online discard sucks right now. And we want to
> > avoid having to care about transactions and ordering when it comes
> > to tracking discard ranges and issuing them.
> >
> > The way I see it is that if we have a worker thread that
> > periodically walks the discard tree to issue discards, we simply
> > need to do a busy extent tree lookup on the range of each discard
> > being tracked. If there are busy extents that span the discard
> > range, then the free space isn't yet stable and so we can't issue
> > the discard on that range. If there are no busy extents over the
> > discard range then the free space is stable and we can issue the
> > discard.
> >
> > i.e. if we completely dissociate the discard and busy extent
> > tracking and just replace it with a busy extent lookup at discard
> > time then we don't need any sort of reference counting or log
> > sequence tracking on busy extents or discard ranges.
> Nice. I like the idea of doing the busy extent lookup instead of
> numbering busy extents and tracking the order of commits.
> So first of all, this idea can also be applied to what I suggest,
> i.e., doing the discard at its current place. But instead of tracking
> busy extent numbers, we:
> - when a busy extent commits and it has a discard range attached to
> it, we lookup in the busy extents tree to check for other busy extents
> overlapping the discard range. Anyways the original code locks the
> pagb_lock in that context, so we might as well do the search.
> - if we find an overlapping busy extent, we detach the discard-range
> from our busy extent and attach it to the overlapping extent. When
> this overlapping busy extent commits, we will retry the search.

I don't think that busy extents should every have a pointer to the
discard range. It's simply not necessary if we are looking up busy
extents at discard time. Hence everything to do with discards can be
removed from the busy extent tree and moved into the discard tree...

> WRT that I have questions: in xfs_extent_busy_update_extent() we can
> "unbusy" part of the extent, or even rb_erase() the busy extent from
> the busy extent tree (it still remains in t_busy list and will be
> tracked).
> Q1: why it is ok to do so? why it is ok for "metadata" to reuse part
> (or all) of the busy extent before its extent-free-intent is
> committed?

The metadata changes are logged, and crash recovery allows correct
ordering of the free, realloc and modify process for metadata. Hence
it doesn't matter that we overwrite the contents of the block before
the free transaction is on disk - the correct contents will always
be present after recovery.

We can't do that for user data because we don't log user data.
Therefore if we allow user data to overwrite the block whil eit is
still busy, crash recovery may not result in the block having the
correct contents (i.e. the transaction that freed the block never
reaches the journal) and we hence expose some other user's data or
metadata in the file.

> Q2: assume we have two busy extents on the same discard range:
> +--busy1--+ +--busy2--+
> +----------discard1--------+
> Assume that xfs_extent_busy_update_extent() fully unbusies busy1. Now
> busy2 commits, searches for overlapping busy extent, does not find one
> and discards discard1. I assume it is fine, because:

I don't think that is correct. If we unbusy busy1 due to
reallocation, we cannot issue a discard across that range. It's in
use by the filesystem, and discarding that range will result in data
or metadata corruption.

> xfs_extent_busy_update_extent() is called before
> xfs_alloc_fixup_trees() where I intend to check for overlapping
> discard range. So if we manage to discard before
> xfs_alloc_fixup_trees(), it is fine, because XFS has not yet really
> allocated this space. Otherwise, xfs_alloc_fixup_trees() will knock
> off discard1 and we will not discard. Works?

You need to update the discard ranges at the same place that the
busy extents are updated. That is the point that the extent is freed
or allocated, and that's the point where the information about the
free space the extent was allocated from is available. Hence the
discard tree should be updated in the same spot from the same
information.

That is, on freeing via xfs_free_extent(), the
xfs_extent_busy_insert() call needs to be moved inside
xfs_free_ag_extent() to where it knows the entire range of the free
extent that spans the extent being freed (i.e. the extent after
merging). This gives you the ability to round the discard range
outwards to the discard granularity. At this point, insert the
extent being freed into the busy tree, and the discard range into
the discard tree. The busy extents don't merge on insert, the
discard ranges can merge.  Note that this will capture blocks moved
from the AGFL to the free space trees, which we don't currently
capture now for discard.

We have to insert the busy extent first, though, because ordering
matters when it comes to discard range tree walks - a busy range
needs to be added first so that the walk doesn't find a discard
range before it's corresponding busy extent is added to the tree.

When we are allocating, we need to remove discard ranges at the
same places where we call xfs_extent_busy_reuse() and *after* a
successful call to xfs_alloc_fixup_trees(). i.e. the allocation has
not complete until after all the free space information has been
updated. The range for discard needs to be trimmed only if the
allocation succeeds, similarly, we only need to block on a discard
in progress if the allocation succeeds....

> WRT to the worker thread: we need some good strategy when to awake it.
> Like use a workqueue and a work item, that tells exactly which discard
> range is now a candidate for discard and needs to be checked for
> overlapping busy extents?

Any extent in the discard range tree is a candidate for discard.
grab the first extent, check it has no busy extents in it's range,
mark it as being discarded and issue the discard (i'm assuming the
tree lock is a mutex here). For the current synchronous discard
implementation, on completion we can simply remove the object from
the tree and free it. Drop the lock, relax, start again.

Once we've walked the entire tree, set up the workqueue to run again
some time in the future if there is still more work to be done. If
it's empty, just return. We'll start it again when we queue up the
first new discard range being inserted into the tree. (same way we
run periodic inode reclaim workqueues)

> > FWIW, if we do this then we can change fstrim xfs_trim_extents() to
> > queue up all the work to be done in the background simply by
> > populating the discard tree with all the free space ranges we wish
> > to discard.  This will significantly reduce the impact of fstrim on
> > filesystem runtime performance as the AGF will only be held locked
> > long enough to populate the discard tree.  And if we do the work
> > per-ag, then we are also parallelising it by allowing discards on
> > multiple AGs to be issued at once and hence it will be significantly
> > faster on devices that can queue TRIM commands (SATA 3.1, SAS and
> > NVMe devices).....
> If we have a huge filesystem with a lot of ranges to discard, this
> will require an un-bound memory amount to populate this tree?

In theory, but we're talking about a fairly frequent discard issue
here (e.g. every 5s) so the buildup is effectively bounded by time.
If it's really a problem, we can bound it by count, too.

> > The fact that this track and background issue mechanism would allow
> > us to optimise both forms of discard the filesystem supports makes
> > me optimistic that we are on the right path. :)
> >
> >> >> question that I have:
> >> >> # xfs_free_ag_extent() has a "isfl" parameter. If it is "true", then
> >> >> this extent is added as usual to the free-space btrees, but the
> >> >> caller doesn't add it as a busy extent. This means that such extent
> >> >> is suitable for allocation right away, without waiting for the log
> >> >> commit?
> >> >
> >> > It means the extent is being moved from the AGFL to the free space
> >> > btree. blocks on the AGFL have already gone through free space
> >> > accounting and busy extent tracking to get to the AGFL, and so there
> >> > is no need to repeat it when moving it to the free space btrees.
> >> Ok, I realize now that this block has already gone through the busy
> >> extent tracking via xfs_allocbt_free_block().
> >
> > Right, and note that blocks going through that path aren't discarded
> > due to the XFS_EXTENT_BUSY_SKIP_DISCARD flag. This is due to the
> > fact they are being freed to the AGFL and as such are likely to be
> > reused immediately. ;)
> Yes, and WRT that: is it true to say that the following holds:
> if we have busy extent with this flag, then we know appropriate range
> is not in the free-space btrees.

Not necessarily tree, because while the extent was pu ton the free
list at the time it was marked busy (hence the skip discard), it
doesn't mean that it hasn't been migrated back to the free space
btree since then.

> Because when we insert such busy
> extent, we don't drop it into the free-space btrees. As a result, we
> should never have a discard range that overlaps a busy extent with
> XFS_EXTENT_BUSY_SKIP_DISCARD.

Right, but a later call to xfs_alloc_fixup_trees() can move it to
the free space tree if the free list was longer than needed for the
current transaction.  hence my comment above about us missing
discards in that case. ;)

> Because all our discard ranges are also
> free in the free-space btrees. Therefore, busy extents with this flag
> do not require any special treatment; we can ignore them fully or
> simply ignore the fact that they have this special flag - they will
> never have a discard range attached anyways.

Pretty much -  if we move discard range updates directly into the
btree manipulation functions, then we can remove all knowledge of
discards from the busy extent tree as discard ranges consider the
free list to be "allocated space"....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Questions about XFS discard and xfs_free_extent() code (newbie)
  2014-01-15  1:45                     ` Dave Chinner
@ 2014-01-19  9:38                       ` Alex Lyakas
  2014-01-19 23:17                         ` Dave Chinner
  0 siblings, 1 reply; 47+ messages in thread
From: Alex Lyakas @ 2014-01-19  9:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Hi Dave,
I believe I understand your intent of having the discard tree almost
fully disconnected from the busy extents tree. I planned to use the
the same pagb_lock for the discard tree as well, which solves the
atomicity and ordering issues you have described. If we use a
different lock (mutex), then some additional care is needed.
One thing I like less about such approach, is that we need to be
careful of not growing the discard tree too large. If, for some
reason, the worker thread is not able to keep up discarding extents
while they are queued for discard, we may end up with too many discard
ranges in the tree. Also at the same time user can trigger an offline
discard, which will add more discard ranges to the tree. While the
online discard that you have, kind of throttles itself. So if the
underlying block device is slow on discarding, the whole system will
be slowed down accordingly.

[snip]

I have one additional question regarding your comment on metadata

>> Q1: why it is ok to do so? why it is ok for "metadata" to reuse part
>> (or all) of the busy extent before its extent-free-intent is
>> committed?
>
> The metadata changes are logged, and crash recovery allows correct
> ordering of the free, realloc and modify process for metadata. Hence
> it doesn't matter that we overwrite the contents of the block before
> the free transaction is on disk - the correct contents will always
> be present after recovery.
>
> We can't do that for user data because we don't log user data.
> Therefore if we allow user data to overwrite the block whil eit is
> still busy, crash recovery may not result in the block having the
> correct contents (i.e. the transaction that freed the block never
> reaches the journal) and we hence expose some other user's data or
> metadata in the file.
If that is the case, why cannot we just issue an async discard before
the busy extent is committed? I understand that if we crashed, we
might have knocked off some of the user data (or changed it to some
new data). But can XFS get corrupted (unmountable) this way? You said
earlier that it can, but now you are saying that reusing a metadata
block, before its busy extent commits, is fine.

Thanks,
Alex.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Questions about XFS discard and xfs_free_extent() code (newbie)
  2014-01-19  9:38                       ` Alex Lyakas
@ 2014-01-19 23:17                         ` Dave Chinner
  2014-07-01 15:06                           ` xfs_growfs_data_private memory leak Alex Lyakas
  0 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2014-01-19 23:17 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: xfs

On Sun, Jan 19, 2014 at 11:38:22AM +0200, Alex Lyakas wrote:
> Hi Dave,
> I believe I understand your intent of having the discard tree almost
> fully disconnected from the busy extents tree. I planned to use the
> the same pagb_lock for the discard tree as well, which solves the
> atomicity and ordering issues you have described. If we use a
> different lock (mutex), then some additional care is needed.

You can probably just change the pagb_lock to a mutex. In the
uncontended case they are almost as fast as spinlocks, and I don't
think there's much contention on this lock right now...

> One thing I like less about such approach, is that we need to be
> careful of not growing the discard tree too large. If, for some
> reason, the worker thread is not able to keep up discarding extents
> while they are queued for discard, we may end up with too many discard
> ranges in the tree.

If that happens, then we'll burn CPU managing the discard tree on
every allocation and waiting for discards to complete. It should
self throttle.

> Also at the same time user can trigger an offline
> discard, which will add more discard ranges to the tree. While the
> online discard that you have, kind of throttles itself. So if the
> underlying block device is slow on discarding, the whole system will
> be slowed down accordingly.

discard already slows the whole system down badly. Moving them to
the background won't prevent that - but it will reduce the impact
because it discards in the transaction commit completion path
completely stall the entire file system for the duration of the
discards.

This is the primary reason why mounting with -o discard is so slow -
discards are done in a path of global serialisation. Moving it to
the background avoids this problem, so regardless of how much
discard work we queue up, the filesystem is still going to be faster
than the current code because we've removed the global serialisation
point.

In the case of fstrim, this makes fstrim run very quickly because it
doesn't issue synchronous discards with the AGF locked.  We still do
the same amount of discard work in the same amount of time in the
background, but we don't block AGs or userspace while we wait for
the discards to be done. That's also a pretty major win.

IOWs, I think the first step is to push all the discard into the
background via it's own rbtree as you've been doing. If we add
tracepoints and stats to the code, we can track the discard queue
depth easily enough with a bit of perf or trace-cmd output
scripting, and then determine whether throttling is necessary.

We can use the loop device to run tests as it implements discard via
hole punching the backing file, and that will tells us pretty
quickly if we need additional throttling by looking at the queue
depths and the amount of CPU being spent manipulating the discard
rbtree (i.e. via perf profiling). Throttling a workqueue isn't hard
to do, but there's no point in doing it if it isn't necessary...

> [snip]
> 
> I have one additional question regarding your comment on metadata
> 
> >> Q1: why it is ok to do so? why it is ok for "metadata" to reuse part
> >> (or all) of the busy extent before its extent-free-intent is
> >> committed?
> >
> > The metadata changes are logged, and crash recovery allows correct
> > ordering of the free, realloc and modify process for metadata. Hence
> > it doesn't matter that we overwrite the contents of the block before
> > the free transaction is on disk - the correct contents will always
> > be present after recovery.
> >
> > We can't do that for user data because we don't log user data.
> > Therefore if we allow user data to overwrite the block whil eit is
> > still busy, crash recovery may not result in the block having the
> > correct contents (i.e. the transaction that freed the block never
> > reaches the journal) and we hence expose some other user's data or
> > metadata in the file.
> If that is the case, why cannot we just issue an async discard before
> the busy extent is committed? I understand that if we crashed, we
> might have knocked off some of the user data (or changed it to some
> new data).

Because a user data extent is not free fo us to make arbitrary
changes to it until the transaction that frees it commits. i.e.  If
we crash then the file must either contain the original data or be
completely gone. Go google "null files after crash" and you'll see
just how important people consider files being intact after a
crash....

> But can XFS get corrupted (unmountable) this way? You said
> earlier that it can, but now you are saying that reusing a metadata
> block, before its busy extent commits, is fine.

No, not exactly. There are 4 different cases to take into account:

1. metadata -> free -> alloc as metadata is the situation where
there are not problems reusing the busy extent because all the
modifications are in the log. Hence crash recovery always ends up
with the correct result no mater where the crash occurs.

2. user data -> free -> alloc as metadata is safe because the
metadata can't be written in place until the free transaction and
the metadata changes are fully committed in the journal. Hence on a
crash the user will either see in-tact data in their file, or the
file will have had the extent removed successfully and the
filesystem has safely reused it for metadata.

3. metadata -> free -> alloc as user data and
4. user data -> free -> alloc as user data are the problematic cases
when it comes to IO ordering - if we allow reallocation of busy
extents, then the new userdata can be written to disk before the
free/alloc transactions are committed to the journal. If we crash
at this point, the after recovery the new data will be pointed to
by the old user.

If the old user is user data, then we've corrupted
their file by exposing some other users data to them.

If the old user is metadata, then we've corrupted the filesystem
because the metadata has been overwritten by user data before the
journal recovery has read and freed the metadata block. That will
cause recovery (and hence mount) to fail.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* xfs_growfs_data_private memory leak
  2014-01-19 23:17                         ` Dave Chinner
@ 2014-07-01 15:06                           ` Alex Lyakas
  2014-07-01 21:56                             ` Dave Chinner
  2014-08-04 11:00                             ` use-after-free on log replay failure Alex Lyakas
  0 siblings, 2 replies; 47+ messages in thread
From: Alex Lyakas @ 2014-07-01 15:06 UTC (permalink / raw)
  To: xfs

Greetings,

It appears that if xfs_growfs_data_private fails during the "new AG headers" 
loop, it does not free all the per-AG structures for the new AGs. When XFS 
is unmounted later, they are not freed as well, because 
xfs_growfs_data_private did not update the "sb_agcount" field, so 
xfs_free_perag will not free them. This happens on 3.8.13, but looking at 
the latest master branch, it seems to have the same issue.

Code like [1] in xfs_growfs_data, seems to fix the issue.

A follow-up question: if xfs_grows_data_private fails during the loop that 
updates all the secondary superblocks, what is the consequence? (I am aware 
that in the latest master branch, the loop is not broken on first error, but 
attempts to initialize whatever possible). When these secondary superblocks 
will get updated? Is there a way to force-update them? Otherwise, what can 
be the consequence of leaving them not updated?

Thanks,
Alex.

[1]
    /*
     * If we had an error, we might have allocated
     * PAGs, which are >=sb_agcount. We need to free
     * those, because they will not get freed in
     * xfs_free_perag().
     */
    if (error) {
        unsigned int n_pags = 0;
        xfs_perag_t* pags[16] = {0};
        xfs_agnumber_t start_agno = mp->m_sb.sb_agcount;

        do {
            unsigned int pag_idx = 0;

            spin_lock(&mp->m_perag_lock);
            n_pags = radix_tree_gang_lookup(&mp->m_perag_tree, (void**)pags, 
start_agno, ARRAY_SIZE(pags));
            for (pag_idx = 0; pag_idx < n_pags; ++pag_idx) {
                xfs_perag_t *deleted = NULL;

                /* for next lookup */
                start_agno = pags[pag_idx]->pag_agno + 1;

                /* nobody should really be touching these AGs...*/
                if (WARN_ON(atomic_read(&pags[pag_idx]->pag_ref) > 0)) {
                    pags[pag_idx] = NULL;
                    continue;
                }

                deleted = radix_tree_delete(&mp->m_perag_tree, 
pags[pag_idx]->pag_agno);
                ASSERT(deleted == pags[pag_idx]);
            }
            spin_unlock(&mp->m_perag_lock);

            /* now delete all those still marked for deletion */
            for (pag_idx = 0; pag_idx < n_pags; ++pag_idx) {
                if (pags[pag_idx])
                    call_rcu(&pags[pag_idx]->rcu_head, 
xfs_free_perag_rcu_cb);
            }
        } while (n_pags > 0);
    }

xfs_free_perag_rcu_cb is similar to __xfs_free_perag, but can be called from 
other files.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: xfs_growfs_data_private memory leak
  2014-07-01 15:06                           ` xfs_growfs_data_private memory leak Alex Lyakas
@ 2014-07-01 21:56                             ` Dave Chinner
  2014-07-02 12:27                               ` Alex Lyakas
  2014-08-04 11:00                             ` use-after-free on log replay failure Alex Lyakas
  1 sibling, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2014-07-01 21:56 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: xfs

On Tue, Jul 01, 2014 at 06:06:38PM +0300, Alex Lyakas wrote:
> Greetings,
> 
> It appears that if xfs_growfs_data_private fails during the "new AG
> headers" loop, it does not free all the per-AG structures for the
> new AGs. When XFS is unmounted later, they are not freed as well,
> because xfs_growfs_data_private did not update the "sb_agcount"
> field, so xfs_free_perag will not free them. This happens on 3.8.13,
> but looking at the latest master branch, it seems to have the same
> issue.
> 
> Code like [1] in xfs_growfs_data, seems to fix the issue.

Why not just do this in the appropriate error stack, like is
done inside xfs_initialize_perag() on error?

        for (i = oagcount; i < nagcount; i++) {
                pag = radix_tree_delete(&mp->m_perag_tree, index);
                kmem_free(pag);
        }

(though it might need RCU freeing)

When you have a fix, can you send a proper patch with a sign-off on
it?

> A follow-up question: if xfs_grows_data_private fails during the
> loop that updates all the secondary superblocks, what is the
> consequence? (I am aware that in the latest master branch, the loop
> is not broken on first error, but attempts to initialize whatever
> possible). When these secondary superblocks will get updated? Is
> there a way to force-update them? Otherwise, what can be the
> consequence of leaving them not updated?

The consequence is documented in mainline tree - if we don't update
them all, then repair will do the wrong thing.  Repair requires a
majority iof identical secondaries to determine if the primary is
correct or out of date. The old behaviour of not updating after the
first error meant that the majority were old superblocks and so at
some time in the future repair could decide your filesystem is
smaller than it really is and hence truncate away the grown section
of the filesystem. i.e. trigger catastrophic, unrecoverable data
loss.

Hence it's far better to write every seconday we can than to leave
a majority in a bad state....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: xfs_growfs_data_private memory leak
  2014-07-01 21:56                             ` Dave Chinner
@ 2014-07-02 12:27                               ` Alex Lyakas
  2014-08-04 18:15                                 ` Eric Sandeen
  0 siblings, 1 reply; 47+ messages in thread
From: Alex Lyakas @ 2014-07-02 12:27 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Hi Dave,
Thank you for your comments.

I realize that secondary superblocks are needed mostly for repairing a 
broken filesystem.  However, I don't see that they get updated regularly, 
i.e., during normal operation they don't seem to get updated at all. I put a 
print in xfs_sb_write_verify, and it gets called only with: 
bp->b_bn==XFS_SB_DADDR.

So do I understand correctly (also from comments in 
xfs_growfs_data_private), that it is safe to operate a filesystem while 
having broken secondary superblocks? For me, it appears to mount properly, 
and all the data seems to be there, but xfs_check complains like:
bad sb magic # 0xc2a4baf2 in ag 6144
bad sb version # 0x4b5d in ag 6144
blocks 6144/65536..2192631388 out of range
blocks 6144/65536..2192631388 claimed by block 6144/0
bad sb magic # 0xb20f3079 in ag 6145
bad sb version # 0x6505 in ag 6145
blocks 6145/65536..3530010017 out of range
blocks 6145/65536..3530010017 claimed by block 6145/0
...

Also, if secondary superblocks do not get updated regularly, and there is no 
way to ask an operational XFS to update them, then during repair we may not 
find a good secondary superblock.

As for the patch, I cannot post a patch against the upstream kernel, because 
I am running an older kernel. Unfortunately, I cannot qualify an upstream 
patch properly in a reasonable time. Is there a value in posting a patch 
against 3.8.13? Otherwise, it's fine by me if somebody else posts it and 
takes the credit.

Thanks,
Alex.



-----Original Message----- 
From: Dave Chinner
Sent: 02 July, 2014 12:56 AM
To: Alex Lyakas
Cc: xfs@oss.sgi.com
Subject: Re: xfs_growfs_data_private memory leak

On Tue, Jul 01, 2014 at 06:06:38PM +0300, Alex Lyakas wrote:
> Greetings,
>
> It appears that if xfs_growfs_data_private fails during the "new AG
> headers" loop, it does not free all the per-AG structures for the
> new AGs. When XFS is unmounted later, they are not freed as well,
> because xfs_growfs_data_private did not update the "sb_agcount"
> field, so xfs_free_perag will not free them. This happens on 3.8.13,
> but looking at the latest master branch, it seems to have the same
> issue.
>
> Code like [1] in xfs_growfs_data, seems to fix the issue.

Why not just do this in the appropriate error stack, like is
done inside xfs_initialize_perag() on error?

        for (i = oagcount; i < nagcount; i++) {
                pag = radix_tree_delete(&mp->m_perag_tree, index);
                kmem_free(pag);
        }

(though it might need RCU freeing)

When you have a fix, can you send a proper patch with a sign-off on
it?

> A follow-up question: if xfs_grows_data_private fails during the
> loop that updates all the secondary superblocks, what is the
> consequence? (I am aware that in the latest master branch, the loop
> is not broken on first error, but attempts to initialize whatever
> possible). When these secondary superblocks will get updated? Is
> there a way to force-update them? Otherwise, what can be the
> consequence of leaving them not updated?

The consequence is documented in mainline tree - if we don't update
them all, then repair will do the wrong thing.  Repair requires a
majority iof identical secondaries to determine if the primary is
correct or out of date. The old behaviour of not updating after the
first error meant that the majority were old superblocks and so at
some time in the future repair could decide your filesystem is
smaller than it really is and hence truncate away the grown section
of the filesystem. i.e. trigger catastrophic, unrecoverable data
loss.

Hence it's far better to write every seconday we can than to leave
a majority in a bad state....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* use-after-free on log replay failure
  2014-07-01 15:06                           ` xfs_growfs_data_private memory leak Alex Lyakas
  2014-07-01 21:56                             ` Dave Chinner
@ 2014-08-04 11:00                             ` Alex Lyakas
  2014-08-04 14:12                               ` Brian Foster
  2014-08-04 23:07                               ` Dave Chinner
  1 sibling, 2 replies; 47+ messages in thread
From: Alex Lyakas @ 2014-08-04 11:00 UTC (permalink / raw)
  To: xfs

Greetings,

we had a log replay failure due to some errors that the underlying block 
device returned:
[49133.801406] XFS (dm-95): metadata I/O error: block 0x270e8c180 
("xlog_recover_iodone") error 28 numblks 16
[49133.802495] XFS (dm-95): log mount/recovery failed: error 28
[49133.802644] XFS (dm-95): log mount failed

and then kernel panicked [1].

Looking at the code, when xfs_mountfs() fails, xfs_fs_fill_super() goes and 
cleans up and eventually frees the "xfs_mount" structure. But then 
xfs_buf_iodone_work() can still be delivered through "xfslogd_workqueue", 
which is static and not per-XFS. But this callback has a pointer to 
"xfs_mount", and may try to access it as in [1]. Does this analysis sound 
correct? Kernel is 3.8.13, but looking at the XFS master branch, it might 
have the same issue.

Should we flush this static workqueue before unmounting?

Thanks,
Alex.


[1]
[49133.804546] general protection fault: 0000 [#1] SMP
[49133.808033]  xcbc rmd160 crypto_null af_key xfrm_algo scsi_dh cirrus 
psmouse ttm drm_kms_helper serio_raw drm i2c_piix4 sysimgblt virtio_balloon 
sysfillrect syscopyarea nfsd(OF) kvm nfs_acl auth_rpcgss nfs fscache 
microcode lockd mac_hid sunrpc lp parport floppy ixgbevf(OF)
[49133.808033] CPU 2
[49133.808033] Pid: 2907, comm: kworker/2:1H Tainted: GF       W  O 
3.8.13-030813-generic #201305111843 Bochs Bochs
[49133.808033] RIP: 0010:[<ffffffff813582fb>]  [<ffffffff813582fb>] 
strnlen+0xb/0x30
[49133.808033] RSP: 0018:ffff8801e31c5b08  EFLAGS: 00010086
[49133.808033] RAX: 0000000000000000 RBX: ffffffff81e4e527 RCX: 
0000000000000000
[49133.808033] RDX: 640000450008cf9d RSI: ffffffffffffffff RDI: 
640000450008cf9d
[49133.808033] RBP: ffff8801e31c5b08 R08: 000000000000ffff R09: 
000000000000ffff
[49133.808033] R10: 0000000000000000 R11: 0000000000000ffe R12: 
640000450008cf9d
[49133.808033] R13: ffffffff81e4e900 R14: 0000000000000000 R15: 
000000000000ffff
[49133.808033] FS:  0000000000000000(0000) GS:ffff88021fd00000(0000) 
knlGS:0000000000000000
[49133.808033] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[49133.808033] CR2: 00007fa4a91abd80 CR3: 000000020e783000 CR4: 
00000000000006e0
[49133.808033] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[49133.808033] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 
0000000000000400
[49133.808033] Process kworker/2:1H (pid: 2907, threadinfo ffff8801e31c4000, 
task ffff8802124f45c0)
[49133.808033] Stack:
[49133.808033]  ffff8801e31c5b48 ffffffff81359d8e ffff8801e31c5b28 
ffffffff81e4e527
[49133.808033]  ffffffffa0636a1e ffff8801e31c5c80 ffffffffa0636a1e 
ffffffff81e4e900
[49133.808033]  ffff8801e31c5bc8 ffffffff8135af89 ffff8801e31c5bc8 
ffffffff8105a4e7
[49133.808033] Call Trace:
[49133.808033]  [<ffffffff81359d8e>] string.isra.4+0x3e/0xd0
[49133.808033]  [<ffffffff8135af89>] vsnprintf+0x219/0x640
[49133.808033]  [<ffffffff8105a4e7>] ? msg_print_text+0xb7/0x1b0
[49133.808033]  [<ffffffff8135b471>] vscnprintf+0x11/0x30
[49133.808033]  [<ffffffff8105b3b1>] vprintk_emit+0xc1/0x490
[49133.808033]  [<ffffffff8105b460>] ? vprintk_emit+0x170/0x490
[49133.808033]  [<ffffffff816d5848>] printk+0x61/0x63
[49133.808033]  [<ffffffffa05ba261>] __xfs_printk+0x31/0x50 [xfs]
[49133.808033]  [<ffffffffa05ba4b3>] xfs_notice+0x53/0x60 [xfs]
[49133.808033]  [<ffffffffa05b14a5>] xfs_do_force_shutdown+0xf5/0x180 [xfs]
[49133.808033]  [<ffffffffa05f6c38>] ? xlog_recover_iodone+0x48/0x70 [xfs]
[49133.808033]  [<ffffffffa05f6c38>] xlog_recover_iodone+0x48/0x70 [xfs]
[49133.808033]  [<ffffffffa05a749d>] xfs_buf_iodone_work+0x4d/0xa0 [xfs]
[49133.808033]  [<ffffffff81078b81>] process_one_work+0x141/0x490
[49133.808033]  [<ffffffff81079b48>] worker_thread+0x168/0x400
[49133.808033]  [<ffffffff810799e0>] ? manage_workers+0x120/0x120
[49133.808033]  [<ffffffff8107f050>] kthread+0xc0/0xd0
[49133.808033]  [<ffffffff8107ef90>] ? flush_kthread_worker+0xb0/0xb0
[49133.808033]  [<ffffffff816f61ec>] ret_from_fork+0x7c/0xb0
[49133.808033]  [<ffffffff8107ef90>] ? flush_kthread_worker+0xb0/0xb0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: use-after-free on log replay failure
  2014-08-04 11:00                             ` use-after-free on log replay failure Alex Lyakas
@ 2014-08-04 14:12                               ` Brian Foster
  2014-08-04 23:07                               ` Dave Chinner
  1 sibling, 0 replies; 47+ messages in thread
From: Brian Foster @ 2014-08-04 14:12 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: xfs

On Mon, Aug 04, 2014 at 02:00:05PM +0300, Alex Lyakas wrote:
> Greetings,
> 
> we had a log replay failure due to some errors that the underlying block
> device returned:
> [49133.801406] XFS (dm-95): metadata I/O error: block 0x270e8c180
> ("xlog_recover_iodone") error 28 numblks 16
> [49133.802495] XFS (dm-95): log mount/recovery failed: error 28
> [49133.802644] XFS (dm-95): log mount failed
> 
> and then kernel panicked [1].
> 
> Looking at the code, when xfs_mountfs() fails, xfs_fs_fill_super() goes and
> cleans up and eventually frees the "xfs_mount" structure. But then
> xfs_buf_iodone_work() can still be delivered through "xfslogd_workqueue",
> which is static and not per-XFS. But this callback has a pointer to
> "xfs_mount", and may try to access it as in [1]. Does this analysis sound
> correct? Kernel is 3.8.13, but looking at the XFS master branch, it might
> have the same issue.
> 

Seems possible... we call xfs_buf_delwri_submit() via log recovery which
does an xfs_buf_iowait() on each buffer for synchronous I/O.
xfs_buf_iowait() doesn't wait on b_iowait if b_error is set, however. In
the callback side, xfs_buf_ioerror() is called before the
_xfs_buf_ioend() sequence that calls into the workqueue. Perhaps the
error can be detected by the iowait before the xfslogd_workqueue job can
run..?

> Should we flush this static workqueue before unmounting?
> 

It's not totally clear that fixes the problem since we set b_error on
the buffer before we even schedule on the workqueue. Perhaps it depends
on the nature of the race. Another option could be to not use the wq on
error, but that may or may not be ideal for other contexts besides log
recovery.

It looks like the bio code always returns error via the callback and the
xlog_recover_iodone() handler does call back into ioend, presumably to
do the io completion. It might not be appropriate to remove the b_error
bypass in xfs_buf_iowait() as we have ioerror() callers on the buffer
that can occur before I/O submission, but I wonder if it could be made
conditional for this particular case. That would be an interesting
experiment at least to see if it fixes this problem.

Brian

> Thanks,
> Alex.
> 
> 
> [1]
> [49133.804546] general protection fault: 0000 [#1] SMP
> [49133.808033]  xcbc rmd160 crypto_null af_key xfrm_algo scsi_dh cirrus
> psmouse ttm drm_kms_helper serio_raw drm i2c_piix4 sysimgblt virtio_balloon
> sysfillrect syscopyarea nfsd(OF) kvm nfs_acl auth_rpcgss nfs fscache
> microcode lockd mac_hid sunrpc lp parport floppy ixgbevf(OF)
> [49133.808033] CPU 2
> [49133.808033] Pid: 2907, comm: kworker/2:1H Tainted: GF       W  O
> 3.8.13-030813-generic #201305111843 Bochs Bochs
> [49133.808033] RIP: 0010:[<ffffffff813582fb>]  [<ffffffff813582fb>]
> strnlen+0xb/0x30
> [49133.808033] RSP: 0018:ffff8801e31c5b08  EFLAGS: 00010086
> [49133.808033] RAX: 0000000000000000 RBX: ffffffff81e4e527 RCX:
> 0000000000000000
> [49133.808033] RDX: 640000450008cf9d RSI: ffffffffffffffff RDI:
> 640000450008cf9d
> [49133.808033] RBP: ffff8801e31c5b08 R08: 000000000000ffff R09:
> 000000000000ffff
> [49133.808033] R10: 0000000000000000 R11: 0000000000000ffe R12:
> 640000450008cf9d
> [49133.808033] R13: ffffffff81e4e900 R14: 0000000000000000 R15:
> 000000000000ffff
> [49133.808033] FS:  0000000000000000(0000) GS:ffff88021fd00000(0000)
> knlGS:0000000000000000
> [49133.808033] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [49133.808033] CR2: 00007fa4a91abd80 CR3: 000000020e783000 CR4:
> 00000000000006e0
> [49133.808033] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [49133.808033] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400
> [49133.808033] Process kworker/2:1H (pid: 2907, threadinfo ffff8801e31c4000,
> task ffff8802124f45c0)
> [49133.808033] Stack:
> [49133.808033]  ffff8801e31c5b48 ffffffff81359d8e ffff8801e31c5b28
> ffffffff81e4e527
> [49133.808033]  ffffffffa0636a1e ffff8801e31c5c80 ffffffffa0636a1e
> ffffffff81e4e900
> [49133.808033]  ffff8801e31c5bc8 ffffffff8135af89 ffff8801e31c5bc8
> ffffffff8105a4e7
> [49133.808033] Call Trace:
> [49133.808033]  [<ffffffff81359d8e>] string.isra.4+0x3e/0xd0
> [49133.808033]  [<ffffffff8135af89>] vsnprintf+0x219/0x640
> [49133.808033]  [<ffffffff8105a4e7>] ? msg_print_text+0xb7/0x1b0
> [49133.808033]  [<ffffffff8135b471>] vscnprintf+0x11/0x30
> [49133.808033]  [<ffffffff8105b3b1>] vprintk_emit+0xc1/0x490
> [49133.808033]  [<ffffffff8105b460>] ? vprintk_emit+0x170/0x490
> [49133.808033]  [<ffffffff816d5848>] printk+0x61/0x63
> [49133.808033]  [<ffffffffa05ba261>] __xfs_printk+0x31/0x50 [xfs]
> [49133.808033]  [<ffffffffa05ba4b3>] xfs_notice+0x53/0x60 [xfs]
> [49133.808033]  [<ffffffffa05b14a5>] xfs_do_force_shutdown+0xf5/0x180 [xfs]
> [49133.808033]  [<ffffffffa05f6c38>] ? xlog_recover_iodone+0x48/0x70 [xfs]
> [49133.808033]  [<ffffffffa05f6c38>] xlog_recover_iodone+0x48/0x70 [xfs]
> [49133.808033]  [<ffffffffa05a749d>] xfs_buf_iodone_work+0x4d/0xa0 [xfs]
> [49133.808033]  [<ffffffff81078b81>] process_one_work+0x141/0x490
> [49133.808033]  [<ffffffff81079b48>] worker_thread+0x168/0x400
> [49133.808033]  [<ffffffff810799e0>] ? manage_workers+0x120/0x120
> [49133.808033]  [<ffffffff8107f050>] kthread+0xc0/0xd0
> [49133.808033]  [<ffffffff8107ef90>] ? flush_kthread_worker+0xb0/0xb0
> [49133.808033]  [<ffffffff816f61ec>] ret_from_fork+0x7c/0xb0
> [49133.808033]  [<ffffffff8107ef90>] ? flush_kthread_worker+0xb0/0xb0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: xfs_growfs_data_private memory leak
  2014-07-02 12:27                               ` Alex Lyakas
@ 2014-08-04 18:15                                 ` Eric Sandeen
  2014-08-06  8:56                                   ` Alex Lyakas
  0 siblings, 1 reply; 47+ messages in thread
From: Eric Sandeen @ 2014-08-04 18:15 UTC (permalink / raw)
  To: Alex Lyakas, Dave Chinner; +Cc: xfs

On 7/2/14, 7:27 AM, Alex Lyakas wrote:
> Hi Dave,
> Thank you for your comments.
> 
> I realize that secondary superblocks are needed mostly for repairing

s/mostly/only/

> a broken filesystem. However, I don't see that they get updated
> regularly, i.e., during normal operation they don't seem to get
> updated at all. I put a print in xfs_sb_write_verify, and it gets
> called only with: bp->b_bn==XFS_SB_DADDR.

See the comments above verify_sb(): not everything is validated, so
not everything needs to be constantly updated.  It's just the basic
fs geometry, not counters, etc.

/*
 * verify a superblock -- does not verify root inode #
 *      can only check that geometry info is internally
 *      consistent.  because of growfs, that's no guarantee
 *      of correctness (e.g. geometry may have changed)
 *
 * fields verified or consistency checked:
 *
 *                      sb_magicnum
 *
 *                      sb_versionnum
 *
 *                      sb_inprogress
 *
 *                      sb_blocksize    (as a group)
 *                      sb_blocklog
 *
 * geometry info -      sb_dblocks      (as a group)
 *                      sb_agcount
 *                      sb_agblocks
 *                      sb_agblklog
 *
 * inode info -         sb_inodesize    (x-checked with geo info)
 *                      sb_inopblock
 *
 * sector size info -
 *                      sb_sectsize
 *                      sb_sectlog
 *                      sb_logsectsize
 *                      sb_logsectlog
 *
 * not checked here -
 *                      sb_rootino
 *                      sb_fname
 *                      sb_fpack
 *                      sb_logstart
 *                      sb_uuid
 *
 *                      ALL real-time fields
 *                      final 4 summary counters
 */


> So do I understand correctly (also from comments in
> xfs_growfs_data_private), that it is safe to operate a filesystem
> while having broken secondary superblocks? For me, it appears to
> mount properly, and all the data seems to be there, but xfs_check
> complains like:

> bad sb magic # 0xc2a4baf2 in ag 6144
> bad sb version # 0x4b5d in ag 6144
> blocks 6144/65536..2192631388 out of range
> blocks 6144/65536..2192631388 claimed by block 6144/0
> bad sb magic # 0xb20f3079 in ag 6145
> bad sb version # 0x6505 in ag 6145
> blocks 6145/65536..3530010017 out of range
> blocks 6145/65536..3530010017 claimed by block 6145/0
> ...

some of that looks more serious than "just" bad backup sb's.
But the bad secondaries shouldn't cause runtime problems AFAIK.

> Also, if secondary superblocks do not get updated regularly, and
> there is no way to ask an operational XFS to update them, then during
> repair we may not find a good secondary superblock.

You seem to have 6144 (!) allocation groups; one would hope that a
majority of those supers would be "good" and the others will be
properly corrected by an xfs_repair.

> As for the patch, I cannot post a patch against the upstream kernel,
> because I am running an older kernel. Unfortunately, I cannot qualify
> an upstream patch properly in a reasonable time. Is there a value in
> posting a patch against 3.8.13? Otherwise, it's fine by me if
> somebody else posts it and takes the credit.

If the patch applies cleanly to both kernels, probably fine to
go ahead and post it, with that caveat.

-Eric

> Thanks,
> Alex.
> 
> 
> 
> -----Original Message----- From: Dave Chinner
> Sent: 02 July, 2014 12:56 AM
> To: Alex Lyakas
> Cc: xfs@oss.sgi.com
> Subject: Re: xfs_growfs_data_private memory leak
> 
> On Tue, Jul 01, 2014 at 06:06:38PM +0300, Alex Lyakas wrote:
>> Greetings,
>>
>> It appears that if xfs_growfs_data_private fails during the "new AG
>> headers" loop, it does not free all the per-AG structures for the
>> new AGs. When XFS is unmounted later, they are not freed as well,
>> because xfs_growfs_data_private did not update the "sb_agcount"
>> field, so xfs_free_perag will not free them. This happens on 3.8.13,
>> but looking at the latest master branch, it seems to have the same
>> issue.
>>
>> Code like [1] in xfs_growfs_data, seems to fix the issue.
> 
> Why not just do this in the appropriate error stack, like is
> done inside xfs_initialize_perag() on error?
> 
>        for (i = oagcount; i < nagcount; i++) {
>                pag = radix_tree_delete(&mp->m_perag_tree, index);
>                kmem_free(pag);
>        }
> 
> (though it might need RCU freeing)
> 
> When you have a fix, can you send a proper patch with a sign-off on
> it?
> 
>> A follow-up question: if xfs_grows_data_private fails during the
>> loop that updates all the secondary superblocks, what is the
>> consequence? (I am aware that in the latest master branch, the loop
>> is not broken on first error, but attempts to initialize whatever
>> possible). When these secondary superblocks will get updated? Is
>> there a way to force-update them? Otherwise, what can be the
>> consequence of leaving them not updated?
> 
> The consequence is documented in mainline tree - if we don't update
> them all, then repair will do the wrong thing.  Repair requires a
> majority iof identical secondaries to determine if the primary is
> correct or out of date. The old behaviour of not updating after the
> first error meant that the majority were old superblocks and so at
> some time in the future repair could decide your filesystem is
> smaller than it really is and hence truncate away the grown section
> of the filesystem. i.e. trigger catastrophic, unrecoverable data
> loss.
> 
> Hence it's far better to write every seconday we can than to leave
> a majority in a bad state....
> 
> Cheers,
> 
> Dave.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: use-after-free on log replay failure
  2014-08-04 11:00                             ` use-after-free on log replay failure Alex Lyakas
  2014-08-04 14:12                               ` Brian Foster
@ 2014-08-04 23:07                               ` Dave Chinner
  2014-08-06 10:05                                 ` Alex Lyakas
  2014-08-06 12:52                                 ` Alex Lyakas
  1 sibling, 2 replies; 47+ messages in thread
From: Dave Chinner @ 2014-08-04 23:07 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: xfs

On Mon, Aug 04, 2014 at 02:00:05PM +0300, Alex Lyakas wrote:
> Greetings,
> 
> we had a log replay failure due to some errors that the underlying
> block device returned:
> [49133.801406] XFS (dm-95): metadata I/O error: block 0x270e8c180
> ("xlog_recover_iodone") error 28 numblks 16
> [49133.802495] XFS (dm-95): log mount/recovery failed: error 28
> [49133.802644] XFS (dm-95): log mount failed

#define ENOSPC          28      /* No space left on device */

You're getting an ENOSPC as a metadata IO error during log recovery?
Thin provisioning problem, perhaps, and the error is occurring on
submission rather than completion? If so:

8d6c121 xfs: fix buffer use after free on IO error

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: xfs_growfs_data_private memory leak
  2014-08-04 18:15                                 ` Eric Sandeen
@ 2014-08-06  8:56                                   ` Alex Lyakas
  0 siblings, 0 replies; 47+ messages in thread
From: Alex Lyakas @ 2014-08-06  8:56 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: xfs

Hi Eric,
What I was trying to achieve is to allow aborting XFS resize in the middle. 
I am not sure whether a typical user needs this feature, but for my use case 
it is needed.

So XFS resize is performed in two steps:
1) Initialize the newly-added AGs (AGFL, BNO btree, CNT btree etc)
2) Commit this by updating the primary superblock
3) Update all the existing secondary superblocks, and initialize all the 
newly-added secondary superblocks

So I added a custom IOCTL, which can tell this sequence to abort gracefully 
(just checking a flag on each loop iteration).

Aborting during step 1 works perfectly (apart from the memory leak that I 
discovered and fixed). Since the primary superblock has not been updated, 
XFS doesn't know about the newly-added AGs, so it doesn't care what's 
written there. Resize can be restarted, and XFS will redo this step from the 
beginning.
Aborting during step 3 can leave some of the secondary superblocks totally 
uninitialized. However, from XFS standpoint, the resize has been committed 
already. That's where my question came from. I realize that kernel can crash 
during step 3, leading to same results as with my code that aborts this 
step.

Anyways, since I was not sure about these secondary superblocks, I did not 
add an exit point at this step. (Again, realize that we can crash during 
this step anyways).

Thanks,
Alex.



-----Original Message----- 
From: Eric Sandeen
Sent: 04 August, 2014 9:15 PM
To: Alex Lyakas ; Dave Chinner
Cc: xfs@oss.sgi.com
Subject: Re: xfs_growfs_data_private memory leak

On 7/2/14, 7:27 AM, Alex Lyakas wrote:
> Hi Dave,
> Thank you for your comments.
>
> I realize that secondary superblocks are needed mostly for repairing

s/mostly/only/

> a broken filesystem. However, I don't see that they get updated
> regularly, i.e., during normal operation they don't seem to get
> updated at all. I put a print in xfs_sb_write_verify, and it gets
> called only with: bp->b_bn==XFS_SB_DADDR.

See the comments above verify_sb(): not everything is validated, so
not everything needs to be constantly updated.  It's just the basic
fs geometry, not counters, etc.

/*
* verify a superblock -- does not verify root inode #
*      can only check that geometry info is internally
*      consistent.  because of growfs, that's no guarantee
*      of correctness (e.g. geometry may have changed)
*
* fields verified or consistency checked:
*
*                      sb_magicnum
*
*                      sb_versionnum
*
*                      sb_inprogress
*
*                      sb_blocksize    (as a group)
*                      sb_blocklog
*
* geometry info -      sb_dblocks      (as a group)
*                      sb_agcount
*                      sb_agblocks
*                      sb_agblklog
*
* inode info -         sb_inodesize    (x-checked with geo info)
*                      sb_inopblock
*
* sector size info -
*                      sb_sectsize
*                      sb_sectlog
*                      sb_logsectsize
*                      sb_logsectlog
*
* not checked here -
*                      sb_rootino
*                      sb_fname
*                      sb_fpack
*                      sb_logstart
*                      sb_uuid
*
*                      ALL real-time fields
*                      final 4 summary counters
*/


> So do I understand correctly (also from comments in
> xfs_growfs_data_private), that it is safe to operate a filesystem
> while having broken secondary superblocks? For me, it appears to
> mount properly, and all the data seems to be there, but xfs_check
> complains like:

> bad sb magic # 0xc2a4baf2 in ag 6144
> bad sb version # 0x4b5d in ag 6144
> blocks 6144/65536..2192631388 out of range
> blocks 6144/65536..2192631388 claimed by block 6144/0
> bad sb magic # 0xb20f3079 in ag 6145
> bad sb version # 0x6505 in ag 6145
> blocks 6145/65536..3530010017 out of range
> blocks 6145/65536..3530010017 claimed by block 6145/0
> ...

some of that looks more serious than "just" bad backup sb's.
But the bad secondaries shouldn't cause runtime problems AFAIK.

> Also, if secondary superblocks do not get updated regularly, and
> there is no way to ask an operational XFS to update them, then during
> repair we may not find a good secondary superblock.

You seem to have 6144 (!) allocation groups; one would hope that a
majority of those supers would be "good" and the others will be
properly corrected by an xfs_repair.

> As for the patch, I cannot post a patch against the upstream kernel,
> because I am running an older kernel. Unfortunately, I cannot qualify
> an upstream patch properly in a reasonable time. Is there a value in
> posting a patch against 3.8.13? Otherwise, it's fine by me if
> somebody else posts it and takes the credit.

If the patch applies cleanly to both kernels, probably fine to
go ahead and post it, with that caveat.

-Eric

> Thanks,
> Alex.
>
>
>
> -----Original Message----- From: Dave Chinner
> Sent: 02 July, 2014 12:56 AM
> To: Alex Lyakas
> Cc: xfs@oss.sgi.com
> Subject: Re: xfs_growfs_data_private memory leak
>
> On Tue, Jul 01, 2014 at 06:06:38PM +0300, Alex Lyakas wrote:
>> Greetings,
>>
>> It appears that if xfs_growfs_data_private fails during the "new AG
>> headers" loop, it does not free all the per-AG structures for the
>> new AGs. When XFS is unmounted later, they are not freed as well,
>> because xfs_growfs_data_private did not update the "sb_agcount"
>> field, so xfs_free_perag will not free them. This happens on 3.8.13,
>> but looking at the latest master branch, it seems to have the same
>> issue.
>>
>> Code like [1] in xfs_growfs_data, seems to fix the issue.
>
> Why not just do this in the appropriate error stack, like is
> done inside xfs_initialize_perag() on error?
>
>        for (i = oagcount; i < nagcount; i++) {
>                pag = radix_tree_delete(&mp->m_perag_tree, index);
>                kmem_free(pag);
>        }
>
> (though it might need RCU freeing)
>
> When you have a fix, can you send a proper patch with a sign-off on
> it?
>
>> A follow-up question: if xfs_grows_data_private fails during the
>> loop that updates all the secondary superblocks, what is the
>> consequence? (I am aware that in the latest master branch, the loop
>> is not broken on first error, but attempts to initialize whatever
>> possible). When these secondary superblocks will get updated? Is
>> there a way to force-update them? Otherwise, what can be the
>> consequence of leaving them not updated?
>
> The consequence is documented in mainline tree - if we don't update
> them all, then repair will do the wrong thing.  Repair requires a
> majority iof identical secondaries to determine if the primary is
> correct or out of date. The old behaviour of not updating after the
> first error meant that the majority were old superblocks and so at
> some time in the future repair could decide your filesystem is
> smaller than it really is and hence truncate away the grown section
> of the filesystem. i.e. trigger catastrophic, unrecoverable data
> loss.
>
> Hence it's far better to write every seconday we can than to leave
> a majority in a bad state....
>
> Cheers,
>
> Dave.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: use-after-free on log replay failure
  2014-08-04 23:07                               ` Dave Chinner
@ 2014-08-06 10:05                                 ` Alex Lyakas
  2014-08-06 12:32                                   ` Dave Chinner
  2014-08-06 12:52                                 ` Alex Lyakas
  1 sibling, 1 reply; 47+ messages in thread
From: Alex Lyakas @ 2014-08-06 10:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Hi Dave,

On Tue, Aug 5, 2014 at 2:07 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Aug 04, 2014 at 02:00:05PM +0300, Alex Lyakas wrote:
>> Greetings,
>>
>> we had a log replay failure due to some errors that the underlying
>> block device returned:
>> [49133.801406] XFS (dm-95): metadata I/O error: block 0x270e8c180
>> ("xlog_recover_iodone") error 28 numblks 16
>> [49133.802495] XFS (dm-95): log mount/recovery failed: error 28
>> [49133.802644] XFS (dm-95): log mount failed
>
> #define ENOSPC          28      /* No space left on device */
>
> You're getting an ENOSPC as a metadata IO error during log recovery?
> Thin provisioning problem, perhaps,
Yes, it is a thin provisioning problem (which I already know the cause for).

> and the error is occurring on
> submission rather than completion? If so:
>
> 8d6c121 xfs: fix buffer use after free on IO error
I am not sure what do you mean by "submission rather than completion".
Do you mean that xfs_buf_ioapply_map() returns without submitting any
bios? In that case, no, bios are submitted to the block device, and it
fails them through a different context with ENOSPC error. I will still
try the patch you mentioned, because it also looks relevant to another
question I addressed to you earlier in:
http://oss.sgi.com/archives/xfs/2013-11/msg00648.html

Thanks,
Alex.


>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: use-after-free on log replay failure
  2014-08-06 10:05                                 ` Alex Lyakas
@ 2014-08-06 12:32                                   ` Dave Chinner
  2014-08-06 14:43                                     ` Alex Lyakas
  2014-08-10 16:26                                     ` Alex Lyakas
  0 siblings, 2 replies; 47+ messages in thread
From: Dave Chinner @ 2014-08-06 12:32 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: xfs

On Wed, Aug 06, 2014 at 01:05:34PM +0300, Alex Lyakas wrote:
> Hi Dave,
> 
> On Tue, Aug 5, 2014 at 2:07 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Aug 04, 2014 at 02:00:05PM +0300, Alex Lyakas wrote:
> >> Greetings,
> >>
> >> we had a log replay failure due to some errors that the underlying
> >> block device returned:
> >> [49133.801406] XFS (dm-95): metadata I/O error: block 0x270e8c180
> >> ("xlog_recover_iodone") error 28 numblks 16
> >> [49133.802495] XFS (dm-95): log mount/recovery failed: error 28
> >> [49133.802644] XFS (dm-95): log mount failed
> >
> > #define ENOSPC          28      /* No space left on device */
> >
> > You're getting an ENOSPC as a metadata IO error during log recovery?
> > Thin provisioning problem, perhaps,
> Yes, it is a thin provisioning problem (which I already know the cause for).
> 
> > and the error is occurring on
> > submission rather than completion? If so:
> >
> > 8d6c121 xfs: fix buffer use after free on IO error
> I am not sure what do you mean by "submission rather than completion".
> Do you mean that xfs_buf_ioapply_map() returns without submitting any
> bios?

No, that the bio submission results in immediate failure (e.g. the
device goes away, so submission results in ENODEV). Hence when
_xfs_buf_ioapply() releases it's IO reference itis the only
remaining reference to the buffer and so completion processing is
run immediately. i.e. inline from the submission path.

Normally IO errors are reported through the bio in IO completion
interrupt context. i.e the IO is completed by the hardware and the
error status is attached to bio, which is then completed and we get
into XFS that way. The IO submision context is long gone at this
point....

> In that case, no, bios are submitted to the block device, and it
> fails them through a different context with ENOSPC error. I will still
> try the patch you mentioned, because it also looks relevant to another
> question I addressed to you earlier in:
> http://oss.sgi.com/archives/xfs/2013-11/msg00648.html

No, that's a different problem.

9c23ecc xfs: unmount does not wait for shutdown during unmount

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: use-after-free on log replay failure
  2014-08-04 23:07                               ` Dave Chinner
  2014-08-06 10:05                                 ` Alex Lyakas
@ 2014-08-06 12:52                                 ` Alex Lyakas
  2014-08-06 15:20                                   ` Brian Foster
  1 sibling, 1 reply; 47+ messages in thread
From: Alex Lyakas @ 2014-08-06 12:52 UTC (permalink / raw)
  To: Dave Chinner; +Cc: bfoster, xfs

Hello Dave and Brian,

Dave, I tried the patch you suggested, but it does not fix the issue. I did 
some further digging, and it appears that _xfs_buf_ioend(schedule=1) can be 
called from xfs_buf_iorequest(), which the patch fixes, but also from 
xfs_buf_bio_end_io() which is my case. I am reproducing the issue pretty 
easily. The flow that I have is like this:
- xlog_recover() calls xlog_find_tail(). This works alright.
- Now I add a small sleep before calling xlog_do_recover(), and meanwhile I 
instruct my block device to return ENOSPC for any WRITE from now on.

What seems to happen is that several WRITE bios are submitted and they all 
fail. When they do, they reach xfs_buf_ioend() through a stack like this:

Aug  6 15:23:07 dev kernel: [  304.410528] [56]xfs*[xfs_buf_ioend:1056] 
XFS(dm-19): Scheduling xfs_buf_iodone_work on error
Aug  6 15:23:07 dev kernel: [  304.410534] Pid: 56, comm: kworker/u:1 
Tainted: G        W  O 3.8.13-557-generic #1382000791
Aug  6 15:23:07 dev kernel: [  304.410537] Call Trace:
Aug  6 15:23:07 dev kernel: [  304.410587]  [<ffffffffa04d6654>] 
xfs_buf_ioend+0x1a4/0x1b0 [xfs]
Aug  6 15:23:07 dev kernel: [  304.410621]  [<ffffffffa04d6685>] 
_xfs_buf_ioend+0x25/0x30 [xfs]
Aug  6 15:23:07 dev kernel: [  304.410643]  [<ffffffffa04d6b3d>] 
xfs_buf_bio_end_io+0x3d/0x50 [xfs]
Aug  6 15:23:07 dev kernel: [  304.410652]  [<ffffffff811c3d8d>] 
bio_endio+0x1d/0x40
...

At this point, they are scheduled to run xlog_recover_iodone through 
xfslogd_workqueue.
The first callback that gets called, calls xfs_do_force_shutdown in stack 
like this:

Aug  6 15:23:07 dev kernel: [  304.411791] XFS (dm-19): metadata I/O error: 
block 0x3780001 ("xlog_recover_iodone") error 28 numblks 1
Aug  6 15:23:07 dev kernel: [  304.413493] XFS (dm-19): 
xfs_do_force_shutdown(0x1) called from line 377 of file 
/mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c.  Return address = 
0xffffffffa0526848
Aug  6 15:23:07 dev kernel: [  304.413837]  [<ffffffffa04e0b60>] 
xfs_do_force_shutdown+0x40/0x180 [xfs]
Aug  6 15:23:07 dev kernel: [  304.413870]  [<ffffffffa0526848>] ? 
xlog_recover_iodone+0x48/0x70 [xfs]
Aug  6 15:23:07 dev kernel: [  304.413902]  [<ffffffffa0526848>] 
xlog_recover_iodone+0x48/0x70 [xfs]
Aug  6 15:23:07 dev kernel: [  304.413923]  [<ffffffffa04d645d>] 
xfs_buf_iodone_work+0x4d/0xa0 [xfs]
Aug  6 15:23:07 dev kernel: [  304.413930]  [<ffffffff81077a11>] 
process_one_work+0x141/0x4a0
Aug  6 15:23:07 dev kernel: [  304.413937]  [<ffffffff810789e8>] 
worker_thread+0x168/0x410
Aug  6 15:23:07 dev kernel: [  304.413943]  [<ffffffff81078880>] ? 
manage_workers+0x120/0x120
Aug  6 15:23:07 dev kernel: [  304.413949]  [<ffffffff8107df10>] 
kthread+0xc0/0xd0
Aug  6 15:23:07 dev kernel: [  304.413954]  [<ffffffff8107de50>] ? 
flush_kthread_worker+0xb0/0xb0
Aug  6 15:23:07 dev kernel: [  304.413976]  [<ffffffff816ab86c>] 
ret_from_fork+0x7c/0xb0
Aug  6 15:23:07 dev kernel: [  304.413986]  [<ffffffff8107de50>] ? 
flush_kthread_worker+0xb0/0xb0
Aug  6 15:23:07 dev kernel: [  304.413990] ---[ end trace 
988d698520e1fa81 ]---
Aug  6 15:23:07 dev kernel: [  304.414012] XFS (dm-19): I/O Error Detected. 
Shutting down filesystem
Aug  6 15:23:07 dev kernel: [  304.415936] XFS (dm-19): Please umount the 
filesystem and rectify the problem(s)

But the rest of the callbacks also arrive:
Aug  6 15:23:07 dev kernel: [  304.417812] XFS (dm-19): metadata I/O error: 
block 0x3780002 ("xlog_recover_iodone") error 28 numblks 1
Aug  6 15:23:07 dev kernel: [  304.420420] XFS (dm-19): 
xfs_do_force_shutdown(0x1) called from line 377 of file 
/mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c.  Return address = 
0xffffffffa0526848
Aug  6 15:23:07 dev kernel: [  304.420427] XFS (dm-19): metadata I/O error: 
block 0x3780008 ("xlog_recover_iodone") error 28 numblks 8
Aug  6 15:23:07 dev kernel: [  304.422708] XFS (dm-19): 
xfs_do_force_shutdown(0x1) called from line 377 of file 
/mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c.  Return address = 
0xffffffffa0526848
Aug  6 15:23:07 dev kernel: [  304.422738] XFS (dm-19): metadata I/O error: 
block 0x3780010 ("xlog_recover_iodone") error 28 numblks 8

The mount sequence fails and goes back to the caller:
Aug  6 15:23:07 dev kernel: [  304.423438] XFS (dm-19): log mount/recovery 
failed: error 28
Aug  6 15:23:07 dev kernel: [  304.423757] XFS (dm-19): log mount failed

But there are still additional callbacks to deliver, which the mount 
sequence did not wait for!
Aug  6 15:23:07 dev kernel: [  304.425717] XFS (\x10@dR): 
xfs_do_force_shutdown(0x1) called from line 377 of file 
/mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c.  Return address = 
0xffffffffa0526848
Aug  6 15:23:07 dev kernel: [  304.425723] XFS (\x10@dR): metadata I/O error: 
block 0x3780018 ("xlog_recover_iodone") error 28 numblks 8
Aug  6 15:23:07 dev kernel: [  304.428239] XFS (\x10@dR): 
xfs_do_force_shutdown(0x1) called from line 377 of file 
/mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c.  Return address = 
0xffffffffa0526848
Aug  6 15:23:07 dev kernel: [  304.428246] XFS (\x10@dR): metadata I/O error: 
block 0x37800a0 ("xlog_recover_iodone") error 28 numblks 16

Notice the junk that they are printing! Naturally, because xfs_mount 
structure has been kfreed.

Finally the kernel crashes (instead of printing junk), because the xfs_mount 
structure is gone, but the callback tries to access it (printing the name):

Aug  6 15:23:07 dev kernel: [  304.430796] general protection fault: 0000 
[#1] SMP
Aug  6 15:23:07 dev kernel: [  304.432035] Modules linked in: xfrm_user 
xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 iscsi_scst_tcp(O) 
scst_vdisk(O) scst(O) dm_zcache(O) dm_btrfs(O) xfs(O) btrfs(O) libcrc32c 
raid456(O) async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq 
async_tx raid1(O) md_mod deflate zlib_deflate ctr twofish_generic 
twofish_x86_64_3way glue_helper lrw xts gf128mul twofish_x86_64 
twofish_common camellia_generic serpent_generic blowfish_generic 
blowfish_x86_64 blowfish_common cast5_generic cast_common des_generic xcbc 
rmd160 sha512_generic crypto_null af_key xfrm_algo dm_round_robin kvm vfat 
fat ppdev psmouse microcode nfsd nfs_acl dm_multipath(O) serio_raw 
parport_pc nfsv4 dm_iostat(O) mac_hid i2c_piix4 auth_rpcgss nfs fscache 
lockd sunrpc lp parport floppy
Aug  6 15:23:07 dev kernel: [  304.432035] CPU 1
Aug  6 15:23:07 dev kernel: [  304.432035] Pid: 133, comm: kworker/1:1H 
Tainted: G        W  O 3.8.13-557-generic #1382000791 Bochs Bochs
Aug  6 15:23:07 dev kernel: [  304.432035] RIP: 0010:[<ffffffff8133c2cb>] 
[<ffffffff8133c2cb>] strnlen+0xb/0x30
Aug  6 15:23:07 dev kernel: [  304.432035] RSP: 0018:ffff880035461b08 
EFLAGS: 00010086
Aug  6 15:23:07 dev kernel: [  304.432035] RAX: 0000000000000000 RBX: 
ffffffff81e6a4e7 RCX: 0000000000000000
Aug  6 15:23:07 dev kernel: [  304.432035] RDX: e4e8390a265c0000 RSI: 
ffffffffffffffff RDI: e4e8390a265c0000
Aug  6 15:23:07 dev kernel: [  304.432035] RBP: ffff880035461b08 R08: 
000000000000ffff R09: 000000000000ffff
Aug  6 15:23:07 dev kernel: [  304.432035] R10: 0000000000000000 R11: 
00000000000004cd R12: e4e8390a265c0000
Aug  6 15:23:07 dev kernel: [  304.432035] R13: ffffffff81e6a8c0 R14: 
0000000000000000 R15: 000000000000ffff
Aug  6 15:23:07 dev kernel: [  304.432035] FS:  0000000000000000(0000) 
GS:ffff88007fc80000(0000) knlGS:0000000000000000
Aug  6 15:23:07 dev kernel: [  304.432035] CS:  0010 DS: 0000 ES: 0000 CR0: 
000000008005003b
Aug  6 15:23:07 dev kernel: [  304.432035] CR2: 00007fc902ffbfd8 CR3: 
000000007702a000 CR4: 00000000000006e0
Aug  6 15:23:07 dev kernel: [  304.432035] DR0: 0000000000000000 DR1: 
0000000000000000 DR2: 0000000000000000
Aug  6 15:23:07 dev kernel: [  304.432035] DR3: 0000000000000000 DR6: 
00000000ffff0ff0 DR7: 0000000000000400
Aug  6 15:23:07 dev kernel: [  304.432035] Process kworker/1:1H (pid: 133, 
threadinfo ffff880035460000, task ffff880035412e00)
Aug  6 15:23:07 dev kernel: [  304.432035] Stack:
Aug  6 15:23:07 dev kernel: [  304.432035]  ffff880035461b48 
ffffffff8133dd5e 0000000000000000 ffffffff81e6a4e7
Aug  6 15:23:07 dev kernel: [  304.432035]  ffffffffa0566cba 
ffff880035461c80 ffffffffa0566cba ffffffff81e6a8c0
Aug  6 15:23:07 dev kernel: [  304.432035]  ffff880035461bc8 
ffffffff8133ef59 ffff880035461bc8 ffffffff81c84040
Aug  6 15:23:07 dev kernel: [  304.432035] Call Trace:
Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8133dd5e>] 
string.isra.4+0x3e/0xd0
Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8133ef59>] 
vsnprintf+0x219/0x640
Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8133f441>] 
vscnprintf+0x11/0x30
Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8105a971>] 
vprintk_emit+0xc1/0x490
Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8105aa20>] ? 
vprintk_emit+0x170/0x490
Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8168b992>] 
printk+0x61/0x63
Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffffa04e9bf1>] 
__xfs_printk+0x31/0x50 [xfs]
Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffffa04e9e43>] 
xfs_notice+0x53/0x60 [xfs]
Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffffa04e0c15>] 
xfs_do_force_shutdown+0xf5/0x180 [xfs]
Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffffa0526848>] ? 
xlog_recover_iodone+0x48/0x70 [xfs]
Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffffa0526848>] 
xlog_recover_iodone+0x48/0x70 [xfs]
Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffffa04d645d>] 
xfs_buf_iodone_work+0x4d/0xa0 [xfs]
Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff81077a11>] 
process_one_work+0x141/0x4a0
Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff810789e8>] 
worker_thread+0x168/0x410
Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff81078880>] ? 
manage_workers+0x120/0x120
Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8107df10>] 
kthread+0xc0/0xd0
Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8107de50>] ? 
flush_kthread_worker+0xb0/0xb0
Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff816ab86c>] 
ret_from_fork+0x7c/0xb0
Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8107de50>] ? 
flush_kthread_worker+0xb0/0xb0
Aug  6 15:23:07 dev kernel: [  304.432035] Code: 31 c0 80 3f 00 55 48 89 e5 
74 11 48 89 f8 66 90 48 83 c0 01 80 38 00 75 f7 48 29 f8 5d c3 66 90 55 31 
c0 48 85 f6 48 89 e5 74 23 <80> 3f 00 74 1e 48 89 f8 eb 0c 0f 1f 00 48 83 ee 
01 80 38 00 74
Aug  6 15:23:07 dev kernel: [  304.432035] RIP  [<ffffffff8133c2cb>] 
strnlen+0xb/0x30
Aug  6 15:23:07 dev kernel: [  304.432035]  RSP <ffff880035461b08>


So previously you said: "So, something is corrupting memory and stamping all 
over the XFS structures." and also "given you have a bunch of out of tree 
modules loaded (and some which are experiemental) suggests that you have a 
problem with your storage...".

But I believe, my analysis shows that during the mount sequence XFS does not 
wait properly for all the bios to complete, before failing the mount 
sequence back to the caller.

Thanks,
Alex.



-----Original Message----- 
From: Dave Chinner
Sent: 05 August, 2014 2:07 AM
To: Alex Lyakas
Cc: xfs@oss.sgi.com
Subject: Re: use-after-free on log replay failure

On Mon, Aug 04, 2014 at 02:00:05PM +0300, Alex Lyakas wrote:
> Greetings,
>
> we had a log replay failure due to some errors that the underlying
> block device returned:
> [49133.801406] XFS (dm-95): metadata I/O error: block 0x270e8c180
> ("xlog_recover_iodone") error 28 numblks 16
> [49133.802495] XFS (dm-95): log mount/recovery failed: error 28
> [49133.802644] XFS (dm-95): log mount failed

#define ENOSPC          28      /* No space left on device */

You're getting an ENOSPC as a metadata IO error during log recovery?
Thin provisioning problem, perhaps, and the error is occurring on
submission rather than completion? If so:

8d6c121 xfs: fix buffer use after free on IO error

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: use-after-free on log replay failure
  2014-08-06 12:32                                   ` Dave Chinner
@ 2014-08-06 14:43                                     ` Alex Lyakas
  2014-08-10 16:26                                     ` Alex Lyakas
  1 sibling, 0 replies; 47+ messages in thread
From: Alex Lyakas @ 2014-08-06 14:43 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Hello Dave,
I applied this patch manually to 3.8.13, but still hitting exact same issue 
with my reproduction.
I still have xlog_recover_iodone callbacks delivered through 
xfs_buf_bio_end_io scheduling it through xfslogd_workqueue.

Some of the callbacks arrive before mount completes:
Aug  6 17:31:33 dev kernel: [ 3258.774970] XFS (dm-19): metadata I/O error: 
block 0x3780001 ("xlog_recover_iodone") error 28 numblks 1
Aug  6 17:31:33 dev kernel: [ 3258.776687] XFS (dm-19): 
xfs_do_force_shutdown(0x1) called from line 377 of file 
/mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c.  Return address = 
0xffffffffa0743848
Aug  6 17:31:33 dev kernel: [ 3258.777262] XFS (dm-19): I/O Error Detected. 
Shutting down filesystem
Aug  6 17:31:33 dev kernel: [ 3258.778369] XFS (dm-19): Please umount the 
filesystem and rectify the problem(s)
Aug  6 17:31:33 dev kernel: [ 3258.779634] XFS (dm-19): metadata I/O error: 
block 0x3780002 ("xlog_recover_iodone") error 28 numblks 1
Aug  6 17:31:33 dev kernel: [ 3258.781929] XFS (dm-19): 
xfs_do_force_shutdown(0x1) called from line 377 of file 
/mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c.  Return address = 
0xffffffffa0743848
Aug  6 17:31:33 dev kernel: [ 3258.781939] XFS (dm-19): metadata I/O error: 
block 0x3780008 ("xlog_recover_iodone") error 28 numblks 8
Aug  6 17:31:33 dev kernel: [ 3258.784235] XFS (dm-19): 
xfs_do_force_shutdown(0x1) called from line 377 of file 
/mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c.  Return address = 
0xffffffffa0743848
Aug  6 17:31:33 dev kernel: [ 3258.784260] XFS (dm-19): metadata I/O error: 
block 0x3780010 ("xlog_recover_iodone") error 28 numblks 8
Aug  6 17:31:33 dev kernel: [ 3258.784389] XFS (dm-19): log mount/recovery 
failed: error 28
Aug  6 17:31:33 dev kernel: [ 3258.784549] XFS (dm-19): log mount failed

And some arrive afterwards, and print garbage.
Aug  6 17:31:33 dev kernel: [ 3258.786398] XFS (ˆm&_): 
xfs_do_force_shutdown(0x1) called from line 377 of file 
/mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c.  Return address = 
0xffffffffa0743848
Aug  6 17:31:33 dev kernel: [ 3258.786404] XFS (ˆm&_): metadata I/O error: 
block 0x3780018 ("xlog_recover_iodone") error 28 numblks 8
Aug  6 17:31:33 dev kernel: [ 3258.788575] XFS (ˆm&_): 
xfs_do_force_shutdown(0x1) called from line 377 of file 
/mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c.  Return address = 
0xffffffffa0743848
Aug  6 17:31:33 dev kernel: [ 3258.788614] XFS (ˆm&_): metadata I/O error: 
block 0x37800a0 ("xlog_recover_iodone") error 28 numblks 16
Aug  6 17:31:33 dev kernel: [ 3258.790849] XFS (ˆm&_): 
xfs_do_force_shutdown(0x1) called from line 377 of file 
/mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c.  Return address = 
0xffffffffa0743848

Maybe some additional patch is needed for my kernel?

Thanks,
Alex.


-----Original Message----- 
From: Dave Chinner
Sent: 06 August, 2014 3:32 PM
To: Alex Lyakas
Cc: xfs@oss.sgi.com
Subject: Re: use-after-free on log replay failure

On Wed, Aug 06, 2014 at 01:05:34PM +0300, Alex Lyakas wrote:
> Hi Dave,
>
> On Tue, Aug 5, 2014 at 2:07 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Aug 04, 2014 at 02:00:05PM +0300, Alex Lyakas wrote:
> >> Greetings,
> >>
> >> we had a log replay failure due to some errors that the underlying
> >> block device returned:
> >> [49133.801406] XFS (dm-95): metadata I/O error: block 0x270e8c180
> >> ("xlog_recover_iodone") error 28 numblks 16
> >> [49133.802495] XFS (dm-95): log mount/recovery failed: error 28
> >> [49133.802644] XFS (dm-95): log mount failed
> >
> > #define ENOSPC          28      /* No space left on device */
> >
> > You're getting an ENOSPC as a metadata IO error during log recovery?
> > Thin provisioning problem, perhaps,
> Yes, it is a thin provisioning problem (which I already know the cause 
> for).
>
> > and the error is occurring on
> > submission rather than completion? If so:
> >
> > 8d6c121 xfs: fix buffer use after free on IO error
> I am not sure what do you mean by "submission rather than completion".
> Do you mean that xfs_buf_ioapply_map() returns without submitting any
> bios?

No, that the bio submission results in immediate failure (e.g. the
device goes away, so submission results in ENODEV). Hence when
_xfs_buf_ioapply() releases it's IO reference itis the only
remaining reference to the buffer and so completion processing is
run immediately. i.e. inline from the submission path.

Normally IO errors are reported through the bio in IO completion
interrupt context. i.e the IO is completed by the hardware and the
error status is attached to bio, which is then completed and we get
into XFS that way. The IO submision context is long gone at this
point....

> In that case, no, bios are submitted to the block device, and it
> fails them through a different context with ENOSPC error. I will still
> try the patch you mentioned, because it also looks relevant to another
> question I addressed to you earlier in:
> http://oss.sgi.com/archives/xfs/2013-11/msg00648.html

No, that's a different problem.

9c23ecc xfs: unmount does not wait for shutdown during unmount

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: use-after-free on log replay failure
  2014-08-06 12:52                                 ` Alex Lyakas
@ 2014-08-06 15:20                                   ` Brian Foster
  2014-08-06 15:28                                     ` Alex Lyakas
  2014-08-10 12:20                                     ` Alex Lyakas
  0 siblings, 2 replies; 47+ messages in thread
From: Brian Foster @ 2014-08-06 15:20 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: xfs

On Wed, Aug 06, 2014 at 03:52:03PM +0300, Alex Lyakas wrote:
> Hello Dave and Brian,
> 
> Dave, I tried the patch you suggested, but it does not fix the issue. I did
> some further digging, and it appears that _xfs_buf_ioend(schedule=1) can be
> called from xfs_buf_iorequest(), which the patch fixes, but also from
> xfs_buf_bio_end_io() which is my case. I am reproducing the issue pretty
> easily. The flow that I have is like this:
> - xlog_recover() calls xlog_find_tail(). This works alright.

What's the purpose of a sleep here?

> - Now I add a small sleep before calling xlog_do_recover(), and meanwhile I
> instruct my block device to return ENOSPC for any WRITE from now on.
> 
> What seems to happen is that several WRITE bios are submitted and they all
> fail. When they do, they reach xfs_buf_ioend() through a stack like this:
> 
> Aug  6 15:23:07 dev kernel: [  304.410528] [56]xfs*[xfs_buf_ioend:1056]
> XFS(dm-19): Scheduling xfs_buf_iodone_work on error
> Aug  6 15:23:07 dev kernel: [  304.410534] Pid: 56, comm: kworker/u:1
> Tainted: G        W  O 3.8.13-557-generic #1382000791
> Aug  6 15:23:07 dev kernel: [  304.410537] Call Trace:
> Aug  6 15:23:07 dev kernel: [  304.410587]  [<ffffffffa04d6654>]
> xfs_buf_ioend+0x1a4/0x1b0 [xfs]
> Aug  6 15:23:07 dev kernel: [  304.410621]  [<ffffffffa04d6685>]
> _xfs_buf_ioend+0x25/0x30 [xfs]
> Aug  6 15:23:07 dev kernel: [  304.410643]  [<ffffffffa04d6b3d>]
> xfs_buf_bio_end_io+0x3d/0x50 [xfs]
> Aug  6 15:23:07 dev kernel: [  304.410652]  [<ffffffff811c3d8d>]
> bio_endio+0x1d/0x40
> ...
> 
> At this point, they are scheduled to run xlog_recover_iodone through
> xfslogd_workqueue.
> The first callback that gets called, calls xfs_do_force_shutdown in stack
> like this:
> 
> Aug  6 15:23:07 dev kernel: [  304.411791] XFS (dm-19): metadata I/O error:
> block 0x3780001 ("xlog_recover_iodone") error 28 numblks 1
> Aug  6 15:23:07 dev kernel: [  304.413493] XFS (dm-19):
> xfs_do_force_shutdown(0x1) called from line 377 of file
> /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c.  Return address =
> 0xffffffffa0526848
> Aug  6 15:23:07 dev kernel: [  304.413837]  [<ffffffffa04e0b60>]
> xfs_do_force_shutdown+0x40/0x180 [xfs]
> Aug  6 15:23:07 dev kernel: [  304.413870]  [<ffffffffa0526848>] ?
> xlog_recover_iodone+0x48/0x70 [xfs]
> Aug  6 15:23:07 dev kernel: [  304.413902]  [<ffffffffa0526848>]
> xlog_recover_iodone+0x48/0x70 [xfs]
> Aug  6 15:23:07 dev kernel: [  304.413923]  [<ffffffffa04d645d>]
> xfs_buf_iodone_work+0x4d/0xa0 [xfs]
> Aug  6 15:23:07 dev kernel: [  304.413930]  [<ffffffff81077a11>]
> process_one_work+0x141/0x4a0
> Aug  6 15:23:07 dev kernel: [  304.413937]  [<ffffffff810789e8>]
> worker_thread+0x168/0x410
> Aug  6 15:23:07 dev kernel: [  304.413943]  [<ffffffff81078880>] ?
> manage_workers+0x120/0x120
> Aug  6 15:23:07 dev kernel: [  304.413949]  [<ffffffff8107df10>]
> kthread+0xc0/0xd0
> Aug  6 15:23:07 dev kernel: [  304.413954]  [<ffffffff8107de50>] ?
> flush_kthread_worker+0xb0/0xb0
> Aug  6 15:23:07 dev kernel: [  304.413976]  [<ffffffff816ab86c>]
> ret_from_fork+0x7c/0xb0
> Aug  6 15:23:07 dev kernel: [  304.413986]  [<ffffffff8107de50>] ?
> flush_kthread_worker+0xb0/0xb0
> Aug  6 15:23:07 dev kernel: [  304.413990] ---[ end trace 988d698520e1fa81
> ]---
> Aug  6 15:23:07 dev kernel: [  304.414012] XFS (dm-19): I/O Error Detected.
> Shutting down filesystem
> Aug  6 15:23:07 dev kernel: [  304.415936] XFS (dm-19): Please umount the
> filesystem and rectify the problem(s)
> 
> But the rest of the callbacks also arrive:
> Aug  6 15:23:07 dev kernel: [  304.417812] XFS (dm-19): metadata I/O error:
> block 0x3780002 ("xlog_recover_iodone") error 28 numblks 1
> Aug  6 15:23:07 dev kernel: [  304.420420] XFS (dm-19):
> xfs_do_force_shutdown(0x1) called from line 377 of file
> /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c.  Return address =
> 0xffffffffa0526848
> Aug  6 15:23:07 dev kernel: [  304.420427] XFS (dm-19): metadata I/O error:
> block 0x3780008 ("xlog_recover_iodone") error 28 numblks 8
> Aug  6 15:23:07 dev kernel: [  304.422708] XFS (dm-19):
> xfs_do_force_shutdown(0x1) called from line 377 of file
> /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c.  Return address =
> 0xffffffffa0526848
> Aug  6 15:23:07 dev kernel: [  304.422738] XFS (dm-19): metadata I/O error:
> block 0x3780010 ("xlog_recover_iodone") error 28 numblks 8
> 
> The mount sequence fails and goes back to the caller:
> Aug  6 15:23:07 dev kernel: [  304.423438] XFS (dm-19): log mount/recovery
> failed: error 28
> Aug  6 15:23:07 dev kernel: [  304.423757] XFS (dm-19): log mount failed
> 
> But there are still additional callbacks to deliver, which the mount
> sequence did not wait for!
> Aug  6 15:23:07 dev kernel: [  304.425717] XFS (\x10@dR):
> xfs_do_force_shutdown(0x1) called from line 377 of file
> /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c.  Return address =
> 0xffffffffa0526848
> Aug  6 15:23:07 dev kernel: [  304.425723] XFS (\x10@dR): metadata I/O error:
> block 0x3780018 ("xlog_recover_iodone") error 28 numblks 8
> Aug  6 15:23:07 dev kernel: [  304.428239] XFS (\x10@dR):
> xfs_do_force_shutdown(0x1) called from line 377 of file
> /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c.  Return address =
> 0xffffffffa0526848
> Aug  6 15:23:07 dev kernel: [  304.428246] XFS (\x10@dR): metadata I/O error:
> block 0x37800a0 ("xlog_recover_iodone") error 28 numblks 16
> 
> Notice the junk that they are printing! Naturally, because xfs_mount
> structure has been kfreed.
> 
> Finally the kernel crashes (instead of printing junk), because the xfs_mount
> structure is gone, but the callback tries to access it (printing the name):
> 
> Aug  6 15:23:07 dev kernel: [  304.430796] general protection fault: 0000
> [#1] SMP
> Aug  6 15:23:07 dev kernel: [  304.432035] Modules linked in: xfrm_user
> xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 iscsi_scst_tcp(O)
> scst_vdisk(O) scst(O) dm_zcache(O) dm_btrfs(O) xfs(O) btrfs(O) libcrc32c
> raid456(O) async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq
> async_tx raid1(O) md_mod deflate zlib_deflate ctr twofish_generic
> twofish_x86_64_3way glue_helper lrw xts gf128mul twofish_x86_64
> twofish_common camellia_generic serpent_generic blowfish_generic
> blowfish_x86_64 blowfish_common cast5_generic cast_common des_generic xcbc
> rmd160 sha512_generic crypto_null af_key xfrm_algo dm_round_robin kvm vfat
> fat ppdev psmouse microcode nfsd nfs_acl dm_multipath(O) serio_raw
> parport_pc nfsv4 dm_iostat(O) mac_hid i2c_piix4 auth_rpcgss nfs fscache
> lockd sunrpc lp parport floppy
> Aug  6 15:23:07 dev kernel: [  304.432035] CPU 1
> Aug  6 15:23:07 dev kernel: [  304.432035] Pid: 133, comm: kworker/1:1H
> Tainted: G        W  O 3.8.13-557-generic #1382000791 Bochs Bochs
> Aug  6 15:23:07 dev kernel: [  304.432035] RIP: 0010:[<ffffffff8133c2cb>]
> [<ffffffff8133c2cb>] strnlen+0xb/0x30
> Aug  6 15:23:07 dev kernel: [  304.432035] RSP: 0018:ffff880035461b08
> EFLAGS: 00010086
> Aug  6 15:23:07 dev kernel: [  304.432035] RAX: 0000000000000000 RBX:
> ffffffff81e6a4e7 RCX: 0000000000000000
> Aug  6 15:23:07 dev kernel: [  304.432035] RDX: e4e8390a265c0000 RSI:
> ffffffffffffffff RDI: e4e8390a265c0000
> Aug  6 15:23:07 dev kernel: [  304.432035] RBP: ffff880035461b08 R08:
> 000000000000ffff R09: 000000000000ffff
> Aug  6 15:23:07 dev kernel: [  304.432035] R10: 0000000000000000 R11:
> 00000000000004cd R12: e4e8390a265c0000
> Aug  6 15:23:07 dev kernel: [  304.432035] R13: ffffffff81e6a8c0 R14:
> 0000000000000000 R15: 000000000000ffff
> Aug  6 15:23:07 dev kernel: [  304.432035] FS:  0000000000000000(0000)
> GS:ffff88007fc80000(0000) knlGS:0000000000000000
> Aug  6 15:23:07 dev kernel: [  304.432035] CS:  0010 DS: 0000 ES: 0000 CR0:
> 000000008005003b
> Aug  6 15:23:07 dev kernel: [  304.432035] CR2: 00007fc902ffbfd8 CR3:
> 000000007702a000 CR4: 00000000000006e0
> Aug  6 15:23:07 dev kernel: [  304.432035] DR0: 0000000000000000 DR1:
> 0000000000000000 DR2: 0000000000000000
> Aug  6 15:23:07 dev kernel: [  304.432035] DR3: 0000000000000000 DR6:
> 00000000ffff0ff0 DR7: 0000000000000400
> Aug  6 15:23:07 dev kernel: [  304.432035] Process kworker/1:1H (pid: 133,
> threadinfo ffff880035460000, task ffff880035412e00)
> Aug  6 15:23:07 dev kernel: [  304.432035] Stack:
> Aug  6 15:23:07 dev kernel: [  304.432035]  ffff880035461b48
> ffffffff8133dd5e 0000000000000000 ffffffff81e6a4e7
> Aug  6 15:23:07 dev kernel: [  304.432035]  ffffffffa0566cba
> ffff880035461c80 ffffffffa0566cba ffffffff81e6a8c0
> Aug  6 15:23:07 dev kernel: [  304.432035]  ffff880035461bc8
> ffffffff8133ef59 ffff880035461bc8 ffffffff81c84040
> Aug  6 15:23:07 dev kernel: [  304.432035] Call Trace:
> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8133dd5e>]
> string.isra.4+0x3e/0xd0
> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8133ef59>]
> vsnprintf+0x219/0x640
> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8133f441>]
> vscnprintf+0x11/0x30
> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8105a971>]
> vprintk_emit+0xc1/0x490
> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8105aa20>] ?
> vprintk_emit+0x170/0x490
> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8168b992>]
> printk+0x61/0x63
> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffffa04e9bf1>]
> __xfs_printk+0x31/0x50 [xfs]
> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffffa04e9e43>]
> xfs_notice+0x53/0x60 [xfs]
> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffffa04e0c15>]
> xfs_do_force_shutdown+0xf5/0x180 [xfs]
> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffffa0526848>] ?
> xlog_recover_iodone+0x48/0x70 [xfs]
> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffffa0526848>]
> xlog_recover_iodone+0x48/0x70 [xfs]
> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffffa04d645d>]
> xfs_buf_iodone_work+0x4d/0xa0 [xfs]
> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff81077a11>]
> process_one_work+0x141/0x4a0
> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff810789e8>]
> worker_thread+0x168/0x410
> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff81078880>] ?
> manage_workers+0x120/0x120
> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8107df10>]
> kthread+0xc0/0xd0
> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8107de50>] ?
> flush_kthread_worker+0xb0/0xb0
> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff816ab86c>]
> ret_from_fork+0x7c/0xb0
> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8107de50>] ?
> flush_kthread_worker+0xb0/0xb0
> Aug  6 15:23:07 dev kernel: [  304.432035] Code: 31 c0 80 3f 00 55 48 89 e5
> 74 11 48 89 f8 66 90 48 83 c0 01 80 38 00 75 f7 48 29 f8 5d c3 66 90 55 31
> c0 48 85 f6 48 89 e5 74 23 <80> 3f 00 74 1e 48 89 f8 eb 0c 0f 1f 00 48 83 ee
> 01 80 38 00 74
> Aug  6 15:23:07 dev kernel: [  304.432035] RIP  [<ffffffff8133c2cb>]
> strnlen+0xb/0x30
> Aug  6 15:23:07 dev kernel: [  304.432035]  RSP <ffff880035461b08>
> 
> 
> So previously you said: "So, something is corrupting memory and stamping all
> over the XFS structures." and also "given you have a bunch of out of tree
> modules loaded (and some which are experiemental) suggests that you have a
> problem with your storage...".
> 
> But I believe, my analysis shows that during the mount sequence XFS does not
> wait properly for all the bios to complete, before failing the mount
> sequence back to the caller.
> 

As an experiment, what about the following? Compile tested only and not
safe for general use.

What might help more is to see if you can create a reproducer on a
recent, clean kernel. Perhaps a metadump of your reproducer fs combined
with whatever block device ENOSPC hack you're using would do it.

Brian

---8<---

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index cd7b8ca..fbcf524 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1409,19 +1409,27 @@ xfs_buf_iorequest(
  * case nothing will ever complete.  It returns the I/O error code, if any, or
  * 0 if there was no error.
  */
-int
-xfs_buf_iowait(
-	xfs_buf_t		*bp)
+static int
+__xfs_buf_iowait(
+	struct xfs_buf		*bp,
+	bool			skip_error)
 {
 	trace_xfs_buf_iowait(bp, _RET_IP_);
 
-	if (!bp->b_error)
+	if (skip_error || !bp->b_error)
 		wait_for_completion(&bp->b_iowait);
 
 	trace_xfs_buf_iowait_done(bp, _RET_IP_);
 	return bp->b_error;
 }
 
+int
+xfs_buf_iowait(
+	struct xfs_buf		*bp)
+{
+	return __xfs_buf_iowait(bp, false);
+}
+
 xfs_caddr_t
 xfs_buf_offset(
 	xfs_buf_t		*bp,
@@ -1866,7 +1874,7 @@ xfs_buf_delwri_submit(
 		bp = list_first_entry(&io_list, struct xfs_buf, b_list);
 
 		list_del_init(&bp->b_list);
-		error2 = xfs_buf_iowait(bp);
+		error2 = __xfs_buf_iowait(bp, true);
 		xfs_buf_relse(bp);
 		if (!error)
 			error = error2;

---

> Thanks,
> Alex.
> 
> 
> 
> -----Original Message----- From: Dave Chinner
> Sent: 05 August, 2014 2:07 AM
> To: Alex Lyakas
> Cc: xfs@oss.sgi.com
> Subject: Re: use-after-free on log replay failure
> 
> On Mon, Aug 04, 2014 at 02:00:05PM +0300, Alex Lyakas wrote:
> >Greetings,
> >
> >we had a log replay failure due to some errors that the underlying
> >block device returned:
> >[49133.801406] XFS (dm-95): metadata I/O error: block 0x270e8c180
> >("xlog_recover_iodone") error 28 numblks 16
> >[49133.802495] XFS (dm-95): log mount/recovery failed: error 28
> >[49133.802644] XFS (dm-95): log mount failed
> 
> #define ENOSPC          28      /* No space left on device */
> 
> You're getting an ENOSPC as a metadata IO error during log recovery?
> Thin provisioning problem, perhaps, and the error is occurring on
> submission rather than completion? If so:
> 
> 8d6c121 xfs: fix buffer use after free on IO error
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: use-after-free on log replay failure
  2014-08-06 15:20                                   ` Brian Foster
@ 2014-08-06 15:28                                     ` Alex Lyakas
  2014-08-10 12:20                                     ` Alex Lyakas
  1 sibling, 0 replies; 47+ messages in thread
From: Alex Lyakas @ 2014-08-06 15:28 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

Hi Brian,
The purpose of the sleep is like this: I added a  module parameter to my 
block device, which says "from now on fail all WRITEs with ENOSPC". If I set 
this parameter before mounting, then xlog_find_tail() fails. This does not 
reproduce my problem. So I want xlog_find_tail() to succeed, and then start 
returning errors. So I need the sleep so that I can set my parameter, before 
actual log recovery is attempted:)

Will try what you suggest.

Thanks,
Alex.


-----Original Message----- 
From: Brian Foster
Sent: 06 August, 2014 6:20 PM
To: Alex Lyakas
Cc: Dave Chinner ; xfs@oss.sgi.com
Subject: Re: use-after-free on log replay failure

On Wed, Aug 06, 2014 at 03:52:03PM +0300, Alex Lyakas wrote:
> Hello Dave and Brian,
>
> Dave, I tried the patch you suggested, but it does not fix the issue. I 
> did
> some further digging, and it appears that _xfs_buf_ioend(schedule=1) can 
> be
> called from xfs_buf_iorequest(), which the patch fixes, but also from
> xfs_buf_bio_end_io() which is my case. I am reproducing the issue pretty
> easily. The flow that I have is like this:
> - xlog_recover() calls xlog_find_tail(). This works alright.

What's the purpose of a sleep here?

> - Now I add a small sleep before calling xlog_do_recover(), and meanwhile 
> I
> instruct my block device to return ENOSPC for any WRITE from now on.
>
> What seems to happen is that several WRITE bios are submitted and they all
> fail. When they do, they reach xfs_buf_ioend() through a stack like this:
>
> Aug  6 15:23:07 dev kernel: [  304.410528] [56]xfs*[xfs_buf_ioend:1056]
> XFS(dm-19): Scheduling xfs_buf_iodone_work on error
> Aug  6 15:23:07 dev kernel: [  304.410534] Pid: 56, comm: kworker/u:1
> Tainted: G        W  O 3.8.13-557-generic #1382000791
> Aug  6 15:23:07 dev kernel: [  304.410537] Call Trace:
> Aug  6 15:23:07 dev kernel: [  304.410587]  [<ffffffffa04d6654>]
> xfs_buf_ioend+0x1a4/0x1b0 [xfs]
> Aug  6 15:23:07 dev kernel: [  304.410621]  [<ffffffffa04d6685>]
> _xfs_buf_ioend+0x25/0x30 [xfs]
> Aug  6 15:23:07 dev kernel: [  304.410643]  [<ffffffffa04d6b3d>]
> xfs_buf_bio_end_io+0x3d/0x50 [xfs]
> Aug  6 15:23:07 dev kernel: [  304.410652]  [<ffffffff811c3d8d>]
> bio_endio+0x1d/0x40
> ...
>
> At this point, they are scheduled to run xlog_recover_iodone through
> xfslogd_workqueue.
> The first callback that gets called, calls xfs_do_force_shutdown in stack
> like this:
>
> Aug  6 15:23:07 dev kernel: [  304.411791] XFS (dm-19): metadata I/O 
> error:
> block 0x3780001 ("xlog_recover_iodone") error 28 numblks 1
> Aug  6 15:23:07 dev kernel: [  304.413493] XFS (dm-19):
> xfs_do_force_shutdown(0x1) called from line 377 of file
> /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c.  Return address =
> 0xffffffffa0526848
> Aug  6 15:23:07 dev kernel: [  304.413837]  [<ffffffffa04e0b60>]
> xfs_do_force_shutdown+0x40/0x180 [xfs]
> Aug  6 15:23:07 dev kernel: [  304.413870]  [<ffffffffa0526848>] ?
> xlog_recover_iodone+0x48/0x70 [xfs]
> Aug  6 15:23:07 dev kernel: [  304.413902]  [<ffffffffa0526848>]
> xlog_recover_iodone+0x48/0x70 [xfs]
> Aug  6 15:23:07 dev kernel: [  304.413923]  [<ffffffffa04d645d>]
> xfs_buf_iodone_work+0x4d/0xa0 [xfs]
> Aug  6 15:23:07 dev kernel: [  304.413930]  [<ffffffff81077a11>]
> process_one_work+0x141/0x4a0
> Aug  6 15:23:07 dev kernel: [  304.413937]  [<ffffffff810789e8>]
> worker_thread+0x168/0x410
> Aug  6 15:23:07 dev kernel: [  304.413943]  [<ffffffff81078880>] ?
> manage_workers+0x120/0x120
> Aug  6 15:23:07 dev kernel: [  304.413949]  [<ffffffff8107df10>]
> kthread+0xc0/0xd0
> Aug  6 15:23:07 dev kernel: [  304.413954]  [<ffffffff8107de50>] ?
> flush_kthread_worker+0xb0/0xb0
> Aug  6 15:23:07 dev kernel: [  304.413976]  [<ffffffff816ab86c>]
> ret_from_fork+0x7c/0xb0
> Aug  6 15:23:07 dev kernel: [  304.413986]  [<ffffffff8107de50>] ?
> flush_kthread_worker+0xb0/0xb0
> Aug  6 15:23:07 dev kernel: [  304.413990] ---[ end trace 988d698520e1fa81
> ]---
> Aug  6 15:23:07 dev kernel: [  304.414012] XFS (dm-19): I/O Error 
> Detected.
> Shutting down filesystem
> Aug  6 15:23:07 dev kernel: [  304.415936] XFS (dm-19): Please umount the
> filesystem and rectify the problem(s)
>
> But the rest of the callbacks also arrive:
> Aug  6 15:23:07 dev kernel: [  304.417812] XFS (dm-19): metadata I/O 
> error:
> block 0x3780002 ("xlog_recover_iodone") error 28 numblks 1
> Aug  6 15:23:07 dev kernel: [  304.420420] XFS (dm-19):
> xfs_do_force_shutdown(0x1) called from line 377 of file
> /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c.  Return address =
> 0xffffffffa0526848
> Aug  6 15:23:07 dev kernel: [  304.420427] XFS (dm-19): metadata I/O 
> error:
> block 0x3780008 ("xlog_recover_iodone") error 28 numblks 8
> Aug  6 15:23:07 dev kernel: [  304.422708] XFS (dm-19):
> xfs_do_force_shutdown(0x1) called from line 377 of file
> /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c.  Return address =
> 0xffffffffa0526848
> Aug  6 15:23:07 dev kernel: [  304.422738] XFS (dm-19): metadata I/O 
> error:
> block 0x3780010 ("xlog_recover_iodone") error 28 numblks 8
>
> The mount sequence fails and goes back to the caller:
> Aug  6 15:23:07 dev kernel: [  304.423438] XFS (dm-19): log mount/recovery
> failed: error 28
> Aug  6 15:23:07 dev kernel: [  304.423757] XFS (dm-19): log mount failed
>
> But there are still additional callbacks to deliver, which the mount
> sequence did not wait for!
> Aug  6 15:23:07 dev kernel: [  304.425717] XFS (\x10@dR):
> xfs_do_force_shutdown(0x1) called from line 377 of file
> /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c.  Return address =
> 0xffffffffa0526848
> Aug  6 15:23:07 dev kernel: [  304.425723] XFS (\x10@dR): metadata I/O error:
> block 0x3780018 ("xlog_recover_iodone") error 28 numblks 8
> Aug  6 15:23:07 dev kernel: [  304.428239] XFS (\x10@dR):
> xfs_do_force_shutdown(0x1) called from line 377 of file
> /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c.  Return address =
> 0xffffffffa0526848
> Aug  6 15:23:07 dev kernel: [  304.428246] XFS (\x10@dR): metadata I/O error:
> block 0x37800a0 ("xlog_recover_iodone") error 28 numblks 16
>
> Notice the junk that they are printing! Naturally, because xfs_mount
> structure has been kfreed.
>
> Finally the kernel crashes (instead of printing junk), because the 
> xfs_mount
> structure is gone, but the callback tries to access it (printing the 
> name):
>
> Aug  6 15:23:07 dev kernel: [  304.430796] general protection fault: 0000
> [#1] SMP
> Aug  6 15:23:07 dev kernel: [  304.432035] Modules linked in: xfrm_user
> xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 iscsi_scst_tcp(O)
> scst_vdisk(O) scst(O) dm_zcache(O) dm_btrfs(O) xfs(O) btrfs(O) libcrc32c
> raid456(O) async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq
> async_tx raid1(O) md_mod deflate zlib_deflate ctr twofish_generic
> twofish_x86_64_3way glue_helper lrw xts gf128mul twofish_x86_64
> twofish_common camellia_generic serpent_generic blowfish_generic
> blowfish_x86_64 blowfish_common cast5_generic cast_common des_generic xcbc
> rmd160 sha512_generic crypto_null af_key xfrm_algo dm_round_robin kvm vfat
> fat ppdev psmouse microcode nfsd nfs_acl dm_multipath(O) serio_raw
> parport_pc nfsv4 dm_iostat(O) mac_hid i2c_piix4 auth_rpcgss nfs fscache
> lockd sunrpc lp parport floppy
> Aug  6 15:23:07 dev kernel: [  304.432035] CPU 1
> Aug  6 15:23:07 dev kernel: [  304.432035] Pid: 133, comm: kworker/1:1H
> Tainted: G        W  O 3.8.13-557-generic #1382000791 Bochs Bochs
> Aug  6 15:23:07 dev kernel: [  304.432035] RIP: 0010:[<ffffffff8133c2cb>]
> [<ffffffff8133c2cb>] strnlen+0xb/0x30
> Aug  6 15:23:07 dev kernel: [  304.432035] RSP: 0018:ffff880035461b08
> EFLAGS: 00010086
> Aug  6 15:23:07 dev kernel: [  304.432035] RAX: 0000000000000000 RBX:
> ffffffff81e6a4e7 RCX: 0000000000000000
> Aug  6 15:23:07 dev kernel: [  304.432035] RDX: e4e8390a265c0000 RSI:
> ffffffffffffffff RDI: e4e8390a265c0000
> Aug  6 15:23:07 dev kernel: [  304.432035] RBP: ffff880035461b08 R08:
> 000000000000ffff R09: 000000000000ffff
> Aug  6 15:23:07 dev kernel: [  304.432035] R10: 0000000000000000 R11:
> 00000000000004cd R12: e4e8390a265c0000
> Aug  6 15:23:07 dev kernel: [  304.432035] R13: ffffffff81e6a8c0 R14:
> 0000000000000000 R15: 000000000000ffff
> Aug  6 15:23:07 dev kernel: [  304.432035] FS:  0000000000000000(0000)
> GS:ffff88007fc80000(0000) knlGS:0000000000000000
> Aug  6 15:23:07 dev kernel: [  304.432035] CS:  0010 DS: 0000 ES: 0000 
> CR0:
> 000000008005003b
> Aug  6 15:23:07 dev kernel: [  304.432035] CR2: 00007fc902ffbfd8 CR3:
> 000000007702a000 CR4: 00000000000006e0
> Aug  6 15:23:07 dev kernel: [  304.432035] DR0: 0000000000000000 DR1:
> 0000000000000000 DR2: 0000000000000000
> Aug  6 15:23:07 dev kernel: [  304.432035] DR3: 0000000000000000 DR6:
> 00000000ffff0ff0 DR7: 0000000000000400
> Aug  6 15:23:07 dev kernel: [  304.432035] Process kworker/1:1H (pid: 133,
> threadinfo ffff880035460000, task ffff880035412e00)
> Aug  6 15:23:07 dev kernel: [  304.432035] Stack:
> Aug  6 15:23:07 dev kernel: [  304.432035]  ffff880035461b48
> ffffffff8133dd5e 0000000000000000 ffffffff81e6a4e7
> Aug  6 15:23:07 dev kernel: [  304.432035]  ffffffffa0566cba
> ffff880035461c80 ffffffffa0566cba ffffffff81e6a8c0
> Aug  6 15:23:07 dev kernel: [  304.432035]  ffff880035461bc8
> ffffffff8133ef59 ffff880035461bc8 ffffffff81c84040
> Aug  6 15:23:07 dev kernel: [  304.432035] Call Trace:
> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8133dd5e>]
> string.isra.4+0x3e/0xd0
> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8133ef59>]
> vsnprintf+0x219/0x640
> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8133f441>]
> vscnprintf+0x11/0x30
> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8105a971>]
> vprintk_emit+0xc1/0x490
> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8105aa20>] ?
> vprintk_emit+0x170/0x490
> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8168b992>]
> printk+0x61/0x63
> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffffa04e9bf1>]
> __xfs_printk+0x31/0x50 [xfs]
> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffffa04e9e43>]
> xfs_notice+0x53/0x60 [xfs]
> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffffa04e0c15>]
> xfs_do_force_shutdown+0xf5/0x180 [xfs]
> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffffa0526848>] ?
> xlog_recover_iodone+0x48/0x70 [xfs]
> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffffa0526848>]
> xlog_recover_iodone+0x48/0x70 [xfs]
> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffffa04d645d>]
> xfs_buf_iodone_work+0x4d/0xa0 [xfs]
> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff81077a11>]
> process_one_work+0x141/0x4a0
> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff810789e8>]
> worker_thread+0x168/0x410
> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff81078880>] ?
> manage_workers+0x120/0x120
> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8107df10>]
> kthread+0xc0/0xd0
> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8107de50>] ?
> flush_kthread_worker+0xb0/0xb0
> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff816ab86c>]
> ret_from_fork+0x7c/0xb0
> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8107de50>] ?
> flush_kthread_worker+0xb0/0xb0
> Aug  6 15:23:07 dev kernel: [  304.432035] Code: 31 c0 80 3f 00 55 48 89 
> e5
> 74 11 48 89 f8 66 90 48 83 c0 01 80 38 00 75 f7 48 29 f8 5d c3 66 90 55 31
> c0 48 85 f6 48 89 e5 74 23 <80> 3f 00 74 1e 48 89 f8 eb 0c 0f 1f 00 48 83 
> ee
> 01 80 38 00 74
> Aug  6 15:23:07 dev kernel: [  304.432035] RIP  [<ffffffff8133c2cb>]
> strnlen+0xb/0x30
> Aug  6 15:23:07 dev kernel: [  304.432035]  RSP <ffff880035461b08>
>
>
> So previously you said: "So, something is corrupting memory and stamping 
> all
> over the XFS structures." and also "given you have a bunch of out of tree
> modules loaded (and some which are experiemental) suggests that you have a
> problem with your storage...".
>
> But I believe, my analysis shows that during the mount sequence XFS does 
> not
> wait properly for all the bios to complete, before failing the mount
> sequence back to the caller.
>

As an experiment, what about the following? Compile tested only and not
safe for general use.

What might help more is to see if you can create a reproducer on a
recent, clean kernel. Perhaps a metadump of your reproducer fs combined
with whatever block device ENOSPC hack you're using would do it.

Brian

---8<---

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index cd7b8ca..fbcf524 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1409,19 +1409,27 @@ xfs_buf_iorequest(
  * case nothing will ever complete.  It returns the I/O error code, if any, 
or
  * 0 if there was no error.
  */
-int
-xfs_buf_iowait(
- xfs_buf_t *bp)
+static int
+__xfs_buf_iowait(
+ struct xfs_buf *bp,
+ bool skip_error)
{
  trace_xfs_buf_iowait(bp, _RET_IP_);

- if (!bp->b_error)
+ if (skip_error || !bp->b_error)
  wait_for_completion(&bp->b_iowait);

  trace_xfs_buf_iowait_done(bp, _RET_IP_);
  return bp->b_error;
}

+int
+xfs_buf_iowait(
+ struct xfs_buf *bp)
+{
+ return __xfs_buf_iowait(bp, false);
+}
+
xfs_caddr_t
xfs_buf_offset(
  xfs_buf_t *bp,
@@ -1866,7 +1874,7 @@ xfs_buf_delwri_submit(
  bp = list_first_entry(&io_list, struct xfs_buf, b_list);

  list_del_init(&bp->b_list);
- error2 = xfs_buf_iowait(bp);
+ error2 = __xfs_buf_iowait(bp, true);
  xfs_buf_relse(bp);
  if (!error)
  error = error2;

---

> Thanks,
> Alex.
>
>
>
> -----Original Message----- From: Dave Chinner
> Sent: 05 August, 2014 2:07 AM
> To: Alex Lyakas
> Cc: xfs@oss.sgi.com
> Subject: Re: use-after-free on log replay failure
>
> On Mon, Aug 04, 2014 at 02:00:05PM +0300, Alex Lyakas wrote:
> >Greetings,
> >
> >we had a log replay failure due to some errors that the underlying
> >block device returned:
> >[49133.801406] XFS (dm-95): metadata I/O error: block 0x270e8c180
> >("xlog_recover_iodone") error 28 numblks 16
> >[49133.802495] XFS (dm-95): log mount/recovery failed: error 28
> >[49133.802644] XFS (dm-95): log mount failed
>
> #define ENOSPC          28      /* No space left on device */
>
> You're getting an ENOSPC as a metadata IO error during log recovery?
> Thin provisioning problem, perhaps, and the error is occurring on
> submission rather than completion? If so:
>
> 8d6c121 xfs: fix buffer use after free on IO error
>
> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: use-after-free on log replay failure
  2014-08-06 15:20                                   ` Brian Foster
  2014-08-06 15:28                                     ` Alex Lyakas
@ 2014-08-10 12:20                                     ` Alex Lyakas
  2014-08-11 13:20                                       ` Brian Foster
  1 sibling, 1 reply; 47+ messages in thread
From: Alex Lyakas @ 2014-08-10 12:20 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

Hi Brian,

On Wed, Aug 6, 2014 at 6:20 PM, Brian Foster <bfoster@redhat.com> wrote:
> On Wed, Aug 06, 2014 at 03:52:03PM +0300, Alex Lyakas wrote:
>> Hello Dave and Brian,
>>
>> Dave, I tried the patch you suggested, but it does not fix the issue. I did
>> some further digging, and it appears that _xfs_buf_ioend(schedule=1) can be
>> called from xfs_buf_iorequest(), which the patch fixes, but also from
>> xfs_buf_bio_end_io() which is my case. I am reproducing the issue pretty
>> easily. The flow that I have is like this:
>> - xlog_recover() calls xlog_find_tail(). This works alright.
>
> What's the purpose of a sleep here?
>
>> - Now I add a small sleep before calling xlog_do_recover(), and meanwhile I
>> instruct my block device to return ENOSPC for any WRITE from now on.
>>
>> What seems to happen is that several WRITE bios are submitted and they all
>> fail. When they do, they reach xfs_buf_ioend() through a stack like this:
>>
>> Aug  6 15:23:07 dev kernel: [  304.410528] [56]xfs*[xfs_buf_ioend:1056]
>> XFS(dm-19): Scheduling xfs_buf_iodone_work on error
>> Aug  6 15:23:07 dev kernel: [  304.410534] Pid: 56, comm: kworker/u:1
>> Tainted: G        W  O 3.8.13-557-generic #1382000791
>> Aug  6 15:23:07 dev kernel: [  304.410537] Call Trace:
>> Aug  6 15:23:07 dev kernel: [  304.410587]  [<ffffffffa04d6654>]
>> xfs_buf_ioend+0x1a4/0x1b0 [xfs]
>> Aug  6 15:23:07 dev kernel: [  304.410621]  [<ffffffffa04d6685>]
>> _xfs_buf_ioend+0x25/0x30 [xfs]
>> Aug  6 15:23:07 dev kernel: [  304.410643]  [<ffffffffa04d6b3d>]
>> xfs_buf_bio_end_io+0x3d/0x50 [xfs]
>> Aug  6 15:23:07 dev kernel: [  304.410652]  [<ffffffff811c3d8d>]
>> bio_endio+0x1d/0x40
>> ...
>>
>> At this point, they are scheduled to run xlog_recover_iodone through
>> xfslogd_workqueue.
>> The first callback that gets called, calls xfs_do_force_shutdown in stack
>> like this:
>>
>> Aug  6 15:23:07 dev kernel: [  304.411791] XFS (dm-19): metadata I/O error:
>> block 0x3780001 ("xlog_recover_iodone") error 28 numblks 1
>> Aug  6 15:23:07 dev kernel: [  304.413493] XFS (dm-19):
>> xfs_do_force_shutdown(0x1) called from line 377 of file
>> /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c.  Return address =
>> 0xffffffffa0526848
>> Aug  6 15:23:07 dev kernel: [  304.413837]  [<ffffffffa04e0b60>]
>> xfs_do_force_shutdown+0x40/0x180 [xfs]
>> Aug  6 15:23:07 dev kernel: [  304.413870]  [<ffffffffa0526848>] ?
>> xlog_recover_iodone+0x48/0x70 [xfs]
>> Aug  6 15:23:07 dev kernel: [  304.413902]  [<ffffffffa0526848>]
>> xlog_recover_iodone+0x48/0x70 [xfs]
>> Aug  6 15:23:07 dev kernel: [  304.413923]  [<ffffffffa04d645d>]
>> xfs_buf_iodone_work+0x4d/0xa0 [xfs]
>> Aug  6 15:23:07 dev kernel: [  304.413930]  [<ffffffff81077a11>]
>> process_one_work+0x141/0x4a0
>> Aug  6 15:23:07 dev kernel: [  304.413937]  [<ffffffff810789e8>]
>> worker_thread+0x168/0x410
>> Aug  6 15:23:07 dev kernel: [  304.413943]  [<ffffffff81078880>] ?
>> manage_workers+0x120/0x120
>> Aug  6 15:23:07 dev kernel: [  304.413949]  [<ffffffff8107df10>]
>> kthread+0xc0/0xd0
>> Aug  6 15:23:07 dev kernel: [  304.413954]  [<ffffffff8107de50>] ?
>> flush_kthread_worker+0xb0/0xb0
>> Aug  6 15:23:07 dev kernel: [  304.413976]  [<ffffffff816ab86c>]
>> ret_from_fork+0x7c/0xb0
>> Aug  6 15:23:07 dev kernel: [  304.413986]  [<ffffffff8107de50>] ?
>> flush_kthread_worker+0xb0/0xb0
>> Aug  6 15:23:07 dev kernel: [  304.413990] ---[ end trace 988d698520e1fa81
>> ]---
>> Aug  6 15:23:07 dev kernel: [  304.414012] XFS (dm-19): I/O Error Detected.
>> Shutting down filesystem
>> Aug  6 15:23:07 dev kernel: [  304.415936] XFS (dm-19): Please umount the
>> filesystem and rectify the problem(s)
>>
>> But the rest of the callbacks also arrive:
>> Aug  6 15:23:07 dev kernel: [  304.417812] XFS (dm-19): metadata I/O error:
>> block 0x3780002 ("xlog_recover_iodone") error 28 numblks 1
>> Aug  6 15:23:07 dev kernel: [  304.420420] XFS (dm-19):
>> xfs_do_force_shutdown(0x1) called from line 377 of file
>> /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c.  Return address =
>> 0xffffffffa0526848
>> Aug  6 15:23:07 dev kernel: [  304.420427] XFS (dm-19): metadata I/O error:
>> block 0x3780008 ("xlog_recover_iodone") error 28 numblks 8
>> Aug  6 15:23:07 dev kernel: [  304.422708] XFS (dm-19):
>> xfs_do_force_shutdown(0x1) called from line 377 of file
>> /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c.  Return address =
>> 0xffffffffa0526848
>> Aug  6 15:23:07 dev kernel: [  304.422738] XFS (dm-19): metadata I/O error:
>> block 0x3780010 ("xlog_recover_iodone") error 28 numblks 8
>>
>> The mount sequence fails and goes back to the caller:
>> Aug  6 15:23:07 dev kernel: [  304.423438] XFS (dm-19): log mount/recovery
>> failed: error 28
>> Aug  6 15:23:07 dev kernel: [  304.423757] XFS (dm-19): log mount failed
>>
>> But there are still additional callbacks to deliver, which the mount
>> sequence did not wait for!
>> Aug  6 15:23:07 dev kernel: [  304.425717] XFS ( @dR):
>> xfs_do_force_shutdown(0x1) called from line 377 of file
>> /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c.  Return address =
>> 0xffffffffa0526848
>> Aug  6 15:23:07 dev kernel: [  304.425723] XFS ( @dR): metadata I/O error:
>> block 0x3780018 ("xlog_recover_iodone") error 28 numblks 8
>> Aug  6 15:23:07 dev kernel: [  304.428239] XFS ( @dR):
>> xfs_do_force_shutdown(0x1) called from line 377 of file
>> /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c.  Return address =
>> 0xffffffffa0526848
>> Aug  6 15:23:07 dev kernel: [  304.428246] XFS ( @dR): metadata I/O error:
>> block 0x37800a0 ("xlog_recover_iodone") error 28 numblks 16
>>
>> Notice the junk that they are printing! Naturally, because xfs_mount
>> structure has been kfreed.
>>
>> Finally the kernel crashes (instead of printing junk), because the xfs_mount
>> structure is gone, but the callback tries to access it (printing the name):
>>
>> Aug  6 15:23:07 dev kernel: [  304.430796] general protection fault: 0000
>> [#1] SMP
>> Aug  6 15:23:07 dev kernel: [  304.432035] Modules linked in: xfrm_user
>> xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 iscsi_scst_tcp(O)
>> scst_vdisk(O) scst(O) dm_zcache(O) dm_btrfs(O) xfs(O) btrfs(O) libcrc32c
>> raid456(O) async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq
>> async_tx raid1(O) md_mod deflate zlib_deflate ctr twofish_generic
>> twofish_x86_64_3way glue_helper lrw xts gf128mul twofish_x86_64
>> twofish_common camellia_generic serpent_generic blowfish_generic
>> blowfish_x86_64 blowfish_common cast5_generic cast_common des_generic xcbc
>> rmd160 sha512_generic crypto_null af_key xfrm_algo dm_round_robin kvm vfat
>> fat ppdev psmouse microcode nfsd nfs_acl dm_multipath(O) serio_raw
>> parport_pc nfsv4 dm_iostat(O) mac_hid i2c_piix4 auth_rpcgss nfs fscache
>> lockd sunrpc lp parport floppy
>> Aug  6 15:23:07 dev kernel: [  304.432035] CPU 1
>> Aug  6 15:23:07 dev kernel: [  304.432035] Pid: 133, comm: kworker/1:1H
>> Tainted: G        W  O 3.8.13-557-generic #1382000791 Bochs Bochs
>> Aug  6 15:23:07 dev kernel: [  304.432035] RIP: 0010:[<ffffffff8133c2cb>]
>> [<ffffffff8133c2cb>] strnlen+0xb/0x30
>> Aug  6 15:23:07 dev kernel: [  304.432035] RSP: 0018:ffff880035461b08
>> EFLAGS: 00010086
>> Aug  6 15:23:07 dev kernel: [  304.432035] RAX: 0000000000000000 RBX:
>> ffffffff81e6a4e7 RCX: 0000000000000000
>> Aug  6 15:23:07 dev kernel: [  304.432035] RDX: e4e8390a265c0000 RSI:
>> ffffffffffffffff RDI: e4e8390a265c0000
>> Aug  6 15:23:07 dev kernel: [  304.432035] RBP: ffff880035461b08 R08:
>> 000000000000ffff R09: 000000000000ffff
>> Aug  6 15:23:07 dev kernel: [  304.432035] R10: 0000000000000000 R11:
>> 00000000000004cd R12: e4e8390a265c0000
>> Aug  6 15:23:07 dev kernel: [  304.432035] R13: ffffffff81e6a8c0 R14:
>> 0000000000000000 R15: 000000000000ffff
>> Aug  6 15:23:07 dev kernel: [  304.432035] FS:  0000000000000000(0000)
>> GS:ffff88007fc80000(0000) knlGS:0000000000000000
>> Aug  6 15:23:07 dev kernel: [  304.432035] CS:  0010 DS: 0000 ES: 0000 CR0:
>> 000000008005003b
>> Aug  6 15:23:07 dev kernel: [  304.432035] CR2: 00007fc902ffbfd8 CR3:
>> 000000007702a000 CR4: 00000000000006e0
>> Aug  6 15:23:07 dev kernel: [  304.432035] DR0: 0000000000000000 DR1:
>> 0000000000000000 DR2: 0000000000000000
>> Aug  6 15:23:07 dev kernel: [  304.432035] DR3: 0000000000000000 DR6:
>> 00000000ffff0ff0 DR7: 0000000000000400
>> Aug  6 15:23:07 dev kernel: [  304.432035] Process kworker/1:1H (pid: 133,
>> threadinfo ffff880035460000, task ffff880035412e00)
>> Aug  6 15:23:07 dev kernel: [  304.432035] Stack:
>> Aug  6 15:23:07 dev kernel: [  304.432035]  ffff880035461b48
>> ffffffff8133dd5e 0000000000000000 ffffffff81e6a4e7
>> Aug  6 15:23:07 dev kernel: [  304.432035]  ffffffffa0566cba
>> ffff880035461c80 ffffffffa0566cba ffffffff81e6a8c0
>> Aug  6 15:23:07 dev kernel: [  304.432035]  ffff880035461bc8
>> ffffffff8133ef59 ffff880035461bc8 ffffffff81c84040
>> Aug  6 15:23:07 dev kernel: [  304.432035] Call Trace:
>> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8133dd5e>]
>> string.isra.4+0x3e/0xd0
>> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8133ef59>]
>> vsnprintf+0x219/0x640
>> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8133f441>]
>> vscnprintf+0x11/0x30
>> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8105a971>]
>> vprintk_emit+0xc1/0x490
>> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8105aa20>] ?
>> vprintk_emit+0x170/0x490
>> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8168b992>]
>> printk+0x61/0x63
>> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffffa04e9bf1>]
>> __xfs_printk+0x31/0x50 [xfs]
>> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffffa04e9e43>]
>> xfs_notice+0x53/0x60 [xfs]
>> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffffa04e0c15>]
>> xfs_do_force_shutdown+0xf5/0x180 [xfs]
>> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffffa0526848>] ?
>> xlog_recover_iodone+0x48/0x70 [xfs]
>> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffffa0526848>]
>> xlog_recover_iodone+0x48/0x70 [xfs]
>> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffffa04d645d>]
>> xfs_buf_iodone_work+0x4d/0xa0 [xfs]
>> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff81077a11>]
>> process_one_work+0x141/0x4a0
>> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff810789e8>]
>> worker_thread+0x168/0x410
>> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff81078880>] ?
>> manage_workers+0x120/0x120
>> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8107df10>]
>> kthread+0xc0/0xd0
>> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8107de50>] ?
>> flush_kthread_worker+0xb0/0xb0
>> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff816ab86c>]
>> ret_from_fork+0x7c/0xb0
>> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8107de50>] ?
>> flush_kthread_worker+0xb0/0xb0
>> Aug  6 15:23:07 dev kernel: [  304.432035] Code: 31 c0 80 3f 00 55 48 89 e5
>> 74 11 48 89 f8 66 90 48 83 c0 01 80 38 00 75 f7 48 29 f8 5d c3 66 90 55 31
>> c0 48 85 f6 48 89 e5 74 23 <80> 3f 00 74 1e 48 89 f8 eb 0c 0f 1f 00 48 83 ee
>> 01 80 38 00 74
>> Aug  6 15:23:07 dev kernel: [  304.432035] RIP  [<ffffffff8133c2cb>]
>> strnlen+0xb/0x30
>> Aug  6 15:23:07 dev kernel: [  304.432035]  RSP <ffff880035461b08>
>>
>>
>> So previously you said: "So, something is corrupting memory and stamping all
>> over the XFS structures." and also "given you have a bunch of out of tree
>> modules loaded (and some which are experiemental) suggests that you have a
>> problem with your storage...".
>>
>> But I believe, my analysis shows that during the mount sequence XFS does not
>> wait properly for all the bios to complete, before failing the mount
>> sequence back to the caller.
>>
>
> As an experiment, what about the following? Compile tested only and not
> safe for general use.
>
> What might help more is to see if you can create a reproducer on a
> recent, clean kernel. Perhaps a metadump of your reproducer fs combined
> with whatever block device ENOSPC hack you're using would do it.
>
> Brian
>
> ---8<---
>
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index cd7b8ca..fbcf524 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -1409,19 +1409,27 @@ xfs_buf_iorequest(
>   * case nothing will ever complete.  It returns the I/O error code, if any, or
>   * 0 if there was no error.
>   */
> -int
> -xfs_buf_iowait(
> -       xfs_buf_t               *bp)
> +static int
> +__xfs_buf_iowait(
> +       struct xfs_buf          *bp,
> +       bool                    skip_error)
>  {
>         trace_xfs_buf_iowait(bp, _RET_IP_);
>
> -       if (!bp->b_error)
> +       if (skip_error || !bp->b_error)
>                 wait_for_completion(&bp->b_iowait);
>
>         trace_xfs_buf_iowait_done(bp, _RET_IP_);
>         return bp->b_error;
>  }
>
> +int
> +xfs_buf_iowait(
> +       struct xfs_buf          *bp)
> +{
> +       return __xfs_buf_iowait(bp, false);
> +}
> +
>  xfs_caddr_t
>  xfs_buf_offset(
>         xfs_buf_t               *bp,
> @@ -1866,7 +1874,7 @@ xfs_buf_delwri_submit(
>                 bp = list_first_entry(&io_list, struct xfs_buf, b_list);
>
>                 list_del_init(&bp->b_list);
> -               error2 = xfs_buf_iowait(bp);
> +               error2 = __xfs_buf_iowait(bp, true);
>                 xfs_buf_relse(bp);
>                 if (!error)
>                         error = error2;
>
> ---
I think that this patch fixes the problem. I tried reproducing it like
30 times, and it doesn't happen with this patch. Dropping this patch
reproduces the problem within 1 or 2 tries. Thanks!
What are next steps? How to make it "safe for general use"?

Thanks,
Alex.





>
>> Thanks,
>> Alex.
>>
>>
>>
>> -----Original Message----- From: Dave Chinner
>> Sent: 05 August, 2014 2:07 AM
>> To: Alex Lyakas
>> Cc: xfs@oss.sgi.com
>> Subject: Re: use-after-free on log replay failure
>>
>> On Mon, Aug 04, 2014 at 02:00:05PM +0300, Alex Lyakas wrote:
>> >Greetings,
>> >
>> >we had a log replay failure due to some errors that the underlying
>> >block device returned:
>> >[49133.801406] XFS (dm-95): metadata I/O error: block 0x270e8c180
>> >("xlog_recover_iodone") error 28 numblks 16
>> >[49133.802495] XFS (dm-95): log mount/recovery failed: error 28
>> >[49133.802644] XFS (dm-95): log mount failed
>>
>> #define ENOSPC          28      /* No space left on device */
>>
>> You're getting an ENOSPC as a metadata IO error during log recovery?
>> Thin provisioning problem, perhaps, and the error is occurring on
>> submission rather than completion? If so:
>>
>> 8d6c121 xfs: fix buffer use after free on IO error
>>
>> Cheers,
>>
>> Dave.
>> --
>> Dave Chinner
>> david@fromorbit.com
>>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: use-after-free on log replay failure
  2014-08-06 12:32                                   ` Dave Chinner
  2014-08-06 14:43                                     ` Alex Lyakas
@ 2014-08-10 16:26                                     ` Alex Lyakas
  1 sibling, 0 replies; 47+ messages in thread
From: Alex Lyakas @ 2014-08-10 16:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Hello Dave,

On Wed, Aug 6, 2014 at 3:32 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Aug 06, 2014 at 01:05:34PM +0300, Alex Lyakas wrote:
>> Hi Dave,
>>
>> On Tue, Aug 5, 2014 at 2:07 AM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Mon, Aug 04, 2014 at 02:00:05PM +0300, Alex Lyakas wrote:
>> >> Greetings,
>> >>
>> >> we had a log replay failure due to some errors that the underlying
>> >> block device returned:
>> >> [49133.801406] XFS (dm-95): metadata I/O error: block 0x270e8c180
>> >> ("xlog_recover_iodone") error 28 numblks 16
>> >> [49133.802495] XFS (dm-95): log mount/recovery failed: error 28
>> >> [49133.802644] XFS (dm-95): log mount failed
>> >
>> > #define ENOSPC          28      /* No space left on device */
>> >
>> > You're getting an ENOSPC as a metadata IO error during log recovery?
>> > Thin provisioning problem, perhaps,
>> Yes, it is a thin provisioning problem (which I already know the cause for).
>>
>> > and the error is occurring on
>> > submission rather than completion? If so:
>> >
>> > 8d6c121 xfs: fix buffer use after free on IO error
>> I am not sure what do you mean by "submission rather than completion".
>> Do you mean that xfs_buf_ioapply_map() returns without submitting any
>> bios?
>
> No, that the bio submission results in immediate failure (e.g. the
> device goes away, so submission results in ENODEV). Hence when
> _xfs_buf_ioapply() releases it's IO reference itis the only
> remaining reference to the buffer and so completion processing is
> run immediately. i.e. inline from the submission path.
>
> Normally IO errors are reported through the bio in IO completion
> interrupt context. i.e the IO is completed by the hardware and the
> error status is attached to bio, which is then completed and we get
> into XFS that way. The IO submision context is long gone at this
> point....
>
>> In that case, no, bios are submitted to the block device, and it
>> fails them through a different context with ENOSPC error. I will still
>> try the patch you mentioned, because it also looks relevant to another
>> question I addressed to you earlier in:
>> http://oss.sgi.com/archives/xfs/2013-11/msg00648.html
>
> No, that's a different problem.
>
> 9c23ecc xfs: unmount does not wait for shutdown during unmount
Yes, this patch appears to fix the problem that I reported in the
past. XFS survives the unmount and kmemleak is also happy. Thanks! Is
this patch safe to apply to 3.8.13?

Thanks,
Alex.



>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: use-after-free on log replay failure
  2014-08-10 12:20                                     ` Alex Lyakas
@ 2014-08-11 13:20                                       ` Brian Foster
  2014-08-11 21:52                                         ` Dave Chinner
  0 siblings, 1 reply; 47+ messages in thread
From: Brian Foster @ 2014-08-11 13:20 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: xfs

On Sun, Aug 10, 2014 at 03:20:50PM +0300, Alex Lyakas wrote:
> Hi Brian,
> 
> On Wed, Aug 6, 2014 at 6:20 PM, Brian Foster <bfoster@redhat.com> wrote:
> > On Wed, Aug 06, 2014 at 03:52:03PM +0300, Alex Lyakas wrote:
> >> Hello Dave and Brian,
> >>
> >> Dave, I tried the patch you suggested, but it does not fix the issue. I did
> >> some further digging, and it appears that _xfs_buf_ioend(schedule=1) can be
> >> called from xfs_buf_iorequest(), which the patch fixes, but also from
> >> xfs_buf_bio_end_io() which is my case. I am reproducing the issue pretty
> >> easily. The flow that I have is like this:
> >> - xlog_recover() calls xlog_find_tail(). This works alright.
> >
> > What's the purpose of a sleep here?
> >
> >> - Now I add a small sleep before calling xlog_do_recover(), and meanwhile I
> >> instruct my block device to return ENOSPC for any WRITE from now on.
> >>
> >> What seems to happen is that several WRITE bios are submitted and they all
> >> fail. When they do, they reach xfs_buf_ioend() through a stack like this:
> >>
> >> Aug  6 15:23:07 dev kernel: [  304.410528] [56]xfs*[xfs_buf_ioend:1056]
> >> XFS(dm-19): Scheduling xfs_buf_iodone_work on error
> >> Aug  6 15:23:07 dev kernel: [  304.410534] Pid: 56, comm: kworker/u:1
> >> Tainted: G        W  O 3.8.13-557-generic #1382000791
> >> Aug  6 15:23:07 dev kernel: [  304.410537] Call Trace:
> >> Aug  6 15:23:07 dev kernel: [  304.410587]  [<ffffffffa04d6654>]
> >> xfs_buf_ioend+0x1a4/0x1b0 [xfs]
> >> Aug  6 15:23:07 dev kernel: [  304.410621]  [<ffffffffa04d6685>]
> >> _xfs_buf_ioend+0x25/0x30 [xfs]
> >> Aug  6 15:23:07 dev kernel: [  304.410643]  [<ffffffffa04d6b3d>]
> >> xfs_buf_bio_end_io+0x3d/0x50 [xfs]
> >> Aug  6 15:23:07 dev kernel: [  304.410652]  [<ffffffff811c3d8d>]
> >> bio_endio+0x1d/0x40
> >> ...
> >>
> >> At this point, they are scheduled to run xlog_recover_iodone through
> >> xfslogd_workqueue.
> >> The first callback that gets called, calls xfs_do_force_shutdown in stack
> >> like this:
> >>
> >> Aug  6 15:23:07 dev kernel: [  304.411791] XFS (dm-19): metadata I/O error:
> >> block 0x3780001 ("xlog_recover_iodone") error 28 numblks 1
> >> Aug  6 15:23:07 dev kernel: [  304.413493] XFS (dm-19):
> >> xfs_do_force_shutdown(0x1) called from line 377 of file
> >> /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c.  Return address =
> >> 0xffffffffa0526848
> >> Aug  6 15:23:07 dev kernel: [  304.413837]  [<ffffffffa04e0b60>]
> >> xfs_do_force_shutdown+0x40/0x180 [xfs]
> >> Aug  6 15:23:07 dev kernel: [  304.413870]  [<ffffffffa0526848>] ?
> >> xlog_recover_iodone+0x48/0x70 [xfs]
> >> Aug  6 15:23:07 dev kernel: [  304.413902]  [<ffffffffa0526848>]
> >> xlog_recover_iodone+0x48/0x70 [xfs]
> >> Aug  6 15:23:07 dev kernel: [  304.413923]  [<ffffffffa04d645d>]
> >> xfs_buf_iodone_work+0x4d/0xa0 [xfs]
> >> Aug  6 15:23:07 dev kernel: [  304.413930]  [<ffffffff81077a11>]
> >> process_one_work+0x141/0x4a0
> >> Aug  6 15:23:07 dev kernel: [  304.413937]  [<ffffffff810789e8>]
> >> worker_thread+0x168/0x410
> >> Aug  6 15:23:07 dev kernel: [  304.413943]  [<ffffffff81078880>] ?
> >> manage_workers+0x120/0x120
> >> Aug  6 15:23:07 dev kernel: [  304.413949]  [<ffffffff8107df10>]
> >> kthread+0xc0/0xd0
> >> Aug  6 15:23:07 dev kernel: [  304.413954]  [<ffffffff8107de50>] ?
> >> flush_kthread_worker+0xb0/0xb0
> >> Aug  6 15:23:07 dev kernel: [  304.413976]  [<ffffffff816ab86c>]
> >> ret_from_fork+0x7c/0xb0
> >> Aug  6 15:23:07 dev kernel: [  304.413986]  [<ffffffff8107de50>] ?
> >> flush_kthread_worker+0xb0/0xb0
> >> Aug  6 15:23:07 dev kernel: [  304.413990] ---[ end trace 988d698520e1fa81
> >> ]---
> >> Aug  6 15:23:07 dev kernel: [  304.414012] XFS (dm-19): I/O Error Detected.
> >> Shutting down filesystem
> >> Aug  6 15:23:07 dev kernel: [  304.415936] XFS (dm-19): Please umount the
> >> filesystem and rectify the problem(s)
> >>
> >> But the rest of the callbacks also arrive:
> >> Aug  6 15:23:07 dev kernel: [  304.417812] XFS (dm-19): metadata I/O error:
> >> block 0x3780002 ("xlog_recover_iodone") error 28 numblks 1
> >> Aug  6 15:23:07 dev kernel: [  304.420420] XFS (dm-19):
> >> xfs_do_force_shutdown(0x1) called from line 377 of file
> >> /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c.  Return address =
> >> 0xffffffffa0526848
> >> Aug  6 15:23:07 dev kernel: [  304.420427] XFS (dm-19): metadata I/O error:
> >> block 0x3780008 ("xlog_recover_iodone") error 28 numblks 8
> >> Aug  6 15:23:07 dev kernel: [  304.422708] XFS (dm-19):
> >> xfs_do_force_shutdown(0x1) called from line 377 of file
> >> /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c.  Return address =
> >> 0xffffffffa0526848
> >> Aug  6 15:23:07 dev kernel: [  304.422738] XFS (dm-19): metadata I/O error:
> >> block 0x3780010 ("xlog_recover_iodone") error 28 numblks 8
> >>
> >> The mount sequence fails and goes back to the caller:
> >> Aug  6 15:23:07 dev kernel: [  304.423438] XFS (dm-19): log mount/recovery
> >> failed: error 28
> >> Aug  6 15:23:07 dev kernel: [  304.423757] XFS (dm-19): log mount failed
> >>
> >> But there are still additional callbacks to deliver, which the mount
> >> sequence did not wait for!
> >> Aug  6 15:23:07 dev kernel: [  304.425717] XFS ( @dR):
> >> xfs_do_force_shutdown(0x1) called from line 377 of file
> >> /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c.  Return address =
> >> 0xffffffffa0526848
> >> Aug  6 15:23:07 dev kernel: [  304.425723] XFS ( @dR): metadata I/O error:
> >> block 0x3780018 ("xlog_recover_iodone") error 28 numblks 8
> >> Aug  6 15:23:07 dev kernel: [  304.428239] XFS ( @dR):
> >> xfs_do_force_shutdown(0x1) called from line 377 of file
> >> /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c.  Return address =
> >> 0xffffffffa0526848
> >> Aug  6 15:23:07 dev kernel: [  304.428246] XFS ( @dR): metadata I/O error:
> >> block 0x37800a0 ("xlog_recover_iodone") error 28 numblks 16
> >>
> >> Notice the junk that they are printing! Naturally, because xfs_mount
> >> structure has been kfreed.
> >>
> >> Finally the kernel crashes (instead of printing junk), because the xfs_mount
> >> structure is gone, but the callback tries to access it (printing the name):
> >>
> >> Aug  6 15:23:07 dev kernel: [  304.430796] general protection fault: 0000
> >> [#1] SMP
> >> Aug  6 15:23:07 dev kernel: [  304.432035] Modules linked in: xfrm_user
> >> xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 iscsi_scst_tcp(O)
> >> scst_vdisk(O) scst(O) dm_zcache(O) dm_btrfs(O) xfs(O) btrfs(O) libcrc32c
> >> raid456(O) async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq
> >> async_tx raid1(O) md_mod deflate zlib_deflate ctr twofish_generic
> >> twofish_x86_64_3way glue_helper lrw xts gf128mul twofish_x86_64
> >> twofish_common camellia_generic serpent_generic blowfish_generic
> >> blowfish_x86_64 blowfish_common cast5_generic cast_common des_generic xcbc
> >> rmd160 sha512_generic crypto_null af_key xfrm_algo dm_round_robin kvm vfat
> >> fat ppdev psmouse microcode nfsd nfs_acl dm_multipath(O) serio_raw
> >> parport_pc nfsv4 dm_iostat(O) mac_hid i2c_piix4 auth_rpcgss nfs fscache
> >> lockd sunrpc lp parport floppy
> >> Aug  6 15:23:07 dev kernel: [  304.432035] CPU 1
> >> Aug  6 15:23:07 dev kernel: [  304.432035] Pid: 133, comm: kworker/1:1H
> >> Tainted: G        W  O 3.8.13-557-generic #1382000791 Bochs Bochs
> >> Aug  6 15:23:07 dev kernel: [  304.432035] RIP: 0010:[<ffffffff8133c2cb>]
> >> [<ffffffff8133c2cb>] strnlen+0xb/0x30
> >> Aug  6 15:23:07 dev kernel: [  304.432035] RSP: 0018:ffff880035461b08
> >> EFLAGS: 00010086
> >> Aug  6 15:23:07 dev kernel: [  304.432035] RAX: 0000000000000000 RBX:
> >> ffffffff81e6a4e7 RCX: 0000000000000000
> >> Aug  6 15:23:07 dev kernel: [  304.432035] RDX: e4e8390a265c0000 RSI:
> >> ffffffffffffffff RDI: e4e8390a265c0000
> >> Aug  6 15:23:07 dev kernel: [  304.432035] RBP: ffff880035461b08 R08:
> >> 000000000000ffff R09: 000000000000ffff
> >> Aug  6 15:23:07 dev kernel: [  304.432035] R10: 0000000000000000 R11:
> >> 00000000000004cd R12: e4e8390a265c0000
> >> Aug  6 15:23:07 dev kernel: [  304.432035] R13: ffffffff81e6a8c0 R14:
> >> 0000000000000000 R15: 000000000000ffff
> >> Aug  6 15:23:07 dev kernel: [  304.432035] FS:  0000000000000000(0000)
> >> GS:ffff88007fc80000(0000) knlGS:0000000000000000
> >> Aug  6 15:23:07 dev kernel: [  304.432035] CS:  0010 DS: 0000 ES: 0000 CR0:
> >> 000000008005003b
> >> Aug  6 15:23:07 dev kernel: [  304.432035] CR2: 00007fc902ffbfd8 CR3:
> >> 000000007702a000 CR4: 00000000000006e0
> >> Aug  6 15:23:07 dev kernel: [  304.432035] DR0: 0000000000000000 DR1:
> >> 0000000000000000 DR2: 0000000000000000
> >> Aug  6 15:23:07 dev kernel: [  304.432035] DR3: 0000000000000000 DR6:
> >> 00000000ffff0ff0 DR7: 0000000000000400
> >> Aug  6 15:23:07 dev kernel: [  304.432035] Process kworker/1:1H (pid: 133,
> >> threadinfo ffff880035460000, task ffff880035412e00)
> >> Aug  6 15:23:07 dev kernel: [  304.432035] Stack:
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  ffff880035461b48
> >> ffffffff8133dd5e 0000000000000000 ffffffff81e6a4e7
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  ffffffffa0566cba
> >> ffff880035461c80 ffffffffa0566cba ffffffff81e6a8c0
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  ffff880035461bc8
> >> ffffffff8133ef59 ffff880035461bc8 ffffffff81c84040
> >> Aug  6 15:23:07 dev kernel: [  304.432035] Call Trace:
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8133dd5e>]
> >> string.isra.4+0x3e/0xd0
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8133ef59>]
> >> vsnprintf+0x219/0x640
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8133f441>]
> >> vscnprintf+0x11/0x30
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8105a971>]
> >> vprintk_emit+0xc1/0x490
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8105aa20>] ?
> >> vprintk_emit+0x170/0x490
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8168b992>]
> >> printk+0x61/0x63
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffffa04e9bf1>]
> >> __xfs_printk+0x31/0x50 [xfs]
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffffa04e9e43>]
> >> xfs_notice+0x53/0x60 [xfs]
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffffa04e0c15>]
> >> xfs_do_force_shutdown+0xf5/0x180 [xfs]
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffffa0526848>] ?
> >> xlog_recover_iodone+0x48/0x70 [xfs]
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffffa0526848>]
> >> xlog_recover_iodone+0x48/0x70 [xfs]
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffffa04d645d>]
> >> xfs_buf_iodone_work+0x4d/0xa0 [xfs]
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff81077a11>]
> >> process_one_work+0x141/0x4a0
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff810789e8>]
> >> worker_thread+0x168/0x410
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff81078880>] ?
> >> manage_workers+0x120/0x120
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8107df10>]
> >> kthread+0xc0/0xd0
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8107de50>] ?
> >> flush_kthread_worker+0xb0/0xb0
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff816ab86c>]
> >> ret_from_fork+0x7c/0xb0
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8107de50>] ?
> >> flush_kthread_worker+0xb0/0xb0
> >> Aug  6 15:23:07 dev kernel: [  304.432035] Code: 31 c0 80 3f 00 55 48 89 e5
> >> 74 11 48 89 f8 66 90 48 83 c0 01 80 38 00 75 f7 48 29 f8 5d c3 66 90 55 31
> >> c0 48 85 f6 48 89 e5 74 23 <80> 3f 00 74 1e 48 89 f8 eb 0c 0f 1f 00 48 83 ee
> >> 01 80 38 00 74
> >> Aug  6 15:23:07 dev kernel: [  304.432035] RIP  [<ffffffff8133c2cb>]
> >> strnlen+0xb/0x30
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  RSP <ffff880035461b08>
> >>
> >>
> >> So previously you said: "So, something is corrupting memory and stamping all
> >> over the XFS structures." and also "given you have a bunch of out of tree
> >> modules loaded (and some which are experiemental) suggests that you have a
> >> problem with your storage...".
> >>
> >> But I believe, my analysis shows that during the mount sequence XFS does not
> >> wait properly for all the bios to complete, before failing the mount
> >> sequence back to the caller.
> >>
> >
> > As an experiment, what about the following? Compile tested only and not
> > safe for general use.
> >
> > What might help more is to see if you can create a reproducer on a
> > recent, clean kernel. Perhaps a metadump of your reproducer fs combined
> > with whatever block device ENOSPC hack you're using would do it.
> >
> > Brian
> >
> > ---8<---
> >
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index cd7b8ca..fbcf524 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -1409,19 +1409,27 @@ xfs_buf_iorequest(
> >   * case nothing will ever complete.  It returns the I/O error code, if any, or
> >   * 0 if there was no error.
> >   */
> > -int
> > -xfs_buf_iowait(
> > -       xfs_buf_t               *bp)
> > +static int
> > +__xfs_buf_iowait(
> > +       struct xfs_buf          *bp,
> > +       bool                    skip_error)
> >  {
> >         trace_xfs_buf_iowait(bp, _RET_IP_);
> >
> > -       if (!bp->b_error)
> > +       if (skip_error || !bp->b_error)
> >                 wait_for_completion(&bp->b_iowait);
> >
> >         trace_xfs_buf_iowait_done(bp, _RET_IP_);
> >         return bp->b_error;
> >  }
> >
> > +int
> > +xfs_buf_iowait(
> > +       struct xfs_buf          *bp)
> > +{
> > +       return __xfs_buf_iowait(bp, false);
> > +}
> > +
> >  xfs_caddr_t
> >  xfs_buf_offset(
> >         xfs_buf_t               *bp,
> > @@ -1866,7 +1874,7 @@ xfs_buf_delwri_submit(
> >                 bp = list_first_entry(&io_list, struct xfs_buf, b_list);
> >
> >                 list_del_init(&bp->b_list);
> > -               error2 = xfs_buf_iowait(bp);
> > +               error2 = __xfs_buf_iowait(bp, true);
> >                 xfs_buf_relse(bp);
> >                 if (!error)
> >                         error = error2;
> >
> > ---
> I think that this patch fixes the problem. I tried reproducing it like
> 30 times, and it doesn't happen with this patch. Dropping this patch
> reproduces the problem within 1 or 2 tries. Thanks!
> What are next steps? How to make it "safe for general use"?
> 

Ok, thanks for testing. I think that implicates the caller bypassing the
expected blocking with the right sequence of log recovery I/Os and
device failure. TBH, I'd still like to see the specifics, if possible.
Could you come up with a generic reproducer for this? I think a metadump
of the fs with the dirty log plus whatever device failure simulation
hack you're using would suffice.

The ideal fix is not yet clear to me. Technically, we could always find
a way to customize this particular path to rely on b_iowait since that
appears safe, but that could just be a band aid over a larger problem.
I'll need to step back and stare at this code some more to try and
understand the layering better, then follow up with something when
things are more clear.

Brian

> Thanks,
> Alex.
> 
> 
> 
> 
> 
> >
> >> Thanks,
> >> Alex.
> >>
> >>
> >>
> >> -----Original Message----- From: Dave Chinner
> >> Sent: 05 August, 2014 2:07 AM
> >> To: Alex Lyakas
> >> Cc: xfs@oss.sgi.com
> >> Subject: Re: use-after-free on log replay failure
> >>
> >> On Mon, Aug 04, 2014 at 02:00:05PM +0300, Alex Lyakas wrote:
> >> >Greetings,
> >> >
> >> >we had a log replay failure due to some errors that the underlying
> >> >block device returned:
> >> >[49133.801406] XFS (dm-95): metadata I/O error: block 0x270e8c180
> >> >("xlog_recover_iodone") error 28 numblks 16
> >> >[49133.802495] XFS (dm-95): log mount/recovery failed: error 28
> >> >[49133.802644] XFS (dm-95): log mount failed
> >>
> >> #define ENOSPC          28      /* No space left on device */
> >>
> >> You're getting an ENOSPC as a metadata IO error during log recovery?
> >> Thin provisioning problem, perhaps, and the error is occurring on
> >> submission rather than completion? If so:
> >>
> >> 8d6c121 xfs: fix buffer use after free on IO error
> >>
> >> Cheers,
> >>
> >> Dave.
> >> --
> >> Dave Chinner
> >> david@fromorbit.com
> >>
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: use-after-free on log replay failure
  2014-08-11 13:20                                       ` Brian Foster
@ 2014-08-11 21:52                                         ` Dave Chinner
  2014-08-12 12:03                                           ` Brian Foster
  0 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2014-08-11 21:52 UTC (permalink / raw)
  To: Brian Foster; +Cc: Alex Lyakas, xfs

On Mon, Aug 11, 2014 at 09:20:57AM -0400, Brian Foster wrote:
> On Sun, Aug 10, 2014 at 03:20:50PM +0300, Alex Lyakas wrote:
> > On Wed, Aug 6, 2014 at 6:20 PM, Brian Foster <bfoster@redhat.com> wrote:
> > > On Wed, Aug 06, 2014 at 03:52:03PM +0300, Alex Lyakas wrote:
.....
> > >> But I believe, my analysis shows that during the mount sequence XFS does not
> > >> wait properly for all the bios to complete, before failing the mount
> > >> sequence back to the caller.
> > >>
> > >
> > > As an experiment, what about the following? Compile tested only and not
> > > safe for general use.
...
> > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > index cd7b8ca..fbcf524 100644
> > > --- a/fs/xfs/xfs_buf.c
> > > +++ b/fs/xfs/xfs_buf.c
> > > @@ -1409,19 +1409,27 @@ xfs_buf_iorequest(
> > >   * case nothing will ever complete.  It returns the I/O error code, if any, or
> > >   * 0 if there was no error.
> > >   */
> > > -int
> > > -xfs_buf_iowait(
> > > -       xfs_buf_t               *bp)
> > > +static int
> > > +__xfs_buf_iowait(
> > > +       struct xfs_buf          *bp,
> > > +       bool                    skip_error)
> > >  {
> > >         trace_xfs_buf_iowait(bp, _RET_IP_);
> > >
> > > -       if (!bp->b_error)
> > > +       if (skip_error || !bp->b_error)
> > >                 wait_for_completion(&bp->b_iowait);
> > >
> > >         trace_xfs_buf_iowait_done(bp, _RET_IP_);
> > >         return bp->b_error;
> > >  }
> > >
> > > +int
> > > +xfs_buf_iowait(
> > > +       struct xfs_buf          *bp)
> > > +{
> > > +       return __xfs_buf_iowait(bp, false);
> > > +}
> > > +
> > >  xfs_caddr_t
> > >  xfs_buf_offset(
> > >         xfs_buf_t               *bp,
> > > @@ -1866,7 +1874,7 @@ xfs_buf_delwri_submit(
> > >                 bp = list_first_entry(&io_list, struct xfs_buf, b_list);
> > >
> > >                 list_del_init(&bp->b_list);
> > > -               error2 = xfs_buf_iowait(bp);
> > > +               error2 = __xfs_buf_iowait(bp, true);
> > >                 xfs_buf_relse(bp);
> > >                 if (!error)
> > >                         error = error2;

Not waiting here on buffer error should not matter. Any buffer that
is under IO and requires completion should be referenced, and that
means it should be caught and waited on by xfs_wait_buftarg() in the
mount failure path after log recovery fails.

> > > ---
> > I think that this patch fixes the problem. I tried reproducing it like
> > 30 times, and it doesn't happen with this patch. Dropping this patch
> > reproduces the problem within 1 or 2 tries. Thanks!
> > What are next steps? How to make it "safe for general use"?
> > 
> 
> Ok, thanks for testing. I think that implicates the caller bypassing the
> expected blocking with the right sequence of log recovery I/Os and
> device failure. TBH, I'd still like to see the specifics, if possible.
> Could you come up with a generic reproducer for this? I think a metadump
> of the fs with the dirty log plus whatever device failure simulation
> hack you're using would suffice.

The real issue is we don't know exactly what code is being tested
(it's 3.8 + random bug fix backports + custom code). Even if we have
a reproducer there's no guarantee it will reproduce on a current
kernel. IOWs, we are stumbling around in the dark bashing our heads
against everything in the room, and that just wastes everyone's
time.

We need a reproducer that triggers on a current, unmodified
kernel release. You can use dm-faulty to error out all writes just
like you are doing with your custom code. See
xfstests::tests/generic/321 and common/dmflakey for to do this.
Ideally the reproducer is in a form that xfstests can use....

If you can't reproduce it on an upstream kernel, then git bisect is
your friend. It will find the commit that fixed the problem you are
seeing....

> The ideal fix is not yet clear to me.

We are not even that far along - the root cause of the bug is not at
all clear to me. :/

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: use-after-free on log replay failure
  2014-08-11 21:52                                         ` Dave Chinner
@ 2014-08-12 12:03                                           ` Brian Foster
  2014-08-12 12:39                                             ` Alex Lyakas
  0 siblings, 1 reply; 47+ messages in thread
From: Brian Foster @ 2014-08-12 12:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Alex Lyakas, xfs

On Tue, Aug 12, 2014 at 07:52:07AM +1000, Dave Chinner wrote:
> On Mon, Aug 11, 2014 at 09:20:57AM -0400, Brian Foster wrote:
> > On Sun, Aug 10, 2014 at 03:20:50PM +0300, Alex Lyakas wrote:
> > > On Wed, Aug 6, 2014 at 6:20 PM, Brian Foster <bfoster@redhat.com> wrote:
> > > > On Wed, Aug 06, 2014 at 03:52:03PM +0300, Alex Lyakas wrote:
> .....
> > > >> But I believe, my analysis shows that during the mount sequence XFS does not
> > > >> wait properly for all the bios to complete, before failing the mount
> > > >> sequence back to the caller.
> > > >>
> > > >
> > > > As an experiment, what about the following? Compile tested only and not
> > > > safe for general use.
> ...
> > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > > index cd7b8ca..fbcf524 100644
> > > > --- a/fs/xfs/xfs_buf.c
> > > > +++ b/fs/xfs/xfs_buf.c
> > > > @@ -1409,19 +1409,27 @@ xfs_buf_iorequest(
> > > >   * case nothing will ever complete.  It returns the I/O error code, if any, or
> > > >   * 0 if there was no error.
> > > >   */
> > > > -int
> > > > -xfs_buf_iowait(
> > > > -       xfs_buf_t               *bp)
> > > > +static int
> > > > +__xfs_buf_iowait(
> > > > +       struct xfs_buf          *bp,
> > > > +       bool                    skip_error)
> > > >  {
> > > >         trace_xfs_buf_iowait(bp, _RET_IP_);
> > > >
> > > > -       if (!bp->b_error)
> > > > +       if (skip_error || !bp->b_error)
> > > >                 wait_for_completion(&bp->b_iowait);
> > > >
> > > >         trace_xfs_buf_iowait_done(bp, _RET_IP_);
> > > >         return bp->b_error;
> > > >  }
> > > >
> > > > +int
> > > > +xfs_buf_iowait(
> > > > +       struct xfs_buf          *bp)
> > > > +{
> > > > +       return __xfs_buf_iowait(bp, false);
> > > > +}
> > > > +
> > > >  xfs_caddr_t
> > > >  xfs_buf_offset(
> > > >         xfs_buf_t               *bp,
> > > > @@ -1866,7 +1874,7 @@ xfs_buf_delwri_submit(
> > > >                 bp = list_first_entry(&io_list, struct xfs_buf, b_list);
> > > >
> > > >                 list_del_init(&bp->b_list);
> > > > -               error2 = xfs_buf_iowait(bp);
> > > > +               error2 = __xfs_buf_iowait(bp, true);
> > > >                 xfs_buf_relse(bp);
> > > >                 if (!error)
> > > >                         error = error2;
> 
> Not waiting here on buffer error should not matter. Any buffer that
> is under IO and requires completion should be referenced, and that
> means it should be caught and waited on by xfs_wait_buftarg() in the
> mount failure path after log recovery fails.
> 

I think that assumes the I/O is successful. Looking through
xlog_recover_buffer_pass2() as an example, we read the buffer which
should return with b_hold == 1. The delwri queue increments the hold and
we xfs_buf_relse() in the return path (i.e., buffer is now held by the
delwri queue awaiting submission).

Sometime later we delwri_submit()... xfs_buf_iorequest() does an
xfs_buf_hold() and xfs_buf_rele() within that single function. The
delwri_submit() releases its hold after xfs_buf_iowait(), which I guess
at that point bp should go onto the lru (b_hold back to 1 in
xfs_buf_rele(). Indeed, the caller has lost scope of the buffer at this
point.

So unless I miss something or got the lifecycle wrong here, which is
easily possible ;), this all hinges on xfs_buf_iowait(). That's where
the last hold forcing the buffer to stay around goes away.
xfs_buftarg_wait_rele() will dispose the buffer if b_hold == 1. If
xfs_buf_iowait() is racy in the event of I/O errors via the bio
callback, I think this path is susceptible just the same.

> > > > ---
> > > I think that this patch fixes the problem. I tried reproducing it like
> > > 30 times, and it doesn't happen with this patch. Dropping this patch
> > > reproduces the problem within 1 or 2 tries. Thanks!
> > > What are next steps? How to make it "safe for general use"?
> > > 
> > 
> > Ok, thanks for testing. I think that implicates the caller bypassing the
> > expected blocking with the right sequence of log recovery I/Os and
> > device failure. TBH, I'd still like to see the specifics, if possible.
> > Could you come up with a generic reproducer for this? I think a metadump
> > of the fs with the dirty log plus whatever device failure simulation
> > hack you're using would suffice.
> 
> The real issue is we don't know exactly what code is being tested
> (it's 3.8 + random bug fix backports + custom code). Even if we have
> a reproducer there's no guarantee it will reproduce on a current
> kernel. IOWs, we are stumbling around in the dark bashing our heads
> against everything in the room, and that just wastes everyone's
> time.
> 
> We need a reproducer that triggers on a current, unmodified
> kernel release. You can use dm-faulty to error out all writes just
> like you are doing with your custom code. See
> xfstests::tests/generic/321 and common/dmflakey for to do this.
> Ideally the reproducer is in a form that xfstests can use....
> 
> If you can't reproduce it on an upstream kernel, then git bisect is
> your friend. It will find the commit that fixed the problem you are
> seeing....
> 

Ugh, yeah. The fact that this was customized as such apparently went
over my head. I agree completely. This needs to be genericized to a
pristine, preferably current kernel. The experiment patch could be
papering over something completely different.

> > The ideal fix is not yet clear to me.
> 
> We are not even that far along - the root cause of the bug is not at
> all clear to me. :/
> 

Yeah.. the above was just the theory that motivated the experiment in
the previously posted patch. It of course remains a theory until we can
see the race in action. I was referring to the potential fix for the
raciness of xfs_buf_iowait() with regard to bio errors and the wq iodone
handling, while still asking for a reproducer to confirm the actual
problem. FWIW, I'm not too high on changes in the buf management code,
even a smallish behavior change, without a real trace of some sort that
documents the problem and justifies the change.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: use-after-free on log replay failure
  2014-08-12 12:03                                           ` Brian Foster
@ 2014-08-12 12:39                                             ` Alex Lyakas
  2014-08-12 19:31                                               ` Brian Foster
                                                                 ` (2 more replies)
  0 siblings, 3 replies; 47+ messages in thread
From: Alex Lyakas @ 2014-08-12 12:39 UTC (permalink / raw)
  To: Brian Foster, Dave Chinner; +Cc: xfs

[-- Attachment #1: Type: text/plain, Size: 13483 bytes --]

Hello Dave, Brian,
I will describe a generic reproduction that you ask for.

It was performed on pristine XFS code from 3.8.13, taken from here:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git

top commit being:
commit dbf932a9b316d5b29b3e220e5a30e7a165ad2992
Author: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Date:   Sat May 11 13:57:46 2013 -0700

    Linux 3.8.13


I made a single (I swear!) code change in XFS:

diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 96fcbb8..d756bf6 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -3736,6 +3736,9 @@ xlog_recover(
        if ((error = xlog_find_tail(log, &head_blk, &tail_blk)))
                return error;

+       xfs_notice(log->l_mp, "Sleep 10s before xlog_do_recover");
+       msleep(10000);
+
        if (tail_blk != head_blk) {
                /* There used to be a comment here:
                 *

Fresh XFS was formatted on a 20 GB block device within a VM, using:
mkfs.xfs -f -K /dev/vde -p /etc/zadara/xfs.protofile
and:
root@vc-00-00-1383-dev:~# cat /etc/zadara/xfs.protofile
dummy                   : bootfilename, not used, backward compatibility
0 0                             : numbers of blocks and inodes, not used, 
backward compatibility
d--777 0 0              : set 777 perms for the root dir
$
$

I mounted XFS with the following options:
rw,sync,noatime,wsync,attr2,inode64,noquota 0 0

I started a couple of processes writing files sequentially onto this mount 
point, and after few seconds crashed the VM.
When the VM came up, I took the metadump file and placed it in:
https://drive.google.com/file/d/0ByBy89zr3kJNa0ZpdmZFS242RVU/edit?usp=sharing

Then I set up the following Device Mapper target onto /dev/vde:
dmsetup create VDE --table "0 41943040 linear-custom /dev/vde 0"
I am attaching the code (and Makefile) of dm-linear-custom target. It is 
exact copy of dm-linear, except that it has a module parameter. With the 
parameter set to 0, this is an identity mapping onto /dev/vde. If the 
parameter is set to non-0, all WRITE bios are failed with ENOSPC. There is a 
workqueue to fail them in a different context (not sure if really needed, 
but that's what our "real" custom
block device does).

Now I did:
mount -o noatime,sync /dev/mapper/VDE /mnt/xfs

The log recovery flow went into the sleep that I added, and then I did:
echo 1 > /sys/module/dm_linear_custom/parameters/fail_writes

Problem reproduced:
Aug 12 14:23:04 vc-00-00-1383-dev kernel: [  175.000657] XFS (dm-0): 
Mounting Filesystem
Aug 12 14:23:04 vc-00-00-1383-dev kernel: [  175.026991] XFS (dm-0): Sleep 
10s before xlog_do_recover
Aug 12 14:23:14 vc-00-00-1383-dev kernel: [  185.028113] XFS (dm-0): 
Starting recovery (logdev: internal)
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.556622] XFS (dm-0): 
metadata I/O error: block 0x2 ("xlog_recover_iodone") error 28 numblks 1
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.556675] XFS (dm-0): 
metadata I/O error: block 0x40 ("xlog_recover_iodone") error 28 numblks 16
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.556680] XFS (dm-0): 
xfs_do_force_shutdown(0x1) called from line 377 of file 
/mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c.  Return address 
= 0xffffffffa0349f68
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.556683] XFS (dm-0): I/O 
Error Detected. Shutting down filesystem
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.556684] XFS (dm-0): Please 
umount the filesystem and rectify the problem(s)
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.556766] XFS (dm-0): 
metadata I/O error: block 0xa00002 ("xlog_recover_iodone") error 5 numblks 1
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.556769] XFS (dm-0): 
xfs_do_force_shutdown(0x1) called from line 377 of file 
/mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c.  Return address 
= 0xffffffffa0349f68
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.556771] XFS (dm-0): 
metadata I/O error: block 0xa00008 ("xlog_recover_iodone") error 5 numblks 8
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.556774] XFS (dm-0): 
xfs_do_force_shutdown(0x1) called from line 377 of file 
/mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c.  Return address 
= 0xffffffffa0349f68
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.556776] XFS (dm-0): 
metadata I/O error: block 0xa00010 ("xlog_recover_iodone") error 5 numblks 8
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.556779] XFS (dm-0): 
xfs_do_force_shutdown(0x1) called from line 377 of file 
/mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c.  Return address 
= 0xffffffffa0349f68
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.556781] XFS (dm-0): 
metadata I/O error: block 0xa00018 ("xlog_recover_iodone") error 5 numblks 8
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.556783] XFS (dm-0): 
xfs_do_force_shutdown(0x1) called from line 377 of file 
/mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c.  Return address 
= 0xffffffffa0349f68
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.556785] XFS (dm-0): 
metadata I/O error: block 0xa00040 ("xlog_recover_iodone") error 5 numblks 
16
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.556788] XFS (dm-0): 
xfs_do_force_shutdown(0x1) called from line 377 of file 
/mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c.  Return address 
= 0xffffffffa0349f68
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.556790] XFS (dm-0): 
metadata I/O error: block 0xa00050 ("xlog_recover_iodone") error 5 numblks 
16
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.556793] XFS (dm-0): 
xfs_do_force_shutdown(0x1) called from line 377 of file 
/mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c.  Return address 
= 0xffffffffa0349f68
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.556880] XFS (dm-0): 
metadata I/O error: block 0xa00001 ("xlog_recover_iodone") error 28 numblks 
1
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.556884] XFS (dm-0): 
xfs_do_force_shutdown(0x1) called from line 377 of file 
/mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c.  Return address 
= 0xffffffffa0349f68
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.556977] XFS (dm-0): log 
mount/recovery failed: error 28
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.557215] XFS (dm-0): log 
mount failed
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.573194] XFS (): 
xfs_do_force_shutdown(0x1) called from line 377 of file 
/mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c.  Return address 
= 0xffffffffa0349f68
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.573214] XFS (): metadata 
I/O error: block 0x18 ("xlog_recover_iodone") error 28 numblks 8
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.574685] XFS (): 
xfs_do_force_shutdown(0x1) called from line 377 of file 
/mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c.  Return address 
= 0xffffffffa0349f68

As you see, after mount completes, IO callbacks are still arriving and 
printing junk (empty string in this case). Immediately after that kernel 
dies.

Is this description generic enough?

Thanks,
Alex.




-----Original Message----- 
From: Brian Foster
Sent: 12 August, 2014 3:03 PM
To: Dave Chinner
Cc: Alex Lyakas ; xfs@oss.sgi.com
Subject: Re: use-after-free on log replay failure

On Tue, Aug 12, 2014 at 07:52:07AM +1000, Dave Chinner wrote:
> On Mon, Aug 11, 2014 at 09:20:57AM -0400, Brian Foster wrote:
> > On Sun, Aug 10, 2014 at 03:20:50PM +0300, Alex Lyakas wrote:
> > > On Wed, Aug 6, 2014 at 6:20 PM, Brian Foster <bfoster@redhat.com> 
> > > wrote:
> > > > On Wed, Aug 06, 2014 at 03:52:03PM +0300, Alex Lyakas wrote:
> .....
> > > >> But I believe, my analysis shows that during the mount sequence XFS 
> > > >> does not
> > > >> wait properly for all the bios to complete, before failing the 
> > > >> mount
> > > >> sequence back to the caller.
> > > >>
> > > >
> > > > As an experiment, what about the following? Compile tested only and 
> > > > not
> > > > safe for general use.
> ...
> > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > > index cd7b8ca..fbcf524 100644
> > > > --- a/fs/xfs/xfs_buf.c
> > > > +++ b/fs/xfs/xfs_buf.c
> > > > @@ -1409,19 +1409,27 @@ xfs_buf_iorequest(
> > > >   * case nothing will ever complete.  It returns the I/O error code, 
> > > > if any, or
> > > >   * 0 if there was no error.
> > > >   */
> > > > -int
> > > > -xfs_buf_iowait(
> > > > -       xfs_buf_t               *bp)
> > > > +static int
> > > > +__xfs_buf_iowait(
> > > > +       struct xfs_buf          *bp,
> > > > +       bool                    skip_error)
> > > >  {
> > > >         trace_xfs_buf_iowait(bp, _RET_IP_);
> > > >
> > > > -       if (!bp->b_error)
> > > > +       if (skip_error || !bp->b_error)
> > > >                 wait_for_completion(&bp->b_iowait);
> > > >
> > > >         trace_xfs_buf_iowait_done(bp, _RET_IP_);
> > > >         return bp->b_error;
> > > >  }
> > > >
> > > > +int
> > > > +xfs_buf_iowait(
> > > > +       struct xfs_buf          *bp)
> > > > +{
> > > > +       return __xfs_buf_iowait(bp, false);
> > > > +}
> > > > +
> > > >  xfs_caddr_t
> > > >  xfs_buf_offset(
> > > >         xfs_buf_t               *bp,
> > > > @@ -1866,7 +1874,7 @@ xfs_buf_delwri_submit(
> > > >                 bp = list_first_entry(&io_list, struct xfs_buf, 
> > > > b_list);
> > > >
> > > >                 list_del_init(&bp->b_list);
> > > > -               error2 = xfs_buf_iowait(bp);
> > > > +               error2 = __xfs_buf_iowait(bp, true);
> > > >                 xfs_buf_relse(bp);
> > > >                 if (!error)
> > > >                         error = error2;
>
> Not waiting here on buffer error should not matter. Any buffer that
> is under IO and requires completion should be referenced, and that
> means it should be caught and waited on by xfs_wait_buftarg() in the
> mount failure path after log recovery fails.
>

I think that assumes the I/O is successful. Looking through
xlog_recover_buffer_pass2() as an example, we read the buffer which
should return with b_hold == 1. The delwri queue increments the hold and
we xfs_buf_relse() in the return path (i.e., buffer is now held by the
delwri queue awaiting submission).

Sometime later we delwri_submit()... xfs_buf_iorequest() does an
xfs_buf_hold() and xfs_buf_rele() within that single function. The
delwri_submit() releases its hold after xfs_buf_iowait(), which I guess
at that point bp should go onto the lru (b_hold back to 1 in
xfs_buf_rele(). Indeed, the caller has lost scope of the buffer at this
point.

So unless I miss something or got the lifecycle wrong here, which is
easily possible ;), this all hinges on xfs_buf_iowait(). That's where
the last hold forcing the buffer to stay around goes away.
xfs_buftarg_wait_rele() will dispose the buffer if b_hold == 1. If
xfs_buf_iowait() is racy in the event of I/O errors via the bio
callback, I think this path is susceptible just the same.

> > > > ---
> > > I think that this patch fixes the problem. I tried reproducing it like
> > > 30 times, and it doesn't happen with this patch. Dropping this patch
> > > reproduces the problem within 1 or 2 tries. Thanks!
> > > What are next steps? How to make it "safe for general use"?
> > >
> >
> > Ok, thanks for testing. I think that implicates the caller bypassing the
> > expected blocking with the right sequence of log recovery I/Os and
> > device failure. TBH, I'd still like to see the specifics, if possible.
> > Could you come up with a generic reproducer for this? I think a metadump
> > of the fs with the dirty log plus whatever device failure simulation
> > hack you're using would suffice.
>
> The real issue is we don't know exactly what code is being tested
> (it's 3.8 + random bug fix backports + custom code). Even if we have
> a reproducer there's no guarantee it will reproduce on a current
> kernel. IOWs, we are stumbling around in the dark bashing our heads
> against everything in the room, and that just wastes everyone's
> time.
>
> We need a reproducer that triggers on a current, unmodified
> kernel release. You can use dm-faulty to error out all writes just
> like you are doing with your custom code. See
> xfstests::tests/generic/321 and common/dmflakey for to do this.
> Ideally the reproducer is in a form that xfstests can use....
>
> If you can't reproduce it on an upstream kernel, then git bisect is
> your friend. It will find the commit that fixed the problem you are
> seeing....
>

Ugh, yeah. The fact that this was customized as such apparently went
over my head. I agree completely. This needs to be genericized to a
pristine, preferably current kernel. The experiment patch could be
papering over something completely different.

> > The ideal fix is not yet clear to me.
>
> We are not even that far along - the root cause of the bug is not at
> all clear to me. :/
>

Yeah.. the above was just the theory that motivated the experiment in
the previously posted patch. It of course remains a theory until we can
see the race in action. I was referring to the potential fix for the
raciness of xfs_buf_iowait() with regard to bio errors and the wq iodone
handling, while still asking for a reproducer to confirm the actual
problem. FWIW, I'm not too high on changes in the buf management code,
even a smallish behavior change, without a real trace of some sort that
documents the problem and justifies the change.

Brian

> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com 

[-- Attachment #2: dm-linear-custom.c --]
[-- Type: application/octet-stream, Size: 5786 bytes --]

/*
 * Copyright (C) 2001-2003 Sistina Software (UK) Limited.
 *
 * This file is released under the GPL.
 */

#include <linux/module.h>
#include <linux/init.h>
#include <linux/blkdev.h>
#include <linux/bio.h>
#include <linux/slab.h>
#include <linux/device-mapper.h>

#define DM_MSG_PREFIX "linear-custom"

/*
 * Linear: maps a linear range of a device.
 */
struct linear_c {
	struct dm_dev *dev;
	sector_t start;
};

unsigned int fail_writes = 0;
module_param_named(fail_writes, fail_writes, uint, S_IWUSR|S_IRUSR | S_IRGRP | S_IROTH);
MODULE_PARM_DESC(fail_writes, "When set to non-0, all writes will be failed with ENOSPC");

static struct workqueue_struct *fail_writes_wq;

struct fail_writes_work {
	struct work_struct work;
	struct bio *bio;
};

/*
 * Construct a linear mapping: <dev_path> <offset>
 */
static int linear_ctr(struct dm_target *ti, unsigned int argc, char **argv)
{
	struct linear_c *lc;
	unsigned long long tmp;
	char dummy;

	if (argc != 2) {
		ti->error = "Invalid argument count";
		return -EINVAL;
	}

	lc = kmalloc(sizeof(*lc), GFP_KERNEL);
	if (lc == NULL) {
		ti->error = "dm-linear: Cannot allocate linear context";
		return -ENOMEM;
	}

	if (sscanf(argv[1], "%llu%c", &tmp, &dummy) != 1) {
		ti->error = "dm-linear: Invalid device sector";
		goto bad;
	}
	lc->start = tmp;

	if (dm_get_device(ti, argv[0], dm_table_get_mode(ti->table), &lc->dev)) {
		ti->error = "dm-linear: Device lookup failed";
		goto bad;
	}

	ti->num_flush_requests = 1;
	ti->num_discard_requests = 1;
	ti->num_write_same_requests = 1;
	ti->private = lc;
	return 0;

      bad:
	kfree(lc);
	return -EINVAL;
}

static void linear_dtr(struct dm_target *ti)
{
	struct linear_c *lc = (struct linear_c *) ti->private;

	dm_put_device(ti, lc->dev);
	kfree(lc);
}

static sector_t linear_map_sector(struct dm_target *ti, sector_t bi_sector)
{
	struct linear_c *lc = ti->private;

	return lc->start + dm_target_offset(ti, bi_sector);
}

static void linear_map_bio(struct dm_target *ti, struct bio *bio)
{
	struct linear_c *lc = ti->private;

	bio->bi_bdev = lc->dev->bdev;
	if (bio_sectors(bio))
		bio->bi_sector = linear_map_sector(ti, bio->bi_sector);
}

void linear_fail_write(struct work_struct *work)
{
	struct fail_writes_work *w = container_of(work, struct fail_writes_work, work);

	bio_endio(w->bio, -ENOSPC);
	kfree(w);
}

static int linear_map(struct dm_target *ti, struct bio *bio)
{
	if (bio_rw(bio)== WRITE && fail_writes) {
		struct fail_writes_work *w = kmalloc(sizeof(struct fail_writes_work), GFP_NOIO);
		if (w) {
			bool queued = false;

			INIT_WORK(&w->work, linear_fail_write);
			w->bio = bio;
			queued = queue_work(fail_writes_wq, &w->work);
			if (queued)
				return DM_MAPIO_SUBMITTED;

			kfree(w);
		}
	}

	linear_map_bio(ti, bio);

	return DM_MAPIO_REMAPPED;
}

static void linear_status(struct dm_target *ti, status_type_t type,
			  unsigned status_flags, char *result, unsigned maxlen)
{
	struct linear_c *lc = (struct linear_c *) ti->private;

	switch (type) {
	case STATUSTYPE_INFO:
		result[0] = '\0';
		break;

	case STATUSTYPE_TABLE:
		snprintf(result, maxlen, "%s %llu", lc->dev->name,
				(unsigned long long)lc->start);
		break;
	}
}

static int linear_ioctl(struct dm_target *ti, unsigned int cmd,
			unsigned long arg)
{
	struct linear_c *lc = (struct linear_c *) ti->private;
	struct dm_dev *dev = lc->dev;
	int r = 0;

	/*
	 * Only pass ioctls through if the device sizes match exactly.
	 */
	if (lc->start ||
	    ti->len != i_size_read(dev->bdev->bd_inode) >> SECTOR_SHIFT)
		r = scsi_verify_blk_ioctl(NULL, cmd);

	return r ? : __blkdev_driver_ioctl(dev->bdev, dev->mode, cmd, arg);
}

static int linear_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
			struct bio_vec *biovec, int max_size)
{
	struct linear_c *lc = ti->private;
	struct request_queue *q = bdev_get_queue(lc->dev->bdev);

	if (!q->merge_bvec_fn)
		return max_size;

	bvm->bi_bdev = lc->dev->bdev;
	bvm->bi_sector = linear_map_sector(ti, bvm->bi_sector);

	return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
}

static int linear_iterate_devices(struct dm_target *ti,
				  iterate_devices_callout_fn fn, void *data)
{
	struct linear_c *lc = ti->private;

	return fn(ti, lc->dev, lc->start, ti->len, data);
}

static struct target_type linearc_target = {
	.name   = "linear-custom",
	.version = {1, 2, 1},
	.module = THIS_MODULE,
	.ctr    = linear_ctr,
	.dtr    = linear_dtr,
	.map    = linear_map,
	.status = linear_status,
	.ioctl  = linear_ioctl,
	.merge  = linear_merge,
	.iterate_devices = linear_iterate_devices,
};

int dm_linear_custom_init(void)
{
	int r = 0;

	DMINFO("Going to register target: %s", linearc_target.name);
	r = dm_register_target(&linearc_target);
	if (r < 0) {
		DMERR("register failed %d", r);
		goto out;
	}

	fail_writes_wq = alloc_workqueue("dm_linear_custom_fail_writes_WQ",
		WQ_NON_REENTRANT | WQ_UNBOUND | WQ_MEM_RECLAIM, 
		0/*max_active*/);
	if (fail_writes_wq == NULL) {
		r = -ENOMEM;
		goto out_unreg_target;
	}

	return 0;

out_unreg_target:
	dm_unregister_target(&linearc_target);
out:
	return r;
}

void dm_linear_custom_exit(void)
{
	if (fail_writes_wq) {
		/* actually destroy_workqueue() also does drain_workqueue */
		flush_workqueue(fail_writes_wq);
		destroy_workqueue(fail_writes_wq);
		fail_writes_wq = NULL;
	}

	dm_unregister_target(&linearc_target);
}

module_init(dm_linear_custom_init);
module_exit(dm_linear_custom_exit);

MODULE_AUTHOR("Zadara Storage <team@zadarastorage.com>");
MODULE_DESCRIPTION("A modified dm-linear for BrianF");
MODULE_LICENSE("GPL");


[-- Attachment #3: Makefile --]
[-- Type: application/octet-stream, Size: 405 bytes --]

obj-m += dm-linear-custom.o

ccflags-y += -Wall 			# Enable most warning messages
ccflags-y += -Werror		# Error out the compiler on warnings

KVERSION = $(shell uname -r)
DM_LC_DIR=$(shell pwd)
DM_LC_KO=dm-linear-custom.ko

default:
	$(MAKE) -C /lib/modules/$(KVERSION)/build M=$(DM_LC_DIR) EXTRA_CFLAGS=-g modules

clean:
	$(MAKE) -C /lib/modules/$(KVERSION)/build M=$(DM_LC_DIR) clean



[-- Attachment #4: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: use-after-free on log replay failure
  2014-08-12 12:39                                             ` Alex Lyakas
@ 2014-08-12 19:31                                               ` Brian Foster
  2014-08-12 23:56                                               ` Dave Chinner
  2014-08-13  0:03                                               ` Dave Chinner
  2 siblings, 0 replies; 47+ messages in thread
From: Brian Foster @ 2014-08-12 19:31 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: xfs

On Tue, Aug 12, 2014 at 03:39:02PM +0300, Alex Lyakas wrote:
> Hello Dave, Brian,
> I will describe a generic reproduction that you ask for.
> 
> It was performed on pristine XFS code from 3.8.13, taken from here:
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
> 

This seems generic enough to me. Could you try on a more recent kernel?
Dave had mentioned there were fixes in this area of log recovery, so a
bisect might be all that is necessary to track down the patch you need.
Otherwise, we can pick up debugging from something more recent.

Brian

> top commit being:
> commit dbf932a9b316d5b29b3e220e5a30e7a165ad2992
> Author: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Date:   Sat May 11 13:57:46 2013 -0700
> 
>    Linux 3.8.13
> 
> 
> I made a single (I swear!) code change in XFS:
> 
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index 96fcbb8..d756bf6 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -3736,6 +3736,9 @@ xlog_recover(
>        if ((error = xlog_find_tail(log, &head_blk, &tail_blk)))
>                return error;
> 
> +       xfs_notice(log->l_mp, "Sleep 10s before xlog_do_recover");
> +       msleep(10000);
> +
>        if (tail_blk != head_blk) {
>                /* There used to be a comment here:
>                 *
> 
> Fresh XFS was formatted on a 20 GB block device within a VM, using:
> mkfs.xfs -f -K /dev/vde -p /etc/zadara/xfs.protofile
> and:
> root@vc-00-00-1383-dev:~# cat /etc/zadara/xfs.protofile
> dummy                   : bootfilename, not used, backward compatibility
> 0 0                             : numbers of blocks and inodes, not used,
> backward compatibility
> d--777 0 0              : set 777 perms for the root dir
> $
> $
> 
> I mounted XFS with the following options:
> rw,sync,noatime,wsync,attr2,inode64,noquota 0 0
> 
> I started a couple of processes writing files sequentially onto this mount
> point, and after few seconds crashed the VM.
> When the VM came up, I took the metadump file and placed it in:
> https://drive.google.com/file/d/0ByBy89zr3kJNa0ZpdmZFS242RVU/edit?usp=sharing
> 
> Then I set up the following Device Mapper target onto /dev/vde:
> dmsetup create VDE --table "0 41943040 linear-custom /dev/vde 0"
> I am attaching the code (and Makefile) of dm-linear-custom target. It is
> exact copy of dm-linear, except that it has a module parameter. With the
> parameter set to 0, this is an identity mapping onto /dev/vde. If the
> parameter is set to non-0, all WRITE bios are failed with ENOSPC. There is a
> workqueue to fail them in a different context (not sure if really needed,
> but that's what our "real" custom
> block device does).
> 
> Now I did:
> mount -o noatime,sync /dev/mapper/VDE /mnt/xfs
> 
> The log recovery flow went into the sleep that I added, and then I did:
> echo 1 > /sys/module/dm_linear_custom/parameters/fail_writes
> 
> Problem reproduced:
> Aug 12 14:23:04 vc-00-00-1383-dev kernel: [  175.000657] XFS (dm-0):
> Mounting Filesystem
> Aug 12 14:23:04 vc-00-00-1383-dev kernel: [  175.026991] XFS (dm-0): Sleep
> 10s before xlog_do_recover
> Aug 12 14:23:14 vc-00-00-1383-dev kernel: [  185.028113] XFS (dm-0):
> Starting recovery (logdev: internal)
> Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.556622] XFS (dm-0):
> metadata I/O error: block 0x2 ("xlog_recover_iodone") error 28 numblks 1
> Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.556675] XFS (dm-0):
> metadata I/O error: block 0x40 ("xlog_recover_iodone") error 28 numblks 16
> Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.556680] XFS (dm-0):
> xfs_do_force_shutdown(0x1) called from line 377 of file
> /mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c.  Return address
> = 0xffffffffa0349f68
> Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.556683] XFS (dm-0): I/O
> Error Detected. Shutting down filesystem
> Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.556684] XFS (dm-0): Please
> umount the filesystem and rectify the problem(s)
> Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.556766] XFS (dm-0):
> metadata I/O error: block 0xa00002 ("xlog_recover_iodone") error 5 numblks 1
> Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.556769] XFS (dm-0):
> xfs_do_force_shutdown(0x1) called from line 377 of file
> /mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c.  Return address
> = 0xffffffffa0349f68
> Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.556771] XFS (dm-0):
> metadata I/O error: block 0xa00008 ("xlog_recover_iodone") error 5 numblks 8
> Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.556774] XFS (dm-0):
> xfs_do_force_shutdown(0x1) called from line 377 of file
> /mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c.  Return address
> = 0xffffffffa0349f68
> Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.556776] XFS (dm-0):
> metadata I/O error: block 0xa00010 ("xlog_recover_iodone") error 5 numblks 8
> Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.556779] XFS (dm-0):
> xfs_do_force_shutdown(0x1) called from line 377 of file
> /mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c.  Return address
> = 0xffffffffa0349f68
> Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.556781] XFS (dm-0):
> metadata I/O error: block 0xa00018 ("xlog_recover_iodone") error 5 numblks 8
> Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.556783] XFS (dm-0):
> xfs_do_force_shutdown(0x1) called from line 377 of file
> /mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c.  Return address
> = 0xffffffffa0349f68
> Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.556785] XFS (dm-0):
> metadata I/O error: block 0xa00040 ("xlog_recover_iodone") error 5 numblks
> 16
> Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.556788] XFS (dm-0):
> xfs_do_force_shutdown(0x1) called from line 377 of file
> /mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c.  Return address
> = 0xffffffffa0349f68
> Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.556790] XFS (dm-0):
> metadata I/O error: block 0xa00050 ("xlog_recover_iodone") error 5 numblks
> 16
> Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.556793] XFS (dm-0):
> xfs_do_force_shutdown(0x1) called from line 377 of file
> /mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c.  Return address
> = 0xffffffffa0349f68
> Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.556880] XFS (dm-0):
> metadata I/O error: block 0xa00001 ("xlog_recover_iodone") error 28 numblks
> 1
> Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.556884] XFS (dm-0):
> xfs_do_force_shutdown(0x1) called from line 377 of file
> /mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c.  Return address
> = 0xffffffffa0349f68
> Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.556977] XFS (dm-0): log
> mount/recovery failed: error 28
> Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.557215] XFS (dm-0): log
> mount failed
> Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.573194] XFS ():
> xfs_do_force_shutdown(0x1) called from line 377 of file
> /mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c.  Return address
> = 0xffffffffa0349f68
> Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.573214] XFS (): metadata
> I/O error: block 0x18 ("xlog_recover_iodone") error 28 numblks 8
> Aug 12 14:23:18 vc-00-00-1383-dev kernel: [  188.574685] XFS ():
> xfs_do_force_shutdown(0x1) called from line 377 of file
> /mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c.  Return address
> = 0xffffffffa0349f68
> 
> As you see, after mount completes, IO callbacks are still arriving and
> printing junk (empty string in this case). Immediately after that kernel
> dies.
> 
> Is this description generic enough?
> 
> Thanks,
> Alex.
> 
> 
> 
> 
> -----Original Message----- From: Brian Foster
> Sent: 12 August, 2014 3:03 PM
> To: Dave Chinner
> Cc: Alex Lyakas ; xfs@oss.sgi.com
> Subject: Re: use-after-free on log replay failure
> 
> On Tue, Aug 12, 2014 at 07:52:07AM +1000, Dave Chinner wrote:
> >On Mon, Aug 11, 2014 at 09:20:57AM -0400, Brian Foster wrote:
> >> On Sun, Aug 10, 2014 at 03:20:50PM +0300, Alex Lyakas wrote:
> >> > On Wed, Aug 6, 2014 at 6:20 PM, Brian Foster <bfoster@redhat.com> > >
> >wrote:
> >> > > On Wed, Aug 06, 2014 at 03:52:03PM +0300, Alex Lyakas wrote:
> >.....
> >> > >> But I believe, my analysis shows that during the mount sequence XFS
> >> > >> does not
> >> > >> wait properly for all the bios to complete, before failing the > >
> >>> mount
> >> > >> sequence back to the caller.
> >> > >>
> >> > >
> >> > > As an experiment, what about the following? Compile tested only and
> >> > > not
> >> > > safe for general use.
> >...
> >> > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> >> > > index cd7b8ca..fbcf524 100644
> >> > > --- a/fs/xfs/xfs_buf.c
> >> > > +++ b/fs/xfs/xfs_buf.c
> >> > > @@ -1409,19 +1409,27 @@ xfs_buf_iorequest(
> >> > >   * case nothing will ever complete.  It returns the I/O error code,
> >> > > if any, or
> >> > >   * 0 if there was no error.
> >> > >   */
> >> > > -int
> >> > > -xfs_buf_iowait(
> >> > > -       xfs_buf_t               *bp)
> >> > > +static int
> >> > > +__xfs_buf_iowait(
> >> > > +       struct xfs_buf          *bp,
> >> > > +       bool                    skip_error)
> >> > >  {
> >> > >         trace_xfs_buf_iowait(bp, _RET_IP_);
> >> > >
> >> > > -       if (!bp->b_error)
> >> > > +       if (skip_error || !bp->b_error)
> >> > >                 wait_for_completion(&bp->b_iowait);
> >> > >
> >> > >         trace_xfs_buf_iowait_done(bp, _RET_IP_);
> >> > >         return bp->b_error;
> >> > >  }
> >> > >
> >> > > +int
> >> > > +xfs_buf_iowait(
> >> > > +       struct xfs_buf          *bp)
> >> > > +{
> >> > > +       return __xfs_buf_iowait(bp, false);
> >> > > +}
> >> > > +
> >> > >  xfs_caddr_t
> >> > >  xfs_buf_offset(
> >> > >         xfs_buf_t               *bp,
> >> > > @@ -1866,7 +1874,7 @@ xfs_buf_delwri_submit(
> >> > >                 bp = list_first_entry(&io_list, struct xfs_buf, > >
> >> b_list);
> >> > >
> >> > >                 list_del_init(&bp->b_list);
> >> > > -               error2 = xfs_buf_iowait(bp);
> >> > > +               error2 = __xfs_buf_iowait(bp, true);
> >> > >                 xfs_buf_relse(bp);
> >> > >                 if (!error)
> >> > >                         error = error2;
> >
> >Not waiting here on buffer error should not matter. Any buffer that
> >is under IO and requires completion should be referenced, and that
> >means it should be caught and waited on by xfs_wait_buftarg() in the
> >mount failure path after log recovery fails.
> >
> 
> I think that assumes the I/O is successful. Looking through
> xlog_recover_buffer_pass2() as an example, we read the buffer which
> should return with b_hold == 1. The delwri queue increments the hold and
> we xfs_buf_relse() in the return path (i.e., buffer is now held by the
> delwri queue awaiting submission).
> 
> Sometime later we delwri_submit()... xfs_buf_iorequest() does an
> xfs_buf_hold() and xfs_buf_rele() within that single function. The
> delwri_submit() releases its hold after xfs_buf_iowait(), which I guess
> at that point bp should go onto the lru (b_hold back to 1 in
> xfs_buf_rele(). Indeed, the caller has lost scope of the buffer at this
> point.
> 
> So unless I miss something or got the lifecycle wrong here, which is
> easily possible ;), this all hinges on xfs_buf_iowait(). That's where
> the last hold forcing the buffer to stay around goes away.
> xfs_buftarg_wait_rele() will dispose the buffer if b_hold == 1. If
> xfs_buf_iowait() is racy in the event of I/O errors via the bio
> callback, I think this path is susceptible just the same.
> 
> >> > > ---
> >> > I think that this patch fixes the problem. I tried reproducing it like
> >> > 30 times, and it doesn't happen with this patch. Dropping this patch
> >> > reproduces the problem within 1 or 2 tries. Thanks!
> >> > What are next steps? How to make it "safe for general use"?
> >> >
> >>
> >> Ok, thanks for testing. I think that implicates the caller bypassing the
> >> expected blocking with the right sequence of log recovery I/Os and
> >> device failure. TBH, I'd still like to see the specifics, if possible.
> >> Could you come up with a generic reproducer for this? I think a metadump
> >> of the fs with the dirty log plus whatever device failure simulation
> >> hack you're using would suffice.
> >
> >The real issue is we don't know exactly what code is being tested
> >(it's 3.8 + random bug fix backports + custom code). Even if we have
> >a reproducer there's no guarantee it will reproduce on a current
> >kernel. IOWs, we are stumbling around in the dark bashing our heads
> >against everything in the room, and that just wastes everyone's
> >time.
> >
> >We need a reproducer that triggers on a current, unmodified
> >kernel release. You can use dm-faulty to error out all writes just
> >like you are doing with your custom code. See
> >xfstests::tests/generic/321 and common/dmflakey for to do this.
> >Ideally the reproducer is in a form that xfstests can use....
> >
> >If you can't reproduce it on an upstream kernel, then git bisect is
> >your friend. It will find the commit that fixed the problem you are
> >seeing....
> >
> 
> Ugh, yeah. The fact that this was customized as such apparently went
> over my head. I agree completely. This needs to be genericized to a
> pristine, preferably current kernel. The experiment patch could be
> papering over something completely different.
> 
> >> The ideal fix is not yet clear to me.
> >
> >We are not even that far along - the root cause of the bug is not at
> >all clear to me. :/
> >
> 
> Yeah.. the above was just the theory that motivated the experiment in
> the previously posted patch. It of course remains a theory until we can
> see the race in action. I was referring to the potential fix for the
> raciness of xfs_buf_iowait() with regard to bio errors and the wq iodone
> handling, while still asking for a reproducer to confirm the actual
> problem. FWIW, I'm not too high on changes in the buf management code,
> even a smallish behavior change, without a real trace of some sort that
> documents the problem and justifies the change.
> 
> Brian
> 
> >Cheers,
> >
> >Dave.
> >-- 
> >Dave Chinner
> >david@fromorbit.com



> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: use-after-free on log replay failure
  2014-08-12 12:39                                             ` Alex Lyakas
  2014-08-12 19:31                                               ` Brian Foster
@ 2014-08-12 23:56                                               ` Dave Chinner
  2014-08-13 12:59                                                 ` Brian Foster
  2014-08-13 17:07                                                 ` Alex Lyakas
  2014-08-13  0:03                                               ` Dave Chinner
  2 siblings, 2 replies; 47+ messages in thread
From: Dave Chinner @ 2014-08-12 23:56 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: Brian Foster, xfs

On Tue, Aug 12, 2014 at 03:39:02PM +0300, Alex Lyakas wrote:
> Hello Dave, Brian,
> I will describe a generic reproduction that you ask for.
> 
> It was performed on pristine XFS code from 3.8.13, taken from here:
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
....
> I mounted XFS with the following options:
> rw,sync,noatime,wsync,attr2,inode64,noquota 0 0
> 
> I started a couple of processes writing files sequentially onto this
> mount point, and after few seconds crashed the VM.
> When the VM came up, I took the metadump file and placed it in:
> https://drive.google.com/file/d/0ByBy89zr3kJNa0ZpdmZFS242RVU/edit?usp=sharing
> 
> Then I set up the following Device Mapper target onto /dev/vde:
> dmsetup create VDE --table "0 41943040 linear-custom /dev/vde 0"
> I am attaching the code (and Makefile) of dm-linear-custom target.
> It is exact copy of dm-linear, except that it has a module
> parameter. With the parameter set to 0, this is an identity mapping
> onto /dev/vde. If the parameter is set to non-0, all WRITE bios are
> failed with ENOSPC. There is a workqueue to fail them in a different
> context (not sure if really needed, but that's what our "real"
> custom
> block device does).

Well, they you go. That explains it - an asynchronous dispatch error
happening fast enough to race with the synchronous XFS dispatch
processing.

dispatch thread			device workqueue
xfs_buf_hold();
atomic_set(b_io_remaining, 1)
atomic_inc(b_io_remaining)
submit_bio(bio)
queue_work(bio)
xfs_buf_ioend(bp, ....);
  atomic_dec(b_io_remaining)
xfs_buf_rele()
				bio error set to ENOSPC
				  bio->end_io()
				    xfs_buf_bio_endio()
				      bp->b_error = ENOSPC
				      _xfs_buf_ioend(bp, 1);
				        atomic_dec(b_io_remaining)
					  xfs_buf_ioend(bp, 1);
					    queue_work(bp)
xfs_buf_iowait()
 if (bp->b_error) return error;
if (error)
  xfs_buf_relse()
    xfs_buf_rele()
      xfs_buf_free()

And now we have a freed buffer that is queued on the io completion
queue. Basically, it requires the buffer error to be set
asynchronously *between* the dispatch decrementing it's I/O count
after dispatch, but before we wait on the IO.

Not sure what the right fix is yet - removing the bp->b_error check
from xfs_buf_iowait() doesn't solve the problem - it just prevents
this code path from being tripped over by the race condition.

But, just to validate this is the problem, you should be able to
reproduce this on a 3.16 kernel. Can you try that, Alex?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: use-after-free on log replay failure
  2014-08-12 12:39                                             ` Alex Lyakas
  2014-08-12 19:31                                               ` Brian Foster
  2014-08-12 23:56                                               ` Dave Chinner
@ 2014-08-13  0:03                                               ` Dave Chinner
  2014-08-13 13:11                                                 ` Brian Foster
  2 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2014-08-13  0:03 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: Brian Foster, xfs

On Tue, Aug 12, 2014 at 03:39:02PM +0300, Alex Lyakas wrote:
> Then I set up the following Device Mapper target onto /dev/vde:
> dmsetup create VDE --table "0 41943040 linear-custom /dev/vde 0"
> I am attaching the code (and Makefile) of dm-linear-custom target.
> It is exact copy of dm-linear, except that it has a module
> parameter. With the parameter set to 0, this is an identity mapping
> onto /dev/vde. If the parameter is set to non-0, all WRITE bios are
> failed with ENOSPC. There is a workqueue to fail them in a different
> context (not sure if really needed, but that's what our "real"
> custom
> block device does).

FWIW, now I've looked at the dm module, this could easily be added
to the dm-flakey driver by adding a "queue_write_error" option
to it (i.e. similar to the current drop_writes and corrupt_bio_byte
options).

If we add the code there, then we could add a debug-only XFS sysfs
variable to trigger the log recovery sleep, and then use dm-flakey
to queue and error out writes. That gives us a reproducable xfstest
for this condition. Brian, does that sound like a reasonable plan to
you?

Thanks for describing the method you've been using to reproduce the
bug so clearly, Alex.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: use-after-free on log replay failure
  2014-08-12 23:56                                               ` Dave Chinner
@ 2014-08-13 12:59                                                 ` Brian Foster
  2014-08-13 20:59                                                   ` Dave Chinner
  2014-08-13 17:07                                                 ` Alex Lyakas
  1 sibling, 1 reply; 47+ messages in thread
From: Brian Foster @ 2014-08-13 12:59 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Alex Lyakas, xfs

On Wed, Aug 13, 2014 at 09:56:15AM +1000, Dave Chinner wrote:
> On Tue, Aug 12, 2014 at 03:39:02PM +0300, Alex Lyakas wrote:
> > Hello Dave, Brian,
> > I will describe a generic reproduction that you ask for.
> > 
> > It was performed on pristine XFS code from 3.8.13, taken from here:
> > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
> ....
> > I mounted XFS with the following options:
> > rw,sync,noatime,wsync,attr2,inode64,noquota 0 0
> > 
> > I started a couple of processes writing files sequentially onto this
> > mount point, and after few seconds crashed the VM.
> > When the VM came up, I took the metadump file and placed it in:
> > https://drive.google.com/file/d/0ByBy89zr3kJNa0ZpdmZFS242RVU/edit?usp=sharing
> > 
> > Then I set up the following Device Mapper target onto /dev/vde:
> > dmsetup create VDE --table "0 41943040 linear-custom /dev/vde 0"
> > I am attaching the code (and Makefile) of dm-linear-custom target.
> > It is exact copy of dm-linear, except that it has a module
> > parameter. With the parameter set to 0, this is an identity mapping
> > onto /dev/vde. If the parameter is set to non-0, all WRITE bios are
> > failed with ENOSPC. There is a workqueue to fail them in a different
> > context (not sure if really needed, but that's what our "real"
> > custom
> > block device does).
> 
> Well, they you go. That explains it - an asynchronous dispatch error
> happening fast enough to race with the synchronous XFS dispatch
> processing.
> 
> dispatch thread			device workqueue
> xfs_buf_hold();
> atomic_set(b_io_remaining, 1)
> atomic_inc(b_io_remaining)
> submit_bio(bio)
> queue_work(bio)
> xfs_buf_ioend(bp, ....);
>   atomic_dec(b_io_remaining)
> xfs_buf_rele()
> 				bio error set to ENOSPC
> 				  bio->end_io()
> 				    xfs_buf_bio_endio()
> 				      bp->b_error = ENOSPC
> 				      _xfs_buf_ioend(bp, 1);
> 				        atomic_dec(b_io_remaining)
> 					  xfs_buf_ioend(bp, 1);
> 					    queue_work(bp)
> xfs_buf_iowait()
>  if (bp->b_error) return error;
> if (error)
>   xfs_buf_relse()
>     xfs_buf_rele()
>       xfs_buf_free()
> 
> And now we have a freed buffer that is queued on the io completion
> queue. Basically, it requires the buffer error to be set
> asynchronously *between* the dispatch decrementing it's I/O count
> after dispatch, but before we wait on the IO.
> 

That's basically the theory I wanted to test with the experimental
patch. E.g., the error check races with the iodone workqueue item.

> Not sure what the right fix is yet - removing the bp->b_error check
> from xfs_buf_iowait() doesn't solve the problem - it just prevents
> this code path from being tripped over by the race condition.
> 

Perhaps I'm missing some context... I don't follow how removing the
error check doesn't solve the problem. It clearly closes the race and
perhaps there are other means of doing the same thing, but what part of
the problem does that leave unresolved? E.g., we provide a
synchronization mechanism for an async submission path and an object
(xfs_buf) that is involved with potentially multiple such async (I/O)
operations. The async callback side manages the counts of outstanding
bios etc. to set the state of the buf object correctly and fires a
completion when everything is done. The calling side simply waits on the
completion before it can analyze state of the object. Referring to
anything inside that object that happens to be managed by the buffer I/O
mechanism before the buffer is considered complete just seems generally
racy.

It looks like submit_bio() manages this by providing the error through
the callback (always). It also doesn't look like submission path is
guaranteed to be synchronous either (consider md, which appears to use
workqueues and kernel threads)), so I'm not sure that '...;
xfs_buf_iorequest(bp); if (bp->b_error)' is really safe anywhere unless
you're explicitly looking for a write verifier error or something and
do nothing further on the buf contingent on completion (e.g., freeing it
or something it depends on).

Brian

> But, just to validate this is the problem, you should be able to
> reproduce this on a 3.16 kernel. Can you try that, Alex?
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: use-after-free on log replay failure
  2014-08-13  0:03                                               ` Dave Chinner
@ 2014-08-13 13:11                                                 ` Brian Foster
  0 siblings, 0 replies; 47+ messages in thread
From: Brian Foster @ 2014-08-13 13:11 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Alex Lyakas, xfs

On Wed, Aug 13, 2014 at 10:03:12AM +1000, Dave Chinner wrote:
> On Tue, Aug 12, 2014 at 03:39:02PM +0300, Alex Lyakas wrote:
> > Then I set up the following Device Mapper target onto /dev/vde:
> > dmsetup create VDE --table "0 41943040 linear-custom /dev/vde 0"
> > I am attaching the code (and Makefile) of dm-linear-custom target.
> > It is exact copy of dm-linear, except that it has a module
> > parameter. With the parameter set to 0, this is an identity mapping
> > onto /dev/vde. If the parameter is set to non-0, all WRITE bios are
> > failed with ENOSPC. There is a workqueue to fail them in a different
> > context (not sure if really needed, but that's what our "real"
> > custom
> > block device does).
> 
> FWIW, now I've looked at the dm module, this could easily be added
> to the dm-flakey driver by adding a "queue_write_error" option
> to it (i.e. similar to the current drop_writes and corrupt_bio_byte
> options).
> 
> If we add the code there, then we could add a debug-only XFS sysfs
> variable to trigger the log recovery sleep, and then use dm-flakey
> to queue and error out writes. That gives us a reproducable xfstest
> for this condition. Brian, does that sound like a reasonable plan to
> you?
> 

It would be nice if we could avoid this kind of timing hack, but I'll
have to look at the related code and see what we have for options. I'm
also assuming the oops/crash is a consistent behavior when we hit the
race, since that probably defines the failure mode of the test.

But yeah, seems like a reasonable plan in general. Added to the todo
list for once we have this sorted out...

Brian

> Thanks for describing the method you've been using to reproduce the
> bug so clearly, Alex.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: use-after-free on log replay failure
  2014-08-12 23:56                                               ` Dave Chinner
  2014-08-13 12:59                                                 ` Brian Foster
@ 2014-08-13 17:07                                                 ` Alex Lyakas
  1 sibling, 0 replies; 47+ messages in thread
From: Alex Lyakas @ 2014-08-13 17:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, xfs

Hi Dave,
I compiled kernel 3.16 today, top commit being:
commit 19583ca584d6f574384e17fe7613dfaeadcdc4a6
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Sun Aug 3 15:25:02 2014 -0700

    Linux 3.16


(also had to use the updated copy of dm-linear from 3.16).

I would say that the problem reproduces, but it looks a little differently. 
Typical flow is:

Aug 13 19:45:48 vc-00-00-1383-dev kernel: [  143.521383] XFS (dm-0): 
Mounting V4 Filesystem
Aug 13 19:45:48 vc-00-00-1383-dev kernel: [  143.558558] XFS (dm-0): Sleep 
10s before log recovery
Aug 13 19:45:58 vc-00-00-1383-dev kernel: [  153.560139] XFS (dm-0): 
Starting recovery (logdev: internal)
Aug 13 19:45:59 vc-00-00-1383-dev kernel: [  154.085826] XFS (dm-0): 
metadata I/O error: block 0x1 ("xlog_recover_iodone") error 28 numblks 1
Aug 13 19:45:59 vc-00-00-1383-dev kernel: [  154.087336] XFS (dm-0): 
xfs_do_force_shutdown(0x1) called from line 383 of file 
/mnt/share/src/linux-mainline/fs/xfs/xfs_log_recover.c.  Return address = 
0xffffffffa030d5f8
Aug 13 19:45:59 vc-00-00-1383-dev kernel: [  154.087341] XFS (dm-0): I/O 
Error Detected. Shutting down filesystem
Aug 13 19:45:59 vc-00-00-1383-dev kernel: [  154.088455] XFS (dm-0): Please 
umount the filesystem and rectify the problem(s)
Aug 13 19:45:59 vc-00-00-1383-dev kernel: [  154.088601] XFS (dm-0): 
metadata I/O error: block 0x8 ("xlog_recover_iodone") error 28 numblks 8
Aug 13 19:45:59 vc-00-00-1383-dev kernel: [  154.088605] XFS (dm-0): 
xfs_do_force_shutdown(0x1) called from line 383 of file 
/mnt/share/src/linux-mainline/fs/xfs/xfs_log_recover.c.  Return address = 
0xffffffffa030d5f8
Aug 13 19:45:59 vc-00-00-1383-dev kernel: [  154.088658] XFS (dm-0): 
metadata I/O error: block 0x10 ("xlog_recover_iodone") error 28 numblks 8
Aug 13 19:45:59 vc-00-00-1383-dev kernel: [  154.088663] XFS (dm-0): 
xfs_do_force_shutdown(0x1) called from line 383 of file 
/mnt/share/src/linux-mainline/fs/xfs/xfs_log_recover.c.  Return address = 
0xffffffffa030d5f8
Aug 13 19:45:59 vc-00-00-1383-dev kernel: [  154.088702] XFS (dm-0): 
metadata I/O error: block 0x40 ("xlog_recover_iodone") error 28 numblks 16
Aug 13 19:45:59 vc-00-00-1383-dev kernel: [  154.088714] XFS (dm-0): 
xfs_do_force_shutdown(0x1) called from line 383 of file 
/mnt/share/src/linux-mainline/fs/xfs/xfs_log_recover.c.  Return address = 
0xffffffffa030d5f8
Aug 13 19:45:59 vc-00-00-1383-dev kernel: [  154.088719] XFS (dm-0): 
metadata I/O error: block 0xa00060 ("xlog_recover_iodone") error 28 numblks 
16
Aug 13 19:45:59 vc-00-00-1383-dev kernel: [  154.088731] XFS (dm-0): 
xfs_do_force_shutdown(0x1) called from line 383 of file 
/mnt/share/src/linux-mainline/fs/xfs/xfs_log_recover.c.  Return address = 
0xffffffffa030d5f8
Aug 13 19:45:59 vc-00-00-1383-dev kernel: [  154.096288] XFS (dm-0): log 
mount/recovery failed: error 28
Aug 13 19:45:59 vc-00-00-1383-dev kernel: [  154.096482] XFS (dm-0): log 
mount failed

So there is no clearly-looking problem like with my development kernel 
(which also had KMEMLEAK enabled, which changes timings significantly). 
However, after several seconds, kernel panics sporadically with stacks like 
[1], [2], and sometimes KVM dies on me with messages like[3], [4]. So 
definitely this test corrupts some internal kernel structures, and with 
KMEMLEAK enabled this corruption is more clear.

I put the metadump from this experiment here:
https://drive.google.com/file/d/0ByBy89zr3kJNSGxoc3U4X1lOZ2s/edit?usp=sharing

Thanks,
Alex.


[1]
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [  156.223389] BUG: unable to 
handle kernel paging request at 00007f9e50678000
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [  156.224019] IP: 
[<ffffffff811923a8>] kmem_cache_alloc+0x68/0x160
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [  156.224019] PGD 362b8067 PUD 
7a9d9067 PMD 7849f067 PTE 0
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [  156.224019] Oops: 0000 [#1] SMP
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [  156.224019] Modules linked in: 
xfs(O) libcrc32c dm_linear_custom(O) deflate ctr twofish_generic 
twofish_x86_64_3way glue_helper lrw xts gf128mul twofish_x86_64 
twofish_common camellia_generic serpent_generic blowfish_generic 
blowfish_x86_64 blowfish_common cast5_generic cast_common des_generic cmac 
xcbc rmd160 sha512_generic crypto_null af_key xfrm_algo dm_round_robin nfsd 
nfs_acl ppdev dm_multipath psmouse parport_pc serio_raw i2c_piix4 mac_hid lp 
rpcsec_gss_krb5 parport auth_rpcgss oid_registry nfsv4 nfs fscache lockd 
sunrpc floppy
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [  156.224019] CPU: 1 PID: 3302 
Comm: cron Tainted: G           O  3.16.0-999-generic #201408131143
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [  156.224019] Hardware name: 
Bochs Bochs, BIOS Bochs 01/01/2007
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [  156.224019] task: 
ffff88007a9d48c0 ti: ffff88007aeb0000 task.ti: ffff88007aeb0000
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [  156.224019] RIP: 
0010:[<ffffffff811923a8>]  [<ffffffff811923a8>] kmem_cache_alloc+0x68/0x160
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [  156.224019] RSP: 
0018:ffff88007aeb3cc0  EFLAGS: 00010286
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [  156.224019] RAX: 
0000000000000000 RBX: ffff88007a137048 RCX: 000000000001ff04
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [  156.224019] RDX: 
000000000001ff03 RSI: 00000000000000d0 RDI: 0000000000015d20
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [  156.224019] RBP: 
ffff88007aeb3d10 R08: ffff88007fc95d20 R09: 0000000000000003
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [  156.224019] R10: 
0000000000000003 R11: ffff8800362b83d0 R12: ffff88007d001b00
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [  156.224019] R13: 
00007f9e50678000 R14: ffffffff811767fd R15: 00000000000000d0
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [  156.224019] FS: 
00007f9e5066b7c0(0000) GS:ffff88007fc80000(0000) knlGS:0000000000000000
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [  156.224019] CS:  0010 DS: 0000 
ES: 0000 CR0: 000000008005003b
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [  156.224019] CR2: 
00007f9e50678000 CR3: 0000000036743000 CR4: 00000000000006e0
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [  156.224019] Stack:
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [  156.224019]  00007f9e4e9d4fff 
ffffffff811767d6 0000000000000040 0000000000000048
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [  156.224019]  ffff88007aeb3d40 
ffff88007a137048 ffff88007af7cb80 ffff880079814da8
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [  156.224019]  ffff88007af7cc38 
ffff88007af7cc48 ffff88007aeb3d40 ffffffff811767fd
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [  156.224019] Call Trace:
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [  156.224019] 
[<ffffffff811767d6>] ? anon_vma_fork+0x56/0x130
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [  156.224019] 
[<ffffffff811767fd>] anon_vma_fork+0x7d/0x130
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [  156.224019] 
[<ffffffff816d0426>] dup_mmap+0x1c5/0x380
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [  156.224019] 
[<ffffffff816d069f>] dup_mm+0xbe/0x155
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [  156.224019] 
[<ffffffff816d07fc>] copy_mm+0xc6/0xe9
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [  156.224019] 
[<ffffffff81054174>] copy_process.part.35+0x694/0xe90
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [  156.224019] 
[<ffffffff810549f0>] copy_process+0x80/0x90
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [  156.224019] 
[<ffffffff81054b12>] do_fork+0x62/0x280
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [  156.224019] 
[<ffffffff812ca6ee>] ? cap_task_setnice+0xe/0x10
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [  156.224019] 
[<ffffffff812cbfd6>] ? security_task_setnice+0x16/0x20
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [  156.224019] 
[<ffffffff8106a648>] ? set_one_prio+0x88/0xd0
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [  156.224019] 
[<ffffffff81054db6>] SyS_clone+0x16/0x20
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [  156.224019] 
[<ffffffff816e9b39>] stub_clone+0x69/0x90
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [  156.224019] 
[<ffffffff816e9812>] ? system_call_fastpath+0x16/0x1b
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [  156.224019] Code: 00 49 8b 50 
08 4d 8b 28 49 8b 40 10 4d 85 ed 0f 84 e6 00 00 00 48 85 c0 0f 84 dd 00 00 
00 49 63 44 24 20 49 8b 3c 24 48 8d 4a 01 <49> 8b 5c 05 00 4c 89 e8 65 48 0f 
c7 0f 0f 94 c0 84 c0 74 b5 49
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [  156.224019] RIP 
[<ffffffff811923a8>] kmem_cache_alloc+0x68/0x160
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [  156.224019]  RSP 
<ffff88007aeb3cc0>
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [  156.224019] CR2: 
00007f9e50678000
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [  156.276967] ---[ end trace 
56118c807adeb6de ]---
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [  167.136602] BUG: unable to 
handle kernel paging request at 00007f9e50678000
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [  167.137877] IP: 
[<ffffffff811917ac>] kmem_cache_alloc_trace+0x6c/0x160
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [  167.139042] PGD 79732067 PUD 0
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [  167.139645] Oops: 0000 [#2] SMP
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [  167.140050] Modules linked in: 
xfs(O) libcrc32c dm_linear_custom(O) deflate ctr twofish_generic 
twofish_x86_64_3way glue_helper lrw xts gf128mul twofish_x86_64 
twofish_common camellia_generic serpent_generic blowfish_generic 
blowfish_x86_64 blowfish_common cast5_generic cast_common des_generic cmac 
xcbc rmd160 sha512_generic crypto_null af_key xfrm_algo dm_round_robin nfsd 
nfs_acl ppdev dm_multipath psmouse parport_pc serio_raw i2c_piix4 mac_hid lp 
rpcsec_gss_krb5 parport auth_rpcgss oid_registry nfsv4 nfs fscache lockd 
sunrpc floppy
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [  167.140050] CPU: 1 PID: 1688 
Comm: whoopsie Tainted: G      D    O  3.16.0-999-generic #201408131143
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [  167.140050] Hardware name: 
Bochs Bochs, BIOS Bochs 01/01/2007
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [  167.140050] task: 
ffff88007c261840 ti: ffff8800799a8000 task.ti: ffff8800799a8000
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [  167.140050] RIP: 
0010:[<ffffffff811917ac>]  [<ffffffff811917ac>] 
kmem_cache_alloc_trace+0x6c/0x160
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [  167.140050] RSP: 
0018:ffff8800799abdd8  EFLAGS: 00010286
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [  167.140050] RAX: 
0000000000000000 RBX: ffff88007bd0e080 RCX: 000000000001ff04
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [  167.140050] RDX: 
000000000001ff03 RSI: 00000000000000d0 RDI: 0000000000015d20
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [  167.140050] RBP: 
ffff8800799abe28 R08: ffff88007fc95d20 R09: 0000000000000000
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [  167.140050] R10: 
0000000000000000 R11: 0000000000000246 R12: ffff88007d001b00
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [  167.140050] R13: 
00007f9e50678000 R14: ffffffff815bbbdb R15: 00000000000000d0
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [  167.140050] FS: 
00007fa0248e37c0(0000) GS:ffff88007fc80000(0000) knlGS:0000000000000000
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [  167.140050] CS:  0010 DS: 0000 
ES: 0000 CR0: 0000000080050033
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [  167.140050] CR2: 
00007f9e50678000 CR3: 00000000367c0000 CR4: 00000000000006e0
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [  167.140050] Stack:
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [  167.140050]  ffff88007c7ee780 
ffffffff815bbbbd 0000000000000040 0000000000000280
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [  167.140050]  ffff8800799abe08 
ffff88007bd0e080 0000000000000000 0000000000000000
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [  167.140050]  0000000000000000 
0000000000000000 ffff8800799abe58 ffffffff815bbbdb
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [  167.140050] Call Trace:
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [  167.140050] 
[<ffffffff815bbbbd>] ? sock_alloc_inode+0x2d/0xd0
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [  167.140050] 
[<ffffffff815bbbdb>] sock_alloc_inode+0x4b/0xd0
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [  167.140050] 
[<ffffffff811bb326>] alloc_inode+0x26/0xa0
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [  167.140050] 
[<ffffffff811bcce3>] new_inode_pseudo+0x13/0x60
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [  167.140050] 
[<ffffffff815bbe8e>] sock_alloc+0x1e/0x80
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [  167.140050] 
[<ffffffff815be2d5>] __sock_create+0x95/0x200
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [  167.140050] 
[<ffffffff815be4a0>] sock_create+0x30/0x40
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [  167.140050] 
[<ffffffff815bf5a6>] SyS_socket+0x36/0xb0
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [  167.140050] 
[<ffffffff816e9812>] system_call_fastpath+0x16/0x1b
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [  167.140050] Code: 00 49 8b 50 
08 4d 8b 28 49 8b 40 10 4d 85 ed 0f 84 e2 00 00 00 48 85 c0 0f 84 d9 00 00 
00 49 63 44 24 20 49 8b 3c 24 48 8d 4a 01 <49> 8b 5c 05 00 4c 89 e8 65 48 0f 
c7 0f 0f 94 c0 84 c0 74 b5 49
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [  167.140050] RIP 
[<ffffffff811917ac>] kmem_cache_alloc_trace+0x6c/0x160
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [  167.140050]  RSP 
<ffff8800799abdd8>
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [  167.140050] CR2: 
00007f9e50678000
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [  167.185653] ---[ end trace 
56118c807adeb6df ]---


[2]
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.566506] BUG: unable to 
handle kernel paging request at 00007f41d65a100c
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.567769] IP: 
[<ffffffff81406a47>] vring_add_indirect+0x87/0x210
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.568859] PGD 7b4a7067 PUD 
7b4a8067 PMD 7b4aa067 PTE 8000000076f5b065
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570055] Oops: 0003 [#1] SMP
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] Modules linked in: 
xfs(O) libcrc32c dm_linear_custom(O) deflate ctr twofish_generic 
twofish_x86_64_3way glue_helper lrw xts gf128mul twofish_x86_64 
twofish_common camellia_generic serpent_generic blowfish_generic 
blowfish_x86_64 blowfish_common cast5_generic cast_common des_generic cmac 
xcbc rmd160 sha512_generic crypto_null af_key xfrm_algo nfsd nfs_acl 
dm_round_robin ppdev parport_pc mac_hid dm_multipath i2c_piix4 psmouse 
serio_raw rpcsec_gss_krb5 auth_rpcgss oid_registry lp nfsv4 parport nfs 
fscache lockd sunrpc floppy
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] CPU: 3 PID: 3329 
Comm: mount Tainted: G           O  3.16.0-999-generic #201408131143
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] Hardware name: 
Bochs Bochs, BIOS Bochs 01/01/2007
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] task: 
ffff88007b2ac8c0 ti: ffff88007a900000 task.ti: ffff88007a900000
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] RIP: 
0010:[<ffffffff81406a47>]  [<ffffffff81406a47>] 
vring_add_indirect+0x87/0x210
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] RSP: 
0018:ffff88007a903688  EFLAGS: 00010006
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] RAX: 
ffff88007a9037b8 RBX: ffff88007a903788 RCX: 00007f41d65a1000
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] RDX: 
000000000001da11 RSI: 0000160000000000 RDI: ffff88007a9037b8
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] RBP: 
ffff88007a9036d8 R08: ffff88007fd95d20 R09: ffffffff814069f6
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] R10: 
0000000000000003 R11: ffff88007a903860 R12: ffffffff81406cb0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] R13: 
ffff88007c9ae000 R14: 0000000000000000 R15: 00007f41d65a1000
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] FS: 
00007f41d67b5800(0000) GS:ffff88007fd80000(0000) knlGS:0000000000000000
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] CS:  0010 DS: 0000 
ES: 0000 CR0: 0000000080050033
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] CR2: 
00007f41d65a100c CR3: 000000007a943000 CR4: 00000000000006e0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] Stack:
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399]  ffff88007fff8700 
0000000300000202 000000007a9037b8 0000000100000002
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399]  ffff88007a903768 
ffff88007c9ae000 ffff88007c9ae000 ffff88007a903788
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399]  0000000000000001 
0000000000000003 ffff88007a903768 ffffffff814072a2
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] Call Trace:
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff814072a2>] virtqueue_add_sgs+0x2f2/0x340
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff8147854a>] __virtblk_add_req+0xda/0x1b0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff8132f307>] ? __bt_get+0xc7/0x1e0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff8132f47a>] ? bt_get+0x5a/0x180
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff8132953c>] ? blk_rq_map_sg+0x3c/0x170
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff8147871a>] virtio_queue_rq+0xfa/0x250
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff8132d39f>] __blk_mq_run_hw_queue+0xff/0x260
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff8132d97d>] blk_mq_run_hw_queue+0x7d/0xb0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff8132dd71>] blk_sq_make_request+0x171/0x2f0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff81320605>] generic_make_request.part.75+0x75/0xb0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff813215a8>] generic_make_request+0x68/0x70
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff81321625>] submit_bio+0x75/0x140
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff811d2773>] _submit_bh+0x113/0x160
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff811d27d0>] submit_bh+0x10/0x20
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff811d6135>] block_read_full_page+0x1f5/0x340
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff811d6d20>] ? I_BDEV+0x10/0x10
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff8115c445>] ? __inc_zone_page_state+0x35/0x40
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff8113eca4>] ? __add_to_page_cache_locked+0xa4/0x130
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff811d7798>] blkdev_readpage+0x18/0x20
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff8114aee8>] read_pages+0xe8/0x100
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff8114b1e3>] __do_page_cache_readahead+0x163/0x170
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff8114b555>] force_page_cache_readahead+0x75/0xb0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff8114b5d3>] page_cache_sync_readahead+0x43/0x50
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff8113f91e>] do_generic_file_read+0x30e/0x490
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff81140c44>] generic_file_read_iter+0xf4/0x150
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff8116b4b8>] ? handle_mm_fault+0x48/0x80
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff8116edc0>] ? find_vma+0x20/0x80
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff811d6ec7>] blkdev_read_iter+0x37/0x40
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff811a10d8>] new_sync_read+0x78/0xb0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff811a218b>] vfs_read+0xab/0x180
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff811a240f>] SyS_read+0x4f/0xb0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff816e9812>] system_call_fastpath+0x16/0x1b
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] Code: 00 00 00 00 
48 63 45 c4 48 8b 04 c3 48 85 c0 74 59 0f 1f 00 49 63 ce 48 be 00 00 00 00 
00 16 00 00 48 89 c7 48 c1 e1 04 4c 01 f9 <66> c7 41 0c 01 00 48 8b 10 48 83 
e2 fc 48 01 f2 8b 70 08 48 c1
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] RIP 
[<ffffffff81406a47>] vring_add_indirect+0x87/0x210
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399]  RSP 
<ffff88007a903688>
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] CR2: 
00007f41d65a100c
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] ---[ end trace 
cc83f9989ae9e2af ]---
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] BUG: unable to 
handle kernel paging request at 0000000079a0b008
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] IP: 
[<ffffffff810d18b0>] acct_collect+0x60/0x1b0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] PGD 7b4ac067 PUD 
7b58c067 PMD 7b4a3067 PTE 0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] Oops: 0000 [#2] SMP
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] Modules linked in: 
xfs(O) libcrc32c dm_linear_custom(O) deflate ctr twofish_generic 
twofish_x86_64_3way glue_helper lrw xts gf128mul twofish_x86_64 
twofish_common camellia_generic serpent_generic blowfish_generic 
blowfish_x86_64 blowfish_common cast5_generic cast_common des_generic cmac 
xcbc rmd160 sha512_generic crypto_null af_key xfrm_algo nfsd nfs_acl 
dm_round_robin ppdev parport_pc mac_hid dm_multipath i2c_piix4 psmouse 
serio_raw rpcsec_gss_krb5 auth_rpcgss oid_registry lp nfsv4 parport nfs 
fscache lockd sunrpc floppy
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] CPU: 3 PID: 3329 
Comm: mount Tainted: G      D    O  3.16.0-999-generic #201408131143
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] Hardware name: 
Bochs Bochs, BIOS Bochs 01/01/2007
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] task: 
ffff88007b2ac8c0 ti: ffff88007a900000 task.ti: ffff88007a900000
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] RIP: 
0010:[<ffffffff810d18b0>]  [<ffffffff810d18b0>] acct_collect+0x60/0x1b0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] RSP: 
0018:ffff88007a903328  EFLAGS: 00010006
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] RAX: 
0000000079a0b000 RBX: 0000000000000009 RCX: 0000000000000245
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] RDX: 
0000000000000275 RSI: 0000000000000001 RDI: ffff8800362fb860
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] RBP: 
ffff88007a903348 R08: 00007ffffffff000 R09: 0000000000000d01
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] R10: 
0000000000000000 R11: 0000000000000000 R12: ffff88007afa4800
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] R13: 
ffff88007b2ac8c0 R14: 0001000101133010 R15: 0000000000000046
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] FS: 
00007f41d67b5800(0000) GS:ffff88007fd80000(0000) knlGS:0000000000000000
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] CS:  0010 DS: 0000 
ES: 0000 CR0: 0000000080050033
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] CR2: 
0000000079a0b008 CR3: 000000007a943000 CR4: 00000000000006e0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] Stack:
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399]  ffff88007b2ac8c0 
0000000000000009 ffff88007a9035d8 0000000000000001
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399]  ffff88007a903398 
ffffffff810582f7 0000000000000007 0000000000000006
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399]  000000000000000a 
0000000000000046 0000000000000009 ffff88007a9035d8
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] Call Trace:
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff810582f7>] do_exit+0x367/0x470
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff81006465>] oops_end+0xa5/0xf0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff816cf708>] no_context+0x1be/0x1cd
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff816cf8e9>] __bad_area_nosemaphore+0x1d2/0x1f1
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff816cf9ae>] bad_area_access_error+0x45/0x4e
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff81044dfe>] __do_page_fault+0x2fe/0x550
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff81210435>] ? kernfs_path+0x55/0x70
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff8134a4e3>] ? cfq_find_alloc_queue+0x293/0x430
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff81141635>] ? mempool_alloc_slab+0x15/0x20
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff81141635>] ? mempool_alloc_slab+0x15/0x20
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff81406cb0>] ? virtqueue_get_buf+0xe0/0xe0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff8104516c>] do_page_fault+0xc/0x10
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff816eb222>] page_fault+0x22/0x30
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff81406cb0>] ? virtqueue_get_buf+0xe0/0xe0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff814069f6>] ? vring_add_indirect+0x36/0x210
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff81406a47>] ? vring_add_indirect+0x87/0x210
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff814069f6>] ? vring_add_indirect+0x36/0x210
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff814072a2>] virtqueue_add_sgs+0x2f2/0x340
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff8147854a>] __virtblk_add_req+0xda/0x1b0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff8132f307>] ? __bt_get+0xc7/0x1e0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff8132f47a>] ? bt_get+0x5a/0x180
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff8132953c>] ? blk_rq_map_sg+0x3c/0x170
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff8147871a>] virtio_queue_rq+0xfa/0x250
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff8132d39f>] __blk_mq_run_hw_queue+0xff/0x260
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff8132d97d>] blk_mq_run_hw_queue+0x7d/0xb0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff8132dd71>] blk_sq_make_request+0x171/0x2f0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff81320605>] generic_make_request.part.75+0x75/0xb0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff813215a8>] generic_make_request+0x68/0x70
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff81321625>] submit_bio+0x75/0x140
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff811d2773>] _submit_bh+0x113/0x160
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff811d27d0>] submit_bh+0x10/0x20
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff811d6135>] block_read_full_page+0x1f5/0x340
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff811d6d20>] ? I_BDEV+0x10/0x10
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff8115c445>] ? __inc_zone_page_state+0x35/0x40
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff8113eca4>] ? __add_to_page_cache_locked+0xa4/0x130
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff811d7798>] blkdev_readpage+0x18/0x20
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff8114aee8>] read_pages+0xe8/0x100
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff8114b1e3>] __do_page_cache_readahead+0x163/0x170
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff8114b555>] force_page_cache_readahead+0x75/0xb0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff8114b5d3>] page_cache_sync_readahead+0x43/0x50
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff8113f91e>] do_generic_file_read+0x30e/0x490
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff81140c44>] generic_file_read_iter+0xf4/0x150
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff8116b4b8>] ? handle_mm_fault+0x48/0x80
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff8116edc0>] ? find_vma+0x20/0x80
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff811d6ec7>] blkdev_read_iter+0x37/0x40
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff811a10d8>] new_sync_read+0x78/0xb0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff811a218b>] vfs_read+0xab/0x180
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff811a240f>] SyS_read+0x4f/0xb0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] 
[<ffffffff816e9812>] system_call_fastpath+0x16/0x1b
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] Code: 00 74 55 49 
8b bd a0 03 00 00 48 83 c7 60 e8 a8 65 61 00 49 8b 85 a0 03 00 00 48 8b 00 
48 85 c0 74 1d 66 0f 1f 84 00 00 00 00 00 <4c> 03 70 08 4c 2b 30 48 8b 40 10 
48 85 c0 75 f0 49 c1 ee 0a 65
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] RIP 
[<ffffffff810d18b0>] acct_collect+0x60/0x1b0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399]  RSP 
<ffff88007a903328>
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] CR2: 
0000000079a0b008
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] ---[ end trace 
cc83f9989ae9e2b0 ]---
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [  118.570399] Fixing recursive 
fault but reboot is needed!

[3]
(qemu) KVM internal error. Suberror: 1
emulation failure
RAX=0000000000000000 RBX=0000000000000000 RCX=0000000000000000 
RDX=0000000000000623
RSI=0000000000000000 RDI=0000000000000000 RBP=0000000000000000 
RSP=0000000000000000
R8 =0000000000000000 R9 =0000000000000000 R10=0000000000000000 
R11=0000000000000000
R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 
R15=0000000000000000
RIP=000000000000fff0 RFL=00010002 [-------] CPL=3 II=0 A20=1 SMM=0 HLT=0
ES =0000 0000000000000000 ffffffff 00000000
CS =0033 0000000000000000 ffffffff 00a0fb00 DPL=3 CS64 [-RA]
SS =002b 0000000000000000 ffffffff 00c0f300 DPL=3 DS   [-WA]
DS =0000 0000000000000000 ffffffff 00000000
FS =0000 00007f0514f7c700 ffffffff 00000000
GS =0000 0000000000000000 ffffffff 00000000
LDT=0000 0000000000000000 ffffffff 00000000
TR =0040 ffff88007fc90380 00002087 00008b00 DPL=0 TSS64-busy
GDT=     ffff88007fc84000 0000007f
IDT=     ffffffffff57c000 00000fff
CR0=80050033 CR2=000000000061cd64 CR3=000000007ae6b000 CR4=000006e0
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 
DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000d01

[4]
(qemu) qemu-system-x86_64: virtio: trying to map MMIO memory

-----Original Message----- 
From: Dave Chinner
Sent: 13 August, 2014 2:56 AM
To: Alex Lyakas
Cc: Brian Foster ; xfs@oss.sgi.com
Subject: Re: use-after-free on log replay failure

On Tue, Aug 12, 2014 at 03:39:02PM +0300, Alex Lyakas wrote:
> Hello Dave, Brian,
> I will describe a generic reproduction that you ask for.
>
> It was performed on pristine XFS code from 3.8.13, taken from here:
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
....
> I mounted XFS with the following options:
> rw,sync,noatime,wsync,attr2,inode64,noquota 0 0
>
> I started a couple of processes writing files sequentially onto this
> mount point, and after few seconds crashed the VM.
> When the VM came up, I took the metadump file and placed it in:
> https://drive.google.com/file/d/0ByBy89zr3kJNa0ZpdmZFS242RVU/edit?usp=sharing
>
> Then I set up the following Device Mapper target onto /dev/vde:
> dmsetup create VDE --table "0 41943040 linear-custom /dev/vde 0"
> I am attaching the code (and Makefile) of dm-linear-custom target.
> It is exact copy of dm-linear, except that it has a module
> parameter. With the parameter set to 0, this is an identity mapping
> onto /dev/vde. If the parameter is set to non-0, all WRITE bios are
> failed with ENOSPC. There is a workqueue to fail them in a different
> context (not sure if really needed, but that's what our "real"
> custom
> block device does).

Well, they you go. That explains it - an asynchronous dispatch error
happening fast enough to race with the synchronous XFS dispatch
processing.

dispatch thread device workqueue
xfs_buf_hold();
atomic_set(b_io_remaining, 1)
atomic_inc(b_io_remaining)
submit_bio(bio)
queue_work(bio)
xfs_buf_ioend(bp, ....);
  atomic_dec(b_io_remaining)
xfs_buf_rele()
bio error set to ENOSPC
  bio->end_io()
    xfs_buf_bio_endio()
      bp->b_error = ENOSPC
      _xfs_buf_ioend(bp, 1);
        atomic_dec(b_io_remaining)
  xfs_buf_ioend(bp, 1);
    queue_work(bp)
xfs_buf_iowait()
if (bp->b_error) return error;
if (error)
  xfs_buf_relse()
    xfs_buf_rele()
      xfs_buf_free()

And now we have a freed buffer that is queued on the io completion
queue. Basically, it requires the buffer error to be set
asynchronously *between* the dispatch decrementing it's I/O count
after dispatch, but before we wait on the IO.

Not sure what the right fix is yet - removing the bp->b_error check
from xfs_buf_iowait() doesn't solve the problem - it just prevents
this code path from being tripped over by the race condition.

But, just to validate this is the problem, you should be able to
reproduce this on a 3.16 kernel. Can you try that, Alex?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: use-after-free on log replay failure
  2014-08-13 12:59                                                 ` Brian Foster
@ 2014-08-13 20:59                                                   ` Dave Chinner
  2014-08-13 23:21                                                     ` Brian Foster
  0 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2014-08-13 20:59 UTC (permalink / raw)
  To: Brian Foster; +Cc: Alex Lyakas, xfs

On Wed, Aug 13, 2014 at 08:59:32AM -0400, Brian Foster wrote:
> On Wed, Aug 13, 2014 at 09:56:15AM +1000, Dave Chinner wrote:
> > On Tue, Aug 12, 2014 at 03:39:02PM +0300, Alex Lyakas wrote:
> > > Hello Dave, Brian,
> > > I will describe a generic reproduction that you ask for.
> > > 
> > > It was performed on pristine XFS code from 3.8.13, taken from here:
> > > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
> > ....
> > > I mounted XFS with the following options:
> > > rw,sync,noatime,wsync,attr2,inode64,noquota 0 0
> > > 
> > > I started a couple of processes writing files sequentially onto this
> > > mount point, and after few seconds crashed the VM.
> > > When the VM came up, I took the metadump file and placed it in:
> > > https://drive.google.com/file/d/0ByBy89zr3kJNa0ZpdmZFS242RVU/edit?usp=sharing
> > > 
> > > Then I set up the following Device Mapper target onto /dev/vde:
> > > dmsetup create VDE --table "0 41943040 linear-custom /dev/vde 0"
> > > I am attaching the code (and Makefile) of dm-linear-custom target.
> > > It is exact copy of dm-linear, except that it has a module
> > > parameter. With the parameter set to 0, this is an identity mapping
> > > onto /dev/vde. If the parameter is set to non-0, all WRITE bios are
> > > failed with ENOSPC. There is a workqueue to fail them in a different
> > > context (not sure if really needed, but that's what our "real"
> > > custom
> > > block device does).
> > 
> > Well, they you go. That explains it - an asynchronous dispatch error
> > happening fast enough to race with the synchronous XFS dispatch
> > processing.
> > 
> > dispatch thread			device workqueue
> > xfs_buf_hold();
> > atomic_set(b_io_remaining, 1)
> > atomic_inc(b_io_remaining)
> > submit_bio(bio)
> > queue_work(bio)
> > xfs_buf_ioend(bp, ....);
> >   atomic_dec(b_io_remaining)
> > xfs_buf_rele()
> > 				bio error set to ENOSPC
> > 				  bio->end_io()
> > 				    xfs_buf_bio_endio()
> > 				      bp->b_error = ENOSPC
> > 				      _xfs_buf_ioend(bp, 1);
> > 				        atomic_dec(b_io_remaining)
> > 					  xfs_buf_ioend(bp, 1);
> > 					    queue_work(bp)
> > xfs_buf_iowait()
> >  if (bp->b_error) return error;
> > if (error)
> >   xfs_buf_relse()
> >     xfs_buf_rele()
> >       xfs_buf_free()
> > 
> > And now we have a freed buffer that is queued on the io completion
> > queue. Basically, it requires the buffer error to be set
> > asynchronously *between* the dispatch decrementing it's I/O count
> > after dispatch, but before we wait on the IO.
> > 
> 
> That's basically the theory I wanted to test with the experimental
> patch. E.g., the error check races with the iodone workqueue item.
> 
> > Not sure what the right fix is yet - removing the bp->b_error check
> > from xfs_buf_iowait() doesn't solve the problem - it just prevents
> > this code path from being tripped over by the race condition.
> > 
> 
> Perhaps I'm missing some context... I don't follow how removing the
> error check doesn't solve the problem. It clearly closes the race and
> perhaps there are other means of doing the same thing, but what part of
> the problem does that leave unresolved?

Anything that does:

	xfs_buf_iorequest(bp);
	if (bp->b_error)
		xfs_buf_relse(bp);

is susceptible to the same race condition. based on bp->b_error
being set asynchronously and before the buffer IO completion
processing is complete.


> E.g., we provide a
> synchronization mechanism for an async submission path and an object
> (xfs_buf) that is involved with potentially multiple such async (I/O)
> operations. The async callback side manages the counts of outstanding
> bios etc. to set the state of the buf object correctly and fires a
> completion when everything is done. The calling side simply waits on the
> completion before it can analyze state of the object. Referring to
> anything inside that object that happens to be managed by the buffer I/O
> mechanism before the buffer is considered complete just seems generally
> racy.

The point is that the IO submitter holds the buffer lock and so has
"exclusive" access to the buffer, even after it is submitted. It is
allowed to check the internal state of the buffer at any time, and
it is expected to be sane, including while IO completion processing
is running.

The real issue is that workqueue based  IO completion processing is
not protected by a reference count of any kind for synchronous IO.
It is done with only the reference count of lock holder held, and so
if the lock holder unlocks and frees the buffer, then that buffer
will be freed.

This issue doesn't exist with B_ASYNC IO submission, because the
B_ASYNC IO owns the reference and the the buffer lock and drops them
from the workqueue when the IO comlpetion processing actuall
completes...

> It looks like submit_bio() manages this by providing the error through
> the callback (always). It also doesn't look like submission path is
> guaranteed to be synchronous either (consider md, which appears to use
> workqueues and kernel threads)), so I'm not sure that '...;
> xfs_buf_iorequest(bp); if (bp->b_error)' is really safe anywhere unless
> you're explicitly looking for a write verifier error or something and
> do nothing further on the buf contingent on completion (e.g., freeing it
> or something it depends on).

My point remains that it *should be safe*, and the intent is that
the caller should be able to check for submission errors without
being exposed to a use after free situation. That's the bug we need
to fix, not say "you can't check for submission errors on
synchronous IO" to avoid the race condition.....

Cheers,

Dave
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: use-after-free on log replay failure
  2014-08-13 20:59                                                   ` Dave Chinner
@ 2014-08-13 23:21                                                     ` Brian Foster
  2014-08-14  6:14                                                       ` Dave Chinner
  0 siblings, 1 reply; 47+ messages in thread
From: Brian Foster @ 2014-08-13 23:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Alex Lyakas, xfs

On Thu, Aug 14, 2014 at 06:59:29AM +1000, Dave Chinner wrote:
> On Wed, Aug 13, 2014 at 08:59:32AM -0400, Brian Foster wrote:
> > On Wed, Aug 13, 2014 at 09:56:15AM +1000, Dave Chinner wrote:
> > > On Tue, Aug 12, 2014 at 03:39:02PM +0300, Alex Lyakas wrote:
> > > > Hello Dave, Brian,
> > > > I will describe a generic reproduction that you ask for.
> > > > 
> > > > It was performed on pristine XFS code from 3.8.13, taken from here:
> > > > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
> > > ....
> > > > I mounted XFS with the following options:
> > > > rw,sync,noatime,wsync,attr2,inode64,noquota 0 0
> > > > 
> > > > I started a couple of processes writing files sequentially onto this
> > > > mount point, and after few seconds crashed the VM.
> > > > When the VM came up, I took the metadump file and placed it in:
> > > > https://drive.google.com/file/d/0ByBy89zr3kJNa0ZpdmZFS242RVU/edit?usp=sharing
> > > > 
> > > > Then I set up the following Device Mapper target onto /dev/vde:
> > > > dmsetup create VDE --table "0 41943040 linear-custom /dev/vde 0"
> > > > I am attaching the code (and Makefile) of dm-linear-custom target.
> > > > It is exact copy of dm-linear, except that it has a module
> > > > parameter. With the parameter set to 0, this is an identity mapping
> > > > onto /dev/vde. If the parameter is set to non-0, all WRITE bios are
> > > > failed with ENOSPC. There is a workqueue to fail them in a different
> > > > context (not sure if really needed, but that's what our "real"
> > > > custom
> > > > block device does).
> > > 
> > > Well, they you go. That explains it - an asynchronous dispatch error
> > > happening fast enough to race with the synchronous XFS dispatch
> > > processing.
> > > 
> > > dispatch thread			device workqueue
> > > xfs_buf_hold();
> > > atomic_set(b_io_remaining, 1)
> > > atomic_inc(b_io_remaining)
> > > submit_bio(bio)
> > > queue_work(bio)
> > > xfs_buf_ioend(bp, ....);
> > >   atomic_dec(b_io_remaining)
> > > xfs_buf_rele()
> > > 				bio error set to ENOSPC
> > > 				  bio->end_io()
> > > 				    xfs_buf_bio_endio()
> > > 				      bp->b_error = ENOSPC
> > > 				      _xfs_buf_ioend(bp, 1);
> > > 				        atomic_dec(b_io_remaining)
> > > 					  xfs_buf_ioend(bp, 1);
> > > 					    queue_work(bp)
> > > xfs_buf_iowait()
> > >  if (bp->b_error) return error;
> > > if (error)
> > >   xfs_buf_relse()
> > >     xfs_buf_rele()
> > >       xfs_buf_free()
> > > 
> > > And now we have a freed buffer that is queued on the io completion
> > > queue. Basically, it requires the buffer error to be set
> > > asynchronously *between* the dispatch decrementing it's I/O count
> > > after dispatch, but before we wait on the IO.
> > > 
> > 
> > That's basically the theory I wanted to test with the experimental
> > patch. E.g., the error check races with the iodone workqueue item.
> > 
> > > Not sure what the right fix is yet - removing the bp->b_error check
> > > from xfs_buf_iowait() doesn't solve the problem - it just prevents
> > > this code path from being tripped over by the race condition.
> > > 
> > 
> > Perhaps I'm missing some context... I don't follow how removing the
> > error check doesn't solve the problem. It clearly closes the race and
> > perhaps there are other means of doing the same thing, but what part of
> > the problem does that leave unresolved?
> 
> Anything that does:
> 
> 	xfs_buf_iorequest(bp);
> 	if (bp->b_error)
> 		xfs_buf_relse(bp);
> 
> is susceptible to the same race condition. based on bp->b_error
> being set asynchronously and before the buffer IO completion
> processing is complete.
> 

Understood, by why would anything do that (as opposed to
xfs_buf_iowait())? I don't see that we do that anywhere today
(the check buried within xfs_buf_iowait() notwithstanding of course).

>From what I can see, all it really guarantees is that the submission has
either passed/failed the write verifier, yes?

> 
> > E.g., we provide a
> > synchronization mechanism for an async submission path and an object
> > (xfs_buf) that is involved with potentially multiple such async (I/O)
> > operations. The async callback side manages the counts of outstanding
> > bios etc. to set the state of the buf object correctly and fires a
> > completion when everything is done. The calling side simply waits on the
> > completion before it can analyze state of the object. Referring to
> > anything inside that object that happens to be managed by the buffer I/O
> > mechanism before the buffer is considered complete just seems generally
> > racy.
> 
> The point is that the IO submitter holds the buffer lock and so has
> "exclusive" access to the buffer, even after it is submitted. It is
> allowed to check the internal state of the buffer at any time, and
> it is expected to be sane, including while IO completion processing
> is running.
> 

Fair enough, but if the mechanism is async the submitter clearly knows
that's not failsafe (i.e., passing the check above doesn't mean the I/O
will not fail).

> The real issue is that workqueue based  IO completion processing is
> not protected by a reference count of any kind for synchronous IO.
> It is done with only the reference count of lock holder held, and so
> if the lock holder unlocks and frees the buffer, then that buffer
> will be freed.
> 

I see, the notion of the work item having a hold/refcount on the buffer
makes sense. But with a submission/wait mechanism that waits on actual
"completion" of the I/O (e.g., complete(bp->b_iowait)), that only offers
protection for the case where a sync I/O submitter doesn't wait. I
suppose that's possible, but also seems like a misuse of sync I/O. E.g.,
why wouldn't that caller do async I/O?

> This issue doesn't exist with B_ASYNC IO submission, because the
> B_ASYNC IO owns the reference and the the buffer lock and drops them
> from the workqueue when the IO comlpetion processing actuall
> completes...
> 

Indeed. I wasn't clear on the reference ownership nuance between the
sync/async variants of I/O. Thanks for the context. That said, we also
don't have submitters that check for errors on async I/O either.

> > It looks like submit_bio() manages this by providing the error through
> > the callback (always). It also doesn't look like submission path is
> > guaranteed to be synchronous either (consider md, which appears to use
> > workqueues and kernel threads)), so I'm not sure that '...;
> > xfs_buf_iorequest(bp); if (bp->b_error)' is really safe anywhere unless
> > you're explicitly looking for a write verifier error or something and
> > do nothing further on the buf contingent on completion (e.g., freeing it
> > or something it depends on).
> 
> My point remains that it *should be safe*, and the intent is that
> the caller should be able to check for submission errors without
> being exposed to a use after free situation. That's the bug we need
> to fix, not say "you can't check for submission errors on
> synchronous IO" to avoid the race condition.....
> 

Well, technically you can check for submission errors on sync I/O, just
use the code you posted above. :) What we can't currently do is find out
when the I/O subsystem is done with the buffer.

Perhaps the point here is around the semantics of xfs_buf_iowait(). With
a mechanism that is fundamentally async, the sync variant obviously
becomes the async mechanism + some kind of synchronization. I'd expect
that synchronization to not necessarily just tell me whether an error
occurred, but also tell me when the I/O subsystem is done with the
object I've passed (e.g., so I'm free to chuck it, scribble over it, put
it back where I got it, whatever).

My impression is that's the purpose of the b_iowait mechanism.
Otherwise, what's the point of the whole
bio_end_io->buf_ioend->b_iodone->buf_ioend round trip dance?

Brian

> Cheers,
> 
> Dave
> -- 
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: use-after-free on log replay failure
  2014-08-13 23:21                                                     ` Brian Foster
@ 2014-08-14  6:14                                                       ` Dave Chinner
  2014-08-14 19:05                                                         ` Brian Foster
  0 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2014-08-14  6:14 UTC (permalink / raw)
  To: Brian Foster; +Cc: Alex Lyakas, xfs

On Wed, Aug 13, 2014 at 07:21:35PM -0400, Brian Foster wrote:
> On Thu, Aug 14, 2014 at 06:59:29AM +1000, Dave Chinner wrote:
> > On Wed, Aug 13, 2014 at 08:59:32AM -0400, Brian Foster wrote:
> > > Perhaps I'm missing some context... I don't follow how removing the
> > > error check doesn't solve the problem. It clearly closes the race and
> > > perhaps there are other means of doing the same thing, but what part of
> > > the problem does that leave unresolved?
> > 
> > Anything that does:
> > 
> > 	xfs_buf_iorequest(bp);
> > 	if (bp->b_error)
> > 		xfs_buf_relse(bp);
> > 
> > is susceptible to the same race condition. based on bp->b_error
> > being set asynchronously and before the buffer IO completion
> > processing is complete.
> > 
> 
> Understood, by why would anything do that (as opposed to
> xfs_buf_iowait())? I don't see that we do that anywhere today
> (the check buried within xfs_buf_iowait() notwithstanding of course).

"Why" is not important - the fact is the caller *owns* the buffer
and so the above fragment of code is valid behaviour. If there is
an error on the buffer after xfs_buf_iorequest() request returns on
a synchronous IO, then it's a bug if there is still IO in progress
on that buffer.

We can't run IO completion synchronously from xfs_buf_bio_end_io in
this async dispatch error case - we cannot detect it as any
different from IO completion in interrupt context - and so we need
to have some kind of reference protecting the buffer from being
freed from under the completion.

i.e. the bug is that a synchronous buffer has no active reference
while it is sitting on the completion workqueue - it's references
are owned by other contexts that can drop them without regard to
the completion status of the buffer.

For async IO we transfer a reference and the lock to the IO context,
which gets dropped in xfs_buf_iodone_work when all the IO is
complete. Synchronous IO needs this protection, too.

As a proof of concept, adding this to the start of
xfs_buf_iorequest():

+	/*
+	 * synchronous IO needs it's own reference count. async IO
+	 * inherits the submitter's reference count.
+	 */
+	if (!(bp->b_flags & XBF_ASYNC))
+		xfs_buf_hold(bp);

And this to the synchronous IO completion case for
xfs_buf_iodone_work():

	else {
		ASSERT(read && bp->b_ops);
		complete(&bp->b_iowait);
+		xfs_buf_rele(bp);
	}

Should ensure that all IO carries a reference count and the buffer
cannot be freed until all IO processing has been completed.

This means it does not matter what the buffer owner does after
xfs_buf_iorequest() - even unconditionally calling xfs_buf_relse()
will not result in use-after-free as the b_hold count will not go to
zero until the IO completion processing has been finalised.

Fixing the rest of the mess (i.e. determining how to deal with
submission/completion races) is going to require more effort and
thought. For the moment, though, correctly reference counting
buffers will solve the use-after-free without changing any
other behaviour.

> From what I can see, all it really guarantees is that the submission has
> either passed/failed the write verifier, yes?

No.  It can also mean it wasn't rejected by the lower layersi as
they process the bio passed by submit_bio(). e.g.  ENODEV, because
the underlying device has been hot-unplugged, EIO because the
buffer is beyond the end of the device, etc.

> > > It looks like submit_bio() manages this by providing the error through
> > > the callback (always). It also doesn't look like submission path is
> > > guaranteed to be synchronous either (consider md, which appears to use
> > > workqueues and kernel threads)), so I'm not sure that '...;
> > > xfs_buf_iorequest(bp); if (bp->b_error)' is really safe anywhere unless
> > > you're explicitly looking for a write verifier error or something and
> > > do nothing further on the buf contingent on completion (e.g., freeing it
> > > or something it depends on).
> > 
> > My point remains that it *should be safe*, and the intent is that
> > the caller should be able to check for submission errors without
> > being exposed to a use after free situation. That's the bug we need
> > to fix, not say "you can't check for submission errors on
> > synchronous IO" to avoid the race condition.....
> > 
> 
> Well, technically you can check for submission errors on sync I/O, just
> use the code you posted above. :) What we can't currently do is find out
> when the I/O subsystem is done with the buffer.

By definition, a buffer marked with an error after submission
processing is complete. It should not need to be waited on, and
there-in lies the bug.

> Perhaps the point here is around the semantics of xfs_buf_iowait(). With
> a mechanism that is fundamentally async, the sync variant obviously
> becomes the async mechanism + some kind of synchronization. I'd expect
> that synchronization to not necessarily just tell me whether an error
> occurred, but also tell me when the I/O subsystem is done with the
> object I've passed (e.g., so I'm free to chuck it, scribble over it, put
> it back where I got it, whatever).
>
> My impression is that's the purpose of the b_iowait mechanism.
> Otherwise, what's the point of the whole
> bio_end_io->buf_ioend->b_iodone->buf_ioend round trip dance?

Yes, that's exactly what xfs_buf_iorequest/xfs_buf_iowait() provides
and the b_error indication is an integral part of that
synchronisation mechanism.  Unfortunately, that is also the part of
the mechanism that is racy and causing problems.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: use-after-free on log replay failure
  2014-08-14  6:14                                                       ` Dave Chinner
@ 2014-08-14 19:05                                                         ` Brian Foster
  2014-08-14 22:27                                                           ` Dave Chinner
  0 siblings, 1 reply; 47+ messages in thread
From: Brian Foster @ 2014-08-14 19:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Alex Lyakas, xfs

On Thu, Aug 14, 2014 at 04:14:44PM +1000, Dave Chinner wrote:
> On Wed, Aug 13, 2014 at 07:21:35PM -0400, Brian Foster wrote:
> > On Thu, Aug 14, 2014 at 06:59:29AM +1000, Dave Chinner wrote:
> > > On Wed, Aug 13, 2014 at 08:59:32AM -0400, Brian Foster wrote:
> > > > Perhaps I'm missing some context... I don't follow how removing the
> > > > error check doesn't solve the problem. It clearly closes the race and
> > > > perhaps there are other means of doing the same thing, but what part of
> > > > the problem does that leave unresolved?
> > > 
> > > Anything that does:
> > > 
> > > 	xfs_buf_iorequest(bp);
> > > 	if (bp->b_error)
> > > 		xfs_buf_relse(bp);
> > > 
> > > is susceptible to the same race condition. based on bp->b_error
> > > being set asynchronously and before the buffer IO completion
> > > processing is complete.
> > > 
> > 
> > Understood, by why would anything do that (as opposed to
> > xfs_buf_iowait())? I don't see that we do that anywhere today
> > (the check buried within xfs_buf_iowait() notwithstanding of course).
> 
> "Why" is not important - the fact is the caller *owns* the buffer
> and so the above fragment of code is valid behaviour. If there is
> an error on the buffer after xfs_buf_iorequest() request returns on
> a synchronous IO, then it's a bug if there is still IO in progress
> on that buffer.
> 

A buffer can consist of multiple I/Os, yes? If so, then it seems quite
possible for one I/O to fail and set an error on the buffer while
another might still be in progress. I don't see how that can be
considered a bug in general.

Even if not, I'd expect to see a comment explaining why any code
fragment such as the above is not broken because that's not
deterministic at all, even with a single I/O. E.g., the error could very
well have been set by the callback where we clearly continue I/O
processing so-to-speak (though we could consider where we currently set
the error in the callback sequence a bug as well).

> We can't run IO completion synchronously from xfs_buf_bio_end_io in
> this async dispatch error case - we cannot detect it as any
> different from IO completion in interrupt context - and so we need
> to have some kind of reference protecting the buffer from being
> freed from under the completion.
> 

Indeed.

> i.e. the bug is that a synchronous buffer has no active reference
> while it is sitting on the completion workqueue - it's references
> are owned by other contexts that can drop them without regard to
> the completion status of the buffer.
> 
> For async IO we transfer a reference and the lock to the IO context,
> which gets dropped in xfs_buf_iodone_work when all the IO is
> complete. Synchronous IO needs this protection, too.
> 
> As a proof of concept, adding this to the start of
> xfs_buf_iorequest():
> 
> +	/*
> +	 * synchronous IO needs it's own reference count. async IO
> +	 * inherits the submitter's reference count.
> +	 */
> +	if (!(bp->b_flags & XBF_ASYNC))
> +		xfs_buf_hold(bp);
> 
> And this to the synchronous IO completion case for
> xfs_buf_iodone_work():
> 
> 	else {
> 		ASSERT(read && bp->b_ops);
> 		complete(&bp->b_iowait);
> +		xfs_buf_rele(bp);
> 	}
> 
> Should ensure that all IO carries a reference count and the buffer
> cannot be freed until all IO processing has been completed.
> 
> This means it does not matter what the buffer owner does after
> xfs_buf_iorequest() - even unconditionally calling xfs_buf_relse()
> will not result in use-after-free as the b_hold count will not go to
> zero until the IO completion processing has been finalised.
> 

Makes sense, assuming we handle the possible error cases and whatnot
therein. Thinking some more, suppose we take this ref, submit one I/O
successfully and a subsequent fails. Then who is responsible for
releasing the reference?

> Fixing the rest of the mess (i.e. determining how to deal with
> submission/completion races) is going to require more effort and
> thought. For the moment, though, correctly reference counting
> buffers will solve the use-after-free without changing any
> other behaviour.
> 
> > From what I can see, all it really guarantees is that the submission has
> > either passed/failed the write verifier, yes?
> 
> No.  It can also mean it wasn't rejected by the lower layersi as
> they process the bio passed by submit_bio(). e.g.  ENODEV, because
> the underlying device has been hot-unplugged, EIO because the
> buffer is beyond the end of the device, etc.
> 

Those are the errors that happen to be synchronously processed today.
That's an implementation detail. submit_bio() is an asynchronous
interface so I don't see any guarantee that will always be the case.
E.g., that's easily broken should somebody decide to defer early end_io
processing to a workqueue. We do a similar thing ourselves for the
reasons you've stated above.

I don't see anything in or around submit_bio() or generic_make_request()
that suggest the interface is anything but async. From
generic_make_request():

/**
 ...
 * generic_make_request() does not return any status.  The
 * success/failure status of the request, along with notification of
 * completion, is delivered asynchronously through the bio->bi_end_io
 * function described (one day) else where.
 *
 ...
 */

> > > > It looks like submit_bio() manages this by providing the error through
> > > > the callback (always). It also doesn't look like submission path is
> > > > guaranteed to be synchronous either (consider md, which appears to use
> > > > workqueues and kernel threads)), so I'm not sure that '...;
> > > > xfs_buf_iorequest(bp); if (bp->b_error)' is really safe anywhere unless
> > > > you're explicitly looking for a write verifier error or something and
> > > > do nothing further on the buf contingent on completion (e.g., freeing it
> > > > or something it depends on).
> > > 
> > > My point remains that it *should be safe*, and the intent is that
> > > the caller should be able to check for submission errors without
> > > being exposed to a use after free situation. That's the bug we need
> > > to fix, not say "you can't check for submission errors on
> > > synchronous IO" to avoid the race condition.....
> > > 
> > 
> > Well, technically you can check for submission errors on sync I/O, just
> > use the code you posted above. :) What we can't currently do is find out
> > when the I/O subsystem is done with the buffer.
> 
> By definition, a buffer marked with an error after submission
> processing is complete. It should not need to be waited on, and
> there-in lies the bug.
> 

I suppose that implicates the error processing on the callback side. We
set the error and continue processing on the buffer. Another option
could be to shuffle that around on the callback side, but to me _that_
is an approach that narrowly avoids the race rather than closing it via
use of synchronization.

> > Perhaps the point here is around the semantics of xfs_buf_iowait(). With
> > a mechanism that is fundamentally async, the sync variant obviously
> > becomes the async mechanism + some kind of synchronization. I'd expect
> > that synchronization to not necessarily just tell me whether an error
> > occurred, but also tell me when the I/O subsystem is done with the
> > object I've passed (e.g., so I'm free to chuck it, scribble over it, put
> > it back where I got it, whatever).
> >
> > My impression is that's the purpose of the b_iowait mechanism.
> > Otherwise, what's the point of the whole
> > bio_end_io->buf_ioend->b_iodone->buf_ioend round trip dance?
> 
> Yes, that's exactly what xfs_buf_iorequest/xfs_buf_iowait() provides
> and the b_error indication is an integral part of that
> synchronisation mechanism.  Unfortunately, that is also the part of
> the mechanism that is racy and causing problems.
> 

I don't see how b_iowait itself is racy. It completes when the I/O
completes. The problem is that we've overloaded these mechanisms to
where we attempt to use them for multiple things. b_error can be a
submission error or a deeper I/O error. Adding the b_error check to
xfs_buf_iowait() converts it to the same "has a submission error
occurred? or has any error occurred yet? or wait until all buffer
I/O is complete" non-deterministic semantics.

I agree that the reference count protection for sync I/O sounds useful
and closes a gap (need to think about that some more), but to check for
an error as part of the synchronization mechanism means that the
mechanism simply isn't synchronous.

I don't see any paths outside of I/O submission itself that care
significantly about one general form of error (submission) vs. another
(async I/O error) as opposed to simply whether an error has occurred or
not. The fact that an error _can_ occur at any time until all of the
outstanding bios are completed overrides the fact that some might occur
by the time submission completes, meaning we have to handle the former
for any callers that care about errors in general. Waiting on b_iowait
before checking b_error is an always-safe way to do that.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: use-after-free on log replay failure
  2014-08-14 19:05                                                         ` Brian Foster
@ 2014-08-14 22:27                                                           ` Dave Chinner
  0 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2014-08-14 22:27 UTC (permalink / raw)
  To: Brian Foster; +Cc: Alex Lyakas, xfs

On Thu, Aug 14, 2014 at 03:05:22PM -0400, Brian Foster wrote:
> On Thu, Aug 14, 2014 at 04:14:44PM +1000, Dave Chinner wrote:
> > On Wed, Aug 13, 2014 at 07:21:35PM -0400, Brian Foster wrote:
> > > On Thu, Aug 14, 2014 at 06:59:29AM +1000, Dave Chinner wrote:
> > > > On Wed, Aug 13, 2014 at 08:59:32AM -0400, Brian Foster wrote:
> > > > > Perhaps I'm missing some context... I don't follow how removing the
> > > > > error check doesn't solve the problem. It clearly closes the race and
> > > > > perhaps there are other means of doing the same thing, but what part of
> > > > > the problem does that leave unresolved?
> > > > 
> > > > Anything that does:
> > > > 
> > > > 	xfs_buf_iorequest(bp);
> > > > 	if (bp->b_error)
> > > > 		xfs_buf_relse(bp);
> > > > 
> > > > is susceptible to the same race condition. based on bp->b_error
> > > > being set asynchronously and before the buffer IO completion
> > > > processing is complete.
> > > > 
> > > 
> > > Understood, by why would anything do that (as opposed to
> > > xfs_buf_iowait())? I don't see that we do that anywhere today
> > > (the check buried within xfs_buf_iowait() notwithstanding of course).
> > 
> > "Why" is not important - the fact is the caller *owns* the buffer
> > and so the above fragment of code is valid behaviour. If there is
> > an error on the buffer after xfs_buf_iorequest() request returns on
> > a synchronous IO, then it's a bug if there is still IO in progress
> > on that buffer.
> > 
> 
> A buffer can consist of multiple I/Os, yes? If so, then it seems quite
> possible for one I/O to fail and set an error on the buffer while
> another might still be in progress. I don't see how that can be
> considered a bug in general.

You miss my point. We have to mark the buffer with an error, but it
should not be visible to the submitter until all IO on the buffer
is done. i.e. setting bp->b_error from the completion path
needs to be deferred until xfs_buf_iodone_work() is run after all
submitted IOs on the buffer have completed.

> Even if not, I'd expect to see a comment explaining why any code
> fragment such as the above is not broken because that's not
> deterministic at all, even with a single I/O. E.g., the error could very
> well have been set by the callback where we clearly continue I/O
> processing so-to-speak (though we could consider where we currently set
> the error in the callback sequence a bug as well).

Chicken, meet Egg.

There's no comment saying the above code is OK or not because we
assume that if we hold the buffer lock it is safe to look at any
state on the buffer at any time. What this thread points out is that
synchronous IO changes the state of the buffer in a context that
does not hold the buffer lock, and so violates that assumption.

> > This means it does not matter what the buffer owner does after
> > xfs_buf_iorequest() - even unconditionally calling xfs_buf_relse()
> > will not result in use-after-free as the b_hold count will not go to
> > zero until the IO completion processing has been finalised.
> > 
> 
> Makes sense, assuming we handle the possible error cases and whatnot
> therein. Thinking some more, suppose we take this ref, submit one I/O
> successfully and a subsequent fails. Then who is responsible for
> releasing the reference?

xfs_buf_iodone_work() doesn't run until b_io_remaining goes to zero.
That's the context that releases the ref. It doesn't matter how many
are submitted, complete successfully or fail.

> > Fixing the rest of the mess (i.e. determining how to deal with
> > submission/completion races) is going to require more effort and
> > thought. For the moment, though, correctly reference counting
> > buffers will solve the use-after-free without changing any
> > other behaviour.
> > 
> > > From what I can see, all it really guarantees is that the submission has
> > > either passed/failed the write verifier, yes?
> > 
> > No.  It can also mean it wasn't rejected by the lower layersi as
> > they process the bio passed by submit_bio(). e.g.  ENODEV, because
> > the underlying device has been hot-unplugged, EIO because the
> > buffer is beyond the end of the device, etc.
> > 
> 
> Those are the errors that happen to be synchronously processed today.
> That's an implementation detail. submit_bio() is an asynchronous
> interface so I don't see any guarantee that will always be the case.
> E.g., that's easily broken should somebody decide to defer early end_io
> processing to a workqueue. We do a similar thing ourselves for the
> reasons you've stated above.
>
> I don't see anything in or around submit_bio() or generic_make_request()
> that suggest the interface is anything but async. From
> generic_make_request():
> 
> /**
>  ...
>  * generic_make_request() does not return any status.  The
>  * success/failure status of the request, along with notification of
>  * completion, is delivered asynchronously through the bio->bi_end_io
>  * function described (one day) else where.
>  *
>  ...
>  */

Delivery of the error is through bio->bi_end_io, but that is not
necessarily from an asynchronous context. e.g. the first thing that
generic_make_request() does is run generic_make_request_checks(),
where a failure runs bio_endio() and therefore bio->bi_end_io()
in the submitter's context. i.e. *synchronously*. This is exactly
what the b_io_remaining reference owned by xfs_buf_iorequest() is
being taken for - to prevent the *XFS endio processing* from being
run asynchronously.

> > > > > It looks like submit_bio() manages this by providing the error through
> > > > > the callback (always). It also doesn't look like submission path is
> > > > > guaranteed to be synchronous either (consider md, which appears to use
> > > > > workqueues and kernel threads)), so I'm not sure that '...;
> > > > > xfs_buf_iorequest(bp); if (bp->b_error)' is really safe anywhere unless
> > > > > you're explicitly looking for a write verifier error or something and
> > > > > do nothing further on the buf contingent on completion (e.g., freeing it
> > > > > or something it depends on).
> > > > 
> > > > My point remains that it *should be safe*, and the intent is that
> > > > the caller should be able to check for submission errors without
> > > > being exposed to a use after free situation. That's the bug we need
> > > > to fix, not say "you can't check for submission errors on
> > > > synchronous IO" to avoid the race condition.....
> > > > 
> > > 
> > > Well, technically you can check for submission errors on sync I/O, just
> > > use the code you posted above. :) What we can't currently do is find out
> > > when the I/O subsystem is done with the buffer.
> > 
> > By definition, a buffer marked with an error after submission
> > processing is complete. It should not need to be waited on, and
> > there-in lies the bug.
> 
> I suppose that implicates the error processing on the callback side. We
> set the error and continue processing on the buffer. Another option
> could be to shuffle that around on the callback side, but to me _that_
> is an approach that narrowly avoids the race rather than closing it via
> use of synchronization.

We need to store the error from the callback somewhere, until
b_io_remaining falls to zero. Right now we are putting it in
b_error, which makes it immediately visible to the submitter without
any synchronisation. Basically, we can't put state into anything
that the submitter is expected to check. We have internal state
fields that are not to be used externally:

        spinlock_t              b_lock;         /* internal state lock */
        unsigned int            b_state;        /* internal state flags */

So what we probably need to do is add an internal:

        int			b_io_error;	/* internal error state */

serialised by b_lock, where we stuff errors from xfs_buf_bio_end_io
and then propagate them to b_error in xfs_buf_iodone_work() when all
the IO is complete....

And with the extra reference count on the buffer, it doesn't matter
if the submitter detects a submission error and releases it's
reference to the buffer while there is still other IO in progress.
Hence we solve all the issues without changing the current
submit/wait semantics, or needing to remove the submission error
check from xfs_buf_iowait.


> I don't see any paths outside of I/O submission itself that care
> significantly about one general form of error (submission) vs. another
> (async I/O error) as opposed to simply whether an error has occurred or
> not. The fact that an error _can_ occur at any time until all of the
> outstanding bios are completed overrides the fact that some might occur
> by the time submission completes, meaning we have to handle the former
> for any callers that care about errors in general. Waiting on b_iowait
> before checking b_error is an always-safe way to do that.

Assuming there are no other bugs, it's "always safe". And that's the
key thing - the fact that we actually have a check for an error
before waiting indicates that it hasn't always been safe to wait on
a buffer marked with a submission error.

This code is full of cruft and mess. It's fully of history,
band-aids upon band-aids, partially completed cleanups, left over
complexity from times gone past and new complexity layered over the
top of structures not originally designed to support those uses.
Let's fix the use after free bug right away, and then clean up the
cruft and fix the underlying problems so we can guarantee that
"always-safe" behaviour an not have it blow up in our faces in
future....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2014-08-14 22:28 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-18 18:37 Questions about XFS discard and xfs_free_extent() code (newbie) Alex Lyakas
2013-12-18 23:06 ` Dave Chinner
2013-12-19  9:24   ` Alex Lyakas
2013-12-19 10:55     ` Dave Chinner
2013-12-19 19:24       ` Alex Lyakas
2013-12-21 17:03         ` Chris Murphy
2013-12-24 18:21       ` Alex Lyakas
2013-12-26 23:00         ` Dave Chinner
2014-01-08 18:13           ` Alex Lyakas
2014-01-13  3:02             ` Dave Chinner
2014-01-13 17:44               ` Alex Lyakas
2014-01-13 20:43                 ` Dave Chinner
2014-01-14 13:48                   ` Alex Lyakas
2014-01-15  1:45                     ` Dave Chinner
2014-01-19  9:38                       ` Alex Lyakas
2014-01-19 23:17                         ` Dave Chinner
2014-07-01 15:06                           ` xfs_growfs_data_private memory leak Alex Lyakas
2014-07-01 21:56                             ` Dave Chinner
2014-07-02 12:27                               ` Alex Lyakas
2014-08-04 18:15                                 ` Eric Sandeen
2014-08-06  8:56                                   ` Alex Lyakas
2014-08-04 11:00                             ` use-after-free on log replay failure Alex Lyakas
2014-08-04 14:12                               ` Brian Foster
2014-08-04 23:07                               ` Dave Chinner
2014-08-06 10:05                                 ` Alex Lyakas
2014-08-06 12:32                                   ` Dave Chinner
2014-08-06 14:43                                     ` Alex Lyakas
2014-08-10 16:26                                     ` Alex Lyakas
2014-08-06 12:52                                 ` Alex Lyakas
2014-08-06 15:20                                   ` Brian Foster
2014-08-06 15:28                                     ` Alex Lyakas
2014-08-10 12:20                                     ` Alex Lyakas
2014-08-11 13:20                                       ` Brian Foster
2014-08-11 21:52                                         ` Dave Chinner
2014-08-12 12:03                                           ` Brian Foster
2014-08-12 12:39                                             ` Alex Lyakas
2014-08-12 19:31                                               ` Brian Foster
2014-08-12 23:56                                               ` Dave Chinner
2014-08-13 12:59                                                 ` Brian Foster
2014-08-13 20:59                                                   ` Dave Chinner
2014-08-13 23:21                                                     ` Brian Foster
2014-08-14  6:14                                                       ` Dave Chinner
2014-08-14 19:05                                                         ` Brian Foster
2014-08-14 22:27                                                           ` Dave Chinner
2014-08-13 17:07                                                 ` Alex Lyakas
2014-08-13  0:03                                               ` Dave Chinner
2014-08-13 13:11                                                 ` Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.