* Questions about XFS discard and xfs_free_extent() code (newbie) @ 2013-12-18 18:37 Alex Lyakas 2013-12-18 23:06 ` Dave Chinner 0 siblings, 1 reply; 47+ messages in thread From: Alex Lyakas @ 2013-12-18 18:37 UTC (permalink / raw) To: xfs Greetings XFS developers & community, I am studying the XFS code, primarily focusing now at the free-space allocation and deallocation parts. I learned that freeing an extent happens like this: - xfs_free_extent() calls xfs_free_ag_extent(), which attempts to merge the freed extents from left and from right in the by-bno btree. Then the by-size btree is updated accordingly. - xfs_free_extent marks the original (un-merged) extent as "busy" by xfs_extent_busy_insert(). This prevents this original extent from being allocated. (Except that for metadata allocations such extent or part of it can be "unbusied", while it is still not marked for discard with XFS_EXTENT_BUSY_DISCARDED). - Once the appropriate part of the log is committed, xlog_cil_committed calls xfs_discard_extents. This discards the extents using the synchronous blkdev_issue_discard() API, and only them "unbusies" the extents. This makes sense, because we cannot allow allocating these extents until discarding completed. WRT to this flow, I have some questions: - xfs_free_extent first inserts the extent into the free-space btrees, and only then marks it as busy. How come there is no race window here? Can somebody allocate the freed extent before it is marked as busy? Or the free-space btrees somehow are locked at this point? The code says "validate the extent size is legal now we have the agf locked". I more or less see that xfs_alloc_fix_freelist() locks *something*, but I don't see xfs_free_extent() unlocking anything. - If xfs_extent_busy_insert() fails to alloc a xfs_extent_busy structure, such extent cannot be discarded, correct? - xfs_discard_extents() doesn't check the discard granularity of the underlying block device, like xfs_ioc_trim() does. So it may send a small discard request, which cannot be handled. If it would have checked the granularity, it could have avoided sending small requests. But the thing is that the busy extent might have been merged in the free-space btree into a larger extent, which is now suitable for discard. I want to attempt the following logic in xfs_discard_extents(): # search the "by-bno" free-space btree for a larger extent that fully encapsulates the busy extent (which we want to discard) # if found, check whether some other part of the larger extent is still busy (except for the current busy extent we want to discard) # if no, send discard for the larger extent Does this make send? And I think that we need to hold the larger extent locked somehow until the discard completes, to prevent allocation from the discarded range. Can anybody please comment on these questions? Thanks! Alex. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Questions about XFS discard and xfs_free_extent() code (newbie) 2013-12-18 18:37 Questions about XFS discard and xfs_free_extent() code (newbie) Alex Lyakas @ 2013-12-18 23:06 ` Dave Chinner 2013-12-19 9:24 ` Alex Lyakas 0 siblings, 1 reply; 47+ messages in thread From: Dave Chinner @ 2013-12-18 23:06 UTC (permalink / raw) To: Alex Lyakas; +Cc: xfs On Wed, Dec 18, 2013 at 08:37:29PM +0200, Alex Lyakas wrote: > Greetings XFS developers & community, > > I am studying the XFS code, primarily focusing now at the free-space > allocation and deallocation parts. > > I learned that freeing an extent happens like this: > - xfs_free_extent() calls xfs_free_ag_extent(), which attempts to merge the > freed extents from left and from right in the by-bno btree. Then the by-size > btree is updated accordingly. > - xfs_free_extent marks the original (un-merged) extent as "busy" by > xfs_extent_busy_insert(). This prevents this original extent from being > allocated. (Except that for metadata allocations such extent or part of it > can be "unbusied", while it is still not marked for discard with > XFS_EXTENT_BUSY_DISCARDED). > - Once the appropriate part of the log is committed, xlog_cil_committed > calls xfs_discard_extents. This discards the extents using the synchronous > blkdev_issue_discard() API, and only them "unbusies" the extents. This makes > sense, because we cannot allow allocating these extents until discarding > completed. > > WRT to this flow, I have some questions: > > - xfs_free_extent first inserts the extent into the free-space btrees, and > only then marks it as busy. How come there is no race window here? Because the AGF is locked exclusively at this point, meaning only one process can be modifying the free space tree at this point in time. > Can > somebody allocate the freed extent before it is marked as busy? Or the > free-space btrees somehow are locked at this point? The code says "validate > the extent size is legal now we have the agf locked". I more or less see > that xfs_alloc_fix_freelist() locks *something*, but I don't see > xfs_free_extent() unlocking anything. The AGF remains locked until the transaction is committed. The transaction commit code unlocks items modified in the transaction via the ->iop_unlock log item callback.... > - If xfs_extent_busy_insert() fails to alloc a xfs_extent_busy structure, > such extent cannot be discarded, correct? Correct. > - xfs_discard_extents() doesn't check the discard granularity of the > underlying block device, like xfs_ioc_trim() does. So it may send a small > discard request, which cannot be handled. Discard is a "advisory" operation - it is never guaranteed to do anything. > If it would have checked the > granularity, it could have avoided sending small requests. But the thing is > that the busy extent might have been merged in the free-space btree into a > larger extent, which is now suitable for discard. Sure, but the busy extent tree tracks extents across multiple transaction contexts, and we cannot merge extents that are in different contexts. > I want to attempt the following logic in xfs_discard_extents(): > # search the "by-bno" free-space btree for a larger extent that fully > encapsulates the busy extent (which we want to discard) > # if found, check whether some other part of the larger extent is still busy > (except for the current busy extent we want to discard) > # if no, send discard for the larger extent > Does this make send? And I think that we need to hold the larger > extent locked somehow until the > discard completes, to prevent allocation from the discarded range. You can't search the freespace btrees in log IO completion context - that will cause deadlocks because we can be holding the locks searching the freespace trees when we issue a log force and block waiting for log IO completion to occur. e.g. in xfs_extent_busy_reuse().... Also, walking the free space btrees can be an IO bound operation, overhead/latency we absolutely do not want to add to log IO completion. Further, walking the free space btrees can be a memory intensive operation (buffers are demand paged from disk) and log IO completion may be necessary for memory reclaim to make progress in low memory situations. So adding unbound memory demand to log IO completion will cause low memory deadlocks, too. IOWs, adding freespace tree processing to xfs_discard_extents() just won't work. What we really need is a smarter block layer implementation of the discard operation - it needs to be asynchronous, and it needs to support merging of adjacent discard requests. Now that SATA 3.1 devices are appearing on the market, queued trim operations are now possible. Dispatching discard oeprations as synchronous operations prevents us from taking advantage of these operations. Further, because it's synchronous, the block layer can't merge adjacent discards, not batch multiple discard ranges up into a single TRIM command. IOWs, what we really need is for the block layer discard code to be brought up to the capabilities of the hardware on the market first. Then we will be in a position to be able to optimise the XFS code to use async dispatch and new IO completion handlers to finish the log IO completion processing, and at that point we shouldn't need to care anymore. Note that XFS already dispatches discards in ascending block order, so if we issue adjacent discards the block layer will be able to merge them appropriately. Hence we don't need to add that complexity to XFS.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Questions about XFS discard and xfs_free_extent() code (newbie) 2013-12-18 23:06 ` Dave Chinner @ 2013-12-19 9:24 ` Alex Lyakas 2013-12-19 10:55 ` Dave Chinner 0 siblings, 1 reply; 47+ messages in thread From: Alex Lyakas @ 2013-12-19 9:24 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs Hi Dave, Thank you for your comments. I realize now that what I proposed cannot be done; I need to understand deeper how XFS transactions work (unfortunately, the awesome "XFS Filesystem Structure" doc has a TODO in the "Journaling Log" section). Can you please comment on one more question: Let's say we had such fully asynchronous "fire-and-forget" discard operation (I can implement one myself for my block-device via a custom IOCTL). What is wrong if we trigger such operation in xfs_free_ag_extent(), right after we have merged the freed extent into a bigger one? I understand that the extent-free-intent is not yet committed to the log at this point. But from the user's point of view, the extent has been deleted, no? So if the underlying block device discards the merged extent right away, before committing to the log, what issues this can cause? Thanks, Alex. -----Original Message----- From: Dave Chinner Sent: 19 December, 2013 1:06 AM To: Alex Lyakas Cc: xfs@oss.sgi.com Subject: Re: Questions about XFS discard and xfs_free_extent() code (newbie) On Wed, Dec 18, 2013 at 08:37:29PM +0200, Alex Lyakas wrote: > Greetings XFS developers & community, > > I am studying the XFS code, primarily focusing now at the free-space > allocation and deallocation parts. > > I learned that freeing an extent happens like this: > - xfs_free_extent() calls xfs_free_ag_extent(), which attempts to merge > the > freed extents from left and from right in the by-bno btree. Then the > by-size > btree is updated accordingly. > - xfs_free_extent marks the original (un-merged) extent as "busy" by > xfs_extent_busy_insert(). This prevents this original extent from being > allocated. (Except that for metadata allocations such extent or part of it > can be "unbusied", while it is still not marked for discard with > XFS_EXTENT_BUSY_DISCARDED). > - Once the appropriate part of the log is committed, xlog_cil_committed > calls xfs_discard_extents. This discards the extents using the synchronous > blkdev_issue_discard() API, and only them "unbusies" the extents. This > makes > sense, because we cannot allow allocating these extents until discarding > completed. > > WRT to this flow, I have some questions: > > - xfs_free_extent first inserts the extent into the free-space btrees, and > only then marks it as busy. How come there is no race window here? Because the AGF is locked exclusively at this point, meaning only one process can be modifying the free space tree at this point in time. > Can > somebody allocate the freed extent before it is marked as busy? Or the > free-space btrees somehow are locked at this point? The code says > "validate > the extent size is legal now we have the agf locked". I more or less see > that xfs_alloc_fix_freelist() locks *something*, but I don't see > xfs_free_extent() unlocking anything. The AGF remains locked until the transaction is committed. The transaction commit code unlocks items modified in the transaction via the ->iop_unlock log item callback.... > - If xfs_extent_busy_insert() fails to alloc a xfs_extent_busy structure, > such extent cannot be discarded, correct? Correct. > - xfs_discard_extents() doesn't check the discard granularity of the > underlying block device, like xfs_ioc_trim() does. So it may send a small > discard request, which cannot be handled. Discard is a "advisory" operation - it is never guaranteed to do anything. > If it would have checked the > granularity, it could have avoided sending small requests. But the thing > is > that the busy extent might have been merged in the free-space btree into a > larger extent, which is now suitable for discard. Sure, but the busy extent tree tracks extents across multiple transaction contexts, and we cannot merge extents that are in different contexts. > I want to attempt the following logic in xfs_discard_extents(): > # search the "by-bno" free-space btree for a larger extent that fully > encapsulates the busy extent (which we want to discard) > # if found, check whether some other part of the larger extent is still > busy > (except for the current busy extent we want to discard) > # if no, send discard for the larger extent > Does this make send? And I think that we need to hold the larger > extent locked somehow until the > discard completes, to prevent allocation from the discarded range. You can't search the freespace btrees in log IO completion context - that will cause deadlocks because we can be holding the locks searching the freespace trees when we issue a log force and block waiting for log IO completion to occur. e.g. in xfs_extent_busy_reuse().... Also, walking the free space btrees can be an IO bound operation, overhead/latency we absolutely do not want to add to log IO completion. Further, walking the free space btrees can be a memory intensive operation (buffers are demand paged from disk) and log IO completion may be necessary for memory reclaim to make progress in low memory situations. So adding unbound memory demand to log IO completion will cause low memory deadlocks, too. IOWs, adding freespace tree processing to xfs_discard_extents() just won't work. What we really need is a smarter block layer implementation of the discard operation - it needs to be asynchronous, and it needs to support merging of adjacent discard requests. Now that SATA 3.1 devices are appearing on the market, queued trim operations are now possible. Dispatching discard oeprations as synchronous operations prevents us from taking advantage of these operations. Further, because it's synchronous, the block layer can't merge adjacent discards, not batch multiple discard ranges up into a single TRIM command. IOWs, what we really need is for the block layer discard code to be brought up to the capabilities of the hardware on the market first. Then we will be in a position to be able to optimise the XFS code to use async dispatch and new IO completion handlers to finish the log IO completion processing, and at that point we shouldn't need to care anymore. Note that XFS already dispatches discards in ascending block order, so if we issue adjacent discards the block layer will be able to merge them appropriately. Hence we don't need to add that complexity to XFS.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Questions about XFS discard and xfs_free_extent() code (newbie) 2013-12-19 9:24 ` Alex Lyakas @ 2013-12-19 10:55 ` Dave Chinner 2013-12-19 19:24 ` Alex Lyakas 2013-12-24 18:21 ` Alex Lyakas 0 siblings, 2 replies; 47+ messages in thread From: Dave Chinner @ 2013-12-19 10:55 UTC (permalink / raw) To: Alex Lyakas; +Cc: xfs On Thu, Dec 19, 2013 at 11:24:15AM +0200, Alex Lyakas wrote: > Hi Dave, > Thank you for your comments. > I realize now that what I proposed cannot be done; I need to > understand deeper how XFS transactions work (unfortunately, the > awesome "XFS Filesystem Structure" doc has a TODO in the "Journaling > Log" section). > > Can you please comment on one more question: > Let's say we had such fully asynchronous "fire-and-forget" discard > operation (I can implement one myself for my block-device via a > custom IOCTL). What is wrong if we trigger such operation in > xfs_free_ag_extent(), right after we have merged the freed extent > into a bigger one? I understand that the extent-free-intent is not > yet committed to the log at this point. But from the user's point of > view, the extent has been deleted, no? So if the underlying block > device discards the merged extent right away, before committing to > the log, what issues this can cause? Think of what happens when a crash occurs immediately after the discard completes. The freeing of the extent never made it to th elog, so after recovery, the file still exists and the user can access it. Except that it's contents are now all different to before the crash occurred. IOWs, issuing the discard before the transaction that frees the extent is on stable storage means we are discarding user data or metadata before we've guaranteed that the extent free transaction is permanent and that means we violate certain guarantees with respect to crash recovery... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Questions about XFS discard and xfs_free_extent() code (newbie) 2013-12-19 10:55 ` Dave Chinner @ 2013-12-19 19:24 ` Alex Lyakas 2013-12-21 17:03 ` Chris Murphy 2013-12-24 18:21 ` Alex Lyakas 1 sibling, 1 reply; 47+ messages in thread From: Alex Lyakas @ 2013-12-19 19:24 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs Hi Dave, It makes sense. I agree it might break some guarantees. Although if the user deleted some blocks in the file or the whole file, maybe it's ok to not have a clear promise what he sees after the crash. But I agree, it's not a clear semantics. Thanks for the comments, Alex. -----Original Message----- From: Dave Chinner Sent: 19 December, 2013 12:55 PM To: Alex Lyakas Cc: xfs@oss.sgi.com Subject: Re: Questions about XFS discard and xfs_free_extent() code (newbie) On Thu, Dec 19, 2013 at 11:24:15AM +0200, Alex Lyakas wrote: > Hi Dave, > Thank you for your comments. > I realize now that what I proposed cannot be done; I need to > understand deeper how XFS transactions work (unfortunately, the > awesome "XFS Filesystem Structure" doc has a TODO in the "Journaling > Log" section). > > Can you please comment on one more question: > Let's say we had such fully asynchronous "fire-and-forget" discard > operation (I can implement one myself for my block-device via a > custom IOCTL). What is wrong if we trigger such operation in > xfs_free_ag_extent(), right after we have merged the freed extent > into a bigger one? I understand that the extent-free-intent is not > yet committed to the log at this point. But from the user's point of > view, the extent has been deleted, no? So if the underlying block > device discards the merged extent right away, before committing to > the log, what issues this can cause? Think of what happens when a crash occurs immediately after the discard completes. The freeing of the extent never made it to th elog, so after recovery, the file still exists and the user can access it. Except that it's contents are now all different to before the crash occurred. IOWs, issuing the discard before the transaction that frees the extent is on stable storage means we are discarding user data or metadata before we've guaranteed that the extent free transaction is permanent and that means we violate certain guarantees with respect to crash recovery... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Questions about XFS discard and xfs_free_extent() code (newbie) 2013-12-19 19:24 ` Alex Lyakas @ 2013-12-21 17:03 ` Chris Murphy 0 siblings, 0 replies; 47+ messages in thread From: Chris Murphy @ 2013-12-21 17:03 UTC (permalink / raw) To: xfs On Dec 19, 2013, at 12:24 PM, Alex Lyakas <alex@zadarastorage.com> wrote: > Hi Dave, > It makes sense. I agree it might break some guarantees. Although if the user deleted some blocks in the file or the whole file, maybe it's ok to not have a clear promise what he sees after the crash. User perspective: I disagree. Sounds like a possible zombie file invasion, with no clear way of reversion. The file either needs to be gone, as in not accessible in user space, or it needs to be present and intact. There isn't a reasonable expectation for a file to be resurrected from the dead that's also corrupted. If the file name isn't also corrupted, the problem is worse. It looks like a legitimate file, yet it's useless. The zombie files will be subject to backup and restore, just like their valid predecessors. All I need is to stumble upon a handful of these files, which I won't necessarily remember were deleted files, to start assuming I have some sort of weird file system corruption, at which point at best I'll become really confused not knowing what to do next. At worse, I may start throwing hammers that end up causing worse problems. Chris Murphy _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Questions about XFS discard and xfs_free_extent() code (newbie) 2013-12-19 10:55 ` Dave Chinner 2013-12-19 19:24 ` Alex Lyakas @ 2013-12-24 18:21 ` Alex Lyakas 2013-12-26 23:00 ` Dave Chinner 1 sibling, 1 reply; 47+ messages in thread From: Alex Lyakas @ 2013-12-24 18:21 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs Hi Dave, Reading through the code some more, I see that the extent that is freed through xfs_free_extent() can be an XFS metadata extent as well. For example, xfs_inobt_free_block() frees a block of the AG's free-inode btree. Also, xfs_bmbt_free_block() frees a generic btree block by putting it onto the cursor's "to-be-freed" list, which will be dropped into the free-space btree (by xfs_free_extent) in xfs_bmap_finish(). If we discard such metadata block before the transaction is committed to the log and we crash, we might not be able to properly mount after reboot, is that right? I mean it's not that some file's data block will show 0s to the user instead of before-delete data, but some XFS btree node (for example) will be wiped in such case. Can this happen? Thanks, Alex. -----Original Message----- From: Dave Chinner Sent: 19 December, 2013 12:55 PM To: Alex Lyakas Cc: xfs@oss.sgi.com Subject: Re: Questions about XFS discard and xfs_free_extent() code (newbie) On Thu, Dec 19, 2013 at 11:24:15AM +0200, Alex Lyakas wrote: > Hi Dave, > Thank you for your comments. > I realize now that what I proposed cannot be done; I need to > understand deeper how XFS transactions work (unfortunately, the > awesome "XFS Filesystem Structure" doc has a TODO in the "Journaling > Log" section). > > Can you please comment on one more question: > Let's say we had such fully asynchronous "fire-and-forget" discard > operation (I can implement one myself for my block-device via a > custom IOCTL). What is wrong if we trigger such operation in > xfs_free_ag_extent(), right after we have merged the freed extent > into a bigger one? I understand that the extent-free-intent is not > yet committed to the log at this point. But from the user's point of > view, the extent has been deleted, no? So if the underlying block > device discards the merged extent right away, before committing to > the log, what issues this can cause? Think of what happens when a crash occurs immediately after the discard completes. The freeing of the extent never made it to th elog, so after recovery, the file still exists and the user can access it. Except that it's contents are now all different to before the crash occurred. IOWs, issuing the discard before the transaction that frees the extent is on stable storage means we are discarding user data or metadata before we've guaranteed that the extent free transaction is permanent and that means we violate certain guarantees with respect to crash recovery... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Questions about XFS discard and xfs_free_extent() code (newbie) 2013-12-24 18:21 ` Alex Lyakas @ 2013-12-26 23:00 ` Dave Chinner 2014-01-08 18:13 ` Alex Lyakas 0 siblings, 1 reply; 47+ messages in thread From: Dave Chinner @ 2013-12-26 23:00 UTC (permalink / raw) To: Alex Lyakas; +Cc: xfs On Tue, Dec 24, 2013 at 08:21:50PM +0200, Alex Lyakas wrote: > Hi Dave, > Reading through the code some more, I see that the extent that is > freed through xfs_free_extent() can be an XFS metadata extent as > well. > For example, xfs_inobt_free_block() frees a block of the AG's > free-inode btree. Also, xfs_bmbt_free_block() frees a generic btree > block by putting it onto the cursor's "to-be-freed" list, which will > be dropped into the free-space btree (by xfs_free_extent) in > xfs_bmap_finish(). If we discard such metadata block before the > transaction is committed to the log and we crash, we might not be > able to properly mount after reboot, is that right? Yes. The log stores a delta of the transactional changes, and so requires th eprevious version of the block to be intact for revoery to take place. > I mean it's not > that some file's data block will show 0s to the user instead of > before-delete data, but some XFS btree node (for example) will be > wiped in such case. Can this happen? Yes, it could. That's what I meant by: [snip] > > IOWs, issuing the discard before the transaction that frees the > > extent is on stable storage means we are discarding user data or ^^ > > metadata before we've guaranteed that the extent free transaction ^^^^^^^^ > > is permanent and that means we violate certain guarantees with > > respect to crash recovery... The "or metadata" part of the above sentence. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Questions about XFS discard and xfs_free_extent() code (newbie) 2013-12-26 23:00 ` Dave Chinner @ 2014-01-08 18:13 ` Alex Lyakas 2014-01-13 3:02 ` Dave Chinner 0 siblings, 1 reply; 47+ messages in thread From: Alex Lyakas @ 2014-01-08 18:13 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs Hello Dave, Currently I am working on the following approach: Basic idea: make each xfs_extent_busy struct carry potential information about a larger extent-for-discard, i.e., one that is a multiple of discard_granularity. In detail this looks like this: # xfs_free_ag_extent attempts to merge the freed extent into a larger free-extent in the by-bno btree. When this function completes its work, we have [nbno, nlen], which is potentially a larger free-extent. At this point we also know that AGF is locked. # We trim [nbno, nlen] to be a multiple-of and aligned-by discard_granularity (if possible), and we receive [dbno, dlen], which is a very nice extent to discard. # When calling xfs_extent_busy_insert(), we add these two values to the xfs_extent_busy struct. # When the extent-free operation is committed for this busy extent, we know that we can discard this [dbno, dlen] area, unless somebody have allocated an extent, which overlaps this area. To address that, at the end of xfs_alloc_fixup_trees() we do the following: # We know that we are going to allocate [rbno, rlen] from appropriate AG. So at this point, we search the busy extents tree to check if there is a busy extent that holds [dbno, dlen] (this is a multiple-of and aligned-by discard granularity), which overlaps [rbno, rlen] fully or partially. If found, we shrink the [dbno, dlen] area to be still a multiple-of and aligned by discard-granularity, if possible. So we have a new smaller [dbno, dlen] that we still can discard, attached to the same busy extent. Or we discover that the new area is too small to discard, so we forget about it. # The allocation flow anyways searches the busy extents tree, so we should be ok WRT to locking order, but adding some extra work. This way, we basically track larger chunks, which are nice to discard. I am aware that I need to handle additional issues like: # A busy extent can be "unbusyied" or shrunk by xfs_extent_busy_update_extent(). We need to update [dbno, dlen] accordingly or delete it fully # To be able to search for [dbno, dlen], we probably need another rbtree (under the same pag->pagb_lock), which tracks large extents for discard. xfs_extent_busy needs additional rbnode. # If during xfs_alloc_fixup_trees() we discover that extent is already being discarded, we need to wait. Assuming we have asynchronous discard, this wait will be short - we only need the block device to queue the discard request, and then we are good to allocate from that area again. One thing I am unsure about, is a scenario like this: # assume discard-granularity=1MB # we have a 1MB almost free, except two 4K blocks, somewhere in the free space # Transaction t1 comes and frees 4K block A, but the 1MB extent is not fully free yet, so nothing to discard # Transaction t2 frees the second 4K block B, now 1MB is free and we attach a [dbno, dlen] to the second busy extent However, I think there is no guarantee that t1 will commit before t2; is that right? But we cannot discard the 1MB extent, before both transactions commit. (One approach to solve this, is to give a sequence number for each xfs_extent_busy extent, and have a background thread that does delayed discards, once all needed busy extents are committed. The delayed discards are also considered in the check that xfs_alloc_fixup_trees() does). What do you think overall about this approach? Is there something fundamental that prevents it from working? Also (if you are still reading:), can you kindly comment this question that I have: # xfs_free_ag_extent() has a "isfl" parameter. If it is "true", then this extent is added as usual to the free-space btrees, but the caller doesn't add it as a busy extent. This means that such extent is suitable for allocation right away, without waiting for the log commit? Thank you for helping, Alex. -----Original Message----- From: Dave Chinner Sent: 27 December, 2013 1:00 AM To: Alex Lyakas Cc: xfs@oss.sgi.com Subject: Re: Questions about XFS discard and xfs_free_extent() code (newbie) On Tue, Dec 24, 2013 at 08:21:50PM +0200, Alex Lyakas wrote: > Hi Dave, > Reading through the code some more, I see that the extent that is > freed through xfs_free_extent() can be an XFS metadata extent as > well. > For example, xfs_inobt_free_block() frees a block of the AG's > free-inode btree. Also, xfs_bmbt_free_block() frees a generic btree > block by putting it onto the cursor's "to-be-freed" list, which will > be dropped into the free-space btree (by xfs_free_extent) in > xfs_bmap_finish(). If we discard such metadata block before the > transaction is committed to the log and we crash, we might not be > able to properly mount after reboot, is that right? Yes. The log stores a delta of the transactional changes, and so requires th eprevious version of the block to be intact for revoery to take place. > I mean it's not > that some file's data block will show 0s to the user instead of > before-delete data, but some XFS btree node (for example) will be > wiped in such case. Can this happen? Yes, it could. That's what I meant by: [snip] > > IOWs, issuing the discard before the transaction that frees the > > extent is on stable storage means we are discarding user data or ^^ > > metadata before we've guaranteed that the extent free transaction ^^^^^^^^ > > is permanent and that means we violate certain guarantees with > > respect to crash recovery... The "or metadata" part of the above sentence. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Questions about XFS discard and xfs_free_extent() code (newbie) 2014-01-08 18:13 ` Alex Lyakas @ 2014-01-13 3:02 ` Dave Chinner 2014-01-13 17:44 ` Alex Lyakas 0 siblings, 1 reply; 47+ messages in thread From: Dave Chinner @ 2014-01-13 3:02 UTC (permalink / raw) To: Alex Lyakas; +Cc: xfs On Wed, Jan 08, 2014 at 08:13:38PM +0200, Alex Lyakas wrote: > Hello Dave, > > Currently I am working on the following approach: > > Basic idea: make each xfs_extent_busy struct carry potential > information about a larger extent-for-discard, i.e., one that is a > multiple of discard_granularity. You're making a big assumption here that discard_granularity is useful for aggregating large extents. Discard granularity on modern SSDs is a single sector: $ cat /sys/block/sda/queue/discard_granularity 512 $ I've just checked 4 different types of SSDs I have here (from 3 years old no-name sandforce SSDs to a brand new samsung 840 EVO), and htey all give the same result. IOWs, in most cases other than thin provisioning it will not be useful for optimising discards into larger, aligned extents. > In detail this looks like this: > # xfs_free_ag_extent attempts to merge the freed extent into a > larger free-extent in the by-bno btree. When this function > completes its work, we have [nbno, nlen], which is potentially a > larger free-extent. At this point we also know that AGF is locked. > # We trim [nbno, nlen] to be a multiple-of and aligned-by > discard_granularity (if possible), and we receive [dbno, dlen], > which is a very nice extent to discard. > # When calling xfs_extent_busy_insert(), we add these two values to > the xfs_extent_busy struct. > # When the extent-free operation is committed for this busy extent, > we know that we can discard this [dbno, dlen] area, unless somebody > have allocated an extent, which overlaps this area. This strikes me as optimisation at the wrong time i.e. optimising discard ranges at extent free time ignores the fact that the extent can be immediately reallocated, and so all the optimisation work is wasted. > To address that, at the end of xfs_alloc_fixup_trees() we do the following: > # We know that we are going to allocate [rbno, rlen] from > appropriate AG. So at this point, we search the busy extents tree to > check if there is a busy extent that holds [dbno, dlen] (this is a > multiple-of and aligned-by discard granularity), which overlaps How do you find that? The busy extent tree is indexed on individual extents that have been freed, not the discard granularity ranges they track. > [rbno, rlen] fully or partially. If found, we shrink the [dbno, > dlen] area to be still a multiple-of and aligned by > discard-granularity, if possible. So we have a new smaller [dbno, > dlen] that we still can discard, attached to the same busy extent. > Or we discover that the new area is too small to discard, so we > forget about it. Forget about the discard range, right? We can't ignore the busy extent that covers the range being freed - it must be tracked all the way through to transaction commit completion. > # The allocation flow anyways searches the busy extents tree, so we > should be ok WRT to locking order, but adding some extra work. There are other locking issues to be concerned about than order.... > I am aware that I need to handle additional issues like: > # A busy extent can be "unbusyied" or shrunk by > xfs_extent_busy_update_extent(). We need to update [dbno, dlen] > accordingly or delete it fully > # To be able to search for [dbno, dlen], we probably need another > rbtree (under the same pag->pagb_lock), which tracks large extents > for discard. xfs_extent_busy needs additional rbnode. > # If during xfs_alloc_fixup_trees() we discover that extent is > already being discarded, we need to wait. Assuming we have > asynchronous discard, this wait will be short - we only need the > block device to queue the discard request, and then we are good to > allocate from that area again. * multiple busy extents can be found in the one "discard range". Hence there is a n:1 relationship between the busy extents and the related "discard extent" that might be related to it. Hence if we end up with: busy1 busy2 busy3 +-------+-------+------+ +----------------------+ discard1 and then we reuse busy2, then we have to delete discard1 and update busy1 and busy3 not to point at discard1. Indeed, depending on the discard granularity, it might ned up: busy1 busy3 +-------+ +------+ +-------+ +------+ discard1 discard2 And so the act of having to track optimal "discard ranges" becomes very, very complex. I really don't see any advantage in tracking discard ranges like this, because we can do these optimisations of merging and trimming just before issuing the discards. And realistically, merging and trimming is something the block layer should be doing for us already. > > One thing I am unsure about, is a scenario like this: > # assume discard-granularity=1MB > # we have a 1MB almost free, except two 4K blocks, somewhere in the > free space > # Transaction t1 comes and frees 4K block A, but the 1MB extent is > not fully free yet, so nothing to discard > # Transaction t2 frees the second 4K block B, now 1MB is free and we > attach a [dbno, dlen] to the second busy extent > > However, I think there is no guarantee that t1 will commit before > t2; is that right? Correct. > But we cannot discard the 1MB extent, before both > transactions commit. (One approach to solve this, is to give a > sequence number for each xfs_extent_busy extent, and have a > background thread that does delayed discards, once all needed busy > extents are committed. The delayed discards are also considered in > the check that xfs_alloc_fixup_trees() does). We used to have a log sequence number in the busy extent to prevent reuse of a busy extent - it would trigger a log force up to the given LSN before allowing the extent to be reused. It caused significant scalability problems for the busy extent tracking code, and so it was removed and replaced with the non-blocking searches we do now. See: ed3b4d6 xfs: Improve scalability of busy extent tracking e26f050 xfs: do not immediately reuse busy extent ranges 97d3ac7 xfs: exact busy extent tracking i.e. blocking waiting for discard or log IO completion while holding the AGF locked is a major problem for allocation latency determinism. With discards potentially taking seconds, waiting for them to complete while holding the AGF locked will effectively stall parts of the filesystem for long periods of time. That blocking is what the above commits prevent, and by doing this allow us to use the busy extent tree for issuing discard on ranges that have been freed.... > What do you think overall about this approach? Is there something > fundamental that prevents it from working? I'm not convinced that re-introducing busy extent commit sequence tracking and blocking to optimise discard operations is a particularly good idea given the above. > Also (if you are still reading:), can you kindly comment this > question that I have: > # xfs_free_ag_extent() has a "isfl" parameter. If it is "true", then > this extent is added as usual to the free-space btrees, but the > caller doesn't add it as a busy extent. This means that such extent > is suitable for allocation right away, without waiting for the log > commit? It means the extent is being moved from the AGFL to the free space btree. blocks on the AGFL have already gone through free space accounting and busy extent tracking to get to the AGFL, and so there is no need to repeat it when moving it to the free space btrees. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Questions about XFS discard and xfs_free_extent() code (newbie) 2014-01-13 3:02 ` Dave Chinner @ 2014-01-13 17:44 ` Alex Lyakas 2014-01-13 20:43 ` Dave Chinner 0 siblings, 1 reply; 47+ messages in thread From: Alex Lyakas @ 2014-01-13 17:44 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs Hi Dave, Thank you for your comments, and for pointing me at the commits. On Mon, Jan 13, 2014 at 5:02 AM, Dave Chinner <david@fromorbit.com> wrote: > On Wed, Jan 08, 2014 at 08:13:38PM +0200, Alex Lyakas wrote: >> Hello Dave, >> >> Currently I am working on the following approach: >> >> Basic idea: make each xfs_extent_busy struct carry potential >> information about a larger extent-for-discard, i.e., one that is a >> multiple of discard_granularity. > > You're making a big assumption here that discard_granularity is > useful for aggregating large extents. Discard granularity on modern > SSDs is a single sector: > > $ cat /sys/block/sda/queue/discard_granularity > 512 > $ > > I've just checked 4 different types of SSDs I have here (from 3 > years old no-name sandforce SSDs to a brand new samsung 840 EVO), > and htey all give the same result. > > IOWs, in most cases other than thin provisioning it will not be > useful for optimising discards into larger, aligned extents. Agree. The use case I primarily look into is, indeed, thin provisioning. In my case, discard granularity can be, for example, 256Kb or even higher. I realize this is not the case with SSDs. > >> In detail this looks like this: >> # xfs_free_ag_extent attempts to merge the freed extent into a >> larger free-extent in the by-bno btree. When this function >> completes its work, we have [nbno, nlen], which is potentially a >> larger free-extent. At this point we also know that AGF is locked. >> # We trim [nbno, nlen] to be a multiple-of and aligned-by >> discard_granularity (if possible), and we receive [dbno, dlen], >> which is a very nice extent to discard. >> # When calling xfs_extent_busy_insert(), we add these two values to >> the xfs_extent_busy struct. >> # When the extent-free operation is committed for this busy extent, >> we know that we can discard this [dbno, dlen] area, unless somebody >> have allocated an extent, which overlaps this area. > > This strikes me as optimisation at the wrong time i.e. optimising > discard ranges at extent free time ignores the fact that the extent > can be immediately reallocated, and so all the optimisation work is > wasted. Agree it can be reallocated. Although, based on the code, only a metadata extent can be reallocated immediately, while user data busy extents cannot be "unbusied" until their freeing is committed to the log. I agree it can happen, that a non-busy part of a discard-range can be allocated, and then the whole range cannot be discarded. > >> To address that, at the end of xfs_alloc_fixup_trees() we do the following: >> # We know that we are going to allocate [rbno, rlen] from >> appropriate AG. So at this point, we search the busy extents tree to >> check if there is a busy extent that holds [dbno, dlen] (this is a >> multiple-of and aligned-by discard granularity), which overlaps > > How do you find that? The busy extent tree is indexed on individual > extents that have been freed, not the discard granularity ranges > they track. Agree, that's why I mentioned that another rbtree is needed, that indexes discard-ranges. > >> [rbno, rlen] fully or partially. If found, we shrink the [dbno, >> dlen] area to be still a multiple-of and aligned by >> discard-granularity, if possible. So we have a new smaller [dbno, >> dlen] that we still can discard, attached to the same busy extent. >> Or we discover that the new area is too small to discard, so we >> forget about it. > > Forget about the discard range, right? We can't ignore the busy > extent that covers the range being freed - it must be tracked all > the way through to transaction commit completion. Agree that busy extent cannot be forgotten about. I meant that we forget only about the discard range that is related to this busy extent. So when the transaction commit completes, we will "unbusy" the busy extent as usual, but we will not discard anything, because there will be no discard-range connected to this busy extent. > >> # The allocation flow anyways searches the busy extents tree, so we >> should be ok WRT to locking order, but adding some extra work. > > There are other locking issues to be concerned about than order.... > >> I am aware that I need to handle additional issues like: >> # A busy extent can be "unbusyied" or shrunk by >> xfs_extent_busy_update_extent(). We need to update [dbno, dlen] >> accordingly or delete it fully >> # To be able to search for [dbno, dlen], we probably need another >> rbtree (under the same pag->pagb_lock), which tracks large extents >> for discard. xfs_extent_busy needs additional rbnode. >> # If during xfs_alloc_fixup_trees() we discover that extent is >> already being discarded, we need to wait. Assuming we have >> asynchronous discard, this wait will be short - we only need the >> block device to queue the discard request, and then we are good to >> allocate from that area again. > > * multiple busy extents can be found in the one "discard range". > Hence there is a n:1 relationship between the busy extents and the > related "discard extent" that might be related to it. Hence if we > end up with: > > busy1 busy2 busy3 > +-------+-------+------+ > +----------------------+ > discard1 > > and then we reuse busy2, then we have to delete discard1 and update > busy1 and busy3 not to point at discard1. Indeed, depending on > the discard granularity, it might ned up: > > busy1 busy3 > +-------+ +------+ > +-------+ +------+ > discard1 discard2 > > And so the act of having to track optimal "discard ranges" becomes > very, very complex. I agree with the examples you have given. That's why I realized that the numbering of busy extents is needed. In the first example, the extent that was freed *last* out of busy1/busy2/busy3 will be the one pointing at the discard range. Assume it was busy3. Continuing with your example, even if we fully reuse busy2, we delete it from pagb_tree, but we *never* delete it from the transaction's t_busy list. So we can split discard1 into discard1/2 and keep them connected to busy3. When busy3 commits, we check if busy1 and busy2 have already committed, using the busy extents numbering. If yes, we discard. If not, we "unbusy" busy3 as usual, and we leave discard1/2 in the second rbtree (they have the sequence number of busy3 to know when we can discard them). Similar would have happened if discard1 was originally connected to busy2. Even after full reusage of busy2, busy2 is not deleted until it commits (it is only erased from the pagb_tree). > > I really don't see any advantage in tracking discard ranges like > this, because we can do these optimisations of merging and trimming > just before issuing the discards. And realistically, merging and > trimming is something the block layer should be doing for us > already. > Yes, I realize that XFS mindset is to do pure filesystem work, i.e., arrange blocks of data in files and map them to disk. The rest should be handled by application above and by the storage system below. In your awesome AU2012 talk, you also confirm that mindset. Trouble is that the block layer cannot really merge small discard requests, without the information that you have in AGF btrees. >> >> One thing I am unsure about, is a scenario like this: >> # assume discard-granularity=1MB >> # we have a 1MB almost free, except two 4K blocks, somewhere in the >> free space >> # Transaction t1 comes and frees 4K block A, but the 1MB extent is >> not fully free yet, so nothing to discard >> # Transaction t2 frees the second 4K block B, now 1MB is free and we >> attach a [dbno, dlen] to the second busy extent >> >> However, I think there is no guarantee that t1 will commit before >> t2; is that right? > > Correct. > >> But we cannot discard the 1MB extent, before both >> transactions commit. (One approach to solve this, is to give a >> sequence number for each xfs_extent_busy extent, and have a >> background thread that does delayed discards, once all needed busy >> extents are committed. The delayed discards are also considered in >> the check that xfs_alloc_fixup_trees() does). > > We used to have a log sequence number in the busy extent to prevent > reuse of a busy extent - it would trigger a log force up to the > given LSN before allowing the extent to be reused. It caused > significant scalability problems for the busy extent tracking code, > and so it was removed and replaced with the non-blocking searches we > do now. See: > > ed3b4d6 xfs: Improve scalability of busy extent tracking > e26f050 xfs: do not immediately reuse busy extent ranges > 97d3ac7 xfs: exact busy extent tracking > > i.e. blocking waiting for discard or log IO completion while holding > the AGF locked is a major problem for allocation latency > determinism. With discards potentially taking seconds, waiting for > them to complete while holding the AGF locked will effectively stall > parts of the filesystem for long periods of time. That blocking is > what the above commits prevent, and by doing this allow us to use > the busy extent tree for issuing discard on ranges that have been > freed.... > >> What do you think overall about this approach? Is there something >> fundamental that prevents it from working? > > I'm not convinced that re-introducing busy extent commit > sequence tracking and blocking to optimise discard operations is a > particularly good idea given the above. I am not sure I am suggesting to block or lock anything (perhaps I am, without realizing that). By and large, I suggest to have another data structure, an rbtree, that tracks discard ranges. This rbtree is loosely connected to the busy extent rbtree. And I suggest three things to do with this new rbtree: - Whenever a busy extent is added, maybe add a discard range to the second rbtree, and attach it to the busy extent (if we got a nice discard range) - For each new allocation, check if something needs to be removed/changed in this rbtree. Yes, I realize this is additional work. - When a busy extent commits, by all means we "unbusy" the extent as usual. But we also check in the second rbtree, whether we can issue a discard for some discard range. Perhaps we can. Or we cannot because of other busy extents, that have not committed yet (the numbering is used to determine that). In that case, we will discard later, when all the needed busy extent commit. Unless new allocation removed/changed this discard range already. But we are not delaying the "unbusying" of the busy extent, and we are not keeping the AGF locked (I think). Also, we are issuing discards in the same place and context where XFS does it today. > >> Also (if you are still reading:), can you kindly comment this >> question that I have: >> # xfs_free_ag_extent() has a "isfl" parameter. If it is "true", then >> this extent is added as usual to the free-space btrees, but the >> caller doesn't add it as a busy extent. This means that such extent >> is suitable for allocation right away, without waiting for the log >> commit? > > It means the extent is being moved from the AGFL to the free space > btree. blocks on the AGFL have already gone through free space > accounting and busy extent tracking to get to the AGFL, and so there > is no need to repeat it when moving it to the free space btrees. Ok, I realize now that this block has already gone through the busy extent tracking via xfs_allocbt_free_block(). Thanks, Alex. P.S.: Just watched your AU2014 talk. Interesting. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Questions about XFS discard and xfs_free_extent() code (newbie) 2014-01-13 17:44 ` Alex Lyakas @ 2014-01-13 20:43 ` Dave Chinner 2014-01-14 13:48 ` Alex Lyakas 0 siblings, 1 reply; 47+ messages in thread From: Dave Chinner @ 2014-01-13 20:43 UTC (permalink / raw) To: Alex Lyakas; +Cc: xfs On Mon, Jan 13, 2014 at 07:44:13PM +0200, Alex Lyakas wrote: > On Mon, Jan 13, 2014 at 5:02 AM, Dave Chinner <david@fromorbit.com> wrote: > > On Wed, Jan 08, 2014 at 08:13:38PM +0200, Alex Lyakas wrote: > Hi Dave, > Thank you for your comments, and for pointing me at the commits. [snip stuff we both understand] > > I really don't see any advantage in tracking discard ranges like > > this, because we can do these optimisations of merging and trimming > > just before issuing the discards. And realistically, merging and > > trimming is something the block layer should be doing for us > > already. > > > Yes, I realize that XFS mindset is to do pure filesystem work, i.e., > arrange blocks of data in files and map them to disk. The rest should > be handled by application above and by the storage system below. In > your awesome AU2012 talk, you also confirm that mindset. Trouble is > that the block layer cannot really merge small discard requests, > without the information that you have in AGF btrees. Ok, I see what you are getting at here - mounting -o discard does not alleviate the need for occasionally running fstrim to issue discards on large free extents that may have formed from lots of small, disjoint extents being freed. IOWs, we cannot perfectly optimise fine grained discards without some form of help from tracking "overlaps" with already freed, discarded space. [snip] > >> What do you think overall about this approach? Is there something > >> fundamental that prevents it from working? > > > > I'm not convinced that re-introducing busy extent commit > > sequence tracking and blocking to optimise discard operations is a > > particularly good idea given the above. > I am not sure I am suggesting to block or lock anything (perhaps I am, > without realizing that). By and large, I suggest to have another data > structure, an rbtree, that tracks discard ranges. Right, I understand this. Go back to this comment you had about allocating a range that a discard is currently being issued on: | If during xfs_alloc_fixup_trees() we discover that extent is already | being discarded, we need to wait. Assuming we have asynchronous | discard, this wait will be short - we only need the block device to | queue the discard request, and then we are good to allocate from | that area again That will be blocking with the AGF held, regardless of whether we have asynchronous discard or not. Essentially, background discard can be considered "asynchronous" when viewed from the context of allocation. I'd forgotten that we effectively do that blocking right now xfs_extent_busy_update_extent(), when we trip over an extent being discarded, so this shouldn't be a blocker for a different discard tracking implementation. :) > This rbtree is > loosely connected to the busy extent rbtree. And I suggest three > things to do with this new rbtree: Yes, but lets improve that "loose connection" by making them almost not connected at all. > - Whenever a busy extent is added, maybe add a discard range to the > second rbtree, and attach it to the busy extent (if we got a nice > discard range) > - For each new allocation, check if something needs to be > removed/changed in this rbtree. Yes, I realize this is additional > work. It's not a huge amount of extra work compared to the rest of the allocation path, so I don't see this as a major issue. > - When a busy extent commits, by all means we "unbusy" the extent as > usual. But we also check in the second rbtree, whether we can issue a > discard for some discard range. Perhaps we can. Or we cannot because > of other busy extents, that have not committed yet (the numbering is > used to determine that). In that case, we will discard later, when all > the needed busy extent commit. Unless new allocation removed/changed > this discard range already. But we are not delaying the "unbusying" of > the busy extent, and we are not keeping the AGF locked (I think). > Also, we are issuing discards in the same place and context where XFS > does it today. This is where I think the issues lie. We don't want to have to do anything when a busy extent is removed at transaction commit - that's the reason online discard sucks right now. And we want to avoid having to care about transactions and ordering when it comes to tracking discard ranges and issuing them. The way I see it is that if we have a worker thread that periodically walks the discard tree to issue discards, we simply need to do a busy extent tree lookup on the range of each discard being tracked. If there are busy extents that span the discard range, then the free space isn't yet stable and so we can't issue the discard on that range. If there are no busy extents over the discard range then the free space is stable and we can issue the discard. i.e. if we completely dissociate the discard and busy extent tracking and just replace it with a busy extent lookup at discard time then we don't need any sort of reference counting or log sequence tracking on busy extents or discard ranges. FWIW, if we do this then we can change fstrim xfs_trim_extents() to queue up all the work to be done in the background simply by populating the discard tree with all the free space ranges we wish to discard. This will significantly reduce the impact of fstrim on filesystem runtime performance as the AGF will only be held locked long enough to populate the discard tree. And if we do the work per-ag, then we are also parallelising it by allowing discards on multiple AGs to be issued at once and hence it will be significantly faster on devices that can queue TRIM commands (SATA 3.1, SAS and NVMe devices)..... The fact that this track and background issue mechanism would allow us to optimise both forms of discard the filesystem supports makes me optimistic that we are on the right path. :) > >> question that I have: > >> # xfs_free_ag_extent() has a "isfl" parameter. If it is "true", then > >> this extent is added as usual to the free-space btrees, but the > >> caller doesn't add it as a busy extent. This means that such extent > >> is suitable for allocation right away, without waiting for the log > >> commit? > > > > It means the extent is being moved from the AGFL to the free space > > btree. blocks on the AGFL have already gone through free space > > accounting and busy extent tracking to get to the AGFL, and so there > > is no need to repeat it when moving it to the free space btrees. > Ok, I realize now that this block has already gone through the busy > extent tracking via xfs_allocbt_free_block(). Right, and note that blocks going through that path aren't discarded due to the XFS_EXTENT_BUSY_SKIP_DISCARD flag. This is due to the fact they are being freed to the AGFL and as such are likely to be reused immediately. ;) > P.S.: Just watched your AU2014 talk. Interesting. It was a little bit different. And fun. ;) Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Questions about XFS discard and xfs_free_extent() code (newbie) 2014-01-13 20:43 ` Dave Chinner @ 2014-01-14 13:48 ` Alex Lyakas 2014-01-15 1:45 ` Dave Chinner 0 siblings, 1 reply; 47+ messages in thread From: Alex Lyakas @ 2014-01-14 13:48 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs Hi Dave, [snip] > Right, I understand this. Go back to this comment you had about > allocating a range that a discard is currently being issued on: > > | If during xfs_alloc_fixup_trees() we discover that extent is already > | being discarded, we need to wait. Assuming we have asynchronous > | discard, this wait will be short - we only need the block device to > | queue the discard request, and then we are good to allocate from > | that area again > > That will be blocking with the AGF held, regardless of whether we > have asynchronous discard or not. Essentially, background discard > can be considered "asynchronous" when viewed from the context of > allocation. > > I'd forgotten that we effectively do that blocking right now > xfs_extent_busy_update_extent(), when we trip over an extent being > discarded, so this shouldn't be a blocker for a different discard > tracking implementation. :) Exactly, you already do that anyways. And similarly, we will do: lock() lookup overlapping discard-range if range->flags & XFS_EXTENT_BUSY_DISCARDED => unlock and sleep and retry... ... and when discarding a discard-range, we mark it with this flag (under lock) and leave it in the tree until discarded. [snip] >> - When a busy extent commits, by all means we "unbusy" the extent as >> usual. But we also check in the second rbtree, whether we can issue a >> discard for some discard range. Perhaps we can. Or we cannot because >> of other busy extents, that have not committed yet (the numbering is >> used to determine that). In that case, we will discard later, when all >> the needed busy extent commit. Unless new allocation removed/changed >> this discard range already. But we are not delaying the "unbusying" of >> the busy extent, and we are not keeping the AGF locked (I think). >> Also, we are issuing discards in the same place and context where XFS >> does it today. > > This is where I think the issues lie. We don't want to have to do > anything when a busy extent is removed at transaction commit - > that's the reason online discard sucks right now. And we want to > avoid having to care about transactions and ordering when it comes > to tracking discard ranges and issuing them. > > The way I see it is that if we have a worker thread that > periodically walks the discard tree to issue discards, we simply > need to do a busy extent tree lookup on the range of each discard > being tracked. If there are busy extents that span the discard > range, then the free space isn't yet stable and so we can't issue > the discard on that range. If there are no busy extents over the > discard range then the free space is stable and we can issue the > discard. > > i.e. if we completely dissociate the discard and busy extent > tracking and just replace it with a busy extent lookup at discard > time then we don't need any sort of reference counting or log > sequence tracking on busy extents or discard ranges. Nice. I like the idea of doing the busy extent lookup instead of numbering busy extents and tracking the order of commits. So first of all, this idea can also be applied to what I suggest, i.e., doing the discard at its current place. But instead of tracking busy extent numbers, we: - when a busy extent commits and it has a discard range attached to it, we lookup in the busy extents tree to check for other busy extents overlapping the discard range. Anyways the original code locks the pagb_lock in that context, so we might as well do the search. - if we find an overlapping busy extent, we detach the discard-range from our busy extent and attach it to the overlapping extent. When this overlapping busy extent commits, we will retry the search. WRT that I have questions: in xfs_extent_busy_update_extent() we can "unbusy" part of the extent, or even rb_erase() the busy extent from the busy extent tree (it still remains in t_busy list and will be tracked). Q1: why it is ok to do so? why it is ok for "metadata" to reuse part (or all) of the busy extent before its extent-free-intent is committed? Q2: assume we have two busy extents on the same discard range: +--busy1--+ +--busy2--+ +----------discard1--------+ Assume that xfs_extent_busy_update_extent() fully unbusies busy1. Now busy2 commits, searches for overlapping busy extent, does not find one and discards discard1. I assume it is fine, because: xfs_extent_busy_update_extent() is called before xfs_alloc_fixup_trees() where I intend to check for overlapping discard range. So if we manage to discard before xfs_alloc_fixup_trees(), it is fine, because XFS has not yet really allocated this space. Otherwise, xfs_alloc_fixup_trees() will knock off discard1 and we will not discard. Works? WRT to the worker thread: we need some good strategy when to awake it. Like use a workqueue and a work item, that tells exactly which discard range is now a candidate for discard and needs to be checked for overlapping busy extents? > > FWIW, if we do this then we can change fstrim xfs_trim_extents() to > queue up all the work to be done in the background simply by > populating the discard tree with all the free space ranges we wish > to discard. This will significantly reduce the impact of fstrim on > filesystem runtime performance as the AGF will only be held locked > long enough to populate the discard tree. And if we do the work > per-ag, then we are also parallelising it by allowing discards on > multiple AGs to be issued at once and hence it will be significantly > faster on devices that can queue TRIM commands (SATA 3.1, SAS and > NVMe devices)..... If we have a huge filesystem with a lot of ranges to discard, this will require an un-bound memory amount to populate this tree? > > The fact that this track and background issue mechanism would allow > us to optimise both forms of discard the filesystem supports makes > me optimistic that we are on the right path. :) > >> >> question that I have: >> >> # xfs_free_ag_extent() has a "isfl" parameter. If it is "true", then >> >> this extent is added as usual to the free-space btrees, but the >> >> caller doesn't add it as a busy extent. This means that such extent >> >> is suitable for allocation right away, without waiting for the log >> >> commit? >> > >> > It means the extent is being moved from the AGFL to the free space >> > btree. blocks on the AGFL have already gone through free space >> > accounting and busy extent tracking to get to the AGFL, and so there >> > is no need to repeat it when moving it to the free space btrees. >> Ok, I realize now that this block has already gone through the busy >> extent tracking via xfs_allocbt_free_block(). > > Right, and note that blocks going through that path aren't discarded > due to the XFS_EXTENT_BUSY_SKIP_DISCARD flag. This is due to the > fact they are being freed to the AGFL and as such are likely to be > reused immediately. ;) Yes, and WRT that: is it true to say that the following holds: if we have busy extent with this flag, then we know appropriate range is not in the free-space btrees. Because when we insert such busy extent, we don't drop it into the free-space btrees. As a result, we should never have a discard range that overlaps a busy extent with XFS_EXTENT_BUSY_SKIP_DISCARD. Because all our discard ranges are also free in the free-space btrees. Therefore, busy extents with this flag do not require any special treatment; we can ignore them fully or simply ignore the fact that they have this special flag - they will never have a discard range attached anyways. Thanks! You are very responsive. Alex. > >> P.S.: Just watched your AU2014 talk. Interesting. > > It was a little bit different. And fun. ;) > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Questions about XFS discard and xfs_free_extent() code (newbie) 2014-01-14 13:48 ` Alex Lyakas @ 2014-01-15 1:45 ` Dave Chinner 2014-01-19 9:38 ` Alex Lyakas 0 siblings, 1 reply; 47+ messages in thread From: Dave Chinner @ 2014-01-15 1:45 UTC (permalink / raw) To: Alex Lyakas; +Cc: xfs On Tue, Jan 14, 2014 at 03:48:46PM +0200, Alex Lyakas wrote: > >> - When a busy extent commits, by all means we "unbusy" the extent as > >> usual. But we also check in the second rbtree, whether we can issue a > >> discard for some discard range. Perhaps we can. Or we cannot because > >> of other busy extents, that have not committed yet (the numbering is > >> used to determine that). In that case, we will discard later, when all > >> the needed busy extent commit. Unless new allocation removed/changed > >> this discard range already. But we are not delaying the "unbusying" of > >> the busy extent, and we are not keeping the AGF locked (I think). > >> Also, we are issuing discards in the same place and context where XFS > >> does it today. > > > > This is where I think the issues lie. We don't want to have to do > > anything when a busy extent is removed at transaction commit - > > that's the reason online discard sucks right now. And we want to > > avoid having to care about transactions and ordering when it comes > > to tracking discard ranges and issuing them. > > > > The way I see it is that if we have a worker thread that > > periodically walks the discard tree to issue discards, we simply > > need to do a busy extent tree lookup on the range of each discard > > being tracked. If there are busy extents that span the discard > > range, then the free space isn't yet stable and so we can't issue > > the discard on that range. If there are no busy extents over the > > discard range then the free space is stable and we can issue the > > discard. > > > > i.e. if we completely dissociate the discard and busy extent > > tracking and just replace it with a busy extent lookup at discard > > time then we don't need any sort of reference counting or log > > sequence tracking on busy extents or discard ranges. > Nice. I like the idea of doing the busy extent lookup instead of > numbering busy extents and tracking the order of commits. > So first of all, this idea can also be applied to what I suggest, > i.e., doing the discard at its current place. But instead of tracking > busy extent numbers, we: > - when a busy extent commits and it has a discard range attached to > it, we lookup in the busy extents tree to check for other busy extents > overlapping the discard range. Anyways the original code locks the > pagb_lock in that context, so we might as well do the search. > - if we find an overlapping busy extent, we detach the discard-range > from our busy extent and attach it to the overlapping extent. When > this overlapping busy extent commits, we will retry the search. I don't think that busy extents should every have a pointer to the discard range. It's simply not necessary if we are looking up busy extents at discard time. Hence everything to do with discards can be removed from the busy extent tree and moved into the discard tree... > WRT that I have questions: in xfs_extent_busy_update_extent() we can > "unbusy" part of the extent, or even rb_erase() the busy extent from > the busy extent tree (it still remains in t_busy list and will be > tracked). > Q1: why it is ok to do so? why it is ok for "metadata" to reuse part > (or all) of the busy extent before its extent-free-intent is > committed? The metadata changes are logged, and crash recovery allows correct ordering of the free, realloc and modify process for metadata. Hence it doesn't matter that we overwrite the contents of the block before the free transaction is on disk - the correct contents will always be present after recovery. We can't do that for user data because we don't log user data. Therefore if we allow user data to overwrite the block whil eit is still busy, crash recovery may not result in the block having the correct contents (i.e. the transaction that freed the block never reaches the journal) and we hence expose some other user's data or metadata in the file. > Q2: assume we have two busy extents on the same discard range: > +--busy1--+ +--busy2--+ > +----------discard1--------+ > Assume that xfs_extent_busy_update_extent() fully unbusies busy1. Now > busy2 commits, searches for overlapping busy extent, does not find one > and discards discard1. I assume it is fine, because: I don't think that is correct. If we unbusy busy1 due to reallocation, we cannot issue a discard across that range. It's in use by the filesystem, and discarding that range will result in data or metadata corruption. > xfs_extent_busy_update_extent() is called before > xfs_alloc_fixup_trees() where I intend to check for overlapping > discard range. So if we manage to discard before > xfs_alloc_fixup_trees(), it is fine, because XFS has not yet really > allocated this space. Otherwise, xfs_alloc_fixup_trees() will knock > off discard1 and we will not discard. Works? You need to update the discard ranges at the same place that the busy extents are updated. That is the point that the extent is freed or allocated, and that's the point where the information about the free space the extent was allocated from is available. Hence the discard tree should be updated in the same spot from the same information. That is, on freeing via xfs_free_extent(), the xfs_extent_busy_insert() call needs to be moved inside xfs_free_ag_extent() to where it knows the entire range of the free extent that spans the extent being freed (i.e. the extent after merging). This gives you the ability to round the discard range outwards to the discard granularity. At this point, insert the extent being freed into the busy tree, and the discard range into the discard tree. The busy extents don't merge on insert, the discard ranges can merge. Note that this will capture blocks moved from the AGFL to the free space trees, which we don't currently capture now for discard. We have to insert the busy extent first, though, because ordering matters when it comes to discard range tree walks - a busy range needs to be added first so that the walk doesn't find a discard range before it's corresponding busy extent is added to the tree. When we are allocating, we need to remove discard ranges at the same places where we call xfs_extent_busy_reuse() and *after* a successful call to xfs_alloc_fixup_trees(). i.e. the allocation has not complete until after all the free space information has been updated. The range for discard needs to be trimmed only if the allocation succeeds, similarly, we only need to block on a discard in progress if the allocation succeeds.... > WRT to the worker thread: we need some good strategy when to awake it. > Like use a workqueue and a work item, that tells exactly which discard > range is now a candidate for discard and needs to be checked for > overlapping busy extents? Any extent in the discard range tree is a candidate for discard. grab the first extent, check it has no busy extents in it's range, mark it as being discarded and issue the discard (i'm assuming the tree lock is a mutex here). For the current synchronous discard implementation, on completion we can simply remove the object from the tree and free it. Drop the lock, relax, start again. Once we've walked the entire tree, set up the workqueue to run again some time in the future if there is still more work to be done. If it's empty, just return. We'll start it again when we queue up the first new discard range being inserted into the tree. (same way we run periodic inode reclaim workqueues) > > FWIW, if we do this then we can change fstrim xfs_trim_extents() to > > queue up all the work to be done in the background simply by > > populating the discard tree with all the free space ranges we wish > > to discard. This will significantly reduce the impact of fstrim on > > filesystem runtime performance as the AGF will only be held locked > > long enough to populate the discard tree. And if we do the work > > per-ag, then we are also parallelising it by allowing discards on > > multiple AGs to be issued at once and hence it will be significantly > > faster on devices that can queue TRIM commands (SATA 3.1, SAS and > > NVMe devices)..... > If we have a huge filesystem with a lot of ranges to discard, this > will require an un-bound memory amount to populate this tree? In theory, but we're talking about a fairly frequent discard issue here (e.g. every 5s) so the buildup is effectively bounded by time. If it's really a problem, we can bound it by count, too. > > The fact that this track and background issue mechanism would allow > > us to optimise both forms of discard the filesystem supports makes > > me optimistic that we are on the right path. :) > > > >> >> question that I have: > >> >> # xfs_free_ag_extent() has a "isfl" parameter. If it is "true", then > >> >> this extent is added as usual to the free-space btrees, but the > >> >> caller doesn't add it as a busy extent. This means that such extent > >> >> is suitable for allocation right away, without waiting for the log > >> >> commit? > >> > > >> > It means the extent is being moved from the AGFL to the free space > >> > btree. blocks on the AGFL have already gone through free space > >> > accounting and busy extent tracking to get to the AGFL, and so there > >> > is no need to repeat it when moving it to the free space btrees. > >> Ok, I realize now that this block has already gone through the busy > >> extent tracking via xfs_allocbt_free_block(). > > > > Right, and note that blocks going through that path aren't discarded > > due to the XFS_EXTENT_BUSY_SKIP_DISCARD flag. This is due to the > > fact they are being freed to the AGFL and as such are likely to be > > reused immediately. ;) > Yes, and WRT that: is it true to say that the following holds: > if we have busy extent with this flag, then we know appropriate range > is not in the free-space btrees. Not necessarily tree, because while the extent was pu ton the free list at the time it was marked busy (hence the skip discard), it doesn't mean that it hasn't been migrated back to the free space btree since then. > Because when we insert such busy > extent, we don't drop it into the free-space btrees. As a result, we > should never have a discard range that overlaps a busy extent with > XFS_EXTENT_BUSY_SKIP_DISCARD. Right, but a later call to xfs_alloc_fixup_trees() can move it to the free space tree if the free list was longer than needed for the current transaction. hence my comment above about us missing discards in that case. ;) > Because all our discard ranges are also > free in the free-space btrees. Therefore, busy extents with this flag > do not require any special treatment; we can ignore them fully or > simply ignore the fact that they have this special flag - they will > never have a discard range attached anyways. Pretty much - if we move discard range updates directly into the btree manipulation functions, then we can remove all knowledge of discards from the busy extent tree as discard ranges consider the free list to be "allocated space".... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Questions about XFS discard and xfs_free_extent() code (newbie) 2014-01-15 1:45 ` Dave Chinner @ 2014-01-19 9:38 ` Alex Lyakas 2014-01-19 23:17 ` Dave Chinner 0 siblings, 1 reply; 47+ messages in thread From: Alex Lyakas @ 2014-01-19 9:38 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs Hi Dave, I believe I understand your intent of having the discard tree almost fully disconnected from the busy extents tree. I planned to use the the same pagb_lock for the discard tree as well, which solves the atomicity and ordering issues you have described. If we use a different lock (mutex), then some additional care is needed. One thing I like less about such approach, is that we need to be careful of not growing the discard tree too large. If, for some reason, the worker thread is not able to keep up discarding extents while they are queued for discard, we may end up with too many discard ranges in the tree. Also at the same time user can trigger an offline discard, which will add more discard ranges to the tree. While the online discard that you have, kind of throttles itself. So if the underlying block device is slow on discarding, the whole system will be slowed down accordingly. [snip] I have one additional question regarding your comment on metadata >> Q1: why it is ok to do so? why it is ok for "metadata" to reuse part >> (or all) of the busy extent before its extent-free-intent is >> committed? > > The metadata changes are logged, and crash recovery allows correct > ordering of the free, realloc and modify process for metadata. Hence > it doesn't matter that we overwrite the contents of the block before > the free transaction is on disk - the correct contents will always > be present after recovery. > > We can't do that for user data because we don't log user data. > Therefore if we allow user data to overwrite the block whil eit is > still busy, crash recovery may not result in the block having the > correct contents (i.e. the transaction that freed the block never > reaches the journal) and we hence expose some other user's data or > metadata in the file. If that is the case, why cannot we just issue an async discard before the busy extent is committed? I understand that if we crashed, we might have knocked off some of the user data (or changed it to some new data). But can XFS get corrupted (unmountable) this way? You said earlier that it can, but now you are saying that reusing a metadata block, before its busy extent commits, is fine. Thanks, Alex. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Questions about XFS discard and xfs_free_extent() code (newbie) 2014-01-19 9:38 ` Alex Lyakas @ 2014-01-19 23:17 ` Dave Chinner 2014-07-01 15:06 ` xfs_growfs_data_private memory leak Alex Lyakas 0 siblings, 1 reply; 47+ messages in thread From: Dave Chinner @ 2014-01-19 23:17 UTC (permalink / raw) To: Alex Lyakas; +Cc: xfs On Sun, Jan 19, 2014 at 11:38:22AM +0200, Alex Lyakas wrote: > Hi Dave, > I believe I understand your intent of having the discard tree almost > fully disconnected from the busy extents tree. I planned to use the > the same pagb_lock for the discard tree as well, which solves the > atomicity and ordering issues you have described. If we use a > different lock (mutex), then some additional care is needed. You can probably just change the pagb_lock to a mutex. In the uncontended case they are almost as fast as spinlocks, and I don't think there's much contention on this lock right now... > One thing I like less about such approach, is that we need to be > careful of not growing the discard tree too large. If, for some > reason, the worker thread is not able to keep up discarding extents > while they are queued for discard, we may end up with too many discard > ranges in the tree. If that happens, then we'll burn CPU managing the discard tree on every allocation and waiting for discards to complete. It should self throttle. > Also at the same time user can trigger an offline > discard, which will add more discard ranges to the tree. While the > online discard that you have, kind of throttles itself. So if the > underlying block device is slow on discarding, the whole system will > be slowed down accordingly. discard already slows the whole system down badly. Moving them to the background won't prevent that - but it will reduce the impact because it discards in the transaction commit completion path completely stall the entire file system for the duration of the discards. This is the primary reason why mounting with -o discard is so slow - discards are done in a path of global serialisation. Moving it to the background avoids this problem, so regardless of how much discard work we queue up, the filesystem is still going to be faster than the current code because we've removed the global serialisation point. In the case of fstrim, this makes fstrim run very quickly because it doesn't issue synchronous discards with the AGF locked. We still do the same amount of discard work in the same amount of time in the background, but we don't block AGs or userspace while we wait for the discards to be done. That's also a pretty major win. IOWs, I think the first step is to push all the discard into the background via it's own rbtree as you've been doing. If we add tracepoints and stats to the code, we can track the discard queue depth easily enough with a bit of perf or trace-cmd output scripting, and then determine whether throttling is necessary. We can use the loop device to run tests as it implements discard via hole punching the backing file, and that will tells us pretty quickly if we need additional throttling by looking at the queue depths and the amount of CPU being spent manipulating the discard rbtree (i.e. via perf profiling). Throttling a workqueue isn't hard to do, but there's no point in doing it if it isn't necessary... > [snip] > > I have one additional question regarding your comment on metadata > > >> Q1: why it is ok to do so? why it is ok for "metadata" to reuse part > >> (or all) of the busy extent before its extent-free-intent is > >> committed? > > > > The metadata changes are logged, and crash recovery allows correct > > ordering of the free, realloc and modify process for metadata. Hence > > it doesn't matter that we overwrite the contents of the block before > > the free transaction is on disk - the correct contents will always > > be present after recovery. > > > > We can't do that for user data because we don't log user data. > > Therefore if we allow user data to overwrite the block whil eit is > > still busy, crash recovery may not result in the block having the > > correct contents (i.e. the transaction that freed the block never > > reaches the journal) and we hence expose some other user's data or > > metadata in the file. > If that is the case, why cannot we just issue an async discard before > the busy extent is committed? I understand that if we crashed, we > might have knocked off some of the user data (or changed it to some > new data). Because a user data extent is not free fo us to make arbitrary changes to it until the transaction that frees it commits. i.e. If we crash then the file must either contain the original data or be completely gone. Go google "null files after crash" and you'll see just how important people consider files being intact after a crash.... > But can XFS get corrupted (unmountable) this way? You said > earlier that it can, but now you are saying that reusing a metadata > block, before its busy extent commits, is fine. No, not exactly. There are 4 different cases to take into account: 1. metadata -> free -> alloc as metadata is the situation where there are not problems reusing the busy extent because all the modifications are in the log. Hence crash recovery always ends up with the correct result no mater where the crash occurs. 2. user data -> free -> alloc as metadata is safe because the metadata can't be written in place until the free transaction and the metadata changes are fully committed in the journal. Hence on a crash the user will either see in-tact data in their file, or the file will have had the extent removed successfully and the filesystem has safely reused it for metadata. 3. metadata -> free -> alloc as user data and 4. user data -> free -> alloc as user data are the problematic cases when it comes to IO ordering - if we allow reallocation of busy extents, then the new userdata can be written to disk before the free/alloc transactions are committed to the journal. If we crash at this point, the after recovery the new data will be pointed to by the old user. If the old user is user data, then we've corrupted their file by exposing some other users data to them. If the old user is metadata, then we've corrupted the filesystem because the metadata has been overwritten by user data before the journal recovery has read and freed the metadata block. That will cause recovery (and hence mount) to fail. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
* xfs_growfs_data_private memory leak 2014-01-19 23:17 ` Dave Chinner @ 2014-07-01 15:06 ` Alex Lyakas 2014-07-01 21:56 ` Dave Chinner 2014-08-04 11:00 ` use-after-free on log replay failure Alex Lyakas 0 siblings, 2 replies; 47+ messages in thread From: Alex Lyakas @ 2014-07-01 15:06 UTC (permalink / raw) To: xfs Greetings, It appears that if xfs_growfs_data_private fails during the "new AG headers" loop, it does not free all the per-AG structures for the new AGs. When XFS is unmounted later, they are not freed as well, because xfs_growfs_data_private did not update the "sb_agcount" field, so xfs_free_perag will not free them. This happens on 3.8.13, but looking at the latest master branch, it seems to have the same issue. Code like [1] in xfs_growfs_data, seems to fix the issue. A follow-up question: if xfs_grows_data_private fails during the loop that updates all the secondary superblocks, what is the consequence? (I am aware that in the latest master branch, the loop is not broken on first error, but attempts to initialize whatever possible). When these secondary superblocks will get updated? Is there a way to force-update them? Otherwise, what can be the consequence of leaving them not updated? Thanks, Alex. [1] /* * If we had an error, we might have allocated * PAGs, which are >=sb_agcount. We need to free * those, because they will not get freed in * xfs_free_perag(). */ if (error) { unsigned int n_pags = 0; xfs_perag_t* pags[16] = {0}; xfs_agnumber_t start_agno = mp->m_sb.sb_agcount; do { unsigned int pag_idx = 0; spin_lock(&mp->m_perag_lock); n_pags = radix_tree_gang_lookup(&mp->m_perag_tree, (void**)pags, start_agno, ARRAY_SIZE(pags)); for (pag_idx = 0; pag_idx < n_pags; ++pag_idx) { xfs_perag_t *deleted = NULL; /* for next lookup */ start_agno = pags[pag_idx]->pag_agno + 1; /* nobody should really be touching these AGs...*/ if (WARN_ON(atomic_read(&pags[pag_idx]->pag_ref) > 0)) { pags[pag_idx] = NULL; continue; } deleted = radix_tree_delete(&mp->m_perag_tree, pags[pag_idx]->pag_agno); ASSERT(deleted == pags[pag_idx]); } spin_unlock(&mp->m_perag_lock); /* now delete all those still marked for deletion */ for (pag_idx = 0; pag_idx < n_pags; ++pag_idx) { if (pags[pag_idx]) call_rcu(&pags[pag_idx]->rcu_head, xfs_free_perag_rcu_cb); } } while (n_pags > 0); } xfs_free_perag_rcu_cb is similar to __xfs_free_perag, but can be called from other files. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: xfs_growfs_data_private memory leak 2014-07-01 15:06 ` xfs_growfs_data_private memory leak Alex Lyakas @ 2014-07-01 21:56 ` Dave Chinner 2014-07-02 12:27 ` Alex Lyakas 2014-08-04 11:00 ` use-after-free on log replay failure Alex Lyakas 1 sibling, 1 reply; 47+ messages in thread From: Dave Chinner @ 2014-07-01 21:56 UTC (permalink / raw) To: Alex Lyakas; +Cc: xfs On Tue, Jul 01, 2014 at 06:06:38PM +0300, Alex Lyakas wrote: > Greetings, > > It appears that if xfs_growfs_data_private fails during the "new AG > headers" loop, it does not free all the per-AG structures for the > new AGs. When XFS is unmounted later, they are not freed as well, > because xfs_growfs_data_private did not update the "sb_agcount" > field, so xfs_free_perag will not free them. This happens on 3.8.13, > but looking at the latest master branch, it seems to have the same > issue. > > Code like [1] in xfs_growfs_data, seems to fix the issue. Why not just do this in the appropriate error stack, like is done inside xfs_initialize_perag() on error? for (i = oagcount; i < nagcount; i++) { pag = radix_tree_delete(&mp->m_perag_tree, index); kmem_free(pag); } (though it might need RCU freeing) When you have a fix, can you send a proper patch with a sign-off on it? > A follow-up question: if xfs_grows_data_private fails during the > loop that updates all the secondary superblocks, what is the > consequence? (I am aware that in the latest master branch, the loop > is not broken on first error, but attempts to initialize whatever > possible). When these secondary superblocks will get updated? Is > there a way to force-update them? Otherwise, what can be the > consequence of leaving them not updated? The consequence is documented in mainline tree - if we don't update them all, then repair will do the wrong thing. Repair requires a majority iof identical secondaries to determine if the primary is correct or out of date. The old behaviour of not updating after the first error meant that the majority were old superblocks and so at some time in the future repair could decide your filesystem is smaller than it really is and hence truncate away the grown section of the filesystem. i.e. trigger catastrophic, unrecoverable data loss. Hence it's far better to write every seconday we can than to leave a majority in a bad state.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: xfs_growfs_data_private memory leak 2014-07-01 21:56 ` Dave Chinner @ 2014-07-02 12:27 ` Alex Lyakas 2014-08-04 18:15 ` Eric Sandeen 0 siblings, 1 reply; 47+ messages in thread From: Alex Lyakas @ 2014-07-02 12:27 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs Hi Dave, Thank you for your comments. I realize that secondary superblocks are needed mostly for repairing a broken filesystem. However, I don't see that they get updated regularly, i.e., during normal operation they don't seem to get updated at all. I put a print in xfs_sb_write_verify, and it gets called only with: bp->b_bn==XFS_SB_DADDR. So do I understand correctly (also from comments in xfs_growfs_data_private), that it is safe to operate a filesystem while having broken secondary superblocks? For me, it appears to mount properly, and all the data seems to be there, but xfs_check complains like: bad sb magic # 0xc2a4baf2 in ag 6144 bad sb version # 0x4b5d in ag 6144 blocks 6144/65536..2192631388 out of range blocks 6144/65536..2192631388 claimed by block 6144/0 bad sb magic # 0xb20f3079 in ag 6145 bad sb version # 0x6505 in ag 6145 blocks 6145/65536..3530010017 out of range blocks 6145/65536..3530010017 claimed by block 6145/0 ... Also, if secondary superblocks do not get updated regularly, and there is no way to ask an operational XFS to update them, then during repair we may not find a good secondary superblock. As for the patch, I cannot post a patch against the upstream kernel, because I am running an older kernel. Unfortunately, I cannot qualify an upstream patch properly in a reasonable time. Is there a value in posting a patch against 3.8.13? Otherwise, it's fine by me if somebody else posts it and takes the credit. Thanks, Alex. -----Original Message----- From: Dave Chinner Sent: 02 July, 2014 12:56 AM To: Alex Lyakas Cc: xfs@oss.sgi.com Subject: Re: xfs_growfs_data_private memory leak On Tue, Jul 01, 2014 at 06:06:38PM +0300, Alex Lyakas wrote: > Greetings, > > It appears that if xfs_growfs_data_private fails during the "new AG > headers" loop, it does not free all the per-AG structures for the > new AGs. When XFS is unmounted later, they are not freed as well, > because xfs_growfs_data_private did not update the "sb_agcount" > field, so xfs_free_perag will not free them. This happens on 3.8.13, > but looking at the latest master branch, it seems to have the same > issue. > > Code like [1] in xfs_growfs_data, seems to fix the issue. Why not just do this in the appropriate error stack, like is done inside xfs_initialize_perag() on error? for (i = oagcount; i < nagcount; i++) { pag = radix_tree_delete(&mp->m_perag_tree, index); kmem_free(pag); } (though it might need RCU freeing) When you have a fix, can you send a proper patch with a sign-off on it? > A follow-up question: if xfs_grows_data_private fails during the > loop that updates all the secondary superblocks, what is the > consequence? (I am aware that in the latest master branch, the loop > is not broken on first error, but attempts to initialize whatever > possible). When these secondary superblocks will get updated? Is > there a way to force-update them? Otherwise, what can be the > consequence of leaving them not updated? The consequence is documented in mainline tree - if we don't update them all, then repair will do the wrong thing. Repair requires a majority iof identical secondaries to determine if the primary is correct or out of date. The old behaviour of not updating after the first error meant that the majority were old superblocks and so at some time in the future repair could decide your filesystem is smaller than it really is and hence truncate away the grown section of the filesystem. i.e. trigger catastrophic, unrecoverable data loss. Hence it's far better to write every seconday we can than to leave a majority in a bad state.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: xfs_growfs_data_private memory leak 2014-07-02 12:27 ` Alex Lyakas @ 2014-08-04 18:15 ` Eric Sandeen 2014-08-06 8:56 ` Alex Lyakas 0 siblings, 1 reply; 47+ messages in thread From: Eric Sandeen @ 2014-08-04 18:15 UTC (permalink / raw) To: Alex Lyakas, Dave Chinner; +Cc: xfs On 7/2/14, 7:27 AM, Alex Lyakas wrote: > Hi Dave, > Thank you for your comments. > > I realize that secondary superblocks are needed mostly for repairing s/mostly/only/ > a broken filesystem. However, I don't see that they get updated > regularly, i.e., during normal operation they don't seem to get > updated at all. I put a print in xfs_sb_write_verify, and it gets > called only with: bp->b_bn==XFS_SB_DADDR. See the comments above verify_sb(): not everything is validated, so not everything needs to be constantly updated. It's just the basic fs geometry, not counters, etc. /* * verify a superblock -- does not verify root inode # * can only check that geometry info is internally * consistent. because of growfs, that's no guarantee * of correctness (e.g. geometry may have changed) * * fields verified or consistency checked: * * sb_magicnum * * sb_versionnum * * sb_inprogress * * sb_blocksize (as a group) * sb_blocklog * * geometry info - sb_dblocks (as a group) * sb_agcount * sb_agblocks * sb_agblklog * * inode info - sb_inodesize (x-checked with geo info) * sb_inopblock * * sector size info - * sb_sectsize * sb_sectlog * sb_logsectsize * sb_logsectlog * * not checked here - * sb_rootino * sb_fname * sb_fpack * sb_logstart * sb_uuid * * ALL real-time fields * final 4 summary counters */ > So do I understand correctly (also from comments in > xfs_growfs_data_private), that it is safe to operate a filesystem > while having broken secondary superblocks? For me, it appears to > mount properly, and all the data seems to be there, but xfs_check > complains like: > bad sb magic # 0xc2a4baf2 in ag 6144 > bad sb version # 0x4b5d in ag 6144 > blocks 6144/65536..2192631388 out of range > blocks 6144/65536..2192631388 claimed by block 6144/0 > bad sb magic # 0xb20f3079 in ag 6145 > bad sb version # 0x6505 in ag 6145 > blocks 6145/65536..3530010017 out of range > blocks 6145/65536..3530010017 claimed by block 6145/0 > ... some of that looks more serious than "just" bad backup sb's. But the bad secondaries shouldn't cause runtime problems AFAIK. > Also, if secondary superblocks do not get updated regularly, and > there is no way to ask an operational XFS to update them, then during > repair we may not find a good secondary superblock. You seem to have 6144 (!) allocation groups; one would hope that a majority of those supers would be "good" and the others will be properly corrected by an xfs_repair. > As for the patch, I cannot post a patch against the upstream kernel, > because I am running an older kernel. Unfortunately, I cannot qualify > an upstream patch properly in a reasonable time. Is there a value in > posting a patch against 3.8.13? Otherwise, it's fine by me if > somebody else posts it and takes the credit. If the patch applies cleanly to both kernels, probably fine to go ahead and post it, with that caveat. -Eric > Thanks, > Alex. > > > > -----Original Message----- From: Dave Chinner > Sent: 02 July, 2014 12:56 AM > To: Alex Lyakas > Cc: xfs@oss.sgi.com > Subject: Re: xfs_growfs_data_private memory leak > > On Tue, Jul 01, 2014 at 06:06:38PM +0300, Alex Lyakas wrote: >> Greetings, >> >> It appears that if xfs_growfs_data_private fails during the "new AG >> headers" loop, it does not free all the per-AG structures for the >> new AGs. When XFS is unmounted later, they are not freed as well, >> because xfs_growfs_data_private did not update the "sb_agcount" >> field, so xfs_free_perag will not free them. This happens on 3.8.13, >> but looking at the latest master branch, it seems to have the same >> issue. >> >> Code like [1] in xfs_growfs_data, seems to fix the issue. > > Why not just do this in the appropriate error stack, like is > done inside xfs_initialize_perag() on error? > > for (i = oagcount; i < nagcount; i++) { > pag = radix_tree_delete(&mp->m_perag_tree, index); > kmem_free(pag); > } > > (though it might need RCU freeing) > > When you have a fix, can you send a proper patch with a sign-off on > it? > >> A follow-up question: if xfs_grows_data_private fails during the >> loop that updates all the secondary superblocks, what is the >> consequence? (I am aware that in the latest master branch, the loop >> is not broken on first error, but attempts to initialize whatever >> possible). When these secondary superblocks will get updated? Is >> there a way to force-update them? Otherwise, what can be the >> consequence of leaving them not updated? > > The consequence is documented in mainline tree - if we don't update > them all, then repair will do the wrong thing. Repair requires a > majority iof identical secondaries to determine if the primary is > correct or out of date. The old behaviour of not updating after the > first error meant that the majority were old superblocks and so at > some time in the future repair could decide your filesystem is > smaller than it really is and hence truncate away the grown section > of the filesystem. i.e. trigger catastrophic, unrecoverable data > loss. > > Hence it's far better to write every seconday we can than to leave > a majority in a bad state.... > > Cheers, > > Dave. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: xfs_growfs_data_private memory leak 2014-08-04 18:15 ` Eric Sandeen @ 2014-08-06 8:56 ` Alex Lyakas 0 siblings, 0 replies; 47+ messages in thread From: Alex Lyakas @ 2014-08-06 8:56 UTC (permalink / raw) To: Eric Sandeen; +Cc: xfs Hi Eric, What I was trying to achieve is to allow aborting XFS resize in the middle. I am not sure whether a typical user needs this feature, but for my use case it is needed. So XFS resize is performed in two steps: 1) Initialize the newly-added AGs (AGFL, BNO btree, CNT btree etc) 2) Commit this by updating the primary superblock 3) Update all the existing secondary superblocks, and initialize all the newly-added secondary superblocks So I added a custom IOCTL, which can tell this sequence to abort gracefully (just checking a flag on each loop iteration). Aborting during step 1 works perfectly (apart from the memory leak that I discovered and fixed). Since the primary superblock has not been updated, XFS doesn't know about the newly-added AGs, so it doesn't care what's written there. Resize can be restarted, and XFS will redo this step from the beginning. Aborting during step 3 can leave some of the secondary superblocks totally uninitialized. However, from XFS standpoint, the resize has been committed already. That's where my question came from. I realize that kernel can crash during step 3, leading to same results as with my code that aborts this step. Anyways, since I was not sure about these secondary superblocks, I did not add an exit point at this step. (Again, realize that we can crash during this step anyways). Thanks, Alex. -----Original Message----- From: Eric Sandeen Sent: 04 August, 2014 9:15 PM To: Alex Lyakas ; Dave Chinner Cc: xfs@oss.sgi.com Subject: Re: xfs_growfs_data_private memory leak On 7/2/14, 7:27 AM, Alex Lyakas wrote: > Hi Dave, > Thank you for your comments. > > I realize that secondary superblocks are needed mostly for repairing s/mostly/only/ > a broken filesystem. However, I don't see that they get updated > regularly, i.e., during normal operation they don't seem to get > updated at all. I put a print in xfs_sb_write_verify, and it gets > called only with: bp->b_bn==XFS_SB_DADDR. See the comments above verify_sb(): not everything is validated, so not everything needs to be constantly updated. It's just the basic fs geometry, not counters, etc. /* * verify a superblock -- does not verify root inode # * can only check that geometry info is internally * consistent. because of growfs, that's no guarantee * of correctness (e.g. geometry may have changed) * * fields verified or consistency checked: * * sb_magicnum * * sb_versionnum * * sb_inprogress * * sb_blocksize (as a group) * sb_blocklog * * geometry info - sb_dblocks (as a group) * sb_agcount * sb_agblocks * sb_agblklog * * inode info - sb_inodesize (x-checked with geo info) * sb_inopblock * * sector size info - * sb_sectsize * sb_sectlog * sb_logsectsize * sb_logsectlog * * not checked here - * sb_rootino * sb_fname * sb_fpack * sb_logstart * sb_uuid * * ALL real-time fields * final 4 summary counters */ > So do I understand correctly (also from comments in > xfs_growfs_data_private), that it is safe to operate a filesystem > while having broken secondary superblocks? For me, it appears to > mount properly, and all the data seems to be there, but xfs_check > complains like: > bad sb magic # 0xc2a4baf2 in ag 6144 > bad sb version # 0x4b5d in ag 6144 > blocks 6144/65536..2192631388 out of range > blocks 6144/65536..2192631388 claimed by block 6144/0 > bad sb magic # 0xb20f3079 in ag 6145 > bad sb version # 0x6505 in ag 6145 > blocks 6145/65536..3530010017 out of range > blocks 6145/65536..3530010017 claimed by block 6145/0 > ... some of that looks more serious than "just" bad backup sb's. But the bad secondaries shouldn't cause runtime problems AFAIK. > Also, if secondary superblocks do not get updated regularly, and > there is no way to ask an operational XFS to update them, then during > repair we may not find a good secondary superblock. You seem to have 6144 (!) allocation groups; one would hope that a majority of those supers would be "good" and the others will be properly corrected by an xfs_repair. > As for the patch, I cannot post a patch against the upstream kernel, > because I am running an older kernel. Unfortunately, I cannot qualify > an upstream patch properly in a reasonable time. Is there a value in > posting a patch against 3.8.13? Otherwise, it's fine by me if > somebody else posts it and takes the credit. If the patch applies cleanly to both kernels, probably fine to go ahead and post it, with that caveat. -Eric > Thanks, > Alex. > > > > -----Original Message----- From: Dave Chinner > Sent: 02 July, 2014 12:56 AM > To: Alex Lyakas > Cc: xfs@oss.sgi.com > Subject: Re: xfs_growfs_data_private memory leak > > On Tue, Jul 01, 2014 at 06:06:38PM +0300, Alex Lyakas wrote: >> Greetings, >> >> It appears that if xfs_growfs_data_private fails during the "new AG >> headers" loop, it does not free all the per-AG structures for the >> new AGs. When XFS is unmounted later, they are not freed as well, >> because xfs_growfs_data_private did not update the "sb_agcount" >> field, so xfs_free_perag will not free them. This happens on 3.8.13, >> but looking at the latest master branch, it seems to have the same >> issue. >> >> Code like [1] in xfs_growfs_data, seems to fix the issue. > > Why not just do this in the appropriate error stack, like is > done inside xfs_initialize_perag() on error? > > for (i = oagcount; i < nagcount; i++) { > pag = radix_tree_delete(&mp->m_perag_tree, index); > kmem_free(pag); > } > > (though it might need RCU freeing) > > When you have a fix, can you send a proper patch with a sign-off on > it? > >> A follow-up question: if xfs_grows_data_private fails during the >> loop that updates all the secondary superblocks, what is the >> consequence? (I am aware that in the latest master branch, the loop >> is not broken on first error, but attempts to initialize whatever >> possible). When these secondary superblocks will get updated? Is >> there a way to force-update them? Otherwise, what can be the >> consequence of leaving them not updated? > > The consequence is documented in mainline tree - if we don't update > them all, then repair will do the wrong thing. Repair requires a > majority iof identical secondaries to determine if the primary is > correct or out of date. The old behaviour of not updating after the > first error meant that the majority were old superblocks and so at > some time in the future repair could decide your filesystem is > smaller than it really is and hence truncate away the grown section > of the filesystem. i.e. trigger catastrophic, unrecoverable data > loss. > > Hence it's far better to write every seconday we can than to leave > a majority in a bad state.... > > Cheers, > > Dave. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
* use-after-free on log replay failure 2014-07-01 15:06 ` xfs_growfs_data_private memory leak Alex Lyakas 2014-07-01 21:56 ` Dave Chinner @ 2014-08-04 11:00 ` Alex Lyakas 2014-08-04 14:12 ` Brian Foster 2014-08-04 23:07 ` Dave Chinner 1 sibling, 2 replies; 47+ messages in thread From: Alex Lyakas @ 2014-08-04 11:00 UTC (permalink / raw) To: xfs Greetings, we had a log replay failure due to some errors that the underlying block device returned: [49133.801406] XFS (dm-95): metadata I/O error: block 0x270e8c180 ("xlog_recover_iodone") error 28 numblks 16 [49133.802495] XFS (dm-95): log mount/recovery failed: error 28 [49133.802644] XFS (dm-95): log mount failed and then kernel panicked [1]. Looking at the code, when xfs_mountfs() fails, xfs_fs_fill_super() goes and cleans up and eventually frees the "xfs_mount" structure. But then xfs_buf_iodone_work() can still be delivered through "xfslogd_workqueue", which is static and not per-XFS. But this callback has a pointer to "xfs_mount", and may try to access it as in [1]. Does this analysis sound correct? Kernel is 3.8.13, but looking at the XFS master branch, it might have the same issue. Should we flush this static workqueue before unmounting? Thanks, Alex. [1] [49133.804546] general protection fault: 0000 [#1] SMP [49133.808033] xcbc rmd160 crypto_null af_key xfrm_algo scsi_dh cirrus psmouse ttm drm_kms_helper serio_raw drm i2c_piix4 sysimgblt virtio_balloon sysfillrect syscopyarea nfsd(OF) kvm nfs_acl auth_rpcgss nfs fscache microcode lockd mac_hid sunrpc lp parport floppy ixgbevf(OF) [49133.808033] CPU 2 [49133.808033] Pid: 2907, comm: kworker/2:1H Tainted: GF W O 3.8.13-030813-generic #201305111843 Bochs Bochs [49133.808033] RIP: 0010:[<ffffffff813582fb>] [<ffffffff813582fb>] strnlen+0xb/0x30 [49133.808033] RSP: 0018:ffff8801e31c5b08 EFLAGS: 00010086 [49133.808033] RAX: 0000000000000000 RBX: ffffffff81e4e527 RCX: 0000000000000000 [49133.808033] RDX: 640000450008cf9d RSI: ffffffffffffffff RDI: 640000450008cf9d [49133.808033] RBP: ffff8801e31c5b08 R08: 000000000000ffff R09: 000000000000ffff [49133.808033] R10: 0000000000000000 R11: 0000000000000ffe R12: 640000450008cf9d [49133.808033] R13: ffffffff81e4e900 R14: 0000000000000000 R15: 000000000000ffff [49133.808033] FS: 0000000000000000(0000) GS:ffff88021fd00000(0000) knlGS:0000000000000000 [49133.808033] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [49133.808033] CR2: 00007fa4a91abd80 CR3: 000000020e783000 CR4: 00000000000006e0 [49133.808033] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [49133.808033] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [49133.808033] Process kworker/2:1H (pid: 2907, threadinfo ffff8801e31c4000, task ffff8802124f45c0) [49133.808033] Stack: [49133.808033] ffff8801e31c5b48 ffffffff81359d8e ffff8801e31c5b28 ffffffff81e4e527 [49133.808033] ffffffffa0636a1e ffff8801e31c5c80 ffffffffa0636a1e ffffffff81e4e900 [49133.808033] ffff8801e31c5bc8 ffffffff8135af89 ffff8801e31c5bc8 ffffffff8105a4e7 [49133.808033] Call Trace: [49133.808033] [<ffffffff81359d8e>] string.isra.4+0x3e/0xd0 [49133.808033] [<ffffffff8135af89>] vsnprintf+0x219/0x640 [49133.808033] [<ffffffff8105a4e7>] ? msg_print_text+0xb7/0x1b0 [49133.808033] [<ffffffff8135b471>] vscnprintf+0x11/0x30 [49133.808033] [<ffffffff8105b3b1>] vprintk_emit+0xc1/0x490 [49133.808033] [<ffffffff8105b460>] ? vprintk_emit+0x170/0x490 [49133.808033] [<ffffffff816d5848>] printk+0x61/0x63 [49133.808033] [<ffffffffa05ba261>] __xfs_printk+0x31/0x50 [xfs] [49133.808033] [<ffffffffa05ba4b3>] xfs_notice+0x53/0x60 [xfs] [49133.808033] [<ffffffffa05b14a5>] xfs_do_force_shutdown+0xf5/0x180 [xfs] [49133.808033] [<ffffffffa05f6c38>] ? xlog_recover_iodone+0x48/0x70 [xfs] [49133.808033] [<ffffffffa05f6c38>] xlog_recover_iodone+0x48/0x70 [xfs] [49133.808033] [<ffffffffa05a749d>] xfs_buf_iodone_work+0x4d/0xa0 [xfs] [49133.808033] [<ffffffff81078b81>] process_one_work+0x141/0x490 [49133.808033] [<ffffffff81079b48>] worker_thread+0x168/0x400 [49133.808033] [<ffffffff810799e0>] ? manage_workers+0x120/0x120 [49133.808033] [<ffffffff8107f050>] kthread+0xc0/0xd0 [49133.808033] [<ffffffff8107ef90>] ? flush_kthread_worker+0xb0/0xb0 [49133.808033] [<ffffffff816f61ec>] ret_from_fork+0x7c/0xb0 [49133.808033] [<ffffffff8107ef90>] ? flush_kthread_worker+0xb0/0xb0 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: use-after-free on log replay failure 2014-08-04 11:00 ` use-after-free on log replay failure Alex Lyakas @ 2014-08-04 14:12 ` Brian Foster 2014-08-04 23:07 ` Dave Chinner 1 sibling, 0 replies; 47+ messages in thread From: Brian Foster @ 2014-08-04 14:12 UTC (permalink / raw) To: Alex Lyakas; +Cc: xfs On Mon, Aug 04, 2014 at 02:00:05PM +0300, Alex Lyakas wrote: > Greetings, > > we had a log replay failure due to some errors that the underlying block > device returned: > [49133.801406] XFS (dm-95): metadata I/O error: block 0x270e8c180 > ("xlog_recover_iodone") error 28 numblks 16 > [49133.802495] XFS (dm-95): log mount/recovery failed: error 28 > [49133.802644] XFS (dm-95): log mount failed > > and then kernel panicked [1]. > > Looking at the code, when xfs_mountfs() fails, xfs_fs_fill_super() goes and > cleans up and eventually frees the "xfs_mount" structure. But then > xfs_buf_iodone_work() can still be delivered through "xfslogd_workqueue", > which is static and not per-XFS. But this callback has a pointer to > "xfs_mount", and may try to access it as in [1]. Does this analysis sound > correct? Kernel is 3.8.13, but looking at the XFS master branch, it might > have the same issue. > Seems possible... we call xfs_buf_delwri_submit() via log recovery which does an xfs_buf_iowait() on each buffer for synchronous I/O. xfs_buf_iowait() doesn't wait on b_iowait if b_error is set, however. In the callback side, xfs_buf_ioerror() is called before the _xfs_buf_ioend() sequence that calls into the workqueue. Perhaps the error can be detected by the iowait before the xfslogd_workqueue job can run..? > Should we flush this static workqueue before unmounting? > It's not totally clear that fixes the problem since we set b_error on the buffer before we even schedule on the workqueue. Perhaps it depends on the nature of the race. Another option could be to not use the wq on error, but that may or may not be ideal for other contexts besides log recovery. It looks like the bio code always returns error via the callback and the xlog_recover_iodone() handler does call back into ioend, presumably to do the io completion. It might not be appropriate to remove the b_error bypass in xfs_buf_iowait() as we have ioerror() callers on the buffer that can occur before I/O submission, but I wonder if it could be made conditional for this particular case. That would be an interesting experiment at least to see if it fixes this problem. Brian > Thanks, > Alex. > > > [1] > [49133.804546] general protection fault: 0000 [#1] SMP > [49133.808033] xcbc rmd160 crypto_null af_key xfrm_algo scsi_dh cirrus > psmouse ttm drm_kms_helper serio_raw drm i2c_piix4 sysimgblt virtio_balloon > sysfillrect syscopyarea nfsd(OF) kvm nfs_acl auth_rpcgss nfs fscache > microcode lockd mac_hid sunrpc lp parport floppy ixgbevf(OF) > [49133.808033] CPU 2 > [49133.808033] Pid: 2907, comm: kworker/2:1H Tainted: GF W O > 3.8.13-030813-generic #201305111843 Bochs Bochs > [49133.808033] RIP: 0010:[<ffffffff813582fb>] [<ffffffff813582fb>] > strnlen+0xb/0x30 > [49133.808033] RSP: 0018:ffff8801e31c5b08 EFLAGS: 00010086 > [49133.808033] RAX: 0000000000000000 RBX: ffffffff81e4e527 RCX: > 0000000000000000 > [49133.808033] RDX: 640000450008cf9d RSI: ffffffffffffffff RDI: > 640000450008cf9d > [49133.808033] RBP: ffff8801e31c5b08 R08: 000000000000ffff R09: > 000000000000ffff > [49133.808033] R10: 0000000000000000 R11: 0000000000000ffe R12: > 640000450008cf9d > [49133.808033] R13: ffffffff81e4e900 R14: 0000000000000000 R15: > 000000000000ffff > [49133.808033] FS: 0000000000000000(0000) GS:ffff88021fd00000(0000) > knlGS:0000000000000000 > [49133.808033] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > [49133.808033] CR2: 00007fa4a91abd80 CR3: 000000020e783000 CR4: > 00000000000006e0 > [49133.808033] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > 0000000000000000 > [49133.808033] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: > 0000000000000400 > [49133.808033] Process kworker/2:1H (pid: 2907, threadinfo ffff8801e31c4000, > task ffff8802124f45c0) > [49133.808033] Stack: > [49133.808033] ffff8801e31c5b48 ffffffff81359d8e ffff8801e31c5b28 > ffffffff81e4e527 > [49133.808033] ffffffffa0636a1e ffff8801e31c5c80 ffffffffa0636a1e > ffffffff81e4e900 > [49133.808033] ffff8801e31c5bc8 ffffffff8135af89 ffff8801e31c5bc8 > ffffffff8105a4e7 > [49133.808033] Call Trace: > [49133.808033] [<ffffffff81359d8e>] string.isra.4+0x3e/0xd0 > [49133.808033] [<ffffffff8135af89>] vsnprintf+0x219/0x640 > [49133.808033] [<ffffffff8105a4e7>] ? msg_print_text+0xb7/0x1b0 > [49133.808033] [<ffffffff8135b471>] vscnprintf+0x11/0x30 > [49133.808033] [<ffffffff8105b3b1>] vprintk_emit+0xc1/0x490 > [49133.808033] [<ffffffff8105b460>] ? vprintk_emit+0x170/0x490 > [49133.808033] [<ffffffff816d5848>] printk+0x61/0x63 > [49133.808033] [<ffffffffa05ba261>] __xfs_printk+0x31/0x50 [xfs] > [49133.808033] [<ffffffffa05ba4b3>] xfs_notice+0x53/0x60 [xfs] > [49133.808033] [<ffffffffa05b14a5>] xfs_do_force_shutdown+0xf5/0x180 [xfs] > [49133.808033] [<ffffffffa05f6c38>] ? xlog_recover_iodone+0x48/0x70 [xfs] > [49133.808033] [<ffffffffa05f6c38>] xlog_recover_iodone+0x48/0x70 [xfs] > [49133.808033] [<ffffffffa05a749d>] xfs_buf_iodone_work+0x4d/0xa0 [xfs] > [49133.808033] [<ffffffff81078b81>] process_one_work+0x141/0x490 > [49133.808033] [<ffffffff81079b48>] worker_thread+0x168/0x400 > [49133.808033] [<ffffffff810799e0>] ? manage_workers+0x120/0x120 > [49133.808033] [<ffffffff8107f050>] kthread+0xc0/0xd0 > [49133.808033] [<ffffffff8107ef90>] ? flush_kthread_worker+0xb0/0xb0 > [49133.808033] [<ffffffff816f61ec>] ret_from_fork+0x7c/0xb0 > [49133.808033] [<ffffffff8107ef90>] ? flush_kthread_worker+0xb0/0xb0 > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: use-after-free on log replay failure 2014-08-04 11:00 ` use-after-free on log replay failure Alex Lyakas 2014-08-04 14:12 ` Brian Foster @ 2014-08-04 23:07 ` Dave Chinner 2014-08-06 10:05 ` Alex Lyakas 2014-08-06 12:52 ` Alex Lyakas 1 sibling, 2 replies; 47+ messages in thread From: Dave Chinner @ 2014-08-04 23:07 UTC (permalink / raw) To: Alex Lyakas; +Cc: xfs On Mon, Aug 04, 2014 at 02:00:05PM +0300, Alex Lyakas wrote: > Greetings, > > we had a log replay failure due to some errors that the underlying > block device returned: > [49133.801406] XFS (dm-95): metadata I/O error: block 0x270e8c180 > ("xlog_recover_iodone") error 28 numblks 16 > [49133.802495] XFS (dm-95): log mount/recovery failed: error 28 > [49133.802644] XFS (dm-95): log mount failed #define ENOSPC 28 /* No space left on device */ You're getting an ENOSPC as a metadata IO error during log recovery? Thin provisioning problem, perhaps, and the error is occurring on submission rather than completion? If so: 8d6c121 xfs: fix buffer use after free on IO error Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: use-after-free on log replay failure 2014-08-04 23:07 ` Dave Chinner @ 2014-08-06 10:05 ` Alex Lyakas 2014-08-06 12:32 ` Dave Chinner 2014-08-06 12:52 ` Alex Lyakas 1 sibling, 1 reply; 47+ messages in thread From: Alex Lyakas @ 2014-08-06 10:05 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs Hi Dave, On Tue, Aug 5, 2014 at 2:07 AM, Dave Chinner <david@fromorbit.com> wrote: > On Mon, Aug 04, 2014 at 02:00:05PM +0300, Alex Lyakas wrote: >> Greetings, >> >> we had a log replay failure due to some errors that the underlying >> block device returned: >> [49133.801406] XFS (dm-95): metadata I/O error: block 0x270e8c180 >> ("xlog_recover_iodone") error 28 numblks 16 >> [49133.802495] XFS (dm-95): log mount/recovery failed: error 28 >> [49133.802644] XFS (dm-95): log mount failed > > #define ENOSPC 28 /* No space left on device */ > > You're getting an ENOSPC as a metadata IO error during log recovery? > Thin provisioning problem, perhaps, Yes, it is a thin provisioning problem (which I already know the cause for). > and the error is occurring on > submission rather than completion? If so: > > 8d6c121 xfs: fix buffer use after free on IO error I am not sure what do you mean by "submission rather than completion". Do you mean that xfs_buf_ioapply_map() returns without submitting any bios? In that case, no, bios are submitted to the block device, and it fails them through a different context with ENOSPC error. I will still try the patch you mentioned, because it also looks relevant to another question I addressed to you earlier in: http://oss.sgi.com/archives/xfs/2013-11/msg00648.html Thanks, Alex. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: use-after-free on log replay failure 2014-08-06 10:05 ` Alex Lyakas @ 2014-08-06 12:32 ` Dave Chinner 2014-08-06 14:43 ` Alex Lyakas 2014-08-10 16:26 ` Alex Lyakas 0 siblings, 2 replies; 47+ messages in thread From: Dave Chinner @ 2014-08-06 12:32 UTC (permalink / raw) To: Alex Lyakas; +Cc: xfs On Wed, Aug 06, 2014 at 01:05:34PM +0300, Alex Lyakas wrote: > Hi Dave, > > On Tue, Aug 5, 2014 at 2:07 AM, Dave Chinner <david@fromorbit.com> wrote: > > On Mon, Aug 04, 2014 at 02:00:05PM +0300, Alex Lyakas wrote: > >> Greetings, > >> > >> we had a log replay failure due to some errors that the underlying > >> block device returned: > >> [49133.801406] XFS (dm-95): metadata I/O error: block 0x270e8c180 > >> ("xlog_recover_iodone") error 28 numblks 16 > >> [49133.802495] XFS (dm-95): log mount/recovery failed: error 28 > >> [49133.802644] XFS (dm-95): log mount failed > > > > #define ENOSPC 28 /* No space left on device */ > > > > You're getting an ENOSPC as a metadata IO error during log recovery? > > Thin provisioning problem, perhaps, > Yes, it is a thin provisioning problem (which I already know the cause for). > > > and the error is occurring on > > submission rather than completion? If so: > > > > 8d6c121 xfs: fix buffer use after free on IO error > I am not sure what do you mean by "submission rather than completion". > Do you mean that xfs_buf_ioapply_map() returns without submitting any > bios? No, that the bio submission results in immediate failure (e.g. the device goes away, so submission results in ENODEV). Hence when _xfs_buf_ioapply() releases it's IO reference itis the only remaining reference to the buffer and so completion processing is run immediately. i.e. inline from the submission path. Normally IO errors are reported through the bio in IO completion interrupt context. i.e the IO is completed by the hardware and the error status is attached to bio, which is then completed and we get into XFS that way. The IO submision context is long gone at this point.... > In that case, no, bios are submitted to the block device, and it > fails them through a different context with ENOSPC error. I will still > try the patch you mentioned, because it also looks relevant to another > question I addressed to you earlier in: > http://oss.sgi.com/archives/xfs/2013-11/msg00648.html No, that's a different problem. 9c23ecc xfs: unmount does not wait for shutdown during unmount Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: use-after-free on log replay failure 2014-08-06 12:32 ` Dave Chinner @ 2014-08-06 14:43 ` Alex Lyakas 2014-08-10 16:26 ` Alex Lyakas 1 sibling, 0 replies; 47+ messages in thread From: Alex Lyakas @ 2014-08-06 14:43 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs Hello Dave, I applied this patch manually to 3.8.13, but still hitting exact same issue with my reproduction. I still have xlog_recover_iodone callbacks delivered through xfs_buf_bio_end_io scheduling it through xfslogd_workqueue. Some of the callbacks arrive before mount completes: Aug 6 17:31:33 dev kernel: [ 3258.774970] XFS (dm-19): metadata I/O error: block 0x3780001 ("xlog_recover_iodone") error 28 numblks 1 Aug 6 17:31:33 dev kernel: [ 3258.776687] XFS (dm-19): xfs_do_force_shutdown(0x1) called from line 377 of file /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address = 0xffffffffa0743848 Aug 6 17:31:33 dev kernel: [ 3258.777262] XFS (dm-19): I/O Error Detected. Shutting down filesystem Aug 6 17:31:33 dev kernel: [ 3258.778369] XFS (dm-19): Please umount the filesystem and rectify the problem(s) Aug 6 17:31:33 dev kernel: [ 3258.779634] XFS (dm-19): metadata I/O error: block 0x3780002 ("xlog_recover_iodone") error 28 numblks 1 Aug 6 17:31:33 dev kernel: [ 3258.781929] XFS (dm-19): xfs_do_force_shutdown(0x1) called from line 377 of file /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address = 0xffffffffa0743848 Aug 6 17:31:33 dev kernel: [ 3258.781939] XFS (dm-19): metadata I/O error: block 0x3780008 ("xlog_recover_iodone") error 28 numblks 8 Aug 6 17:31:33 dev kernel: [ 3258.784235] XFS (dm-19): xfs_do_force_shutdown(0x1) called from line 377 of file /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address = 0xffffffffa0743848 Aug 6 17:31:33 dev kernel: [ 3258.784260] XFS (dm-19): metadata I/O error: block 0x3780010 ("xlog_recover_iodone") error 28 numblks 8 Aug 6 17:31:33 dev kernel: [ 3258.784389] XFS (dm-19): log mount/recovery failed: error 28 Aug 6 17:31:33 dev kernel: [ 3258.784549] XFS (dm-19): log mount failed And some arrive afterwards, and print garbage. Aug 6 17:31:33 dev kernel: [ 3258.786398] XFS (ˆm&_): xfs_do_force_shutdown(0x1) called from line 377 of file /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address = 0xffffffffa0743848 Aug 6 17:31:33 dev kernel: [ 3258.786404] XFS (ˆm&_): metadata I/O error: block 0x3780018 ("xlog_recover_iodone") error 28 numblks 8 Aug 6 17:31:33 dev kernel: [ 3258.788575] XFS (ˆm&_): xfs_do_force_shutdown(0x1) called from line 377 of file /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address = 0xffffffffa0743848 Aug 6 17:31:33 dev kernel: [ 3258.788614] XFS (ˆm&_): metadata I/O error: block 0x37800a0 ("xlog_recover_iodone") error 28 numblks 16 Aug 6 17:31:33 dev kernel: [ 3258.790849] XFS (ˆm&_): xfs_do_force_shutdown(0x1) called from line 377 of file /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address = 0xffffffffa0743848 Maybe some additional patch is needed for my kernel? Thanks, Alex. -----Original Message----- From: Dave Chinner Sent: 06 August, 2014 3:32 PM To: Alex Lyakas Cc: xfs@oss.sgi.com Subject: Re: use-after-free on log replay failure On Wed, Aug 06, 2014 at 01:05:34PM +0300, Alex Lyakas wrote: > Hi Dave, > > On Tue, Aug 5, 2014 at 2:07 AM, Dave Chinner <david@fromorbit.com> wrote: > > On Mon, Aug 04, 2014 at 02:00:05PM +0300, Alex Lyakas wrote: > >> Greetings, > >> > >> we had a log replay failure due to some errors that the underlying > >> block device returned: > >> [49133.801406] XFS (dm-95): metadata I/O error: block 0x270e8c180 > >> ("xlog_recover_iodone") error 28 numblks 16 > >> [49133.802495] XFS (dm-95): log mount/recovery failed: error 28 > >> [49133.802644] XFS (dm-95): log mount failed > > > > #define ENOSPC 28 /* No space left on device */ > > > > You're getting an ENOSPC as a metadata IO error during log recovery? > > Thin provisioning problem, perhaps, > Yes, it is a thin provisioning problem (which I already know the cause > for). > > > and the error is occurring on > > submission rather than completion? If so: > > > > 8d6c121 xfs: fix buffer use after free on IO error > I am not sure what do you mean by "submission rather than completion". > Do you mean that xfs_buf_ioapply_map() returns without submitting any > bios? No, that the bio submission results in immediate failure (e.g. the device goes away, so submission results in ENODEV). Hence when _xfs_buf_ioapply() releases it's IO reference itis the only remaining reference to the buffer and so completion processing is run immediately. i.e. inline from the submission path. Normally IO errors are reported through the bio in IO completion interrupt context. i.e the IO is completed by the hardware and the error status is attached to bio, which is then completed and we get into XFS that way. The IO submision context is long gone at this point.... > In that case, no, bios are submitted to the block device, and it > fails them through a different context with ENOSPC error. I will still > try the patch you mentioned, because it also looks relevant to another > question I addressed to you earlier in: > http://oss.sgi.com/archives/xfs/2013-11/msg00648.html No, that's a different problem. 9c23ecc xfs: unmount does not wait for shutdown during unmount Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: use-after-free on log replay failure 2014-08-06 12:32 ` Dave Chinner 2014-08-06 14:43 ` Alex Lyakas @ 2014-08-10 16:26 ` Alex Lyakas 1 sibling, 0 replies; 47+ messages in thread From: Alex Lyakas @ 2014-08-10 16:26 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs Hello Dave, On Wed, Aug 6, 2014 at 3:32 PM, Dave Chinner <david@fromorbit.com> wrote: > On Wed, Aug 06, 2014 at 01:05:34PM +0300, Alex Lyakas wrote: >> Hi Dave, >> >> On Tue, Aug 5, 2014 at 2:07 AM, Dave Chinner <david@fromorbit.com> wrote: >> > On Mon, Aug 04, 2014 at 02:00:05PM +0300, Alex Lyakas wrote: >> >> Greetings, >> >> >> >> we had a log replay failure due to some errors that the underlying >> >> block device returned: >> >> [49133.801406] XFS (dm-95): metadata I/O error: block 0x270e8c180 >> >> ("xlog_recover_iodone") error 28 numblks 16 >> >> [49133.802495] XFS (dm-95): log mount/recovery failed: error 28 >> >> [49133.802644] XFS (dm-95): log mount failed >> > >> > #define ENOSPC 28 /* No space left on device */ >> > >> > You're getting an ENOSPC as a metadata IO error during log recovery? >> > Thin provisioning problem, perhaps, >> Yes, it is a thin provisioning problem (which I already know the cause for). >> >> > and the error is occurring on >> > submission rather than completion? If so: >> > >> > 8d6c121 xfs: fix buffer use after free on IO error >> I am not sure what do you mean by "submission rather than completion". >> Do you mean that xfs_buf_ioapply_map() returns without submitting any >> bios? > > No, that the bio submission results in immediate failure (e.g. the > device goes away, so submission results in ENODEV). Hence when > _xfs_buf_ioapply() releases it's IO reference itis the only > remaining reference to the buffer and so completion processing is > run immediately. i.e. inline from the submission path. > > Normally IO errors are reported through the bio in IO completion > interrupt context. i.e the IO is completed by the hardware and the > error status is attached to bio, which is then completed and we get > into XFS that way. The IO submision context is long gone at this > point.... > >> In that case, no, bios are submitted to the block device, and it >> fails them through a different context with ENOSPC error. I will still >> try the patch you mentioned, because it also looks relevant to another >> question I addressed to you earlier in: >> http://oss.sgi.com/archives/xfs/2013-11/msg00648.html > > No, that's a different problem. > > 9c23ecc xfs: unmount does not wait for shutdown during unmount Yes, this patch appears to fix the problem that I reported in the past. XFS survives the unmount and kmemleak is also happy. Thanks! Is this patch safe to apply to 3.8.13? Thanks, Alex. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: use-after-free on log replay failure 2014-08-04 23:07 ` Dave Chinner 2014-08-06 10:05 ` Alex Lyakas @ 2014-08-06 12:52 ` Alex Lyakas 2014-08-06 15:20 ` Brian Foster 1 sibling, 1 reply; 47+ messages in thread From: Alex Lyakas @ 2014-08-06 12:52 UTC (permalink / raw) To: Dave Chinner; +Cc: bfoster, xfs Hello Dave and Brian, Dave, I tried the patch you suggested, but it does not fix the issue. I did some further digging, and it appears that _xfs_buf_ioend(schedule=1) can be called from xfs_buf_iorequest(), which the patch fixes, but also from xfs_buf_bio_end_io() which is my case. I am reproducing the issue pretty easily. The flow that I have is like this: - xlog_recover() calls xlog_find_tail(). This works alright. - Now I add a small sleep before calling xlog_do_recover(), and meanwhile I instruct my block device to return ENOSPC for any WRITE from now on. What seems to happen is that several WRITE bios are submitted and they all fail. When they do, they reach xfs_buf_ioend() through a stack like this: Aug 6 15:23:07 dev kernel: [ 304.410528] [56]xfs*[xfs_buf_ioend:1056] XFS(dm-19): Scheduling xfs_buf_iodone_work on error Aug 6 15:23:07 dev kernel: [ 304.410534] Pid: 56, comm: kworker/u:1 Tainted: G W O 3.8.13-557-generic #1382000791 Aug 6 15:23:07 dev kernel: [ 304.410537] Call Trace: Aug 6 15:23:07 dev kernel: [ 304.410587] [<ffffffffa04d6654>] xfs_buf_ioend+0x1a4/0x1b0 [xfs] Aug 6 15:23:07 dev kernel: [ 304.410621] [<ffffffffa04d6685>] _xfs_buf_ioend+0x25/0x30 [xfs] Aug 6 15:23:07 dev kernel: [ 304.410643] [<ffffffffa04d6b3d>] xfs_buf_bio_end_io+0x3d/0x50 [xfs] Aug 6 15:23:07 dev kernel: [ 304.410652] [<ffffffff811c3d8d>] bio_endio+0x1d/0x40 ... At this point, they are scheduled to run xlog_recover_iodone through xfslogd_workqueue. The first callback that gets called, calls xfs_do_force_shutdown in stack like this: Aug 6 15:23:07 dev kernel: [ 304.411791] XFS (dm-19): metadata I/O error: block 0x3780001 ("xlog_recover_iodone") error 28 numblks 1 Aug 6 15:23:07 dev kernel: [ 304.413493] XFS (dm-19): xfs_do_force_shutdown(0x1) called from line 377 of file /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address = 0xffffffffa0526848 Aug 6 15:23:07 dev kernel: [ 304.413837] [<ffffffffa04e0b60>] xfs_do_force_shutdown+0x40/0x180 [xfs] Aug 6 15:23:07 dev kernel: [ 304.413870] [<ffffffffa0526848>] ? xlog_recover_iodone+0x48/0x70 [xfs] Aug 6 15:23:07 dev kernel: [ 304.413902] [<ffffffffa0526848>] xlog_recover_iodone+0x48/0x70 [xfs] Aug 6 15:23:07 dev kernel: [ 304.413923] [<ffffffffa04d645d>] xfs_buf_iodone_work+0x4d/0xa0 [xfs] Aug 6 15:23:07 dev kernel: [ 304.413930] [<ffffffff81077a11>] process_one_work+0x141/0x4a0 Aug 6 15:23:07 dev kernel: [ 304.413937] [<ffffffff810789e8>] worker_thread+0x168/0x410 Aug 6 15:23:07 dev kernel: [ 304.413943] [<ffffffff81078880>] ? manage_workers+0x120/0x120 Aug 6 15:23:07 dev kernel: [ 304.413949] [<ffffffff8107df10>] kthread+0xc0/0xd0 Aug 6 15:23:07 dev kernel: [ 304.413954] [<ffffffff8107de50>] ? flush_kthread_worker+0xb0/0xb0 Aug 6 15:23:07 dev kernel: [ 304.413976] [<ffffffff816ab86c>] ret_from_fork+0x7c/0xb0 Aug 6 15:23:07 dev kernel: [ 304.413986] [<ffffffff8107de50>] ? flush_kthread_worker+0xb0/0xb0 Aug 6 15:23:07 dev kernel: [ 304.413990] ---[ end trace 988d698520e1fa81 ]--- Aug 6 15:23:07 dev kernel: [ 304.414012] XFS (dm-19): I/O Error Detected. Shutting down filesystem Aug 6 15:23:07 dev kernel: [ 304.415936] XFS (dm-19): Please umount the filesystem and rectify the problem(s) But the rest of the callbacks also arrive: Aug 6 15:23:07 dev kernel: [ 304.417812] XFS (dm-19): metadata I/O error: block 0x3780002 ("xlog_recover_iodone") error 28 numblks 1 Aug 6 15:23:07 dev kernel: [ 304.420420] XFS (dm-19): xfs_do_force_shutdown(0x1) called from line 377 of file /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address = 0xffffffffa0526848 Aug 6 15:23:07 dev kernel: [ 304.420427] XFS (dm-19): metadata I/O error: block 0x3780008 ("xlog_recover_iodone") error 28 numblks 8 Aug 6 15:23:07 dev kernel: [ 304.422708] XFS (dm-19): xfs_do_force_shutdown(0x1) called from line 377 of file /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address = 0xffffffffa0526848 Aug 6 15:23:07 dev kernel: [ 304.422738] XFS (dm-19): metadata I/O error: block 0x3780010 ("xlog_recover_iodone") error 28 numblks 8 The mount sequence fails and goes back to the caller: Aug 6 15:23:07 dev kernel: [ 304.423438] XFS (dm-19): log mount/recovery failed: error 28 Aug 6 15:23:07 dev kernel: [ 304.423757] XFS (dm-19): log mount failed But there are still additional callbacks to deliver, which the mount sequence did not wait for! Aug 6 15:23:07 dev kernel: [ 304.425717] XFS (\x10@dR): xfs_do_force_shutdown(0x1) called from line 377 of file /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address = 0xffffffffa0526848 Aug 6 15:23:07 dev kernel: [ 304.425723] XFS (\x10@dR): metadata I/O error: block 0x3780018 ("xlog_recover_iodone") error 28 numblks 8 Aug 6 15:23:07 dev kernel: [ 304.428239] XFS (\x10@dR): xfs_do_force_shutdown(0x1) called from line 377 of file /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address = 0xffffffffa0526848 Aug 6 15:23:07 dev kernel: [ 304.428246] XFS (\x10@dR): metadata I/O error: block 0x37800a0 ("xlog_recover_iodone") error 28 numblks 16 Notice the junk that they are printing! Naturally, because xfs_mount structure has been kfreed. Finally the kernel crashes (instead of printing junk), because the xfs_mount structure is gone, but the callback tries to access it (printing the name): Aug 6 15:23:07 dev kernel: [ 304.430796] general protection fault: 0000 [#1] SMP Aug 6 15:23:07 dev kernel: [ 304.432035] Modules linked in: xfrm_user xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 iscsi_scst_tcp(O) scst_vdisk(O) scst(O) dm_zcache(O) dm_btrfs(O) xfs(O) btrfs(O) libcrc32c raid456(O) async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq async_tx raid1(O) md_mod deflate zlib_deflate ctr twofish_generic twofish_x86_64_3way glue_helper lrw xts gf128mul twofish_x86_64 twofish_common camellia_generic serpent_generic blowfish_generic blowfish_x86_64 blowfish_common cast5_generic cast_common des_generic xcbc rmd160 sha512_generic crypto_null af_key xfrm_algo dm_round_robin kvm vfat fat ppdev psmouse microcode nfsd nfs_acl dm_multipath(O) serio_raw parport_pc nfsv4 dm_iostat(O) mac_hid i2c_piix4 auth_rpcgss nfs fscache lockd sunrpc lp parport floppy Aug 6 15:23:07 dev kernel: [ 304.432035] CPU 1 Aug 6 15:23:07 dev kernel: [ 304.432035] Pid: 133, comm: kworker/1:1H Tainted: G W O 3.8.13-557-generic #1382000791 Bochs Bochs Aug 6 15:23:07 dev kernel: [ 304.432035] RIP: 0010:[<ffffffff8133c2cb>] [<ffffffff8133c2cb>] strnlen+0xb/0x30 Aug 6 15:23:07 dev kernel: [ 304.432035] RSP: 0018:ffff880035461b08 EFLAGS: 00010086 Aug 6 15:23:07 dev kernel: [ 304.432035] RAX: 0000000000000000 RBX: ffffffff81e6a4e7 RCX: 0000000000000000 Aug 6 15:23:07 dev kernel: [ 304.432035] RDX: e4e8390a265c0000 RSI: ffffffffffffffff RDI: e4e8390a265c0000 Aug 6 15:23:07 dev kernel: [ 304.432035] RBP: ffff880035461b08 R08: 000000000000ffff R09: 000000000000ffff Aug 6 15:23:07 dev kernel: [ 304.432035] R10: 0000000000000000 R11: 00000000000004cd R12: e4e8390a265c0000 Aug 6 15:23:07 dev kernel: [ 304.432035] R13: ffffffff81e6a8c0 R14: 0000000000000000 R15: 000000000000ffff Aug 6 15:23:07 dev kernel: [ 304.432035] FS: 0000000000000000(0000) GS:ffff88007fc80000(0000) knlGS:0000000000000000 Aug 6 15:23:07 dev kernel: [ 304.432035] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Aug 6 15:23:07 dev kernel: [ 304.432035] CR2: 00007fc902ffbfd8 CR3: 000000007702a000 CR4: 00000000000006e0 Aug 6 15:23:07 dev kernel: [ 304.432035] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Aug 6 15:23:07 dev kernel: [ 304.432035] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Aug 6 15:23:07 dev kernel: [ 304.432035] Process kworker/1:1H (pid: 133, threadinfo ffff880035460000, task ffff880035412e00) Aug 6 15:23:07 dev kernel: [ 304.432035] Stack: Aug 6 15:23:07 dev kernel: [ 304.432035] ffff880035461b48 ffffffff8133dd5e 0000000000000000 ffffffff81e6a4e7 Aug 6 15:23:07 dev kernel: [ 304.432035] ffffffffa0566cba ffff880035461c80 ffffffffa0566cba ffffffff81e6a8c0 Aug 6 15:23:07 dev kernel: [ 304.432035] ffff880035461bc8 ffffffff8133ef59 ffff880035461bc8 ffffffff81c84040 Aug 6 15:23:07 dev kernel: [ 304.432035] Call Trace: Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8133dd5e>] string.isra.4+0x3e/0xd0 Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8133ef59>] vsnprintf+0x219/0x640 Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8133f441>] vscnprintf+0x11/0x30 Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8105a971>] vprintk_emit+0xc1/0x490 Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8105aa20>] ? vprintk_emit+0x170/0x490 Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8168b992>] printk+0x61/0x63 Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa04e9bf1>] __xfs_printk+0x31/0x50 [xfs] Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa04e9e43>] xfs_notice+0x53/0x60 [xfs] Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa04e0c15>] xfs_do_force_shutdown+0xf5/0x180 [xfs] Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa0526848>] ? xlog_recover_iodone+0x48/0x70 [xfs] Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa0526848>] xlog_recover_iodone+0x48/0x70 [xfs] Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa04d645d>] xfs_buf_iodone_work+0x4d/0xa0 [xfs] Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff81077a11>] process_one_work+0x141/0x4a0 Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff810789e8>] worker_thread+0x168/0x410 Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff81078880>] ? manage_workers+0x120/0x120 Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8107df10>] kthread+0xc0/0xd0 Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8107de50>] ? flush_kthread_worker+0xb0/0xb0 Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff816ab86c>] ret_from_fork+0x7c/0xb0 Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8107de50>] ? flush_kthread_worker+0xb0/0xb0 Aug 6 15:23:07 dev kernel: [ 304.432035] Code: 31 c0 80 3f 00 55 48 89 e5 74 11 48 89 f8 66 90 48 83 c0 01 80 38 00 75 f7 48 29 f8 5d c3 66 90 55 31 c0 48 85 f6 48 89 e5 74 23 <80> 3f 00 74 1e 48 89 f8 eb 0c 0f 1f 00 48 83 ee 01 80 38 00 74 Aug 6 15:23:07 dev kernel: [ 304.432035] RIP [<ffffffff8133c2cb>] strnlen+0xb/0x30 Aug 6 15:23:07 dev kernel: [ 304.432035] RSP <ffff880035461b08> So previously you said: "So, something is corrupting memory and stamping all over the XFS structures." and also "given you have a bunch of out of tree modules loaded (and some which are experiemental) suggests that you have a problem with your storage...". But I believe, my analysis shows that during the mount sequence XFS does not wait properly for all the bios to complete, before failing the mount sequence back to the caller. Thanks, Alex. -----Original Message----- From: Dave Chinner Sent: 05 August, 2014 2:07 AM To: Alex Lyakas Cc: xfs@oss.sgi.com Subject: Re: use-after-free on log replay failure On Mon, Aug 04, 2014 at 02:00:05PM +0300, Alex Lyakas wrote: > Greetings, > > we had a log replay failure due to some errors that the underlying > block device returned: > [49133.801406] XFS (dm-95): metadata I/O error: block 0x270e8c180 > ("xlog_recover_iodone") error 28 numblks 16 > [49133.802495] XFS (dm-95): log mount/recovery failed: error 28 > [49133.802644] XFS (dm-95): log mount failed #define ENOSPC 28 /* No space left on device */ You're getting an ENOSPC as a metadata IO error during log recovery? Thin provisioning problem, perhaps, and the error is occurring on submission rather than completion? If so: 8d6c121 xfs: fix buffer use after free on IO error Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: use-after-free on log replay failure 2014-08-06 12:52 ` Alex Lyakas @ 2014-08-06 15:20 ` Brian Foster 2014-08-06 15:28 ` Alex Lyakas 2014-08-10 12:20 ` Alex Lyakas 0 siblings, 2 replies; 47+ messages in thread From: Brian Foster @ 2014-08-06 15:20 UTC (permalink / raw) To: Alex Lyakas; +Cc: xfs On Wed, Aug 06, 2014 at 03:52:03PM +0300, Alex Lyakas wrote: > Hello Dave and Brian, > > Dave, I tried the patch you suggested, but it does not fix the issue. I did > some further digging, and it appears that _xfs_buf_ioend(schedule=1) can be > called from xfs_buf_iorequest(), which the patch fixes, but also from > xfs_buf_bio_end_io() which is my case. I am reproducing the issue pretty > easily. The flow that I have is like this: > - xlog_recover() calls xlog_find_tail(). This works alright. What's the purpose of a sleep here? > - Now I add a small sleep before calling xlog_do_recover(), and meanwhile I > instruct my block device to return ENOSPC for any WRITE from now on. > > What seems to happen is that several WRITE bios are submitted and they all > fail. When they do, they reach xfs_buf_ioend() through a stack like this: > > Aug 6 15:23:07 dev kernel: [ 304.410528] [56]xfs*[xfs_buf_ioend:1056] > XFS(dm-19): Scheduling xfs_buf_iodone_work on error > Aug 6 15:23:07 dev kernel: [ 304.410534] Pid: 56, comm: kworker/u:1 > Tainted: G W O 3.8.13-557-generic #1382000791 > Aug 6 15:23:07 dev kernel: [ 304.410537] Call Trace: > Aug 6 15:23:07 dev kernel: [ 304.410587] [<ffffffffa04d6654>] > xfs_buf_ioend+0x1a4/0x1b0 [xfs] > Aug 6 15:23:07 dev kernel: [ 304.410621] [<ffffffffa04d6685>] > _xfs_buf_ioend+0x25/0x30 [xfs] > Aug 6 15:23:07 dev kernel: [ 304.410643] [<ffffffffa04d6b3d>] > xfs_buf_bio_end_io+0x3d/0x50 [xfs] > Aug 6 15:23:07 dev kernel: [ 304.410652] [<ffffffff811c3d8d>] > bio_endio+0x1d/0x40 > ... > > At this point, they are scheduled to run xlog_recover_iodone through > xfslogd_workqueue. > The first callback that gets called, calls xfs_do_force_shutdown in stack > like this: > > Aug 6 15:23:07 dev kernel: [ 304.411791] XFS (dm-19): metadata I/O error: > block 0x3780001 ("xlog_recover_iodone") error 28 numblks 1 > Aug 6 15:23:07 dev kernel: [ 304.413493] XFS (dm-19): > xfs_do_force_shutdown(0x1) called from line 377 of file > /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address = > 0xffffffffa0526848 > Aug 6 15:23:07 dev kernel: [ 304.413837] [<ffffffffa04e0b60>] > xfs_do_force_shutdown+0x40/0x180 [xfs] > Aug 6 15:23:07 dev kernel: [ 304.413870] [<ffffffffa0526848>] ? > xlog_recover_iodone+0x48/0x70 [xfs] > Aug 6 15:23:07 dev kernel: [ 304.413902] [<ffffffffa0526848>] > xlog_recover_iodone+0x48/0x70 [xfs] > Aug 6 15:23:07 dev kernel: [ 304.413923] [<ffffffffa04d645d>] > xfs_buf_iodone_work+0x4d/0xa0 [xfs] > Aug 6 15:23:07 dev kernel: [ 304.413930] [<ffffffff81077a11>] > process_one_work+0x141/0x4a0 > Aug 6 15:23:07 dev kernel: [ 304.413937] [<ffffffff810789e8>] > worker_thread+0x168/0x410 > Aug 6 15:23:07 dev kernel: [ 304.413943] [<ffffffff81078880>] ? > manage_workers+0x120/0x120 > Aug 6 15:23:07 dev kernel: [ 304.413949] [<ffffffff8107df10>] > kthread+0xc0/0xd0 > Aug 6 15:23:07 dev kernel: [ 304.413954] [<ffffffff8107de50>] ? > flush_kthread_worker+0xb0/0xb0 > Aug 6 15:23:07 dev kernel: [ 304.413976] [<ffffffff816ab86c>] > ret_from_fork+0x7c/0xb0 > Aug 6 15:23:07 dev kernel: [ 304.413986] [<ffffffff8107de50>] ? > flush_kthread_worker+0xb0/0xb0 > Aug 6 15:23:07 dev kernel: [ 304.413990] ---[ end trace 988d698520e1fa81 > ]--- > Aug 6 15:23:07 dev kernel: [ 304.414012] XFS (dm-19): I/O Error Detected. > Shutting down filesystem > Aug 6 15:23:07 dev kernel: [ 304.415936] XFS (dm-19): Please umount the > filesystem and rectify the problem(s) > > But the rest of the callbacks also arrive: > Aug 6 15:23:07 dev kernel: [ 304.417812] XFS (dm-19): metadata I/O error: > block 0x3780002 ("xlog_recover_iodone") error 28 numblks 1 > Aug 6 15:23:07 dev kernel: [ 304.420420] XFS (dm-19): > xfs_do_force_shutdown(0x1) called from line 377 of file > /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address = > 0xffffffffa0526848 > Aug 6 15:23:07 dev kernel: [ 304.420427] XFS (dm-19): metadata I/O error: > block 0x3780008 ("xlog_recover_iodone") error 28 numblks 8 > Aug 6 15:23:07 dev kernel: [ 304.422708] XFS (dm-19): > xfs_do_force_shutdown(0x1) called from line 377 of file > /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address = > 0xffffffffa0526848 > Aug 6 15:23:07 dev kernel: [ 304.422738] XFS (dm-19): metadata I/O error: > block 0x3780010 ("xlog_recover_iodone") error 28 numblks 8 > > The mount sequence fails and goes back to the caller: > Aug 6 15:23:07 dev kernel: [ 304.423438] XFS (dm-19): log mount/recovery > failed: error 28 > Aug 6 15:23:07 dev kernel: [ 304.423757] XFS (dm-19): log mount failed > > But there are still additional callbacks to deliver, which the mount > sequence did not wait for! > Aug 6 15:23:07 dev kernel: [ 304.425717] XFS (\x10@dR): > xfs_do_force_shutdown(0x1) called from line 377 of file > /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address = > 0xffffffffa0526848 > Aug 6 15:23:07 dev kernel: [ 304.425723] XFS (\x10@dR): metadata I/O error: > block 0x3780018 ("xlog_recover_iodone") error 28 numblks 8 > Aug 6 15:23:07 dev kernel: [ 304.428239] XFS (\x10@dR): > xfs_do_force_shutdown(0x1) called from line 377 of file > /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address = > 0xffffffffa0526848 > Aug 6 15:23:07 dev kernel: [ 304.428246] XFS (\x10@dR): metadata I/O error: > block 0x37800a0 ("xlog_recover_iodone") error 28 numblks 16 > > Notice the junk that they are printing! Naturally, because xfs_mount > structure has been kfreed. > > Finally the kernel crashes (instead of printing junk), because the xfs_mount > structure is gone, but the callback tries to access it (printing the name): > > Aug 6 15:23:07 dev kernel: [ 304.430796] general protection fault: 0000 > [#1] SMP > Aug 6 15:23:07 dev kernel: [ 304.432035] Modules linked in: xfrm_user > xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 iscsi_scst_tcp(O) > scst_vdisk(O) scst(O) dm_zcache(O) dm_btrfs(O) xfs(O) btrfs(O) libcrc32c > raid456(O) async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq > async_tx raid1(O) md_mod deflate zlib_deflate ctr twofish_generic > twofish_x86_64_3way glue_helper lrw xts gf128mul twofish_x86_64 > twofish_common camellia_generic serpent_generic blowfish_generic > blowfish_x86_64 blowfish_common cast5_generic cast_common des_generic xcbc > rmd160 sha512_generic crypto_null af_key xfrm_algo dm_round_robin kvm vfat > fat ppdev psmouse microcode nfsd nfs_acl dm_multipath(O) serio_raw > parport_pc nfsv4 dm_iostat(O) mac_hid i2c_piix4 auth_rpcgss nfs fscache > lockd sunrpc lp parport floppy > Aug 6 15:23:07 dev kernel: [ 304.432035] CPU 1 > Aug 6 15:23:07 dev kernel: [ 304.432035] Pid: 133, comm: kworker/1:1H > Tainted: G W O 3.8.13-557-generic #1382000791 Bochs Bochs > Aug 6 15:23:07 dev kernel: [ 304.432035] RIP: 0010:[<ffffffff8133c2cb>] > [<ffffffff8133c2cb>] strnlen+0xb/0x30 > Aug 6 15:23:07 dev kernel: [ 304.432035] RSP: 0018:ffff880035461b08 > EFLAGS: 00010086 > Aug 6 15:23:07 dev kernel: [ 304.432035] RAX: 0000000000000000 RBX: > ffffffff81e6a4e7 RCX: 0000000000000000 > Aug 6 15:23:07 dev kernel: [ 304.432035] RDX: e4e8390a265c0000 RSI: > ffffffffffffffff RDI: e4e8390a265c0000 > Aug 6 15:23:07 dev kernel: [ 304.432035] RBP: ffff880035461b08 R08: > 000000000000ffff R09: 000000000000ffff > Aug 6 15:23:07 dev kernel: [ 304.432035] R10: 0000000000000000 R11: > 00000000000004cd R12: e4e8390a265c0000 > Aug 6 15:23:07 dev kernel: [ 304.432035] R13: ffffffff81e6a8c0 R14: > 0000000000000000 R15: 000000000000ffff > Aug 6 15:23:07 dev kernel: [ 304.432035] FS: 0000000000000000(0000) > GS:ffff88007fc80000(0000) knlGS:0000000000000000 > Aug 6 15:23:07 dev kernel: [ 304.432035] CS: 0010 DS: 0000 ES: 0000 CR0: > 000000008005003b > Aug 6 15:23:07 dev kernel: [ 304.432035] CR2: 00007fc902ffbfd8 CR3: > 000000007702a000 CR4: 00000000000006e0 > Aug 6 15:23:07 dev kernel: [ 304.432035] DR0: 0000000000000000 DR1: > 0000000000000000 DR2: 0000000000000000 > Aug 6 15:23:07 dev kernel: [ 304.432035] DR3: 0000000000000000 DR6: > 00000000ffff0ff0 DR7: 0000000000000400 > Aug 6 15:23:07 dev kernel: [ 304.432035] Process kworker/1:1H (pid: 133, > threadinfo ffff880035460000, task ffff880035412e00) > Aug 6 15:23:07 dev kernel: [ 304.432035] Stack: > Aug 6 15:23:07 dev kernel: [ 304.432035] ffff880035461b48 > ffffffff8133dd5e 0000000000000000 ffffffff81e6a4e7 > Aug 6 15:23:07 dev kernel: [ 304.432035] ffffffffa0566cba > ffff880035461c80 ffffffffa0566cba ffffffff81e6a8c0 > Aug 6 15:23:07 dev kernel: [ 304.432035] ffff880035461bc8 > ffffffff8133ef59 ffff880035461bc8 ffffffff81c84040 > Aug 6 15:23:07 dev kernel: [ 304.432035] Call Trace: > Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8133dd5e>] > string.isra.4+0x3e/0xd0 > Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8133ef59>] > vsnprintf+0x219/0x640 > Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8133f441>] > vscnprintf+0x11/0x30 > Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8105a971>] > vprintk_emit+0xc1/0x490 > Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8105aa20>] ? > vprintk_emit+0x170/0x490 > Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8168b992>] > printk+0x61/0x63 > Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa04e9bf1>] > __xfs_printk+0x31/0x50 [xfs] > Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa04e9e43>] > xfs_notice+0x53/0x60 [xfs] > Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa04e0c15>] > xfs_do_force_shutdown+0xf5/0x180 [xfs] > Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa0526848>] ? > xlog_recover_iodone+0x48/0x70 [xfs] > Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa0526848>] > xlog_recover_iodone+0x48/0x70 [xfs] > Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa04d645d>] > xfs_buf_iodone_work+0x4d/0xa0 [xfs] > Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff81077a11>] > process_one_work+0x141/0x4a0 > Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff810789e8>] > worker_thread+0x168/0x410 > Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff81078880>] ? > manage_workers+0x120/0x120 > Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8107df10>] > kthread+0xc0/0xd0 > Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8107de50>] ? > flush_kthread_worker+0xb0/0xb0 > Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff816ab86c>] > ret_from_fork+0x7c/0xb0 > Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8107de50>] ? > flush_kthread_worker+0xb0/0xb0 > Aug 6 15:23:07 dev kernel: [ 304.432035] Code: 31 c0 80 3f 00 55 48 89 e5 > 74 11 48 89 f8 66 90 48 83 c0 01 80 38 00 75 f7 48 29 f8 5d c3 66 90 55 31 > c0 48 85 f6 48 89 e5 74 23 <80> 3f 00 74 1e 48 89 f8 eb 0c 0f 1f 00 48 83 ee > 01 80 38 00 74 > Aug 6 15:23:07 dev kernel: [ 304.432035] RIP [<ffffffff8133c2cb>] > strnlen+0xb/0x30 > Aug 6 15:23:07 dev kernel: [ 304.432035] RSP <ffff880035461b08> > > > So previously you said: "So, something is corrupting memory and stamping all > over the XFS structures." and also "given you have a bunch of out of tree > modules loaded (and some which are experiemental) suggests that you have a > problem with your storage...". > > But I believe, my analysis shows that during the mount sequence XFS does not > wait properly for all the bios to complete, before failing the mount > sequence back to the caller. > As an experiment, what about the following? Compile tested only and not safe for general use. What might help more is to see if you can create a reproducer on a recent, clean kernel. Perhaps a metadump of your reproducer fs combined with whatever block device ENOSPC hack you're using would do it. Brian ---8<--- diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index cd7b8ca..fbcf524 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -1409,19 +1409,27 @@ xfs_buf_iorequest( * case nothing will ever complete. It returns the I/O error code, if any, or * 0 if there was no error. */ -int -xfs_buf_iowait( - xfs_buf_t *bp) +static int +__xfs_buf_iowait( + struct xfs_buf *bp, + bool skip_error) { trace_xfs_buf_iowait(bp, _RET_IP_); - if (!bp->b_error) + if (skip_error || !bp->b_error) wait_for_completion(&bp->b_iowait); trace_xfs_buf_iowait_done(bp, _RET_IP_); return bp->b_error; } +int +xfs_buf_iowait( + struct xfs_buf *bp) +{ + return __xfs_buf_iowait(bp, false); +} + xfs_caddr_t xfs_buf_offset( xfs_buf_t *bp, @@ -1866,7 +1874,7 @@ xfs_buf_delwri_submit( bp = list_first_entry(&io_list, struct xfs_buf, b_list); list_del_init(&bp->b_list); - error2 = xfs_buf_iowait(bp); + error2 = __xfs_buf_iowait(bp, true); xfs_buf_relse(bp); if (!error) error = error2; --- > Thanks, > Alex. > > > > -----Original Message----- From: Dave Chinner > Sent: 05 August, 2014 2:07 AM > To: Alex Lyakas > Cc: xfs@oss.sgi.com > Subject: Re: use-after-free on log replay failure > > On Mon, Aug 04, 2014 at 02:00:05PM +0300, Alex Lyakas wrote: > >Greetings, > > > >we had a log replay failure due to some errors that the underlying > >block device returned: > >[49133.801406] XFS (dm-95): metadata I/O error: block 0x270e8c180 > >("xlog_recover_iodone") error 28 numblks 16 > >[49133.802495] XFS (dm-95): log mount/recovery failed: error 28 > >[49133.802644] XFS (dm-95): log mount failed > > #define ENOSPC 28 /* No space left on device */ > > You're getting an ENOSPC as a metadata IO error during log recovery? > Thin provisioning problem, perhaps, and the error is occurring on > submission rather than completion? If so: > > 8d6c121 xfs: fix buffer use after free on IO error > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 47+ messages in thread
* Re: use-after-free on log replay failure 2014-08-06 15:20 ` Brian Foster @ 2014-08-06 15:28 ` Alex Lyakas 2014-08-10 12:20 ` Alex Lyakas 1 sibling, 0 replies; 47+ messages in thread From: Alex Lyakas @ 2014-08-06 15:28 UTC (permalink / raw) To: Brian Foster; +Cc: xfs Hi Brian, The purpose of the sleep is like this: I added a module parameter to my block device, which says "from now on fail all WRITEs with ENOSPC". If I set this parameter before mounting, then xlog_find_tail() fails. This does not reproduce my problem. So I want xlog_find_tail() to succeed, and then start returning errors. So I need the sleep so that I can set my parameter, before actual log recovery is attempted:) Will try what you suggest. Thanks, Alex. -----Original Message----- From: Brian Foster Sent: 06 August, 2014 6:20 PM To: Alex Lyakas Cc: Dave Chinner ; xfs@oss.sgi.com Subject: Re: use-after-free on log replay failure On Wed, Aug 06, 2014 at 03:52:03PM +0300, Alex Lyakas wrote: > Hello Dave and Brian, > > Dave, I tried the patch you suggested, but it does not fix the issue. I > did > some further digging, and it appears that _xfs_buf_ioend(schedule=1) can > be > called from xfs_buf_iorequest(), which the patch fixes, but also from > xfs_buf_bio_end_io() which is my case. I am reproducing the issue pretty > easily. The flow that I have is like this: > - xlog_recover() calls xlog_find_tail(). This works alright. What's the purpose of a sleep here? > - Now I add a small sleep before calling xlog_do_recover(), and meanwhile > I > instruct my block device to return ENOSPC for any WRITE from now on. > > What seems to happen is that several WRITE bios are submitted and they all > fail. When they do, they reach xfs_buf_ioend() through a stack like this: > > Aug 6 15:23:07 dev kernel: [ 304.410528] [56]xfs*[xfs_buf_ioend:1056] > XFS(dm-19): Scheduling xfs_buf_iodone_work on error > Aug 6 15:23:07 dev kernel: [ 304.410534] Pid: 56, comm: kworker/u:1 > Tainted: G W O 3.8.13-557-generic #1382000791 > Aug 6 15:23:07 dev kernel: [ 304.410537] Call Trace: > Aug 6 15:23:07 dev kernel: [ 304.410587] [<ffffffffa04d6654>] > xfs_buf_ioend+0x1a4/0x1b0 [xfs] > Aug 6 15:23:07 dev kernel: [ 304.410621] [<ffffffffa04d6685>] > _xfs_buf_ioend+0x25/0x30 [xfs] > Aug 6 15:23:07 dev kernel: [ 304.410643] [<ffffffffa04d6b3d>] > xfs_buf_bio_end_io+0x3d/0x50 [xfs] > Aug 6 15:23:07 dev kernel: [ 304.410652] [<ffffffff811c3d8d>] > bio_endio+0x1d/0x40 > ... > > At this point, they are scheduled to run xlog_recover_iodone through > xfslogd_workqueue. > The first callback that gets called, calls xfs_do_force_shutdown in stack > like this: > > Aug 6 15:23:07 dev kernel: [ 304.411791] XFS (dm-19): metadata I/O > error: > block 0x3780001 ("xlog_recover_iodone") error 28 numblks 1 > Aug 6 15:23:07 dev kernel: [ 304.413493] XFS (dm-19): > xfs_do_force_shutdown(0x1) called from line 377 of file > /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address = > 0xffffffffa0526848 > Aug 6 15:23:07 dev kernel: [ 304.413837] [<ffffffffa04e0b60>] > xfs_do_force_shutdown+0x40/0x180 [xfs] > Aug 6 15:23:07 dev kernel: [ 304.413870] [<ffffffffa0526848>] ? > xlog_recover_iodone+0x48/0x70 [xfs] > Aug 6 15:23:07 dev kernel: [ 304.413902] [<ffffffffa0526848>] > xlog_recover_iodone+0x48/0x70 [xfs] > Aug 6 15:23:07 dev kernel: [ 304.413923] [<ffffffffa04d645d>] > xfs_buf_iodone_work+0x4d/0xa0 [xfs] > Aug 6 15:23:07 dev kernel: [ 304.413930] [<ffffffff81077a11>] > process_one_work+0x141/0x4a0 > Aug 6 15:23:07 dev kernel: [ 304.413937] [<ffffffff810789e8>] > worker_thread+0x168/0x410 > Aug 6 15:23:07 dev kernel: [ 304.413943] [<ffffffff81078880>] ? > manage_workers+0x120/0x120 > Aug 6 15:23:07 dev kernel: [ 304.413949] [<ffffffff8107df10>] > kthread+0xc0/0xd0 > Aug 6 15:23:07 dev kernel: [ 304.413954] [<ffffffff8107de50>] ? > flush_kthread_worker+0xb0/0xb0 > Aug 6 15:23:07 dev kernel: [ 304.413976] [<ffffffff816ab86c>] > ret_from_fork+0x7c/0xb0 > Aug 6 15:23:07 dev kernel: [ 304.413986] [<ffffffff8107de50>] ? > flush_kthread_worker+0xb0/0xb0 > Aug 6 15:23:07 dev kernel: [ 304.413990] ---[ end trace 988d698520e1fa81 > ]--- > Aug 6 15:23:07 dev kernel: [ 304.414012] XFS (dm-19): I/O Error > Detected. > Shutting down filesystem > Aug 6 15:23:07 dev kernel: [ 304.415936] XFS (dm-19): Please umount the > filesystem and rectify the problem(s) > > But the rest of the callbacks also arrive: > Aug 6 15:23:07 dev kernel: [ 304.417812] XFS (dm-19): metadata I/O > error: > block 0x3780002 ("xlog_recover_iodone") error 28 numblks 1 > Aug 6 15:23:07 dev kernel: [ 304.420420] XFS (dm-19): > xfs_do_force_shutdown(0x1) called from line 377 of file > /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address = > 0xffffffffa0526848 > Aug 6 15:23:07 dev kernel: [ 304.420427] XFS (dm-19): metadata I/O > error: > block 0x3780008 ("xlog_recover_iodone") error 28 numblks 8 > Aug 6 15:23:07 dev kernel: [ 304.422708] XFS (dm-19): > xfs_do_force_shutdown(0x1) called from line 377 of file > /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address = > 0xffffffffa0526848 > Aug 6 15:23:07 dev kernel: [ 304.422738] XFS (dm-19): metadata I/O > error: > block 0x3780010 ("xlog_recover_iodone") error 28 numblks 8 > > The mount sequence fails and goes back to the caller: > Aug 6 15:23:07 dev kernel: [ 304.423438] XFS (dm-19): log mount/recovery > failed: error 28 > Aug 6 15:23:07 dev kernel: [ 304.423757] XFS (dm-19): log mount failed > > But there are still additional callbacks to deliver, which the mount > sequence did not wait for! > Aug 6 15:23:07 dev kernel: [ 304.425717] XFS (\x10@dR): > xfs_do_force_shutdown(0x1) called from line 377 of file > /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address = > 0xffffffffa0526848 > Aug 6 15:23:07 dev kernel: [ 304.425723] XFS (\x10@dR): metadata I/O error: > block 0x3780018 ("xlog_recover_iodone") error 28 numblks 8 > Aug 6 15:23:07 dev kernel: [ 304.428239] XFS (\x10@dR): > xfs_do_force_shutdown(0x1) called from line 377 of file > /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address = > 0xffffffffa0526848 > Aug 6 15:23:07 dev kernel: [ 304.428246] XFS (\x10@dR): metadata I/O error: > block 0x37800a0 ("xlog_recover_iodone") error 28 numblks 16 > > Notice the junk that they are printing! Naturally, because xfs_mount > structure has been kfreed. > > Finally the kernel crashes (instead of printing junk), because the > xfs_mount > structure is gone, but the callback tries to access it (printing the > name): > > Aug 6 15:23:07 dev kernel: [ 304.430796] general protection fault: 0000 > [#1] SMP > Aug 6 15:23:07 dev kernel: [ 304.432035] Modules linked in: xfrm_user > xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 iscsi_scst_tcp(O) > scst_vdisk(O) scst(O) dm_zcache(O) dm_btrfs(O) xfs(O) btrfs(O) libcrc32c > raid456(O) async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq > async_tx raid1(O) md_mod deflate zlib_deflate ctr twofish_generic > twofish_x86_64_3way glue_helper lrw xts gf128mul twofish_x86_64 > twofish_common camellia_generic serpent_generic blowfish_generic > blowfish_x86_64 blowfish_common cast5_generic cast_common des_generic xcbc > rmd160 sha512_generic crypto_null af_key xfrm_algo dm_round_robin kvm vfat > fat ppdev psmouse microcode nfsd nfs_acl dm_multipath(O) serio_raw > parport_pc nfsv4 dm_iostat(O) mac_hid i2c_piix4 auth_rpcgss nfs fscache > lockd sunrpc lp parport floppy > Aug 6 15:23:07 dev kernel: [ 304.432035] CPU 1 > Aug 6 15:23:07 dev kernel: [ 304.432035] Pid: 133, comm: kworker/1:1H > Tainted: G W O 3.8.13-557-generic #1382000791 Bochs Bochs > Aug 6 15:23:07 dev kernel: [ 304.432035] RIP: 0010:[<ffffffff8133c2cb>] > [<ffffffff8133c2cb>] strnlen+0xb/0x30 > Aug 6 15:23:07 dev kernel: [ 304.432035] RSP: 0018:ffff880035461b08 > EFLAGS: 00010086 > Aug 6 15:23:07 dev kernel: [ 304.432035] RAX: 0000000000000000 RBX: > ffffffff81e6a4e7 RCX: 0000000000000000 > Aug 6 15:23:07 dev kernel: [ 304.432035] RDX: e4e8390a265c0000 RSI: > ffffffffffffffff RDI: e4e8390a265c0000 > Aug 6 15:23:07 dev kernel: [ 304.432035] RBP: ffff880035461b08 R08: > 000000000000ffff R09: 000000000000ffff > Aug 6 15:23:07 dev kernel: [ 304.432035] R10: 0000000000000000 R11: > 00000000000004cd R12: e4e8390a265c0000 > Aug 6 15:23:07 dev kernel: [ 304.432035] R13: ffffffff81e6a8c0 R14: > 0000000000000000 R15: 000000000000ffff > Aug 6 15:23:07 dev kernel: [ 304.432035] FS: 0000000000000000(0000) > GS:ffff88007fc80000(0000) knlGS:0000000000000000 > Aug 6 15:23:07 dev kernel: [ 304.432035] CS: 0010 DS: 0000 ES: 0000 > CR0: > 000000008005003b > Aug 6 15:23:07 dev kernel: [ 304.432035] CR2: 00007fc902ffbfd8 CR3: > 000000007702a000 CR4: 00000000000006e0 > Aug 6 15:23:07 dev kernel: [ 304.432035] DR0: 0000000000000000 DR1: > 0000000000000000 DR2: 0000000000000000 > Aug 6 15:23:07 dev kernel: [ 304.432035] DR3: 0000000000000000 DR6: > 00000000ffff0ff0 DR7: 0000000000000400 > Aug 6 15:23:07 dev kernel: [ 304.432035] Process kworker/1:1H (pid: 133, > threadinfo ffff880035460000, task ffff880035412e00) > Aug 6 15:23:07 dev kernel: [ 304.432035] Stack: > Aug 6 15:23:07 dev kernel: [ 304.432035] ffff880035461b48 > ffffffff8133dd5e 0000000000000000 ffffffff81e6a4e7 > Aug 6 15:23:07 dev kernel: [ 304.432035] ffffffffa0566cba > ffff880035461c80 ffffffffa0566cba ffffffff81e6a8c0 > Aug 6 15:23:07 dev kernel: [ 304.432035] ffff880035461bc8 > ffffffff8133ef59 ffff880035461bc8 ffffffff81c84040 > Aug 6 15:23:07 dev kernel: [ 304.432035] Call Trace: > Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8133dd5e>] > string.isra.4+0x3e/0xd0 > Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8133ef59>] > vsnprintf+0x219/0x640 > Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8133f441>] > vscnprintf+0x11/0x30 > Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8105a971>] > vprintk_emit+0xc1/0x490 > Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8105aa20>] ? > vprintk_emit+0x170/0x490 > Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8168b992>] > printk+0x61/0x63 > Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa04e9bf1>] > __xfs_printk+0x31/0x50 [xfs] > Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa04e9e43>] > xfs_notice+0x53/0x60 [xfs] > Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa04e0c15>] > xfs_do_force_shutdown+0xf5/0x180 [xfs] > Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa0526848>] ? > xlog_recover_iodone+0x48/0x70 [xfs] > Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa0526848>] > xlog_recover_iodone+0x48/0x70 [xfs] > Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa04d645d>] > xfs_buf_iodone_work+0x4d/0xa0 [xfs] > Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff81077a11>] > process_one_work+0x141/0x4a0 > Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff810789e8>] > worker_thread+0x168/0x410 > Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff81078880>] ? > manage_workers+0x120/0x120 > Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8107df10>] > kthread+0xc0/0xd0 > Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8107de50>] ? > flush_kthread_worker+0xb0/0xb0 > Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff816ab86c>] > ret_from_fork+0x7c/0xb0 > Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8107de50>] ? > flush_kthread_worker+0xb0/0xb0 > Aug 6 15:23:07 dev kernel: [ 304.432035] Code: 31 c0 80 3f 00 55 48 89 > e5 > 74 11 48 89 f8 66 90 48 83 c0 01 80 38 00 75 f7 48 29 f8 5d c3 66 90 55 31 > c0 48 85 f6 48 89 e5 74 23 <80> 3f 00 74 1e 48 89 f8 eb 0c 0f 1f 00 48 83 > ee > 01 80 38 00 74 > Aug 6 15:23:07 dev kernel: [ 304.432035] RIP [<ffffffff8133c2cb>] > strnlen+0xb/0x30 > Aug 6 15:23:07 dev kernel: [ 304.432035] RSP <ffff880035461b08> > > > So previously you said: "So, something is corrupting memory and stamping > all > over the XFS structures." and also "given you have a bunch of out of tree > modules loaded (and some which are experiemental) suggests that you have a > problem with your storage...". > > But I believe, my analysis shows that during the mount sequence XFS does > not > wait properly for all the bios to complete, before failing the mount > sequence back to the caller. > As an experiment, what about the following? Compile tested only and not safe for general use. What might help more is to see if you can create a reproducer on a recent, clean kernel. Perhaps a metadump of your reproducer fs combined with whatever block device ENOSPC hack you're using would do it. Brian ---8<--- diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index cd7b8ca..fbcf524 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -1409,19 +1409,27 @@ xfs_buf_iorequest( * case nothing will ever complete. It returns the I/O error code, if any, or * 0 if there was no error. */ -int -xfs_buf_iowait( - xfs_buf_t *bp) +static int +__xfs_buf_iowait( + struct xfs_buf *bp, + bool skip_error) { trace_xfs_buf_iowait(bp, _RET_IP_); - if (!bp->b_error) + if (skip_error || !bp->b_error) wait_for_completion(&bp->b_iowait); trace_xfs_buf_iowait_done(bp, _RET_IP_); return bp->b_error; } +int +xfs_buf_iowait( + struct xfs_buf *bp) +{ + return __xfs_buf_iowait(bp, false); +} + xfs_caddr_t xfs_buf_offset( xfs_buf_t *bp, @@ -1866,7 +1874,7 @@ xfs_buf_delwri_submit( bp = list_first_entry(&io_list, struct xfs_buf, b_list); list_del_init(&bp->b_list); - error2 = xfs_buf_iowait(bp); + error2 = __xfs_buf_iowait(bp, true); xfs_buf_relse(bp); if (!error) error = error2; --- > Thanks, > Alex. > > > > -----Original Message----- From: Dave Chinner > Sent: 05 August, 2014 2:07 AM > To: Alex Lyakas > Cc: xfs@oss.sgi.com > Subject: Re: use-after-free on log replay failure > > On Mon, Aug 04, 2014 at 02:00:05PM +0300, Alex Lyakas wrote: > >Greetings, > > > >we had a log replay failure due to some errors that the underlying > >block device returned: > >[49133.801406] XFS (dm-95): metadata I/O error: block 0x270e8c180 > >("xlog_recover_iodone") error 28 numblks 16 > >[49133.802495] XFS (dm-95): log mount/recovery failed: error 28 > >[49133.802644] XFS (dm-95): log mount failed > > #define ENOSPC 28 /* No space left on device */ > > You're getting an ENOSPC as a metadata IO error during log recovery? > Thin provisioning problem, perhaps, and the error is occurring on > submission rather than completion? If so: > > 8d6c121 xfs: fix buffer use after free on IO error > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 47+ messages in thread
* Re: use-after-free on log replay failure 2014-08-06 15:20 ` Brian Foster 2014-08-06 15:28 ` Alex Lyakas @ 2014-08-10 12:20 ` Alex Lyakas 2014-08-11 13:20 ` Brian Foster 1 sibling, 1 reply; 47+ messages in thread From: Alex Lyakas @ 2014-08-10 12:20 UTC (permalink / raw) To: Brian Foster; +Cc: xfs Hi Brian, On Wed, Aug 6, 2014 at 6:20 PM, Brian Foster <bfoster@redhat.com> wrote: > On Wed, Aug 06, 2014 at 03:52:03PM +0300, Alex Lyakas wrote: >> Hello Dave and Brian, >> >> Dave, I tried the patch you suggested, but it does not fix the issue. I did >> some further digging, and it appears that _xfs_buf_ioend(schedule=1) can be >> called from xfs_buf_iorequest(), which the patch fixes, but also from >> xfs_buf_bio_end_io() which is my case. I am reproducing the issue pretty >> easily. The flow that I have is like this: >> - xlog_recover() calls xlog_find_tail(). This works alright. > > What's the purpose of a sleep here? > >> - Now I add a small sleep before calling xlog_do_recover(), and meanwhile I >> instruct my block device to return ENOSPC for any WRITE from now on. >> >> What seems to happen is that several WRITE bios are submitted and they all >> fail. When they do, they reach xfs_buf_ioend() through a stack like this: >> >> Aug 6 15:23:07 dev kernel: [ 304.410528] [56]xfs*[xfs_buf_ioend:1056] >> XFS(dm-19): Scheduling xfs_buf_iodone_work on error >> Aug 6 15:23:07 dev kernel: [ 304.410534] Pid: 56, comm: kworker/u:1 >> Tainted: G W O 3.8.13-557-generic #1382000791 >> Aug 6 15:23:07 dev kernel: [ 304.410537] Call Trace: >> Aug 6 15:23:07 dev kernel: [ 304.410587] [<ffffffffa04d6654>] >> xfs_buf_ioend+0x1a4/0x1b0 [xfs] >> Aug 6 15:23:07 dev kernel: [ 304.410621] [<ffffffffa04d6685>] >> _xfs_buf_ioend+0x25/0x30 [xfs] >> Aug 6 15:23:07 dev kernel: [ 304.410643] [<ffffffffa04d6b3d>] >> xfs_buf_bio_end_io+0x3d/0x50 [xfs] >> Aug 6 15:23:07 dev kernel: [ 304.410652] [<ffffffff811c3d8d>] >> bio_endio+0x1d/0x40 >> ... >> >> At this point, they are scheduled to run xlog_recover_iodone through >> xfslogd_workqueue. >> The first callback that gets called, calls xfs_do_force_shutdown in stack >> like this: >> >> Aug 6 15:23:07 dev kernel: [ 304.411791] XFS (dm-19): metadata I/O error: >> block 0x3780001 ("xlog_recover_iodone") error 28 numblks 1 >> Aug 6 15:23:07 dev kernel: [ 304.413493] XFS (dm-19): >> xfs_do_force_shutdown(0x1) called from line 377 of file >> /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address = >> 0xffffffffa0526848 >> Aug 6 15:23:07 dev kernel: [ 304.413837] [<ffffffffa04e0b60>] >> xfs_do_force_shutdown+0x40/0x180 [xfs] >> Aug 6 15:23:07 dev kernel: [ 304.413870] [<ffffffffa0526848>] ? >> xlog_recover_iodone+0x48/0x70 [xfs] >> Aug 6 15:23:07 dev kernel: [ 304.413902] [<ffffffffa0526848>] >> xlog_recover_iodone+0x48/0x70 [xfs] >> Aug 6 15:23:07 dev kernel: [ 304.413923] [<ffffffffa04d645d>] >> xfs_buf_iodone_work+0x4d/0xa0 [xfs] >> Aug 6 15:23:07 dev kernel: [ 304.413930] [<ffffffff81077a11>] >> process_one_work+0x141/0x4a0 >> Aug 6 15:23:07 dev kernel: [ 304.413937] [<ffffffff810789e8>] >> worker_thread+0x168/0x410 >> Aug 6 15:23:07 dev kernel: [ 304.413943] [<ffffffff81078880>] ? >> manage_workers+0x120/0x120 >> Aug 6 15:23:07 dev kernel: [ 304.413949] [<ffffffff8107df10>] >> kthread+0xc0/0xd0 >> Aug 6 15:23:07 dev kernel: [ 304.413954] [<ffffffff8107de50>] ? >> flush_kthread_worker+0xb0/0xb0 >> Aug 6 15:23:07 dev kernel: [ 304.413976] [<ffffffff816ab86c>] >> ret_from_fork+0x7c/0xb0 >> Aug 6 15:23:07 dev kernel: [ 304.413986] [<ffffffff8107de50>] ? >> flush_kthread_worker+0xb0/0xb0 >> Aug 6 15:23:07 dev kernel: [ 304.413990] ---[ end trace 988d698520e1fa81 >> ]--- >> Aug 6 15:23:07 dev kernel: [ 304.414012] XFS (dm-19): I/O Error Detected. >> Shutting down filesystem >> Aug 6 15:23:07 dev kernel: [ 304.415936] XFS (dm-19): Please umount the >> filesystem and rectify the problem(s) >> >> But the rest of the callbacks also arrive: >> Aug 6 15:23:07 dev kernel: [ 304.417812] XFS (dm-19): metadata I/O error: >> block 0x3780002 ("xlog_recover_iodone") error 28 numblks 1 >> Aug 6 15:23:07 dev kernel: [ 304.420420] XFS (dm-19): >> xfs_do_force_shutdown(0x1) called from line 377 of file >> /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address = >> 0xffffffffa0526848 >> Aug 6 15:23:07 dev kernel: [ 304.420427] XFS (dm-19): metadata I/O error: >> block 0x3780008 ("xlog_recover_iodone") error 28 numblks 8 >> Aug 6 15:23:07 dev kernel: [ 304.422708] XFS (dm-19): >> xfs_do_force_shutdown(0x1) called from line 377 of file >> /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address = >> 0xffffffffa0526848 >> Aug 6 15:23:07 dev kernel: [ 304.422738] XFS (dm-19): metadata I/O error: >> block 0x3780010 ("xlog_recover_iodone") error 28 numblks 8 >> >> The mount sequence fails and goes back to the caller: >> Aug 6 15:23:07 dev kernel: [ 304.423438] XFS (dm-19): log mount/recovery >> failed: error 28 >> Aug 6 15:23:07 dev kernel: [ 304.423757] XFS (dm-19): log mount failed >> >> But there are still additional callbacks to deliver, which the mount >> sequence did not wait for! >> Aug 6 15:23:07 dev kernel: [ 304.425717] XFS ( @dR): >> xfs_do_force_shutdown(0x1) called from line 377 of file >> /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address = >> 0xffffffffa0526848 >> Aug 6 15:23:07 dev kernel: [ 304.425723] XFS ( @dR): metadata I/O error: >> block 0x3780018 ("xlog_recover_iodone") error 28 numblks 8 >> Aug 6 15:23:07 dev kernel: [ 304.428239] XFS ( @dR): >> xfs_do_force_shutdown(0x1) called from line 377 of file >> /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address = >> 0xffffffffa0526848 >> Aug 6 15:23:07 dev kernel: [ 304.428246] XFS ( @dR): metadata I/O error: >> block 0x37800a0 ("xlog_recover_iodone") error 28 numblks 16 >> >> Notice the junk that they are printing! Naturally, because xfs_mount >> structure has been kfreed. >> >> Finally the kernel crashes (instead of printing junk), because the xfs_mount >> structure is gone, but the callback tries to access it (printing the name): >> >> Aug 6 15:23:07 dev kernel: [ 304.430796] general protection fault: 0000 >> [#1] SMP >> Aug 6 15:23:07 dev kernel: [ 304.432035] Modules linked in: xfrm_user >> xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 iscsi_scst_tcp(O) >> scst_vdisk(O) scst(O) dm_zcache(O) dm_btrfs(O) xfs(O) btrfs(O) libcrc32c >> raid456(O) async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq >> async_tx raid1(O) md_mod deflate zlib_deflate ctr twofish_generic >> twofish_x86_64_3way glue_helper lrw xts gf128mul twofish_x86_64 >> twofish_common camellia_generic serpent_generic blowfish_generic >> blowfish_x86_64 blowfish_common cast5_generic cast_common des_generic xcbc >> rmd160 sha512_generic crypto_null af_key xfrm_algo dm_round_robin kvm vfat >> fat ppdev psmouse microcode nfsd nfs_acl dm_multipath(O) serio_raw >> parport_pc nfsv4 dm_iostat(O) mac_hid i2c_piix4 auth_rpcgss nfs fscache >> lockd sunrpc lp parport floppy >> Aug 6 15:23:07 dev kernel: [ 304.432035] CPU 1 >> Aug 6 15:23:07 dev kernel: [ 304.432035] Pid: 133, comm: kworker/1:1H >> Tainted: G W O 3.8.13-557-generic #1382000791 Bochs Bochs >> Aug 6 15:23:07 dev kernel: [ 304.432035] RIP: 0010:[<ffffffff8133c2cb>] >> [<ffffffff8133c2cb>] strnlen+0xb/0x30 >> Aug 6 15:23:07 dev kernel: [ 304.432035] RSP: 0018:ffff880035461b08 >> EFLAGS: 00010086 >> Aug 6 15:23:07 dev kernel: [ 304.432035] RAX: 0000000000000000 RBX: >> ffffffff81e6a4e7 RCX: 0000000000000000 >> Aug 6 15:23:07 dev kernel: [ 304.432035] RDX: e4e8390a265c0000 RSI: >> ffffffffffffffff RDI: e4e8390a265c0000 >> Aug 6 15:23:07 dev kernel: [ 304.432035] RBP: ffff880035461b08 R08: >> 000000000000ffff R09: 000000000000ffff >> Aug 6 15:23:07 dev kernel: [ 304.432035] R10: 0000000000000000 R11: >> 00000000000004cd R12: e4e8390a265c0000 >> Aug 6 15:23:07 dev kernel: [ 304.432035] R13: ffffffff81e6a8c0 R14: >> 0000000000000000 R15: 000000000000ffff >> Aug 6 15:23:07 dev kernel: [ 304.432035] FS: 0000000000000000(0000) >> GS:ffff88007fc80000(0000) knlGS:0000000000000000 >> Aug 6 15:23:07 dev kernel: [ 304.432035] CS: 0010 DS: 0000 ES: 0000 CR0: >> 000000008005003b >> Aug 6 15:23:07 dev kernel: [ 304.432035] CR2: 00007fc902ffbfd8 CR3: >> 000000007702a000 CR4: 00000000000006e0 >> Aug 6 15:23:07 dev kernel: [ 304.432035] DR0: 0000000000000000 DR1: >> 0000000000000000 DR2: 0000000000000000 >> Aug 6 15:23:07 dev kernel: [ 304.432035] DR3: 0000000000000000 DR6: >> 00000000ffff0ff0 DR7: 0000000000000400 >> Aug 6 15:23:07 dev kernel: [ 304.432035] Process kworker/1:1H (pid: 133, >> threadinfo ffff880035460000, task ffff880035412e00) >> Aug 6 15:23:07 dev kernel: [ 304.432035] Stack: >> Aug 6 15:23:07 dev kernel: [ 304.432035] ffff880035461b48 >> ffffffff8133dd5e 0000000000000000 ffffffff81e6a4e7 >> Aug 6 15:23:07 dev kernel: [ 304.432035] ffffffffa0566cba >> ffff880035461c80 ffffffffa0566cba ffffffff81e6a8c0 >> Aug 6 15:23:07 dev kernel: [ 304.432035] ffff880035461bc8 >> ffffffff8133ef59 ffff880035461bc8 ffffffff81c84040 >> Aug 6 15:23:07 dev kernel: [ 304.432035] Call Trace: >> Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8133dd5e>] >> string.isra.4+0x3e/0xd0 >> Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8133ef59>] >> vsnprintf+0x219/0x640 >> Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8133f441>] >> vscnprintf+0x11/0x30 >> Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8105a971>] >> vprintk_emit+0xc1/0x490 >> Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8105aa20>] ? >> vprintk_emit+0x170/0x490 >> Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8168b992>] >> printk+0x61/0x63 >> Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa04e9bf1>] >> __xfs_printk+0x31/0x50 [xfs] >> Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa04e9e43>] >> xfs_notice+0x53/0x60 [xfs] >> Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa04e0c15>] >> xfs_do_force_shutdown+0xf5/0x180 [xfs] >> Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa0526848>] ? >> xlog_recover_iodone+0x48/0x70 [xfs] >> Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa0526848>] >> xlog_recover_iodone+0x48/0x70 [xfs] >> Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa04d645d>] >> xfs_buf_iodone_work+0x4d/0xa0 [xfs] >> Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff81077a11>] >> process_one_work+0x141/0x4a0 >> Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff810789e8>] >> worker_thread+0x168/0x410 >> Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff81078880>] ? >> manage_workers+0x120/0x120 >> Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8107df10>] >> kthread+0xc0/0xd0 >> Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8107de50>] ? >> flush_kthread_worker+0xb0/0xb0 >> Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff816ab86c>] >> ret_from_fork+0x7c/0xb0 >> Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8107de50>] ? >> flush_kthread_worker+0xb0/0xb0 >> Aug 6 15:23:07 dev kernel: [ 304.432035] Code: 31 c0 80 3f 00 55 48 89 e5 >> 74 11 48 89 f8 66 90 48 83 c0 01 80 38 00 75 f7 48 29 f8 5d c3 66 90 55 31 >> c0 48 85 f6 48 89 e5 74 23 <80> 3f 00 74 1e 48 89 f8 eb 0c 0f 1f 00 48 83 ee >> 01 80 38 00 74 >> Aug 6 15:23:07 dev kernel: [ 304.432035] RIP [<ffffffff8133c2cb>] >> strnlen+0xb/0x30 >> Aug 6 15:23:07 dev kernel: [ 304.432035] RSP <ffff880035461b08> >> >> >> So previously you said: "So, something is corrupting memory and stamping all >> over the XFS structures." and also "given you have a bunch of out of tree >> modules loaded (and some which are experiemental) suggests that you have a >> problem with your storage...". >> >> But I believe, my analysis shows that during the mount sequence XFS does not >> wait properly for all the bios to complete, before failing the mount >> sequence back to the caller. >> > > As an experiment, what about the following? Compile tested only and not > safe for general use. > > What might help more is to see if you can create a reproducer on a > recent, clean kernel. Perhaps a metadump of your reproducer fs combined > with whatever block device ENOSPC hack you're using would do it. > > Brian > > ---8<--- > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > index cd7b8ca..fbcf524 100644 > --- a/fs/xfs/xfs_buf.c > +++ b/fs/xfs/xfs_buf.c > @@ -1409,19 +1409,27 @@ xfs_buf_iorequest( > * case nothing will ever complete. It returns the I/O error code, if any, or > * 0 if there was no error. > */ > -int > -xfs_buf_iowait( > - xfs_buf_t *bp) > +static int > +__xfs_buf_iowait( > + struct xfs_buf *bp, > + bool skip_error) > { > trace_xfs_buf_iowait(bp, _RET_IP_); > > - if (!bp->b_error) > + if (skip_error || !bp->b_error) > wait_for_completion(&bp->b_iowait); > > trace_xfs_buf_iowait_done(bp, _RET_IP_); > return bp->b_error; > } > > +int > +xfs_buf_iowait( > + struct xfs_buf *bp) > +{ > + return __xfs_buf_iowait(bp, false); > +} > + > xfs_caddr_t > xfs_buf_offset( > xfs_buf_t *bp, > @@ -1866,7 +1874,7 @@ xfs_buf_delwri_submit( > bp = list_first_entry(&io_list, struct xfs_buf, b_list); > > list_del_init(&bp->b_list); > - error2 = xfs_buf_iowait(bp); > + error2 = __xfs_buf_iowait(bp, true); > xfs_buf_relse(bp); > if (!error) > error = error2; > > --- I think that this patch fixes the problem. I tried reproducing it like 30 times, and it doesn't happen with this patch. Dropping this patch reproduces the problem within 1 or 2 tries. Thanks! What are next steps? How to make it "safe for general use"? Thanks, Alex. > >> Thanks, >> Alex. >> >> >> >> -----Original Message----- From: Dave Chinner >> Sent: 05 August, 2014 2:07 AM >> To: Alex Lyakas >> Cc: xfs@oss.sgi.com >> Subject: Re: use-after-free on log replay failure >> >> On Mon, Aug 04, 2014 at 02:00:05PM +0300, Alex Lyakas wrote: >> >Greetings, >> > >> >we had a log replay failure due to some errors that the underlying >> >block device returned: >> >[49133.801406] XFS (dm-95): metadata I/O error: block 0x270e8c180 >> >("xlog_recover_iodone") error 28 numblks 16 >> >[49133.802495] XFS (dm-95): log mount/recovery failed: error 28 >> >[49133.802644] XFS (dm-95): log mount failed >> >> #define ENOSPC 28 /* No space left on device */ >> >> You're getting an ENOSPC as a metadata IO error during log recovery? >> Thin provisioning problem, perhaps, and the error is occurring on >> submission rather than completion? If so: >> >> 8d6c121 xfs: fix buffer use after free on IO error >> >> Cheers, >> >> Dave. >> -- >> Dave Chinner >> david@fromorbit.com >> _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: use-after-free on log replay failure 2014-08-10 12:20 ` Alex Lyakas @ 2014-08-11 13:20 ` Brian Foster 2014-08-11 21:52 ` Dave Chinner 0 siblings, 1 reply; 47+ messages in thread From: Brian Foster @ 2014-08-11 13:20 UTC (permalink / raw) To: Alex Lyakas; +Cc: xfs On Sun, Aug 10, 2014 at 03:20:50PM +0300, Alex Lyakas wrote: > Hi Brian, > > On Wed, Aug 6, 2014 at 6:20 PM, Brian Foster <bfoster@redhat.com> wrote: > > On Wed, Aug 06, 2014 at 03:52:03PM +0300, Alex Lyakas wrote: > >> Hello Dave and Brian, > >> > >> Dave, I tried the patch you suggested, but it does not fix the issue. I did > >> some further digging, and it appears that _xfs_buf_ioend(schedule=1) can be > >> called from xfs_buf_iorequest(), which the patch fixes, but also from > >> xfs_buf_bio_end_io() which is my case. I am reproducing the issue pretty > >> easily. The flow that I have is like this: > >> - xlog_recover() calls xlog_find_tail(). This works alright. > > > > What's the purpose of a sleep here? > > > >> - Now I add a small sleep before calling xlog_do_recover(), and meanwhile I > >> instruct my block device to return ENOSPC for any WRITE from now on. > >> > >> What seems to happen is that several WRITE bios are submitted and they all > >> fail. When they do, they reach xfs_buf_ioend() through a stack like this: > >> > >> Aug 6 15:23:07 dev kernel: [ 304.410528] [56]xfs*[xfs_buf_ioend:1056] > >> XFS(dm-19): Scheduling xfs_buf_iodone_work on error > >> Aug 6 15:23:07 dev kernel: [ 304.410534] Pid: 56, comm: kworker/u:1 > >> Tainted: G W O 3.8.13-557-generic #1382000791 > >> Aug 6 15:23:07 dev kernel: [ 304.410537] Call Trace: > >> Aug 6 15:23:07 dev kernel: [ 304.410587] [<ffffffffa04d6654>] > >> xfs_buf_ioend+0x1a4/0x1b0 [xfs] > >> Aug 6 15:23:07 dev kernel: [ 304.410621] [<ffffffffa04d6685>] > >> _xfs_buf_ioend+0x25/0x30 [xfs] > >> Aug 6 15:23:07 dev kernel: [ 304.410643] [<ffffffffa04d6b3d>] > >> xfs_buf_bio_end_io+0x3d/0x50 [xfs] > >> Aug 6 15:23:07 dev kernel: [ 304.410652] [<ffffffff811c3d8d>] > >> bio_endio+0x1d/0x40 > >> ... > >> > >> At this point, they are scheduled to run xlog_recover_iodone through > >> xfslogd_workqueue. > >> The first callback that gets called, calls xfs_do_force_shutdown in stack > >> like this: > >> > >> Aug 6 15:23:07 dev kernel: [ 304.411791] XFS (dm-19): metadata I/O error: > >> block 0x3780001 ("xlog_recover_iodone") error 28 numblks 1 > >> Aug 6 15:23:07 dev kernel: [ 304.413493] XFS (dm-19): > >> xfs_do_force_shutdown(0x1) called from line 377 of file > >> /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address = > >> 0xffffffffa0526848 > >> Aug 6 15:23:07 dev kernel: [ 304.413837] [<ffffffffa04e0b60>] > >> xfs_do_force_shutdown+0x40/0x180 [xfs] > >> Aug 6 15:23:07 dev kernel: [ 304.413870] [<ffffffffa0526848>] ? > >> xlog_recover_iodone+0x48/0x70 [xfs] > >> Aug 6 15:23:07 dev kernel: [ 304.413902] [<ffffffffa0526848>] > >> xlog_recover_iodone+0x48/0x70 [xfs] > >> Aug 6 15:23:07 dev kernel: [ 304.413923] [<ffffffffa04d645d>] > >> xfs_buf_iodone_work+0x4d/0xa0 [xfs] > >> Aug 6 15:23:07 dev kernel: [ 304.413930] [<ffffffff81077a11>] > >> process_one_work+0x141/0x4a0 > >> Aug 6 15:23:07 dev kernel: [ 304.413937] [<ffffffff810789e8>] > >> worker_thread+0x168/0x410 > >> Aug 6 15:23:07 dev kernel: [ 304.413943] [<ffffffff81078880>] ? > >> manage_workers+0x120/0x120 > >> Aug 6 15:23:07 dev kernel: [ 304.413949] [<ffffffff8107df10>] > >> kthread+0xc0/0xd0 > >> Aug 6 15:23:07 dev kernel: [ 304.413954] [<ffffffff8107de50>] ? > >> flush_kthread_worker+0xb0/0xb0 > >> Aug 6 15:23:07 dev kernel: [ 304.413976] [<ffffffff816ab86c>] > >> ret_from_fork+0x7c/0xb0 > >> Aug 6 15:23:07 dev kernel: [ 304.413986] [<ffffffff8107de50>] ? > >> flush_kthread_worker+0xb0/0xb0 > >> Aug 6 15:23:07 dev kernel: [ 304.413990] ---[ end trace 988d698520e1fa81 > >> ]--- > >> Aug 6 15:23:07 dev kernel: [ 304.414012] XFS (dm-19): I/O Error Detected. > >> Shutting down filesystem > >> Aug 6 15:23:07 dev kernel: [ 304.415936] XFS (dm-19): Please umount the > >> filesystem and rectify the problem(s) > >> > >> But the rest of the callbacks also arrive: > >> Aug 6 15:23:07 dev kernel: [ 304.417812] XFS (dm-19): metadata I/O error: > >> block 0x3780002 ("xlog_recover_iodone") error 28 numblks 1 > >> Aug 6 15:23:07 dev kernel: [ 304.420420] XFS (dm-19): > >> xfs_do_force_shutdown(0x1) called from line 377 of file > >> /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address = > >> 0xffffffffa0526848 > >> Aug 6 15:23:07 dev kernel: [ 304.420427] XFS (dm-19): metadata I/O error: > >> block 0x3780008 ("xlog_recover_iodone") error 28 numblks 8 > >> Aug 6 15:23:07 dev kernel: [ 304.422708] XFS (dm-19): > >> xfs_do_force_shutdown(0x1) called from line 377 of file > >> /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address = > >> 0xffffffffa0526848 > >> Aug 6 15:23:07 dev kernel: [ 304.422738] XFS (dm-19): metadata I/O error: > >> block 0x3780010 ("xlog_recover_iodone") error 28 numblks 8 > >> > >> The mount sequence fails and goes back to the caller: > >> Aug 6 15:23:07 dev kernel: [ 304.423438] XFS (dm-19): log mount/recovery > >> failed: error 28 > >> Aug 6 15:23:07 dev kernel: [ 304.423757] XFS (dm-19): log mount failed > >> > >> But there are still additional callbacks to deliver, which the mount > >> sequence did not wait for! > >> Aug 6 15:23:07 dev kernel: [ 304.425717] XFS ( @dR): > >> xfs_do_force_shutdown(0x1) called from line 377 of file > >> /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address = > >> 0xffffffffa0526848 > >> Aug 6 15:23:07 dev kernel: [ 304.425723] XFS ( @dR): metadata I/O error: > >> block 0x3780018 ("xlog_recover_iodone") error 28 numblks 8 > >> Aug 6 15:23:07 dev kernel: [ 304.428239] XFS ( @dR): > >> xfs_do_force_shutdown(0x1) called from line 377 of file > >> /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address = > >> 0xffffffffa0526848 > >> Aug 6 15:23:07 dev kernel: [ 304.428246] XFS ( @dR): metadata I/O error: > >> block 0x37800a0 ("xlog_recover_iodone") error 28 numblks 16 > >> > >> Notice the junk that they are printing! Naturally, because xfs_mount > >> structure has been kfreed. > >> > >> Finally the kernel crashes (instead of printing junk), because the xfs_mount > >> structure is gone, but the callback tries to access it (printing the name): > >> > >> Aug 6 15:23:07 dev kernel: [ 304.430796] general protection fault: 0000 > >> [#1] SMP > >> Aug 6 15:23:07 dev kernel: [ 304.432035] Modules linked in: xfrm_user > >> xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 iscsi_scst_tcp(O) > >> scst_vdisk(O) scst(O) dm_zcache(O) dm_btrfs(O) xfs(O) btrfs(O) libcrc32c > >> raid456(O) async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq > >> async_tx raid1(O) md_mod deflate zlib_deflate ctr twofish_generic > >> twofish_x86_64_3way glue_helper lrw xts gf128mul twofish_x86_64 > >> twofish_common camellia_generic serpent_generic blowfish_generic > >> blowfish_x86_64 blowfish_common cast5_generic cast_common des_generic xcbc > >> rmd160 sha512_generic crypto_null af_key xfrm_algo dm_round_robin kvm vfat > >> fat ppdev psmouse microcode nfsd nfs_acl dm_multipath(O) serio_raw > >> parport_pc nfsv4 dm_iostat(O) mac_hid i2c_piix4 auth_rpcgss nfs fscache > >> lockd sunrpc lp parport floppy > >> Aug 6 15:23:07 dev kernel: [ 304.432035] CPU 1 > >> Aug 6 15:23:07 dev kernel: [ 304.432035] Pid: 133, comm: kworker/1:1H > >> Tainted: G W O 3.8.13-557-generic #1382000791 Bochs Bochs > >> Aug 6 15:23:07 dev kernel: [ 304.432035] RIP: 0010:[<ffffffff8133c2cb>] > >> [<ffffffff8133c2cb>] strnlen+0xb/0x30 > >> Aug 6 15:23:07 dev kernel: [ 304.432035] RSP: 0018:ffff880035461b08 > >> EFLAGS: 00010086 > >> Aug 6 15:23:07 dev kernel: [ 304.432035] RAX: 0000000000000000 RBX: > >> ffffffff81e6a4e7 RCX: 0000000000000000 > >> Aug 6 15:23:07 dev kernel: [ 304.432035] RDX: e4e8390a265c0000 RSI: > >> ffffffffffffffff RDI: e4e8390a265c0000 > >> Aug 6 15:23:07 dev kernel: [ 304.432035] RBP: ffff880035461b08 R08: > >> 000000000000ffff R09: 000000000000ffff > >> Aug 6 15:23:07 dev kernel: [ 304.432035] R10: 0000000000000000 R11: > >> 00000000000004cd R12: e4e8390a265c0000 > >> Aug 6 15:23:07 dev kernel: [ 304.432035] R13: ffffffff81e6a8c0 R14: > >> 0000000000000000 R15: 000000000000ffff > >> Aug 6 15:23:07 dev kernel: [ 304.432035] FS: 0000000000000000(0000) > >> GS:ffff88007fc80000(0000) knlGS:0000000000000000 > >> Aug 6 15:23:07 dev kernel: [ 304.432035] CS: 0010 DS: 0000 ES: 0000 CR0: > >> 000000008005003b > >> Aug 6 15:23:07 dev kernel: [ 304.432035] CR2: 00007fc902ffbfd8 CR3: > >> 000000007702a000 CR4: 00000000000006e0 > >> Aug 6 15:23:07 dev kernel: [ 304.432035] DR0: 0000000000000000 DR1: > >> 0000000000000000 DR2: 0000000000000000 > >> Aug 6 15:23:07 dev kernel: [ 304.432035] DR3: 0000000000000000 DR6: > >> 00000000ffff0ff0 DR7: 0000000000000400 > >> Aug 6 15:23:07 dev kernel: [ 304.432035] Process kworker/1:1H (pid: 133, > >> threadinfo ffff880035460000, task ffff880035412e00) > >> Aug 6 15:23:07 dev kernel: [ 304.432035] Stack: > >> Aug 6 15:23:07 dev kernel: [ 304.432035] ffff880035461b48 > >> ffffffff8133dd5e 0000000000000000 ffffffff81e6a4e7 > >> Aug 6 15:23:07 dev kernel: [ 304.432035] ffffffffa0566cba > >> ffff880035461c80 ffffffffa0566cba ffffffff81e6a8c0 > >> Aug 6 15:23:07 dev kernel: [ 304.432035] ffff880035461bc8 > >> ffffffff8133ef59 ffff880035461bc8 ffffffff81c84040 > >> Aug 6 15:23:07 dev kernel: [ 304.432035] Call Trace: > >> Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8133dd5e>] > >> string.isra.4+0x3e/0xd0 > >> Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8133ef59>] > >> vsnprintf+0x219/0x640 > >> Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8133f441>] > >> vscnprintf+0x11/0x30 > >> Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8105a971>] > >> vprintk_emit+0xc1/0x490 > >> Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8105aa20>] ? > >> vprintk_emit+0x170/0x490 > >> Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8168b992>] > >> printk+0x61/0x63 > >> Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa04e9bf1>] > >> __xfs_printk+0x31/0x50 [xfs] > >> Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa04e9e43>] > >> xfs_notice+0x53/0x60 [xfs] > >> Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa04e0c15>] > >> xfs_do_force_shutdown+0xf5/0x180 [xfs] > >> Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa0526848>] ? > >> xlog_recover_iodone+0x48/0x70 [xfs] > >> Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa0526848>] > >> xlog_recover_iodone+0x48/0x70 [xfs] > >> Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa04d645d>] > >> xfs_buf_iodone_work+0x4d/0xa0 [xfs] > >> Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff81077a11>] > >> process_one_work+0x141/0x4a0 > >> Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff810789e8>] > >> worker_thread+0x168/0x410 > >> Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff81078880>] ? > >> manage_workers+0x120/0x120 > >> Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8107df10>] > >> kthread+0xc0/0xd0 > >> Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8107de50>] ? > >> flush_kthread_worker+0xb0/0xb0 > >> Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff816ab86c>] > >> ret_from_fork+0x7c/0xb0 > >> Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8107de50>] ? > >> flush_kthread_worker+0xb0/0xb0 > >> Aug 6 15:23:07 dev kernel: [ 304.432035] Code: 31 c0 80 3f 00 55 48 89 e5 > >> 74 11 48 89 f8 66 90 48 83 c0 01 80 38 00 75 f7 48 29 f8 5d c3 66 90 55 31 > >> c0 48 85 f6 48 89 e5 74 23 <80> 3f 00 74 1e 48 89 f8 eb 0c 0f 1f 00 48 83 ee > >> 01 80 38 00 74 > >> Aug 6 15:23:07 dev kernel: [ 304.432035] RIP [<ffffffff8133c2cb>] > >> strnlen+0xb/0x30 > >> Aug 6 15:23:07 dev kernel: [ 304.432035] RSP <ffff880035461b08> > >> > >> > >> So previously you said: "So, something is corrupting memory and stamping all > >> over the XFS structures." and also "given you have a bunch of out of tree > >> modules loaded (and some which are experiemental) suggests that you have a > >> problem with your storage...". > >> > >> But I believe, my analysis shows that during the mount sequence XFS does not > >> wait properly for all the bios to complete, before failing the mount > >> sequence back to the caller. > >> > > > > As an experiment, what about the following? Compile tested only and not > > safe for general use. > > > > What might help more is to see if you can create a reproducer on a > > recent, clean kernel. Perhaps a metadump of your reproducer fs combined > > with whatever block device ENOSPC hack you're using would do it. > > > > Brian > > > > ---8<--- > > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > > index cd7b8ca..fbcf524 100644 > > --- a/fs/xfs/xfs_buf.c > > +++ b/fs/xfs/xfs_buf.c > > @@ -1409,19 +1409,27 @@ xfs_buf_iorequest( > > * case nothing will ever complete. It returns the I/O error code, if any, or > > * 0 if there was no error. > > */ > > -int > > -xfs_buf_iowait( > > - xfs_buf_t *bp) > > +static int > > +__xfs_buf_iowait( > > + struct xfs_buf *bp, > > + bool skip_error) > > { > > trace_xfs_buf_iowait(bp, _RET_IP_); > > > > - if (!bp->b_error) > > + if (skip_error || !bp->b_error) > > wait_for_completion(&bp->b_iowait); > > > > trace_xfs_buf_iowait_done(bp, _RET_IP_); > > return bp->b_error; > > } > > > > +int > > +xfs_buf_iowait( > > + struct xfs_buf *bp) > > +{ > > + return __xfs_buf_iowait(bp, false); > > +} > > + > > xfs_caddr_t > > xfs_buf_offset( > > xfs_buf_t *bp, > > @@ -1866,7 +1874,7 @@ xfs_buf_delwri_submit( > > bp = list_first_entry(&io_list, struct xfs_buf, b_list); > > > > list_del_init(&bp->b_list); > > - error2 = xfs_buf_iowait(bp); > > + error2 = __xfs_buf_iowait(bp, true); > > xfs_buf_relse(bp); > > if (!error) > > error = error2; > > > > --- > I think that this patch fixes the problem. I tried reproducing it like > 30 times, and it doesn't happen with this patch. Dropping this patch > reproduces the problem within 1 or 2 tries. Thanks! > What are next steps? How to make it "safe for general use"? > Ok, thanks for testing. I think that implicates the caller bypassing the expected blocking with the right sequence of log recovery I/Os and device failure. TBH, I'd still like to see the specifics, if possible. Could you come up with a generic reproducer for this? I think a metadump of the fs with the dirty log plus whatever device failure simulation hack you're using would suffice. The ideal fix is not yet clear to me. Technically, we could always find a way to customize this particular path to rely on b_iowait since that appears safe, but that could just be a band aid over a larger problem. I'll need to step back and stare at this code some more to try and understand the layering better, then follow up with something when things are more clear. Brian > Thanks, > Alex. > > > > > > > > >> Thanks, > >> Alex. > >> > >> > >> > >> -----Original Message----- From: Dave Chinner > >> Sent: 05 August, 2014 2:07 AM > >> To: Alex Lyakas > >> Cc: xfs@oss.sgi.com > >> Subject: Re: use-after-free on log replay failure > >> > >> On Mon, Aug 04, 2014 at 02:00:05PM +0300, Alex Lyakas wrote: > >> >Greetings, > >> > > >> >we had a log replay failure due to some errors that the underlying > >> >block device returned: > >> >[49133.801406] XFS (dm-95): metadata I/O error: block 0x270e8c180 > >> >("xlog_recover_iodone") error 28 numblks 16 > >> >[49133.802495] XFS (dm-95): log mount/recovery failed: error 28 > >> >[49133.802644] XFS (dm-95): log mount failed > >> > >> #define ENOSPC 28 /* No space left on device */ > >> > >> You're getting an ENOSPC as a metadata IO error during log recovery? > >> Thin provisioning problem, perhaps, and the error is occurring on > >> submission rather than completion? If so: > >> > >> 8d6c121 xfs: fix buffer use after free on IO error > >> > >> Cheers, > >> > >> Dave. > >> -- > >> Dave Chinner > >> david@fromorbit.com > >> > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: use-after-free on log replay failure 2014-08-11 13:20 ` Brian Foster @ 2014-08-11 21:52 ` Dave Chinner 2014-08-12 12:03 ` Brian Foster 0 siblings, 1 reply; 47+ messages in thread From: Dave Chinner @ 2014-08-11 21:52 UTC (permalink / raw) To: Brian Foster; +Cc: Alex Lyakas, xfs On Mon, Aug 11, 2014 at 09:20:57AM -0400, Brian Foster wrote: > On Sun, Aug 10, 2014 at 03:20:50PM +0300, Alex Lyakas wrote: > > On Wed, Aug 6, 2014 at 6:20 PM, Brian Foster <bfoster@redhat.com> wrote: > > > On Wed, Aug 06, 2014 at 03:52:03PM +0300, Alex Lyakas wrote: ..... > > >> But I believe, my analysis shows that during the mount sequence XFS does not > > >> wait properly for all the bios to complete, before failing the mount > > >> sequence back to the caller. > > >> > > > > > > As an experiment, what about the following? Compile tested only and not > > > safe for general use. ... > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > > > index cd7b8ca..fbcf524 100644 > > > --- a/fs/xfs/xfs_buf.c > > > +++ b/fs/xfs/xfs_buf.c > > > @@ -1409,19 +1409,27 @@ xfs_buf_iorequest( > > > * case nothing will ever complete. It returns the I/O error code, if any, or > > > * 0 if there was no error. > > > */ > > > -int > > > -xfs_buf_iowait( > > > - xfs_buf_t *bp) > > > +static int > > > +__xfs_buf_iowait( > > > + struct xfs_buf *bp, > > > + bool skip_error) > > > { > > > trace_xfs_buf_iowait(bp, _RET_IP_); > > > > > > - if (!bp->b_error) > > > + if (skip_error || !bp->b_error) > > > wait_for_completion(&bp->b_iowait); > > > > > > trace_xfs_buf_iowait_done(bp, _RET_IP_); > > > return bp->b_error; > > > } > > > > > > +int > > > +xfs_buf_iowait( > > > + struct xfs_buf *bp) > > > +{ > > > + return __xfs_buf_iowait(bp, false); > > > +} > > > + > > > xfs_caddr_t > > > xfs_buf_offset( > > > xfs_buf_t *bp, > > > @@ -1866,7 +1874,7 @@ xfs_buf_delwri_submit( > > > bp = list_first_entry(&io_list, struct xfs_buf, b_list); > > > > > > list_del_init(&bp->b_list); > > > - error2 = xfs_buf_iowait(bp); > > > + error2 = __xfs_buf_iowait(bp, true); > > > xfs_buf_relse(bp); > > > if (!error) > > > error = error2; Not waiting here on buffer error should not matter. Any buffer that is under IO and requires completion should be referenced, and that means it should be caught and waited on by xfs_wait_buftarg() in the mount failure path after log recovery fails. > > > --- > > I think that this patch fixes the problem. I tried reproducing it like > > 30 times, and it doesn't happen with this patch. Dropping this patch > > reproduces the problem within 1 or 2 tries. Thanks! > > What are next steps? How to make it "safe for general use"? > > > > Ok, thanks for testing. I think that implicates the caller bypassing the > expected blocking with the right sequence of log recovery I/Os and > device failure. TBH, I'd still like to see the specifics, if possible. > Could you come up with a generic reproducer for this? I think a metadump > of the fs with the dirty log plus whatever device failure simulation > hack you're using would suffice. The real issue is we don't know exactly what code is being tested (it's 3.8 + random bug fix backports + custom code). Even if we have a reproducer there's no guarantee it will reproduce on a current kernel. IOWs, we are stumbling around in the dark bashing our heads against everything in the room, and that just wastes everyone's time. We need a reproducer that triggers on a current, unmodified kernel release. You can use dm-faulty to error out all writes just like you are doing with your custom code. See xfstests::tests/generic/321 and common/dmflakey for to do this. Ideally the reproducer is in a form that xfstests can use.... If you can't reproduce it on an upstream kernel, then git bisect is your friend. It will find the commit that fixed the problem you are seeing.... > The ideal fix is not yet clear to me. We are not even that far along - the root cause of the bug is not at all clear to me. :/ Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: use-after-free on log replay failure 2014-08-11 21:52 ` Dave Chinner @ 2014-08-12 12:03 ` Brian Foster 2014-08-12 12:39 ` Alex Lyakas 0 siblings, 1 reply; 47+ messages in thread From: Brian Foster @ 2014-08-12 12:03 UTC (permalink / raw) To: Dave Chinner; +Cc: Alex Lyakas, xfs On Tue, Aug 12, 2014 at 07:52:07AM +1000, Dave Chinner wrote: > On Mon, Aug 11, 2014 at 09:20:57AM -0400, Brian Foster wrote: > > On Sun, Aug 10, 2014 at 03:20:50PM +0300, Alex Lyakas wrote: > > > On Wed, Aug 6, 2014 at 6:20 PM, Brian Foster <bfoster@redhat.com> wrote: > > > > On Wed, Aug 06, 2014 at 03:52:03PM +0300, Alex Lyakas wrote: > ..... > > > >> But I believe, my analysis shows that during the mount sequence XFS does not > > > >> wait properly for all the bios to complete, before failing the mount > > > >> sequence back to the caller. > > > >> > > > > > > > > As an experiment, what about the following? Compile tested only and not > > > > safe for general use. > ... > > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > > > > index cd7b8ca..fbcf524 100644 > > > > --- a/fs/xfs/xfs_buf.c > > > > +++ b/fs/xfs/xfs_buf.c > > > > @@ -1409,19 +1409,27 @@ xfs_buf_iorequest( > > > > * case nothing will ever complete. It returns the I/O error code, if any, or > > > > * 0 if there was no error. > > > > */ > > > > -int > > > > -xfs_buf_iowait( > > > > - xfs_buf_t *bp) > > > > +static int > > > > +__xfs_buf_iowait( > > > > + struct xfs_buf *bp, > > > > + bool skip_error) > > > > { > > > > trace_xfs_buf_iowait(bp, _RET_IP_); > > > > > > > > - if (!bp->b_error) > > > > + if (skip_error || !bp->b_error) > > > > wait_for_completion(&bp->b_iowait); > > > > > > > > trace_xfs_buf_iowait_done(bp, _RET_IP_); > > > > return bp->b_error; > > > > } > > > > > > > > +int > > > > +xfs_buf_iowait( > > > > + struct xfs_buf *bp) > > > > +{ > > > > + return __xfs_buf_iowait(bp, false); > > > > +} > > > > + > > > > xfs_caddr_t > > > > xfs_buf_offset( > > > > xfs_buf_t *bp, > > > > @@ -1866,7 +1874,7 @@ xfs_buf_delwri_submit( > > > > bp = list_first_entry(&io_list, struct xfs_buf, b_list); > > > > > > > > list_del_init(&bp->b_list); > > > > - error2 = xfs_buf_iowait(bp); > > > > + error2 = __xfs_buf_iowait(bp, true); > > > > xfs_buf_relse(bp); > > > > if (!error) > > > > error = error2; > > Not waiting here on buffer error should not matter. Any buffer that > is under IO and requires completion should be referenced, and that > means it should be caught and waited on by xfs_wait_buftarg() in the > mount failure path after log recovery fails. > I think that assumes the I/O is successful. Looking through xlog_recover_buffer_pass2() as an example, we read the buffer which should return with b_hold == 1. The delwri queue increments the hold and we xfs_buf_relse() in the return path (i.e., buffer is now held by the delwri queue awaiting submission). Sometime later we delwri_submit()... xfs_buf_iorequest() does an xfs_buf_hold() and xfs_buf_rele() within that single function. The delwri_submit() releases its hold after xfs_buf_iowait(), which I guess at that point bp should go onto the lru (b_hold back to 1 in xfs_buf_rele(). Indeed, the caller has lost scope of the buffer at this point. So unless I miss something or got the lifecycle wrong here, which is easily possible ;), this all hinges on xfs_buf_iowait(). That's where the last hold forcing the buffer to stay around goes away. xfs_buftarg_wait_rele() will dispose the buffer if b_hold == 1. If xfs_buf_iowait() is racy in the event of I/O errors via the bio callback, I think this path is susceptible just the same. > > > > --- > > > I think that this patch fixes the problem. I tried reproducing it like > > > 30 times, and it doesn't happen with this patch. Dropping this patch > > > reproduces the problem within 1 or 2 tries. Thanks! > > > What are next steps? How to make it "safe for general use"? > > > > > > > Ok, thanks for testing. I think that implicates the caller bypassing the > > expected blocking with the right sequence of log recovery I/Os and > > device failure. TBH, I'd still like to see the specifics, if possible. > > Could you come up with a generic reproducer for this? I think a metadump > > of the fs with the dirty log plus whatever device failure simulation > > hack you're using would suffice. > > The real issue is we don't know exactly what code is being tested > (it's 3.8 + random bug fix backports + custom code). Even if we have > a reproducer there's no guarantee it will reproduce on a current > kernel. IOWs, we are stumbling around in the dark bashing our heads > against everything in the room, and that just wastes everyone's > time. > > We need a reproducer that triggers on a current, unmodified > kernel release. You can use dm-faulty to error out all writes just > like you are doing with your custom code. See > xfstests::tests/generic/321 and common/dmflakey for to do this. > Ideally the reproducer is in a form that xfstests can use.... > > If you can't reproduce it on an upstream kernel, then git bisect is > your friend. It will find the commit that fixed the problem you are > seeing.... > Ugh, yeah. The fact that this was customized as such apparently went over my head. I agree completely. This needs to be genericized to a pristine, preferably current kernel. The experiment patch could be papering over something completely different. > > The ideal fix is not yet clear to me. > > We are not even that far along - the root cause of the bug is not at > all clear to me. :/ > Yeah.. the above was just the theory that motivated the experiment in the previously posted patch. It of course remains a theory until we can see the race in action. I was referring to the potential fix for the raciness of xfs_buf_iowait() with regard to bio errors and the wq iodone handling, while still asking for a reproducer to confirm the actual problem. FWIW, I'm not too high on changes in the buf management code, even a smallish behavior change, without a real trace of some sort that documents the problem and justifies the change. Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: use-after-free on log replay failure 2014-08-12 12:03 ` Brian Foster @ 2014-08-12 12:39 ` Alex Lyakas 2014-08-12 19:31 ` Brian Foster ` (2 more replies) 0 siblings, 3 replies; 47+ messages in thread From: Alex Lyakas @ 2014-08-12 12:39 UTC (permalink / raw) To: Brian Foster, Dave Chinner; +Cc: xfs [-- Attachment #1: Type: text/plain, Size: 13483 bytes --] Hello Dave, Brian, I will describe a generic reproduction that you ask for. It was performed on pristine XFS code from 3.8.13, taken from here: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git top commit being: commit dbf932a9b316d5b29b3e220e5a30e7a165ad2992 Author: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Date: Sat May 11 13:57:46 2013 -0700 Linux 3.8.13 I made a single (I swear!) code change in XFS: diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c index 96fcbb8..d756bf6 100644 --- a/fs/xfs/xfs_log_recover.c +++ b/fs/xfs/xfs_log_recover.c @@ -3736,6 +3736,9 @@ xlog_recover( if ((error = xlog_find_tail(log, &head_blk, &tail_blk))) return error; + xfs_notice(log->l_mp, "Sleep 10s before xlog_do_recover"); + msleep(10000); + if (tail_blk != head_blk) { /* There used to be a comment here: * Fresh XFS was formatted on a 20 GB block device within a VM, using: mkfs.xfs -f -K /dev/vde -p /etc/zadara/xfs.protofile and: root@vc-00-00-1383-dev:~# cat /etc/zadara/xfs.protofile dummy : bootfilename, not used, backward compatibility 0 0 : numbers of blocks and inodes, not used, backward compatibility d--777 0 0 : set 777 perms for the root dir $ $ I mounted XFS with the following options: rw,sync,noatime,wsync,attr2,inode64,noquota 0 0 I started a couple of processes writing files sequentially onto this mount point, and after few seconds crashed the VM. When the VM came up, I took the metadump file and placed it in: https://drive.google.com/file/d/0ByBy89zr3kJNa0ZpdmZFS242RVU/edit?usp=sharing Then I set up the following Device Mapper target onto /dev/vde: dmsetup create VDE --table "0 41943040 linear-custom /dev/vde 0" I am attaching the code (and Makefile) of dm-linear-custom target. It is exact copy of dm-linear, except that it has a module parameter. With the parameter set to 0, this is an identity mapping onto /dev/vde. If the parameter is set to non-0, all WRITE bios are failed with ENOSPC. There is a workqueue to fail them in a different context (not sure if really needed, but that's what our "real" custom block device does). Now I did: mount -o noatime,sync /dev/mapper/VDE /mnt/xfs The log recovery flow went into the sleep that I added, and then I did: echo 1 > /sys/module/dm_linear_custom/parameters/fail_writes Problem reproduced: Aug 12 14:23:04 vc-00-00-1383-dev kernel: [ 175.000657] XFS (dm-0): Mounting Filesystem Aug 12 14:23:04 vc-00-00-1383-dev kernel: [ 175.026991] XFS (dm-0): Sleep 10s before xlog_do_recover Aug 12 14:23:14 vc-00-00-1383-dev kernel: [ 185.028113] XFS (dm-0): Starting recovery (logdev: internal) Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556622] XFS (dm-0): metadata I/O error: block 0x2 ("xlog_recover_iodone") error 28 numblks 1 Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556675] XFS (dm-0): metadata I/O error: block 0x40 ("xlog_recover_iodone") error 28 numblks 16 Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556680] XFS (dm-0): xfs_do_force_shutdown(0x1) called from line 377 of file /mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address = 0xffffffffa0349f68 Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556683] XFS (dm-0): I/O Error Detected. Shutting down filesystem Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556684] XFS (dm-0): Please umount the filesystem and rectify the problem(s) Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556766] XFS (dm-0): metadata I/O error: block 0xa00002 ("xlog_recover_iodone") error 5 numblks 1 Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556769] XFS (dm-0): xfs_do_force_shutdown(0x1) called from line 377 of file /mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address = 0xffffffffa0349f68 Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556771] XFS (dm-0): metadata I/O error: block 0xa00008 ("xlog_recover_iodone") error 5 numblks 8 Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556774] XFS (dm-0): xfs_do_force_shutdown(0x1) called from line 377 of file /mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address = 0xffffffffa0349f68 Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556776] XFS (dm-0): metadata I/O error: block 0xa00010 ("xlog_recover_iodone") error 5 numblks 8 Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556779] XFS (dm-0): xfs_do_force_shutdown(0x1) called from line 377 of file /mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address = 0xffffffffa0349f68 Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556781] XFS (dm-0): metadata I/O error: block 0xa00018 ("xlog_recover_iodone") error 5 numblks 8 Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556783] XFS (dm-0): xfs_do_force_shutdown(0x1) called from line 377 of file /mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address = 0xffffffffa0349f68 Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556785] XFS (dm-0): metadata I/O error: block 0xa00040 ("xlog_recover_iodone") error 5 numblks 16 Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556788] XFS (dm-0): xfs_do_force_shutdown(0x1) called from line 377 of file /mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address = 0xffffffffa0349f68 Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556790] XFS (dm-0): metadata I/O error: block 0xa00050 ("xlog_recover_iodone") error 5 numblks 16 Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556793] XFS (dm-0): xfs_do_force_shutdown(0x1) called from line 377 of file /mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address = 0xffffffffa0349f68 Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556880] XFS (dm-0): metadata I/O error: block 0xa00001 ("xlog_recover_iodone") error 28 numblks 1 Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556884] XFS (dm-0): xfs_do_force_shutdown(0x1) called from line 377 of file /mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address = 0xffffffffa0349f68 Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556977] XFS (dm-0): log mount/recovery failed: error 28 Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.557215] XFS (dm-0): log mount failed Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.573194] XFS (): xfs_do_force_shutdown(0x1) called from line 377 of file /mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address = 0xffffffffa0349f68 Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.573214] XFS (): metadata I/O error: block 0x18 ("xlog_recover_iodone") error 28 numblks 8 Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.574685] XFS (): xfs_do_force_shutdown(0x1) called from line 377 of file /mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address = 0xffffffffa0349f68 As you see, after mount completes, IO callbacks are still arriving and printing junk (empty string in this case). Immediately after that kernel dies. Is this description generic enough? Thanks, Alex. -----Original Message----- From: Brian Foster Sent: 12 August, 2014 3:03 PM To: Dave Chinner Cc: Alex Lyakas ; xfs@oss.sgi.com Subject: Re: use-after-free on log replay failure On Tue, Aug 12, 2014 at 07:52:07AM +1000, Dave Chinner wrote: > On Mon, Aug 11, 2014 at 09:20:57AM -0400, Brian Foster wrote: > > On Sun, Aug 10, 2014 at 03:20:50PM +0300, Alex Lyakas wrote: > > > On Wed, Aug 6, 2014 at 6:20 PM, Brian Foster <bfoster@redhat.com> > > > wrote: > > > > On Wed, Aug 06, 2014 at 03:52:03PM +0300, Alex Lyakas wrote: > ..... > > > >> But I believe, my analysis shows that during the mount sequence XFS > > > >> does not > > > >> wait properly for all the bios to complete, before failing the > > > >> mount > > > >> sequence back to the caller. > > > >> > > > > > > > > As an experiment, what about the following? Compile tested only and > > > > not > > > > safe for general use. > ... > > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > > > > index cd7b8ca..fbcf524 100644 > > > > --- a/fs/xfs/xfs_buf.c > > > > +++ b/fs/xfs/xfs_buf.c > > > > @@ -1409,19 +1409,27 @@ xfs_buf_iorequest( > > > > * case nothing will ever complete. It returns the I/O error code, > > > > if any, or > > > > * 0 if there was no error. > > > > */ > > > > -int > > > > -xfs_buf_iowait( > > > > - xfs_buf_t *bp) > > > > +static int > > > > +__xfs_buf_iowait( > > > > + struct xfs_buf *bp, > > > > + bool skip_error) > > > > { > > > > trace_xfs_buf_iowait(bp, _RET_IP_); > > > > > > > > - if (!bp->b_error) > > > > + if (skip_error || !bp->b_error) > > > > wait_for_completion(&bp->b_iowait); > > > > > > > > trace_xfs_buf_iowait_done(bp, _RET_IP_); > > > > return bp->b_error; > > > > } > > > > > > > > +int > > > > +xfs_buf_iowait( > > > > + struct xfs_buf *bp) > > > > +{ > > > > + return __xfs_buf_iowait(bp, false); > > > > +} > > > > + > > > > xfs_caddr_t > > > > xfs_buf_offset( > > > > xfs_buf_t *bp, > > > > @@ -1866,7 +1874,7 @@ xfs_buf_delwri_submit( > > > > bp = list_first_entry(&io_list, struct xfs_buf, > > > > b_list); > > > > > > > > list_del_init(&bp->b_list); > > > > - error2 = xfs_buf_iowait(bp); > > > > + error2 = __xfs_buf_iowait(bp, true); > > > > xfs_buf_relse(bp); > > > > if (!error) > > > > error = error2; > > Not waiting here on buffer error should not matter. Any buffer that > is under IO and requires completion should be referenced, and that > means it should be caught and waited on by xfs_wait_buftarg() in the > mount failure path after log recovery fails. > I think that assumes the I/O is successful. Looking through xlog_recover_buffer_pass2() as an example, we read the buffer which should return with b_hold == 1. The delwri queue increments the hold and we xfs_buf_relse() in the return path (i.e., buffer is now held by the delwri queue awaiting submission). Sometime later we delwri_submit()... xfs_buf_iorequest() does an xfs_buf_hold() and xfs_buf_rele() within that single function. The delwri_submit() releases its hold after xfs_buf_iowait(), which I guess at that point bp should go onto the lru (b_hold back to 1 in xfs_buf_rele(). Indeed, the caller has lost scope of the buffer at this point. So unless I miss something or got the lifecycle wrong here, which is easily possible ;), this all hinges on xfs_buf_iowait(). That's where the last hold forcing the buffer to stay around goes away. xfs_buftarg_wait_rele() will dispose the buffer if b_hold == 1. If xfs_buf_iowait() is racy in the event of I/O errors via the bio callback, I think this path is susceptible just the same. > > > > --- > > > I think that this patch fixes the problem. I tried reproducing it like > > > 30 times, and it doesn't happen with this patch. Dropping this patch > > > reproduces the problem within 1 or 2 tries. Thanks! > > > What are next steps? How to make it "safe for general use"? > > > > > > > Ok, thanks for testing. I think that implicates the caller bypassing the > > expected blocking with the right sequence of log recovery I/Os and > > device failure. TBH, I'd still like to see the specifics, if possible. > > Could you come up with a generic reproducer for this? I think a metadump > > of the fs with the dirty log plus whatever device failure simulation > > hack you're using would suffice. > > The real issue is we don't know exactly what code is being tested > (it's 3.8 + random bug fix backports + custom code). Even if we have > a reproducer there's no guarantee it will reproduce on a current > kernel. IOWs, we are stumbling around in the dark bashing our heads > against everything in the room, and that just wastes everyone's > time. > > We need a reproducer that triggers on a current, unmodified > kernel release. You can use dm-faulty to error out all writes just > like you are doing with your custom code. See > xfstests::tests/generic/321 and common/dmflakey for to do this. > Ideally the reproducer is in a form that xfstests can use.... > > If you can't reproduce it on an upstream kernel, then git bisect is > your friend. It will find the commit that fixed the problem you are > seeing.... > Ugh, yeah. The fact that this was customized as such apparently went over my head. I agree completely. This needs to be genericized to a pristine, preferably current kernel. The experiment patch could be papering over something completely different. > > The ideal fix is not yet clear to me. > > We are not even that far along - the root cause of the bug is not at > all clear to me. :/ > Yeah.. the above was just the theory that motivated the experiment in the previously posted patch. It of course remains a theory until we can see the race in action. I was referring to the potential fix for the raciness of xfs_buf_iowait() with regard to bio errors and the wq iodone handling, while still asking for a reproducer to confirm the actual problem. FWIW, I'm not too high on changes in the buf management code, even a smallish behavior change, without a real trace of some sort that documents the problem and justifies the change. Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com [-- Attachment #2: dm-linear-custom.c --] [-- Type: application/octet-stream, Size: 5786 bytes --] /* * Copyright (C) 2001-2003 Sistina Software (UK) Limited. * * This file is released under the GPL. */ #include <linux/module.h> #include <linux/init.h> #include <linux/blkdev.h> #include <linux/bio.h> #include <linux/slab.h> #include <linux/device-mapper.h> #define DM_MSG_PREFIX "linear-custom" /* * Linear: maps a linear range of a device. */ struct linear_c { struct dm_dev *dev; sector_t start; }; unsigned int fail_writes = 0; module_param_named(fail_writes, fail_writes, uint, S_IWUSR|S_IRUSR | S_IRGRP | S_IROTH); MODULE_PARM_DESC(fail_writes, "When set to non-0, all writes will be failed with ENOSPC"); static struct workqueue_struct *fail_writes_wq; struct fail_writes_work { struct work_struct work; struct bio *bio; }; /* * Construct a linear mapping: <dev_path> <offset> */ static int linear_ctr(struct dm_target *ti, unsigned int argc, char **argv) { struct linear_c *lc; unsigned long long tmp; char dummy; if (argc != 2) { ti->error = "Invalid argument count"; return -EINVAL; } lc = kmalloc(sizeof(*lc), GFP_KERNEL); if (lc == NULL) { ti->error = "dm-linear: Cannot allocate linear context"; return -ENOMEM; } if (sscanf(argv[1], "%llu%c", &tmp, &dummy) != 1) { ti->error = "dm-linear: Invalid device sector"; goto bad; } lc->start = tmp; if (dm_get_device(ti, argv[0], dm_table_get_mode(ti->table), &lc->dev)) { ti->error = "dm-linear: Device lookup failed"; goto bad; } ti->num_flush_requests = 1; ti->num_discard_requests = 1; ti->num_write_same_requests = 1; ti->private = lc; return 0; bad: kfree(lc); return -EINVAL; } static void linear_dtr(struct dm_target *ti) { struct linear_c *lc = (struct linear_c *) ti->private; dm_put_device(ti, lc->dev); kfree(lc); } static sector_t linear_map_sector(struct dm_target *ti, sector_t bi_sector) { struct linear_c *lc = ti->private; return lc->start + dm_target_offset(ti, bi_sector); } static void linear_map_bio(struct dm_target *ti, struct bio *bio) { struct linear_c *lc = ti->private; bio->bi_bdev = lc->dev->bdev; if (bio_sectors(bio)) bio->bi_sector = linear_map_sector(ti, bio->bi_sector); } void linear_fail_write(struct work_struct *work) { struct fail_writes_work *w = container_of(work, struct fail_writes_work, work); bio_endio(w->bio, -ENOSPC); kfree(w); } static int linear_map(struct dm_target *ti, struct bio *bio) { if (bio_rw(bio)== WRITE && fail_writes) { struct fail_writes_work *w = kmalloc(sizeof(struct fail_writes_work), GFP_NOIO); if (w) { bool queued = false; INIT_WORK(&w->work, linear_fail_write); w->bio = bio; queued = queue_work(fail_writes_wq, &w->work); if (queued) return DM_MAPIO_SUBMITTED; kfree(w); } } linear_map_bio(ti, bio); return DM_MAPIO_REMAPPED; } static void linear_status(struct dm_target *ti, status_type_t type, unsigned status_flags, char *result, unsigned maxlen) { struct linear_c *lc = (struct linear_c *) ti->private; switch (type) { case STATUSTYPE_INFO: result[0] = '\0'; break; case STATUSTYPE_TABLE: snprintf(result, maxlen, "%s %llu", lc->dev->name, (unsigned long long)lc->start); break; } } static int linear_ioctl(struct dm_target *ti, unsigned int cmd, unsigned long arg) { struct linear_c *lc = (struct linear_c *) ti->private; struct dm_dev *dev = lc->dev; int r = 0; /* * Only pass ioctls through if the device sizes match exactly. */ if (lc->start || ti->len != i_size_read(dev->bdev->bd_inode) >> SECTOR_SHIFT) r = scsi_verify_blk_ioctl(NULL, cmd); return r ? : __blkdev_driver_ioctl(dev->bdev, dev->mode, cmd, arg); } static int linear_merge(struct dm_target *ti, struct bvec_merge_data *bvm, struct bio_vec *biovec, int max_size) { struct linear_c *lc = ti->private; struct request_queue *q = bdev_get_queue(lc->dev->bdev); if (!q->merge_bvec_fn) return max_size; bvm->bi_bdev = lc->dev->bdev; bvm->bi_sector = linear_map_sector(ti, bvm->bi_sector); return min(max_size, q->merge_bvec_fn(q, bvm, biovec)); } static int linear_iterate_devices(struct dm_target *ti, iterate_devices_callout_fn fn, void *data) { struct linear_c *lc = ti->private; return fn(ti, lc->dev, lc->start, ti->len, data); } static struct target_type linearc_target = { .name = "linear-custom", .version = {1, 2, 1}, .module = THIS_MODULE, .ctr = linear_ctr, .dtr = linear_dtr, .map = linear_map, .status = linear_status, .ioctl = linear_ioctl, .merge = linear_merge, .iterate_devices = linear_iterate_devices, }; int dm_linear_custom_init(void) { int r = 0; DMINFO("Going to register target: %s", linearc_target.name); r = dm_register_target(&linearc_target); if (r < 0) { DMERR("register failed %d", r); goto out; } fail_writes_wq = alloc_workqueue("dm_linear_custom_fail_writes_WQ", WQ_NON_REENTRANT | WQ_UNBOUND | WQ_MEM_RECLAIM, 0/*max_active*/); if (fail_writes_wq == NULL) { r = -ENOMEM; goto out_unreg_target; } return 0; out_unreg_target: dm_unregister_target(&linearc_target); out: return r; } void dm_linear_custom_exit(void) { if (fail_writes_wq) { /* actually destroy_workqueue() also does drain_workqueue */ flush_workqueue(fail_writes_wq); destroy_workqueue(fail_writes_wq); fail_writes_wq = NULL; } dm_unregister_target(&linearc_target); } module_init(dm_linear_custom_init); module_exit(dm_linear_custom_exit); MODULE_AUTHOR("Zadara Storage <team@zadarastorage.com>"); MODULE_DESCRIPTION("A modified dm-linear for BrianF"); MODULE_LICENSE("GPL"); [-- Attachment #3: Makefile --] [-- Type: application/octet-stream, Size: 405 bytes --] obj-m += dm-linear-custom.o ccflags-y += -Wall # Enable most warning messages ccflags-y += -Werror # Error out the compiler on warnings KVERSION = $(shell uname -r) DM_LC_DIR=$(shell pwd) DM_LC_KO=dm-linear-custom.ko default: $(MAKE) -C /lib/modules/$(KVERSION)/build M=$(DM_LC_DIR) EXTRA_CFLAGS=-g modules clean: $(MAKE) -C /lib/modules/$(KVERSION)/build M=$(DM_LC_DIR) clean [-- Attachment #4: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 47+ messages in thread
* Re: use-after-free on log replay failure 2014-08-12 12:39 ` Alex Lyakas @ 2014-08-12 19:31 ` Brian Foster 2014-08-12 23:56 ` Dave Chinner 2014-08-13 0:03 ` Dave Chinner 2 siblings, 0 replies; 47+ messages in thread From: Brian Foster @ 2014-08-12 19:31 UTC (permalink / raw) To: Alex Lyakas; +Cc: xfs On Tue, Aug 12, 2014 at 03:39:02PM +0300, Alex Lyakas wrote: > Hello Dave, Brian, > I will describe a generic reproduction that you ask for. > > It was performed on pristine XFS code from 3.8.13, taken from here: > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git > This seems generic enough to me. Could you try on a more recent kernel? Dave had mentioned there were fixes in this area of log recovery, so a bisect might be all that is necessary to track down the patch you need. Otherwise, we can pick up debugging from something more recent. Brian > top commit being: > commit dbf932a9b316d5b29b3e220e5a30e7a165ad2992 > Author: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > Date: Sat May 11 13:57:46 2013 -0700 > > Linux 3.8.13 > > > I made a single (I swear!) code change in XFS: > > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c > index 96fcbb8..d756bf6 100644 > --- a/fs/xfs/xfs_log_recover.c > +++ b/fs/xfs/xfs_log_recover.c > @@ -3736,6 +3736,9 @@ xlog_recover( > if ((error = xlog_find_tail(log, &head_blk, &tail_blk))) > return error; > > + xfs_notice(log->l_mp, "Sleep 10s before xlog_do_recover"); > + msleep(10000); > + > if (tail_blk != head_blk) { > /* There used to be a comment here: > * > > Fresh XFS was formatted on a 20 GB block device within a VM, using: > mkfs.xfs -f -K /dev/vde -p /etc/zadara/xfs.protofile > and: > root@vc-00-00-1383-dev:~# cat /etc/zadara/xfs.protofile > dummy : bootfilename, not used, backward compatibility > 0 0 : numbers of blocks and inodes, not used, > backward compatibility > d--777 0 0 : set 777 perms for the root dir > $ > $ > > I mounted XFS with the following options: > rw,sync,noatime,wsync,attr2,inode64,noquota 0 0 > > I started a couple of processes writing files sequentially onto this mount > point, and after few seconds crashed the VM. > When the VM came up, I took the metadump file and placed it in: > https://drive.google.com/file/d/0ByBy89zr3kJNa0ZpdmZFS242RVU/edit?usp=sharing > > Then I set up the following Device Mapper target onto /dev/vde: > dmsetup create VDE --table "0 41943040 linear-custom /dev/vde 0" > I am attaching the code (and Makefile) of dm-linear-custom target. It is > exact copy of dm-linear, except that it has a module parameter. With the > parameter set to 0, this is an identity mapping onto /dev/vde. If the > parameter is set to non-0, all WRITE bios are failed with ENOSPC. There is a > workqueue to fail them in a different context (not sure if really needed, > but that's what our "real" custom > block device does). > > Now I did: > mount -o noatime,sync /dev/mapper/VDE /mnt/xfs > > The log recovery flow went into the sleep that I added, and then I did: > echo 1 > /sys/module/dm_linear_custom/parameters/fail_writes > > Problem reproduced: > Aug 12 14:23:04 vc-00-00-1383-dev kernel: [ 175.000657] XFS (dm-0): > Mounting Filesystem > Aug 12 14:23:04 vc-00-00-1383-dev kernel: [ 175.026991] XFS (dm-0): Sleep > 10s before xlog_do_recover > Aug 12 14:23:14 vc-00-00-1383-dev kernel: [ 185.028113] XFS (dm-0): > Starting recovery (logdev: internal) > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556622] XFS (dm-0): > metadata I/O error: block 0x2 ("xlog_recover_iodone") error 28 numblks 1 > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556675] XFS (dm-0): > metadata I/O error: block 0x40 ("xlog_recover_iodone") error 28 numblks 16 > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556680] XFS (dm-0): > xfs_do_force_shutdown(0x1) called from line 377 of file > /mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address > = 0xffffffffa0349f68 > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556683] XFS (dm-0): I/O > Error Detected. Shutting down filesystem > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556684] XFS (dm-0): Please > umount the filesystem and rectify the problem(s) > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556766] XFS (dm-0): > metadata I/O error: block 0xa00002 ("xlog_recover_iodone") error 5 numblks 1 > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556769] XFS (dm-0): > xfs_do_force_shutdown(0x1) called from line 377 of file > /mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address > = 0xffffffffa0349f68 > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556771] XFS (dm-0): > metadata I/O error: block 0xa00008 ("xlog_recover_iodone") error 5 numblks 8 > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556774] XFS (dm-0): > xfs_do_force_shutdown(0x1) called from line 377 of file > /mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address > = 0xffffffffa0349f68 > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556776] XFS (dm-0): > metadata I/O error: block 0xa00010 ("xlog_recover_iodone") error 5 numblks 8 > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556779] XFS (dm-0): > xfs_do_force_shutdown(0x1) called from line 377 of file > /mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address > = 0xffffffffa0349f68 > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556781] XFS (dm-0): > metadata I/O error: block 0xa00018 ("xlog_recover_iodone") error 5 numblks 8 > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556783] XFS (dm-0): > xfs_do_force_shutdown(0x1) called from line 377 of file > /mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address > = 0xffffffffa0349f68 > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556785] XFS (dm-0): > metadata I/O error: block 0xa00040 ("xlog_recover_iodone") error 5 numblks > 16 > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556788] XFS (dm-0): > xfs_do_force_shutdown(0x1) called from line 377 of file > /mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address > = 0xffffffffa0349f68 > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556790] XFS (dm-0): > metadata I/O error: block 0xa00050 ("xlog_recover_iodone") error 5 numblks > 16 > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556793] XFS (dm-0): > xfs_do_force_shutdown(0x1) called from line 377 of file > /mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address > = 0xffffffffa0349f68 > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556880] XFS (dm-0): > metadata I/O error: block 0xa00001 ("xlog_recover_iodone") error 28 numblks > 1 > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556884] XFS (dm-0): > xfs_do_force_shutdown(0x1) called from line 377 of file > /mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address > = 0xffffffffa0349f68 > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556977] XFS (dm-0): log > mount/recovery failed: error 28 > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.557215] XFS (dm-0): log > mount failed > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.573194] XFS (): > xfs_do_force_shutdown(0x1) called from line 377 of file > /mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address > = 0xffffffffa0349f68 > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.573214] XFS (): metadata > I/O error: block 0x18 ("xlog_recover_iodone") error 28 numblks 8 > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.574685] XFS (): > xfs_do_force_shutdown(0x1) called from line 377 of file > /mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address > = 0xffffffffa0349f68 > > As you see, after mount completes, IO callbacks are still arriving and > printing junk (empty string in this case). Immediately after that kernel > dies. > > Is this description generic enough? > > Thanks, > Alex. > > > > > -----Original Message----- From: Brian Foster > Sent: 12 August, 2014 3:03 PM > To: Dave Chinner > Cc: Alex Lyakas ; xfs@oss.sgi.com > Subject: Re: use-after-free on log replay failure > > On Tue, Aug 12, 2014 at 07:52:07AM +1000, Dave Chinner wrote: > >On Mon, Aug 11, 2014 at 09:20:57AM -0400, Brian Foster wrote: > >> On Sun, Aug 10, 2014 at 03:20:50PM +0300, Alex Lyakas wrote: > >> > On Wed, Aug 6, 2014 at 6:20 PM, Brian Foster <bfoster@redhat.com> > > > >wrote: > >> > > On Wed, Aug 06, 2014 at 03:52:03PM +0300, Alex Lyakas wrote: > >..... > >> > >> But I believe, my analysis shows that during the mount sequence XFS > >> > >> does not > >> > >> wait properly for all the bios to complete, before failing the > > > >>> mount > >> > >> sequence back to the caller. > >> > >> > >> > > > >> > > As an experiment, what about the following? Compile tested only and > >> > > not > >> > > safe for general use. > >... > >> > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > >> > > index cd7b8ca..fbcf524 100644 > >> > > --- a/fs/xfs/xfs_buf.c > >> > > +++ b/fs/xfs/xfs_buf.c > >> > > @@ -1409,19 +1409,27 @@ xfs_buf_iorequest( > >> > > * case nothing will ever complete. It returns the I/O error code, > >> > > if any, or > >> > > * 0 if there was no error. > >> > > */ > >> > > -int > >> > > -xfs_buf_iowait( > >> > > - xfs_buf_t *bp) > >> > > +static int > >> > > +__xfs_buf_iowait( > >> > > + struct xfs_buf *bp, > >> > > + bool skip_error) > >> > > { > >> > > trace_xfs_buf_iowait(bp, _RET_IP_); > >> > > > >> > > - if (!bp->b_error) > >> > > + if (skip_error || !bp->b_error) > >> > > wait_for_completion(&bp->b_iowait); > >> > > > >> > > trace_xfs_buf_iowait_done(bp, _RET_IP_); > >> > > return bp->b_error; > >> > > } > >> > > > >> > > +int > >> > > +xfs_buf_iowait( > >> > > + struct xfs_buf *bp) > >> > > +{ > >> > > + return __xfs_buf_iowait(bp, false); > >> > > +} > >> > > + > >> > > xfs_caddr_t > >> > > xfs_buf_offset( > >> > > xfs_buf_t *bp, > >> > > @@ -1866,7 +1874,7 @@ xfs_buf_delwri_submit( > >> > > bp = list_first_entry(&io_list, struct xfs_buf, > > > >> b_list); > >> > > > >> > > list_del_init(&bp->b_list); > >> > > - error2 = xfs_buf_iowait(bp); > >> > > + error2 = __xfs_buf_iowait(bp, true); > >> > > xfs_buf_relse(bp); > >> > > if (!error) > >> > > error = error2; > > > >Not waiting here on buffer error should not matter. Any buffer that > >is under IO and requires completion should be referenced, and that > >means it should be caught and waited on by xfs_wait_buftarg() in the > >mount failure path after log recovery fails. > > > > I think that assumes the I/O is successful. Looking through > xlog_recover_buffer_pass2() as an example, we read the buffer which > should return with b_hold == 1. The delwri queue increments the hold and > we xfs_buf_relse() in the return path (i.e., buffer is now held by the > delwri queue awaiting submission). > > Sometime later we delwri_submit()... xfs_buf_iorequest() does an > xfs_buf_hold() and xfs_buf_rele() within that single function. The > delwri_submit() releases its hold after xfs_buf_iowait(), which I guess > at that point bp should go onto the lru (b_hold back to 1 in > xfs_buf_rele(). Indeed, the caller has lost scope of the buffer at this > point. > > So unless I miss something or got the lifecycle wrong here, which is > easily possible ;), this all hinges on xfs_buf_iowait(). That's where > the last hold forcing the buffer to stay around goes away. > xfs_buftarg_wait_rele() will dispose the buffer if b_hold == 1. If > xfs_buf_iowait() is racy in the event of I/O errors via the bio > callback, I think this path is susceptible just the same. > > >> > > --- > >> > I think that this patch fixes the problem. I tried reproducing it like > >> > 30 times, and it doesn't happen with this patch. Dropping this patch > >> > reproduces the problem within 1 or 2 tries. Thanks! > >> > What are next steps? How to make it "safe for general use"? > >> > > >> > >> Ok, thanks for testing. I think that implicates the caller bypassing the > >> expected blocking with the right sequence of log recovery I/Os and > >> device failure. TBH, I'd still like to see the specifics, if possible. > >> Could you come up with a generic reproducer for this? I think a metadump > >> of the fs with the dirty log plus whatever device failure simulation > >> hack you're using would suffice. > > > >The real issue is we don't know exactly what code is being tested > >(it's 3.8 + random bug fix backports + custom code). Even if we have > >a reproducer there's no guarantee it will reproduce on a current > >kernel. IOWs, we are stumbling around in the dark bashing our heads > >against everything in the room, and that just wastes everyone's > >time. > > > >We need a reproducer that triggers on a current, unmodified > >kernel release. You can use dm-faulty to error out all writes just > >like you are doing with your custom code. See > >xfstests::tests/generic/321 and common/dmflakey for to do this. > >Ideally the reproducer is in a form that xfstests can use.... > > > >If you can't reproduce it on an upstream kernel, then git bisect is > >your friend. It will find the commit that fixed the problem you are > >seeing.... > > > > Ugh, yeah. The fact that this was customized as such apparently went > over my head. I agree completely. This needs to be genericized to a > pristine, preferably current kernel. The experiment patch could be > papering over something completely different. > > >> The ideal fix is not yet clear to me. > > > >We are not even that far along - the root cause of the bug is not at > >all clear to me. :/ > > > > Yeah.. the above was just the theory that motivated the experiment in > the previously posted patch. It of course remains a theory until we can > see the race in action. I was referring to the potential fix for the > raciness of xfs_buf_iowait() with regard to bio errors and the wq iodone > handling, while still asking for a reproducer to confirm the actual > problem. FWIW, I'm not too high on changes in the buf management code, > even a smallish behavior change, without a real trace of some sort that > documents the problem and justifies the change. > > Brian > > >Cheers, > > > >Dave. > >-- > >Dave Chinner > >david@fromorbit.com > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: use-after-free on log replay failure 2014-08-12 12:39 ` Alex Lyakas 2014-08-12 19:31 ` Brian Foster @ 2014-08-12 23:56 ` Dave Chinner 2014-08-13 12:59 ` Brian Foster 2014-08-13 17:07 ` Alex Lyakas 2014-08-13 0:03 ` Dave Chinner 2 siblings, 2 replies; 47+ messages in thread From: Dave Chinner @ 2014-08-12 23:56 UTC (permalink / raw) To: Alex Lyakas; +Cc: Brian Foster, xfs On Tue, Aug 12, 2014 at 03:39:02PM +0300, Alex Lyakas wrote: > Hello Dave, Brian, > I will describe a generic reproduction that you ask for. > > It was performed on pristine XFS code from 3.8.13, taken from here: > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git .... > I mounted XFS with the following options: > rw,sync,noatime,wsync,attr2,inode64,noquota 0 0 > > I started a couple of processes writing files sequentially onto this > mount point, and after few seconds crashed the VM. > When the VM came up, I took the metadump file and placed it in: > https://drive.google.com/file/d/0ByBy89zr3kJNa0ZpdmZFS242RVU/edit?usp=sharing > > Then I set up the following Device Mapper target onto /dev/vde: > dmsetup create VDE --table "0 41943040 linear-custom /dev/vde 0" > I am attaching the code (and Makefile) of dm-linear-custom target. > It is exact copy of dm-linear, except that it has a module > parameter. With the parameter set to 0, this is an identity mapping > onto /dev/vde. If the parameter is set to non-0, all WRITE bios are > failed with ENOSPC. There is a workqueue to fail them in a different > context (not sure if really needed, but that's what our "real" > custom > block device does). Well, they you go. That explains it - an asynchronous dispatch error happening fast enough to race with the synchronous XFS dispatch processing. dispatch thread device workqueue xfs_buf_hold(); atomic_set(b_io_remaining, 1) atomic_inc(b_io_remaining) submit_bio(bio) queue_work(bio) xfs_buf_ioend(bp, ....); atomic_dec(b_io_remaining) xfs_buf_rele() bio error set to ENOSPC bio->end_io() xfs_buf_bio_endio() bp->b_error = ENOSPC _xfs_buf_ioend(bp, 1); atomic_dec(b_io_remaining) xfs_buf_ioend(bp, 1); queue_work(bp) xfs_buf_iowait() if (bp->b_error) return error; if (error) xfs_buf_relse() xfs_buf_rele() xfs_buf_free() And now we have a freed buffer that is queued on the io completion queue. Basically, it requires the buffer error to be set asynchronously *between* the dispatch decrementing it's I/O count after dispatch, but before we wait on the IO. Not sure what the right fix is yet - removing the bp->b_error check from xfs_buf_iowait() doesn't solve the problem - it just prevents this code path from being tripped over by the race condition. But, just to validate this is the problem, you should be able to reproduce this on a 3.16 kernel. Can you try that, Alex? Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: use-after-free on log replay failure 2014-08-12 23:56 ` Dave Chinner @ 2014-08-13 12:59 ` Brian Foster 2014-08-13 20:59 ` Dave Chinner 2014-08-13 17:07 ` Alex Lyakas 1 sibling, 1 reply; 47+ messages in thread From: Brian Foster @ 2014-08-13 12:59 UTC (permalink / raw) To: Dave Chinner; +Cc: Alex Lyakas, xfs On Wed, Aug 13, 2014 at 09:56:15AM +1000, Dave Chinner wrote: > On Tue, Aug 12, 2014 at 03:39:02PM +0300, Alex Lyakas wrote: > > Hello Dave, Brian, > > I will describe a generic reproduction that you ask for. > > > > It was performed on pristine XFS code from 3.8.13, taken from here: > > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git > .... > > I mounted XFS with the following options: > > rw,sync,noatime,wsync,attr2,inode64,noquota 0 0 > > > > I started a couple of processes writing files sequentially onto this > > mount point, and after few seconds crashed the VM. > > When the VM came up, I took the metadump file and placed it in: > > https://drive.google.com/file/d/0ByBy89zr3kJNa0ZpdmZFS242RVU/edit?usp=sharing > > > > Then I set up the following Device Mapper target onto /dev/vde: > > dmsetup create VDE --table "0 41943040 linear-custom /dev/vde 0" > > I am attaching the code (and Makefile) of dm-linear-custom target. > > It is exact copy of dm-linear, except that it has a module > > parameter. With the parameter set to 0, this is an identity mapping > > onto /dev/vde. If the parameter is set to non-0, all WRITE bios are > > failed with ENOSPC. There is a workqueue to fail them in a different > > context (not sure if really needed, but that's what our "real" > > custom > > block device does). > > Well, they you go. That explains it - an asynchronous dispatch error > happening fast enough to race with the synchronous XFS dispatch > processing. > > dispatch thread device workqueue > xfs_buf_hold(); > atomic_set(b_io_remaining, 1) > atomic_inc(b_io_remaining) > submit_bio(bio) > queue_work(bio) > xfs_buf_ioend(bp, ....); > atomic_dec(b_io_remaining) > xfs_buf_rele() > bio error set to ENOSPC > bio->end_io() > xfs_buf_bio_endio() > bp->b_error = ENOSPC > _xfs_buf_ioend(bp, 1); > atomic_dec(b_io_remaining) > xfs_buf_ioend(bp, 1); > queue_work(bp) > xfs_buf_iowait() > if (bp->b_error) return error; > if (error) > xfs_buf_relse() > xfs_buf_rele() > xfs_buf_free() > > And now we have a freed buffer that is queued on the io completion > queue. Basically, it requires the buffer error to be set > asynchronously *between* the dispatch decrementing it's I/O count > after dispatch, but before we wait on the IO. > That's basically the theory I wanted to test with the experimental patch. E.g., the error check races with the iodone workqueue item. > Not sure what the right fix is yet - removing the bp->b_error check > from xfs_buf_iowait() doesn't solve the problem - it just prevents > this code path from being tripped over by the race condition. > Perhaps I'm missing some context... I don't follow how removing the error check doesn't solve the problem. It clearly closes the race and perhaps there are other means of doing the same thing, but what part of the problem does that leave unresolved? E.g., we provide a synchronization mechanism for an async submission path and an object (xfs_buf) that is involved with potentially multiple such async (I/O) operations. The async callback side manages the counts of outstanding bios etc. to set the state of the buf object correctly and fires a completion when everything is done. The calling side simply waits on the completion before it can analyze state of the object. Referring to anything inside that object that happens to be managed by the buffer I/O mechanism before the buffer is considered complete just seems generally racy. It looks like submit_bio() manages this by providing the error through the callback (always). It also doesn't look like submission path is guaranteed to be synchronous either (consider md, which appears to use workqueues and kernel threads)), so I'm not sure that '...; xfs_buf_iorequest(bp); if (bp->b_error)' is really safe anywhere unless you're explicitly looking for a write verifier error or something and do nothing further on the buf contingent on completion (e.g., freeing it or something it depends on). Brian > But, just to validate this is the problem, you should be able to > reproduce this on a 3.16 kernel. Can you try that, Alex? > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: use-after-free on log replay failure 2014-08-13 12:59 ` Brian Foster @ 2014-08-13 20:59 ` Dave Chinner 2014-08-13 23:21 ` Brian Foster 0 siblings, 1 reply; 47+ messages in thread From: Dave Chinner @ 2014-08-13 20:59 UTC (permalink / raw) To: Brian Foster; +Cc: Alex Lyakas, xfs On Wed, Aug 13, 2014 at 08:59:32AM -0400, Brian Foster wrote: > On Wed, Aug 13, 2014 at 09:56:15AM +1000, Dave Chinner wrote: > > On Tue, Aug 12, 2014 at 03:39:02PM +0300, Alex Lyakas wrote: > > > Hello Dave, Brian, > > > I will describe a generic reproduction that you ask for. > > > > > > It was performed on pristine XFS code from 3.8.13, taken from here: > > > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git > > .... > > > I mounted XFS with the following options: > > > rw,sync,noatime,wsync,attr2,inode64,noquota 0 0 > > > > > > I started a couple of processes writing files sequentially onto this > > > mount point, and after few seconds crashed the VM. > > > When the VM came up, I took the metadump file and placed it in: > > > https://drive.google.com/file/d/0ByBy89zr3kJNa0ZpdmZFS242RVU/edit?usp=sharing > > > > > > Then I set up the following Device Mapper target onto /dev/vde: > > > dmsetup create VDE --table "0 41943040 linear-custom /dev/vde 0" > > > I am attaching the code (and Makefile) of dm-linear-custom target. > > > It is exact copy of dm-linear, except that it has a module > > > parameter. With the parameter set to 0, this is an identity mapping > > > onto /dev/vde. If the parameter is set to non-0, all WRITE bios are > > > failed with ENOSPC. There is a workqueue to fail them in a different > > > context (not sure if really needed, but that's what our "real" > > > custom > > > block device does). > > > > Well, they you go. That explains it - an asynchronous dispatch error > > happening fast enough to race with the synchronous XFS dispatch > > processing. > > > > dispatch thread device workqueue > > xfs_buf_hold(); > > atomic_set(b_io_remaining, 1) > > atomic_inc(b_io_remaining) > > submit_bio(bio) > > queue_work(bio) > > xfs_buf_ioend(bp, ....); > > atomic_dec(b_io_remaining) > > xfs_buf_rele() > > bio error set to ENOSPC > > bio->end_io() > > xfs_buf_bio_endio() > > bp->b_error = ENOSPC > > _xfs_buf_ioend(bp, 1); > > atomic_dec(b_io_remaining) > > xfs_buf_ioend(bp, 1); > > queue_work(bp) > > xfs_buf_iowait() > > if (bp->b_error) return error; > > if (error) > > xfs_buf_relse() > > xfs_buf_rele() > > xfs_buf_free() > > > > And now we have a freed buffer that is queued on the io completion > > queue. Basically, it requires the buffer error to be set > > asynchronously *between* the dispatch decrementing it's I/O count > > after dispatch, but before we wait on the IO. > > > > That's basically the theory I wanted to test with the experimental > patch. E.g., the error check races with the iodone workqueue item. > > > Not sure what the right fix is yet - removing the bp->b_error check > > from xfs_buf_iowait() doesn't solve the problem - it just prevents > > this code path from being tripped over by the race condition. > > > > Perhaps I'm missing some context... I don't follow how removing the > error check doesn't solve the problem. It clearly closes the race and > perhaps there are other means of doing the same thing, but what part of > the problem does that leave unresolved? Anything that does: xfs_buf_iorequest(bp); if (bp->b_error) xfs_buf_relse(bp); is susceptible to the same race condition. based on bp->b_error being set asynchronously and before the buffer IO completion processing is complete. > E.g., we provide a > synchronization mechanism for an async submission path and an object > (xfs_buf) that is involved with potentially multiple such async (I/O) > operations. The async callback side manages the counts of outstanding > bios etc. to set the state of the buf object correctly and fires a > completion when everything is done. The calling side simply waits on the > completion before it can analyze state of the object. Referring to > anything inside that object that happens to be managed by the buffer I/O > mechanism before the buffer is considered complete just seems generally > racy. The point is that the IO submitter holds the buffer lock and so has "exclusive" access to the buffer, even after it is submitted. It is allowed to check the internal state of the buffer at any time, and it is expected to be sane, including while IO completion processing is running. The real issue is that workqueue based IO completion processing is not protected by a reference count of any kind for synchronous IO. It is done with only the reference count of lock holder held, and so if the lock holder unlocks and frees the buffer, then that buffer will be freed. This issue doesn't exist with B_ASYNC IO submission, because the B_ASYNC IO owns the reference and the the buffer lock and drops them from the workqueue when the IO comlpetion processing actuall completes... > It looks like submit_bio() manages this by providing the error through > the callback (always). It also doesn't look like submission path is > guaranteed to be synchronous either (consider md, which appears to use > workqueues and kernel threads)), so I'm not sure that '...; > xfs_buf_iorequest(bp); if (bp->b_error)' is really safe anywhere unless > you're explicitly looking for a write verifier error or something and > do nothing further on the buf contingent on completion (e.g., freeing it > or something it depends on). My point remains that it *should be safe*, and the intent is that the caller should be able to check for submission errors without being exposed to a use after free situation. That's the bug we need to fix, not say "you can't check for submission errors on synchronous IO" to avoid the race condition..... Cheers, Dave -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: use-after-free on log replay failure 2014-08-13 20:59 ` Dave Chinner @ 2014-08-13 23:21 ` Brian Foster 2014-08-14 6:14 ` Dave Chinner 0 siblings, 1 reply; 47+ messages in thread From: Brian Foster @ 2014-08-13 23:21 UTC (permalink / raw) To: Dave Chinner; +Cc: Alex Lyakas, xfs On Thu, Aug 14, 2014 at 06:59:29AM +1000, Dave Chinner wrote: > On Wed, Aug 13, 2014 at 08:59:32AM -0400, Brian Foster wrote: > > On Wed, Aug 13, 2014 at 09:56:15AM +1000, Dave Chinner wrote: > > > On Tue, Aug 12, 2014 at 03:39:02PM +0300, Alex Lyakas wrote: > > > > Hello Dave, Brian, > > > > I will describe a generic reproduction that you ask for. > > > > > > > > It was performed on pristine XFS code from 3.8.13, taken from here: > > > > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git > > > .... > > > > I mounted XFS with the following options: > > > > rw,sync,noatime,wsync,attr2,inode64,noquota 0 0 > > > > > > > > I started a couple of processes writing files sequentially onto this > > > > mount point, and after few seconds crashed the VM. > > > > When the VM came up, I took the metadump file and placed it in: > > > > https://drive.google.com/file/d/0ByBy89zr3kJNa0ZpdmZFS242RVU/edit?usp=sharing > > > > > > > > Then I set up the following Device Mapper target onto /dev/vde: > > > > dmsetup create VDE --table "0 41943040 linear-custom /dev/vde 0" > > > > I am attaching the code (and Makefile) of dm-linear-custom target. > > > > It is exact copy of dm-linear, except that it has a module > > > > parameter. With the parameter set to 0, this is an identity mapping > > > > onto /dev/vde. If the parameter is set to non-0, all WRITE bios are > > > > failed with ENOSPC. There is a workqueue to fail them in a different > > > > context (not sure if really needed, but that's what our "real" > > > > custom > > > > block device does). > > > > > > Well, they you go. That explains it - an asynchronous dispatch error > > > happening fast enough to race with the synchronous XFS dispatch > > > processing. > > > > > > dispatch thread device workqueue > > > xfs_buf_hold(); > > > atomic_set(b_io_remaining, 1) > > > atomic_inc(b_io_remaining) > > > submit_bio(bio) > > > queue_work(bio) > > > xfs_buf_ioend(bp, ....); > > > atomic_dec(b_io_remaining) > > > xfs_buf_rele() > > > bio error set to ENOSPC > > > bio->end_io() > > > xfs_buf_bio_endio() > > > bp->b_error = ENOSPC > > > _xfs_buf_ioend(bp, 1); > > > atomic_dec(b_io_remaining) > > > xfs_buf_ioend(bp, 1); > > > queue_work(bp) > > > xfs_buf_iowait() > > > if (bp->b_error) return error; > > > if (error) > > > xfs_buf_relse() > > > xfs_buf_rele() > > > xfs_buf_free() > > > > > > And now we have a freed buffer that is queued on the io completion > > > queue. Basically, it requires the buffer error to be set > > > asynchronously *between* the dispatch decrementing it's I/O count > > > after dispatch, but before we wait on the IO. > > > > > > > That's basically the theory I wanted to test with the experimental > > patch. E.g., the error check races with the iodone workqueue item. > > > > > Not sure what the right fix is yet - removing the bp->b_error check > > > from xfs_buf_iowait() doesn't solve the problem - it just prevents > > > this code path from being tripped over by the race condition. > > > > > > > Perhaps I'm missing some context... I don't follow how removing the > > error check doesn't solve the problem. It clearly closes the race and > > perhaps there are other means of doing the same thing, but what part of > > the problem does that leave unresolved? > > Anything that does: > > xfs_buf_iorequest(bp); > if (bp->b_error) > xfs_buf_relse(bp); > > is susceptible to the same race condition. based on bp->b_error > being set asynchronously and before the buffer IO completion > processing is complete. > Understood, by why would anything do that (as opposed to xfs_buf_iowait())? I don't see that we do that anywhere today (the check buried within xfs_buf_iowait() notwithstanding of course). >From what I can see, all it really guarantees is that the submission has either passed/failed the write verifier, yes? > > > E.g., we provide a > > synchronization mechanism for an async submission path and an object > > (xfs_buf) that is involved with potentially multiple such async (I/O) > > operations. The async callback side manages the counts of outstanding > > bios etc. to set the state of the buf object correctly and fires a > > completion when everything is done. The calling side simply waits on the > > completion before it can analyze state of the object. Referring to > > anything inside that object that happens to be managed by the buffer I/O > > mechanism before the buffer is considered complete just seems generally > > racy. > > The point is that the IO submitter holds the buffer lock and so has > "exclusive" access to the buffer, even after it is submitted. It is > allowed to check the internal state of the buffer at any time, and > it is expected to be sane, including while IO completion processing > is running. > Fair enough, but if the mechanism is async the submitter clearly knows that's not failsafe (i.e., passing the check above doesn't mean the I/O will not fail). > The real issue is that workqueue based IO completion processing is > not protected by a reference count of any kind for synchronous IO. > It is done with only the reference count of lock holder held, and so > if the lock holder unlocks and frees the buffer, then that buffer > will be freed. > I see, the notion of the work item having a hold/refcount on the buffer makes sense. But with a submission/wait mechanism that waits on actual "completion" of the I/O (e.g., complete(bp->b_iowait)), that only offers protection for the case where a sync I/O submitter doesn't wait. I suppose that's possible, but also seems like a misuse of sync I/O. E.g., why wouldn't that caller do async I/O? > This issue doesn't exist with B_ASYNC IO submission, because the > B_ASYNC IO owns the reference and the the buffer lock and drops them > from the workqueue when the IO comlpetion processing actuall > completes... > Indeed. I wasn't clear on the reference ownership nuance between the sync/async variants of I/O. Thanks for the context. That said, we also don't have submitters that check for errors on async I/O either. > > It looks like submit_bio() manages this by providing the error through > > the callback (always). It also doesn't look like submission path is > > guaranteed to be synchronous either (consider md, which appears to use > > workqueues and kernel threads)), so I'm not sure that '...; > > xfs_buf_iorequest(bp); if (bp->b_error)' is really safe anywhere unless > > you're explicitly looking for a write verifier error or something and > > do nothing further on the buf contingent on completion (e.g., freeing it > > or something it depends on). > > My point remains that it *should be safe*, and the intent is that > the caller should be able to check for submission errors without > being exposed to a use after free situation. That's the bug we need > to fix, not say "you can't check for submission errors on > synchronous IO" to avoid the race condition..... > Well, technically you can check for submission errors on sync I/O, just use the code you posted above. :) What we can't currently do is find out when the I/O subsystem is done with the buffer. Perhaps the point here is around the semantics of xfs_buf_iowait(). With a mechanism that is fundamentally async, the sync variant obviously becomes the async mechanism + some kind of synchronization. I'd expect that synchronization to not necessarily just tell me whether an error occurred, but also tell me when the I/O subsystem is done with the object I've passed (e.g., so I'm free to chuck it, scribble over it, put it back where I got it, whatever). My impression is that's the purpose of the b_iowait mechanism. Otherwise, what's the point of the whole bio_end_io->buf_ioend->b_iodone->buf_ioend round trip dance? Brian > Cheers, > > Dave > -- > Dave Chinner > david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: use-after-free on log replay failure 2014-08-13 23:21 ` Brian Foster @ 2014-08-14 6:14 ` Dave Chinner 2014-08-14 19:05 ` Brian Foster 0 siblings, 1 reply; 47+ messages in thread From: Dave Chinner @ 2014-08-14 6:14 UTC (permalink / raw) To: Brian Foster; +Cc: Alex Lyakas, xfs On Wed, Aug 13, 2014 at 07:21:35PM -0400, Brian Foster wrote: > On Thu, Aug 14, 2014 at 06:59:29AM +1000, Dave Chinner wrote: > > On Wed, Aug 13, 2014 at 08:59:32AM -0400, Brian Foster wrote: > > > Perhaps I'm missing some context... I don't follow how removing the > > > error check doesn't solve the problem. It clearly closes the race and > > > perhaps there are other means of doing the same thing, but what part of > > > the problem does that leave unresolved? > > > > Anything that does: > > > > xfs_buf_iorequest(bp); > > if (bp->b_error) > > xfs_buf_relse(bp); > > > > is susceptible to the same race condition. based on bp->b_error > > being set asynchronously and before the buffer IO completion > > processing is complete. > > > > Understood, by why would anything do that (as opposed to > xfs_buf_iowait())? I don't see that we do that anywhere today > (the check buried within xfs_buf_iowait() notwithstanding of course). "Why" is not important - the fact is the caller *owns* the buffer and so the above fragment of code is valid behaviour. If there is an error on the buffer after xfs_buf_iorequest() request returns on a synchronous IO, then it's a bug if there is still IO in progress on that buffer. We can't run IO completion synchronously from xfs_buf_bio_end_io in this async dispatch error case - we cannot detect it as any different from IO completion in interrupt context - and so we need to have some kind of reference protecting the buffer from being freed from under the completion. i.e. the bug is that a synchronous buffer has no active reference while it is sitting on the completion workqueue - it's references are owned by other contexts that can drop them without regard to the completion status of the buffer. For async IO we transfer a reference and the lock to the IO context, which gets dropped in xfs_buf_iodone_work when all the IO is complete. Synchronous IO needs this protection, too. As a proof of concept, adding this to the start of xfs_buf_iorequest(): + /* + * synchronous IO needs it's own reference count. async IO + * inherits the submitter's reference count. + */ + if (!(bp->b_flags & XBF_ASYNC)) + xfs_buf_hold(bp); And this to the synchronous IO completion case for xfs_buf_iodone_work(): else { ASSERT(read && bp->b_ops); complete(&bp->b_iowait); + xfs_buf_rele(bp); } Should ensure that all IO carries a reference count and the buffer cannot be freed until all IO processing has been completed. This means it does not matter what the buffer owner does after xfs_buf_iorequest() - even unconditionally calling xfs_buf_relse() will not result in use-after-free as the b_hold count will not go to zero until the IO completion processing has been finalised. Fixing the rest of the mess (i.e. determining how to deal with submission/completion races) is going to require more effort and thought. For the moment, though, correctly reference counting buffers will solve the use-after-free without changing any other behaviour. > From what I can see, all it really guarantees is that the submission has > either passed/failed the write verifier, yes? No. It can also mean it wasn't rejected by the lower layersi as they process the bio passed by submit_bio(). e.g. ENODEV, because the underlying device has been hot-unplugged, EIO because the buffer is beyond the end of the device, etc. > > > It looks like submit_bio() manages this by providing the error through > > > the callback (always). It also doesn't look like submission path is > > > guaranteed to be synchronous either (consider md, which appears to use > > > workqueues and kernel threads)), so I'm not sure that '...; > > > xfs_buf_iorequest(bp); if (bp->b_error)' is really safe anywhere unless > > > you're explicitly looking for a write verifier error or something and > > > do nothing further on the buf contingent on completion (e.g., freeing it > > > or something it depends on). > > > > My point remains that it *should be safe*, and the intent is that > > the caller should be able to check for submission errors without > > being exposed to a use after free situation. That's the bug we need > > to fix, not say "you can't check for submission errors on > > synchronous IO" to avoid the race condition..... > > > > Well, technically you can check for submission errors on sync I/O, just > use the code you posted above. :) What we can't currently do is find out > when the I/O subsystem is done with the buffer. By definition, a buffer marked with an error after submission processing is complete. It should not need to be waited on, and there-in lies the bug. > Perhaps the point here is around the semantics of xfs_buf_iowait(). With > a mechanism that is fundamentally async, the sync variant obviously > becomes the async mechanism + some kind of synchronization. I'd expect > that synchronization to not necessarily just tell me whether an error > occurred, but also tell me when the I/O subsystem is done with the > object I've passed (e.g., so I'm free to chuck it, scribble over it, put > it back where I got it, whatever). > > My impression is that's the purpose of the b_iowait mechanism. > Otherwise, what's the point of the whole > bio_end_io->buf_ioend->b_iodone->buf_ioend round trip dance? Yes, that's exactly what xfs_buf_iorequest/xfs_buf_iowait() provides and the b_error indication is an integral part of that synchronisation mechanism. Unfortunately, that is also the part of the mechanism that is racy and causing problems. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: use-after-free on log replay failure 2014-08-14 6:14 ` Dave Chinner @ 2014-08-14 19:05 ` Brian Foster 2014-08-14 22:27 ` Dave Chinner 0 siblings, 1 reply; 47+ messages in thread From: Brian Foster @ 2014-08-14 19:05 UTC (permalink / raw) To: Dave Chinner; +Cc: Alex Lyakas, xfs On Thu, Aug 14, 2014 at 04:14:44PM +1000, Dave Chinner wrote: > On Wed, Aug 13, 2014 at 07:21:35PM -0400, Brian Foster wrote: > > On Thu, Aug 14, 2014 at 06:59:29AM +1000, Dave Chinner wrote: > > > On Wed, Aug 13, 2014 at 08:59:32AM -0400, Brian Foster wrote: > > > > Perhaps I'm missing some context... I don't follow how removing the > > > > error check doesn't solve the problem. It clearly closes the race and > > > > perhaps there are other means of doing the same thing, but what part of > > > > the problem does that leave unresolved? > > > > > > Anything that does: > > > > > > xfs_buf_iorequest(bp); > > > if (bp->b_error) > > > xfs_buf_relse(bp); > > > > > > is susceptible to the same race condition. based on bp->b_error > > > being set asynchronously and before the buffer IO completion > > > processing is complete. > > > > > > > Understood, by why would anything do that (as opposed to > > xfs_buf_iowait())? I don't see that we do that anywhere today > > (the check buried within xfs_buf_iowait() notwithstanding of course). > > "Why" is not important - the fact is the caller *owns* the buffer > and so the above fragment of code is valid behaviour. If there is > an error on the buffer after xfs_buf_iorequest() request returns on > a synchronous IO, then it's a bug if there is still IO in progress > on that buffer. > A buffer can consist of multiple I/Os, yes? If so, then it seems quite possible for one I/O to fail and set an error on the buffer while another might still be in progress. I don't see how that can be considered a bug in general. Even if not, I'd expect to see a comment explaining why any code fragment such as the above is not broken because that's not deterministic at all, even with a single I/O. E.g., the error could very well have been set by the callback where we clearly continue I/O processing so-to-speak (though we could consider where we currently set the error in the callback sequence a bug as well). > We can't run IO completion synchronously from xfs_buf_bio_end_io in > this async dispatch error case - we cannot detect it as any > different from IO completion in interrupt context - and so we need > to have some kind of reference protecting the buffer from being > freed from under the completion. > Indeed. > i.e. the bug is that a synchronous buffer has no active reference > while it is sitting on the completion workqueue - it's references > are owned by other contexts that can drop them without regard to > the completion status of the buffer. > > For async IO we transfer a reference and the lock to the IO context, > which gets dropped in xfs_buf_iodone_work when all the IO is > complete. Synchronous IO needs this protection, too. > > As a proof of concept, adding this to the start of > xfs_buf_iorequest(): > > + /* > + * synchronous IO needs it's own reference count. async IO > + * inherits the submitter's reference count. > + */ > + if (!(bp->b_flags & XBF_ASYNC)) > + xfs_buf_hold(bp); > > And this to the synchronous IO completion case for > xfs_buf_iodone_work(): > > else { > ASSERT(read && bp->b_ops); > complete(&bp->b_iowait); > + xfs_buf_rele(bp); > } > > Should ensure that all IO carries a reference count and the buffer > cannot be freed until all IO processing has been completed. > > This means it does not matter what the buffer owner does after > xfs_buf_iorequest() - even unconditionally calling xfs_buf_relse() > will not result in use-after-free as the b_hold count will not go to > zero until the IO completion processing has been finalised. > Makes sense, assuming we handle the possible error cases and whatnot therein. Thinking some more, suppose we take this ref, submit one I/O successfully and a subsequent fails. Then who is responsible for releasing the reference? > Fixing the rest of the mess (i.e. determining how to deal with > submission/completion races) is going to require more effort and > thought. For the moment, though, correctly reference counting > buffers will solve the use-after-free without changing any > other behaviour. > > > From what I can see, all it really guarantees is that the submission has > > either passed/failed the write verifier, yes? > > No. It can also mean it wasn't rejected by the lower layersi as > they process the bio passed by submit_bio(). e.g. ENODEV, because > the underlying device has been hot-unplugged, EIO because the > buffer is beyond the end of the device, etc. > Those are the errors that happen to be synchronously processed today. That's an implementation detail. submit_bio() is an asynchronous interface so I don't see any guarantee that will always be the case. E.g., that's easily broken should somebody decide to defer early end_io processing to a workqueue. We do a similar thing ourselves for the reasons you've stated above. I don't see anything in or around submit_bio() or generic_make_request() that suggest the interface is anything but async. From generic_make_request(): /** ... * generic_make_request() does not return any status. The * success/failure status of the request, along with notification of * completion, is delivered asynchronously through the bio->bi_end_io * function described (one day) else where. * ... */ > > > > It looks like submit_bio() manages this by providing the error through > > > > the callback (always). It also doesn't look like submission path is > > > > guaranteed to be synchronous either (consider md, which appears to use > > > > workqueues and kernel threads)), so I'm not sure that '...; > > > > xfs_buf_iorequest(bp); if (bp->b_error)' is really safe anywhere unless > > > > you're explicitly looking for a write verifier error or something and > > > > do nothing further on the buf contingent on completion (e.g., freeing it > > > > or something it depends on). > > > > > > My point remains that it *should be safe*, and the intent is that > > > the caller should be able to check for submission errors without > > > being exposed to a use after free situation. That's the bug we need > > > to fix, not say "you can't check for submission errors on > > > synchronous IO" to avoid the race condition..... > > > > > > > Well, technically you can check for submission errors on sync I/O, just > > use the code you posted above. :) What we can't currently do is find out > > when the I/O subsystem is done with the buffer. > > By definition, a buffer marked with an error after submission > processing is complete. It should not need to be waited on, and > there-in lies the bug. > I suppose that implicates the error processing on the callback side. We set the error and continue processing on the buffer. Another option could be to shuffle that around on the callback side, but to me _that_ is an approach that narrowly avoids the race rather than closing it via use of synchronization. > > Perhaps the point here is around the semantics of xfs_buf_iowait(). With > > a mechanism that is fundamentally async, the sync variant obviously > > becomes the async mechanism + some kind of synchronization. I'd expect > > that synchronization to not necessarily just tell me whether an error > > occurred, but also tell me when the I/O subsystem is done with the > > object I've passed (e.g., so I'm free to chuck it, scribble over it, put > > it back where I got it, whatever). > > > > My impression is that's the purpose of the b_iowait mechanism. > > Otherwise, what's the point of the whole > > bio_end_io->buf_ioend->b_iodone->buf_ioend round trip dance? > > Yes, that's exactly what xfs_buf_iorequest/xfs_buf_iowait() provides > and the b_error indication is an integral part of that > synchronisation mechanism. Unfortunately, that is also the part of > the mechanism that is racy and causing problems. > I don't see how b_iowait itself is racy. It completes when the I/O completes. The problem is that we've overloaded these mechanisms to where we attempt to use them for multiple things. b_error can be a submission error or a deeper I/O error. Adding the b_error check to xfs_buf_iowait() converts it to the same "has a submission error occurred? or has any error occurred yet? or wait until all buffer I/O is complete" non-deterministic semantics. I agree that the reference count protection for sync I/O sounds useful and closes a gap (need to think about that some more), but to check for an error as part of the synchronization mechanism means that the mechanism simply isn't synchronous. I don't see any paths outside of I/O submission itself that care significantly about one general form of error (submission) vs. another (async I/O error) as opposed to simply whether an error has occurred or not. The fact that an error _can_ occur at any time until all of the outstanding bios are completed overrides the fact that some might occur by the time submission completes, meaning we have to handle the former for any callers that care about errors in general. Waiting on b_iowait before checking b_error is an always-safe way to do that. Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: use-after-free on log replay failure 2014-08-14 19:05 ` Brian Foster @ 2014-08-14 22:27 ` Dave Chinner 0 siblings, 0 replies; 47+ messages in thread From: Dave Chinner @ 2014-08-14 22:27 UTC (permalink / raw) To: Brian Foster; +Cc: Alex Lyakas, xfs On Thu, Aug 14, 2014 at 03:05:22PM -0400, Brian Foster wrote: > On Thu, Aug 14, 2014 at 04:14:44PM +1000, Dave Chinner wrote: > > On Wed, Aug 13, 2014 at 07:21:35PM -0400, Brian Foster wrote: > > > On Thu, Aug 14, 2014 at 06:59:29AM +1000, Dave Chinner wrote: > > > > On Wed, Aug 13, 2014 at 08:59:32AM -0400, Brian Foster wrote: > > > > > Perhaps I'm missing some context... I don't follow how removing the > > > > > error check doesn't solve the problem. It clearly closes the race and > > > > > perhaps there are other means of doing the same thing, but what part of > > > > > the problem does that leave unresolved? > > > > > > > > Anything that does: > > > > > > > > xfs_buf_iorequest(bp); > > > > if (bp->b_error) > > > > xfs_buf_relse(bp); > > > > > > > > is susceptible to the same race condition. based on bp->b_error > > > > being set asynchronously and before the buffer IO completion > > > > processing is complete. > > > > > > > > > > Understood, by why would anything do that (as opposed to > > > xfs_buf_iowait())? I don't see that we do that anywhere today > > > (the check buried within xfs_buf_iowait() notwithstanding of course). > > > > "Why" is not important - the fact is the caller *owns* the buffer > > and so the above fragment of code is valid behaviour. If there is > > an error on the buffer after xfs_buf_iorequest() request returns on > > a synchronous IO, then it's a bug if there is still IO in progress > > on that buffer. > > > > A buffer can consist of multiple I/Os, yes? If so, then it seems quite > possible for one I/O to fail and set an error on the buffer while > another might still be in progress. I don't see how that can be > considered a bug in general. You miss my point. We have to mark the buffer with an error, but it should not be visible to the submitter until all IO on the buffer is done. i.e. setting bp->b_error from the completion path needs to be deferred until xfs_buf_iodone_work() is run after all submitted IOs on the buffer have completed. > Even if not, I'd expect to see a comment explaining why any code > fragment such as the above is not broken because that's not > deterministic at all, even with a single I/O. E.g., the error could very > well have been set by the callback where we clearly continue I/O > processing so-to-speak (though we could consider where we currently set > the error in the callback sequence a bug as well). Chicken, meet Egg. There's no comment saying the above code is OK or not because we assume that if we hold the buffer lock it is safe to look at any state on the buffer at any time. What this thread points out is that synchronous IO changes the state of the buffer in a context that does not hold the buffer lock, and so violates that assumption. > > This means it does not matter what the buffer owner does after > > xfs_buf_iorequest() - even unconditionally calling xfs_buf_relse() > > will not result in use-after-free as the b_hold count will not go to > > zero until the IO completion processing has been finalised. > > > > Makes sense, assuming we handle the possible error cases and whatnot > therein. Thinking some more, suppose we take this ref, submit one I/O > successfully and a subsequent fails. Then who is responsible for > releasing the reference? xfs_buf_iodone_work() doesn't run until b_io_remaining goes to zero. That's the context that releases the ref. It doesn't matter how many are submitted, complete successfully or fail. > > Fixing the rest of the mess (i.e. determining how to deal with > > submission/completion races) is going to require more effort and > > thought. For the moment, though, correctly reference counting > > buffers will solve the use-after-free without changing any > > other behaviour. > > > > > From what I can see, all it really guarantees is that the submission has > > > either passed/failed the write verifier, yes? > > > > No. It can also mean it wasn't rejected by the lower layersi as > > they process the bio passed by submit_bio(). e.g. ENODEV, because > > the underlying device has been hot-unplugged, EIO because the > > buffer is beyond the end of the device, etc. > > > > Those are the errors that happen to be synchronously processed today. > That's an implementation detail. submit_bio() is an asynchronous > interface so I don't see any guarantee that will always be the case. > E.g., that's easily broken should somebody decide to defer early end_io > processing to a workqueue. We do a similar thing ourselves for the > reasons you've stated above. > > I don't see anything in or around submit_bio() or generic_make_request() > that suggest the interface is anything but async. From > generic_make_request(): > > /** > ... > * generic_make_request() does not return any status. The > * success/failure status of the request, along with notification of > * completion, is delivered asynchronously through the bio->bi_end_io > * function described (one day) else where. > * > ... > */ Delivery of the error is through bio->bi_end_io, but that is not necessarily from an asynchronous context. e.g. the first thing that generic_make_request() does is run generic_make_request_checks(), where a failure runs bio_endio() and therefore bio->bi_end_io() in the submitter's context. i.e. *synchronously*. This is exactly what the b_io_remaining reference owned by xfs_buf_iorequest() is being taken for - to prevent the *XFS endio processing* from being run asynchronously. > > > > > It looks like submit_bio() manages this by providing the error through > > > > > the callback (always). It also doesn't look like submission path is > > > > > guaranteed to be synchronous either (consider md, which appears to use > > > > > workqueues and kernel threads)), so I'm not sure that '...; > > > > > xfs_buf_iorequest(bp); if (bp->b_error)' is really safe anywhere unless > > > > > you're explicitly looking for a write verifier error or something and > > > > > do nothing further on the buf contingent on completion (e.g., freeing it > > > > > or something it depends on). > > > > > > > > My point remains that it *should be safe*, and the intent is that > > > > the caller should be able to check for submission errors without > > > > being exposed to a use after free situation. That's the bug we need > > > > to fix, not say "you can't check for submission errors on > > > > synchronous IO" to avoid the race condition..... > > > > > > > > > > Well, technically you can check for submission errors on sync I/O, just > > > use the code you posted above. :) What we can't currently do is find out > > > when the I/O subsystem is done with the buffer. > > > > By definition, a buffer marked with an error after submission > > processing is complete. It should not need to be waited on, and > > there-in lies the bug. > > I suppose that implicates the error processing on the callback side. We > set the error and continue processing on the buffer. Another option > could be to shuffle that around on the callback side, but to me _that_ > is an approach that narrowly avoids the race rather than closing it via > use of synchronization. We need to store the error from the callback somewhere, until b_io_remaining falls to zero. Right now we are putting it in b_error, which makes it immediately visible to the submitter without any synchronisation. Basically, we can't put state into anything that the submitter is expected to check. We have internal state fields that are not to be used externally: spinlock_t b_lock; /* internal state lock */ unsigned int b_state; /* internal state flags */ So what we probably need to do is add an internal: int b_io_error; /* internal error state */ serialised by b_lock, where we stuff errors from xfs_buf_bio_end_io and then propagate them to b_error in xfs_buf_iodone_work() when all the IO is complete.... And with the extra reference count on the buffer, it doesn't matter if the submitter detects a submission error and releases it's reference to the buffer while there is still other IO in progress. Hence we solve all the issues without changing the current submit/wait semantics, or needing to remove the submission error check from xfs_buf_iowait. > I don't see any paths outside of I/O submission itself that care > significantly about one general form of error (submission) vs. another > (async I/O error) as opposed to simply whether an error has occurred or > not. The fact that an error _can_ occur at any time until all of the > outstanding bios are completed overrides the fact that some might occur > by the time submission completes, meaning we have to handle the former > for any callers that care about errors in general. Waiting on b_iowait > before checking b_error is an always-safe way to do that. Assuming there are no other bugs, it's "always safe". And that's the key thing - the fact that we actually have a check for an error before waiting indicates that it hasn't always been safe to wait on a buffer marked with a submission error. This code is full of cruft and mess. It's fully of history, band-aids upon band-aids, partially completed cleanups, left over complexity from times gone past and new complexity layered over the top of structures not originally designed to support those uses. Let's fix the use after free bug right away, and then clean up the cruft and fix the underlying problems so we can guarantee that "always-safe" behaviour an not have it blow up in our faces in future.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: use-after-free on log replay failure 2014-08-12 23:56 ` Dave Chinner 2014-08-13 12:59 ` Brian Foster @ 2014-08-13 17:07 ` Alex Lyakas 1 sibling, 0 replies; 47+ messages in thread From: Alex Lyakas @ 2014-08-13 17:07 UTC (permalink / raw) To: Dave Chinner; +Cc: Brian Foster, xfs Hi Dave, I compiled kernel 3.16 today, top commit being: commit 19583ca584d6f574384e17fe7613dfaeadcdc4a6 Author: Linus Torvalds <torvalds@linux-foundation.org> Date: Sun Aug 3 15:25:02 2014 -0700 Linux 3.16 (also had to use the updated copy of dm-linear from 3.16). I would say that the problem reproduces, but it looks a little differently. Typical flow is: Aug 13 19:45:48 vc-00-00-1383-dev kernel: [ 143.521383] XFS (dm-0): Mounting V4 Filesystem Aug 13 19:45:48 vc-00-00-1383-dev kernel: [ 143.558558] XFS (dm-0): Sleep 10s before log recovery Aug 13 19:45:58 vc-00-00-1383-dev kernel: [ 153.560139] XFS (dm-0): Starting recovery (logdev: internal) Aug 13 19:45:59 vc-00-00-1383-dev kernel: [ 154.085826] XFS (dm-0): metadata I/O error: block 0x1 ("xlog_recover_iodone") error 28 numblks 1 Aug 13 19:45:59 vc-00-00-1383-dev kernel: [ 154.087336] XFS (dm-0): xfs_do_force_shutdown(0x1) called from line 383 of file /mnt/share/src/linux-mainline/fs/xfs/xfs_log_recover.c. Return address = 0xffffffffa030d5f8 Aug 13 19:45:59 vc-00-00-1383-dev kernel: [ 154.087341] XFS (dm-0): I/O Error Detected. Shutting down filesystem Aug 13 19:45:59 vc-00-00-1383-dev kernel: [ 154.088455] XFS (dm-0): Please umount the filesystem and rectify the problem(s) Aug 13 19:45:59 vc-00-00-1383-dev kernel: [ 154.088601] XFS (dm-0): metadata I/O error: block 0x8 ("xlog_recover_iodone") error 28 numblks 8 Aug 13 19:45:59 vc-00-00-1383-dev kernel: [ 154.088605] XFS (dm-0): xfs_do_force_shutdown(0x1) called from line 383 of file /mnt/share/src/linux-mainline/fs/xfs/xfs_log_recover.c. Return address = 0xffffffffa030d5f8 Aug 13 19:45:59 vc-00-00-1383-dev kernel: [ 154.088658] XFS (dm-0): metadata I/O error: block 0x10 ("xlog_recover_iodone") error 28 numblks 8 Aug 13 19:45:59 vc-00-00-1383-dev kernel: [ 154.088663] XFS (dm-0): xfs_do_force_shutdown(0x1) called from line 383 of file /mnt/share/src/linux-mainline/fs/xfs/xfs_log_recover.c. Return address = 0xffffffffa030d5f8 Aug 13 19:45:59 vc-00-00-1383-dev kernel: [ 154.088702] XFS (dm-0): metadata I/O error: block 0x40 ("xlog_recover_iodone") error 28 numblks 16 Aug 13 19:45:59 vc-00-00-1383-dev kernel: [ 154.088714] XFS (dm-0): xfs_do_force_shutdown(0x1) called from line 383 of file /mnt/share/src/linux-mainline/fs/xfs/xfs_log_recover.c. Return address = 0xffffffffa030d5f8 Aug 13 19:45:59 vc-00-00-1383-dev kernel: [ 154.088719] XFS (dm-0): metadata I/O error: block 0xa00060 ("xlog_recover_iodone") error 28 numblks 16 Aug 13 19:45:59 vc-00-00-1383-dev kernel: [ 154.088731] XFS (dm-0): xfs_do_force_shutdown(0x1) called from line 383 of file /mnt/share/src/linux-mainline/fs/xfs/xfs_log_recover.c. Return address = 0xffffffffa030d5f8 Aug 13 19:45:59 vc-00-00-1383-dev kernel: [ 154.096288] XFS (dm-0): log mount/recovery failed: error 28 Aug 13 19:45:59 vc-00-00-1383-dev kernel: [ 154.096482] XFS (dm-0): log mount failed So there is no clearly-looking problem like with my development kernel (which also had KMEMLEAK enabled, which changes timings significantly). However, after several seconds, kernel panics sporadically with stacks like [1], [2], and sometimes KVM dies on me with messages like[3], [4]. So definitely this test corrupts some internal kernel structures, and with KMEMLEAK enabled this corruption is more clear. I put the metadump from this experiment here: https://drive.google.com/file/d/0ByBy89zr3kJNSGxoc3U4X1lOZ2s/edit?usp=sharing Thanks, Alex. [1] Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.223389] BUG: unable to handle kernel paging request at 00007f9e50678000 Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] IP: [<ffffffff811923a8>] kmem_cache_alloc+0x68/0x160 Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] PGD 362b8067 PUD 7a9d9067 PMD 7849f067 PTE 0 Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] Oops: 0000 [#1] SMP Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] Modules linked in: xfs(O) libcrc32c dm_linear_custom(O) deflate ctr twofish_generic twofish_x86_64_3way glue_helper lrw xts gf128mul twofish_x86_64 twofish_common camellia_generic serpent_generic blowfish_generic blowfish_x86_64 blowfish_common cast5_generic cast_common des_generic cmac xcbc rmd160 sha512_generic crypto_null af_key xfrm_algo dm_round_robin nfsd nfs_acl ppdev dm_multipath psmouse parport_pc serio_raw i2c_piix4 mac_hid lp rpcsec_gss_krb5 parport auth_rpcgss oid_registry nfsv4 nfs fscache lockd sunrpc floppy Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] CPU: 1 PID: 3302 Comm: cron Tainted: G O 3.16.0-999-generic #201408131143 Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007 Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] task: ffff88007a9d48c0 ti: ffff88007aeb0000 task.ti: ffff88007aeb0000 Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] RIP: 0010:[<ffffffff811923a8>] [<ffffffff811923a8>] kmem_cache_alloc+0x68/0x160 Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] RSP: 0018:ffff88007aeb3cc0 EFLAGS: 00010286 Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] RAX: 0000000000000000 RBX: ffff88007a137048 RCX: 000000000001ff04 Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] RDX: 000000000001ff03 RSI: 00000000000000d0 RDI: 0000000000015d20 Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] RBP: ffff88007aeb3d10 R08: ffff88007fc95d20 R09: 0000000000000003 Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] R10: 0000000000000003 R11: ffff8800362b83d0 R12: ffff88007d001b00 Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] R13: 00007f9e50678000 R14: ffffffff811767fd R15: 00000000000000d0 Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] FS: 00007f9e5066b7c0(0000) GS:ffff88007fc80000(0000) knlGS:0000000000000000 Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] CR2: 00007f9e50678000 CR3: 0000000036743000 CR4: 00000000000006e0 Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] Stack: Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] 00007f9e4e9d4fff ffffffff811767d6 0000000000000040 0000000000000048 Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] ffff88007aeb3d40 ffff88007a137048 ffff88007af7cb80 ffff880079814da8 Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] ffff88007af7cc38 ffff88007af7cc48 ffff88007aeb3d40 ffffffff811767fd Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] Call Trace: Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] [<ffffffff811767d6>] ? anon_vma_fork+0x56/0x130 Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] [<ffffffff811767fd>] anon_vma_fork+0x7d/0x130 Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] [<ffffffff816d0426>] dup_mmap+0x1c5/0x380 Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] [<ffffffff816d069f>] dup_mm+0xbe/0x155 Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] [<ffffffff816d07fc>] copy_mm+0xc6/0xe9 Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] [<ffffffff81054174>] copy_process.part.35+0x694/0xe90 Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] [<ffffffff810549f0>] copy_process+0x80/0x90 Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] [<ffffffff81054b12>] do_fork+0x62/0x280 Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] [<ffffffff812ca6ee>] ? cap_task_setnice+0xe/0x10 Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] [<ffffffff812cbfd6>] ? security_task_setnice+0x16/0x20 Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] [<ffffffff8106a648>] ? set_one_prio+0x88/0xd0 Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] [<ffffffff81054db6>] SyS_clone+0x16/0x20 Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] [<ffffffff816e9b39>] stub_clone+0x69/0x90 Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] [<ffffffff816e9812>] ? system_call_fastpath+0x16/0x1b Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] Code: 00 49 8b 50 08 4d 8b 28 49 8b 40 10 4d 85 ed 0f 84 e6 00 00 00 48 85 c0 0f 84 dd 00 00 00 49 63 44 24 20 49 8b 3c 24 48 8d 4a 01 <49> 8b 5c 05 00 4c 89 e8 65 48 0f c7 0f 0f 94 c0 84 c0 74 b5 49 Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] RIP [<ffffffff811923a8>] kmem_cache_alloc+0x68/0x160 Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] RSP <ffff88007aeb3cc0> Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] CR2: 00007f9e50678000 Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.276967] ---[ end trace 56118c807adeb6de ]--- Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.136602] BUG: unable to handle kernel paging request at 00007f9e50678000 Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.137877] IP: [<ffffffff811917ac>] kmem_cache_alloc_trace+0x6c/0x160 Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.139042] PGD 79732067 PUD 0 Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.139645] Oops: 0000 [#2] SMP Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] Modules linked in: xfs(O) libcrc32c dm_linear_custom(O) deflate ctr twofish_generic twofish_x86_64_3way glue_helper lrw xts gf128mul twofish_x86_64 twofish_common camellia_generic serpent_generic blowfish_generic blowfish_x86_64 blowfish_common cast5_generic cast_common des_generic cmac xcbc rmd160 sha512_generic crypto_null af_key xfrm_algo dm_round_robin nfsd nfs_acl ppdev dm_multipath psmouse parport_pc serio_raw i2c_piix4 mac_hid lp rpcsec_gss_krb5 parport auth_rpcgss oid_registry nfsv4 nfs fscache lockd sunrpc floppy Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] CPU: 1 PID: 1688 Comm: whoopsie Tainted: G D O 3.16.0-999-generic #201408131143 Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007 Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] task: ffff88007c261840 ti: ffff8800799a8000 task.ti: ffff8800799a8000 Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] RIP: 0010:[<ffffffff811917ac>] [<ffffffff811917ac>] kmem_cache_alloc_trace+0x6c/0x160 Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] RSP: 0018:ffff8800799abdd8 EFLAGS: 00010286 Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] RAX: 0000000000000000 RBX: ffff88007bd0e080 RCX: 000000000001ff04 Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] RDX: 000000000001ff03 RSI: 00000000000000d0 RDI: 0000000000015d20 Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] RBP: ffff8800799abe28 R08: ffff88007fc95d20 R09: 0000000000000000 Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] R10: 0000000000000000 R11: 0000000000000246 R12: ffff88007d001b00 Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] R13: 00007f9e50678000 R14: ffffffff815bbbdb R15: 00000000000000d0 Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] FS: 00007fa0248e37c0(0000) GS:ffff88007fc80000(0000) knlGS:0000000000000000 Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] CR2: 00007f9e50678000 CR3: 00000000367c0000 CR4: 00000000000006e0 Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] Stack: Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] ffff88007c7ee780 ffffffff815bbbbd 0000000000000040 0000000000000280 Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] ffff8800799abe08 ffff88007bd0e080 0000000000000000 0000000000000000 Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] 0000000000000000 0000000000000000 ffff8800799abe58 ffffffff815bbbdb Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] Call Trace: Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] [<ffffffff815bbbbd>] ? sock_alloc_inode+0x2d/0xd0 Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] [<ffffffff815bbbdb>] sock_alloc_inode+0x4b/0xd0 Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] [<ffffffff811bb326>] alloc_inode+0x26/0xa0 Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] [<ffffffff811bcce3>] new_inode_pseudo+0x13/0x60 Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] [<ffffffff815bbe8e>] sock_alloc+0x1e/0x80 Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] [<ffffffff815be2d5>] __sock_create+0x95/0x200 Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] [<ffffffff815be4a0>] sock_create+0x30/0x40 Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] [<ffffffff815bf5a6>] SyS_socket+0x36/0xb0 Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] [<ffffffff816e9812>] system_call_fastpath+0x16/0x1b Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] Code: 00 49 8b 50 08 4d 8b 28 49 8b 40 10 4d 85 ed 0f 84 e2 00 00 00 48 85 c0 0f 84 d9 00 00 00 49 63 44 24 20 49 8b 3c 24 48 8d 4a 01 <49> 8b 5c 05 00 4c 89 e8 65 48 0f c7 0f 0f 94 c0 84 c0 74 b5 49 Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] RIP [<ffffffff811917ac>] kmem_cache_alloc_trace+0x6c/0x160 Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] RSP <ffff8800799abdd8> Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] CR2: 00007f9e50678000 Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.185653] ---[ end trace 56118c807adeb6df ]--- [2] Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.566506] BUG: unable to handle kernel paging request at 00007f41d65a100c Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.567769] IP: [<ffffffff81406a47>] vring_add_indirect+0x87/0x210 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.568859] PGD 7b4a7067 PUD 7b4a8067 PMD 7b4aa067 PTE 8000000076f5b065 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570055] Oops: 0003 [#1] SMP Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] Modules linked in: xfs(O) libcrc32c dm_linear_custom(O) deflate ctr twofish_generic twofish_x86_64_3way glue_helper lrw xts gf128mul twofish_x86_64 twofish_common camellia_generic serpent_generic blowfish_generic blowfish_x86_64 blowfish_common cast5_generic cast_common des_generic cmac xcbc rmd160 sha512_generic crypto_null af_key xfrm_algo nfsd nfs_acl dm_round_robin ppdev parport_pc mac_hid dm_multipath i2c_piix4 psmouse serio_raw rpcsec_gss_krb5 auth_rpcgss oid_registry lp nfsv4 parport nfs fscache lockd sunrpc floppy Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] CPU: 3 PID: 3329 Comm: mount Tainted: G O 3.16.0-999-generic #201408131143 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] task: ffff88007b2ac8c0 ti: ffff88007a900000 task.ti: ffff88007a900000 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] RIP: 0010:[<ffffffff81406a47>] [<ffffffff81406a47>] vring_add_indirect+0x87/0x210 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] RSP: 0018:ffff88007a903688 EFLAGS: 00010006 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] RAX: ffff88007a9037b8 RBX: ffff88007a903788 RCX: 00007f41d65a1000 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] RDX: 000000000001da11 RSI: 0000160000000000 RDI: ffff88007a9037b8 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] RBP: ffff88007a9036d8 R08: ffff88007fd95d20 R09: ffffffff814069f6 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] R10: 0000000000000003 R11: ffff88007a903860 R12: ffffffff81406cb0 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] R13: ffff88007c9ae000 R14: 0000000000000000 R15: 00007f41d65a1000 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] FS: 00007f41d67b5800(0000) GS:ffff88007fd80000(0000) knlGS:0000000000000000 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] CR2: 00007f41d65a100c CR3: 000000007a943000 CR4: 00000000000006e0 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] Stack: Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] ffff88007fff8700 0000000300000202 000000007a9037b8 0000000100000002 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] ffff88007a903768 ffff88007c9ae000 ffff88007c9ae000 ffff88007a903788 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] 0000000000000001 0000000000000003 ffff88007a903768 ffffffff814072a2 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] Call Trace: Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff814072a2>] virtqueue_add_sgs+0x2f2/0x340 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff8147854a>] __virtblk_add_req+0xda/0x1b0 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff8132f307>] ? __bt_get+0xc7/0x1e0 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff8132f47a>] ? bt_get+0x5a/0x180 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff8132953c>] ? blk_rq_map_sg+0x3c/0x170 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff8147871a>] virtio_queue_rq+0xfa/0x250 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff8132d39f>] __blk_mq_run_hw_queue+0xff/0x260 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff8132d97d>] blk_mq_run_hw_queue+0x7d/0xb0 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff8132dd71>] blk_sq_make_request+0x171/0x2f0 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff81320605>] generic_make_request.part.75+0x75/0xb0 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff813215a8>] generic_make_request+0x68/0x70 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff81321625>] submit_bio+0x75/0x140 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff811d2773>] _submit_bh+0x113/0x160 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff811d27d0>] submit_bh+0x10/0x20 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff811d6135>] block_read_full_page+0x1f5/0x340 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff811d6d20>] ? I_BDEV+0x10/0x10 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff8115c445>] ? __inc_zone_page_state+0x35/0x40 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff8113eca4>] ? __add_to_page_cache_locked+0xa4/0x130 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff811d7798>] blkdev_readpage+0x18/0x20 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff8114aee8>] read_pages+0xe8/0x100 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff8114b1e3>] __do_page_cache_readahead+0x163/0x170 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff8114b555>] force_page_cache_readahead+0x75/0xb0 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff8114b5d3>] page_cache_sync_readahead+0x43/0x50 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff8113f91e>] do_generic_file_read+0x30e/0x490 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff81140c44>] generic_file_read_iter+0xf4/0x150 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff8116b4b8>] ? handle_mm_fault+0x48/0x80 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff8116edc0>] ? find_vma+0x20/0x80 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff811d6ec7>] blkdev_read_iter+0x37/0x40 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff811a10d8>] new_sync_read+0x78/0xb0 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff811a218b>] vfs_read+0xab/0x180 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff811a240f>] SyS_read+0x4f/0xb0 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff816e9812>] system_call_fastpath+0x16/0x1b Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] Code: 00 00 00 00 48 63 45 c4 48 8b 04 c3 48 85 c0 74 59 0f 1f 00 49 63 ce 48 be 00 00 00 00 00 16 00 00 48 89 c7 48 c1 e1 04 4c 01 f9 <66> c7 41 0c 01 00 48 8b 10 48 83 e2 fc 48 01 f2 8b 70 08 48 c1 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] RIP [<ffffffff81406a47>] vring_add_indirect+0x87/0x210 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] RSP <ffff88007a903688> Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] CR2: 00007f41d65a100c Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] ---[ end trace cc83f9989ae9e2af ]--- Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] BUG: unable to handle kernel paging request at 0000000079a0b008 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] IP: [<ffffffff810d18b0>] acct_collect+0x60/0x1b0 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] PGD 7b4ac067 PUD 7b58c067 PMD 7b4a3067 PTE 0 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] Oops: 0000 [#2] SMP Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] Modules linked in: xfs(O) libcrc32c dm_linear_custom(O) deflate ctr twofish_generic twofish_x86_64_3way glue_helper lrw xts gf128mul twofish_x86_64 twofish_common camellia_generic serpent_generic blowfish_generic blowfish_x86_64 blowfish_common cast5_generic cast_common des_generic cmac xcbc rmd160 sha512_generic crypto_null af_key xfrm_algo nfsd nfs_acl dm_round_robin ppdev parport_pc mac_hid dm_multipath i2c_piix4 psmouse serio_raw rpcsec_gss_krb5 auth_rpcgss oid_registry lp nfsv4 parport nfs fscache lockd sunrpc floppy Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] CPU: 3 PID: 3329 Comm: mount Tainted: G D O 3.16.0-999-generic #201408131143 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] task: ffff88007b2ac8c0 ti: ffff88007a900000 task.ti: ffff88007a900000 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] RIP: 0010:[<ffffffff810d18b0>] [<ffffffff810d18b0>] acct_collect+0x60/0x1b0 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] RSP: 0018:ffff88007a903328 EFLAGS: 00010006 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] RAX: 0000000079a0b000 RBX: 0000000000000009 RCX: 0000000000000245 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] RDX: 0000000000000275 RSI: 0000000000000001 RDI: ffff8800362fb860 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] RBP: ffff88007a903348 R08: 00007ffffffff000 R09: 0000000000000d01 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88007afa4800 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] R13: ffff88007b2ac8c0 R14: 0001000101133010 R15: 0000000000000046 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] FS: 00007f41d67b5800(0000) GS:ffff88007fd80000(0000) knlGS:0000000000000000 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] CR2: 0000000079a0b008 CR3: 000000007a943000 CR4: 00000000000006e0 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] Stack: Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] ffff88007b2ac8c0 0000000000000009 ffff88007a9035d8 0000000000000001 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] ffff88007a903398 ffffffff810582f7 0000000000000007 0000000000000006 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] 000000000000000a 0000000000000046 0000000000000009 ffff88007a9035d8 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] Call Trace: Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff810582f7>] do_exit+0x367/0x470 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff81006465>] oops_end+0xa5/0xf0 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff816cf708>] no_context+0x1be/0x1cd Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff816cf8e9>] __bad_area_nosemaphore+0x1d2/0x1f1 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff816cf9ae>] bad_area_access_error+0x45/0x4e Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff81044dfe>] __do_page_fault+0x2fe/0x550 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff81210435>] ? kernfs_path+0x55/0x70 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff8134a4e3>] ? cfq_find_alloc_queue+0x293/0x430 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff81141635>] ? mempool_alloc_slab+0x15/0x20 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff81141635>] ? mempool_alloc_slab+0x15/0x20 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff81406cb0>] ? virtqueue_get_buf+0xe0/0xe0 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff8104516c>] do_page_fault+0xc/0x10 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff816eb222>] page_fault+0x22/0x30 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff81406cb0>] ? virtqueue_get_buf+0xe0/0xe0 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff814069f6>] ? vring_add_indirect+0x36/0x210 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff81406a47>] ? vring_add_indirect+0x87/0x210 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff814069f6>] ? vring_add_indirect+0x36/0x210 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff814072a2>] virtqueue_add_sgs+0x2f2/0x340 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff8147854a>] __virtblk_add_req+0xda/0x1b0 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff8132f307>] ? __bt_get+0xc7/0x1e0 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff8132f47a>] ? bt_get+0x5a/0x180 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff8132953c>] ? blk_rq_map_sg+0x3c/0x170 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff8147871a>] virtio_queue_rq+0xfa/0x250 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff8132d39f>] __blk_mq_run_hw_queue+0xff/0x260 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff8132d97d>] blk_mq_run_hw_queue+0x7d/0xb0 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff8132dd71>] blk_sq_make_request+0x171/0x2f0 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff81320605>] generic_make_request.part.75+0x75/0xb0 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff813215a8>] generic_make_request+0x68/0x70 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff81321625>] submit_bio+0x75/0x140 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff811d2773>] _submit_bh+0x113/0x160 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff811d27d0>] submit_bh+0x10/0x20 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff811d6135>] block_read_full_page+0x1f5/0x340 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff811d6d20>] ? I_BDEV+0x10/0x10 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff8115c445>] ? __inc_zone_page_state+0x35/0x40 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff8113eca4>] ? __add_to_page_cache_locked+0xa4/0x130 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff811d7798>] blkdev_readpage+0x18/0x20 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff8114aee8>] read_pages+0xe8/0x100 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff8114b1e3>] __do_page_cache_readahead+0x163/0x170 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff8114b555>] force_page_cache_readahead+0x75/0xb0 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff8114b5d3>] page_cache_sync_readahead+0x43/0x50 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff8113f91e>] do_generic_file_read+0x30e/0x490 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff81140c44>] generic_file_read_iter+0xf4/0x150 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff8116b4b8>] ? handle_mm_fault+0x48/0x80 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff8116edc0>] ? find_vma+0x20/0x80 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff811d6ec7>] blkdev_read_iter+0x37/0x40 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff811a10d8>] new_sync_read+0x78/0xb0 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff811a218b>] vfs_read+0xab/0x180 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff811a240f>] SyS_read+0x4f/0xb0 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] [<ffffffff816e9812>] system_call_fastpath+0x16/0x1b Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] Code: 00 74 55 49 8b bd a0 03 00 00 48 83 c7 60 e8 a8 65 61 00 49 8b 85 a0 03 00 00 48 8b 00 48 85 c0 74 1d 66 0f 1f 84 00 00 00 00 00 <4c> 03 70 08 4c 2b 30 48 8b 40 10 48 85 c0 75 f0 49 c1 ee 0a 65 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] RIP [<ffffffff810d18b0>] acct_collect+0x60/0x1b0 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] RSP <ffff88007a903328> Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] CR2: 0000000079a0b008 Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] ---[ end trace cc83f9989ae9e2b0 ]--- Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] Fixing recursive fault but reboot is needed! [3] (qemu) KVM internal error. Suberror: 1 emulation failure RAX=0000000000000000 RBX=0000000000000000 RCX=0000000000000000 RDX=0000000000000623 RSI=0000000000000000 RDI=0000000000000000 RBP=0000000000000000 RSP=0000000000000000 R8 =0000000000000000 R9 =0000000000000000 R10=0000000000000000 R11=0000000000000000 R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 R15=0000000000000000 RIP=000000000000fff0 RFL=00010002 [-------] CPL=3 II=0 A20=1 SMM=0 HLT=0 ES =0000 0000000000000000 ffffffff 00000000 CS =0033 0000000000000000 ffffffff 00a0fb00 DPL=3 CS64 [-RA] SS =002b 0000000000000000 ffffffff 00c0f300 DPL=3 DS [-WA] DS =0000 0000000000000000 ffffffff 00000000 FS =0000 00007f0514f7c700 ffffffff 00000000 GS =0000 0000000000000000 ffffffff 00000000 LDT=0000 0000000000000000 ffffffff 00000000 TR =0040 ffff88007fc90380 00002087 00008b00 DPL=0 TSS64-busy GDT= ffff88007fc84000 0000007f IDT= ffffffffff57c000 00000fff CR0=80050033 CR2=000000000061cd64 CR3=000000007ae6b000 CR4=000006e0 DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 DR6=00000000ffff0ff0 DR7=0000000000000400 EFER=0000000000000d01 [4] (qemu) qemu-system-x86_64: virtio: trying to map MMIO memory -----Original Message----- From: Dave Chinner Sent: 13 August, 2014 2:56 AM To: Alex Lyakas Cc: Brian Foster ; xfs@oss.sgi.com Subject: Re: use-after-free on log replay failure On Tue, Aug 12, 2014 at 03:39:02PM +0300, Alex Lyakas wrote: > Hello Dave, Brian, > I will describe a generic reproduction that you ask for. > > It was performed on pristine XFS code from 3.8.13, taken from here: > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git .... > I mounted XFS with the following options: > rw,sync,noatime,wsync,attr2,inode64,noquota 0 0 > > I started a couple of processes writing files sequentially onto this > mount point, and after few seconds crashed the VM. > When the VM came up, I took the metadump file and placed it in: > https://drive.google.com/file/d/0ByBy89zr3kJNa0ZpdmZFS242RVU/edit?usp=sharing > > Then I set up the following Device Mapper target onto /dev/vde: > dmsetup create VDE --table "0 41943040 linear-custom /dev/vde 0" > I am attaching the code (and Makefile) of dm-linear-custom target. > It is exact copy of dm-linear, except that it has a module > parameter. With the parameter set to 0, this is an identity mapping > onto /dev/vde. If the parameter is set to non-0, all WRITE bios are > failed with ENOSPC. There is a workqueue to fail them in a different > context (not sure if really needed, but that's what our "real" > custom > block device does). Well, they you go. That explains it - an asynchronous dispatch error happening fast enough to race with the synchronous XFS dispatch processing. dispatch thread device workqueue xfs_buf_hold(); atomic_set(b_io_remaining, 1) atomic_inc(b_io_remaining) submit_bio(bio) queue_work(bio) xfs_buf_ioend(bp, ....); atomic_dec(b_io_remaining) xfs_buf_rele() bio error set to ENOSPC bio->end_io() xfs_buf_bio_endio() bp->b_error = ENOSPC _xfs_buf_ioend(bp, 1); atomic_dec(b_io_remaining) xfs_buf_ioend(bp, 1); queue_work(bp) xfs_buf_iowait() if (bp->b_error) return error; if (error) xfs_buf_relse() xfs_buf_rele() xfs_buf_free() And now we have a freed buffer that is queued on the io completion queue. Basically, it requires the buffer error to be set asynchronously *between* the dispatch decrementing it's I/O count after dispatch, but before we wait on the IO. Not sure what the right fix is yet - removing the bp->b_error check from xfs_buf_iowait() doesn't solve the problem - it just prevents this code path from being tripped over by the race condition. But, just to validate this is the problem, you should be able to reproduce this on a 3.16 kernel. Can you try that, Alex? Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: use-after-free on log replay failure 2014-08-12 12:39 ` Alex Lyakas 2014-08-12 19:31 ` Brian Foster 2014-08-12 23:56 ` Dave Chinner @ 2014-08-13 0:03 ` Dave Chinner 2014-08-13 13:11 ` Brian Foster 2 siblings, 1 reply; 47+ messages in thread From: Dave Chinner @ 2014-08-13 0:03 UTC (permalink / raw) To: Alex Lyakas; +Cc: Brian Foster, xfs On Tue, Aug 12, 2014 at 03:39:02PM +0300, Alex Lyakas wrote: > Then I set up the following Device Mapper target onto /dev/vde: > dmsetup create VDE --table "0 41943040 linear-custom /dev/vde 0" > I am attaching the code (and Makefile) of dm-linear-custom target. > It is exact copy of dm-linear, except that it has a module > parameter. With the parameter set to 0, this is an identity mapping > onto /dev/vde. If the parameter is set to non-0, all WRITE bios are > failed with ENOSPC. There is a workqueue to fail them in a different > context (not sure if really needed, but that's what our "real" > custom > block device does). FWIW, now I've looked at the dm module, this could easily be added to the dm-flakey driver by adding a "queue_write_error" option to it (i.e. similar to the current drop_writes and corrupt_bio_byte options). If we add the code there, then we could add a debug-only XFS sysfs variable to trigger the log recovery sleep, and then use dm-flakey to queue and error out writes. That gives us a reproducable xfstest for this condition. Brian, does that sound like a reasonable plan to you? Thanks for describing the method you've been using to reproduce the bug so clearly, Alex. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: use-after-free on log replay failure 2014-08-13 0:03 ` Dave Chinner @ 2014-08-13 13:11 ` Brian Foster 0 siblings, 0 replies; 47+ messages in thread From: Brian Foster @ 2014-08-13 13:11 UTC (permalink / raw) To: Dave Chinner; +Cc: Alex Lyakas, xfs On Wed, Aug 13, 2014 at 10:03:12AM +1000, Dave Chinner wrote: > On Tue, Aug 12, 2014 at 03:39:02PM +0300, Alex Lyakas wrote: > > Then I set up the following Device Mapper target onto /dev/vde: > > dmsetup create VDE --table "0 41943040 linear-custom /dev/vde 0" > > I am attaching the code (and Makefile) of dm-linear-custom target. > > It is exact copy of dm-linear, except that it has a module > > parameter. With the parameter set to 0, this is an identity mapping > > onto /dev/vde. If the parameter is set to non-0, all WRITE bios are > > failed with ENOSPC. There is a workqueue to fail them in a different > > context (not sure if really needed, but that's what our "real" > > custom > > block device does). > > FWIW, now I've looked at the dm module, this could easily be added > to the dm-flakey driver by adding a "queue_write_error" option > to it (i.e. similar to the current drop_writes and corrupt_bio_byte > options). > > If we add the code there, then we could add a debug-only XFS sysfs > variable to trigger the log recovery sleep, and then use dm-flakey > to queue and error out writes. That gives us a reproducable xfstest > for this condition. Brian, does that sound like a reasonable plan to > you? > It would be nice if we could avoid this kind of timing hack, but I'll have to look at the related code and see what we have for options. I'm also assuming the oops/crash is a consistent behavior when we hit the race, since that probably defines the failure mode of the test. But yeah, seems like a reasonable plan in general. Added to the todo list for once we have this sorted out... Brian > Thanks for describing the method you've been using to reproduce the > bug so clearly, Alex. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 47+ messages in thread
end of thread, other threads:[~2014-08-14 22:28 UTC | newest] Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2013-12-18 18:37 Questions about XFS discard and xfs_free_extent() code (newbie) Alex Lyakas 2013-12-18 23:06 ` Dave Chinner 2013-12-19 9:24 ` Alex Lyakas 2013-12-19 10:55 ` Dave Chinner 2013-12-19 19:24 ` Alex Lyakas 2013-12-21 17:03 ` Chris Murphy 2013-12-24 18:21 ` Alex Lyakas 2013-12-26 23:00 ` Dave Chinner 2014-01-08 18:13 ` Alex Lyakas 2014-01-13 3:02 ` Dave Chinner 2014-01-13 17:44 ` Alex Lyakas 2014-01-13 20:43 ` Dave Chinner 2014-01-14 13:48 ` Alex Lyakas 2014-01-15 1:45 ` Dave Chinner 2014-01-19 9:38 ` Alex Lyakas 2014-01-19 23:17 ` Dave Chinner 2014-07-01 15:06 ` xfs_growfs_data_private memory leak Alex Lyakas 2014-07-01 21:56 ` Dave Chinner 2014-07-02 12:27 ` Alex Lyakas 2014-08-04 18:15 ` Eric Sandeen 2014-08-06 8:56 ` Alex Lyakas 2014-08-04 11:00 ` use-after-free on log replay failure Alex Lyakas 2014-08-04 14:12 ` Brian Foster 2014-08-04 23:07 ` Dave Chinner 2014-08-06 10:05 ` Alex Lyakas 2014-08-06 12:32 ` Dave Chinner 2014-08-06 14:43 ` Alex Lyakas 2014-08-10 16:26 ` Alex Lyakas 2014-08-06 12:52 ` Alex Lyakas 2014-08-06 15:20 ` Brian Foster 2014-08-06 15:28 ` Alex Lyakas 2014-08-10 12:20 ` Alex Lyakas 2014-08-11 13:20 ` Brian Foster 2014-08-11 21:52 ` Dave Chinner 2014-08-12 12:03 ` Brian Foster 2014-08-12 12:39 ` Alex Lyakas 2014-08-12 19:31 ` Brian Foster 2014-08-12 23:56 ` Dave Chinner 2014-08-13 12:59 ` Brian Foster 2014-08-13 20:59 ` Dave Chinner 2014-08-13 23:21 ` Brian Foster 2014-08-14 6:14 ` Dave Chinner 2014-08-14 19:05 ` Brian Foster 2014-08-14 22:27 ` Dave Chinner 2014-08-13 17:07 ` Alex Lyakas 2014-08-13 0:03 ` Dave Chinner 2014-08-13 13:11 ` Brian Foster
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.