* Issues with delalloc->real extent allocation @ 2011-01-14 0:29 Dave Chinner 2011-01-14 16:40 ` Geoffrey Wehrman 2011-01-14 21:43 ` bpm 0 siblings, 2 replies; 30+ messages in thread From: Dave Chinner @ 2011-01-14 0:29 UTC (permalink / raw) To: xfs Folks, I've noticed a few suspicious things trying to reproduce the allocate-in-the-middle-of-a-delalloc-extent, Firstly preallocation into delalloc ranges behaves in a unexpected and dangerous way. Preallocation uses unwritten extents to avoid stale data exposure, but when issued into a range that is covered by a delalloc extent, it does not use unwritten extents. The code that causes this is in xfs_bmapi(): /* * Determine state of extent, and the filesystem. * A wasdelay extent has been initialized, so * shouldn't be flagged as unwritten. */ if (wr && xfs_sb_version_hasextflgbit(&mp->m_sb)) { if (!wasdelay && (flags & XFS_BMAPI_PREALLOC)) got.br_state = XFS_EXT_UNWRITTEN; } This seems to be incorrect to me - a "wasdelay" extent has not yet been initialised - there's data in memory, but there is nothing on disk and we may not write it for some time. If we crash after this transaction is written but before any data is written, we expose stale data. Not only that, it allocates the _entire_ delalloc extent that spans the preallocation range, even when the preallocation range is only 1 block and the delalloc extent covers gigabytes. hence we actually expose a much greater range of the file to stale data exposure during a crash than just eh preallocated range. Not good. Secondly, I think we have the same expose-the-entire-delalloc-extent -to-stale-data-exposure problem in ->writepage. This onnne, however, is due to using BMAPI_ENTIRE to allocate the entire delalloc extent the first time any part of it is written to. Even if we are only writing a single page (i.e. wbc->nr_to_write = 1) and the delalloc extent covers gigabytes. So, same problem when we crash. Finally, I think the extsize based problem exposed by test 229 is a also a result of allocating space we have no pages covering in the page cache (triggered by BMAPI_ENTIRE allocation) so the allocated space is never zeroed and hence exposes stale data. A possible solution to all of these problems is to allocate delalloc space as unwritten extents. It's obvious to see how this solves the first two cases, but the last one is a bit less obvious. That one is solved by the fact that unwritten extent conversion does not get extent size aligned, so any block we don't write data for will remain in the unwritten state and therefore correctly return zeros for any read... The issues with this approach are really about the cost of unwritten extent conversion and the impact on low memory operation. The cost is mainly a CPU cost due to delayed logging, but that will be significant if we have to do an unwritten conversion on every IO completion. Batching is definitely possible and would mostly alleviate the overhead, but does add complexity and especially w.r.t. unwritten extent conversion sometimes requiring allocation to complete. The impact on low memory behaviour is less easy to define and handle. Extent conversion can block needing memory, so if we need to complete the extent conversion to free memory we deadlock. For delalloc writes, we can probably mark IO complete and pages clean before we do the conversion in a batched, async manner. That would leaves us with the same structure as we currently have, but adds more complexity because we now need to track extent ranges in "conversion pending" state for subsequent reads and sync operations. i.e. an extent in a pending state is treated like a real extent for the purposes of reading, but conversion needs to be completed during a sync operation. Another possibility for just the ->writepage problem is that we could make delalloc allocation in ->writepage more like and EFI/EFD type operation. That is, we allocate the space and log all the changes in an allocation intent transaction (basically the same as the current allocation transaction) and then add an allocation done transaction that is issued at IO completion that basically manipulation won't require IO to complete. Effectively this would be logging all the xfs_ioend structures. Then in log recovery we simply convert any range that does not have a done transaction to unwritten, thereby preventing stale data exposure. The down side to this approach is that it needs new transactions and log recovery code, so is not backwards compatible. I think is probably a simpler solution with lower overhead compared to allocating unwritten extents everywhere. It doesn't solve the extsize problem, but that probably should be handled up at the ->aio_write level, not at the ->write_page level. Similarly, the preallocation issue could easily be handled with small changes to xfs_bmapi().... I'm sure there are other ways to solve these problems, but these two are the ones that come to mind for me. I'm open to other solutions or ways to improve on these ones, especially if they are simpler. ;) Anyone got any ideas or improvements? Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Issues with delalloc->real extent allocation 2011-01-14 0:29 Issues with delalloc->real extent allocation Dave Chinner @ 2011-01-14 16:40 ` Geoffrey Wehrman 2011-01-14 22:59 ` Dave Chinner 2011-01-14 21:43 ` bpm 1 sibling, 1 reply; 30+ messages in thread From: Geoffrey Wehrman @ 2011-01-14 16:40 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Fri, Jan 14, 2011 at 11:29:00AM +1100, Dave Chinner wrote: | This seems to be incorrect to me - a "wasdelay" extent has not yet | been initialised - there's data in memory, but there is nothing on | disk and we may not write it for some time. If we crash after this | transaction is written but before any data is written, we expose | stale data. | | Not only that, it allocates the _entire_ delalloc extent that spans | the preallocation range, even when the preallocation range is only 1 | block and the delalloc extent covers gigabytes. hence we actually | expose a much greater range of the file to stale data exposure | during a crash than just eh preallocated range. Not good. | | Secondly, I think we have the same expose-the-entire-delalloc-extent | -to-stale-data-exposure problem in ->writepage. This onnne, however, | is due to using BMAPI_ENTIRE to allocate the entire delalloc extent | the first time any part of it is written to. Even if we are only | writing a single page (i.e. wbc->nr_to_write = 1) and the delalloc | extent covers gigabytes. So, same problem when we crash. | | Finally, I think the extsize based problem exposed by test 229 is a | also a result of allocating space we have no pages covering in the | page cache (triggered by BMAPI_ENTIRE allocation) so the allocated | space is never zeroed and hence exposes stale data. There used to be an XFS_BMAPI_EXACT flag that wasn't ever used. What would be the effects of re-creating this flag and using it in writepage to prevent the expose-the-entire-delalloc-extent-to-stale-data-exposure problem? This wouldn't solve the exposure of stale data for a crash that occurs after the extent conversion but before the data is written out. The quantity of data exposed is potentially much smaller however. Also, I'm not saying using XFS_BMAPI_EXACT is feasable. I have a very minimal understanding of the writepage code path. -- Geoffrey Wehrman 651-683-5496 gwehrman@sgi.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Issues with delalloc->real extent allocation 2011-01-14 16:40 ` Geoffrey Wehrman @ 2011-01-14 22:59 ` Dave Chinner 2011-01-15 4:16 ` Geoffrey Wehrman 0 siblings, 1 reply; 30+ messages in thread From: Dave Chinner @ 2011-01-14 22:59 UTC (permalink / raw) To: Geoffrey Wehrman; +Cc: xfs On Fri, Jan 14, 2011 at 10:40:16AM -0600, Geoffrey Wehrman wrote: > On Fri, Jan 14, 2011 at 11:29:00AM +1100, Dave Chinner wrote: > | This seems to be incorrect to me - a "wasdelay" extent has not yet > | been initialised - there's data in memory, but there is nothing on > | disk and we may not write it for some time. If we crash after this > | transaction is written but before any data is written, we expose > | stale data. > | > | Not only that, it allocates the _entire_ delalloc extent that spans > | the preallocation range, even when the preallocation range is only 1 > | block and the delalloc extent covers gigabytes. hence we actually > | expose a much greater range of the file to stale data exposure > | during a crash than just eh preallocated range. Not good. > | > | Secondly, I think we have the same expose-the-entire-delalloc-extent > | -to-stale-data-exposure problem in ->writepage. This onnne, however, > | is due to using BMAPI_ENTIRE to allocate the entire delalloc extent > | the first time any part of it is written to. Even if we are only > | writing a single page (i.e. wbc->nr_to_write = 1) and the delalloc > | extent covers gigabytes. So, same problem when we crash. > | > | Finally, I think the extsize based problem exposed by test 229 is a > | also a result of allocating space we have no pages covering in the > | page cache (triggered by BMAPI_ENTIRE allocation) so the allocated > | space is never zeroed and hence exposes stale data. > > There used to be an XFS_BMAPI_EXACT flag that wasn't ever used. What > would be the effects of re-creating this flag and using it in writepage > to prevent the expose-the-entire-delalloc-extent-to-stale-data-exposure > problem? This wouldn't solve the exposure of stale data for a crash > that occurs after the extent conversion but before the data is written > out. The quantity of data exposed is potentially much smaller however. Definitely a possibility, Geoffrey, and not one that I thought of. I agree that it would significantly minimise the amount of stale data exposed on a crash, but there does seem to be some down sides. Ben has already pointed out the increased cost of allocations, and I can also think of a couple of others: - I'm not sure it's the right solution to the extsize issue because it will prevent extent size and aligned allocations from ocurring, which is exactly what extsize is supposed to provide. I'm also not sure it would work with the realtime device as all allocation has to be at least rtextent_size sized and aligned rather than page boundary. I need to check how the rt allocation code handles sub-rtextent size writes - that may point to the solution for the general case here. - I think that using XFS_BMAPI_EXACT semantic will possibly make the speculative EOF preallocation worthless. Delayed allocation conversion will never convert the speculative delalloc preallocation into real extents (beyond EOF) and we'll see increased fragementation in many common workloads due to that.... > Also, I'm not saying using XFS_BMAPI_EXACT is feasable. I have a very > minimal understanding of the writepage code path. I think there are situations where this does make sense, but given the potential issues I'm not sure it is a solution that can be extended to the general case. A good discussion point on a different angle, though. ;) Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Issues with delalloc->real extent allocation 2011-01-14 22:59 ` Dave Chinner @ 2011-01-15 4:16 ` Geoffrey Wehrman 2011-01-17 5:18 ` Dave Chinner 0 siblings, 1 reply; 30+ messages in thread From: Geoffrey Wehrman @ 2011-01-15 4:16 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Sat, Jan 15, 2011 at 09:59:07AM +1100, Dave Chinner wrote: | On Fri, Jan 14, 2011 at 10:40:16AM -0600, Geoffrey Wehrman wrote: | > Also, I'm not saying using XFS_BMAPI_EXACT is feasible. I have a very | > minimal understanding of the writepage code path. | | I think there are situations where this does make sense, but given | the potential issues I'm not sure it is a solution that can be | extended to the general case. A good discussion point on a different | angle, though. ;) You've convinced me that XFS_BMAPI_EXACT is not the optimal solution. Upon further consideration, I do like your proposal to make delalloc allocation more like an intent/done type operation. The compatibility issues aren't all that bad. As long as the filesystem is unmounted clean, there is no need for the next mount do log recovery and therefore no need to have any knowledge of the new transactions. -- Geoffrey Wehrman 651-683-5496 gwehrman@sgi.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Issues with delalloc->real extent allocation 2011-01-15 4:16 ` Geoffrey Wehrman @ 2011-01-17 5:18 ` Dave Chinner 2011-01-17 14:37 ` Geoffrey Wehrman 0 siblings, 1 reply; 30+ messages in thread From: Dave Chinner @ 2011-01-17 5:18 UTC (permalink / raw) To: Geoffrey Wehrman; +Cc: xfs On Fri, Jan 14, 2011 at 10:16:29PM -0600, Geoffrey Wehrman wrote: > On Sat, Jan 15, 2011 at 09:59:07AM +1100, Dave Chinner wrote: > | On Fri, Jan 14, 2011 at 10:40:16AM -0600, Geoffrey Wehrman wrote: > | > Also, I'm not saying using XFS_BMAPI_EXACT is feasible. I have a very > | > minimal understanding of the writepage code path. > | > | I think there are situations where this does make sense, but given > | the potential issues I'm not sure it is a solution that can be > | extended to the general case. A good discussion point on a different > | angle, though. ;) > > You've convinced me that XFS_BMAPI_EXACT is not the optimal solution. > > Upon further consideration, I do like your proposal to make delalloc > allocation more like an intent/done type operation. The compatibility > issues aren't all that bad. As long as the filesystem is unmounted > clean, there is no need for the next mount do log recovery and therefore > no need to have any knowledge of the new transactions. That is a good observation. If there is agreement that this a strong enough backwards compatibility guarantee (it's good enough for me), then I think that I will start to prototype this approach. However, this does not solve the extsize allocation issues where we don't have dirty pages in the page cache covering parts of the delayed allocation extent so we still need a solution for that. I'm tending towards zeroing in .aio_write as the simplest solution because it doesn't cause buffer head/extent tree mapping mismatches, and it would use the above intent/done operations for crash resilience so there's no additional, rarely used code path to test through .writepage. Does that sound reasonable? Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Issues with delalloc->real extent allocation 2011-01-17 5:18 ` Dave Chinner @ 2011-01-17 14:37 ` Geoffrey Wehrman 2011-01-18 0:24 ` Dave Chinner 0 siblings, 1 reply; 30+ messages in thread From: Geoffrey Wehrman @ 2011-01-17 14:37 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Mon, Jan 17, 2011 at 04:18:28PM +1100, Dave Chinner wrote: | On Fri, Jan 14, 2011 at 10:16:29PM -0600, Geoffrey Wehrman wrote: | > On Sat, Jan 15, 2011 at 09:59:07AM +1100, Dave Chinner wrote: | > | On Fri, Jan 14, 2011 at 10:40:16AM -0600, Geoffrey Wehrman wrote: | > | > Also, I'm not saying using XFS_BMAPI_EXACT is feasible. I have a very | > | > minimal understanding of the writepage code path. | > | | > | I think there are situations where this does make sense, but given | > | the potential issues I'm not sure it is a solution that can be | > | extended to the general case. A good discussion point on a different | > | angle, though. ;) | > | > You've convinced me that XFS_BMAPI_EXACT is not the optimal solution. | > | > Upon further consideration, I do like your proposal to make delalloc | > allocation more like an intent/done type operation. The compatibility | > issues aren't all that bad. As long as the filesystem is unmounted | > clean, there is no need for the next mount do log recovery and therefore | > no need to have any knowledge of the new transactions. | | That is a good observation. If there is agreement that this a strong | enough backwards compatibility guarantee (it's good enough for me), | then I think that I will start to prototype this approach. I'm not sure how a version of XFS without the new log recovery code will behave if it encounters a log with the new transactions. I assume it will gracefully abort log recovery and fail the mount with the report of a corrupt log. I have no objection with this compatibility guarantee. | However, this does not solve the extsize allocation issues where we | don't have dirty pages in the page cache covering parts of the | delayed allocation extent so we still need a solution for that. I'm | tending towards zeroing in .aio_write as the simplest solution | because it doesn't cause buffer head/extent tree mapping mismatches, | and it would use the above intent/done operations for crash | resilience so there's no additional, rarely used code path to test | through .writepage. Does that sound reasonable? Zeroing in .aio_write will create zeroed pages covering the entire allocation, correct? This seems like a reasonable and straightforward approach. I wish I had thought of it myself! -- Geoffrey Wehrman 651-683-5496 gwehrman@sgi.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Issues with delalloc->real extent allocation 2011-01-17 14:37 ` Geoffrey Wehrman @ 2011-01-18 0:24 ` Dave Chinner 2011-01-18 14:30 ` Geoffrey Wehrman 0 siblings, 1 reply; 30+ messages in thread From: Dave Chinner @ 2011-01-18 0:24 UTC (permalink / raw) To: Geoffrey Wehrman; +Cc: xfs On Mon, Jan 17, 2011 at 08:37:08AM -0600, Geoffrey Wehrman wrote: > On Mon, Jan 17, 2011 at 04:18:28PM +1100, Dave Chinner wrote: > | On Fri, Jan 14, 2011 at 10:16:29PM -0600, Geoffrey Wehrman wrote: > | > On Sat, Jan 15, 2011 at 09:59:07AM +1100, Dave Chinner wrote: > | > | On Fri, Jan 14, 2011 at 10:40:16AM -0600, Geoffrey Wehrman wrote: > | > | > Also, I'm not saying using XFS_BMAPI_EXACT is feasible. I have a very > | > | > minimal understanding of the writepage code path. > | > | > | > | I think there are situations where this does make sense, but given > | > | the potential issues I'm not sure it is a solution that can be > | > | extended to the general case. A good discussion point on a different > | > | angle, though. ;) > | > > | > You've convinced me that XFS_BMAPI_EXACT is not the optimal solution. > | > > | > Upon further consideration, I do like your proposal to make delalloc > | > allocation more like an intent/done type operation. The compatibility > | > issues aren't all that bad. As long as the filesystem is unmounted > | > clean, there is no need for the next mount do log recovery and therefore > | > no need to have any knowledge of the new transactions. > | > | That is a good observation. If there is agreement that this a strong > | enough backwards compatibility guarantee (it's good enough for me), > | then I think that I will start to prototype this approach. > > I'm not sure how a version of XFS without the new log recovery code will > behave if it encounters a log with the new transactions. I assume it > will gracefully abort log recovery and fail the mount with the report of > a corrupt log. I have no objection with this compatibility guarantee. It will do the same as you describe for the old log recovery code, so there should be no new problems there. > | However, this does not solve the extsize allocation issues where we > | don't have dirty pages in the page cache covering parts of the > | delayed allocation extent so we still need a solution for that. I'm > | tending towards zeroing in .aio_write as the simplest solution > | because it doesn't cause buffer head/extent tree mapping mismatches, > | and it would use the above intent/done operations for crash > | resilience so there's no additional, rarely used code path to test > | through .writepage. Does that sound reasonable? > > Zeroing in .aio_write will create zeroed pages covering the entire > allocation, correct? Yes, though it only needs to zero the regions that the write does not cover itself - no need to zero what we're about to put data into. ;) Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Issues with delalloc->real extent allocation 2011-01-18 0:24 ` Dave Chinner @ 2011-01-18 14:30 ` Geoffrey Wehrman 2011-01-18 20:40 ` Christoph Hellwig 0 siblings, 1 reply; 30+ messages in thread From: Geoffrey Wehrman @ 2011-01-18 14:30 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Tue, Jan 18, 2011 at 11:24:37AM +1100, Dave Chinner wrote: | On Mon, Jan 17, 2011 at 08:37:08AM -0600, Geoffrey Wehrman wrote: | > On Mon, Jan 17, 2011 at 04:18:28PM +1100, Dave Chinner wrote: | > | However, this does not solve the extsize allocation issues where we | > | don't have dirty pages in the page cache covering parts of the | > | delayed allocation extent so we still need a solution for that. I'm | > | tending towards zeroing in .aio_write as the simplest solution | > | because it doesn't cause buffer head/extent tree mapping mismatches, | > | and it would use the above intent/done operations for crash | > | resilience so there's no additional, rarely used code path to test | > | through .writepage. Does that sound reasonable? | > | > Zeroing in .aio_write will create zeroed pages covering the entire | > allocation, correct? | | Yes, though it only needs to zero the regions that the write does | not cover itself - no need to zero what we're about to put data | into. ;) Glad you were able to understand what I meant. Something I didn't think of earlier though: What happens when I try to use an 8 GB on a system with only 4 GB of memory? I'm not really worried about this pathological case, but I do wonder what the effects will be of allocating what could be significant quantities of memory in .aio_write. -- Geoffrey Wehrman 651-683-5496 gwehrman@sgi.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Issues with delalloc->real extent allocation 2011-01-18 14:30 ` Geoffrey Wehrman @ 2011-01-18 20:40 ` Christoph Hellwig 2011-01-18 22:03 ` Dave Chinner 0 siblings, 1 reply; 30+ messages in thread From: Christoph Hellwig @ 2011-01-18 20:40 UTC (permalink / raw) To: Geoffrey Wehrman; +Cc: xfs On Tue, Jan 18, 2011 at 08:30:00AM -0600, Geoffrey Wehrman wrote: > Glad you were able to understand what I meant. Something I didn't think > of earlier though: What happens when I try to use an 8 GB on a system > with only 4 GB of memory? I'm not really worried about this pathological > case, but I do wonder what the effects will be of allocating what could > be significant quantities of memory in .aio_write. I think for large regions we'd be much better off to only zero the blocks on disk, not in-memory - for example like the code in xfs_zero_remaining_bytes does. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Issues with delalloc->real extent allocation 2011-01-18 20:40 ` Christoph Hellwig @ 2011-01-18 22:03 ` Dave Chinner 0 siblings, 0 replies; 30+ messages in thread From: Dave Chinner @ 2011-01-18 22:03 UTC (permalink / raw) To: Christoph Hellwig; +Cc: xfs On Tue, Jan 18, 2011 at 03:40:09PM -0500, Christoph Hellwig wrote: > On Tue, Jan 18, 2011 at 08:30:00AM -0600, Geoffrey Wehrman wrote: > > Glad you were able to understand what I meant. Something I didn't think > > of earlier though: What happens when I try to use an 8 GB on a system > > with only 4 GB of memory? I'm not really worried about this pathological > > case, but I do wonder what the effects will be of allocating what could > > be significant quantities of memory in .aio_write. > > I think for large regions we'd be much better off to only zero the > blocks on disk, not in-memory - for example like the code in > xfs_zero_remaining_bytes does. That doesn't help us, because the .writepage allocation code will still allocate extsize aligned extents and expose the problem that we have blocks on disk with no data in the page cache covering them. The point of zeroing at .aio_write is that it avoids the problem of .writepage allocating blocks we don't have dirty pages for. My preferred optimisation for this problem is that once we get above a certain extsize threshold we simply preallocate extsized and aligned chunks that cover the entire range for the write instead of writing zeros. That preserves the extsize allocation alignment, and unaligned writes and future IO see the space as unwritten and hence get zeroed correctly... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Issues with delalloc->real extent allocation 2011-01-14 0:29 Issues with delalloc->real extent allocation Dave Chinner 2011-01-14 16:40 ` Geoffrey Wehrman @ 2011-01-14 21:43 ` bpm 2011-01-14 23:32 ` bpm ` (3 more replies) 1 sibling, 4 replies; 30+ messages in thread From: bpm @ 2011-01-14 21:43 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs Hi Dave, On Fri, Jan 14, 2011 at 11:29:00AM +1100, Dave Chinner wrote: > I've noticed a few suspicious things trying to reproduce the > allocate-in-the-middle-of-a-delalloc-extent, ... > Secondly, I think we have the same expose-the-entire-delalloc-extent > -to-stale-data-exposure problem in ->writepage. This onnne, however, > is due to using BMAPI_ENTIRE to allocate the entire delalloc extent > the first time any part of it is written to. Even if we are only > writing a single page (i.e. wbc->nr_to_write = 1) and the delalloc > extent covers gigabytes. So, same problem when we crash. > > Finally, I think the extsize based problem exposed by test 229 is a > also a result of allocating space we have no pages covering in the > page cache (triggered by BMAPI_ENTIRE allocation) so the allocated > space is never zeroed and hence exposes stale data. This is precisely the bug I was going after when I hit the allocate-in-the-middle-of-a-delalloc-extent bug. This is a race between block_prepare_write/__xfs_get_blocks and writepage/xfs_page_state convert. When xfs_page_state_convert allocates a real extent for a page toward the beginning of a delalloc extent, XFS_BMAPI converts the entire delalloc extent. Any subsequent writes into the page cache toward the end of this freshly allocated extent will see a written extent instead of delalloc and read the block from disk into the page before writing over it. If the write does not cover the entire page garbage from disk will be exposed into the page cache. <snip> > I'm sure there are other ways to solve these problems, but these two > are the ones that come to mind for me. I'm open to other solutions > or ways to improve on these ones, especially if they are simpler. ;) > Anyone got any ideas or improvements? The direction I've been taking is to use XFS_BMAPI_EXACT in *xfs_iomap_write_allocate to ensure that an extent covering exactly the pages we're prepared to write out immediately is allocated and the rest of the delalloc extent is left as is. This exercises some of the btree code more heavily and led to the discovery of the allocate-in-the-middle-of-a-delalloc-extent bug. It also presents a performance issue which I've tried to resolve by extending xfs_probe_cluster to probe delalloc extents-- lock up all of the pages to be converted before performing the allocation and hold those locks until they are submitted for writeback. It's not very pretty but it resolves the corruption. There is still the issue of crashes... This could be solved by converting from delalloc to unwritten in xfs_page_state_convert in this very exact way and then to written in the io completion handler. Never go delalloc->written directly. I have not had luck reproducing this on TOT xfs and have come to realize that this is because it doesn't do speculative preallocation of larger delalloc extents unless you are using extsize... which I haven't tried. Regards, Ben _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Issues with delalloc->real extent allocation 2011-01-14 21:43 ` bpm @ 2011-01-14 23:32 ` bpm 2011-01-14 23:50 ` bpm ` (2 subsequent siblings) 3 siblings, 0 replies; 30+ messages in thread From: bpm @ 2011-01-14 23:32 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs Hi Dave, On Fri, Jan 14, 2011 at 03:43:34PM -0600, bpm@sgi.com wrote: > On Fri, Jan 14, 2011 at 11:29:00AM +1100, Dave Chinner wrote: > > I'm sure there are other ways to solve these problems, but these two > > are the ones that come to mind for me. I'm open to other solutions > > or ways to improve on these ones, especially if they are simpler. ;) > > Anyone got any ideas or improvements? > > The direction I've been taking is to use XFS_BMAPI_EXACT in > *xfs_iomap_write_allocate to ensure that an extent covering exactly the > pages we're prepared to write out immediately is allocated and the rest > of the delalloc extent is left as is. This exercises some of the btree > code more heavily and led to the discovery of the > allocate-in-the-middle-of-a-delalloc-extent bug. It also presents a > performance issue which I've tried to resolve by extending > xfs_probe_cluster to probe delalloc extents-- lock up all of the pages > to be converted before performing the allocation and hold those locks > until they are submitted for writeback. It's not very pretty but it > resolves the corruption. Here's the xfs_page_state_convert side of the patch so you can get an idea what am trying for, and how ugly it is. ;^) I have not ported the xfs_iomap_write_allocate bits yet. It is against an older version of xfs... but you get the idea. -Ben Index: sles11-src/fs/xfs/linux-2.6/xfs_aops.c =================================================================== --- sles11-src.orig/fs/xfs/linux-2.6/xfs_aops.c +++ sles11-src/fs/xfs/linux-2.6/xfs_aops.c @@ -585,41 +585,61 @@ STATIC unsigned int xfs_probe_page( struct page *page, unsigned int pg_offset, - int mapped) + int mapped, + unsigned int type, + int *entire) { + struct buffer_head *bh, *head; int ret = 0; if (PageWriteback(page)) return 0; + if (!page->mapping) + return 0; + if (!PageDirty(page)) + return 0; + if (!page_has_buffers(page)) + return 0; - if (page->mapping && PageDirty(page)) { - if (page_has_buffers(page)) { - struct buffer_head *bh, *head; - - bh = head = page_buffers(page); - do { - if (!buffer_uptodate(bh)) - break; - if (mapped != buffer_mapped(bh)) - break; - ret += bh->b_size; - if (ret >= pg_offset) - break; - } while ((bh = bh->b_this_page) != head); - } else - ret = mapped ? 0 : PAGE_CACHE_SIZE; - } + *entire = 0; + bh = head = page_buffers(page); + do { + if (!buffer_uptodate(bh)) + break; + if (buffer_mapped(bh) != mapped) + break; + if (type == IOMAP_UNWRITTEN && !buffer_unwritten(bh)) + break; + if (type == IOMAP_DELAY && !buffer_delay(bh)) + break; + if (type == IOMAP_NEW && !buffer_dirty(bh)) + break; + + ret += bh->b_size; + + if (ret >= pg_offset) + break; + } while ((bh = bh->b_this_page) != head); + + if (bh == head) + *entire = 1; return ret; } +#define MAX_WRITEBACK_PAGES 1024 + STATIC size_t xfs_probe_cluster( struct inode *inode, struct page *startpage, struct buffer_head *bh, struct buffer_head *head, - int mapped) + int mapped, + unsigned int type, + struct page **pages, + int *pagecount, + struct writeback_control *wbc) { struct pagevec pvec; pgoff_t tindex, tlast, tloff; @@ -628,8 +648,15 @@ xfs_probe_cluster( /* First sum forwards in this page */ do { - if (!buffer_uptodate(bh) || (mapped != buffer_mapped(bh))) + if (!buffer_uptodate(bh) || (mapped != buffer_mapped(bh))) { + return total; + } else if (type == IOMAP_UNWRITTEN && !buffer_unwritten(bh)) { return total; + } else if (type == IOMAP_DELAY && !buffer_delay(bh)) { + return total; + } else if (type == IOMAP_NEW && !buffer_dirty(bh)) { + return total; + } total += bh->b_size; } while ((bh = bh->b_this_page) != head); @@ -637,8 +664,9 @@ xfs_probe_cluster( tlast = i_size_read(inode) >> PAGE_CACHE_SHIFT; tindex = startpage->index + 1; - /* Prune this back to avoid pathological behavior */ - tloff = min(tlast, startpage->index + 64); + /* Prune this back to avoid pathological behavior, subtract 1 for the + * first page. */ + tloff = min(tlast, startpage->index + (pgoff_t)MAX_WRITEBACK_PAGES); pagevec_init(&pvec, 0); while (!done && tindex <= tloff) { @@ -647,10 +675,10 @@ xfs_probe_cluster( if (!pagevec_lookup(&pvec, inode->i_mapping, tindex, len)) break; - for (i = 0; i < pagevec_count(&pvec); i++) { + for (i = 0; i < pagevec_count(&pvec) && !done; i++) { struct page *page = pvec.pages[i]; size_t pg_offset, pg_len = 0; - + int entire = 0; if (tindex == tlast) { pg_offset = i_size_read(inode) & (PAGE_CACHE_SIZE - 1); @@ -658,20 +686,39 @@ xfs_probe_cluster( done = 1; break; } - } else + } else { pg_offset = PAGE_CACHE_SIZE; - - if (page->index == tindex && trylock_page(page)) { - pg_len = xfs_probe_page(page, pg_offset, mapped); - unlock_page(page); } + if (page->index == tindex && + *pagecount < MAX_WRITEBACK_PAGES - 1 && + trylock_page(page)) { + pg_len = xfs_probe_page(page, pg_offset, + mapped, type, &entire); + if (pg_len) { + pages[(*pagecount)++] = page; + + } else { + unlock_page(page); + } + } if (!pg_len) { done = 1; break; } total += pg_len; + + /* + * if probe did not succeed on all buffers in the page + * we don't want to probe subsequent pages. This + * ensures that we don't have a mix of buffer types in + * the iomap. + */ + if (!entire) { + done = 1; + break; + } tindex++; } @@ -683,56 +730,19 @@ xfs_probe_cluster( } /* - * Test if a given page is suitable for writing as part of an unwritten - * or delayed allocate extent. - */ -STATIC int -xfs_is_delayed_page( - struct page *page, - unsigned int type) -{ - if (PageWriteback(page)) - return 0; - - if (page->mapping && page_has_buffers(page)) { - struct buffer_head *bh, *head; - int acceptable = 0; - - bh = head = page_buffers(page); - do { - if (buffer_unwritten(bh)) - acceptable = (type == IOMAP_UNWRITTEN); - else if (buffer_delay(bh)) - acceptable = (type == IOMAP_DELAY); - else if (buffer_dirty(bh) && buffer_mapped(bh)) - acceptable = (type == IOMAP_NEW); - else - break; - } while ((bh = bh->b_this_page) != head); - - if (acceptable) - return 1; - } - - return 0; -} - -/* * Allocate & map buffers for page given the extent map. Write it out. * except for the original page of a writepage, this is called on * delalloc/unwritten pages only, for the original page it is possible * that the page has no mapping at all. */ -STATIC int +STATIC void xfs_convert_page( struct inode *inode, struct page *page, - loff_t tindex, xfs_iomap_t *mp, xfs_ioend_t **ioendp, struct writeback_control *wbc, - int startio, - int all_bh) + int startio) { struct buffer_head *bh, *head; xfs_off_t end_offset; @@ -740,20 +750,9 @@ xfs_convert_page( unsigned int type; int bbits = inode->i_blkbits; int len, page_dirty; - int count = 0, done = 0, uptodate = 1; + int count = 0, uptodate = 1; xfs_off_t offset = page_offset(page); - if (page->index != tindex) - goto fail; - if (! trylock_page(page)) - goto fail; - if (PageWriteback(page)) - goto fail_unlock_page; - if (page->mapping != inode->i_mapping) - goto fail_unlock_page; - if (!xfs_is_delayed_page(page, (*ioendp)->io_type)) - goto fail_unlock_page; - /* * page_dirty is initially a count of buffers on the page before * EOF and is decremented as we move each into a cleanable state. @@ -770,6 +769,8 @@ xfs_convert_page( end_offset = min_t(unsigned long long, (xfs_off_t)(page->index + 1) << PAGE_CACHE_SHIFT, i_size_read(inode)); + end_offset = min_t(unsigned long long, end_offset, + (mp->iomap_offset + mp->iomap_bsize)); len = 1 << inode->i_blkbits; p_offset = min_t(unsigned long, end_offset & (PAGE_CACHE_SIZE - 1), @@ -779,12 +780,12 @@ xfs_convert_page( bh = head = page_buffers(page); do { - if (offset >= end_offset) + if (offset >= end_offset) { break; + } if (!buffer_uptodate(bh)) uptodate = 0; if (!(PageUptodate(page) || buffer_uptodate(bh))) { - done = 1; continue; } @@ -794,10 +795,7 @@ xfs_convert_page( else type = IOMAP_DELAY; - if (!xfs_iomap_valid(mp, offset)) { - done = 1; - continue; - } + BUG_ON(!xfs_iomap_valid(mp, offset)); ASSERT(!(mp->iomap_flags & IOMAP_HOLE)); ASSERT(!(mp->iomap_flags & IOMAP_DELAY)); @@ -805,7 +803,7 @@ xfs_convert_page( xfs_map_at_offset(bh, offset, bbits, mp); if (startio) { xfs_add_to_ioend(inode, bh, offset, - type, ioendp, done); + type, ioendp, 0 /* !done */); } else { set_buffer_dirty(bh); unlock_buffer(bh); @@ -814,15 +812,14 @@ xfs_convert_page( page_dirty--; count++; } else { + WARN_ON(!xfs_iomap_valid(mp, offset)); type = IOMAP_NEW; - if (buffer_mapped(bh) && all_bh && startio) { + if (buffer_mapped(bh) && buffer_dirty(bh) && startio) { lock_buffer(bh); xfs_add_to_ioend(inode, bh, offset, - type, ioendp, done); + type, ioendp, 0 /* !done */); count++; page_dirty--; - } else { - done = 1; } } } while (offset += len, (bh = bh->b_this_page) != head); @@ -838,19 +835,16 @@ xfs_convert_page( wbc->nr_to_write--; if (bdi_write_congested(bdi)) { wbc->encountered_congestion = 1; - done = 1; } else if (wbc->nr_to_write <= 0) { - done = 1; + /* XXX ignore nr_to_write + done = 1; */ } } + /* unlocks page */ xfs_start_page_writeback(page, wbc, !page_dirty, count); + } else { + unlock_page(page); } - - return done; - fail_unlock_page: - unlock_page(page); - fail: - return 1; } /* @@ -860,33 +854,17 @@ xfs_convert_page( STATIC void xfs_cluster_write( struct inode *inode, - pgoff_t tindex, + struct page **pages, + int pagecount, xfs_iomap_t *iomapp, xfs_ioend_t **ioendp, struct writeback_control *wbc, - int startio, - int all_bh, - pgoff_t tlast) + int startio) { - struct pagevec pvec; - int done = 0, i; - - pagevec_init(&pvec, 0); - while (!done && tindex <= tlast) { - unsigned len = min_t(pgoff_t, PAGEVEC_SIZE, tlast - tindex + 1); - - if (!pagevec_lookup(&pvec, inode->i_mapping, tindex, len)) - break; - - for (i = 0; i < pagevec_count(&pvec); i++) { - done = xfs_convert_page(inode, pvec.pages[i], tindex++, - iomapp, ioendp, wbc, startio, all_bh); - if (done) - break; - } + int i; - pagevec_release(&pvec); - cond_resched(); + for (i = 0; i < pagecount; i++) { + xfs_convert_page(inode, pages[i], iomapp, ioendp, wbc, startio); } } @@ -908,7 +886,6 @@ xfs_cluster_write( * bh->b_states's will not agree and only ones setup by BPW/BCW will have * valid state, thus the whole page must be written out thing. */ - STATIC int xfs_page_state_convert( struct inode *inode, @@ -924,12 +901,13 @@ xfs_page_state_convert( unsigned long p_offset = 0; unsigned int type; __uint64_t end_offset; - pgoff_t end_index, last_index, tlast; + pgoff_t end_index, last_index; ssize_t size, len; int flags, err, iomap_valid = 0, uptodate = 1; int page_dirty, count = 0; int trylock = 0; - int all_bh = unmapped; + struct page **pages; + int pagecount = 0, i; if (startio) { if (wbc->sync_mode == WB_SYNC_NONE && wbc->nonblocking) @@ -949,6 +927,9 @@ xfs_page_state_convert( } } + pages = kmem_zalloc(sizeof(struct page*) * MAX_WRITEBACK_PAGES, KM_NOFS); + + /* * page_dirty is initially a count of buffers on the page before * EOF and is decremented as we move each into a cleanable state. @@ -1036,14 +1017,23 @@ xfs_page_state_convert( * for unwritten extent conversion. */ new_ioend = 1; - if (type == IOMAP_NEW) { - size = xfs_probe_cluster(inode, - page, bh, head, 0); - } else { - size = len; + size = 0; + if (type == IOMAP_NEW && !pagecount) { + size = xfs_probe_cluster(inode, page, + bh, head, + 0 /* !mapped */, type, + pages, &pagecount, wbc); + } else if ((type == IOMAP_DELAY || + type == IOMAP_UNWRITTEN) && + !pagecount) { + size = xfs_probe_cluster(inode, page, + bh, head, + 1 /* mapped */, type, + pages, &pagecount, wbc); } - err = xfs_map_blocks(inode, offset, size, + err = xfs_map_blocks(inode, offset, + size ? size : len, &iomap, flags); if (err) goto error; @@ -1072,9 +1062,16 @@ xfs_page_state_convert( */ if (!iomap_valid || flags != BMAPI_READ) { flags = BMAPI_READ; - size = xfs_probe_cluster(inode, page, bh, - head, 1); - err = xfs_map_blocks(inode, offset, size, + size = 0; + if (!pagecount) { + size = xfs_probe_cluster(inode, page, + bh, head, + 1 /* mapped */, + IOMAP_NEW, + pages, &pagecount, wbc); + } + err = xfs_map_blocks(inode, offset, + size ? size : len, &iomap, flags); if (err) goto error; @@ -1092,8 +1089,6 @@ xfs_page_state_convert( type = IOMAP_NEW; if (!test_and_set_bit(BH_Lock, &bh->b_state)) { ASSERT(buffer_mapped(bh)); - if (iomap_valid) - all_bh = 1; xfs_add_to_ioend(inode, bh, offset, type, &ioend, !iomap_valid); page_dirty--; @@ -1104,6 +1099,8 @@ xfs_page_state_convert( } else if ((buffer_uptodate(bh) || PageUptodate(page)) && (unmapped || startio)) { iomap_valid = 0; + } else { + WARN_ON(1); } if (!iohead) @@ -1117,13 +1114,11 @@ xfs_page_state_convert( if (startio) xfs_start_page_writeback(page, wbc, 1, count); - if (ioend && iomap_valid) { - offset = (iomap.iomap_offset + iomap.iomap_bsize - 1) >> - PAGE_CACHE_SHIFT; - tlast = min_t(pgoff_t, offset, last_index); - xfs_cluster_write(inode, page->index + 1, &iomap, &ioend, - wbc, startio, all_bh, tlast); + if (ioend && iomap_valid && pagecount) { + xfs_cluster_write(inode, pages, pagecount, &iomap, &ioend, + wbc, startio); } + kmem_free(pages, sizeof(struct page *) * MAX_WRITEBACK_PAGES); if (iohead) xfs_submit_ioend(iohead); @@ -1133,7 +1128,12 @@ xfs_page_state_convert( error: if (iohead) xfs_cancel_ioend(iohead); - + if (pages) { + for (i = 0; i < pagecount; i++) { + unlock_page(pages[i]); + } + kmem_free(pages, sizeof(struct page *) * MAX_WRITEBACK_PAGES); + } return err; } _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Issues with delalloc->real extent allocation 2011-01-14 21:43 ` bpm 2011-01-14 23:32 ` bpm @ 2011-01-14 23:50 ` bpm 2011-01-14 23:55 ` Dave Chinner 2011-01-17 0:28 ` Lachlan McIlroy 3 siblings, 0 replies; 30+ messages in thread From: bpm @ 2011-01-14 23:50 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Fri, Jan 14, 2011 at 03:43:34PM -0600, bpm@sgi.com wrote: > On Fri, Jan 14, 2011 at 11:29:00AM +1100, Dave Chinner wrote: > > I've noticed a few suspicious things trying to reproduce the > > allocate-in-the-middle-of-a-delalloc-extent, > ... > > Secondly, I think we have the same expose-the-entire-delalloc-extent > > -to-stale-data-exposure problem in ->writepage. This onnne, however, > > is due to using BMAPI_ENTIRE to allocate the entire delalloc extent > > the first time any part of it is written to. Even if we are only > > writing a single page (i.e. wbc->nr_to_write = 1) and the delalloc > > extent covers gigabytes. So, same problem when we crash. > > > > Finally, I think the extsize based problem exposed by test 229 is a > > also a result of allocating space we have no pages covering in the > > page cache (triggered by BMAPI_ENTIRE allocation) so the allocated > > space is never zeroed and hence exposes stale data. > > This is precisely the bug I was going after when I hit the > allocate-in-the-middle-of-a-delalloc-extent bug. This is a race between > block_prepare_write/__xfs_get_blocks and writepage/xfs_page_state > convert. When xfs_page_state_convert allocates a real extent for a page > toward the beginning of a delalloc extent, XFS_BMAPI converts the entire > delalloc extent. Any subsequent writes into the page cache toward the > end of this freshly allocated extent will see a written extent instead > of delalloc and read the block from disk into the page before writing > over it. If the write does not cover the entire page garbage from disk > will be exposed into the page cache. Here is a test case to reproduce the corruption. I have only been able to reproduce it by writing the file on an nfs client served from xfs that is allocating large delalloc extents. -Ben *** the writer #include <stdio.h> #include <stdlib.h> #include <fcntl.h> #include <unistd.h> int main(int argc, char *argv[]) { char *filename = argv[1]; off_t seekdist = 3071; /* less than a page, nice and odd */ off_t max_offset = 1024 * 1024 * 1024; /* 1 gig */ off_t current_offset = 0; char buf[] = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\n"; int fd; printf("writing to %s\n", filename); printf("strlen is %d\n", strlen(buf)); fd = open(filename, O_RDWR|O_CREAT, 0644); if (fd == -1) { perror(filename); return -1; } while ((current_offset = lseek(fd, seekdist, SEEK_END)) > 0 && current_offset < max_offset) { if (write(fd, &buf, strlen(buf)) < strlen(buf)) { perror("write 'a'"); return -1; } } close(fd); } *** the reader #include <stdio.h> #include <stdlib.h> #include <fcntl.h> #include <unistd.h> int main(int argc, char *argv[]) { char *filename = argv[1]; off_t seekdist = 3071; /* less than a page, nice and odd */ off_t max_offset = 1024 * 1024 * 1024; /* 1 gig */ off_t current_offset = 0; char buf[] = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\n"; char readbuf[4096]; int fd, i; printf("reading from %s\n", filename); fd = open(filename, O_RDONLY, 0644); if (fd == -1) { perror(filename); return -1; } while (current_offset < max_offset) { ssize_t nread = read(fd, &readbuf, seekdist); if (nread != seekdist) { perror("read nulls"); return -1; } for (i=0; i < seekdist; i++) { if (readbuf[i] != '\0') { printf("foudn non-null at %d\n%s\n", current_offset + i, &readbuf[i]); break; // return -1; } } current_offset += nread; nread = read(fd, &readbuf, strlen(buf)); if (nread != strlen(buf)) { perror("read a"); return -1; } if (strncmp(readbuf, buf, strlen(buf))) { printf("didn't match at %d\n%s\n", current_offset + nread, readbuf); // return -1; } current_offset += nread; } close(fd); } _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Issues with delalloc->real extent allocation 2011-01-14 21:43 ` bpm 2011-01-14 23:32 ` bpm 2011-01-14 23:50 ` bpm @ 2011-01-14 23:55 ` Dave Chinner 2011-01-17 20:12 ` bpm 2011-01-18 20:47 ` Christoph Hellwig 2011-01-17 0:28 ` Lachlan McIlroy 3 siblings, 2 replies; 30+ messages in thread From: Dave Chinner @ 2011-01-14 23:55 UTC (permalink / raw) To: bpm; +Cc: xfs On Fri, Jan 14, 2011 at 03:43:34PM -0600, bpm@sgi.com wrote: > Hi Dave, > > On Fri, Jan 14, 2011 at 11:29:00AM +1100, Dave Chinner wrote: > > I've noticed a few suspicious things trying to reproduce the > > allocate-in-the-middle-of-a-delalloc-extent, > ... > > Secondly, I think we have the same expose-the-entire-delalloc-extent > > -to-stale-data-exposure problem in ->writepage. This onnne, however, > > is due to using BMAPI_ENTIRE to allocate the entire delalloc extent > > the first time any part of it is written to. Even if we are only > > writing a single page (i.e. wbc->nr_to_write = 1) and the delalloc > > extent covers gigabytes. So, same problem when we crash. > > > > Finally, I think the extsize based problem exposed by test 229 is a > > also a result of allocating space we have no pages covering in the > > page cache (triggered by BMAPI_ENTIRE allocation) so the allocated > > space is never zeroed and hence exposes stale data. > > This is precisely the bug I was going after when I hit the > allocate-in-the-middle-of-a-delalloc-extent bug. This is a race between > block_prepare_write/__xfs_get_blocks and writepage/xfs_page_state > convert. When xfs_page_state_convert allocates a real extent for a page > toward the beginning of a delalloc extent, XFS_BMAPI converts the entire > delalloc extent. Any subsequent writes into the page cache toward the > end of this freshly allocated extent will see a written extent instead > of delalloc and read the block from disk into the page before writing > over it. If the write does not cover the entire page garbage from disk > will be exposed into the page cache. I see from later on you are getting this state fom using a large extsize. Perhaps this comes back to my comment about extsize alignment might be better handled at the .aio_write level rather than hiding inside the get_block() callback and attempting to handle the mismatch at the .writepage level. Worth noting, though, is that DIO handles extsize unaligned writes by allocating unwritten extsize sized/aligned extents an then doing conversion at IO completion time. So perhaps we should be following this example for delalloc conversion.... I think, however, if we use delalloc->unwritten allocation, we will need to stop trusting the state of buffer heads in .writepage. That is because we'd then have blocks marked as buffer_delay() that really cover unwritten extents and would need remapping. We're already moving in the direction of not using the state in buffer heads in ->writepage, so perhaps we need to speed up that conversion as the first. Christoph, what are you plans here? > <snip> > > > I'm sure there are other ways to solve these problems, but these two > > are the ones that come to mind for me. I'm open to other solutions > > or ways to improve on these ones, especially if they are simpler. ;) > > Anyone got any ideas or improvements? > > The direction I've been taking is to use XFS_BMAPI_EXACT in > *xfs_iomap_write_allocate to ensure that an extent covering exactly the > pages we're prepared to write out immediately is allocated and the rest > of the delalloc extent is left as is. This exercises some of the btree > code more heavily and led to the discovery of the > allocate-in-the-middle-of-a-delalloc-extent bug. Yup, that explains it ;) > It also presents a > performance issue which I've tried to resolve by extending > xfs_probe_cluster to probe delalloc extents-- lock up all of the pages > to be converted before performing the allocation and hold those locks > until they are submitted for writeback. It's not very pretty but it > resolves the corruption. If we zero the relevant range in the page cache at .aio_write level like we do with xfs_zero_eof or allocate unwritten extents instead, then I don't think that you need to make changes like this. > There is still the issue of crashes... This could be solved by > converting from delalloc to unwritten in xfs_page_state_convert in this > very exact way and then to written in the io completion handler. Never > go delalloc->written directly. > > I have not had luck reproducing this on TOT xfs and have come to realize > that this is because it doesn't do speculative preallocation of larger > delalloc extents unless you are using extsize... which I haven't tried. Have a look at the dynamic speculative allocation patches that just went into 2.6.38 - I'm very interested to know whether your tests expose stale data now that it can do up to an entire extent (8GB on 4k block size) of speculative delalloc for writes that are extending the file. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Issues with delalloc->real extent allocation 2011-01-14 23:55 ` Dave Chinner @ 2011-01-17 20:12 ` bpm 2011-01-18 1:44 ` Dave Chinner 2011-01-18 20:47 ` Christoph Hellwig 1 sibling, 1 reply; 30+ messages in thread From: bpm @ 2011-01-17 20:12 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs Hey Dave, On Sat, Jan 15, 2011 at 10:55:49AM +1100, Dave Chinner wrote: > On Fri, Jan 14, 2011 at 03:43:34PM -0600, bpm@sgi.com wrote: > > It also presents a > > performance issue which I've tried to resolve by extending > > xfs_probe_cluster to probe delalloc extents-- lock up all of the pages > > to be converted before performing the allocation and hold those locks > > until they are submitted for writeback. It's not very pretty but it > > resolves the corruption. > > If we zero the relevant range in the page cache at .aio_write level > like we do with xfs_zero_eof or allocate unwritten extents instead, > then I don't think that you need to make changes like this. Ganging up pages under lock in xfs_page_state_convert (along with exactness in xfs_iomap_write_allocate) was needed to provide exclusion with block_prepare_write because zeroing isn't done in the case of written extents. Converting from delalloc->unwritten has the advantage of __xfs_get_blocks setting each buffer 'new' when you write into the page, so the zeroing is done properly even if you convert the entire extent to unwritten in xfs_vm_writepage instead of just the part you're going to write out. However, when converting from unwritten->written in the completion handler you still need to convert only the part of the extent that was actually written. That might be a lot of transactions in xfs_end_io. > > There is still the issue of crashes... This could be solved by > > converting from delalloc to unwritten in xfs_page_state_convert in this > > very exact way and then to written in the io completion handler. Never > > go delalloc->written directly. > > > > I have not had luck reproducing this on TOT xfs and have come to realize > > that this is because it doesn't do speculative preallocation of larger > > delalloc extents unless you are using extsize... which I haven't tried. > > Have a look at the dynamic speculative allocation patches that just > went into 2.6.38 - I'm very interested to know whether your tests > expose stale data now that it can do up to an entire extent (8GB on > 4k block size) of speculative delalloc for writes that are extending > the file. 8GB, Eek! -Ben _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Issues with delalloc->real extent allocation 2011-01-17 20:12 ` bpm @ 2011-01-18 1:44 ` Dave Chinner 0 siblings, 0 replies; 30+ messages in thread From: Dave Chinner @ 2011-01-18 1:44 UTC (permalink / raw) To: bpm; +Cc: xfs On Mon, Jan 17, 2011 at 02:12:40PM -0600, bpm@sgi.com wrote: > Hey Dave, > > On Sat, Jan 15, 2011 at 10:55:49AM +1100, Dave Chinner wrote: > > On Fri, Jan 14, 2011 at 03:43:34PM -0600, bpm@sgi.com wrote: > > > It also presents a > > > performance issue which I've tried to resolve by extending > > > xfs_probe_cluster to probe delalloc extents-- lock up all of the pages > > > to be converted before performing the allocation and hold those locks > > > until they are submitted for writeback. It's not very pretty but it > > > resolves the corruption. > > > > If we zero the relevant range in the page cache at .aio_write level > > like we do with xfs_zero_eof or allocate unwritten extents instead, > > then I don't think that you need to make changes like this. > > Ganging up pages under lock in xfs_page_state_convert (along with > exactness in xfs_iomap_write_allocate) was needed to provide exclusion > with block_prepare_write because zeroing isn't done in the case of > written extents. > > Converting from delalloc->unwritten has the advantage of > __xfs_get_blocks setting each buffer 'new' when you write into the page, > so the zeroing is done properly even if you convert the entire extent to > unwritten in xfs_vm_writepage instead of just the part you're going to > write out. That's definitely the advantage to using unwritten extents in this case. However, it also means that all the buffers that were already set up at the time of delalloc->unwritten as buffer_delay() are now incorrect - the underlying extent is now unwritten, no delayed allocation. Hence when it comes to writing buffer_delay() buffers, we would also have to handle the case that they are really buffer_unwritten(). That could be very nasty.... > However, when converting from unwritten->written in the > completion handler you still need to convert only the part of the extent > that was actually written. That might be a lot of transactions in > xfs_end_io. Yes, that was my original concern fo the general case and why I was looking at an intent/done transaction arrangement rather than unwritten/conversion arrangement. The "done" part of the transaction is much less overhead, and we can safely do the transaction reservation (i.e. the blocking bits) before the IO is issued so that we keep the loverhead on the completion side down to an absolute minimum.... However, when tracking down an assert failure reported by Christoph with the new speculative delalloc code this morning, I've realised that extent alignment for delayed allocation in xfs_bmapi() is badly broken. It can result in attempting to set up a delalloc extent larger than the maximum extent size because it aligns extents by extending them. Hence if the extent being allocated is MAXEXTLEN in length, the extent size alignment will extend it and we'll do nasty stuff to the extent record we insert into the incore extent tree. To make matters worse, extent size is not limited to the maximum supported extent size in the filesystem. The extent size can be set to (2^32 bytes - 1 fsblock), but for 512 byte block size filesystems the maximum extent size is 2^21 * 2^9 = 2^30. You can't align extents to a size larger than that maximum extent size, and if you try: # sudo mkfs.xfs -b size=512 -f /dev/vda ... # xfs_io -f -c "extsize 2g" \ > -c "ftruncate 4g" \ > -c "pwrite 64k 512" \ > -c "bmap -vp" /mnt/test/foo XFS: Assertion failed: k < (1 << STARTBLOCKVALBITS), file: fs/xfs/xfs_bmap_btree.h, line: 83 .... Call Trace: [<ffffffff8145998c>] xfs_bmapi+0x167c/0x1a20 [<ffffffff81488be2>] xfs_iomap_write_delay+0x1c2/0x340 [<ffffffff814a7b62>] __xfs_get_blocks+0x402/0x4d0 [<ffffffff814a7c61>] xfs_get_blocks+0x11/0x20 [<ffffffff8118f1fb>] __block_write_begin+0x20b/0x5a0 [<ffffffff8118f6f6>] block_write_begin+0x56/0x90 [<ffffffff814a7541>] xfs_vm_write_begin+0x41/0x70 [<ffffffff81118c06>] generic_file_buffered_write+0x106/0x270 [<ffffffff814ae7a2>] xfs_file_buffered_aio_write+0xd2/0x190 [<ffffffff814aece2>] xfs_file_aio_write+0x1c2/0x310 [<ffffffff8116052a>] do_sync_write+0xda/0x120 [<ffffffff81160850>] vfs_write+0xd0/0x190 [<ffffffff81161262>] sys_pwrite64+0x82/0xa0 [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b Which failed this check: 81 static inline xfs_fsblock_t nullstartblock(int k) 82 { 83 >>>>>> ASSERT(k < (1 << STARTBLOCKVALBITS)); 84 return STARTBLOCKMASK | (k); 85 } You end up with a corrupt extent record in the inncore extent list. Along the same lines, extsize > ag size for non-rt inode is also invalid. It won't be handled correctly by the delalloc->unwritten method because delalloc conversion is currently limited to a single allocation per .writepage call to avoid races with truncate. Hence when extsize is larger than an AG, we'll only allocate the part within an a single AG and hence leave part of the delalloc range in the delalloc state. With no pages covering this range, it'll never get written or allocated. Not to mention that it'll trip assert failures in xfs_getbmap: # sudo mkfs.xfs -b size=4k -f-dagsize=1g /dev/vda ... # xfs_io -f -c "extsize 2g" \ > -c "ftruncate 4g" \ > -c "pwrite 64k 512" \ > -c "bmap -vp" /mnt/test/foo XFS: Assertion failed: ((iflags & BMV_IF_DELALLOC) != 0) || (map[i].br_startblock != DELAYSTARTBLOCK) file: fs/xfs/xfs_bmap.c, line: 5558 ..... These problems indicate to me that doing extent size alignment for delayed allocation inside xfs_bmapi() is wrong on many levels. Moving the "alignment" to zeroing at the .aio_write level solves these issues except for one detail - ensuring that speculative delalloc beyond EOF is correctly aligned - this can be handled easily by the prealloc code. Hence I'm thinking that the solution we should be implementing is: 1. Limit extsize values to sane alignment boundaries. 2. Do extsize alignment for delalloc by zeroing the page cache at the .aio_write level for regions not covered by the write. 3. Ensure that speculative delalloc is extsize aligned 4. Rip the delalloc extent alignment code from xfs_bmapi() 5. Implement alloc intent/done transactions to protect against stale data exposure caused by crashing between allocation and data write completion (*). I think that covers all the issues we've discussed - have I missed anything? Cheers, Dave. (*) FWIW, since I proposed the intent/done method, I've realised it also solves the main difficulty in implementing data-in-inode safely: how to sycnhronise the unlogged data write when we move the data out of the inode with the allocation transaction so we know when we can allow the tail of the log to move past the allocation transaction or whether we can replay the format change transaction in log recovery..... -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Issues with delalloc->real extent allocation 2011-01-14 23:55 ` Dave Chinner 2011-01-17 20:12 ` bpm @ 2011-01-18 20:47 ` Christoph Hellwig 2011-01-18 23:18 ` Dave Chinner 1 sibling, 1 reply; 30+ messages in thread From: Christoph Hellwig @ 2011-01-18 20:47 UTC (permalink / raw) To: Dave Chinner; +Cc: bpm, xfs On Sat, Jan 15, 2011 at 10:55:49AM +1100, Dave Chinner wrote: > I see from later on you are getting this state fom using a large > extsize. Perhaps this comes back to my comment about extsize > alignment might be better handled at the .aio_write level rather > than hiding inside the get_block() callback and attempting to handle > the mismatch at the .writepage level. > > Worth noting, though, is that DIO handles extsize unaligned writes > by allocating unwritten extsize sized/aligned extents an then doing > conversion at IO completion time. So perhaps we should be following > this example for delalloc conversion.... That's the other option for zeroing. In many ways this makes a lot more sense, also for the thin provisioned virtualization image file use case I'm looking into right now. > > I think, however, if we use delalloc->unwritten allocation, we will > need to stop trusting the state of buffer heads in .writepage. That > is because we'd then have blocks marked as buffer_delay() that > really cover unwritten extents and would need remapping. We're > already moving in the direction of not using the state in buffer > heads in ->writepage, so perhaps we need to speed up that > conversion as the first. Christoph, what are you plans here? We really don't do much with the flags anymore. We already treat overwritten (just buffer_uptodate) and delayed buffers are already exactly the same. Unwritten buffers are still slightly different, in that we add XFS_BMAPI_IGSTATE to the bmapi flags. This is just a leftover from the old code, and finding out why exactly we add it is still on my todo list. The page clustering code checks if the buffer still matches the type in xfs_convert_page, though. And I'm not quite sure yet how we can remove that - the easiest option would be to keep holding the ilock until the whole clustered writeout is done, but we'd need to investigate what effect that has on performance. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Issues with delalloc->real extent allocation 2011-01-18 20:47 ` Christoph Hellwig @ 2011-01-18 23:18 ` Dave Chinner 2011-01-19 12:03 ` Christoph Hellwig 0 siblings, 1 reply; 30+ messages in thread From: Dave Chinner @ 2011-01-18 23:18 UTC (permalink / raw) To: Christoph Hellwig; +Cc: bpm, xfs On Tue, Jan 18, 2011 at 03:47:52PM -0500, Christoph Hellwig wrote: > On Sat, Jan 15, 2011 at 10:55:49AM +1100, Dave Chinner wrote: > > I see from later on you are getting this state fom using a large > > extsize. Perhaps this comes back to my comment about extsize > > alignment might be better handled at the .aio_write level rather > > than hiding inside the get_block() callback and attempting to handle > > the mismatch at the .writepage level. > > > > Worth noting, though, is that DIO handles extsize unaligned writes > > by allocating unwritten extsize sized/aligned extents an then doing > > conversion at IO completion time. So perhaps we should be following > > this example for delalloc conversion.... > > That's the other option for zeroing. In many ways this makes a lot > more sense, also for the thin provisioned virtualization image file > use case I'm looking into right now. > > > > > I think, however, if we use delalloc->unwritten allocation, we will > > need to stop trusting the state of buffer heads in .writepage. That > > is because we'd then have blocks marked as buffer_delay() that > > really cover unwritten extents and would need remapping. We're > > already moving in the direction of not using the state in buffer > > heads in ->writepage, so perhaps we need to speed up that > > conversion as the first. Christoph, what are you plans here? > > We really don't do much with the flags anymore. We already treat > overwritten (just buffer_uptodate) and delayed buffers are already > exactly the same. Except for the fact we use the delalloc state from the buffer to trigger allocation after mapping. We could probably just use isnullstartblock() for this - only a mapping from a buffer over a delalloc range should return a null startblock. > Unwritten buffers are still slightly different, > in that we add XFS_BMAPI_IGSTATE to the bmapi flags. This is just > a leftover from the old code, and finding out why exactly we add > it is still on my todo list. XFS_BMAPI_IGSTATE does a couple of things. Firstly, It prevents xfs_bmapi() from allocating new extents (turns off write mode). This isn't an issue where it is used because neither of the call sites set XFS_BMAPI_WRITE. Secondly, it allows xfs_bmapi() to consider adjacent written and unwritten extents as contiguous but has no effect on delalloc extents. That is, if we have: A B C +---------------+-----------------+ unwritten written And we ask to map the range A-C with a single map, we'll get back A-B. If we add XFS_BMAPI_IGSTATE, we'll get back A-C instead. This means that when we map the range A-C for IO, we'll do it as a single IO. The unwritten extent conversion code handles ranges like this correctly - it will only convert the unwritten part of the range to written. It is arguable that this is unnecessary because the elevator will merge the two adjacent IOs anyway, but it does reduce the number of mapping calls and IOs issued... Hence it is effectively mechanism for optimising IO mappings where we have contiguous extents with mixed written/unwritten state. In fact, we probably should be setting this for normal written extents as well, so that the case: A B C +---------------+-----------------+ written unwritten is also handled with the same optimisation. That makes handling unwritten and overwrites identical, with only delalloc being different. If we assume delalloc when we get a null startblock, then we don't need to look at the buffer state at all for the initial mapping. > The page clustering code checks if > the buffer still matches the type in xfs_convert_page, though. > And I'm not quite sure yet how we can remove that - the easiest > option would be to keep holding the ilock until the whole clustered > writeout is done, but we'd need to investigate what effect that > has on performance. Holding the ilock doesn't protect against page cache truncation, but we can still check for that via page->mapping. Otherwise this seems like it would work to me - I trust the extent list to reflect the correct state a lot more than I do the buffer heads. It seems to me that with such modifications, the only thing that we are using the bufferhead for is the buffer_uptodate() flag to determine if we should write the block or not. If we can find some other way of holding this state then we don't need bufferheads in the write path at all, right? Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Issues with delalloc->real extent allocation 2011-01-18 23:18 ` Dave Chinner @ 2011-01-19 12:03 ` Christoph Hellwig 2011-01-19 13:31 ` Dave Chinner 0 siblings, 1 reply; 30+ messages in thread From: Christoph Hellwig @ 2011-01-19 12:03 UTC (permalink / raw) To: Dave Chinner; +Cc: bpm, xfs On Wed, Jan 19, 2011 at 10:18:32AM +1100, Dave Chinner wrote: > Except for the fact we use the delalloc state from the buffer to > trigger allocation after mapping. We could probably just use > isnullstartblock() for this - only a mapping from a buffer over a > delalloc range should return a null startblock. isnullstartblock returns true for both DELAYSTARTBLOCK and HOLESTARTBLOCK, so we want to be explicit we can check for br_startblock == DELAYSTARTBLOCK. Note that we also need to explicitly check for br_state == XFS_EXT_UNWRITTEN to set the correct type for the ioend structure. > XFS_BMAPI_IGSTATE does a couple of things. Firstly, It prevents > xfs_bmapi() from allocating new extents (turns off write mode). > This isn't an issue where it is used because neither of the call > sites set XFS_BMAPI_WRITE. I've been down enough to understand what it does. But yes, the single large I/O mapping might explain why we want it. The next version of xfs_map_blocks will get a comment for it.. > In fact, we probably should be setting this for normal written > extents as well, so that the case: > > A B C > +---------------+-----------------+ > written unwritten > > is also handled with the same optimisation. That makes handling > unwritten and overwrites identical, with only delalloc being > different. If we assume delalloc when we get a null startblock, > then we don't need to look at the buffer state at all for the > initial mapping. With the current xfs_bmapi code that won't work. When merging a second extent into the first we only add up the br_blockcount. So for the case above we'd get an extent returned that's not marked as unwrittent and we wouldn't mark the ioend as unwrittent and thus perform not extent conversion after I/O completion. Just adding XFS_BMAPI_IGSTATE blindly for the delalloc case on the other hand is fine, as the merging of delayed extents is handled in a different if clause that totally ignores XFS_BMAPI_IGSTATE. The potention fix for this is to always set br_state if one of the merged extents is an unwrittent extent. The only other caller is xfs_getbmap which does report the unwrittent state to userspace, but already is sloppy for merging the other way if BMV_IF_PREALLOC is not set, so I can't see how beening sloppy this way to makes any difference. > It seems to me that with such modifications, the only thing that we > are using the bufferhead for is the buffer_uptodate() flag to > determine if we should write the block or not. If we can find some > other way of holding this state then we don't need bufferheads in > the write path at all, right? There's really two steps. First we can stop needing buffers headers for the space allocation / mapping. This is doable with the slight tweak of XFS_BMAPI_IGSTATE semantics. We still do need to set BH_Delay or BH_Unwrittent for use in __block_write_begin and block_truncate_page, but they become completely interchangeable at that point. If we want to completely get rid of buffers heads things are a bit more complicated. It's doable as shown by the _nobh aops, but we'll use quite a bit of per-block state that needs to be replaced by per-page state, and we'll lose the way to cache the block number in the buffer head. While we don't make use of that in writepage we do so in the write path, although I'm not sure how important it is. If we get your multi-page write work in it probably won't matter any more. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Issues with delalloc->real extent allocation 2011-01-19 12:03 ` Christoph Hellwig @ 2011-01-19 13:31 ` Dave Chinner 2011-01-19 13:55 ` Christoph Hellwig 0 siblings, 1 reply; 30+ messages in thread From: Dave Chinner @ 2011-01-19 13:31 UTC (permalink / raw) To: Christoph Hellwig; +Cc: bpm, xfs On Wed, Jan 19, 2011 at 07:03:21AM -0500, Christoph Hellwig wrote: > On Wed, Jan 19, 2011 at 10:18:32AM +1100, Dave Chinner wrote: > > Except for the fact we use the delalloc state from the buffer to > > trigger allocation after mapping. We could probably just use > > isnullstartblock() for this - only a mapping from a buffer over a > > delalloc range should return a null startblock. > > isnullstartblock returns true for both DELAYSTARTBLOCK and > HOLESTARTBLOCK, so we want to be explicit we can check for > br_startblock == DELAYSTARTBLOCK. True. > > Note that we also need to explicitly check for br_state == > XFS_EXT_UNWRITTEN to set the correct type for the ioend structure. Yes. As i mentioned on IRC I hacked a quick prototype together to test this out and did exactly this. ;) > > > XFS_BMAPI_IGSTATE does a couple of things. Firstly, It prevents > > xfs_bmapi() from allocating new extents (turns off write mode). > > This isn't an issue where it is used because neither of the call > > sites set XFS_BMAPI_WRITE. > > I've been down enough to understand what it does. But yes, the > single large I/O mapping might explain why we want it. The next > version of xfs_map_blocks will get a comment for it.. > > > In fact, we probably should be setting this for normal written > > extents as well, so that the case: > > > > A B C > > +---------------+-----------------+ > > written unwritten > > > > is also handled with the same optimisation. That makes handling > > unwritten and overwrites identical, with only delalloc being > > different. If we assume delalloc when we get a null startblock, > > then we don't need to look at the buffer state at all for the > > initial mapping. > > With the current xfs_bmapi code that won't work. When merging a second > extent into the first we only add up the br_blockcount. So for the > case above we'd get an extent returned that's not marked as unwrittent > and we wouldn't mark the ioend as unwrittent and thus perform not > extent conversion after I/O completion. Just adding XFS_BMAPI_IGSTATE > blindly for the delalloc case on the other hand is fine, as the > merging of delayed extents is handled in a different if clause that > totally ignores XFS_BMAPI_IGSTATE. > > The potention fix for this is to always set br_state if one of the > merged extents is an unwrittent extent. The only other caller is > xfs_getbmap which does report the unwrittent state to userspace, > but already is sloppy for merging the other way if BMV_IF_PREALLOC > is not set, so I can't see how beening sloppy this way to makes any > difference. Yup: @@ -4827,6 +4827,18 @@ xfs_bmapi( ASSERT(mval->br_startoff == mval[-1].br_startoff + mval[-1].br_blockcount); mval[-1].br_blockcount += mval->br_blockcount; + /* + * if one of the extent types is unwritten, make sure + * the extent is reported as unwritten so the caller + * always takes the correct action for unwritten + * extents. This means we always return consistent + * state regardless of whether we find a written or + * unwritten extent first. + */ + if (mval[-1].br_state != XFS_EXT_UNWRITTEN && + mval->br_state == XFS_EXT_UNWRITTEN) { + mval[-1].br_state = XFS_EXT_UNWRITTEN; + } } else if (n > 0 && mval->br_startblock == DELAYSTARTBLOCK && mval[-1].br_startblock == DELAYSTARTBLOCK && > > It seems to me that with such modifications, the only thing that we > > are using the bufferhead for is the buffer_uptodate() flag to > > determine if we should write the block or not. If we can find some > > other way of holding this state then we don't need bufferheads in > > the write path at all, right? > > There's really two steps. First we can stop needing buffers headers > for the space allocation / mapping. This is doable with the slight > tweak of XFS_BMAPI_IGSTATE semantics. > > We still do need to set BH_Delay or BH_Unwrittent for use in > __block_write_begin and block_truncate_page, but they become completely > interchangeable at that point. > > If we want to completely get rid of buffers heads things are a bit > more complicated. It's doable as shown by the _nobh aops, but we'll > use quite a bit of per-block state that needs to be replaced by per-page > state, Sure, or use a similar method to btrfs which stores dirty state bits in a separate extent tree. Worst case memory usage is still much less than a bufferhead per block... > and we'll lose the way to cache the block number in the buffer > head. While we don't make use of that in writepage we do so in > the write path, although I'm not sure how important it is. If we > get your multi-page write work in it probably won't matter any more. The only place we use bh->b_blocknr is for ioend manipulation. Am I missing something else? Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Issues with delalloc->real extent allocation 2011-01-19 13:31 ` Dave Chinner @ 2011-01-19 13:55 ` Christoph Hellwig 2011-01-20 1:33 ` Dave Chinner 0 siblings, 1 reply; 30+ messages in thread From: Christoph Hellwig @ 2011-01-19 13:55 UTC (permalink / raw) To: Dave Chinner; +Cc: bpm, xfs On Thu, Jan 20, 2011 at 12:31:47AM +1100, Dave Chinner wrote: > > If we want to completely get rid of buffers heads things are a bit > > more complicated. It's doable as shown by the _nobh aops, but we'll > > use quite a bit of per-block state that needs to be replaced by per-page > > state, > > Sure, or use a similar method to btrfs which stores dirty state bits > in a separate extent tree. Worst case memory usage is still much > less than a bufferhead per block... I'm not sure need to track sub-page dirty state. It only matters if we: a) have a file fragmented enough that it has multiple extents allocated inside a single page b) have enough small writes that just dirty parts of a page with a good enough persistant preallocation a) should happen almost never, while b) might be an issue, specially with setups of 64k page size and 4k blocks (e.g. ppc64 enterprise distro configs) > > and we'll lose the way to cache the block number in the buffer > > head. While we don't make use of that in writepage we do so in > > the write path, although I'm not sure how important it is. If we > > get your multi-page write work in it probably won't matter any more. > > The only place we use bh->b_blocknr is for ioend manipulation. Am I > missing something else? You're right. I thought we use it in the write path, but we only care about the buffer_mapped flag, but never actually look at the block number. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Issues with delalloc->real extent allocation 2011-01-19 13:55 ` Christoph Hellwig @ 2011-01-20 1:33 ` Dave Chinner 2011-01-20 11:16 ` Christoph Hellwig 2011-01-20 14:45 ` Geoffrey Wehrman 0 siblings, 2 replies; 30+ messages in thread From: Dave Chinner @ 2011-01-20 1:33 UTC (permalink / raw) To: Christoph Hellwig; +Cc: bpm, xfs On Wed, Jan 19, 2011 at 08:55:48AM -0500, Christoph Hellwig wrote: > On Thu, Jan 20, 2011 at 12:31:47AM +1100, Dave Chinner wrote: > > > If we want to completely get rid of buffers heads things are a bit > > > more complicated. It's doable as shown by the _nobh aops, but we'll > > > use quite a bit of per-block state that needs to be replaced by per-page > > > state, > > > > Sure, or use a similar method to btrfs which stores dirty state bits > > in a separate extent tree. Worst case memory usage is still much > > less than a bufferhead per block... > > I'm not sure need to track sub-page dirty state. It only matters if we: > > a) have a file fragmented enough that it has multiple extents allocated > inside a single page > b) have enough small writes that just dirty parts of a page > > with a good enough persistant preallocation a) should happen almost > never, while b) might be an issue, specially with setups of 64k > page size and 4k blocks (e.g. ppc64 enterprise distro configs) Right - case a) could probably be handled by making the page size an implicit extsize hint so we always try to minimise sub-page fragmentation during allocation. It's case b) that I'm mainly worried about, esp. w.r.t the 64k page size on ia64/ppc. If we only track a single dirty bit in the page, then every sub-page, non-appending write to an uncached region of a file becomes a RMW cycle to initialise the areas around the write correctly. The question is whether we care about this enough given that we return at least PAGE_SIZE in stat() to tell applications the optimal IO size to avoid RMW cycles. Given that XFS is aimed towards optimising for the large file/large IO/high throughput type of application, I'm comfortable with saying that avoiding sub-page writes for optimal throughput IO is an application problem and going from there. Especially considering that stuff like rsync and untarring kernel tarballs are all appending writes so won't take any performance hit at all... And if we only do IO on whole pages (i.e regardless of block size) .writepage suddenly becomes a lot simpler, as well as being trivial to implement our own .readpage/.readpages.... What do people think about this? Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Issues with delalloc->real extent allocation 2011-01-20 1:33 ` Dave Chinner @ 2011-01-20 11:16 ` Christoph Hellwig 2011-01-21 1:59 ` Dave Chinner 2011-01-20 14:45 ` Geoffrey Wehrman 1 sibling, 1 reply; 30+ messages in thread From: Christoph Hellwig @ 2011-01-20 11:16 UTC (permalink / raw) To: Dave Chinner; +Cc: Christoph Hellwig, bpm, xfs On Thu, Jan 20, 2011 at 12:33:46PM +1100, Dave Chinner wrote: > It's case b) that I'm mainly worried about, esp. w.r.t the 64k page > size on ia64/ppc. If we only track a single dirty bit in the page, > then every sub-page, non-appending write to an uncached region of a > file becomes a RMW cycle to initialise the areas around the write > correctly. The question is whether we care about this enough given > that we return at least PAGE_SIZE in stat() to tell applications the > optimal IO size to avoid RMW cycles. Note that this generally is only true for the first write into the region - after that we'll have the rest read into the cache. But we also have the same issue for appending writes if they aren't page aligned. > And if we only do IO on whole pages (i.e regardless of block size) > .writepage suddenly becomes a lot simpler, as well as being trivial > to implement our own .readpage/.readpages.... I don't think it simplifies writepage a lot. All the buffer head handling goes away, but we'll still need to do xfs_bmapi calls at block size granularity. Why would you want to replaced the readpage/readpages code? The generic mpage helpers for it do just fine. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Issues with delalloc->real extent allocation 2011-01-20 11:16 ` Christoph Hellwig @ 2011-01-21 1:59 ` Dave Chinner 0 siblings, 0 replies; 30+ messages in thread From: Dave Chinner @ 2011-01-21 1:59 UTC (permalink / raw) To: Christoph Hellwig; +Cc: bpm, xfs On Thu, Jan 20, 2011 at 06:16:12AM -0500, Christoph Hellwig wrote: > On Thu, Jan 20, 2011 at 12:33:46PM +1100, Dave Chinner wrote: > > It's case b) that I'm mainly worried about, esp. w.r.t the 64k page > > size on ia64/ppc. If we only track a single dirty bit in the page, > > then every sub-page, non-appending write to an uncached region of a > > file becomes a RMW cycle to initialise the areas around the write > > correctly. The question is whether we care about this enough given > > that we return at least PAGE_SIZE in stat() to tell applications the > > optimal IO size to avoid RMW cycles. > > Note that this generally is only true for the first write into the > region - after that we'll have the rest read into the cache. But > we also have the same issue for appending writes if they aren't > page aligned. True - I kind of implied that by saying RMW cycles are limited to "uncached regions", but you've stated in a much clearer and easier to understand way. ;) > > And if we only do IO on whole pages (i.e regardless of block size) > > .writepage suddenly becomes a lot simpler, as well as being trivial > > to implement our own .readpage/.readpages.... > > I don't think it simplifies writepage a lot. All the buffer head > handling goes away, but we'll still need to do xfs_bmapi calls at > block size granularity. Why would you want to replaced the > readpage/readpages code? The generic mpage helpers for it do just fine. When I went through the mpage code I found there were cases that it would attached bufferheads to pages or assume PagePrivate() contains a bufferhead list. e.g. If there are multiple holes in the page, it will fall through to block_read_full_page() which makes this assumption. If we want/need to keep any of our own state on PagePrivate(), we cannot use any function that assumes PagePrivate() is used to hold bufferheads for the page. Quite frankly, a simple extent mapping loop like we do for .writepage is far simpler than what mpage_readpages does. This is what btrfs does (extent_readpages/__extent_read_full_page), and that is far easier to follow and understand than mpage_do_readpage().... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Issues with delalloc->real extent allocation 2011-01-20 1:33 ` Dave Chinner 2011-01-20 11:16 ` Christoph Hellwig @ 2011-01-20 14:45 ` Geoffrey Wehrman 2011-01-21 2:51 ` Dave Chinner 1 sibling, 1 reply; 30+ messages in thread From: Geoffrey Wehrman @ 2011-01-20 14:45 UTC (permalink / raw) To: Dave Chinner; +Cc: Christoph Hellwig, bpm, xfs On Thu, Jan 20, 2011 at 12:33:46PM +1100, Dave Chinner wrote: | Given that XFS is aimed towards optimising for the large file/large | IO/high throughput type of application, I'm comfortable with saying | that avoiding sub-page writes for optimal throughput IO is an | application problem and going from there. Especially considering | that stuff like rsync and untarring kernel tarballs are all | appending writes so won't take any performance hit at all... I agree. I do not expect systems with 64K pages to be used for single bit manipulations. However, I do see a couple of potential problems. The one place where page size I/O may not work though is for an DMAPI/HSM (DMF) managed filesystem where some blocks are managed on near-line media. The DMAPI needs to be able to remove and restore extents on a filesystem block boundary, not a page boundary. The other downside is that for sparse files, we could end up allocating space for zero filled blocks. There may be some workloads where significant quantities of space are wasted. -- Geoffrey Wehrman _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Issues with delalloc->real extent allocation 2011-01-20 14:45 ` Geoffrey Wehrman @ 2011-01-21 2:51 ` Dave Chinner 2011-01-21 14:41 ` Geoffrey Wehrman 0 siblings, 1 reply; 30+ messages in thread From: Dave Chinner @ 2011-01-21 2:51 UTC (permalink / raw) To: Geoffrey Wehrman; +Cc: Christoph Hellwig, bpm, xfs On Thu, Jan 20, 2011 at 08:45:03AM -0600, Geoffrey Wehrman wrote: > On Thu, Jan 20, 2011 at 12:33:46PM +1100, Dave Chinner wrote: > | Given that XFS is aimed towards optimising for the large file/large > | IO/high throughput type of application, I'm comfortable with saying > | that avoiding sub-page writes for optimal throughput IO is an > | application problem and going from there. Especially considering > | that stuff like rsync and untarring kernel tarballs are all > | appending writes so won't take any performance hit at all... > > I agree. I do not expect systems with 64K pages to be used for single > bit manipulations. However, I do see a couple of potential problems. > > The one place where page size I/O may not work though is for an DMAPI/HSM > (DMF) managed filesystem where some blocks are managed on near-line media. > The DMAPI needs to be able to remove and restore extents on a filesystem > block boundary, not a page boundary. DMAPI is still free to remove extents at whatever boundary it wants. The only difference is that it would be asked to restore extents to the page boundary the write covers rather than a block boundary. The allocation and direct IO boundaries do not change, so the only thing that needs to change is the range that DMAPI is told that the read/write is going to cover.... > The other downside is that for sparse files, we could end up allocating > space for zero filled blocks. There may be some workloads where > significant quantities of space are wasted. Yes, that is possible, though on the other hand is will reduce worst case fragmentation of pathological sparse file filling applications. e.g. out-of-core solvers that do strided writes across the file to write the first column in the result matrix as it is calculated, then the second column, then the third ... until all columns are written. ---- Realistically, for every disadvantage or advantage we can enumerate for specific workloads, I think one of us will be able to come up with a counter example that shows the opposite of the original point. I don't think this sort of argument is particularly productive. :/ Instead, I look at it from the point of view that a 64k IO is little slower than a 4k IO so such a change would not make much difference to performance. And given that terabytes of storage capacity is _cheap_ these days (and getting cheaper all the time), the extra space of using 64k instead of 4k for sparse blocks isn't a big deal. When I combine that with my experience from SGI where we always recommended using filesystems block size == page size for best IO performance on HPC setups, there's a fair argument that using page size extents for small sparse writes isn't a problem we really need to care about. І'd prefer to design for where we expect storage to be in the next few years e.g. 10TB spindles. Minimising space usage is not a big priority when we consider that in 2-3 years 100TB of storage will cost less than $5000 (it's about $15-20k right now). Even on desktops we're going to have more capacity that we know what to do with, so trading off storage space for lower memory overhead, lower metadata IO overhead and lower potential fragmentation seems like the right way to move forward to me. Does that seem like a reasonable position to take, or are there other factors that you think I should be considering? Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Issues with delalloc->real extent allocation 2011-01-21 2:51 ` Dave Chinner @ 2011-01-21 14:41 ` Geoffrey Wehrman 2011-01-23 23:26 ` Dave Chinner 0 siblings, 1 reply; 30+ messages in thread From: Geoffrey Wehrman @ 2011-01-21 14:41 UTC (permalink / raw) To: Dave Chinner; +Cc: Christoph Hellwig, bpm, xfs On Fri, Jan 21, 2011 at 01:51:40PM +1100, Dave Chinner wrote: | Realistically, for every disadvantage or advantage we can enumerate | for specific workloads, I think one of us will be able to come up | with a counter example that shows the opposite of the original | point. I don't think this sort of argument is particularly | productive. :/ Sorry, I wasn't trying to be argumentative. Rather I was just documenting what I saw as potential issues. I'm not arguing against your proposed change. If you don't find my sharing of observations productive, I'm happy to keep my thoughts to my self in the future. | Instead, I look at it from the point of view that a 64k IO is little | slower than a 4k IO so such a change would not make much difference | to performance. And given that terabytes of storage capacity is | _cheap_ these days (and getting cheaper all the time), the extra | space of using 64k instead of 4k for sparse blocks isn't a big deal. | | When I combine that with my experience from SGI where we always | recommended using filesystems block size == page size for best IO | performance on HPC setups, there's a fair argument that using page | size extents for small sparse writes isn't a problem we really need | to care about. | | І'd prefer to design for where we expect storage to be in the next | few years e.g. 10TB spindles. Minimising space usage is not a big | priority when we consider that in 2-3 years 100TB of storage will | cost less than $5000 (it's about $15-20k right now). Even on | desktops we're going to have more capacity that we know what to do | with, so trading off storage space for lower memory overhead, lower | metadata IO overhead and lower potential fragmentation seems like | the right way to move forward to me. | | Does that seem like a reasonable position to take, or are there | other factors that you think I should be considering? Keep in mind that storage of the future may not be on spindles, and fragmentation may not be an issue. Even so, with SSD 64K I/O is very reasonable as most flash memory implements at a minimum 64K page. I'm fully in favor your proposal to require page sized I/O. -- Geoffrey Wehrman _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Issues with delalloc->real extent allocation 2011-01-21 14:41 ` Geoffrey Wehrman @ 2011-01-23 23:26 ` Dave Chinner 0 siblings, 0 replies; 30+ messages in thread From: Dave Chinner @ 2011-01-23 23:26 UTC (permalink / raw) To: Geoffrey Wehrman; +Cc: Christoph Hellwig, bpm, xfs On Fri, Jan 21, 2011 at 08:41:52AM -0600, Geoffrey Wehrman wrote: > On Fri, Jan 21, 2011 at 01:51:40PM +1100, Dave Chinner wrote: > | Realistically, for every disadvantage or advantage we can enumerate > | for specific workloads, I think one of us will be able to come up > | with a counter example that shows the opposite of the original > | point. I don't think this sort of argument is particularly > | productive. :/ > > Sorry, I wasn't trying to be argumentative. Rather I was just documenting > what I saw as potential issues. I'm not arguing against your proposed > change. If you don't find my sharing of observations productive, I'm > happy to keep my thoughts to my self in the future. Ah, that's not what I meant, Geoffrey. Reading it back, I probably should have said "direction of discussion" rather than "sort of argument" to make it more obvious I trying not to get stuck with us goign round and round trying to demonstrate the pros and cons of different approaches on a workload-by-workload basis. Basically all I was trying to do is move the discusion past a potential sticking point - I definitely value the input and insight you provide, and I'll try to write more clearly to hopefully avoid such misunderstandings in future discussions. > | Instead, I look at it from the point of view that a 64k IO is little > | slower than a 4k IO so such a change would not make much difference > | to performance. And given that terabytes of storage capacity is > | _cheap_ these days (and getting cheaper all the time), the extra > | space of using 64k instead of 4k for sparse blocks isn't a big deal. > | > | When I combine that with my experience from SGI where we always > | recommended using filesystems block size == page size for best IO > | performance on HPC setups, there's a fair argument that using page > | size extents for small sparse writes isn't a problem we really need > | to care about. > | > | І'd prefer to design for where we expect storage to be in the next > | few years e.g. 10TB spindles. Minimising space usage is not a big > | priority when we consider that in 2-3 years 100TB of storage will > | cost less than $5000 (it's about $15-20k right now). Even on > | desktops we're going to have more capacity that we know what to do > | with, so trading off storage space for lower memory overhead, lower > | metadata IO overhead and lower potential fragmentation seems like > | the right way to move forward to me. > | > | Does that seem like a reasonable position to take, or are there > | other factors that you think I should be considering? > > Keep in mind that storage of the future may not be on spindles, and > fragmentation may not be an issue. Even so, with SSD 64K I/O is very > reasonable as most flash memory implements at a minimum 64K page. I'm > fully in favor your proposal to require page sized I/O. With flash memory there is the potential that we don't even need to care. The trend is towards on-device compression (e.g. Sandforce controllers already do this) to reduce write amplification to values lower than one. Hence a 4k write surrounded by 60k of zeros is unlikely to be a major issue as it will compress really well.... :) Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Issues with delalloc->real extent allocation 2011-01-14 21:43 ` bpm ` (2 preceding siblings ...) 2011-01-14 23:55 ` Dave Chinner @ 2011-01-17 0:28 ` Lachlan McIlroy 2011-01-17 4:37 ` Dave Chinner 3 siblings, 1 reply; 30+ messages in thread From: Lachlan McIlroy @ 2011-01-17 0:28 UTC (permalink / raw) To: bpm; +Cc: xfs ----- Original Message ----- > Hi Dave, > > On Fri, Jan 14, 2011 at 11:29:00AM +1100, Dave Chinner wrote: > > I've noticed a few suspicious things trying to reproduce the > > allocate-in-the-middle-of-a-delalloc-extent, > ... > > Secondly, I think we have the same expose-the-entire-delalloc-extent > > -to-stale-data-exposure problem in ->writepage. This onnne, however, > > is due to using BMAPI_ENTIRE to allocate the entire delalloc extent > > the first time any part of it is written to. Even if we are only > > writing a single page (i.e. wbc->nr_to_write = 1) and the delalloc > > extent covers gigabytes. So, same problem when we crash. > > > > Finally, I think the extsize based problem exposed by test 229 is a > > also a result of allocating space we have no pages covering in the > > page cache (triggered by BMAPI_ENTIRE allocation) so the allocated > > space is never zeroed and hence exposes stale data. > > This is precisely the bug I was going after when I hit the > allocate-in-the-middle-of-a-delalloc-extent bug. This is a race > between > block_prepare_write/__xfs_get_blocks and writepage/xfs_page_state > convert. When xfs_page_state_convert allocates a real extent for a > page > toward the beginning of a delalloc extent, XFS_BMAPI converts the > entire > delalloc extent. Any subsequent writes into the page cache toward the > end of this freshly allocated extent will see a written extent instead > of delalloc and read the block from disk into the page before writing > over it. If the write does not cover the entire page garbage from disk > will be exposed into the page cache. Ben, take a look at the XFS gap list code in IRIX - this code was designed specifically to handle this problem. It's also implemented in several of the cxfs clients too. On entry to a write() it will create a list of all the holes that exist in the write range before any delayed allocations are created. The first call to xfs_bmapi() sets up the delayed allocation for the entire write (even if the bmaps returned don't cover the full write range). If the write results in a fault from disk then it checks the gap list to see if any sections of the buffer used to cover a hole and if so it ignores the state of the extent and zeroes those region(s) in the buffer that match the pre-existing holes. If the buffer has multiple non-hole sections that need to be read from disk the whole buffer will be read from disk and the zeroing of the holes is done post I/O - this reduces the number of I/Os to be done. The whole delayed allocation can be safely converted at any time without risk of reading exposed data (assuming no crash that is). As the write progresses through the range it removes sections from the front of the gap list so by the time the write is complete the gap list is empty. The gap list does not have a dedicated lock to protect it but instead relies on the iolock to ensure that only one write operation occurs at once (so it's not appropriate for direct I/O). > > <snip> > > > I'm sure there are other ways to solve these problems, but these two > > are the ones that come to mind for me. I'm open to other solutions > > or ways to improve on these ones, especially if they are simpler. ;) > > Anyone got any ideas or improvements? > > The direction I've been taking is to use XFS_BMAPI_EXACT in > *xfs_iomap_write_allocate to ensure that an extent covering exactly > the > pages we're prepared to write out immediately is allocated and the > rest > of the delalloc extent is left as is. This exercises some of the btree > code more heavily and led to the discovery of the > allocate-in-the-middle-of-a-delalloc-extent bug. It also presents a > performance issue which I've tried to resolve by extending > xfs_probe_cluster to probe delalloc extents-- lock up all of the pages > to be converted before performing the allocation and hold those locks > until they are submitted for writeback. It's not very pretty but it > resolves the corruption. > > There is still the issue of crashes... This could be solved by > converting from delalloc to unwritten in xfs_page_state_convert in > this > very exact way and then to written in the io completion handler. Never > go delalloc->written directly. That solution would probably make the gap list redundant too. > > I have not had luck reproducing this on TOT xfs and have come to > realize > that this is because it doesn't do speculative preallocation of larger > delalloc extents unless you are using extsize... which I haven't > tried. > > Regards, > Ben > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Issues with delalloc->real extent allocation 2011-01-17 0:28 ` Lachlan McIlroy @ 2011-01-17 4:37 ` Dave Chinner 0 siblings, 0 replies; 30+ messages in thread From: Dave Chinner @ 2011-01-17 4:37 UTC (permalink / raw) To: Lachlan McIlroy; +Cc: bpm, xfs On Sun, Jan 16, 2011 at 07:28:27PM -0500, Lachlan McIlroy wrote: > ----- Original Message ----- > > Hi Dave, > > > > On Fri, Jan 14, 2011 at 11:29:00AM +1100, Dave Chinner wrote: > > > I've noticed a few suspicious things trying to reproduce the > > > allocate-in-the-middle-of-a-delalloc-extent, > > ... > > > Secondly, I think we have the same expose-the-entire-delalloc-extent > > > -to-stale-data-exposure problem in ->writepage. This onnne, however, > > > is due to using BMAPI_ENTIRE to allocate the entire delalloc extent > > > the first time any part of it is written to. Even if we are only > > > writing a single page (i.e. wbc->nr_to_write = 1) and the delalloc > > > extent covers gigabytes. So, same problem when we crash. > > > > > > Finally, I think the extsize based problem exposed by test 229 is a > > > also a result of allocating space we have no pages covering in the > > > page cache (triggered by BMAPI_ENTIRE allocation) so the allocated > > > space is never zeroed and hence exposes stale data. > > > > This is precisely the bug I was going after when I hit the > > allocate-in-the-middle-of-a-delalloc-extent bug. This is a race > > between > > block_prepare_write/__xfs_get_blocks and writepage/xfs_page_state > > convert. When xfs_page_state_convert allocates a real extent for a > > page > > toward the beginning of a delalloc extent, XFS_BMAPI converts the > > entire > > delalloc extent. Any subsequent writes into the page cache toward the > > end of this freshly allocated extent will see a written extent instead > > of delalloc and read the block from disk into the page before writing > > over it. If the write does not cover the entire page garbage from disk > > will be exposed into the page cache. > > Ben, take a look at the XFS gap list code in IRIX - this code was > designed specifically to handle this problem. It's also implemented in > several of the cxfs clients too. On entry to a write() it will create a > list of all the holes that exist in the write range before any delayed > allocations are created. The first call to xfs_bmapi() sets up the delayed > allocation for the entire write (even if the bmaps returned don't cover the > full write range). If the write results in a fault from disk then it checks > the gap list to see if any sections of the buffer used to cover a hole and > if so it ignores the state of the extent and zeroes those region(s) in the > buffer that match the pre-existing holes. If the buffer has multiple > non-hole sections that need to be read from disk the whole buffer will be > read from disk and the zeroing of the holes is done post I/O - this reduces > the number of I/Os to be done. The whole delayed allocation can be safely > converted at any time without risk of reading exposed data (assuming no > crash that is). IMO, that caveat exposes the problem with this approach - it is not persistent. It adds a lot of complexity but doesn't solve the underlying problem: that we are exposing real extents via allocation without having written any data to them. > As the write progresses through the range it removes > sections from the front of the gap list so by the time the write is complete > the gap list is empty. The gap list does not have a dedicated lock to > protect it but instead relies on the iolock to ensure that only one write > operation occurs at once (so it's not appropriate for direct I/O). > > > > > <snip> > > > > > I'm sure there are other ways to solve these problems, but these two > > > are the ones that come to mind for me. I'm open to other solutions > > > or ways to improve on these ones, especially if they are simpler. ;) > > > Anyone got any ideas or improvements? > > > > The direction I've been taking is to use XFS_BMAPI_EXACT in > > *xfs_iomap_write_allocate to ensure that an extent covering exactly > > the > > pages we're prepared to write out immediately is allocated and the > > rest > > of the delalloc extent is left as is. This exercises some of the btree > > code more heavily and led to the discovery of the > > allocate-in-the-middle-of-a-delalloc-extent bug. It also presents a > > performance issue which I've tried to resolve by extending > > xfs_probe_cluster to probe delalloc extents-- lock up all of the pages > > to be converted before performing the allocation and hold those locks > > until they are submitted for writeback. It's not very pretty but it > > resolves the corruption. > > > > There is still the issue of crashes... This could be solved by > > converting from delalloc to unwritten in xfs_page_state_convert in > > this > > very exact way and then to written in the io completion handler. Never > > go delalloc->written directly. > > That solution would probably make the gap list redundant too. Yes, I think you are right. The "gaps" would get mapped as unwritten rather than as normal extents and hence get zereod appropriately. It would also preventing exposure after a crash due to being persistent. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2011-01-23 23:24 UTC | newest] Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2011-01-14 0:29 Issues with delalloc->real extent allocation Dave Chinner 2011-01-14 16:40 ` Geoffrey Wehrman 2011-01-14 22:59 ` Dave Chinner 2011-01-15 4:16 ` Geoffrey Wehrman 2011-01-17 5:18 ` Dave Chinner 2011-01-17 14:37 ` Geoffrey Wehrman 2011-01-18 0:24 ` Dave Chinner 2011-01-18 14:30 ` Geoffrey Wehrman 2011-01-18 20:40 ` Christoph Hellwig 2011-01-18 22:03 ` Dave Chinner 2011-01-14 21:43 ` bpm 2011-01-14 23:32 ` bpm 2011-01-14 23:50 ` bpm 2011-01-14 23:55 ` Dave Chinner 2011-01-17 20:12 ` bpm 2011-01-18 1:44 ` Dave Chinner 2011-01-18 20:47 ` Christoph Hellwig 2011-01-18 23:18 ` Dave Chinner 2011-01-19 12:03 ` Christoph Hellwig 2011-01-19 13:31 ` Dave Chinner 2011-01-19 13:55 ` Christoph Hellwig 2011-01-20 1:33 ` Dave Chinner 2011-01-20 11:16 ` Christoph Hellwig 2011-01-21 1:59 ` Dave Chinner 2011-01-20 14:45 ` Geoffrey Wehrman 2011-01-21 2:51 ` Dave Chinner 2011-01-21 14:41 ` Geoffrey Wehrman 2011-01-23 23:26 ` Dave Chinner 2011-01-17 0:28 ` Lachlan McIlroy 2011-01-17 4:37 ` Dave Chinner
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.