All of lore.kernel.org
 help / color / mirror / Atom feed
* Issues with delalloc->real extent allocation
@ 2011-01-14  0:29 Dave Chinner
  2011-01-14 16:40 ` Geoffrey Wehrman
  2011-01-14 21:43 ` bpm
  0 siblings, 2 replies; 30+ messages in thread
From: Dave Chinner @ 2011-01-14  0:29 UTC (permalink / raw)
  To: xfs

Folks,

I've noticed a few suspicious things trying to reproduce the
allocate-in-the-middle-of-a-delalloc-extent,

Firstly preallocation into delalloc ranges behaves in a unexpected
and dangerous way. Preallocation uses unwritten extents to avoid
stale data exposure, but when issued into a range that is covered by
a delalloc extent, it does not use unwritten extents.

The code that
causes this is in xfs_bmapi():

		/*
		 * Determine state of extent, and the filesystem.
		 * A wasdelay extent has been initialized, so
		 * shouldn't be flagged as unwritten.
		 */
		if (wr && xfs_sb_version_hasextflgbit(&mp->m_sb)) {
			if (!wasdelay && (flags & XFS_BMAPI_PREALLOC))
				got.br_state = XFS_EXT_UNWRITTEN;
		}

This seems to be incorrect to me - a "wasdelay" extent has not yet
been initialised - there's data in memory, but there is nothing on
disk and we may not write it for some time. If we crash after this
transaction is written but before any data is written, we expose
stale data.

Not only that, it allocates the _entire_ delalloc extent that spans
the preallocation range, even when the preallocation range is only 1
block and the delalloc extent covers gigabytes. hence we actually
expose a much greater range of the file to stale data exposure
during a crash than just eh preallocated range. Not good.

Secondly, I think we have the same expose-the-entire-delalloc-extent
-to-stale-data-exposure problem in ->writepage. This onnne, however,
is due to using BMAPI_ENTIRE to allocate the entire delalloc extent
the first time any part of it is written to. Even if we are only
writing a single page (i.e. wbc->nr_to_write = 1) and the delalloc
extent covers gigabytes. So, same problem when we crash.

Finally, I think the extsize based problem exposed by test 229 is a
also a result of allocating space we have no pages covering in the
page cache (triggered by BMAPI_ENTIRE allocation) so the allocated
space is never zeroed and hence exposes stale data.

A possible solution to all of these problems is to allocate delalloc
space as unwritten extents.  It's obvious to see how this solves the
first two cases, but the last one is a bit less obvious. That one is
solved by the fact that unwritten extent conversion does not get
extent size aligned, so any block we don't write data for will
remain in the unwritten state and therefore correctly return zeros
for any read...

The issues with this approach are really about the cost of unwritten
extent conversion and the impact on low memory operation. The cost
is mainly a CPU cost due to delayed logging, but that will be
significant if we have to do an unwritten conversion on every IO
completion. Batching is definitely possible and would mostly
alleviate the overhead, but does add complexity and especially
w.r.t. unwritten extent conversion sometimes requiring allocation to
complete.

The impact on low memory behaviour is less easy to define and
handle. Extent conversion can block needing memory, so if we need to
complete the extent conversion to free memory we deadlock. For
delalloc writes, we can probably mark IO complete and pages clean
before we do the conversion in a batched, async manner. That would
leaves us with the same structure as we currently have, but adds
more complexity because we now need to track extent ranges in
"conversion pending" state for subsequent reads and sync operations.
i.e. an extent in a pending state is treated like a real extent for
the purposes of reading, but conversion needs to be completed during
a sync operation.

Another possibility for just the ->writepage problem is that we
could make delalloc allocation in ->writepage more like and EFI/EFD
type operation. That is, we allocate the space and log all the
changes in an allocation intent transaction (basically the same as
the current allocation transaction) and then add an allocation done
transaction that is issued at IO completion that basically
manipulation won't require IO to complete. Effectively this would be
logging all the xfs_ioend structures. Then in log recovery we simply
convert any range that does not have a done transaction to
unwritten, thereby preventing stale data exposure. The down side to
this approach is that it needs new transactions and log recovery
code, so is not backwards compatible.

I think is probably a simpler solution with lower overhead compared
to allocating unwritten extents everywhere. It doesn't solve the
extsize problem, but that probably should be handled up at the
->aio_write level, not at the ->write_page level.  Similarly, the
preallocation issue could easily be handled with small changes to
xfs_bmapi()....

I'm sure there are other ways to solve these problems, but these two
are the ones that come to mind for me.  I'm open to other solutions
or ways to improve on these ones, especially if they are simpler. ;)
Anyone got any ideas or improvements?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Issues with delalloc->real extent allocation
  2011-01-14  0:29 Issues with delalloc->real extent allocation Dave Chinner
@ 2011-01-14 16:40 ` Geoffrey Wehrman
  2011-01-14 22:59   ` Dave Chinner
  2011-01-14 21:43 ` bpm
  1 sibling, 1 reply; 30+ messages in thread
From: Geoffrey Wehrman @ 2011-01-14 16:40 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Fri, Jan 14, 2011 at 11:29:00AM +1100, Dave Chinner wrote:
| This seems to be incorrect to me - a "wasdelay" extent has not yet
| been initialised - there's data in memory, but there is nothing on
| disk and we may not write it for some time. If we crash after this
| transaction is written but before any data is written, we expose
| stale data.
| 
| Not only that, it allocates the _entire_ delalloc extent that spans
| the preallocation range, even when the preallocation range is only 1
| block and the delalloc extent covers gigabytes. hence we actually
| expose a much greater range of the file to stale data exposure
| during a crash than just eh preallocated range. Not good.
| 
| Secondly, I think we have the same expose-the-entire-delalloc-extent
| -to-stale-data-exposure problem in ->writepage. This onnne, however,
| is due to using BMAPI_ENTIRE to allocate the entire delalloc extent
| the first time any part of it is written to. Even if we are only
| writing a single page (i.e. wbc->nr_to_write = 1) and the delalloc
| extent covers gigabytes. So, same problem when we crash.
| 
| Finally, I think the extsize based problem exposed by test 229 is a
| also a result of allocating space we have no pages covering in the
| page cache (triggered by BMAPI_ENTIRE allocation) so the allocated
| space is never zeroed and hence exposes stale data.

There used to be an XFS_BMAPI_EXACT flag that wasn't ever used.  What
would be the effects of re-creating this flag and using it in writepage
to prevent the expose-the-entire-delalloc-extent-to-stale-data-exposure
problem?  This wouldn't solve the exposure of stale data for a crash
that occurs after the extent conversion but before the data is written
out.  The quantity of data exposed is potentially much smaller however.

Also, I'm not saying using XFS_BMAPI_EXACT is feasable.  I have a very
minimal understanding of the writepage code path.


-- 
Geoffrey Wehrman  651-683-5496  gwehrman@sgi.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Issues with delalloc->real extent allocation
  2011-01-14  0:29 Issues with delalloc->real extent allocation Dave Chinner
  2011-01-14 16:40 ` Geoffrey Wehrman
@ 2011-01-14 21:43 ` bpm
  2011-01-14 23:32   ` bpm
                     ` (3 more replies)
  1 sibling, 4 replies; 30+ messages in thread
From: bpm @ 2011-01-14 21:43 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Hi Dave,

On Fri, Jan 14, 2011 at 11:29:00AM +1100, Dave Chinner wrote:
> I've noticed a few suspicious things trying to reproduce the
> allocate-in-the-middle-of-a-delalloc-extent,
...
> Secondly, I think we have the same expose-the-entire-delalloc-extent
> -to-stale-data-exposure problem in ->writepage. This onnne, however,
> is due to using BMAPI_ENTIRE to allocate the entire delalloc extent
> the first time any part of it is written to. Even if we are only
> writing a single page (i.e. wbc->nr_to_write = 1) and the delalloc
> extent covers gigabytes. So, same problem when we crash.
>
> Finally, I think the extsize based problem exposed by test 229 is a
> also a result of allocating space we have no pages covering in the
> page cache (triggered by BMAPI_ENTIRE allocation) so the allocated
> space is never zeroed and hence exposes stale data.

This is precisely the bug I was going after when I hit the
allocate-in-the-middle-of-a-delalloc-extent bug.  This is a race between
block_prepare_write/__xfs_get_blocks and writepage/xfs_page_state
convert.  When xfs_page_state_convert allocates a real extent for a page
toward the beginning of a delalloc extent, XFS_BMAPI converts the entire
delalloc extent.  Any subsequent writes into the page cache toward the
end of this freshly allocated extent will see a written extent instead
of delalloc and read the block from disk into the page before writing
over it.  If the write does not cover the entire page garbage from disk
will be exposed into the page cache.

<snip>

> I'm sure there are other ways to solve these problems, but these two
> are the ones that come to mind for me.  I'm open to other solutions
> or ways to improve on these ones, especially if they are simpler. ;)
> Anyone got any ideas or improvements?

The direction I've been taking is to use XFS_BMAPI_EXACT in
*xfs_iomap_write_allocate to ensure that an extent covering exactly the
pages we're prepared to write out immediately is allocated and the rest
of the delalloc extent is left as is.  This exercises some of the btree
code more heavily and led to the discovery of the
allocate-in-the-middle-of-a-delalloc-extent bug.  It also presents a
performance issue which I've tried to resolve by extending
xfs_probe_cluster to probe delalloc extents-- lock up all of the pages
to be converted before performing the allocation and hold those locks
until they are submitted for writeback.  It's not very pretty but it
resolves the corruption.

There is still the issue of crashes...  This could be solved by
converting from delalloc to unwritten in xfs_page_state_convert in this
very exact way and then to written in the io completion handler.  Never
go delalloc->written directly.

I have not had luck reproducing this on TOT xfs and have come to realize
that this is because it doesn't do speculative preallocation of larger
delalloc extents unless you are using extsize... which I haven't tried.

Regards,
	Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Issues with delalloc->real extent allocation
  2011-01-14 16:40 ` Geoffrey Wehrman
@ 2011-01-14 22:59   ` Dave Chinner
  2011-01-15  4:16     ` Geoffrey Wehrman
  0 siblings, 1 reply; 30+ messages in thread
From: Dave Chinner @ 2011-01-14 22:59 UTC (permalink / raw)
  To: Geoffrey Wehrman; +Cc: xfs

On Fri, Jan 14, 2011 at 10:40:16AM -0600, Geoffrey Wehrman wrote:
> On Fri, Jan 14, 2011 at 11:29:00AM +1100, Dave Chinner wrote:
> | This seems to be incorrect to me - a "wasdelay" extent has not yet
> | been initialised - there's data in memory, but there is nothing on
> | disk and we may not write it for some time. If we crash after this
> | transaction is written but before any data is written, we expose
> | stale data.
> | 
> | Not only that, it allocates the _entire_ delalloc extent that spans
> | the preallocation range, even when the preallocation range is only 1
> | block and the delalloc extent covers gigabytes. hence we actually
> | expose a much greater range of the file to stale data exposure
> | during a crash than just eh preallocated range. Not good.
> | 
> | Secondly, I think we have the same expose-the-entire-delalloc-extent
> | -to-stale-data-exposure problem in ->writepage. This onnne, however,
> | is due to using BMAPI_ENTIRE to allocate the entire delalloc extent
> | the first time any part of it is written to. Even if we are only
> | writing a single page (i.e. wbc->nr_to_write = 1) and the delalloc
> | extent covers gigabytes. So, same problem when we crash.
> | 
> | Finally, I think the extsize based problem exposed by test 229 is a
> | also a result of allocating space we have no pages covering in the
> | page cache (triggered by BMAPI_ENTIRE allocation) so the allocated
> | space is never zeroed and hence exposes stale data.
> 
> There used to be an XFS_BMAPI_EXACT flag that wasn't ever used.  What
> would be the effects of re-creating this flag and using it in writepage
> to prevent the expose-the-entire-delalloc-extent-to-stale-data-exposure
> problem?  This wouldn't solve the exposure of stale data for a crash
> that occurs after the extent conversion but before the data is written
> out.  The quantity of data exposed is potentially much smaller however.

Definitely a possibility, Geoffrey, and not one that I thought of.
I agree that it would significantly minimise the amount of stale data
exposed on a crash, but there does seem to be some down sides. Ben
has already pointed out the increased cost of allocations, and I can
also think of a couple of others:

	- I'm not sure it's the right solution to the extsize issue
	  because it will prevent extent size and aligned
	  allocations from ocurring, which is exactly what extsize
	  is supposed to provide.

	  I'm also not sure it would work with the realtime device
	  as all allocation has to be at least rtextent_size sized
	  and aligned rather than page boundary. I need to check how
	  the rt allocation code handles sub-rtextent size writes -
	  that may point to the solution for the general case here.

	- I think that using XFS_BMAPI_EXACT semantic will possibly
	  make the speculative EOF preallocation worthless. Delayed
	  allocation conversion will never convert the speculative
	  delalloc preallocation into real extents (beyond EOF) and
	  we'll see increased fragementation in many common
	  workloads due to that....

> Also, I'm not saying using XFS_BMAPI_EXACT is feasable.  I have a very
> minimal understanding of the writepage code path.

I think there are situations where this does make sense, but given
the potential issues I'm not sure it is a solution that can be
extended to the general case. A good discussion point on a different
angle, though. ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Issues with delalloc->real extent allocation
  2011-01-14 21:43 ` bpm
@ 2011-01-14 23:32   ` bpm
  2011-01-14 23:50   ` bpm
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 30+ messages in thread
From: bpm @ 2011-01-14 23:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Hi Dave,

On Fri, Jan 14, 2011 at 03:43:34PM -0600, bpm@sgi.com wrote:
> On Fri, Jan 14, 2011 at 11:29:00AM +1100, Dave Chinner wrote:
> > I'm sure there are other ways to solve these problems, but these two
> > are the ones that come to mind for me.  I'm open to other solutions
> > or ways to improve on these ones, especially if they are simpler. ;)
> > Anyone got any ideas or improvements?
> 
> The direction I've been taking is to use XFS_BMAPI_EXACT in
> *xfs_iomap_write_allocate to ensure that an extent covering exactly the
> pages we're prepared to write out immediately is allocated and the rest
> of the delalloc extent is left as is.  This exercises some of the btree
> code more heavily and led to the discovery of the
> allocate-in-the-middle-of-a-delalloc-extent bug.  It also presents a
> performance issue which I've tried to resolve by extending
> xfs_probe_cluster to probe delalloc extents-- lock up all of the pages
> to be converted before performing the allocation and hold those locks
> until they are submitted for writeback.  It's not very pretty but it
> resolves the corruption.

Here's the xfs_page_state_convert side of the patch so you can get an
idea what am trying for, and how ugly it is.  ;^)

I have not ported the xfs_iomap_write_allocate bits yet.  It is against
an older version of xfs... but you get the idea.

-Ben

Index: sles11-src/fs/xfs/linux-2.6/xfs_aops.c
===================================================================
--- sles11-src.orig/fs/xfs/linux-2.6/xfs_aops.c
+++ sles11-src/fs/xfs/linux-2.6/xfs_aops.c
@@ -585,41 +585,61 @@ STATIC unsigned int
 xfs_probe_page(
 	struct page		*page,
 	unsigned int		pg_offset,
-	int			mapped)
+	int			mapped,
+	unsigned int		type,
+	int			*entire)
 {
+	struct buffer_head	*bh, *head;
 	int			ret = 0;
 
 	if (PageWriteback(page))
 		return 0;
+	if (!page->mapping)
+		return 0;
+	if (!PageDirty(page))
+		return 0;
+	if (!page_has_buffers(page))
+		return 0;
 
-	if (page->mapping && PageDirty(page)) {
-		if (page_has_buffers(page)) {
-			struct buffer_head	*bh, *head;
-
-			bh = head = page_buffers(page);
-			do {
-				if (!buffer_uptodate(bh))
-					break;
-				if (mapped != buffer_mapped(bh))
-					break;
-				ret += bh->b_size;
-				if (ret >= pg_offset)
-					break;
-			} while ((bh = bh->b_this_page) != head);
-		} else
-			ret = mapped ? 0 : PAGE_CACHE_SIZE;
-	}
+	*entire = 0;
+	bh = head = page_buffers(page);
+	do {
+		if (!buffer_uptodate(bh))
+			break;
+		if (buffer_mapped(bh) != mapped)
+			break;
+		if (type == IOMAP_UNWRITTEN && !buffer_unwritten(bh))
+			break;
+		if (type == IOMAP_DELAY && !buffer_delay(bh))
+			break;
+		if (type == IOMAP_NEW && !buffer_dirty(bh))
+			break;
+
+		ret += bh->b_size;
+
+		if (ret >= pg_offset)
+			break;
+	} while ((bh = bh->b_this_page) != head);
+
+	if (bh == head)
+		*entire = 1;
 
 	return ret;
 }
 
+#define MAX_WRITEBACK_PAGES	1024
+
 STATIC size_t
 xfs_probe_cluster(
 	struct inode		*inode,
 	struct page		*startpage,
 	struct buffer_head	*bh,
 	struct buffer_head	*head,
-	int			mapped)
+	int			mapped,
+	unsigned int		type,
+	struct page		**pages,
+	int			*pagecount,
+	struct writeback_control *wbc)
 {
 	struct pagevec		pvec;
 	pgoff_t			tindex, tlast, tloff;
@@ -628,8 +648,15 @@ xfs_probe_cluster(
 
 	/* First sum forwards in this page */
 	do {
-		if (!buffer_uptodate(bh) || (mapped != buffer_mapped(bh)))
+		if (!buffer_uptodate(bh) || (mapped != buffer_mapped(bh))) {
+			return total;
+		} else if (type == IOMAP_UNWRITTEN && !buffer_unwritten(bh)) {
 			return total;
+		} else if (type == IOMAP_DELAY && !buffer_delay(bh)) {
+			return total;
+		} else if (type == IOMAP_NEW && !buffer_dirty(bh)) {
+			return total;
+		}
 		total += bh->b_size;
 	} while ((bh = bh->b_this_page) != head);
 
@@ -637,8 +664,9 @@ xfs_probe_cluster(
 	tlast = i_size_read(inode) >> PAGE_CACHE_SHIFT;
 	tindex = startpage->index + 1;
 
-	/* Prune this back to avoid pathological behavior */
-	tloff = min(tlast, startpage->index + 64);
+	/* Prune this back to avoid pathological behavior, subtract 1 for the
+	 * first page. */
+	tloff = min(tlast, startpage->index + (pgoff_t)MAX_WRITEBACK_PAGES);
 
 	pagevec_init(&pvec, 0);
 	while (!done && tindex <= tloff) {
@@ -647,10 +675,10 @@ xfs_probe_cluster(
 		if (!pagevec_lookup(&pvec, inode->i_mapping, tindex, len))
 			break;
 
-		for (i = 0; i < pagevec_count(&pvec); i++) {
+		for (i = 0; i < pagevec_count(&pvec) && !done; i++) {
 			struct page *page = pvec.pages[i];
 			size_t pg_offset, pg_len = 0;
-
+			int	entire = 0;
 			if (tindex == tlast) {
 				pg_offset =
 				    i_size_read(inode) & (PAGE_CACHE_SIZE - 1);
@@ -658,20 +686,39 @@ xfs_probe_cluster(
 					done = 1;
 					break;
 				}
-			} else
+			} else {
 				pg_offset = PAGE_CACHE_SIZE;
-
-			if (page->index == tindex && trylock_page(page)) {
-				pg_len = xfs_probe_page(page, pg_offset, mapped);
-				unlock_page(page);
 			}
 
+			if (page->index == tindex &&
+			    *pagecount < MAX_WRITEBACK_PAGES - 1 &&
+			    trylock_page(page)) {
+	 			pg_len = xfs_probe_page(page, pg_offset,
+						mapped, type, &entire);
+				if (pg_len) {
+					pages[(*pagecount)++] = page;
+
+				} else {
+					unlock_page(page);
+				}
+			}
 			if (!pg_len) {
 				done = 1;
 				break;
 			}
 
 			total += pg_len;
+
+			/*
+			 * if probe did not succeed on all buffers in the page
+			 * we don't want to probe subsequent pages.  This
+			 * ensures that we don't have a mix of buffer types in
+			 * the iomap.
+			 */
+			if (!entire) {
+				done = 1;
+				break;
+			}
 			tindex++;
 		}
 
@@ -683,56 +730,19 @@ xfs_probe_cluster(
 }
 
 /*
- * Test if a given page is suitable for writing as part of an unwritten
- * or delayed allocate extent.
- */
-STATIC int
-xfs_is_delayed_page(
-	struct page		*page,
-	unsigned int		type)
-{
-	if (PageWriteback(page))
-		return 0;
-
-	if (page->mapping && page_has_buffers(page)) {
-		struct buffer_head	*bh, *head;
-		int			acceptable = 0;
-
-		bh = head = page_buffers(page);
-		do {
-			if (buffer_unwritten(bh))
-				acceptable = (type == IOMAP_UNWRITTEN);
-			else if (buffer_delay(bh))
-				acceptable = (type == IOMAP_DELAY);
-			else if (buffer_dirty(bh) && buffer_mapped(bh))
-				acceptable = (type == IOMAP_NEW);
-			else
-				break;
-		} while ((bh = bh->b_this_page) != head);
-
-		if (acceptable)
-			return 1;
-	}
-
-	return 0;
-}
-
-/*
  * Allocate & map buffers for page given the extent map. Write it out.
  * except for the original page of a writepage, this is called on
  * delalloc/unwritten pages only, for the original page it is possible
  * that the page has no mapping at all.
  */
-STATIC int
+STATIC void
 xfs_convert_page(
 	struct inode		*inode,
 	struct page		*page,
-	loff_t			tindex,
 	xfs_iomap_t		*mp,
 	xfs_ioend_t		**ioendp,
 	struct writeback_control *wbc,
-	int			startio,
-	int			all_bh)
+	int			startio)
 {
 	struct buffer_head	*bh, *head;
 	xfs_off_t		end_offset;
@@ -740,20 +750,9 @@ xfs_convert_page(
 	unsigned int		type;
 	int			bbits = inode->i_blkbits;
 	int			len, page_dirty;
-	int			count = 0, done = 0, uptodate = 1;
+	int			count = 0, uptodate = 1;
  	xfs_off_t		offset = page_offset(page);
 
-	if (page->index != tindex)
-		goto fail;
-	if (! trylock_page(page))
-		goto fail;
-	if (PageWriteback(page))
-		goto fail_unlock_page;
-	if (page->mapping != inode->i_mapping)
-		goto fail_unlock_page;
-	if (!xfs_is_delayed_page(page, (*ioendp)->io_type))
-		goto fail_unlock_page;
-
 	/*
 	 * page_dirty is initially a count of buffers on the page before
 	 * EOF and is decremented as we move each into a cleanable state.
@@ -770,6 +769,8 @@ xfs_convert_page(
 	end_offset = min_t(unsigned long long,
 			(xfs_off_t)(page->index + 1) << PAGE_CACHE_SHIFT,
 			i_size_read(inode));
+	end_offset = min_t(unsigned long long, end_offset,
+			(mp->iomap_offset + mp->iomap_bsize));
 
 	len = 1 << inode->i_blkbits;
 	p_offset = min_t(unsigned long, end_offset & (PAGE_CACHE_SIZE - 1),
@@ -779,12 +780,12 @@ xfs_convert_page(
 
 	bh = head = page_buffers(page);
 	do {
-		if (offset >= end_offset)
+		if (offset >= end_offset) {
 			break;
+		}
 		if (!buffer_uptodate(bh))
 			uptodate = 0;
 		if (!(PageUptodate(page) || buffer_uptodate(bh))) {
-			done = 1;
 			continue;
 		}
 
@@ -794,10 +795,7 @@ xfs_convert_page(
 			else
 				type = IOMAP_DELAY;
 
-			if (!xfs_iomap_valid(mp, offset)) {
-				done = 1;
-				continue;
-			}
+			BUG_ON(!xfs_iomap_valid(mp, offset));
 
 			ASSERT(!(mp->iomap_flags & IOMAP_HOLE));
 			ASSERT(!(mp->iomap_flags & IOMAP_DELAY));
@@ -805,7 +803,7 @@ xfs_convert_page(
 			xfs_map_at_offset(bh, offset, bbits, mp);
 			if (startio) {
 				xfs_add_to_ioend(inode, bh, offset,
-						type, ioendp, done);
+						type, ioendp, 0 /* !done */);
 			} else {
 				set_buffer_dirty(bh);
 				unlock_buffer(bh);
@@ -814,15 +812,14 @@ xfs_convert_page(
 			page_dirty--;
 			count++;
 		} else {
+			WARN_ON(!xfs_iomap_valid(mp, offset));
 			type = IOMAP_NEW;
-			if (buffer_mapped(bh) && all_bh && startio) {
+			if (buffer_mapped(bh) && buffer_dirty(bh) && startio) {
 				lock_buffer(bh);
 				xfs_add_to_ioend(inode, bh, offset,
-						type, ioendp, done);
+						type, ioendp, 0 /* !done */);
 				count++;
 				page_dirty--;
-			} else {
-				done = 1;
 			}
 		}
 	} while (offset += len, (bh = bh->b_this_page) != head);
@@ -838,19 +835,16 @@ xfs_convert_page(
 			wbc->nr_to_write--;
 			if (bdi_write_congested(bdi)) {
 				wbc->encountered_congestion = 1;
-				done = 1;
 			} else if (wbc->nr_to_write <= 0) {
-				done = 1;
+				/* XXX ignore nr_to_write
+				done = 1; */
 			}
 		}
+		/* unlocks page */
 		xfs_start_page_writeback(page, wbc, !page_dirty, count);
+	} else {
+		unlock_page(page);
 	}
-
-	return done;
- fail_unlock_page:
-	unlock_page(page);
- fail:
-	return 1;
 }
 
 /*
@@ -860,33 +854,17 @@ xfs_convert_page(
 STATIC void
 xfs_cluster_write(
 	struct inode		*inode,
-	pgoff_t			tindex,
+	struct page		**pages,
+	int			pagecount,
 	xfs_iomap_t		*iomapp,
 	xfs_ioend_t		**ioendp,
 	struct writeback_control *wbc,
-	int			startio,
-	int			all_bh,
-	pgoff_t			tlast)
+	int			startio)
 {
-	struct pagevec		pvec;
-	int			done = 0, i;
-
-	pagevec_init(&pvec, 0);
-	while (!done && tindex <= tlast) {
-		unsigned len = min_t(pgoff_t, PAGEVEC_SIZE, tlast - tindex + 1);
-
-		if (!pagevec_lookup(&pvec, inode->i_mapping, tindex, len))
-			break;
-
-		for (i = 0; i < pagevec_count(&pvec); i++) {
-			done = xfs_convert_page(inode, pvec.pages[i], tindex++,
-					iomapp, ioendp, wbc, startio, all_bh);
-			if (done)
-				break;
-		}
+	int			i;
 
-		pagevec_release(&pvec);
-		cond_resched();
+	for (i = 0; i < pagecount; i++) {
+		xfs_convert_page(inode, pages[i], iomapp, ioendp, wbc, startio);
 	}
 }
 
@@ -908,7 +886,6 @@ xfs_cluster_write(
  * bh->b_states's will not agree and only ones setup by BPW/BCW will have
  * valid state, thus the whole page must be written out thing.
  */
-
 STATIC int
 xfs_page_state_convert(
 	struct inode	*inode,
@@ -924,12 +901,13 @@ xfs_page_state_convert(
 	unsigned long           p_offset = 0;
 	unsigned int		type;
 	__uint64_t              end_offset;
-	pgoff_t                 end_index, last_index, tlast;
+	pgoff_t                 end_index, last_index;
 	ssize_t			size, len;
 	int			flags, err, iomap_valid = 0, uptodate = 1;
 	int			page_dirty, count = 0;
 	int			trylock = 0;
-	int			all_bh = unmapped;
+	struct page		**pages;
+	int			pagecount = 0, i;
 
 	if (startio) {
 		if (wbc->sync_mode == WB_SYNC_NONE && wbc->nonblocking)
@@ -949,6 +927,9 @@ xfs_page_state_convert(
 		}
 	}
 
+	pages = kmem_zalloc(sizeof(struct page*) * MAX_WRITEBACK_PAGES, KM_NOFS);
+
+
 	/*
 	 * page_dirty is initially a count of buffers on the page before
 	 * EOF and is decremented as we move each into a cleanable state.
@@ -1036,14 +1017,23 @@ xfs_page_state_convert(
 				 * for unwritten extent conversion.
 				 */
 				new_ioend = 1;
-				if (type == IOMAP_NEW) {
-					size = xfs_probe_cluster(inode,
-							page, bh, head, 0);
-				} else {
-					size = len;
+				size = 0;
+				if (type == IOMAP_NEW && !pagecount) {
+					size = xfs_probe_cluster(inode, page,
+							bh, head,
+							0 /* !mapped */, type,
+						       	pages, &pagecount, wbc);
+				} else if ((type == IOMAP_DELAY ||
+					    type == IOMAP_UNWRITTEN) &&
+						!pagecount) {
+					size = xfs_probe_cluster(inode, page,
+							bh, head,
+							1 /* mapped */, type,
+							pages, &pagecount, wbc);
 				}
 
-				err = xfs_map_blocks(inode, offset, size,
+				err = xfs_map_blocks(inode, offset,
+						size ? size : len,
 						&iomap, flags);
 				if (err)
 					goto error;
@@ -1072,9 +1062,16 @@ xfs_page_state_convert(
 			 */
 			if (!iomap_valid || flags != BMAPI_READ) {
 				flags = BMAPI_READ;
-				size = xfs_probe_cluster(inode, page, bh,
-								head, 1);
-				err = xfs_map_blocks(inode, offset, size,
+				size = 0;
+				if (!pagecount) {
+					size = xfs_probe_cluster(inode, page,
+							bh, head,
+							1 /* mapped */,
+							IOMAP_NEW,
+							pages, &pagecount, wbc);
+				}
+				err = xfs_map_blocks(inode, offset,
+						size ? size : len,
 						&iomap, flags);
 				if (err)
 					goto error;
@@ -1092,8 +1089,6 @@ xfs_page_state_convert(
 			type = IOMAP_NEW;
 			if (!test_and_set_bit(BH_Lock, &bh->b_state)) {
 				ASSERT(buffer_mapped(bh));
-				if (iomap_valid)
-					all_bh = 1;
 				xfs_add_to_ioend(inode, bh, offset, type,
 						&ioend, !iomap_valid);
 				page_dirty--;
@@ -1104,6 +1099,8 @@ xfs_page_state_convert(
 		} else if ((buffer_uptodate(bh) || PageUptodate(page)) &&
 			   (unmapped || startio)) {
 			iomap_valid = 0;
+		} else {
+			WARN_ON(1);
 		}
 
 		if (!iohead)
@@ -1117,13 +1114,11 @@ xfs_page_state_convert(
 	if (startio)
 		xfs_start_page_writeback(page, wbc, 1, count);
 
-	if (ioend && iomap_valid) {
-		offset = (iomap.iomap_offset + iomap.iomap_bsize - 1) >>
-					PAGE_CACHE_SHIFT;
-		tlast = min_t(pgoff_t, offset, last_index);
-		xfs_cluster_write(inode, page->index + 1, &iomap, &ioend,
-					wbc, startio, all_bh, tlast);
+	if (ioend && iomap_valid && pagecount) {
+		xfs_cluster_write(inode, pages, pagecount, &iomap, &ioend,
+					wbc, startio);
 	}
+	kmem_free(pages, sizeof(struct page *) * MAX_WRITEBACK_PAGES);
 
 	if (iohead)
 		xfs_submit_ioend(iohead);
@@ -1133,7 +1128,12 @@ xfs_page_state_convert(
 error:
 	if (iohead)
 		xfs_cancel_ioend(iohead);
-
+	if (pages) {
+		for (i = 0; i < pagecount; i++) {
+			unlock_page(pages[i]);
+		}
+		kmem_free(pages, sizeof(struct page *) * MAX_WRITEBACK_PAGES);
+	}
 	return err;
 }
 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Issues with delalloc->real extent allocation
  2011-01-14 21:43 ` bpm
  2011-01-14 23:32   ` bpm
@ 2011-01-14 23:50   ` bpm
  2011-01-14 23:55   ` Dave Chinner
  2011-01-17  0:28   ` Lachlan McIlroy
  3 siblings, 0 replies; 30+ messages in thread
From: bpm @ 2011-01-14 23:50 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Fri, Jan 14, 2011 at 03:43:34PM -0600, bpm@sgi.com wrote:
> On Fri, Jan 14, 2011 at 11:29:00AM +1100, Dave Chinner wrote:
> > I've noticed a few suspicious things trying to reproduce the
> > allocate-in-the-middle-of-a-delalloc-extent,
> ...
> > Secondly, I think we have the same expose-the-entire-delalloc-extent
> > -to-stale-data-exposure problem in ->writepage. This onnne, however,
> > is due to using BMAPI_ENTIRE to allocate the entire delalloc extent
> > the first time any part of it is written to. Even if we are only
> > writing a single page (i.e. wbc->nr_to_write = 1) and the delalloc
> > extent covers gigabytes. So, same problem when we crash.
> >
> > Finally, I think the extsize based problem exposed by test 229 is a
> > also a result of allocating space we have no pages covering in the
> > page cache (triggered by BMAPI_ENTIRE allocation) so the allocated
> > space is never zeroed and hence exposes stale data.
> 
> This is precisely the bug I was going after when I hit the
> allocate-in-the-middle-of-a-delalloc-extent bug.  This is a race between
> block_prepare_write/__xfs_get_blocks and writepage/xfs_page_state
> convert.  When xfs_page_state_convert allocates a real extent for a page
> toward the beginning of a delalloc extent, XFS_BMAPI converts the entire
> delalloc extent.  Any subsequent writes into the page cache toward the
> end of this freshly allocated extent will see a written extent instead
> of delalloc and read the block from disk into the page before writing
> over it.  If the write does not cover the entire page garbage from disk
> will be exposed into the page cache.

Here is a test case to reproduce the corruption.  I have only been able
to reproduce it by writing the file on an nfs client served from xfs
that is allocating large delalloc extents.

-Ben

*** the writer

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>

int
main(int argc, char *argv[]) {

	char *filename = argv[1];
	off_t	seekdist = 3071;	/* less than a page, nice and odd */
	off_t	max_offset = 1024 * 1024 * 1024; /* 1 gig */
	off_t 	current_offset = 0;
	char	buf[] = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\n";
	int	fd;

	printf("writing to %s\n", filename);
	printf("strlen is %d\n", strlen(buf));

	fd = open(filename, O_RDWR|O_CREAT, 0644);
	if (fd == -1) {
		perror(filename);
		return -1;
	}

	while ((current_offset = lseek(fd, seekdist, SEEK_END)) > 0
			&& current_offset < max_offset) {
		if (write(fd, &buf, strlen(buf)) < strlen(buf)) {
			perror("write 'a'");
			return -1;
		}
	}

	close(fd);
}

*** the reader

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>

int
main(int argc, char *argv[]) {

	char *filename = argv[1];
	off_t	seekdist = 3071;	/* less than a page, nice and odd */
	off_t	max_offset = 1024 * 1024 * 1024; /* 1 gig */
	off_t 	current_offset = 0;
	char	buf[] = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\n";
	char	readbuf[4096];
	int	fd, i;

	printf("reading from %s\n", filename);

	fd = open(filename, O_RDONLY, 0644);
	if (fd == -1) {
		perror(filename);
		return -1;
	}
	
	while (current_offset < max_offset) {
		ssize_t nread = read(fd, &readbuf, seekdist);
		if (nread != seekdist) {
			perror("read nulls");
			return -1;
		}
		for (i=0; i < seekdist; i++) {
			if (readbuf[i] != '\0') {
				printf("foudn non-null at %d\n%s\n",
						current_offset + i,
						&readbuf[i]);
				break;
//				return -1;
			}
		}
		
		current_offset += nread;

		nread = read(fd, &readbuf, strlen(buf));
		if (nread != strlen(buf)) {
			perror("read a");
			return -1;
		}

		if (strncmp(readbuf, buf, strlen(buf))) {
			printf("didn't match at %d\n%s\n",
					current_offset + nread,
					readbuf);
//			return -1;
		}

		current_offset += nread;
	}	

	close(fd);
}

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Issues with delalloc->real extent allocation
  2011-01-14 21:43 ` bpm
  2011-01-14 23:32   ` bpm
  2011-01-14 23:50   ` bpm
@ 2011-01-14 23:55   ` Dave Chinner
  2011-01-17 20:12     ` bpm
  2011-01-18 20:47     ` Christoph Hellwig
  2011-01-17  0:28   ` Lachlan McIlroy
  3 siblings, 2 replies; 30+ messages in thread
From: Dave Chinner @ 2011-01-14 23:55 UTC (permalink / raw)
  To: bpm; +Cc: xfs

On Fri, Jan 14, 2011 at 03:43:34PM -0600, bpm@sgi.com wrote:
> Hi Dave,
> 
> On Fri, Jan 14, 2011 at 11:29:00AM +1100, Dave Chinner wrote:
> > I've noticed a few suspicious things trying to reproduce the
> > allocate-in-the-middle-of-a-delalloc-extent,
> ...
> > Secondly, I think we have the same expose-the-entire-delalloc-extent
> > -to-stale-data-exposure problem in ->writepage. This onnne, however,
> > is due to using BMAPI_ENTIRE to allocate the entire delalloc extent
> > the first time any part of it is written to. Even if we are only
> > writing a single page (i.e. wbc->nr_to_write = 1) and the delalloc
> > extent covers gigabytes. So, same problem when we crash.
> >
> > Finally, I think the extsize based problem exposed by test 229 is a
> > also a result of allocating space we have no pages covering in the
> > page cache (triggered by BMAPI_ENTIRE allocation) so the allocated
> > space is never zeroed and hence exposes stale data.
> 
> This is precisely the bug I was going after when I hit the
> allocate-in-the-middle-of-a-delalloc-extent bug.  This is a race between
> block_prepare_write/__xfs_get_blocks and writepage/xfs_page_state
> convert.  When xfs_page_state_convert allocates a real extent for a page
> toward the beginning of a delalloc extent, XFS_BMAPI converts the entire
> delalloc extent.  Any subsequent writes into the page cache toward the
> end of this freshly allocated extent will see a written extent instead
> of delalloc and read the block from disk into the page before writing
> over it.  If the write does not cover the entire page garbage from disk
> will be exposed into the page cache.

I see from later on you are getting this state fom using a large
extsize. Perhaps this comes back to my comment about extsize
alignment might be better handled at the .aio_write level rather
than hiding inside the get_block() callback and attempting to handle
the mismatch at the .writepage level.

Worth noting, though, is that DIO handles extsize unaligned writes
by allocating unwritten extsize sized/aligned extents an then doing
conversion at IO completion time. So perhaps we should be following
this example for delalloc conversion....

I think, however, if we use delalloc->unwritten allocation, we will
need to stop trusting the state of buffer heads in .writepage.  That
is because we'd then have blocks marked as buffer_delay() that
really cover unwritten extents and would need remapping. We're
already moving in the direction of not using the state in buffer
heads in ->writepage, so perhaps we need to speed up that
conversion as the first. Christoph, what are you plans here?

> <snip>
> 
> > I'm sure there are other ways to solve these problems, but these two
> > are the ones that come to mind for me.  I'm open to other solutions
> > or ways to improve on these ones, especially if they are simpler. ;)
> > Anyone got any ideas or improvements?
> 
> The direction I've been taking is to use XFS_BMAPI_EXACT in
> *xfs_iomap_write_allocate to ensure that an extent covering exactly the
> pages we're prepared to write out immediately is allocated and the rest
> of the delalloc extent is left as is.  This exercises some of the btree
> code more heavily and led to the discovery of the
> allocate-in-the-middle-of-a-delalloc-extent bug.

Yup, that explains it ;)

> It also presents a
> performance issue which I've tried to resolve by extending
> xfs_probe_cluster to probe delalloc extents-- lock up all of the pages
> to be converted before performing the allocation and hold those locks
> until they are submitted for writeback.  It's not very pretty but it
> resolves the corruption.

If we zero the relevant range in the page cache at .aio_write level
like we do with xfs_zero_eof or allocate unwritten extents instead,
then I don't think that you need to make changes like this. 

> There is still the issue of crashes...  This could be solved by
> converting from delalloc to unwritten in xfs_page_state_convert in this
> very exact way and then to written in the io completion handler.  Never
> go delalloc->written directly.
> 
> I have not had luck reproducing this on TOT xfs and have come to realize
> that this is because it doesn't do speculative preallocation of larger
> delalloc extents unless you are using extsize... which I haven't tried.

Have a look at the dynamic speculative allocation patches that just
went into 2.6.38 - I'm very interested to know whether your tests
expose stale data now that it can do up to an entire extent (8GB on
4k block size) of speculative delalloc for writes that are extending
the file.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Issues with delalloc->real extent allocation
  2011-01-14 22:59   ` Dave Chinner
@ 2011-01-15  4:16     ` Geoffrey Wehrman
  2011-01-17  5:18       ` Dave Chinner
  0 siblings, 1 reply; 30+ messages in thread
From: Geoffrey Wehrman @ 2011-01-15  4:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Sat, Jan 15, 2011 at 09:59:07AM +1100, Dave Chinner wrote:
| On Fri, Jan 14, 2011 at 10:40:16AM -0600, Geoffrey Wehrman wrote:
| > Also, I'm not saying using XFS_BMAPI_EXACT is feasible.  I have a very
| > minimal understanding of the writepage code path.
| 
| I think there are situations where this does make sense, but given
| the potential issues I'm not sure it is a solution that can be
| extended to the general case. A good discussion point on a different
| angle, though. ;)

You've convinced me that XFS_BMAPI_EXACT is not the optimal solution.

Upon further consideration, I do like your proposal to make delalloc
allocation more like an intent/done type operation.  The compatibility
issues aren't all that bad.  As long as the filesystem is unmounted
clean, there is no need for the next mount do log recovery and therefore
no need to have any knowledge of the new transactions.


-- 
Geoffrey Wehrman  651-683-5496  gwehrman@sgi.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Issues with delalloc->real extent allocation
  2011-01-14 21:43 ` bpm
                     ` (2 preceding siblings ...)
  2011-01-14 23:55   ` Dave Chinner
@ 2011-01-17  0:28   ` Lachlan McIlroy
  2011-01-17  4:37     ` Dave Chinner
  3 siblings, 1 reply; 30+ messages in thread
From: Lachlan McIlroy @ 2011-01-17  0:28 UTC (permalink / raw)
  To: bpm; +Cc: xfs

----- Original Message -----
> Hi Dave,
> 
> On Fri, Jan 14, 2011 at 11:29:00AM +1100, Dave Chinner wrote:
> > I've noticed a few suspicious things trying to reproduce the
> > allocate-in-the-middle-of-a-delalloc-extent,
> ...
> > Secondly, I think we have the same expose-the-entire-delalloc-extent
> > -to-stale-data-exposure problem in ->writepage. This onnne, however,
> > is due to using BMAPI_ENTIRE to allocate the entire delalloc extent
> > the first time any part of it is written to. Even if we are only
> > writing a single page (i.e. wbc->nr_to_write = 1) and the delalloc
> > extent covers gigabytes. So, same problem when we crash.
> >
> > Finally, I think the extsize based problem exposed by test 229 is a
> > also a result of allocating space we have no pages covering in the
> > page cache (triggered by BMAPI_ENTIRE allocation) so the allocated
> > space is never zeroed and hence exposes stale data.
> 
> This is precisely the bug I was going after when I hit the
> allocate-in-the-middle-of-a-delalloc-extent bug. This is a race
> between
> block_prepare_write/__xfs_get_blocks and writepage/xfs_page_state
> convert. When xfs_page_state_convert allocates a real extent for a
> page
> toward the beginning of a delalloc extent, XFS_BMAPI converts the
> entire
> delalloc extent. Any subsequent writes into the page cache toward the
> end of this freshly allocated extent will see a written extent instead
> of delalloc and read the block from disk into the page before writing
> over it. If the write does not cover the entire page garbage from disk
> will be exposed into the page cache.

Ben, take a look at the XFS gap list code in IRIX - this code was
designed specifically to handle this problem.  It's also implemented in
several of the cxfs clients too.  On entry to a write() it will create a
list of all the holes that exist in the write range before any delayed
allocations are created.  The first call to xfs_bmapi() sets up the delayed
allocation for the entire write (even if the bmaps returned don't cover the
full write range).  If the write results in a fault from disk then it checks
the gap list to see if any sections of the buffer used to cover a hole and
if so it ignores the state of the extent and zeroes those region(s) in the
buffer that match the pre-existing holes.  If the buffer has multiple
non-hole sections that need to be read from disk the whole buffer will be
read from disk and the zeroing of the holes is done post I/O - this reduces
the number of I/Os to be done.  The whole delayed allocation can be safely
converted at any time without risk of reading exposed data (assuming no
crash that is).  As the write progresses through the range it removes
sections from the front of the gap list so by the time the write is complete
the gap list is empty.  The gap list does not have a dedicated lock to
protect it but instead relies on the iolock to ensure that only one write
operation occurs at once (so it's not appropriate for direct I/O).

> 
> <snip>
> 
> > I'm sure there are other ways to solve these problems, but these two
> > are the ones that come to mind for me. I'm open to other solutions
> > or ways to improve on these ones, especially if they are simpler. ;)
> > Anyone got any ideas or improvements?
> 
> The direction I've been taking is to use XFS_BMAPI_EXACT in
> *xfs_iomap_write_allocate to ensure that an extent covering exactly
> the
> pages we're prepared to write out immediately is allocated and the
> rest
> of the delalloc extent is left as is. This exercises some of the btree
> code more heavily and led to the discovery of the
> allocate-in-the-middle-of-a-delalloc-extent bug. It also presents a
> performance issue which I've tried to resolve by extending
> xfs_probe_cluster to probe delalloc extents-- lock up all of the pages
> to be converted before performing the allocation and hold those locks
> until they are submitted for writeback. It's not very pretty but it
> resolves the corruption.
> 
> There is still the issue of crashes... This could be solved by
> converting from delalloc to unwritten in xfs_page_state_convert in
> this
> very exact way and then to written in the io completion handler. Never
> go delalloc->written directly.

That solution would probably make the gap list redundant too.

> 
> I have not had luck reproducing this on TOT xfs and have come to
> realize
> that this is because it doesn't do speculative preallocation of larger
> delalloc extents unless you are using extsize... which I haven't
> tried.
> 
> Regards,
> Ben
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Issues with delalloc->real extent allocation
  2011-01-17  0:28   ` Lachlan McIlroy
@ 2011-01-17  4:37     ` Dave Chinner
  0 siblings, 0 replies; 30+ messages in thread
From: Dave Chinner @ 2011-01-17  4:37 UTC (permalink / raw)
  To: Lachlan McIlroy; +Cc: bpm, xfs

On Sun, Jan 16, 2011 at 07:28:27PM -0500, Lachlan McIlroy wrote:
> ----- Original Message -----
> > Hi Dave,
> > 
> > On Fri, Jan 14, 2011 at 11:29:00AM +1100, Dave Chinner wrote:
> > > I've noticed a few suspicious things trying to reproduce the
> > > allocate-in-the-middle-of-a-delalloc-extent,
> > ...
> > > Secondly, I think we have the same expose-the-entire-delalloc-extent
> > > -to-stale-data-exposure problem in ->writepage. This onnne, however,
> > > is due to using BMAPI_ENTIRE to allocate the entire delalloc extent
> > > the first time any part of it is written to. Even if we are only
> > > writing a single page (i.e. wbc->nr_to_write = 1) and the delalloc
> > > extent covers gigabytes. So, same problem when we crash.
> > >
> > > Finally, I think the extsize based problem exposed by test 229 is a
> > > also a result of allocating space we have no pages covering in the
> > > page cache (triggered by BMAPI_ENTIRE allocation) so the allocated
> > > space is never zeroed and hence exposes stale data.
> > 
> > This is precisely the bug I was going after when I hit the
> > allocate-in-the-middle-of-a-delalloc-extent bug. This is a race
> > between
> > block_prepare_write/__xfs_get_blocks and writepage/xfs_page_state
> > convert. When xfs_page_state_convert allocates a real extent for a
> > page
> > toward the beginning of a delalloc extent, XFS_BMAPI converts the
> > entire
> > delalloc extent. Any subsequent writes into the page cache toward the
> > end of this freshly allocated extent will see a written extent instead
> > of delalloc and read the block from disk into the page before writing
> > over it. If the write does not cover the entire page garbage from disk
> > will be exposed into the page cache.
> 
> Ben, take a look at the XFS gap list code in IRIX - this code was
> designed specifically to handle this problem.  It's also implemented in
> several of the cxfs clients too.  On entry to a write() it will create a
> list of all the holes that exist in the write range before any delayed
> allocations are created.  The first call to xfs_bmapi() sets up the delayed
> allocation for the entire write (even if the bmaps returned don't cover the
> full write range).  If the write results in a fault from disk then it checks
> the gap list to see if any sections of the buffer used to cover a hole and
> if so it ignores the state of the extent and zeroes those region(s) in the
> buffer that match the pre-existing holes.  If the buffer has multiple
> non-hole sections that need to be read from disk the whole buffer will be
> read from disk and the zeroing of the holes is done post I/O - this reduces
> the number of I/Os to be done.  The whole delayed allocation can be safely
> converted at any time without risk of reading exposed data (assuming no
> crash that is). 

IMO, that caveat exposes the problem with this approach - it is not
persistent. It adds a lot of complexity but doesn't solve the
underlying problem: that we are exposing real extents via allocation
without having written any data to them.

> As the write progresses through the range it removes
> sections from the front of the gap list so by the time the write is complete
> the gap list is empty.  The gap list does not have a dedicated lock to
> protect it but instead relies on the iolock to ensure that only one write
> operation occurs at once (so it's not appropriate for direct I/O).
> 
> > 
> > <snip>
> > 
> > > I'm sure there are other ways to solve these problems, but these two
> > > are the ones that come to mind for me. I'm open to other solutions
> > > or ways to improve on these ones, especially if they are simpler. ;)
> > > Anyone got any ideas or improvements?
> > 
> > The direction I've been taking is to use XFS_BMAPI_EXACT in
> > *xfs_iomap_write_allocate to ensure that an extent covering exactly
> > the
> > pages we're prepared to write out immediately is allocated and the
> > rest
> > of the delalloc extent is left as is. This exercises some of the btree
> > code more heavily and led to the discovery of the
> > allocate-in-the-middle-of-a-delalloc-extent bug. It also presents a
> > performance issue which I've tried to resolve by extending
> > xfs_probe_cluster to probe delalloc extents-- lock up all of the pages
> > to be converted before performing the allocation and hold those locks
> > until they are submitted for writeback. It's not very pretty but it
> > resolves the corruption.
> > 
> > There is still the issue of crashes... This could be solved by
> > converting from delalloc to unwritten in xfs_page_state_convert in
> > this
> > very exact way and then to written in the io completion handler. Never
> > go delalloc->written directly.
> 
> That solution would probably make the gap list redundant too.

Yes, I think you are right. The "gaps" would get mapped as unwritten
rather than as normal extents and hence get zereod appropriately. It
would also preventing exposure after a crash due to being
persistent.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Issues with delalloc->real extent allocation
  2011-01-15  4:16     ` Geoffrey Wehrman
@ 2011-01-17  5:18       ` Dave Chinner
  2011-01-17 14:37         ` Geoffrey Wehrman
  0 siblings, 1 reply; 30+ messages in thread
From: Dave Chinner @ 2011-01-17  5:18 UTC (permalink / raw)
  To: Geoffrey Wehrman; +Cc: xfs

On Fri, Jan 14, 2011 at 10:16:29PM -0600, Geoffrey Wehrman wrote:
> On Sat, Jan 15, 2011 at 09:59:07AM +1100, Dave Chinner wrote:
> | On Fri, Jan 14, 2011 at 10:40:16AM -0600, Geoffrey Wehrman wrote:
> | > Also, I'm not saying using XFS_BMAPI_EXACT is feasible.  I have a very
> | > minimal understanding of the writepage code path.
> | 
> | I think there are situations where this does make sense, but given
> | the potential issues I'm not sure it is a solution that can be
> | extended to the general case. A good discussion point on a different
> | angle, though. ;)
> 
> You've convinced me that XFS_BMAPI_EXACT is not the optimal solution.
> 
> Upon further consideration, I do like your proposal to make delalloc
> allocation more like an intent/done type operation.  The compatibility
> issues aren't all that bad.  As long as the filesystem is unmounted
> clean, there is no need for the next mount do log recovery and therefore
> no need to have any knowledge of the new transactions.

That is a good observation. If there is agreement that this a strong
enough backwards compatibility guarantee (it's good enough for me),
then I think that I will start to prototype this approach.

However, this does not solve the extsize allocation issues where we
don't have dirty pages in the page cache covering parts of the
delayed allocation extent so we still need a solution for that. I'm
tending towards zeroing in .aio_write as the simplest solution
because it doesn't cause buffer head/extent tree mapping mismatches,
and it would use the above intent/done operations for crash
resilience so there's no additional, rarely used code path to test
through .writepage. Does that sound reasonable?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Issues with delalloc->real extent allocation
  2011-01-17  5:18       ` Dave Chinner
@ 2011-01-17 14:37         ` Geoffrey Wehrman
  2011-01-18  0:24           ` Dave Chinner
  0 siblings, 1 reply; 30+ messages in thread
From: Geoffrey Wehrman @ 2011-01-17 14:37 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Mon, Jan 17, 2011 at 04:18:28PM +1100, Dave Chinner wrote:
| On Fri, Jan 14, 2011 at 10:16:29PM -0600, Geoffrey Wehrman wrote:
| > On Sat, Jan 15, 2011 at 09:59:07AM +1100, Dave Chinner wrote:
| > | On Fri, Jan 14, 2011 at 10:40:16AM -0600, Geoffrey Wehrman wrote:
| > | > Also, I'm not saying using XFS_BMAPI_EXACT is feasible.  I have a very
| > | > minimal understanding of the writepage code path.
| > | 
| > | I think there are situations where this does make sense, but given
| > | the potential issues I'm not sure it is a solution that can be
| > | extended to the general case. A good discussion point on a different
| > | angle, though. ;)
| > 
| > You've convinced me that XFS_BMAPI_EXACT is not the optimal solution.
| > 
| > Upon further consideration, I do like your proposal to make delalloc
| > allocation more like an intent/done type operation.  The compatibility
| > issues aren't all that bad.  As long as the filesystem is unmounted
| > clean, there is no need for the next mount do log recovery and therefore
| > no need to have any knowledge of the new transactions.
| 
| That is a good observation. If there is agreement that this a strong
| enough backwards compatibility guarantee (it's good enough for me),
| then I think that I will start to prototype this approach.

I'm not sure how a version of XFS without the new log recovery code will
behave if it encounters a log with the new transactions.  I assume it
will gracefully abort log recovery and fail the mount with the report of
a corrupt log.  I have no objection with this compatibility guarantee.

| However, this does not solve the extsize allocation issues where we
| don't have dirty pages in the page cache covering parts of the
| delayed allocation extent so we still need a solution for that. I'm
| tending towards zeroing in .aio_write as the simplest solution
| because it doesn't cause buffer head/extent tree mapping mismatches,
| and it would use the above intent/done operations for crash
| resilience so there's no additional, rarely used code path to test
| through .writepage. Does that sound reasonable?

Zeroing in .aio_write will create zeroed pages covering the entire
allocation, correct?  This seems like a reasonable and straightforward
approach.  I wish I had thought of it myself!


-- 
Geoffrey Wehrman  651-683-5496  gwehrman@sgi.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Issues with delalloc->real extent allocation
  2011-01-14 23:55   ` Dave Chinner
@ 2011-01-17 20:12     ` bpm
  2011-01-18  1:44       ` Dave Chinner
  2011-01-18 20:47     ` Christoph Hellwig
  1 sibling, 1 reply; 30+ messages in thread
From: bpm @ 2011-01-17 20:12 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Hey Dave,

On Sat, Jan 15, 2011 at 10:55:49AM +1100, Dave Chinner wrote:
> On Fri, Jan 14, 2011 at 03:43:34PM -0600, bpm@sgi.com wrote:
> > It also presents a
> > performance issue which I've tried to resolve by extending
> > xfs_probe_cluster to probe delalloc extents-- lock up all of the pages
> > to be converted before performing the allocation and hold those locks
> > until they are submitted for writeback.  It's not very pretty but it
> > resolves the corruption.
> 
> If we zero the relevant range in the page cache at .aio_write level
> like we do with xfs_zero_eof or allocate unwritten extents instead,
> then I don't think that you need to make changes like this. 

Ganging up pages under lock in xfs_page_state_convert (along with
exactness in xfs_iomap_write_allocate) was needed to provide exclusion
with block_prepare_write because zeroing isn't done in the case of
written extents.

Converting from delalloc->unwritten has the advantage of
__xfs_get_blocks setting each buffer 'new' when you write into the page,
so the zeroing is done properly even if you convert the entire extent to
unwritten in xfs_vm_writepage instead of just the part you're going to
write out.  However, when converting from unwritten->written in the
completion handler you still need to convert only the part of the extent
that was actually written.  That might be a lot of transactions in
xfs_end_io.

> > There is still the issue of crashes...  This could be solved by
> > converting from delalloc to unwritten in xfs_page_state_convert in this
> > very exact way and then to written in the io completion handler.  Never
> > go delalloc->written directly.
> > 
> > I have not had luck reproducing this on TOT xfs and have come to realize
> > that this is because it doesn't do speculative preallocation of larger
> > delalloc extents unless you are using extsize... which I haven't tried.
> 
> Have a look at the dynamic speculative allocation patches that just
> went into 2.6.38 - I'm very interested to know whether your tests
> expose stale data now that it can do up to an entire extent (8GB on
> 4k block size) of speculative delalloc for writes that are extending
> the file.

8GB, Eek!

-Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Issues with delalloc->real extent allocation
  2011-01-17 14:37         ` Geoffrey Wehrman
@ 2011-01-18  0:24           ` Dave Chinner
  2011-01-18 14:30             ` Geoffrey Wehrman
  0 siblings, 1 reply; 30+ messages in thread
From: Dave Chinner @ 2011-01-18  0:24 UTC (permalink / raw)
  To: Geoffrey Wehrman; +Cc: xfs

On Mon, Jan 17, 2011 at 08:37:08AM -0600, Geoffrey Wehrman wrote:
> On Mon, Jan 17, 2011 at 04:18:28PM +1100, Dave Chinner wrote:
> | On Fri, Jan 14, 2011 at 10:16:29PM -0600, Geoffrey Wehrman wrote:
> | > On Sat, Jan 15, 2011 at 09:59:07AM +1100, Dave Chinner wrote:
> | > | On Fri, Jan 14, 2011 at 10:40:16AM -0600, Geoffrey Wehrman wrote:
> | > | > Also, I'm not saying using XFS_BMAPI_EXACT is feasible.  I have a very
> | > | > minimal understanding of the writepage code path.
> | > | 
> | > | I think there are situations where this does make sense, but given
> | > | the potential issues I'm not sure it is a solution that can be
> | > | extended to the general case. A good discussion point on a different
> | > | angle, though. ;)
> | > 
> | > You've convinced me that XFS_BMAPI_EXACT is not the optimal solution.
> | > 
> | > Upon further consideration, I do like your proposal to make delalloc
> | > allocation more like an intent/done type operation.  The compatibility
> | > issues aren't all that bad.  As long as the filesystem is unmounted
> | > clean, there is no need for the next mount do log recovery and therefore
> | > no need to have any knowledge of the new transactions.
> | 
> | That is a good observation. If there is agreement that this a strong
> | enough backwards compatibility guarantee (it's good enough for me),
> | then I think that I will start to prototype this approach.
> 
> I'm not sure how a version of XFS without the new log recovery code will
> behave if it encounters a log with the new transactions.  I assume it
> will gracefully abort log recovery and fail the mount with the report of
> a corrupt log.  I have no objection with this compatibility guarantee.

It will do the same as you describe for the old log recovery code,
so there should be no new problems there.

> | However, this does not solve the extsize allocation issues where we
> | don't have dirty pages in the page cache covering parts of the
> | delayed allocation extent so we still need a solution for that. I'm
> | tending towards zeroing in .aio_write as the simplest solution
> | because it doesn't cause buffer head/extent tree mapping mismatches,
> | and it would use the above intent/done operations for crash
> | resilience so there's no additional, rarely used code path to test
> | through .writepage. Does that sound reasonable?
> 
> Zeroing in .aio_write will create zeroed pages covering the entire
> allocation, correct?

Yes, though it only needs to zero the regions that the write does
not cover itself - no need to zero what we're about to put data
into. ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Issues with delalloc->real extent allocation
  2011-01-17 20:12     ` bpm
@ 2011-01-18  1:44       ` Dave Chinner
  0 siblings, 0 replies; 30+ messages in thread
From: Dave Chinner @ 2011-01-18  1:44 UTC (permalink / raw)
  To: bpm; +Cc: xfs

On Mon, Jan 17, 2011 at 02:12:40PM -0600, bpm@sgi.com wrote:
> Hey Dave,
> 
> On Sat, Jan 15, 2011 at 10:55:49AM +1100, Dave Chinner wrote:
> > On Fri, Jan 14, 2011 at 03:43:34PM -0600, bpm@sgi.com wrote:
> > > It also presents a
> > > performance issue which I've tried to resolve by extending
> > > xfs_probe_cluster to probe delalloc extents-- lock up all of the pages
> > > to be converted before performing the allocation and hold those locks
> > > until they are submitted for writeback.  It's not very pretty but it
> > > resolves the corruption.
> > 
> > If we zero the relevant range in the page cache at .aio_write level
> > like we do with xfs_zero_eof or allocate unwritten extents instead,
> > then I don't think that you need to make changes like this. 
> 
> Ganging up pages under lock in xfs_page_state_convert (along with
> exactness in xfs_iomap_write_allocate) was needed to provide exclusion
> with block_prepare_write because zeroing isn't done in the case of
> written extents.
> 
> Converting from delalloc->unwritten has the advantage of
> __xfs_get_blocks setting each buffer 'new' when you write into the page,
> so the zeroing is done properly even if you convert the entire extent to
> unwritten in xfs_vm_writepage instead of just the part you're going to
> write out. 

That's definitely the advantage to using unwritten extents in this
case. However, it also means that all the buffers that were
already set up at the time of delalloc->unwritten as buffer_delay()
are now incorrect - the underlying extent is now unwritten, no
delayed allocation. Hence when it comes to writing buffer_delay()
buffers, we would also have to handle the case that they are really
buffer_unwritten(). That could be very nasty....

> However, when converting from unwritten->written in the
> completion handler you still need to convert only the part of the extent
> that was actually written.  That might be a lot of transactions in
> xfs_end_io.

Yes, that was my original concern fo the general case and
why I was looking at an intent/done transaction arrangement rather
than unwritten/conversion arrangement. The "done" part of the
transaction is much less overhead, and we can safely do the
transaction reservation (i.e. the blocking bits) before the IO is
issued so that we keep the loverhead on the completion side down to
an absolute minimum....

However, when tracking down an assert failure reported by Christoph
with the new speculative delalloc code this morning, I've realised
that extent alignment for delayed allocation in xfs_bmapi() is badly
broken. It can result in attempting to set up a delalloc extent
larger than the maximum extent size because it aligns extents by
extending them.  Hence if the extent being allocated is MAXEXTLEN in
length, the extent size alignment will extend it and we'll do nasty
stuff to the extent record we insert into the incore extent tree.

To make matters worse, extent size is not limited to the maximum
supported extent size in the filesystem. The extent size can be
set to (2^32 bytes - 1 fsblock), but for 512 byte block size
filesystems the maximum extent size is 2^21 * 2^9 = 2^30. You can't
align extents to a size larger than that maximum extent size, and if
you try:

# sudo mkfs.xfs -b size=512 -f /dev/vda
...
# xfs_io -f -c "extsize 2g" \
> -c "ftruncate 4g" \
> -c "pwrite 64k 512" \
> -c "bmap -vp" /mnt/test/foo
XFS: Assertion failed: k < (1 << STARTBLOCKVALBITS), file: fs/xfs/xfs_bmap_btree.h, line: 83
....
Call Trace:
  [<ffffffff8145998c>] xfs_bmapi+0x167c/0x1a20
  [<ffffffff81488be2>] xfs_iomap_write_delay+0x1c2/0x340
  [<ffffffff814a7b62>] __xfs_get_blocks+0x402/0x4d0
  [<ffffffff814a7c61>] xfs_get_blocks+0x11/0x20
  [<ffffffff8118f1fb>] __block_write_begin+0x20b/0x5a0
  [<ffffffff8118f6f6>] block_write_begin+0x56/0x90
  [<ffffffff814a7541>] xfs_vm_write_begin+0x41/0x70
  [<ffffffff81118c06>] generic_file_buffered_write+0x106/0x270
  [<ffffffff814ae7a2>] xfs_file_buffered_aio_write+0xd2/0x190
  [<ffffffff814aece2>] xfs_file_aio_write+0x1c2/0x310
  [<ffffffff8116052a>] do_sync_write+0xda/0x120
  [<ffffffff81160850>] vfs_write+0xd0/0x190
  [<ffffffff81161262>] sys_pwrite64+0x82/0xa0
  [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b

Which failed this check:

81      static inline xfs_fsblock_t nullstartblock(int k)
82      {
83  >>>>>>      ASSERT(k < (1 << STARTBLOCKVALBITS));
84              return STARTBLOCKMASK | (k);
85      }

You end up with a corrupt extent record in the inncore extent list.

Along the same lines, extsize > ag size for non-rt inode is also
invalid. It won't be handled correctly by the delalloc->unwritten
method because delalloc conversion is currently limited to a single
allocation per .writepage call to avoid races with truncate. Hence
when extsize is larger than an AG, we'll only allocate the part
within an a single AG and hence leave part of the delalloc range in
the delalloc state. With no pages covering this range, it'll never
get written or allocated. Not to mention that it'll  trip assert
failures in xfs_getbmap:

# sudo mkfs.xfs -b size=4k -f-dagsize=1g /dev/vda
...
# xfs_io -f -c "extsize 2g" \
> -c "ftruncate 4g" \
> -c "pwrite 64k 512" \
> -c "bmap -vp" /mnt/test/foo
XFS: Assertion failed: ((iflags & BMV_IF_DELALLOC) != 0) || (map[i].br_startblock != DELAYSTARTBLOCK) file: fs/xfs/xfs_bmap.c, line: 5558
.....

These problems indicate to me that doing extent size alignment for
delayed allocation inside xfs_bmapi() is wrong on many levels.
Moving the "alignment" to zeroing at the .aio_write level solves
these issues except for one detail - ensuring that speculative
delalloc beyond EOF is correctly aligned - this can be handled easily
by the prealloc code.

Hence I'm thinking that the solution we should be implementing is:

	1. Limit extsize values to sane alignment boundaries.

	2. Do extsize alignment for delalloc by zeroing the page
	   cache at the .aio_write level for regions not covered by
	   the write.

	3. Ensure that speculative delalloc is extsize aligned

	4. Rip the delalloc extent alignment code from xfs_bmapi()

	5. Implement alloc intent/done transactions to protect
	   against stale data exposure caused by crashing between
	   allocation and data write completion (*).

I think that covers all the issues we've discussed - have I missed
anything?

Cheers,

Dave.

(*) FWIW, since I proposed the intent/done method, I've realised it
also solves the main difficulty in implementing data-in-inode
safely: how to sycnhronise the unlogged data write when we move the
data out of the inode with the allocation transaction so we know
when we can allow the tail of the log to move past the allocation
transaction or whether we can replay the format change transaction
in log recovery.....

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Issues with delalloc->real extent allocation
  2011-01-18  0:24           ` Dave Chinner
@ 2011-01-18 14:30             ` Geoffrey Wehrman
  2011-01-18 20:40               ` Christoph Hellwig
  0 siblings, 1 reply; 30+ messages in thread
From: Geoffrey Wehrman @ 2011-01-18 14:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Jan 18, 2011 at 11:24:37AM +1100, Dave Chinner wrote:
| On Mon, Jan 17, 2011 at 08:37:08AM -0600, Geoffrey Wehrman wrote:
| > On Mon, Jan 17, 2011 at 04:18:28PM +1100, Dave Chinner wrote:
| > | However, this does not solve the extsize allocation issues where we
| > | don't have dirty pages in the page cache covering parts of the
| > | delayed allocation extent so we still need a solution for that. I'm
| > | tending towards zeroing in .aio_write as the simplest solution
| > | because it doesn't cause buffer head/extent tree mapping mismatches,
| > | and it would use the above intent/done operations for crash
| > | resilience so there's no additional, rarely used code path to test
| > | through .writepage. Does that sound reasonable?
| > 
| > Zeroing in .aio_write will create zeroed pages covering the entire
| > allocation, correct?
| 
| Yes, though it only needs to zero the regions that the write does
| not cover itself - no need to zero what we're about to put data
| into. ;)

Glad you were able to understand what I meant.  Something I didn't think
of earlier though:  What happens when I try to use an 8 GB on a system
with only 4 GB of memory?  I'm not really worried about this pathological
case, but I do wonder what the effects will be of allocating what could
be significant quantities of memory in .aio_write.


-- 
Geoffrey Wehrman  651-683-5496  gwehrman@sgi.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Issues with delalloc->real extent allocation
  2011-01-18 14:30             ` Geoffrey Wehrman
@ 2011-01-18 20:40               ` Christoph Hellwig
  2011-01-18 22:03                 ` Dave Chinner
  0 siblings, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2011-01-18 20:40 UTC (permalink / raw)
  To: Geoffrey Wehrman; +Cc: xfs

On Tue, Jan 18, 2011 at 08:30:00AM -0600, Geoffrey Wehrman wrote:
> Glad you were able to understand what I meant.  Something I didn't think
> of earlier though:  What happens when I try to use an 8 GB on a system
> with only 4 GB of memory?  I'm not really worried about this pathological
> case, but I do wonder what the effects will be of allocating what could
> be significant quantities of memory in .aio_write.

I think for large regions we'd be much better off to only zero the
blocks on disk, not in-memory - for example like the code in
xfs_zero_remaining_bytes does.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Issues with delalloc->real extent allocation
  2011-01-14 23:55   ` Dave Chinner
  2011-01-17 20:12     ` bpm
@ 2011-01-18 20:47     ` Christoph Hellwig
  2011-01-18 23:18       ` Dave Chinner
  1 sibling, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2011-01-18 20:47 UTC (permalink / raw)
  To: Dave Chinner; +Cc: bpm, xfs

On Sat, Jan 15, 2011 at 10:55:49AM +1100, Dave Chinner wrote:
> I see from later on you are getting this state fom using a large
> extsize. Perhaps this comes back to my comment about extsize
> alignment might be better handled at the .aio_write level rather
> than hiding inside the get_block() callback and attempting to handle
> the mismatch at the .writepage level.
>
> Worth noting, though, is that DIO handles extsize unaligned writes
> by allocating unwritten extsize sized/aligned extents an then doing
> conversion at IO completion time. So perhaps we should be following
> this example for delalloc conversion....

That's the other option for zeroing.  In many ways this makes a lot
more sense, also for the thin provisioned virtualization image file
use case I'm looking into right now.

> 
> I think, however, if we use delalloc->unwritten allocation, we will
> need to stop trusting the state of buffer heads in .writepage.  That
> is because we'd then have blocks marked as buffer_delay() that
> really cover unwritten extents and would need remapping. We're
> already moving in the direction of not using the state in buffer
> heads in ->writepage, so perhaps we need to speed up that
> conversion as the first. Christoph, what are you plans here?

We really don't do much with the flags anymore.  We already treat
overwritten (just buffer_uptodate) and delayed buffers are already
exactly the same.   Unwritten buffers are still slightly different,
in that we add XFS_BMAPI_IGSTATE to the bmapi flags.  This is just
a leftover from the old code, and finding out why exactly we add
it is still on my todo list.  The page clustering code checks if
the buffer still matches the type in xfs_convert_page, though.
And I'm not quite sure yet how we can remove that - the easiest
option would be to keep holding the ilock until the whole clustered
writeout is done, but we'd need to investigate what effect that
has on performance.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Issues with delalloc->real extent allocation
  2011-01-18 20:40               ` Christoph Hellwig
@ 2011-01-18 22:03                 ` Dave Chinner
  0 siblings, 0 replies; 30+ messages in thread
From: Dave Chinner @ 2011-01-18 22:03 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Tue, Jan 18, 2011 at 03:40:09PM -0500, Christoph Hellwig wrote:
> On Tue, Jan 18, 2011 at 08:30:00AM -0600, Geoffrey Wehrman wrote:
> > Glad you were able to understand what I meant.  Something I didn't think
> > of earlier though:  What happens when I try to use an 8 GB on a system
> > with only 4 GB of memory?  I'm not really worried about this pathological
> > case, but I do wonder what the effects will be of allocating what could
> > be significant quantities of memory in .aio_write.
> 
> I think for large regions we'd be much better off to only zero the
> blocks on disk, not in-memory - for example like the code in
> xfs_zero_remaining_bytes does.

That doesn't help us, because the .writepage allocation code will
still allocate extsize aligned extents and expose the problem that
we have blocks on disk with no data in the page cache covering
them. The point of zeroing at .aio_write is that it avoids the
problem of .writepage allocating blocks we don't have dirty pages
for.

My preferred optimisation for this problem is that once we get above
a certain extsize threshold we simply preallocate extsized and
aligned chunks that cover the entire range for the write instead of
writing zeros. That preserves the extsize allocation alignment, and
unaligned writes and future IO see the space as unwritten and hence
get zeroed correctly...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Issues with delalloc->real extent allocation
  2011-01-18 20:47     ` Christoph Hellwig
@ 2011-01-18 23:18       ` Dave Chinner
  2011-01-19 12:03         ` Christoph Hellwig
  0 siblings, 1 reply; 30+ messages in thread
From: Dave Chinner @ 2011-01-18 23:18 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: bpm, xfs

On Tue, Jan 18, 2011 at 03:47:52PM -0500, Christoph Hellwig wrote:
> On Sat, Jan 15, 2011 at 10:55:49AM +1100, Dave Chinner wrote:
> > I see from later on you are getting this state fom using a large
> > extsize. Perhaps this comes back to my comment about extsize
> > alignment might be better handled at the .aio_write level rather
> > than hiding inside the get_block() callback and attempting to handle
> > the mismatch at the .writepage level.
> >
> > Worth noting, though, is that DIO handles extsize unaligned writes
> > by allocating unwritten extsize sized/aligned extents an then doing
> > conversion at IO completion time. So perhaps we should be following
> > this example for delalloc conversion....
> 
> That's the other option for zeroing.  In many ways this makes a lot
> more sense, also for the thin provisioned virtualization image file
> use case I'm looking into right now.
> 
> > 
> > I think, however, if we use delalloc->unwritten allocation, we will
> > need to stop trusting the state of buffer heads in .writepage.  That
> > is because we'd then have blocks marked as buffer_delay() that
> > really cover unwritten extents and would need remapping. We're
> > already moving in the direction of not using the state in buffer
> > heads in ->writepage, so perhaps we need to speed up that
> > conversion as the first. Christoph, what are you plans here?
> 
> We really don't do much with the flags anymore.  We already treat
> overwritten (just buffer_uptodate) and delayed buffers are already
> exactly the same.

Except for the fact we use the delalloc state from the buffer to
trigger allocation after mapping. We could probably just use
isnullstartblock() for this - only a mapping from a buffer over a
delalloc range should return a null startblock.

> Unwritten buffers are still slightly different,
> in that we add XFS_BMAPI_IGSTATE to the bmapi flags.  This is just
> a leftover from the old code, and finding out why exactly we add
> it is still on my todo list.

XFS_BMAPI_IGSTATE does a couple of things. Firstly, It prevents
xfs_bmapi() from allocating new extents (turns off write mode).
This isn't an issue where it is used because neither of the call
sites set XFS_BMAPI_WRITE.

Secondly, it allows xfs_bmapi() to consider adjacent written
and unwritten extents as contiguous but has no effect on delalloc
extents. That is, if we have:

	A               B                 C
	+---------------+-----------------+
	    unwritten          written

And we ask to map the range A-C with a single map, we'll get back
A-B.  If we add XFS_BMAPI_IGSTATE, we'll get back A-C instead. This
means that when we map the range A-C for IO, we'll do it as a single
IO. The unwritten extent conversion code handles ranges like this
correctly - it will only convert the unwritten part of the range to
written. It is arguable that this is unnecessary because the
elevator will merge the two adjacent IOs anyway, but it does reduce
the number of mapping calls and IOs issued...

Hence it is effectively mechanism for optimising IO mappings where
we have contiguous extents with mixed written/unwritten state.

In fact, we probably should be setting this for normal written
extents as well, so that the case:

	A               B                 C
	+---------------+-----------------+
	    written          unwritten

is also handled with the same optimisation. That makes handling
unwritten and overwrites identical, with only delalloc being
different. If we assume delalloc when we get a null startblock,
then we don't need to look at the buffer state at all for the
initial mapping.

> The page clustering code checks if
> the buffer still matches the type in xfs_convert_page, though.
> And I'm not quite sure yet how we can remove that - the easiest
> option would be to keep holding the ilock until the whole clustered
> writeout is done, but we'd need to investigate what effect that
> has on performance.

Holding the ilock doesn't protect against page cache truncation, but
we can still check for that via page->mapping. Otherwise this seems
like it would work to me - I trust the extent list to reflect the
correct state a lot more than I do the buffer heads.

It seems to me that with such modifications, the only thing that we
are using the bufferhead for is the buffer_uptodate() flag to
determine if we should write the block or not. If we can find some
other way of holding this state then we don't need bufferheads in
the write path at all, right?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Issues with delalloc->real extent allocation
  2011-01-18 23:18       ` Dave Chinner
@ 2011-01-19 12:03         ` Christoph Hellwig
  2011-01-19 13:31           ` Dave Chinner
  0 siblings, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2011-01-19 12:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: bpm, xfs

On Wed, Jan 19, 2011 at 10:18:32AM +1100, Dave Chinner wrote:
> Except for the fact we use the delalloc state from the buffer to
> trigger allocation after mapping. We could probably just use
> isnullstartblock() for this - only a mapping from a buffer over a
> delalloc range should return a null startblock.

isnullstartblock returns true for both DELAYSTARTBLOCK and
HOLESTARTBLOCK, so we want to be explicit we can check for
br_startblock == DELAYSTARTBLOCK.

Note that we also need to explicitly check for br_state == 
XFS_EXT_UNWRITTEN to set the correct type for the ioend structure.

> XFS_BMAPI_IGSTATE does a couple of things. Firstly, It prevents
> xfs_bmapi() from allocating new extents (turns off write mode).
> This isn't an issue where it is used because neither of the call
> sites set XFS_BMAPI_WRITE.

I've been down enough to understand what it does.  But yes, the
single large I/O mapping might explain why we want it.  The next
version of xfs_map_blocks will get a comment for it..

> In fact, we probably should be setting this for normal written
> extents as well, so that the case:
> 
> 	A               B                 C
> 	+---------------+-----------------+
> 	    written          unwritten
> 
> is also handled with the same optimisation. That makes handling
> unwritten and overwrites identical, with only delalloc being
> different. If we assume delalloc when we get a null startblock,
> then we don't need to look at the buffer state at all for the
> initial mapping.

With the current xfs_bmapi code that won't work.  When merging a second
extent into the first we only add up the br_blockcount.  So for the
case above we'd get an extent returned that's not marked as unwrittent
and we wouldn't mark the ioend as unwrittent and thus perform not
extent conversion after I/O completion.  Just adding XFS_BMAPI_IGSTATE
blindly for the delalloc case on the other hand is fine, as the
merging of delayed extents is handled in a different if clause that
totally ignores XFS_BMAPI_IGSTATE.

The potention fix for this is to always set br_state if one of the
merged extents is an unwrittent extent. The only other caller is
xfs_getbmap which does report the unwrittent state to userspace,
but already is sloppy for merging the other way if BMV_IF_PREALLOC
is not set, so I can't see how beening sloppy this way to makes any
difference.

> It seems to me that with such modifications, the only thing that we
> are using the bufferhead for is the buffer_uptodate() flag to
> determine if we should write the block or not. If we can find some
> other way of holding this state then we don't need bufferheads in
> the write path at all, right?

There's really two steps.  First we can stop needing buffers headers
for the space allocation / mapping.  This is doable with the slight
tweak of XFS_BMAPI_IGSTATE semantics.

We still do need to set BH_Delay or BH_Unwrittent for use in
__block_write_begin and block_truncate_page, but they become completely
interchangeable at that point.

If we want to completely get rid of buffers heads things are a bit
more complicated.  It's doable as shown by the _nobh aops, but we'll
use quite a bit of per-block state that needs to be replaced by per-page
state, and we'll lose the way to cache the block number in the buffer
head.  While we don't make use of that in writepage we do so in
the write path, although I'm not sure how important it is.  If we
get your multi-page write work in it probably won't matter any more.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Issues with delalloc->real extent allocation
  2011-01-19 12:03         ` Christoph Hellwig
@ 2011-01-19 13:31           ` Dave Chinner
  2011-01-19 13:55             ` Christoph Hellwig
  0 siblings, 1 reply; 30+ messages in thread
From: Dave Chinner @ 2011-01-19 13:31 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: bpm, xfs

On Wed, Jan 19, 2011 at 07:03:21AM -0500, Christoph Hellwig wrote:
> On Wed, Jan 19, 2011 at 10:18:32AM +1100, Dave Chinner wrote:
> > Except for the fact we use the delalloc state from the buffer to
> > trigger allocation after mapping. We could probably just use
> > isnullstartblock() for this - only a mapping from a buffer over a
> > delalloc range should return a null startblock.
> 
> isnullstartblock returns true for both DELAYSTARTBLOCK and
> HOLESTARTBLOCK, so we want to be explicit we can check for
> br_startblock == DELAYSTARTBLOCK.

True.

> 
> Note that we also need to explicitly check for br_state == 
> XFS_EXT_UNWRITTEN to set the correct type for the ioend structure.

Yes. As i mentioned on IRC I hacked a  quick prototype together to
test this out and did exactly this. ;)

> 
> > XFS_BMAPI_IGSTATE does a couple of things. Firstly, It prevents
> > xfs_bmapi() from allocating new extents (turns off write mode).
> > This isn't an issue where it is used because neither of the call
> > sites set XFS_BMAPI_WRITE.
> 
> I've been down enough to understand what it does.  But yes, the
> single large I/O mapping might explain why we want it.  The next
> version of xfs_map_blocks will get a comment for it..
> 
> > In fact, we probably should be setting this for normal written
> > extents as well, so that the case:
> > 
> > 	A               B                 C
> > 	+---------------+-----------------+
> > 	    written          unwritten
> > 
> > is also handled with the same optimisation. That makes handling
> > unwritten and overwrites identical, with only delalloc being
> > different. If we assume delalloc when we get a null startblock,
> > then we don't need to look at the buffer state at all for the
> > initial mapping.
> 
> With the current xfs_bmapi code that won't work.  When merging a second
> extent into the first we only add up the br_blockcount.  So for the
> case above we'd get an extent returned that's not marked as unwrittent
> and we wouldn't mark the ioend as unwrittent and thus perform not
> extent conversion after I/O completion.  Just adding XFS_BMAPI_IGSTATE
> blindly for the delalloc case on the other hand is fine, as the
> merging of delayed extents is handled in a different if clause that
> totally ignores XFS_BMAPI_IGSTATE.
>
> The potention fix for this is to always set br_state if one of the
> merged extents is an unwrittent extent. The only other caller is
> xfs_getbmap which does report the unwrittent state to userspace,
> but already is sloppy for merging the other way if BMV_IF_PREALLOC
> is not set, so I can't see how beening sloppy this way to makes any
> difference.

Yup:


@@ -4827,6 +4827,18 @@ xfs_bmapi(
 			ASSERT(mval->br_startoff ==
 			       mval[-1].br_startoff + mval[-1].br_blockcount);
 			mval[-1].br_blockcount += mval->br_blockcount;
+			/*
+			 * if one of the extent types is unwritten, make sure
+			 * the extent is reported as unwritten so the caller
+			 * always takes the correct action for unwritten
+			 * extents. This means we always return consistent
+			 * state regardless of whether we find a written or
+			 * unwritten extent first.
+			 */
+			if (mval[-1].br_state != XFS_EXT_UNWRITTEN &&
+			    mval->br_state == XFS_EXT_UNWRITTEN) {
+				mval[-1].br_state = XFS_EXT_UNWRITTEN;
+			}
 		} else if (n > 0 &&
 			   mval->br_startblock == DELAYSTARTBLOCK &&
 			   mval[-1].br_startblock == DELAYSTARTBLOCK &&

> > It seems to me that with such modifications, the only thing that we
> > are using the bufferhead for is the buffer_uptodate() flag to
> > determine if we should write the block or not. If we can find some
> > other way of holding this state then we don't need bufferheads in
> > the write path at all, right?
> 
> There's really two steps.  First we can stop needing buffers headers
> for the space allocation / mapping.  This is doable with the slight
> tweak of XFS_BMAPI_IGSTATE semantics.
> 
> We still do need to set BH_Delay or BH_Unwrittent for use in
> __block_write_begin and block_truncate_page, but they become completely
> interchangeable at that point.
> 
> If we want to completely get rid of buffers heads things are a bit
> more complicated.  It's doable as shown by the _nobh aops, but we'll
> use quite a bit of per-block state that needs to be replaced by per-page
> state,

Sure, or use a similar method to btrfs which stores dirty state bits
in a separate extent tree. Worst case memory usage is still much
less than a bufferhead per block...

> and we'll lose the way to cache the block number in the buffer
> head.  While we don't make use of that in writepage we do so in
> the write path, although I'm not sure how important it is.  If we
> get your multi-page write work in it probably won't matter any more.

The only place we use bh->b_blocknr is for ioend manipulation. Am I
missing something else?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Issues with delalloc->real extent allocation
  2011-01-19 13:31           ` Dave Chinner
@ 2011-01-19 13:55             ` Christoph Hellwig
  2011-01-20  1:33               ` Dave Chinner
  0 siblings, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2011-01-19 13:55 UTC (permalink / raw)
  To: Dave Chinner; +Cc: bpm, xfs

On Thu, Jan 20, 2011 at 12:31:47AM +1100, Dave Chinner wrote:
> > If we want to completely get rid of buffers heads things are a bit
> > more complicated.  It's doable as shown by the _nobh aops, but we'll
> > use quite a bit of per-block state that needs to be replaced by per-page
> > state,
> 
> Sure, or use a similar method to btrfs which stores dirty state bits
> in a separate extent tree. Worst case memory usage is still much
> less than a bufferhead per block...

I'm not sure need to track sub-page dirty state.  It only matters if we:

 a) have a file fragmented enough that it has multiple extents allocated
    inside a single page
 b) have enough small writes that just dirty parts of a page

with a good enough persistant preallocation a) should happen almost
never, while b) might be an issue, specially with setups of 64k
page size and 4k blocks (e.g. ppc64 enterprise distro configs)

> > and we'll lose the way to cache the block number in the buffer
> > head.  While we don't make use of that in writepage we do so in
> > the write path, although I'm not sure how important it is.  If we
> > get your multi-page write work in it probably won't matter any more.
> 
> The only place we use bh->b_blocknr is for ioend manipulation. Am I
> missing something else?

You're right.  I thought we use it in the write path, but we only
care about the buffer_mapped flag, but never actually look at the
block number.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Issues with delalloc->real extent allocation
  2011-01-19 13:55             ` Christoph Hellwig
@ 2011-01-20  1:33               ` Dave Chinner
  2011-01-20 11:16                 ` Christoph Hellwig
  2011-01-20 14:45                 ` Geoffrey Wehrman
  0 siblings, 2 replies; 30+ messages in thread
From: Dave Chinner @ 2011-01-20  1:33 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: bpm, xfs

On Wed, Jan 19, 2011 at 08:55:48AM -0500, Christoph Hellwig wrote:
> On Thu, Jan 20, 2011 at 12:31:47AM +1100, Dave Chinner wrote:
> > > If we want to completely get rid of buffers heads things are a bit
> > > more complicated.  It's doable as shown by the _nobh aops, but we'll
> > > use quite a bit of per-block state that needs to be replaced by per-page
> > > state,
> > 
> > Sure, or use a similar method to btrfs which stores dirty state bits
> > in a separate extent tree. Worst case memory usage is still much
> > less than a bufferhead per block...
> 
> I'm not sure need to track sub-page dirty state.  It only matters if we:
> 
>  a) have a file fragmented enough that it has multiple extents allocated
>     inside a single page
>  b) have enough small writes that just dirty parts of a page
> 
> with a good enough persistant preallocation a) should happen almost
> never, while b) might be an issue, specially with setups of 64k
> page size and 4k blocks (e.g. ppc64 enterprise distro configs)

Right - case a) could probably be handled by making the page size
an implicit extsize hint so we always try to minimise sub-page
fragmentation during allocation.

It's case b) that I'm mainly worried about, esp. w.r.t the 64k page
size on ia64/ppc. If we only track a single dirty bit in the page,
then every sub-page, non-appending write to an uncached region of a
file becomes a RMW cycle to initialise the areas around the write
correctly. The question is whether we care about this enough given
that we return at least PAGE_SIZE in stat() to tell applications the
optimal IO size to avoid RMW cycles.

Given that XFS is aimed towards optimising for the large file/large
IO/high throughput type of application, I'm comfortable with saying
that avoiding sub-page writes for optimal throughput IO is an
application problem and going from there. Especially considering
that stuff like rsync and untarring kernel tarballs are all
appending writes so won't take any performance hit at all...

And if we only do IO on whole pages (i.e regardless of block size)
.writepage suddenly becomes a lot simpler, as well as being trivial
to implement our own .readpage/.readpages....

What do people think about this?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Issues with delalloc->real extent allocation
  2011-01-20  1:33               ` Dave Chinner
@ 2011-01-20 11:16                 ` Christoph Hellwig
  2011-01-21  1:59                   ` Dave Chinner
  2011-01-20 14:45                 ` Geoffrey Wehrman
  1 sibling, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2011-01-20 11:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, bpm, xfs

On Thu, Jan 20, 2011 at 12:33:46PM +1100, Dave Chinner wrote:
> It's case b) that I'm mainly worried about, esp. w.r.t the 64k page
> size on ia64/ppc. If we only track a single dirty bit in the page,
> then every sub-page, non-appending write to an uncached region of a
> file becomes a RMW cycle to initialise the areas around the write
> correctly. The question is whether we care about this enough given
> that we return at least PAGE_SIZE in stat() to tell applications the
> optimal IO size to avoid RMW cycles.

Note that this generally is only true for the first write into the
region - after that we'll have the rest read into the cache.  But
we also have the same issue for appending writes if they aren't
page aligned.

> And if we only do IO on whole pages (i.e regardless of block size)
> .writepage suddenly becomes a lot simpler, as well as being trivial
> to implement our own .readpage/.readpages....

I don't think it simplifies writepage a lot.  All the buffer head
handling goes away, but we'll still need to do xfs_bmapi calls at
block size granularity.  Why would you want to replaced the
readpage/readpages code?  The generic mpage helpers for it do just fine.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Issues with delalloc->real extent allocation
  2011-01-20  1:33               ` Dave Chinner
  2011-01-20 11:16                 ` Christoph Hellwig
@ 2011-01-20 14:45                 ` Geoffrey Wehrman
  2011-01-21  2:51                   ` Dave Chinner
  1 sibling, 1 reply; 30+ messages in thread
From: Geoffrey Wehrman @ 2011-01-20 14:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, bpm, xfs

On Thu, Jan 20, 2011 at 12:33:46PM +1100, Dave Chinner wrote:
| Given that XFS is aimed towards optimising for the large file/large
| IO/high throughput type of application, I'm comfortable with saying
| that avoiding sub-page writes for optimal throughput IO is an
| application problem and going from there. Especially considering
| that stuff like rsync and untarring kernel tarballs are all
| appending writes so won't take any performance hit at all...

I agree.  I do not expect systems with 64K pages to be used for single
bit manipulations.  However, I do see a couple of potential problems.

The one place where page size I/O may not work though is for an DMAPI/HSM
(DMF) managed filesystem where some blocks are managed on near-line media.
The DMAPI needs to be able to remove and restore extents on a filesystem
block boundary, not a page boundary.

The other downside is that for sparse files, we could end up allocating
space for zero filled blocks.  There may be some workloads where
significant quantities of space are wasted.


-- 
Geoffrey Wehrman

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Issues with delalloc->real extent allocation
  2011-01-20 11:16                 ` Christoph Hellwig
@ 2011-01-21  1:59                   ` Dave Chinner
  0 siblings, 0 replies; 30+ messages in thread
From: Dave Chinner @ 2011-01-21  1:59 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: bpm, xfs

On Thu, Jan 20, 2011 at 06:16:12AM -0500, Christoph Hellwig wrote:
> On Thu, Jan 20, 2011 at 12:33:46PM +1100, Dave Chinner wrote:
> > It's case b) that I'm mainly worried about, esp. w.r.t the 64k page
> > size on ia64/ppc. If we only track a single dirty bit in the page,
> > then every sub-page, non-appending write to an uncached region of a
> > file becomes a RMW cycle to initialise the areas around the write
> > correctly. The question is whether we care about this enough given
> > that we return at least PAGE_SIZE in stat() to tell applications the
> > optimal IO size to avoid RMW cycles.
> 
> Note that this generally is only true for the first write into the
> region - after that we'll have the rest read into the cache.  But
> we also have the same issue for appending writes if they aren't
> page aligned.

True - I kind of implied that by saying RMW cycles are limited to
"uncached regions", but you've stated in a much clearer and easier
to understand way. ;)

> > And if we only do IO on whole pages (i.e regardless of block size)
> > .writepage suddenly becomes a lot simpler, as well as being trivial
> > to implement our own .readpage/.readpages....
> 
> I don't think it simplifies writepage a lot.  All the buffer head
> handling goes away, but we'll still need to do xfs_bmapi calls at
> block size granularity.  Why would you want to replaced the
> readpage/readpages code?  The generic mpage helpers for it do just fine.

When I went through the mpage code I found there were cases that it
would attached bufferheads to pages or assume PagePrivate() contains
a bufferhead list. e.g. If there are multiple holes in the page, it
will fall through to block_read_full_page() which makes this
assumption.  If we want/need to keep any of our own state on
PagePrivate(), we cannot use any function that assumes PagePrivate()
is used to hold bufferheads for the page.

Quite frankly, a simple extent mapping loop like we do for
.writepage is far simpler than what mpage_readpages does. This is
what btrfs does (extent_readpages/__extent_read_full_page), and that
is far easier to follow and understand than mpage_do_readpage()....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Issues with delalloc->real extent allocation
  2011-01-20 14:45                 ` Geoffrey Wehrman
@ 2011-01-21  2:51                   ` Dave Chinner
  2011-01-21 14:41                     ` Geoffrey Wehrman
  0 siblings, 1 reply; 30+ messages in thread
From: Dave Chinner @ 2011-01-21  2:51 UTC (permalink / raw)
  To: Geoffrey Wehrman; +Cc: Christoph Hellwig, bpm, xfs

On Thu, Jan 20, 2011 at 08:45:03AM -0600, Geoffrey Wehrman wrote:
> On Thu, Jan 20, 2011 at 12:33:46PM +1100, Dave Chinner wrote:
> | Given that XFS is aimed towards optimising for the large file/large
> | IO/high throughput type of application, I'm comfortable with saying
> | that avoiding sub-page writes for optimal throughput IO is an
> | application problem and going from there. Especially considering
> | that stuff like rsync and untarring kernel tarballs are all
> | appending writes so won't take any performance hit at all...
> 
> I agree.  I do not expect systems with 64K pages to be used for single
> bit manipulations.  However, I do see a couple of potential problems.
> 
> The one place where page size I/O may not work though is for an DMAPI/HSM
> (DMF) managed filesystem where some blocks are managed on near-line media.
> The DMAPI needs to be able to remove and restore extents on a filesystem
> block boundary, not a page boundary.

DMAPI is still free to remove extents at whatever boundary it wants.
The only difference is that it would be asked to restore extents to
the page boundary the write covers rather than a block boundary. The
allocation and direct IO boundaries do not change, so the only thing
that needs to change is the range that DMAPI is told that the
read/write is going to cover....

> The other downside is that for sparse files, we could end up allocating
> space for zero filled blocks.  There may be some workloads where
> significant quantities of space are wasted.

Yes, that is possible, though on the other hand is will reduce worst
case fragmentation of pathological sparse file filling applications.
e.g. out-of-core solvers that do strided writes across the file to
write the first column in the result matrix as it is calculated,
then the second column, then the third ...  until all columns are
written.

----

Realistically, for every disadvantage or advantage we can enumerate
for specific workloads, I think one of us will be able to come up
with a counter example that shows the opposite of the original
point. I don't think this sort of argument is particularly
productive. :/

Instead, I look at it from the point of view that a 64k IO is little
slower than a 4k IO so such a change would not make much difference
to performance. And given that terabytes of storage capacity is
_cheap_ these days (and getting cheaper all the time), the extra
space of using 64k instead of 4k for sparse blocks isn't a big deal.

When I combine that with my experience from SGI where we always
recommended using filesystems block size == page size for best IO
performance on HPC setups, there's a fair argument that using page
size extents for small sparse writes isn't a problem we really need
to care about.

І'd prefer to design for where we expect storage to be in the next
few years e.g. 10TB spindles. Minimising space usage is not a big
priority when we consider that in 2-3 years 100TB of storage will
cost less than $5000 (it's about $15-20k right now).  Even on
desktops we're going to have more capacity that we know what to do
with, so trading off storage space for lower memory overhead, lower
metadata IO overhead and lower potential fragmentation seems like
the right way to move forward to me.

Does that seem like a reasonable position to take, or are there
other factors that you think I should be considering?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Issues with delalloc->real extent allocation
  2011-01-21  2:51                   ` Dave Chinner
@ 2011-01-21 14:41                     ` Geoffrey Wehrman
  2011-01-23 23:26                       ` Dave Chinner
  0 siblings, 1 reply; 30+ messages in thread
From: Geoffrey Wehrman @ 2011-01-21 14:41 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, bpm, xfs

On Fri, Jan 21, 2011 at 01:51:40PM +1100, Dave Chinner wrote:
| Realistically, for every disadvantage or advantage we can enumerate
| for specific workloads, I think one of us will be able to come up
| with a counter example that shows the opposite of the original
| point. I don't think this sort of argument is particularly
| productive. :/

Sorry, I wasn't trying to be argumentative.  Rather I was just documenting
what I saw as potential issues.  I'm not arguing against your proposed
change.  If you don't find my sharing of observations productive, I'm
happy to keep my thoughts to my self in the future.

| Instead, I look at it from the point of view that a 64k IO is little
| slower than a 4k IO so such a change would not make much difference
| to performance. And given that terabytes of storage capacity is
| _cheap_ these days (and getting cheaper all the time), the extra
| space of using 64k instead of 4k for sparse blocks isn't a big deal.
| 
| When I combine that with my experience from SGI where we always
| recommended using filesystems block size == page size for best IO
| performance on HPC setups, there's a fair argument that using page
| size extents for small sparse writes isn't a problem we really need
| to care about.
| 
| І'd prefer to design for where we expect storage to be in the next
| few years e.g. 10TB spindles. Minimising space usage is not a big
| priority when we consider that in 2-3 years 100TB of storage will
| cost less than $5000 (it's about $15-20k right now).  Even on
| desktops we're going to have more capacity that we know what to do
| with, so trading off storage space for lower memory overhead, lower
| metadata IO overhead and lower potential fragmentation seems like
| the right way to move forward to me.
| 
| Does that seem like a reasonable position to take, or are there
| other factors that you think I should be considering?

Keep in mind that storage of the future may not be on spindles, and
fragmentation may not be an issue.  Even so, with SSD 64K I/O is very
reasonable as most flash memory implements at a minimum 64K page.  I'm
fully in favor your proposal to require page sized I/O.


-- 
Geoffrey Wehrman

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Issues with delalloc->real extent allocation
  2011-01-21 14:41                     ` Geoffrey Wehrman
@ 2011-01-23 23:26                       ` Dave Chinner
  0 siblings, 0 replies; 30+ messages in thread
From: Dave Chinner @ 2011-01-23 23:26 UTC (permalink / raw)
  To: Geoffrey Wehrman; +Cc: Christoph Hellwig, bpm, xfs

On Fri, Jan 21, 2011 at 08:41:52AM -0600, Geoffrey Wehrman wrote:
> On Fri, Jan 21, 2011 at 01:51:40PM +1100, Dave Chinner wrote:
> | Realistically, for every disadvantage or advantage we can enumerate
> | for specific workloads, I think one of us will be able to come up
> | with a counter example that shows the opposite of the original
> | point. I don't think this sort of argument is particularly
> | productive. :/
> 
> Sorry, I wasn't trying to be argumentative.  Rather I was just documenting
> what I saw as potential issues.  I'm not arguing against your proposed
> change.  If you don't find my sharing of observations productive, I'm
> happy to keep my thoughts to my self in the future.

Ah, that's not what I meant, Geoffrey. Reading it back, I probably
should have said "direction of discussion" rather than "sort of
argument" to make it more obvious I trying not to get stuck with us
goign round and round trying to demonstrate the pros and cons of
different approaches on a workload-by-workload basis.

Basically all I was trying to do is move the discusion past a
potential sticking point - I definitely value the input and insight
you provide, and I'll try to write more clearly to hopefully avoid
such misunderstandings in future discussions.

> | Instead, I look at it from the point of view that a 64k IO is little
> | slower than a 4k IO so such a change would not make much difference
> | to performance. And given that terabytes of storage capacity is
> | _cheap_ these days (and getting cheaper all the time), the extra
> | space of using 64k instead of 4k for sparse blocks isn't a big deal.
> | 
> | When I combine that with my experience from SGI where we always
> | recommended using filesystems block size == page size for best IO
> | performance on HPC setups, there's a fair argument that using page
> | size extents for small sparse writes isn't a problem we really need
> | to care about.
> | 
> | І'd prefer to design for where we expect storage to be in the next
> | few years e.g. 10TB spindles. Minimising space usage is not a big
> | priority when we consider that in 2-3 years 100TB of storage will
> | cost less than $5000 (it's about $15-20k right now).  Even on
> | desktops we're going to have more capacity that we know what to do
> | with, so trading off storage space for lower memory overhead, lower
> | metadata IO overhead and lower potential fragmentation seems like
> | the right way to move forward to me.
> | 
> | Does that seem like a reasonable position to take, or are there
> | other factors that you think I should be considering?
> 
> Keep in mind that storage of the future may not be on spindles, and
> fragmentation may not be an issue.  Even so, with SSD 64K I/O is very
> reasonable as most flash memory implements at a minimum 64K page.  I'm
> fully in favor your proposal to require page sized I/O.

With flash memory there is the potential that we don't even need to
care. The trend is towards on-device compression (e.g. Sandforce
controllers already do this) to reduce write amplification to values
lower than one. Hence a 4k write surrounded by 60k of zeros is
unlikely to be a major issue as it will compress really well.... :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2011-01-23 23:24 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-14  0:29 Issues with delalloc->real extent allocation Dave Chinner
2011-01-14 16:40 ` Geoffrey Wehrman
2011-01-14 22:59   ` Dave Chinner
2011-01-15  4:16     ` Geoffrey Wehrman
2011-01-17  5:18       ` Dave Chinner
2011-01-17 14:37         ` Geoffrey Wehrman
2011-01-18  0:24           ` Dave Chinner
2011-01-18 14:30             ` Geoffrey Wehrman
2011-01-18 20:40               ` Christoph Hellwig
2011-01-18 22:03                 ` Dave Chinner
2011-01-14 21:43 ` bpm
2011-01-14 23:32   ` bpm
2011-01-14 23:50   ` bpm
2011-01-14 23:55   ` Dave Chinner
2011-01-17 20:12     ` bpm
2011-01-18  1:44       ` Dave Chinner
2011-01-18 20:47     ` Christoph Hellwig
2011-01-18 23:18       ` Dave Chinner
2011-01-19 12:03         ` Christoph Hellwig
2011-01-19 13:31           ` Dave Chinner
2011-01-19 13:55             ` Christoph Hellwig
2011-01-20  1:33               ` Dave Chinner
2011-01-20 11:16                 ` Christoph Hellwig
2011-01-21  1:59                   ` Dave Chinner
2011-01-20 14:45                 ` Geoffrey Wehrman
2011-01-21  2:51                   ` Dave Chinner
2011-01-21 14:41                     ` Geoffrey Wehrman
2011-01-23 23:26                       ` Dave Chinner
2011-01-17  0:28   ` Lachlan McIlroy
2011-01-17  4:37     ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.