All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/27] patch queue for Linux 3.1
@ 2011-06-29 14:01 Christoph Hellwig
  2011-06-29 14:01 ` [PATCH 01/27] xfs: PF_FSTRANS should never be set in ->writepage Christoph Hellwig
                   ` (26 more replies)
  0 siblings, 27 replies; 100+ messages in thread
From: Christoph Hellwig @ 2011-06-29 14:01 UTC (permalink / raw)
  To: xfs

This is my current patch queue for Linux 3.1.  It includes the previously
all previously sent patches I'm planning for Linux 3.1 inclusion through
the XFS tree and a few new ones.  The most important new bits is a cleanup
of the structures describing the dir2 on-disk format, which got a bit
more urgent due to more recent gcc versions complaining about the hacks
used in the current version.

The sync lifelock fix is included only in a minimal version that removes
the data syncs.  I plan to sort out the iocount waiting via the i_alloc_sem
removal patches that have been sent for inclusion in the VFS tree.  I'll
cc the XFS list on the updated version with XFS chances.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH 01/27] xfs: PF_FSTRANS should never be set in ->writepage
  2011-06-29 14:01 [PATCH 00/27] patch queue for Linux 3.1 Christoph Hellwig
@ 2011-06-29 14:01 ` Christoph Hellwig
  2011-06-30  1:34   ` Dave Chinner
  2011-06-29 14:01 ` [PATCH 02/27] xfs: remove the unused ilock_nowait codepath in writepage Christoph Hellwig
                   ` (25 subsequent siblings)
  26 siblings, 1 reply; 100+ messages in thread
From: Christoph Hellwig @ 2011-06-29 14:01 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-writepage-simplify-fstrans-check --]
[-- Type: text/plain, Size: 1688 bytes --]

Now that we reject direct reclaim in addition to always using GFP_NOFS
allocation there's no chance we'll ever end up in ->writepage with
PF_FSTRANS set.  Add a WARN_ON if we hit this case, and stop checking
if we'd actually need to start a transaction.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>

Index: linux-2.6/fs/xfs/linux-2.6/xfs_aops.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_aops.c	2011-04-27 20:51:57.503817127 +0200
+++ linux-2.6/fs/xfs/linux-2.6/xfs_aops.c	2011-04-27 20:53:02.186800044 +0200
@@ -906,7 +906,6 @@ xfs_vm_writepage(
 	struct writeback_control *wbc)
 {
 	struct inode		*inode = page->mapping->host;
-	int			delalloc, unwritten;
 	struct buffer_head	*bh, *head;
 	struct xfs_bmbt_irec	imap;
 	xfs_ioend_t		*ioend = NULL, *iohead = NULL;
@@ -938,15 +937,10 @@ xfs_vm_writepage(
 		goto redirty;
 
 	/*
-	 * We need a transaction if there are delalloc or unwritten buffers
-	 * on the page.
-	 *
-	 * If we need a transaction and the process flags say we are already
-	 * in a transaction, or no IO is allowed then mark the page dirty
-	 * again and leave the page as is.
+	 * Given that we do not allow direct reclaim to call us we should
+	 * never be called while in a filesystem transaction.
 	 */
-	xfs_count_page_state(page, &delalloc, &unwritten);
-	if ((current->flags & PF_FSTRANS) && (delalloc || unwritten))
+	if (WARN_ON(current->flags & PF_FSTRANS))
 		goto redirty;
 
 	/* Is this page beyond the end of the file? */

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH 02/27] xfs: remove the unused ilock_nowait codepath in writepage
  2011-06-29 14:01 [PATCH 00/27] patch queue for Linux 3.1 Christoph Hellwig
  2011-06-29 14:01 ` [PATCH 01/27] xfs: PF_FSTRANS should never be set in ->writepage Christoph Hellwig
@ 2011-06-29 14:01 ` Christoph Hellwig
  2011-06-30  0:15   ` Dave Chinner
  2011-06-29 14:01 ` [PATCH 03/27] xfs: use write_cache_pages for writeback clustering Christoph Hellwig
                   ` (24 subsequent siblings)
  26 siblings, 1 reply; 100+ messages in thread
From: Christoph Hellwig @ 2011-06-29 14:01 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-writepage-remove-nonblock --]
[-- Type: text/plain, Size: 2033 bytes --]

wbc->nonblocking is never set, so this whole code has been unreachable
for a long time.  I'm also not sure it would make a lot of sense -
we'd rather finish our writeout after a short wait for the ilock
instead of cancelling the whole ioend.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>

Index: linux-2.6/fs/xfs/linux-2.6/xfs_aops.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_aops.c	2011-04-27 20:54:19.763046444 +0200
+++ linux-2.6/fs/xfs/linux-2.6/xfs_aops.c	2011-04-27 20:54:41.922926393 +0200
@@ -305,8 +305,7 @@ xfs_map_blocks(
 	struct inode		*inode,
 	loff_t			offset,
 	struct xfs_bmbt_irec	*imap,
-	int			type,
-	int			nonblocking)
+	int			type)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
@@ -322,11 +321,7 @@ xfs_map_blocks(
 	if (type == IO_UNWRITTEN)
 		bmapi_flags |= XFS_BMAPI_IGSTATE;
 
-	if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED)) {
-		if (nonblocking)
-			return -XFS_ERROR(EAGAIN);
-		xfs_ilock(ip, XFS_ILOCK_SHARED);
-	}
+	xfs_ilock(ip, XFS_ILOCK_SHARED);
 
 	ASSERT(ip->i_d.di_format != XFS_DINODE_FMT_BTREE ||
 	       (ip->i_df.if_flags & XFS_IFEXTENTS));
@@ -916,7 +911,6 @@ xfs_vm_writepage(
 	ssize_t			len;
 	int			err, imap_valid = 0, uptodate = 1;
 	int			count = 0;
-	int			nonblocking = 0;
 
 	trace_xfs_writepage(inode, page, 0);
 
@@ -964,9 +958,6 @@ xfs_vm_writepage(
 	offset = page_offset(page);
 	type = IO_OVERWRITE;
 
-	if (wbc->sync_mode == WB_SYNC_NONE && wbc->nonblocking)
-		nonblocking = 1;
-
 	do {
 		int new_ioend = 0;
 
@@ -1021,8 +1012,7 @@ xfs_vm_writepage(
 			 * time.
 			 */
 			new_ioend = 1;
-			err = xfs_map_blocks(inode, offset, &imap, type,
-					     nonblocking);
+			err = xfs_map_blocks(inode, offset, &imap, type);
 			if (err)
 				goto error;
 			imap_valid = xfs_imap_valid(inode, &imap, offset);

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-06-29 14:01 [PATCH 00/27] patch queue for Linux 3.1 Christoph Hellwig
  2011-06-29 14:01 ` [PATCH 01/27] xfs: PF_FSTRANS should never be set in ->writepage Christoph Hellwig
  2011-06-29 14:01 ` [PATCH 02/27] xfs: remove the unused ilock_nowait codepath in writepage Christoph Hellwig
@ 2011-06-29 14:01 ` Christoph Hellwig
  2011-06-30  2:00   ` Dave Chinner
  2011-07-01  2:22   ` Dave Chinner
  2011-06-29 14:01 ` [PATCH 04/27] xfs: cleanup xfs_add_to_ioend Christoph Hellwig
                   ` (23 subsequent siblings)
  26 siblings, 2 replies; 100+ messages in thread
From: Christoph Hellwig @ 2011-06-29 14:01 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-implement-writepages --]
[-- Type: text/plain, Size: 11739 bytes --]

Instead of implementing our own writeback clustering use write_cache_pages
to do it for us.  This means the guts of the current writepage implementation
become a new helper used both for implementing ->writepage and as a callback
to write_cache_pages for ->writepages.  A new struct xfs_writeback_ctx
is used to track block mapping state and the ioend chain over multiple
invocation of it.

The advantage over the old code is that we avoid a double pagevec lookup,
and a more efficient handling of extent boundaries inside a page for
small blocksize filesystems, as well as having less XFS specific code.

The downside is that we don't do writeback clustering when called from
kswapd anyore, but that is a case that should be avoided anyway.  Note
that we still convert the whole delalloc range from ->writepage, so
the on-disk allocation pattern is not affected.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/fs/xfs/linux-2.6/xfs_aops.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_aops.c	2011-04-27 20:55:01.482820427 +0200
+++ linux-2.6/fs/xfs/linux-2.6/xfs_aops.c	2011-04-28 11:22:42.747447011 +0200
@@ -38,6 +38,12 @@
 #include <linux/pagevec.h>
 #include <linux/writeback.h>
 
+struct xfs_writeback_ctx {
+	unsigned int		imap_valid;
+	struct xfs_bmbt_irec	imap;
+	struct xfs_ioend	*iohead;
+	struct xfs_ioend	*ioend;
+};
 
 /*
  * Prime number of hash buckets since address is used as the key.
@@ -487,6 +493,7 @@ xfs_submit_ioend(
 	struct buffer_head	*bh;
 	struct bio		*bio;
 	sector_t		lastblock = 0;
+	struct blk_plug		plug;
 
 	/* Pass 1 - start writeback */
 	do {
@@ -496,6 +503,7 @@ xfs_submit_ioend(
 	} while ((ioend = next) != NULL);
 
 	/* Pass 2 - submit I/O */
+	blk_start_plug(&plug);
 	ioend = head;
 	do {
 		next = ioend->io_list;
@@ -522,6 +530,7 @@ xfs_submit_ioend(
 			xfs_submit_ioend_bio(wbc, ioend, bio);
 		xfs_finish_ioend(ioend);
 	} while ((ioend = next) != NULL);
+	blk_finish_plug(&plug);
 }
 
 /*
@@ -661,153 +670,6 @@ xfs_is_delayed_page(
 	return 0;
 }
 
-/*
- * Allocate & map buffers for page given the extent map. Write it out.
- * except for the original page of a writepage, this is called on
- * delalloc/unwritten pages only, for the original page it is possible
- * that the page has no mapping at all.
- */
-STATIC int
-xfs_convert_page(
-	struct inode		*inode,
-	struct page		*page,
-	loff_t			tindex,
-	struct xfs_bmbt_irec	*imap,
-	xfs_ioend_t		**ioendp,
-	struct writeback_control *wbc)
-{
-	struct buffer_head	*bh, *head;
-	xfs_off_t		end_offset;
-	unsigned long		p_offset;
-	unsigned int		type;
-	int			len, page_dirty;
-	int			count = 0, done = 0, uptodate = 1;
- 	xfs_off_t		offset = page_offset(page);
-
-	if (page->index != tindex)
-		goto fail;
-	if (!trylock_page(page))
-		goto fail;
-	if (PageWriteback(page))
-		goto fail_unlock_page;
-	if (page->mapping != inode->i_mapping)
-		goto fail_unlock_page;
-	if (!xfs_is_delayed_page(page, (*ioendp)->io_type))
-		goto fail_unlock_page;
-
-	/*
-	 * page_dirty is initially a count of buffers on the page before
-	 * EOF and is decremented as we move each into a cleanable state.
-	 *
-	 * Derivation:
-	 *
-	 * End offset is the highest offset that this page should represent.
-	 * If we are on the last page, (end_offset & (PAGE_CACHE_SIZE - 1))
-	 * will evaluate non-zero and be less than PAGE_CACHE_SIZE and
-	 * hence give us the correct page_dirty count. On any other page,
-	 * it will be zero and in that case we need page_dirty to be the
-	 * count of buffers on the page.
-	 */
-	end_offset = min_t(unsigned long long,
-			(xfs_off_t)(page->index + 1) << PAGE_CACHE_SHIFT,
-			i_size_read(inode));
-
-	len = 1 << inode->i_blkbits;
-	p_offset = min_t(unsigned long, end_offset & (PAGE_CACHE_SIZE - 1),
-					PAGE_CACHE_SIZE);
-	p_offset = p_offset ? roundup(p_offset, len) : PAGE_CACHE_SIZE;
-	page_dirty = p_offset / len;
-
-	bh = head = page_buffers(page);
-	do {
-		if (offset >= end_offset)
-			break;
-		if (!buffer_uptodate(bh))
-			uptodate = 0;
-		if (!(PageUptodate(page) || buffer_uptodate(bh))) {
-			done = 1;
-			continue;
-		}
-
-		if (buffer_unwritten(bh) || buffer_delay(bh) ||
-		    buffer_mapped(bh)) {
-			if (buffer_unwritten(bh))
-				type = IO_UNWRITTEN;
-			else if (buffer_delay(bh))
-				type = IO_DELALLOC;
-			else
-				type = IO_OVERWRITE;
-
-			if (!xfs_imap_valid(inode, imap, offset)) {
-				done = 1;
-				continue;
-			}
-
-			lock_buffer(bh);
-			if (type != IO_OVERWRITE)
-				xfs_map_at_offset(inode, bh, imap, offset);
-			xfs_add_to_ioend(inode, bh, offset, type,
-					 ioendp, done);
-
-			page_dirty--;
-			count++;
-		} else {
-			done = 1;
-		}
-	} while (offset += len, (bh = bh->b_this_page) != head);
-
-	if (uptodate && bh == head)
-		SetPageUptodate(page);
-
-	if (count) {
-		if (--wbc->nr_to_write <= 0 &&
-		    wbc->sync_mode == WB_SYNC_NONE)
-			done = 1;
-	}
-	xfs_start_page_writeback(page, !page_dirty, count);
-
-	return done;
- fail_unlock_page:
-	unlock_page(page);
- fail:
-	return 1;
-}
-
-/*
- * Convert & write out a cluster of pages in the same extent as defined
- * by mp and following the start page.
- */
-STATIC void
-xfs_cluster_write(
-	struct inode		*inode,
-	pgoff_t			tindex,
-	struct xfs_bmbt_irec	*imap,
-	xfs_ioend_t		**ioendp,
-	struct writeback_control *wbc,
-	pgoff_t			tlast)
-{
-	struct pagevec		pvec;
-	int			done = 0, i;
-
-	pagevec_init(&pvec, 0);
-	while (!done && tindex <= tlast) {
-		unsigned len = min_t(pgoff_t, PAGEVEC_SIZE, tlast - tindex + 1);
-
-		if (!pagevec_lookup(&pvec, inode->i_mapping, tindex, len))
-			break;
-
-		for (i = 0; i < pagevec_count(&pvec); i++) {
-			done = xfs_convert_page(inode, pvec.pages[i], tindex++,
-					imap, ioendp, wbc);
-			if (done)
-				break;
-		}
-
-		pagevec_release(&pvec);
-		cond_resched();
-	}
-}
-
 STATIC void
 xfs_vm_invalidatepage(
 	struct page		*page,
@@ -896,20 +758,20 @@ out_invalidate:
  * redirty the page.
  */
 STATIC int
-xfs_vm_writepage(
+__xfs_vm_writepage(
 	struct page		*page,
-	struct writeback_control *wbc)
+	struct writeback_control *wbc,
+	void			*data)
 {
+	struct xfs_writeback_ctx *ctx = data;
 	struct inode		*inode = page->mapping->host;
 	struct buffer_head	*bh, *head;
-	struct xfs_bmbt_irec	imap;
-	xfs_ioend_t		*ioend = NULL, *iohead = NULL;
 	loff_t			offset;
 	unsigned int		type;
 	__uint64_t              end_offset;
 	pgoff_t                 end_index, last_index;
 	ssize_t			len;
-	int			err, imap_valid = 0, uptodate = 1;
+	int			err, uptodate = 1;
 	int			count = 0;
 
 	trace_xfs_writepage(inode, page, 0);
@@ -917,20 +779,6 @@ xfs_vm_writepage(
 	ASSERT(page_has_buffers(page));
 
 	/*
-	 * Refuse to write the page out if we are called from reclaim context.
-	 *
-	 * This avoids stack overflows when called from deeply used stacks in
-	 * random callers for direct reclaim or memcg reclaim.  We explicitly
-	 * allow reclaim from kswapd as the stack usage there is relatively low.
-	 *
-	 * This should really be done by the core VM, but until that happens
-	 * filesystems like XFS, btrfs and ext4 have to take care of this
-	 * by themselves.
-	 */
-	if ((current->flags & (PF_MEMALLOC|PF_KSWAPD)) == PF_MEMALLOC)
-		goto redirty;
-
-	/*
 	 * Given that we do not allow direct reclaim to call us we should
 	 * never be called while in a filesystem transaction.
 	 */
@@ -973,36 +821,38 @@ xfs_vm_writepage(
 		 * buffers covering holes here.
 		 */
 		if (!buffer_mapped(bh) && buffer_uptodate(bh)) {
-			imap_valid = 0;
+			ctx->imap_valid = 0;
 			continue;
 		}
 
 		if (buffer_unwritten(bh)) {
 			if (type != IO_UNWRITTEN) {
 				type = IO_UNWRITTEN;
-				imap_valid = 0;
+				ctx->imap_valid = 0;
 			}
 		} else if (buffer_delay(bh)) {
 			if (type != IO_DELALLOC) {
 				type = IO_DELALLOC;
-				imap_valid = 0;
+				ctx->imap_valid = 0;
 			}
 		} else if (buffer_uptodate(bh)) {
 			if (type != IO_OVERWRITE) {
 				type = IO_OVERWRITE;
-				imap_valid = 0;
+				ctx->imap_valid = 0;
 			}
 		} else {
 			if (PageUptodate(page)) {
 				ASSERT(buffer_mapped(bh));
-				imap_valid = 0;
+				ctx->imap_valid = 0;
 			}
 			continue;
 		}
 
-		if (imap_valid)
-			imap_valid = xfs_imap_valid(inode, &imap, offset);
-		if (!imap_valid) {
+		if (ctx->imap_valid) {
+			ctx->imap_valid =
+				xfs_imap_valid(inode, &ctx->imap, offset);
+		}
+		if (!ctx->imap_valid) {
 			/*
 			 * If we didn't have a valid mapping then we need to
 			 * put the new mapping into a separate ioend structure.
@@ -1012,22 +862,25 @@ xfs_vm_writepage(
 			 * time.
 			 */
 			new_ioend = 1;
-			err = xfs_map_blocks(inode, offset, &imap, type);
+			err = xfs_map_blocks(inode, offset, &ctx->imap, type);
 			if (err)
 				goto error;
-			imap_valid = xfs_imap_valid(inode, &imap, offset);
+			ctx->imap_valid =
+				xfs_imap_valid(inode, &ctx->imap, offset);
 		}
-		if (imap_valid) {
+		if (ctx->imap_valid) {
 			lock_buffer(bh);
-			if (type != IO_OVERWRITE)
-				xfs_map_at_offset(inode, bh, &imap, offset);
-			xfs_add_to_ioend(inode, bh, offset, type, &ioend,
+			if (type != IO_OVERWRITE) {
+				xfs_map_at_offset(inode, bh, &ctx->imap,
+						  offset);
+			}
+			xfs_add_to_ioend(inode, bh, offset, type, &ctx->ioend,
 					 new_ioend);
 			count++;
 		}
 
-		if (!iohead)
-			iohead = ioend;
+		if (!ctx->iohead)
+			ctx->iohead = ctx->ioend;
 
 	} while (offset += len, ((bh = bh->b_this_page) != head));
 
@@ -1035,38 +888,9 @@ xfs_vm_writepage(
 		SetPageUptodate(page);
 
 	xfs_start_page_writeback(page, 1, count);
-
-	if (ioend && imap_valid) {
-		xfs_off_t		end_index;
-
-		end_index = imap.br_startoff + imap.br_blockcount;
-
-		/* to bytes */
-		end_index <<= inode->i_blkbits;
-
-		/* to pages */
-		end_index = (end_index - 1) >> PAGE_CACHE_SHIFT;
-
-		/* check against file size */
-		if (end_index > last_index)
-			end_index = last_index;
-
-		xfs_cluster_write(inode, page->index + 1, &imap, &ioend,
-				  wbc, end_index);
-	}
-
-	if (iohead)
-		xfs_submit_ioend(wbc, iohead);
-
 	return 0;
 
 error:
-	if (iohead)
-		xfs_cancel_ioend(iohead);
-
-	if (err == -EAGAIN)
-		goto redirty;
-
 	xfs_aops_discard_page(page);
 	ClearPageUptodate(page);
 	unlock_page(page);
@@ -1079,12 +903,62 @@ redirty:
 }
 
 STATIC int
+xfs_vm_writepage(
+	struct page		*page,
+	struct writeback_control *wbc)
+{
+	struct xfs_writeback_ctx ctx = { };
+	int ret;
+
+	/*
+	 * Refuse to write the page out if we are called from reclaim context.
+	 *
+	 * This avoids stack overflows when called from deeply used stacks in
+	 * random callers for direct reclaim or memcg reclaim.  We explicitly
+	 * allow reclaim from kswapd as the stack usage there is relatively low.
+	 *
+	 * This should really be done by the core VM, but until that happens
+	 * filesystems like XFS, btrfs and ext4 have to take care of this
+	 * by themselves.
+	 */
+	if ((current->flags & (PF_MEMALLOC|PF_KSWAPD)) == PF_MEMALLOC) {
+		redirty_page_for_writepage(wbc, page);
+		unlock_page(page);
+		return 0;
+	}
+
+	ret = __xfs_vm_writepage(page, wbc, &ctx);
+
+	if (ctx.iohead) {
+		if (ret)
+			xfs_cancel_ioend(ctx.iohead);
+		else
+			xfs_submit_ioend(wbc, ctx.iohead);
+	}
+
+	return ret;
+}
+
+STATIC int
 xfs_vm_writepages(
 	struct address_space	*mapping,
 	struct writeback_control *wbc)
 {
+	struct xfs_writeback_ctx ctx = { };
+	int ret;
+
 	xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
-	return generic_writepages(mapping, wbc);
+
+	ret = write_cache_pages(mapping, wbc, __xfs_vm_writepage, &ctx);
+
+	if (ctx.iohead) {
+		if (ret)
+			xfs_cancel_ioend(ctx.iohead);
+		else
+			xfs_submit_ioend(wbc, ctx.iohead);
+	}
+
+	return ret;
 }
 
 /*

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH 04/27] xfs: cleanup xfs_add_to_ioend
  2011-06-29 14:01 [PATCH 00/27] patch queue for Linux 3.1 Christoph Hellwig
                   ` (2 preceding siblings ...)
  2011-06-29 14:01 ` [PATCH 03/27] xfs: use write_cache_pages for writeback clustering Christoph Hellwig
@ 2011-06-29 14:01 ` Christoph Hellwig
  2011-06-29 22:13   ` Alex Elder
  2011-06-30  2:00   ` Dave Chinner
  2011-06-29 14:01 ` [PATCH 05/27] xfs: work around bogus gcc warning in xfs_allocbt_init_cursor Christoph Hellwig
                   ` (22 subsequent siblings)
  26 siblings, 2 replies; 100+ messages in thread
From: Christoph Hellwig @ 2011-06-29 14:01 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-cleanup-xfs_add_to_ioend --]
[-- Type: text/plain, Size: 2651 bytes --]

Pass the writeback context to xfs_add_to_ioend to make the ioend
chain manipulations self-contained in this function.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/fs/xfs/linux-2.6/xfs_aops.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_aops.c	2011-04-28 11:22:42.747447011 +0200
+++ linux-2.6/fs/xfs/linux-2.6/xfs_aops.c	2011-04-28 11:22:46.124095385 +0200
@@ -560,39 +560,39 @@ xfs_cancel_ioend(
 }
 
 /*
- * Test to see if we've been building up a completion structure for
- * earlier buffers -- if so, we try to append to this ioend if we
- * can, otherwise we finish off any current ioend and start another.
- * Return true if we've finished the given ioend.
+ * Test to see if we've been building up a completion structure for earlier
+ * buffers -- if so, we try to append to this ioend if we can, otherwise we
+ * finish off any current ioend and start another.
  */
 STATIC void
 xfs_add_to_ioend(
+	struct xfs_writeback_ctx *ctx,
 	struct inode		*inode,
 	struct buffer_head	*bh,
 	xfs_off_t		offset,
 	unsigned int		type,
-	xfs_ioend_t		**result,
 	int			need_ioend)
 {
-	xfs_ioend_t		*ioend = *result;
+	if (!ctx->ioend || need_ioend || type != ctx->ioend->io_type) {
+		struct xfs_ioend	*new;
 
-	if (!ioend || need_ioend || type != ioend->io_type) {
-		xfs_ioend_t	*previous = *result;
-
-		ioend = xfs_alloc_ioend(inode, type);
-		ioend->io_offset = offset;
-		ioend->io_buffer_head = bh;
-		ioend->io_buffer_tail = bh;
-		if (previous)
-			previous->io_list = ioend;
-		*result = ioend;
+		new = xfs_alloc_ioend(inode, type);
+		new->io_offset = offset;
+		new->io_buffer_head = bh;
+		new->io_buffer_tail = bh;
+
+		if (ctx->ioend)
+			ctx->ioend->io_list = new;
+		ctx->ioend = new;
+		if (!ctx->iohead)
+			ctx->iohead = new;
 	} else {
-		ioend->io_buffer_tail->b_private = bh;
-		ioend->io_buffer_tail = bh;
+		ctx->ioend->io_buffer_tail->b_private = bh;
+		ctx->ioend->io_buffer_tail = bh;
 	}
 
 	bh->b_private = NULL;
-	ioend->io_size += bh->b_size;
+	ctx->ioend->io_size += bh->b_size;
 }
 
 STATIC void
@@ -874,14 +874,9 @@ __xfs_vm_writepage(
 				xfs_map_at_offset(inode, bh, &ctx->imap,
 						  offset);
 			}
-			xfs_add_to_ioend(inode, bh, offset, type, &ctx->ioend,
-					 new_ioend);
+			xfs_add_to_ioend(ctx, inode, bh, offset, type, new_ioend);
 			count++;
 		}
-
-		if (!ctx->iohead)
-			ctx->iohead = ctx->ioend;
-
 	} while (offset += len, ((bh = bh->b_this_page) != head));
 
 	if (uptodate && bh == head)

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH 05/27] xfs: work around bogus gcc warning in xfs_allocbt_init_cursor
  2011-06-29 14:01 [PATCH 00/27] patch queue for Linux 3.1 Christoph Hellwig
                   ` (3 preceding siblings ...)
  2011-06-29 14:01 ` [PATCH 04/27] xfs: cleanup xfs_add_to_ioend Christoph Hellwig
@ 2011-06-29 14:01 ` Christoph Hellwig
  2011-06-29 22:13   ` Alex Elder
  2011-06-29 14:01 ` [PATCH 06/27] xfs: split xfs_setattr Christoph Hellwig
                   ` (21 subsequent siblings)
  26 siblings, 1 reply; 100+ messages in thread
From: Christoph Hellwig @ 2011-06-29 14:01 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-fix-xfs_allocbt_init_cursor-warning --]
[-- Type: text/plain, Size: 1401 bytes --]

GCC 4.6 complains about an array subscript is above array bounds when
using the btree index to index into the agf_levels array.  The only
two indices passed in are 0 and 1, and we have an assert insuring that.

Replace the trick of using the array index directly with using constants
in the already existing branch for assigning the XFS_BTREE_LASTREC_UPDATE
flag.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

Index: xfs/fs/xfs/xfs_alloc_btree.c
===================================================================
--- xfs.orig/fs/xfs/xfs_alloc_btree.c	2011-06-17 14:16:27.929065669 +0200
+++ xfs/fs/xfs/xfs_alloc_btree.c	2011-06-17 14:17:22.145729599 +0200
@@ -427,13 +427,16 @@ xfs_allocbt_init_cursor(
 
 	cur->bc_tp = tp;
 	cur->bc_mp = mp;
-	cur->bc_nlevels = be32_to_cpu(agf->agf_levels[btnum]);
 	cur->bc_btnum = btnum;
 	cur->bc_blocklog = mp->m_sb.sb_blocklog;
-
 	cur->bc_ops = &xfs_allocbt_ops;
-	if (btnum == XFS_BTNUM_CNT)
+
+	if (btnum == XFS_BTNUM_CNT) {
+		cur->bc_nlevels = be32_to_cpu(agf->agf_levels[XFS_BTNUM_CNT]);
 		cur->bc_flags = XFS_BTREE_LASTREC_UPDATE;
+	} else {
+		cur->bc_nlevels = be32_to_cpu(agf->agf_levels[XFS_BTNUM_BNO]);
+	}
 
 	cur->bc_private.a.agbp = agbp;
 	cur->bc_private.a.agno = agno;

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH 06/27] xfs: split xfs_setattr
  2011-06-29 14:01 [PATCH 00/27] patch queue for Linux 3.1 Christoph Hellwig
                   ` (4 preceding siblings ...)
  2011-06-29 14:01 ` [PATCH 05/27] xfs: work around bogus gcc warning in xfs_allocbt_init_cursor Christoph Hellwig
@ 2011-06-29 14:01 ` Christoph Hellwig
  2011-06-29 22:13   ` Alex Elder
  2011-06-30  2:11   ` Dave Chinner
  2011-06-29 14:01 ` [PATCH 08/27] xfs: kill xfs_itruncate_start Christoph Hellwig
                   ` (20 subsequent siblings)
  26 siblings, 2 replies; 100+ messages in thread
From: Christoph Hellwig @ 2011-06-29 14:01 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-split-setattr --]
[-- Type: text/plain, Size: 27977 bytes --]

Split up xfs_setattr into two functions, one for the complex truncate
handling, and one for the trivial attribute updates.  Also move both
new routines to xfs_iops.c as they are fairly Linux-specific.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: xfs/fs/xfs/linux-2.6/xfs_iops.c
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/xfs_iops.c	2011-06-29 11:29:02.684972774 +0200
+++ xfs/fs/xfs/linux-2.6/xfs_iops.c	2011-06-29 11:29:07.154948558 +0200
@@ -39,6 +39,7 @@
 #include "xfs_buf_item.h"
 #include "xfs_utils.h"
 #include "xfs_vnodeops.h"
+#include "xfs_inode_item.h"
 #include "xfs_trace.h"
 
 #include <linux/capability.h>
@@ -497,12 +498,449 @@ xfs_vn_getattr(
 	return 0;
 }
 
+int
+xfs_setattr_nonsize(
+	struct xfs_inode	*ip,
+	struct iattr		*iattr,
+	int			flags)
+{
+	xfs_mount_t		*mp = ip->i_mount;
+	struct inode		*inode = VFS_I(ip);
+	int			mask = iattr->ia_valid;
+	xfs_trans_t		*tp;
+	int			error;
+	uid_t			uid = 0, iuid = 0;
+	gid_t			gid = 0, igid = 0;
+	struct xfs_dquot	*udqp = NULL, *gdqp = NULL;
+	struct xfs_dquot	*olddquot1 = NULL, *olddquot2 = NULL;
+
+	trace_xfs_setattr(ip);
+
+	if (mp->m_flags & XFS_MOUNT_RDONLY)
+		return XFS_ERROR(EROFS);
+
+	if (XFS_FORCED_SHUTDOWN(mp))
+		return XFS_ERROR(EIO);
+
+	error = -inode_change_ok(inode, iattr);
+	if (error)
+		return XFS_ERROR(error);
+
+	ASSERT((mask & ATTR_SIZE) == 0);
+
+	/*
+	 * If disk quotas is on, we make sure that the dquots do exist on disk,
+	 * before we start any other transactions. Trying to do this later
+	 * is messy. We don't care to take a readlock to look at the ids
+	 * in inode here, because we can't hold it across the trans_reserve.
+	 * If the IDs do change before we take the ilock, we're covered
+	 * because the i_*dquot fields will get updated anyway.
+	 */
+	if (XFS_IS_QUOTA_ON(mp) && (mask & (ATTR_UID|ATTR_GID))) {
+		uint	qflags = 0;
+
+		if ((mask & ATTR_UID) && XFS_IS_UQUOTA_ON(mp)) {
+			uid = iattr->ia_uid;
+			qflags |= XFS_QMOPT_UQUOTA;
+		} else {
+			uid = ip->i_d.di_uid;
+		}
+		if ((mask & ATTR_GID) && XFS_IS_GQUOTA_ON(mp)) {
+			gid = iattr->ia_gid;
+			qflags |= XFS_QMOPT_GQUOTA;
+		}  else {
+			gid = ip->i_d.di_gid;
+		}
+
+		/*
+		 * We take a reference when we initialize udqp and gdqp,
+		 * so it is important that we never blindly double trip on
+		 * the same variable. See xfs_create() for an example.
+		 */
+		ASSERT(udqp == NULL);
+		ASSERT(gdqp == NULL);
+		error = xfs_qm_vop_dqalloc(ip, uid, gid, xfs_get_projid(ip),
+					 qflags, &udqp, &gdqp);
+		if (error)
+			return error;
+	}
+
+	tp = xfs_trans_alloc(mp, XFS_TRANS_SETATTR_NOT_SIZE);
+	error = xfs_trans_reserve(tp, 0, XFS_ICHANGE_LOG_RES(mp), 0, 0, 0);
+	if (error)
+		goto out_dqrele;
+
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+
+	/*
+	 * Change file ownership.  Must be the owner or privileged.
+	 */
+	if (mask & (ATTR_UID|ATTR_GID)) {
+		/*
+		 * These IDs could have changed since we last looked at them.
+		 * But, we're assured that if the ownership did change
+		 * while we didn't have the inode locked, inode's dquot(s)
+		 * would have changed also.
+		 */
+		iuid = ip->i_d.di_uid;
+		igid = ip->i_d.di_gid;
+		gid = (mask & ATTR_GID) ? iattr->ia_gid : igid;
+		uid = (mask & ATTR_UID) ? iattr->ia_uid : iuid;
+
+		/*
+		 * Do a quota reservation only if uid/gid is actually
+		 * going to change.
+		 */
+		if (XFS_IS_QUOTA_RUNNING(mp) &&
+		    ((XFS_IS_UQUOTA_ON(mp) && iuid != uid) ||
+		     (XFS_IS_GQUOTA_ON(mp) && igid != gid))) {
+			ASSERT(tp);
+			error = xfs_qm_vop_chown_reserve(tp, ip, udqp, gdqp,
+						capable(CAP_FOWNER) ?
+						XFS_QMOPT_FORCE_RES : 0);
+			if (error)	/* out of quota */
+				goto out_trans_cancel;
+		}
+	}
+
+	xfs_trans_ijoin(tp, ip);
+
+	/*
+	 * Change file ownership.  Must be the owner or privileged.
+	 */
+	if (mask & (ATTR_UID|ATTR_GID)) {
+		/*
+		 * CAP_FSETID overrides the following restrictions:
+		 *
+		 * The set-user-ID and set-group-ID bits of a file will be
+		 * cleared upon successful return from chown()
+		 */
+		if ((ip->i_d.di_mode & (S_ISUID|S_ISGID)) &&
+		    !capable(CAP_FSETID))
+			ip->i_d.di_mode &= ~(S_ISUID|S_ISGID);
+
+		/*
+		 * Change the ownerships and register quota modifications
+		 * in the transaction.
+		 */
+		if (iuid != uid) {
+			if (XFS_IS_QUOTA_RUNNING(mp) && XFS_IS_UQUOTA_ON(mp)) {
+				ASSERT(mask & ATTR_UID);
+				ASSERT(udqp);
+				olddquot1 = xfs_qm_vop_chown(tp, ip,
+							&ip->i_udquot, udqp);
+			}
+			ip->i_d.di_uid = uid;
+			inode->i_uid = uid;
+		}
+		if (igid != gid) {
+			if (XFS_IS_QUOTA_RUNNING(mp) && XFS_IS_GQUOTA_ON(mp)) {
+				ASSERT(!XFS_IS_PQUOTA_ON(mp));
+				ASSERT(mask & ATTR_GID);
+				ASSERT(gdqp);
+				olddquot2 = xfs_qm_vop_chown(tp, ip,
+							&ip->i_gdquot, gdqp);
+			}
+			ip->i_d.di_gid = gid;
+			inode->i_gid = gid;
+		}
+	}
+
+	/*
+	 * Change file access modes.
+	 */
+	if (mask & ATTR_MODE) {
+		umode_t mode = iattr->ia_mode;
+
+		if (!in_group_p(inode->i_gid) && !capable(CAP_FSETID))
+			mode &= ~S_ISGID;
+
+		ip->i_d.di_mode &= S_IFMT;
+		ip->i_d.di_mode |= mode & ~S_IFMT;
+
+		inode->i_mode &= S_IFMT;
+		inode->i_mode |= mode & ~S_IFMT;
+	}
+
+	/*
+	 * Change file access or modified times.
+	 */
+	if (mask & ATTR_ATIME) {
+		inode->i_atime = iattr->ia_atime;
+		ip->i_d.di_atime.t_sec = iattr->ia_atime.tv_sec;
+		ip->i_d.di_atime.t_nsec = iattr->ia_atime.tv_nsec;
+		ip->i_update_core = 1;
+	}
+	if (mask & ATTR_CTIME) {
+		inode->i_ctime = iattr->ia_ctime;
+		ip->i_d.di_ctime.t_sec = iattr->ia_ctime.tv_sec;
+		ip->i_d.di_ctime.t_nsec = iattr->ia_ctime.tv_nsec;
+		ip->i_update_core = 1;
+	}
+	if (mask & ATTR_MTIME) {
+		inode->i_mtime = iattr->ia_mtime;
+		ip->i_d.di_mtime.t_sec = iattr->ia_mtime.tv_sec;
+		ip->i_d.di_mtime.t_nsec = iattr->ia_mtime.tv_nsec;
+		ip->i_update_core = 1;
+	}
+
+	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+
+	XFS_STATS_INC(xs_ig_attrchg);
+
+	if (mp->m_flags & XFS_MOUNT_WSYNC)
+		xfs_trans_set_sync(tp);
+	error = xfs_trans_commit(tp, 0);
+
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+
+	/*
+	 * Release any dquot(s) the inode had kept before chown.
+	 */
+	xfs_qm_dqrele(olddquot1);
+	xfs_qm_dqrele(olddquot2);
+	xfs_qm_dqrele(udqp);
+	xfs_qm_dqrele(gdqp);
+
+	if (error)
+		return XFS_ERROR(error);
+
+	/*
+	 * XXX(hch): Updating the ACL entries is not atomic vs the i_mode
+	 * 	     update.  We could avoid this with linked transactions
+	 * 	     and passing down the transaction pointer all the way
+	 *	     to attr_set.  No previous user of the generic
+	 * 	     Posix ACL code seems to care about this issue either.
+	 */
+	if ((mask & ATTR_MODE) && !(flags & XFS_ATTR_NOACL)) {
+		error = -xfs_acl_chmod(inode);
+		if (error)
+			return XFS_ERROR(error);
+	}
+
+	return 0;
+
+out_trans_cancel:
+	xfs_trans_cancel(tp, 0);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+out_dqrele:
+	xfs_qm_dqrele(udqp);
+	xfs_qm_dqrele(gdqp);
+	return error;
+}
+
+/*
+ * Truncate file.  Must have write permission and not be a directory.
+ */
+int
+xfs_setattr_size(
+	struct xfs_inode	*ip,
+	struct iattr		*iattr,
+	int			flags)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct inode		*inode = VFS_I(ip);
+	int			mask = iattr->ia_valid;
+	struct xfs_trans	*tp;
+	int			error;
+	uint			lock_flags;
+	uint			commit_flags = 0;
+
+	trace_xfs_setattr(ip);
+
+	if (mp->m_flags & XFS_MOUNT_RDONLY)
+		return XFS_ERROR(EROFS);
+
+	if (XFS_FORCED_SHUTDOWN(mp))
+		return XFS_ERROR(EIO);
+
+	error = -inode_change_ok(inode, iattr);
+	if (error)
+		return XFS_ERROR(error);
+
+	ASSERT(S_ISREG(ip->i_d.di_mode));
+	ASSERT((mask & (ATTR_MODE|ATTR_UID|ATTR_GID|ATTR_ATIME|ATTR_ATIME_SET|
+			ATTR_MTIME_SET|ATTR_KILL_SUID|ATTR_KILL_SGID|
+			ATTR_KILL_PRIV|ATTR_TIMES_SET)) == 0);
+
+	lock_flags = XFS_ILOCK_EXCL;
+	if (!(flags & XFS_ATTR_NOLOCK))
+		lock_flags |= XFS_IOLOCK_EXCL;
+	xfs_ilock(ip, lock_flags);
+
+	/*
+	 * Short circuit the truncate case for zero length files.
+	 */
+	if (iattr->ia_size == 0 &&
+	    ip->i_size == 0 && ip->i_d.di_nextents == 0) {
+		xfs_iunlock(ip, XFS_ILOCK_EXCL);
+		lock_flags &= ~XFS_ILOCK_EXCL;
+		if (mask & ATTR_CTIME) {
+			inode->i_mtime = inode->i_ctime =
+					current_fs_time(inode->i_sb);
+			xfs_mark_inode_dirty_sync(ip);
+		}
+		goto out_unlock;
+	}
+
+	/*
+	 * Make sure that the dquots are attached to the inode.
+	 */
+	error = xfs_qm_dqattach_locked(ip, 0);
+	if (error)
+		goto out_unlock;
+
+	/*
+	 * Now we can make the changes.  Before we join the inode to the
+	 * transaction, take care of the part of the truncation that must be
+	 * done without the inode lock.  This needs to be done before joining
+	 * the inode to the transaction, because the inode cannot be unlocked
+	 * once it is a part of the transaction.
+	 */
+	if (iattr->ia_size > ip->i_size) {
+		/*
+		 * Do the first part of growing a file: zero any data in the
+		 * last block that is beyond the old EOF.  We need to do this
+		 * before the inode is joined to the transaction to modify
+		 * i_size.
+		 */
+		error = xfs_zero_eof(ip, iattr->ia_size, ip->i_size);
+		if (error)
+			goto out_unlock;
+	}
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	lock_flags &= ~XFS_ILOCK_EXCL;
+
+	/*
+	 * We are going to log the inode size change in this transaction so
+	 * any previous writes that are beyond the on disk EOF and the new
+	 * EOF that have not been written out need to be written here.  If we
+	 * do not write the data out, we expose ourselves to the null files
+	 * problem.
+	 *
+	 * Only flush from the on disk size to the smaller of the in memory
+	 * file size or the new size as that's the range we really care about
+	 * here and prevents waiting for other data not within the range we
+	 * care about here.
+	 */
+	if (ip->i_size != ip->i_d.di_size && iattr->ia_size > ip->i_d.di_size) {
+		error = xfs_flush_pages(ip, ip->i_d.di_size, iattr->ia_size,
+					XBF_ASYNC, FI_NONE);
+		if (error)
+			goto out_unlock;
+	}
+
+	/*
+	 * Wait for all I/O to complete.
+	 */
+	xfs_ioend_wait(ip);
+
+	error = -block_truncate_page(inode->i_mapping, iattr->ia_size,
+				     xfs_get_blocks);
+	if (error)
+		goto out_unlock;
+
+	tp = xfs_trans_alloc(mp, XFS_TRANS_SETATTR_SIZE);
+	error = xfs_trans_reserve(tp, 0, XFS_ITRUNCATE_LOG_RES(mp), 0,
+				 XFS_TRANS_PERM_LOG_RES,
+				 XFS_ITRUNCATE_LOG_COUNT);
+	if (error)
+		goto out_trans_cancel;
+
+	truncate_setsize(inode, iattr->ia_size);
+
+	commit_flags = XFS_TRANS_RELEASE_LOG_RES;
+	lock_flags |= XFS_ILOCK_EXCL;
+
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+
+	xfs_trans_ijoin(tp, ip);
+
+	/*
+	 * Only change the c/mtime if we are changing the size or we are
+	 * explicitly asked to change it.  This handles the semantic difference
+	 * between truncate() and ftruncate() as implemented in the VFS.
+	 *
+	 * The regular truncate() case without ATTR_CTIME and ATTR_MTIME is a
+	 * special case where we need to update the times despite not having
+	 * these flags set.  For all other operations the VFS set these flags
+	 * explicitly if it wants a timestamp update.
+	 */
+	if (iattr->ia_size != ip->i_size &&
+	    (!(mask & (ATTR_CTIME | ATTR_MTIME)))) {
+		iattr->ia_ctime = iattr->ia_mtime =
+			current_fs_time(inode->i_sb);
+		mask |= ATTR_CTIME | ATTR_MTIME;
+	}
+
+	if (iattr->ia_size > ip->i_size) {
+		ip->i_d.di_size = iattr->ia_size;
+		ip->i_size = iattr->ia_size;
+	} else if (iattr->ia_size <= ip->i_size ||
+		   (iattr->ia_size == 0 && ip->i_d.di_nextents)) {
+		/*
+		 * Signal a sync transaction unless we are truncating an
+		 * already unlinked file on a wsync filesystem.
+		 */
+		error = xfs_itruncate_finish(&tp, ip, iattr->ia_size,
+				    XFS_DATA_FORK,
+				    ((ip->i_d.di_nlink != 0 ||
+				      !(mp->m_flags & XFS_MOUNT_WSYNC))
+				     ? 1 : 0));
+		if (error)
+			goto out_trans_abort;
+
+		/*
+		 * Truncated "down", so we're removing references to old data
+		 * here - if we delay flushing for a long time, we expose
+		 * ourselves unduly to the notorious NULL files problem.  So,
+		 * we mark this inode and flush it when the file is closed,
+		 * and do not wait the usual (long) time for writeout.
+		 */
+		xfs_iflags_set(ip, XFS_ITRUNCATED);
+	}
+
+	if (mask & ATTR_CTIME) {
+		inode->i_ctime = iattr->ia_ctime;
+		ip->i_d.di_ctime.t_sec = iattr->ia_ctime.tv_sec;
+		ip->i_d.di_ctime.t_nsec = iattr->ia_ctime.tv_nsec;
+		ip->i_update_core = 1;
+	}
+	if (mask & ATTR_MTIME) {
+		inode->i_mtime = iattr->ia_mtime;
+		ip->i_d.di_mtime.t_sec = iattr->ia_mtime.tv_sec;
+		ip->i_d.di_mtime.t_nsec = iattr->ia_mtime.tv_nsec;
+		ip->i_update_core = 1;
+	}
+
+	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+
+	XFS_STATS_INC(xs_ig_attrchg);
+
+	if (mp->m_flags & XFS_MOUNT_WSYNC)
+		xfs_trans_set_sync(tp);
+
+	error = xfs_trans_commit(tp, XFS_TRANS_RELEASE_LOG_RES);
+out_unlock:
+	if (lock_flags)
+		xfs_iunlock(ip, lock_flags);
+	return error;
+
+out_trans_abort:
+	commit_flags |= XFS_TRANS_ABORT;
+out_trans_cancel:
+	xfs_trans_cancel(tp, commit_flags);
+	goto out_unlock;
+}
+
 STATIC int
 xfs_vn_setattr(
 	struct dentry	*dentry,
 	struct iattr	*iattr)
 {
-	return -xfs_setattr(XFS_I(dentry->d_inode), iattr, 0);
+	if (iattr->ia_valid & ATTR_SIZE)
+		return -xfs_setattr_size(XFS_I(dentry->d_inode), iattr, 0);
+	return -xfs_setattr_nonsize(XFS_I(dentry->d_inode), iattr, 0);
 }
 
 #define XFS_FIEMAP_FLAGS	(FIEMAP_FLAG_SYNC|FIEMAP_FLAG_XATTR)
Index: xfs/fs/xfs/linux-2.6/xfs_acl.c
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/xfs_acl.c	2011-06-29 11:29:02.698306035 +0200
+++ xfs/fs/xfs/linux-2.6/xfs_acl.c	2011-06-29 11:29:07.154948558 +0200
@@ -264,7 +264,7 @@ xfs_set_mode(struct inode *inode, mode_t
 		iattr.ia_mode = mode;
 		iattr.ia_ctime = current_fs_time(inode->i_sb);
 
-		error = -xfs_setattr(XFS_I(inode), &iattr, XFS_ATTR_NOACL);
+		error = -xfs_setattr_nonsize(XFS_I(inode), &iattr, XFS_ATTR_NOACL);
 	}
 
 	return error;
Index: xfs/fs/xfs/linux-2.6/xfs_file.c
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/xfs_file.c	2011-06-29 11:29:02.711639297 +0200
+++ xfs/fs/xfs/linux-2.6/xfs_file.c	2011-06-29 11:29:07.158281874 +0200
@@ -944,7 +944,7 @@ xfs_file_fallocate(
 
 		iattr.ia_valid = ATTR_SIZE;
 		iattr.ia_size = new_size;
-		error = -xfs_setattr(ip, &iattr, XFS_ATTR_NOLOCK);
+		error = -xfs_setattr_size(ip, &iattr, XFS_ATTR_NOLOCK);
 	}
 
 out_unlock:
Index: xfs/fs/xfs/xfs_vnodeops.c
===================================================================
--- xfs.orig/fs/xfs/xfs_vnodeops.c	2011-06-29 11:29:02.721639242 +0200
+++ xfs/fs/xfs/xfs_vnodeops.c	2011-06-29 11:29:07.158281874 +0200
@@ -50,430 +50,6 @@
 #include "xfs_vnodeops.h"
 #include "xfs_trace.h"
 
-int
-xfs_setattr(
-	struct xfs_inode	*ip,
-	struct iattr		*iattr,
-	int			flags)
-{
-	xfs_mount_t		*mp = ip->i_mount;
-	struct inode		*inode = VFS_I(ip);
-	int			mask = iattr->ia_valid;
-	xfs_trans_t		*tp;
-	int			code;
-	uint			lock_flags;
-	uint			commit_flags=0;
-	uid_t			uid=0, iuid=0;
-	gid_t			gid=0, igid=0;
-	struct xfs_dquot	*udqp, *gdqp, *olddquot1, *olddquot2;
-	int			need_iolock = 1;
-
-	trace_xfs_setattr(ip);
-
-	if (mp->m_flags & XFS_MOUNT_RDONLY)
-		return XFS_ERROR(EROFS);
-
-	if (XFS_FORCED_SHUTDOWN(mp))
-		return XFS_ERROR(EIO);
-
-	code = -inode_change_ok(inode, iattr);
-	if (code)
-		return code;
-
-	olddquot1 = olddquot2 = NULL;
-	udqp = gdqp = NULL;
-
-	/*
-	 * If disk quotas is on, we make sure that the dquots do exist on disk,
-	 * before we start any other transactions. Trying to do this later
-	 * is messy. We don't care to take a readlock to look at the ids
-	 * in inode here, because we can't hold it across the trans_reserve.
-	 * If the IDs do change before we take the ilock, we're covered
-	 * because the i_*dquot fields will get updated anyway.
-	 */
-	if (XFS_IS_QUOTA_ON(mp) && (mask & (ATTR_UID|ATTR_GID))) {
-		uint	qflags = 0;
-
-		if ((mask & ATTR_UID) && XFS_IS_UQUOTA_ON(mp)) {
-			uid = iattr->ia_uid;
-			qflags |= XFS_QMOPT_UQUOTA;
-		} else {
-			uid = ip->i_d.di_uid;
-		}
-		if ((mask & ATTR_GID) && XFS_IS_GQUOTA_ON(mp)) {
-			gid = iattr->ia_gid;
-			qflags |= XFS_QMOPT_GQUOTA;
-		}  else {
-			gid = ip->i_d.di_gid;
-		}
-
-		/*
-		 * We take a reference when we initialize udqp and gdqp,
-		 * so it is important that we never blindly double trip on
-		 * the same variable. See xfs_create() for an example.
-		 */
-		ASSERT(udqp == NULL);
-		ASSERT(gdqp == NULL);
-		code = xfs_qm_vop_dqalloc(ip, uid, gid, xfs_get_projid(ip),
-					 qflags, &udqp, &gdqp);
-		if (code)
-			return code;
-	}
-
-	/*
-	 * For the other attributes, we acquire the inode lock and
-	 * first do an error checking pass.
-	 */
-	tp = NULL;
-	lock_flags = XFS_ILOCK_EXCL;
-	if (flags & XFS_ATTR_NOLOCK)
-		need_iolock = 0;
-	if (!(mask & ATTR_SIZE)) {
-		tp = xfs_trans_alloc(mp, XFS_TRANS_SETATTR_NOT_SIZE);
-		commit_flags = 0;
-		code = xfs_trans_reserve(tp, 0, XFS_ICHANGE_LOG_RES(mp),
-					 0, 0, 0);
-		if (code) {
-			lock_flags = 0;
-			goto error_return;
-		}
-	} else {
-		if (need_iolock)
-			lock_flags |= XFS_IOLOCK_EXCL;
-	}
-
-	xfs_ilock(ip, lock_flags);
-
-	/*
-	 * Change file ownership.  Must be the owner or privileged.
-	 */
-	if (mask & (ATTR_UID|ATTR_GID)) {
-		/*
-		 * These IDs could have changed since we last looked at them.
-		 * But, we're assured that if the ownership did change
-		 * while we didn't have the inode locked, inode's dquot(s)
-		 * would have changed also.
-		 */
-		iuid = ip->i_d.di_uid;
-		igid = ip->i_d.di_gid;
-		gid = (mask & ATTR_GID) ? iattr->ia_gid : igid;
-		uid = (mask & ATTR_UID) ? iattr->ia_uid : iuid;
-
-		/*
-		 * Do a quota reservation only if uid/gid is actually
-		 * going to change.
-		 */
-		if (XFS_IS_QUOTA_RUNNING(mp) &&
-		    ((XFS_IS_UQUOTA_ON(mp) && iuid != uid) ||
-		     (XFS_IS_GQUOTA_ON(mp) && igid != gid))) {
-			ASSERT(tp);
-			code = xfs_qm_vop_chown_reserve(tp, ip, udqp, gdqp,
-						capable(CAP_FOWNER) ?
-						XFS_QMOPT_FORCE_RES : 0);
-			if (code)	/* out of quota */
-				goto error_return;
-		}
-	}
-
-	/*
-	 * Truncate file.  Must have write permission and not be a directory.
-	 */
-	if (mask & ATTR_SIZE) {
-		/* Short circuit the truncate case for zero length files */
-		if (iattr->ia_size == 0 &&
-		    ip->i_size == 0 && ip->i_d.di_nextents == 0) {
-			xfs_iunlock(ip, XFS_ILOCK_EXCL);
-			lock_flags &= ~XFS_ILOCK_EXCL;
-			if (mask & ATTR_CTIME) {
-				inode->i_mtime = inode->i_ctime =
-						current_fs_time(inode->i_sb);
-				xfs_mark_inode_dirty_sync(ip);
-			}
-			code = 0;
-			goto error_return;
-		}
-
-		if (S_ISDIR(ip->i_d.di_mode)) {
-			code = XFS_ERROR(EISDIR);
-			goto error_return;
-		} else if (!S_ISREG(ip->i_d.di_mode)) {
-			code = XFS_ERROR(EINVAL);
-			goto error_return;
-		}
-
-		/*
-		 * Make sure that the dquots are attached to the inode.
-		 */
-		code = xfs_qm_dqattach_locked(ip, 0);
-		if (code)
-			goto error_return;
-
-		/*
-		 * Now we can make the changes.  Before we join the inode
-		 * to the transaction, if ATTR_SIZE is set then take care of
-		 * the part of the truncation that must be done without the
-		 * inode lock.  This needs to be done before joining the inode
-		 * to the transaction, because the inode cannot be unlocked
-		 * once it is a part of the transaction.
-		 */
-		if (iattr->ia_size > ip->i_size) {
-			/*
-			 * Do the first part of growing a file: zero any data
-			 * in the last block that is beyond the old EOF.  We
-			 * need to do this before the inode is joined to the
-			 * transaction to modify the i_size.
-			 */
-			code = xfs_zero_eof(ip, iattr->ia_size, ip->i_size);
-			if (code)
-				goto error_return;
-		}
-		xfs_iunlock(ip, XFS_ILOCK_EXCL);
-		lock_flags &= ~XFS_ILOCK_EXCL;
-
-		/*
-		 * We are going to log the inode size change in this
-		 * transaction so any previous writes that are beyond the on
-		 * disk EOF and the new EOF that have not been written out need
-		 * to be written here. If we do not write the data out, we
-		 * expose ourselves to the null files problem.
-		 *
-		 * Only flush from the on disk size to the smaller of the in
-		 * memory file size or the new size as that's the range we
-		 * really care about here and prevents waiting for other data
-		 * not within the range we care about here.
-		 */
-		if (ip->i_size != ip->i_d.di_size &&
-		    iattr->ia_size > ip->i_d.di_size) {
-			code = xfs_flush_pages(ip,
-					ip->i_d.di_size, iattr->ia_size,
-					XBF_ASYNC, FI_NONE);
-			if (code)
-				goto error_return;
-		}
-
-		/* wait for all I/O to complete */
-		xfs_ioend_wait(ip);
-
-		code = -block_truncate_page(inode->i_mapping, iattr->ia_size,
-					    xfs_get_blocks);
-		if (code)
-			goto error_return;
-
-		tp = xfs_trans_alloc(mp, XFS_TRANS_SETATTR_SIZE);
-		code = xfs_trans_reserve(tp, 0, XFS_ITRUNCATE_LOG_RES(mp), 0,
-					 XFS_TRANS_PERM_LOG_RES,
-					 XFS_ITRUNCATE_LOG_COUNT);
-		if (code)
-			goto error_return;
-
-		truncate_setsize(inode, iattr->ia_size);
-
-		commit_flags = XFS_TRANS_RELEASE_LOG_RES;
-		lock_flags |= XFS_ILOCK_EXCL;
-
-		xfs_ilock(ip, XFS_ILOCK_EXCL);
-
-		xfs_trans_ijoin(tp, ip);
-
-		/*
-		 * Only change the c/mtime if we are changing the size
-		 * or we are explicitly asked to change it. This handles
-		 * the semantic difference between truncate() and ftruncate()
-		 * as implemented in the VFS.
-		 *
-		 * The regular truncate() case without ATTR_CTIME and ATTR_MTIME
-		 * is a special case where we need to update the times despite
-		 * not having these flags set.  For all other operations the
-		 * VFS set these flags explicitly if it wants a timestamp
-		 * update.
-		 */
-		if (iattr->ia_size != ip->i_size &&
-		    (!(mask & (ATTR_CTIME | ATTR_MTIME)))) {
-			iattr->ia_ctime = iattr->ia_mtime =
-				current_fs_time(inode->i_sb);
-			mask |= ATTR_CTIME | ATTR_MTIME;
-		}
-
-		if (iattr->ia_size > ip->i_size) {
-			ip->i_d.di_size = iattr->ia_size;
-			ip->i_size = iattr->ia_size;
-			xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
-		} else if (iattr->ia_size <= ip->i_size ||
-			   (iattr->ia_size == 0 && ip->i_d.di_nextents)) {
-			/*
-			 * signal a sync transaction unless
-			 * we're truncating an already unlinked
-			 * file on a wsync filesystem
-			 */
-			code = xfs_itruncate_finish(&tp, ip, iattr->ia_size,
-					    XFS_DATA_FORK,
-					    ((ip->i_d.di_nlink != 0 ||
-					      !(mp->m_flags & XFS_MOUNT_WSYNC))
-					     ? 1 : 0));
-			if (code)
-				goto abort_return;
-			/*
-			 * Truncated "down", so we're removing references
-			 * to old data here - if we now delay flushing for
-			 * a long time, we expose ourselves unduly to the
-			 * notorious NULL files problem.  So, we mark this
-			 * vnode and flush it when the file is closed, and
-			 * do not wait the usual (long) time for writeout.
-			 */
-			xfs_iflags_set(ip, XFS_ITRUNCATED);
-		}
-	} else if (tp) {
-		xfs_trans_ijoin(tp, ip);
-	}
-
-	/*
-	 * Change file ownership.  Must be the owner or privileged.
-	 */
-	if (mask & (ATTR_UID|ATTR_GID)) {
-		/*
-		 * CAP_FSETID overrides the following restrictions:
-		 *
-		 * The set-user-ID and set-group-ID bits of a file will be
-		 * cleared upon successful return from chown()
-		 */
-		if ((ip->i_d.di_mode & (S_ISUID|S_ISGID)) &&
-		    !capable(CAP_FSETID)) {
-			ip->i_d.di_mode &= ~(S_ISUID|S_ISGID);
-		}
-
-		/*
-		 * Change the ownerships and register quota modifications
-		 * in the transaction.
-		 */
-		if (iuid != uid) {
-			if (XFS_IS_QUOTA_RUNNING(mp) && XFS_IS_UQUOTA_ON(mp)) {
-				ASSERT(mask & ATTR_UID);
-				ASSERT(udqp);
-				olddquot1 = xfs_qm_vop_chown(tp, ip,
-							&ip->i_udquot, udqp);
-			}
-			ip->i_d.di_uid = uid;
-			inode->i_uid = uid;
-		}
-		if (igid != gid) {
-			if (XFS_IS_QUOTA_RUNNING(mp) && XFS_IS_GQUOTA_ON(mp)) {
-				ASSERT(!XFS_IS_PQUOTA_ON(mp));
-				ASSERT(mask & ATTR_GID);
-				ASSERT(gdqp);
-				olddquot2 = xfs_qm_vop_chown(tp, ip,
-							&ip->i_gdquot, gdqp);
-			}
-			ip->i_d.di_gid = gid;
-			inode->i_gid = gid;
-		}
-	}
-
-	/*
-	 * Change file access modes.
-	 */
-	if (mask & ATTR_MODE) {
-		umode_t mode = iattr->ia_mode;
-
-		if (!in_group_p(inode->i_gid) && !capable(CAP_FSETID))
-			mode &= ~S_ISGID;
-
-		ip->i_d.di_mode &= S_IFMT;
-		ip->i_d.di_mode |= mode & ~S_IFMT;
-
-		inode->i_mode &= S_IFMT;
-		inode->i_mode |= mode & ~S_IFMT;
-	}
-
-	/*
-	 * Change file access or modified times.
-	 */
-	if (mask & ATTR_ATIME) {
-		inode->i_atime = iattr->ia_atime;
-		ip->i_d.di_atime.t_sec = iattr->ia_atime.tv_sec;
-		ip->i_d.di_atime.t_nsec = iattr->ia_atime.tv_nsec;
-		ip->i_update_core = 1;
-	}
-	if (mask & ATTR_CTIME) {
-		inode->i_ctime = iattr->ia_ctime;
-		ip->i_d.di_ctime.t_sec = iattr->ia_ctime.tv_sec;
-		ip->i_d.di_ctime.t_nsec = iattr->ia_ctime.tv_nsec;
-		ip->i_update_core = 1;
-	}
-	if (mask & ATTR_MTIME) {
-		inode->i_mtime = iattr->ia_mtime;
-		ip->i_d.di_mtime.t_sec = iattr->ia_mtime.tv_sec;
-		ip->i_d.di_mtime.t_nsec = iattr->ia_mtime.tv_nsec;
-		ip->i_update_core = 1;
-	}
-
-	/*
-	 * And finally, log the inode core if any attribute in it
-	 * has been changed.
-	 */
-	if (mask & (ATTR_UID|ATTR_GID|ATTR_MODE|
-		    ATTR_ATIME|ATTR_CTIME|ATTR_MTIME))
-		xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
-
-	XFS_STATS_INC(xs_ig_attrchg);
-
-	/*
-	 * If this is a synchronous mount, make sure that the
-	 * transaction goes to disk before returning to the user.
-	 * This is slightly sub-optimal in that truncates require
-	 * two sync transactions instead of one for wsync filesystems.
-	 * One for the truncate and one for the timestamps since we
-	 * don't want to change the timestamps unless we're sure the
-	 * truncate worked.  Truncates are less than 1% of the laddis
-	 * mix so this probably isn't worth the trouble to optimize.
-	 */
-	code = 0;
-	if (mp->m_flags & XFS_MOUNT_WSYNC)
-		xfs_trans_set_sync(tp);
-
-	code = xfs_trans_commit(tp, commit_flags);
-
-	xfs_iunlock(ip, lock_flags);
-
-	/*
-	 * Release any dquot(s) the inode had kept before chown.
-	 */
-	xfs_qm_dqrele(olddquot1);
-	xfs_qm_dqrele(olddquot2);
-	xfs_qm_dqrele(udqp);
-	xfs_qm_dqrele(gdqp);
-
-	if (code)
-		return code;
-
-	/*
-	 * XXX(hch): Updating the ACL entries is not atomic vs the i_mode
-	 * 	     update.  We could avoid this with linked transactions
-	 * 	     and passing down the transaction pointer all the way
-	 *	     to attr_set.  No previous user of the generic
-	 * 	     Posix ACL code seems to care about this issue either.
-	 */
-	if ((mask & ATTR_MODE) && !(flags & XFS_ATTR_NOACL)) {
-		code = -xfs_acl_chmod(inode);
-		if (code)
-			return XFS_ERROR(code);
-	}
-
-	return 0;
-
- abort_return:
-	commit_flags |= XFS_TRANS_ABORT;
- error_return:
-	xfs_qm_dqrele(udqp);
-	xfs_qm_dqrele(gdqp);
-	if (tp) {
-		xfs_trans_cancel(tp, commit_flags);
-	}
-	if (lock_flags != 0) {
-		xfs_iunlock(ip, lock_flags);
-	}
-	return code;
-}
-
 /*
  * The maximum pathlen is 1024 bytes. Since the minimum file system
  * blocksize is 512 bytes, we can get a max of 2 extents back from
@@ -2784,7 +2360,7 @@ xfs_change_file_space(
 		iattr.ia_valid = ATTR_SIZE;
 		iattr.ia_size = startoffset;
 
-		error = xfs_setattr(ip, &iattr, attr_flags);
+		error = xfs_setattr_size(ip, &iattr, attr_flags);
 
 		if (error)
 			return error;
Index: xfs/fs/xfs/xfs_vnodeops.h
===================================================================
--- xfs.orig/fs/xfs/xfs_vnodeops.h	2011-06-29 11:29:02.734972504 +0200
+++ xfs/fs/xfs/xfs_vnodeops.h	2011-06-29 11:29:07.161615190 +0200
@@ -13,7 +13,8 @@ struct xfs_inode;
 struct xfs_iomap;
 
 
-int xfs_setattr(struct xfs_inode *ip, struct iattr *vap, int flags);
+int xfs_setattr_nonsize(struct xfs_inode *ip, struct iattr *vap, int flags);
+int xfs_setattr_size(struct xfs_inode *ip, struct iattr *vap, int flags);
 #define	XFS_ATTR_DMI		0x01	/* invocation from a DMI function */
 #define	XFS_ATTR_NONBLOCK	0x02	/* return EAGAIN if operation would block */
 #define XFS_ATTR_NOLOCK		0x04	/* Don't grab any conflicting locks */

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH 08/27] xfs: kill xfs_itruncate_start
  2011-06-29 14:01 [PATCH 00/27] patch queue for Linux 3.1 Christoph Hellwig
                   ` (5 preceding siblings ...)
  2011-06-29 14:01 ` [PATCH 06/27] xfs: split xfs_setattr Christoph Hellwig
@ 2011-06-29 14:01 ` Christoph Hellwig
  2011-06-29 22:13   ` Alex Elder
  2011-06-29 14:01 ` [PATCH 09/27] xfs: split xfs_itruncate_finish Christoph Hellwig
                   ` (19 subsequent siblings)
  26 siblings, 1 reply; 100+ messages in thread
From: Christoph Hellwig @ 2011-06-29 14:01 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-kill-xfs_itruncate_start --]
[-- Type: text/plain, Size: 11644 bytes --]

xfs_itruncate_start is a rather length wrapper that evaluates to a call
to xfs_ioend_wait and xfs_tosspages, and only has two callers.

Instead of using the complicated checks left over from IRIX where we
can to truncate the pagecache just call xfs_tosspages
(aka truncate_inode_pages) directly as we want to get rid of all data
after i_size, and truncate_inode_pages handles incorrect alignments
and too large offsets just fine.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

Index: xfs/fs/xfs/xfs_inode.c
===================================================================
--- xfs.orig/fs/xfs/xfs_inode.c	2011-06-29 11:29:02.494973804 +0200
+++ xfs/fs/xfs/xfs_inode.c	2011-06-29 11:29:11.888256249 +0200
@@ -1217,165 +1217,8 @@ xfs_isize_check(
 #endif	/* DEBUG */
 
 /*
- * Calculate the last possible buffered byte in a file.  This must
- * include data that was buffered beyond the EOF by the write code.
- * This also needs to deal with overflowing the xfs_fsize_t type
- * which can happen for sizes near the limit.
- *
- * We also need to take into account any blocks beyond the EOF.  It
- * may be the case that they were buffered by a write which failed.
- * In that case the pages will still be in memory, but the inode size
- * will never have been updated.
- */
-STATIC xfs_fsize_t
-xfs_file_last_byte(
-	xfs_inode_t	*ip)
-{
-	xfs_mount_t	*mp;
-	xfs_fsize_t	last_byte;
-	xfs_fileoff_t	last_block;
-	xfs_fileoff_t	size_last_block;
-	int		error;
-
-	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL|XFS_IOLOCK_SHARED));
-
-	mp = ip->i_mount;
-	/*
-	 * Only check for blocks beyond the EOF if the extents have
-	 * been read in.  This eliminates the need for the inode lock,
-	 * and it also saves us from looking when it really isn't
-	 * necessary.
-	 */
-	if (ip->i_df.if_flags & XFS_IFEXTENTS) {
-		xfs_ilock(ip, XFS_ILOCK_SHARED);
-		error = xfs_bmap_last_offset(NULL, ip, &last_block,
-			XFS_DATA_FORK);
-		xfs_iunlock(ip, XFS_ILOCK_SHARED);
-		if (error) {
-			last_block = 0;
-		}
-	} else {
-		last_block = 0;
-	}
-	size_last_block = XFS_B_TO_FSB(mp, (xfs_ufsize_t)ip->i_size);
-	last_block = XFS_FILEOFF_MAX(last_block, size_last_block);
-
-	last_byte = XFS_FSB_TO_B(mp, last_block);
-	if (last_byte < 0) {
-		return XFS_MAXIOFFSET(mp);
-	}
-	last_byte += (1 << mp->m_writeio_log);
-	if (last_byte < 0) {
-		return XFS_MAXIOFFSET(mp);
-	}
-	return last_byte;
-}
-
-/*
- * Start the truncation of the file to new_size.  The new size
- * must be smaller than the current size.  This routine will
- * clear the buffer and page caches of file data in the removed
- * range, and xfs_itruncate_finish() will remove the underlying
- * disk blocks.
- *
- * The inode must have its I/O lock locked EXCLUSIVELY, and it
- * must NOT have the inode lock held at all.  This is because we're
- * calling into the buffer/page cache code and we can't hold the
- * inode lock when we do so.
- *
- * We need to wait for any direct I/Os in flight to complete before we
- * proceed with the truncate. This is needed to prevent the extents
- * being read or written by the direct I/Os from being removed while the
- * I/O is in flight as there is no other method of synchronising
- * direct I/O with the truncate operation.  Also, because we hold
- * the IOLOCK in exclusive mode, we prevent new direct I/Os from being
- * started until the truncate completes and drops the lock. Essentially,
- * the xfs_ioend_wait() call forms an I/O barrier that provides strict
- * ordering between direct I/Os and the truncate operation.
- *
- * The flags parameter can have either the value XFS_ITRUNC_DEFINITE
- * or XFS_ITRUNC_MAYBE.  The XFS_ITRUNC_MAYBE value should be used
- * in the case that the caller is locking things out of order and
- * may not be able to call xfs_itruncate_finish() with the inode lock
- * held without dropping the I/O lock.  If the caller must drop the
- * I/O lock before calling xfs_itruncate_finish(), then xfs_itruncate_start()
- * must be called again with all the same restrictions as the initial
- * call.
- */
-int
-xfs_itruncate_start(
-	xfs_inode_t	*ip,
-	uint		flags,
-	xfs_fsize_t	new_size)
-{
-	xfs_fsize_t	last_byte;
-	xfs_off_t	toss_start;
-	xfs_mount_t	*mp;
-	int		error = 0;
-
-	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL));
-	ASSERT((new_size == 0) || (new_size <= ip->i_size));
-	ASSERT((flags == XFS_ITRUNC_DEFINITE) ||
-	       (flags == XFS_ITRUNC_MAYBE));
-
-	mp = ip->i_mount;
-
-	/* wait for the completion of any pending DIOs */
-	if (new_size == 0 || new_size < ip->i_size)
-		xfs_ioend_wait(ip);
-
-	/*
-	 * Call toss_pages or flushinval_pages to get rid of pages
-	 * overlapping the region being removed.  We have to use
-	 * the less efficient flushinval_pages in the case that the
-	 * caller may not be able to finish the truncate without
-	 * dropping the inode's I/O lock.  Make sure
-	 * to catch any pages brought in by buffers overlapping
-	 * the EOF by searching out beyond the isize by our
-	 * block size. We round new_size up to a block boundary
-	 * so that we don't toss things on the same block as
-	 * new_size but before it.
-	 *
-	 * Before calling toss_page or flushinval_pages, make sure to
-	 * call remapf() over the same region if the file is mapped.
-	 * This frees up mapped file references to the pages in the
-	 * given range and for the flushinval_pages case it ensures
-	 * that we get the latest mapped changes flushed out.
-	 */
-	toss_start = XFS_B_TO_FSB(mp, (xfs_ufsize_t)new_size);
-	toss_start = XFS_FSB_TO_B(mp, toss_start);
-	if (toss_start < 0) {
-		/*
-		 * The place to start tossing is beyond our maximum
-		 * file size, so there is no way that the data extended
-		 * out there.
-		 */
-		return 0;
-	}
-	last_byte = xfs_file_last_byte(ip);
-	trace_xfs_itruncate_start(ip, new_size, flags, toss_start, last_byte);
-	if (last_byte > toss_start) {
-		if (flags & XFS_ITRUNC_DEFINITE) {
-			xfs_tosspages(ip, toss_start,
-					-1, FI_REMAPF_LOCKED);
-		} else {
-			error = xfs_flushinval_pages(ip, toss_start,
-					-1, FI_REMAPF_LOCKED);
-		}
-	}
-
-#ifdef DEBUG
-	if (new_size == 0) {
-		ASSERT(VN_CACHED(VFS_I(ip)) == 0);
-	}
-#endif
-	return error;
-}
-
-/*
- * Shrink the file to the given new_size.  The new size must be smaller than
- * the current size.  This will free up the underlying blocks in the removed
- * range after a call to xfs_itruncate_start() or xfs_atruncate_start().
+ * Free up the underlying blocks past new_size.  The new size must be
+ * smaller than the current size.
  *
  * The transaction passed to this routine must have made a permanent log
  * reservation of at least XFS_ITRUNCATE_LOG_RES.  This routine may commit the
@@ -1387,7 +1230,7 @@ xfs_itruncate_start(
  * will be "held" within the returned transaction.  This routine does NOT
  * require any disk space to be reserved for it within the transaction.
  *
- * The fork parameter must be either xfs_attr_fork or xfs_data_fork, and it
+ * The fork parameter must be either XFS_ATTR_FORK or XFS_DATA_FORK, and it
  * indicates the fork which is to be truncated.  For the attribute fork we only
  * support truncation to size 0.
  *
Index: xfs/fs/xfs/xfs_vnodeops.c
===================================================================
--- xfs.orig/fs/xfs/xfs_vnodeops.c	2011-06-29 11:29:07.158281874 +0200
+++ xfs/fs/xfs/xfs_vnodeops.c	2011-06-29 11:29:11.888256249 +0200
@@ -197,13 +197,6 @@ xfs_free_eofblocks(
 		 */
 		tp = xfs_trans_alloc(mp, XFS_TRANS_INACTIVE);
 
-		/*
-		 * Do the xfs_itruncate_start() call before
-		 * reserving any log space because
-		 * itruncate_start will call into the buffer
-		 * cache and we can't
-		 * do that within a transaction.
-		 */
 		if (flags & XFS_FREE_EOF_TRYLOCK) {
 			if (!xfs_ilock_nowait(ip, XFS_IOLOCK_EXCL)) {
 				xfs_trans_cancel(tp, 0);
@@ -212,13 +205,6 @@ xfs_free_eofblocks(
 		} else {
 			xfs_ilock(ip, XFS_IOLOCK_EXCL);
 		}
-		error = xfs_itruncate_start(ip, XFS_ITRUNC_DEFINITE,
-				    ip->i_size);
-		if (error) {
-			xfs_trans_cancel(tp, 0);
-			xfs_iunlock(ip, XFS_IOLOCK_EXCL);
-			return error;
-		}
 
 		error = xfs_trans_reserve(tp, 0,
 					  XFS_ITRUNCATE_LOG_RES(mp),
@@ -660,20 +646,9 @@ xfs_inactive(
 
 	tp = xfs_trans_alloc(mp, XFS_TRANS_INACTIVE);
 	if (truncate) {
-		/*
-		 * Do the xfs_itruncate_start() call before
-		 * reserving any log space because itruncate_start
-		 * will call into the buffer cache and we can't
-		 * do that within a transaction.
-		 */
 		xfs_ilock(ip, XFS_IOLOCK_EXCL);
 
-		error = xfs_itruncate_start(ip, XFS_ITRUNC_DEFINITE, 0);
-		if (error) {
-			xfs_trans_cancel(tp, 0);
-			xfs_iunlock(ip, XFS_IOLOCK_EXCL);
-			return VN_INACTIVE_CACHE;
-		}
+		xfs_ioend_wait(ip);
 
 		error = xfs_trans_reserve(tp, 0,
 					  XFS_ITRUNCATE_LOG_RES(mp),
Index: xfs/fs/xfs/linux-2.6/xfs_trace.h
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/xfs_trace.h	2011-06-29 11:29:02.518307010 +0200
+++ xfs/fs/xfs/linux-2.6/xfs_trace.h	2011-06-29 11:29:11.891589564 +0200
@@ -1029,40 +1029,6 @@ DEFINE_SIMPLE_IO_EVENT(xfs_delalloc_enos
 DEFINE_SIMPLE_IO_EVENT(xfs_unwritten_convert);
 DEFINE_SIMPLE_IO_EVENT(xfs_get_blocks_notfound);
 
-
-TRACE_EVENT(xfs_itruncate_start,
-	TP_PROTO(struct xfs_inode *ip, xfs_fsize_t new_size, int flag,
-		 xfs_off_t toss_start, xfs_off_t toss_finish),
-	TP_ARGS(ip, new_size, flag, toss_start, toss_finish),
-	TP_STRUCT__entry(
-		__field(dev_t, dev)
-		__field(xfs_ino_t, ino)
-		__field(xfs_fsize_t, size)
-		__field(xfs_fsize_t, new_size)
-		__field(xfs_off_t, toss_start)
-		__field(xfs_off_t, toss_finish)
-		__field(int, flag)
-	),
-	TP_fast_assign(
-		__entry->dev = VFS_I(ip)->i_sb->s_dev;
-		__entry->ino = ip->i_ino;
-		__entry->size = ip->i_d.di_size;
-		__entry->new_size = new_size;
-		__entry->toss_start = toss_start;
-		__entry->toss_finish = toss_finish;
-		__entry->flag = flag;
-	),
-	TP_printk("dev %d:%d ino 0x%llx %s size 0x%llx new_size 0x%llx "
-		  "toss start 0x%llx toss finish 0x%llx",
-		  MAJOR(__entry->dev), MINOR(__entry->dev),
-		  __entry->ino,
-		  __print_flags(__entry->flag, "|", XFS_ITRUNC_FLAGS),
-		  __entry->size,
-		  __entry->new_size,
-		  __entry->toss_start,
-		  __entry->toss_finish)
-);
-
 DECLARE_EVENT_CLASS(xfs_itrunc_class,
 	TP_PROTO(struct xfs_inode *ip, xfs_fsize_t new_size),
 	TP_ARGS(ip, new_size),
Index: xfs/fs/xfs/xfs_inode.h
===================================================================
--- xfs.orig/fs/xfs/xfs_inode.h	2011-06-29 11:29:02.531640272 +0200
+++ xfs/fs/xfs/xfs_inode.h	2011-06-29 11:29:11.891589564 +0200
@@ -458,16 +458,6 @@ static inline void xfs_ifunlock(xfs_inod
 extern struct lock_class_key xfs_iolock_reclaimable;
 
 /*
- * Flags for xfs_itruncate_start().
- */
-#define	XFS_ITRUNC_DEFINITE	0x1
-#define	XFS_ITRUNC_MAYBE	0x2
-
-#define XFS_ITRUNC_FLAGS \
-	{ XFS_ITRUNC_DEFINITE,	"DEFINITE" }, \
-	{ XFS_ITRUNC_MAYBE,	"MAYBE" }
-
-/*
  * For multiple groups support: if S_ISGID bit is set in the parent
  * directory, group of new file is set to that of the parent, and
  * new subdirectory gets S_ISGID bit from parent.
@@ -501,7 +491,6 @@ uint		xfs_ip2xflags(struct xfs_inode *);
 uint		xfs_dic2xflags(struct xfs_dinode *);
 int		xfs_ifree(struct xfs_trans *, xfs_inode_t *,
 			   struct xfs_bmap_free *);
-int		xfs_itruncate_start(xfs_inode_t *, uint, xfs_fsize_t);
 int		xfs_itruncate_finish(struct xfs_trans **, xfs_inode_t *,
 				     xfs_fsize_t, int, int);
 int		xfs_iunlink(struct xfs_trans *, xfs_inode_t *);

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH 09/27] xfs: split xfs_itruncate_finish
  2011-06-29 14:01 [PATCH 00/27] patch queue for Linux 3.1 Christoph Hellwig
                   ` (6 preceding siblings ...)
  2011-06-29 14:01 ` [PATCH 08/27] xfs: kill xfs_itruncate_start Christoph Hellwig
@ 2011-06-29 14:01 ` Christoph Hellwig
  2011-06-30  2:44   ` Dave Chinner
  2011-06-29 14:01 ` [PATCH 10/27] xfs: improve sync behaviour in the fact of aggressive dirtying Christoph Hellwig
                   ` (18 subsequent siblings)
  26 siblings, 1 reply; 100+ messages in thread
From: Christoph Hellwig @ 2011-06-29 14:01 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-split-xfs_itruncate_finish --]
[-- Type: text/plain, Size: 21705 bytes --]

Split the guts of xfs_itruncate_finish that loop over the existing extents
and calls xfs_bunmapi on them into a new helper, xfs_itruncate_externs.
Make xfs_attr_inactive call it directly instead of xfs_itruncate_finish,
which allows to simplify the latter a lot, by only letting it deal with
the data fork.  As a result xfs_itruncate_finish is renamed to
xfs_itruncate_data to make its use case more obvious.

Also remove the sync parameter from xfs_itruncate_data, which has been
unessecary since the introduction of the busy extent list in 2002, and
completely dead code since 2003 when the XFS_BMAPI_ASYNC parameter was
made a no-op.

I can't actually see why the xfs_attr_inactive needs to set the transaction
sync, but let's keep this patch simple and without changes in behaviour.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: xfs/fs/xfs/linux-2.6/xfs_iops.c
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/xfs_iops.c	2011-06-29 11:35:39.086158618 +0200
+++ xfs/fs/xfs/linux-2.6/xfs_iops.c	2011-06-29 11:35:45.779455690 +0200
@@ -880,15 +880,7 @@ xfs_setattr_size(
 		ip->i_size = iattr->ia_size;
 	} else if (iattr->ia_size <= ip->i_size ||
 		   (iattr->ia_size == 0 && ip->i_d.di_nextents)) {
-		/*
-		 * Signal a sync transaction unless we are truncating an
-		 * already unlinked file on a wsync filesystem.
-		 */
-		error = xfs_itruncate_finish(&tp, ip, iattr->ia_size,
-				    XFS_DATA_FORK,
-				    ((ip->i_d.di_nlink != 0 ||
-				      !(mp->m_flags & XFS_MOUNT_WSYNC))
-				     ? 1 : 0));
+		error = xfs_itruncate_data(&tp, ip, iattr->ia_size);
 		if (error)
 			goto out_trans_abort;
 
Index: xfs/fs/xfs/quota/xfs_qm_syscalls.c
===================================================================
--- xfs.orig/fs/xfs/quota/xfs_qm_syscalls.c	2011-06-29 11:35:39.112825141 +0200
+++ xfs/fs/xfs/quota/xfs_qm_syscalls.c	2011-06-29 11:35:45.782789005 +0200
@@ -263,7 +263,7 @@ xfs_qm_scall_trunc_qfile(
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	xfs_trans_ijoin(tp, ip);
 
-	error = xfs_itruncate_finish(&tp, ip, 0, XFS_DATA_FORK, 1);
+	error = xfs_itruncate_data(&tp, ip, 0);
 	if (error) {
 		xfs_trans_cancel(tp, XFS_TRANS_RELEASE_LOG_RES |
 				     XFS_TRANS_ABORT);
Index: xfs/fs/xfs/xfs_attr.c
===================================================================
--- xfs.orig/fs/xfs/xfs_attr.c	2011-06-29 11:35:39.126158401 +0200
+++ xfs/fs/xfs/xfs_attr.c	2011-06-29 11:35:45.782789005 +0200
@@ -822,17 +822,21 @@ xfs_attr_inactive(xfs_inode_t *dp)
 	error = xfs_attr_root_inactive(&trans, dp);
 	if (error)
 		goto out;
+
 	/*
-	 * signal synchronous inactive transactions unless this
-	 * is a synchronous mount filesystem in which case we
-	 * know that we're here because we've been called out of
-	 * xfs_inactive which means that the last reference is gone
-	 * and the unlink transaction has already hit the disk so
-	 * async inactive transactions are safe.
+	 * Signal synchronous inactive transactions unless this is a
+	 * synchronous mount filesystem in which case we know that we're here
+	 * because we've been called out of xfs_inactive which means that the
+	 * last reference is gone and the unlink transaction has already hit
+	 * the disk so async inactive transactions are safe.
 	 */
-	if ((error = xfs_itruncate_finish(&trans, dp, 0LL, XFS_ATTR_FORK,
-				(!(mp->m_flags & XFS_MOUNT_WSYNC)
-				 ? 1 : 0))))
+	if (!(mp->m_flags & XFS_MOUNT_WSYNC)) {
+		if (dp->i_d.di_anextents > 0)
+			xfs_trans_set_sync(trans);
+	}
+
+	error = xfs_itruncate_extents(&trans, dp, XFS_ATTR_FORK, 0);
+	if (error)
 		goto out;
 
 	/*
Index: xfs/fs/xfs/xfs_inode.c
===================================================================
--- xfs.orig/fs/xfs/xfs_inode.c	2011-06-29 11:35:39.136158346 +0200
+++ xfs/fs/xfs/xfs_inode.c	2011-06-29 11:38:24.515262411 +0200
@@ -52,7 +52,7 @@ kmem_zone_t *xfs_ifork_zone;
 kmem_zone_t *xfs_inode_zone;
 
 /*
- * Used in xfs_itruncate().  This is the maximum number of extents
+ * Used in xfs_itruncate_extents().  This is the maximum number of extents
  * freed from a file in a single transaction.
  */
 #define	XFS_ITRUNC_MAX_EXTENTS	2
@@ -1218,7 +1218,9 @@ xfs_isize_check(
 
 /*
  * Free up the underlying blocks past new_size.  The new size must be
- * smaller than the current size.
+ * smaller than the current size.  This routine can be used both for
+ * the attribute and data fork, and does not modify the inode size,
+ * which is left to the caller.
  *
  * The transaction passed to this routine must have made a permanent log
  * reservation of at least XFS_ITRUNCATE_LOG_RES.  This routine may commit the
@@ -1230,31 +1232,6 @@ xfs_isize_check(
  * will be "held" within the returned transaction.  This routine does NOT
  * require any disk space to be reserved for it within the transaction.
  *
- * The fork parameter must be either XFS_ATTR_FORK or XFS_DATA_FORK, and it
- * indicates the fork which is to be truncated.  For the attribute fork we only
- * support truncation to size 0.
- *
- * We use the sync parameter to indicate whether or not the first transaction
- * we perform might have to be synchronous.  For the attr fork, it needs to be
- * so if the unlink of the inode is not yet known to be permanent in the log.
- * This keeps us from freeing and reusing the blocks of the attribute fork
- * before the unlink of the inode becomes permanent.
- *
- * For the data fork, we normally have to run synchronously if we're being
- * called out of the inactive path or we're being called out of the create path
- * where we're truncating an existing file.  Either way, the truncate needs to
- * be sync so blocks don't reappear in the file with altered data in case of a
- * crash.  wsync filesystems can run the first case async because anything that
- * shrinks the inode has to run sync so by the time we're called here from
- * inactive, the inode size is permanently set to 0.
- *
- * Calls from the truncate path always need to be sync unless we're in a wsync
- * filesystem and the file has already been unlinked.
- *
- * The caller is responsible for correctly setting the sync parameter.  It gets
- * too hard for us to guess here which path we're being called out of just
- * based on inode state.
- *
  * If we get an error, we must return with the inode locked and linked into the
  * current transaction. This keeps things simple for the higher level code,
  * because it always knows that the inode is locked and held in the transaction
@@ -1262,124 +1239,30 @@ xfs_isize_check(
  * dirty on error so that transactions can be easily aborted if possible.
  */
 int
-xfs_itruncate_finish(
-	xfs_trans_t	**tp,
-	xfs_inode_t	*ip,
-	xfs_fsize_t	new_size,
-	int		fork,
-	int		sync)
+xfs_itruncate_extents(
+	struct xfs_trans	**tpp,
+	struct xfs_inode	*ip,
+	int			whichfork,
+	xfs_fsize_t		new_size)
 {
-	xfs_fsblock_t	first_block;
-	xfs_fileoff_t	first_unmap_block;
-	xfs_fileoff_t	last_block;
-	xfs_filblks_t	unmap_len=0;
-	xfs_mount_t	*mp;
-	xfs_trans_t	*ntp;
-	int		done;
-	int		committed;
-	xfs_bmap_free_t	free_list;
-	int		error;
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_trans	*ntp = *tpp;
+	xfs_bmap_free_t		free_list;
+	xfs_fsblock_t		first_block;
+	xfs_fileoff_t		first_unmap_block;
+	xfs_fileoff_t		last_block;
+	xfs_filblks_t		unmap_len;
+	int			committed;
+	int			error = 0;
+	int			done = 0;
 
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL|XFS_IOLOCK_EXCL));
-	ASSERT((new_size == 0) || (new_size <= ip->i_size));
-	ASSERT(*tp != NULL);
-	ASSERT((*tp)->t_flags & XFS_TRANS_PERM_LOG_RES);
-	ASSERT(ip->i_transp == *tp);
+	ASSERT(new_size == 0 || new_size <= ip->i_size);
+	ASSERT((*tpp)->t_flags & XFS_TRANS_PERM_LOG_RES);
+	ASSERT(ip->i_transp == *tpp);
 	ASSERT(ip->i_itemp != NULL);
 	ASSERT(ip->i_itemp->ili_lock_flags == 0);
-
-
-	ntp = *tp;
-	mp = (ntp)->t_mountp;
-	ASSERT(! XFS_NOT_DQATTACHED(mp, ip));
-
-	/*
-	 * We only support truncating the entire attribute fork.
-	 */
-	if (fork == XFS_ATTR_FORK) {
-		new_size = 0LL;
-	}
-	first_unmap_block = XFS_B_TO_FSB(mp, (xfs_ufsize_t)new_size);
-	trace_xfs_itruncate_finish_start(ip, new_size);
-
-	/*
-	 * The first thing we do is set the size to new_size permanently
-	 * on disk.  This way we don't have to worry about anyone ever
-	 * being able to look at the data being freed even in the face
-	 * of a crash.  What we're getting around here is the case where
-	 * we free a block, it is allocated to another file, it is written
-	 * to, and then we crash.  If the new data gets written to the
-	 * file but the log buffers containing the free and reallocation
-	 * don't, then we'd end up with garbage in the blocks being freed.
-	 * As long as we make the new_size permanent before actually
-	 * freeing any blocks it doesn't matter if they get written to.
-	 *
-	 * The callers must signal into us whether or not the size
-	 * setting here must be synchronous.  There are a few cases
-	 * where it doesn't have to be synchronous.  Those cases
-	 * occur if the file is unlinked and we know the unlink is
-	 * permanent or if the blocks being truncated are guaranteed
-	 * to be beyond the inode eof (regardless of the link count)
-	 * and the eof value is permanent.  Both of these cases occur
-	 * only on wsync-mounted filesystems.  In those cases, we're
-	 * guaranteed that no user will ever see the data in the blocks
-	 * that are being truncated so the truncate can run async.
-	 * In the free beyond eof case, the file may wind up with
-	 * more blocks allocated to it than it needs if we crash
-	 * and that won't get fixed until the next time the file
-	 * is re-opened and closed but that's ok as that shouldn't
-	 * be too many blocks.
-	 *
-	 * However, we can't just make all wsync xactions run async
-	 * because there's one call out of the create path that needs
-	 * to run sync where it's truncating an existing file to size
-	 * 0 whose size is > 0.
-	 *
-	 * It's probably possible to come up with a test in this
-	 * routine that would correctly distinguish all the above
-	 * cases from the values of the function parameters and the
-	 * inode state but for sanity's sake, I've decided to let the
-	 * layers above just tell us.  It's simpler to correctly figure
-	 * out in the layer above exactly under what conditions we
-	 * can run async and I think it's easier for others read and
-	 * follow the logic in case something has to be changed.
-	 * cscope is your friend -- rcc.
-	 *
-	 * The attribute fork is much simpler.
-	 *
-	 * For the attribute fork we allow the caller to tell us whether
-	 * the unlink of the inode that led to this call is yet permanent
-	 * in the on disk log.  If it is not and we will be freeing extents
-	 * in this inode then we make the first transaction synchronous
-	 * to make sure that the unlink is permanent by the time we free
-	 * the blocks.
-	 */
-	if (fork == XFS_DATA_FORK) {
-		if (ip->i_d.di_nextents > 0) {
-			/*
-			 * If we are not changing the file size then do
-			 * not update the on-disk file size - we may be
-			 * called from xfs_inactive_free_eofblocks().  If we
-			 * update the on-disk file size and then the system
-			 * crashes before the contents of the file are
-			 * flushed to disk then the files may be full of
-			 * holes (ie NULL files bug).
-			 */
-			if (ip->i_size != new_size) {
-				ip->i_d.di_size = new_size;
-				ip->i_size = new_size;
-				xfs_trans_log_inode(ntp, ip, XFS_ILOG_CORE);
-			}
-		}
-	} else if (sync) {
-		ASSERT(!(mp->m_flags & XFS_MOUNT_WSYNC));
-		if (ip->i_d.di_anextents > 0)
-			xfs_trans_set_sync(ntp);
-	}
-	ASSERT(fork == XFS_DATA_FORK ||
-		(fork == XFS_ATTR_FORK &&
-			((sync && !(mp->m_flags & XFS_MOUNT_WSYNC)) ||
-			 (sync == 0 && (mp->m_flags & XFS_MOUNT_WSYNC)))));
+	ASSERT(!XFS_NOT_DQATTACHED(mp, ip));
 
 	/*
 	 * Since it is possible for space to become allocated beyond
@@ -1390,70 +1273,34 @@ xfs_itruncate_finish(
 	 * beyond the maximum file size (ie it is the same as last_block),
 	 * then there is nothing to do.
 	 */
+	first_unmap_block = XFS_B_TO_FSB(mp, (xfs_ufsize_t)new_size);
 	last_block = XFS_B_TO_FSB(mp, (xfs_ufsize_t)XFS_MAXIOFFSET(mp));
-	ASSERT(first_unmap_block <= last_block);
-	done = 0;
-	if (last_block == first_unmap_block) {
-		done = 1;
-	} else {
-		unmap_len = last_block - first_unmap_block + 1;
-	}
+	if (first_unmap_block == last_block)
+		return 0;
+
+	ASSERT(first_unmap_block < last_block);
+	unmap_len = last_block - first_unmap_block + 1;
 	while (!done) {
-		/*
-		 * Free up up to XFS_ITRUNC_MAX_EXTENTS.  xfs_bunmapi()
-		 * will tell us whether it freed the entire range or
-		 * not.  If this is a synchronous mount (wsync),
-		 * then we can tell bunmapi to keep all the
-		 * transactions asynchronous since the unlink
-		 * transaction that made this inode inactive has
-		 * already hit the disk.  There's no danger of
-		 * the freed blocks being reused, there being a
-		 * crash, and the reused blocks suddenly reappearing
-		 * in this file with garbage in them once recovery
-		 * runs.
-		 */
 		xfs_bmap_init(&free_list, &first_block);
 		error = xfs_bunmapi(ntp, ip,
 				    first_unmap_block, unmap_len,
-				    xfs_bmapi_aflag(fork),
+				    xfs_bmapi_aflag(whichfork),
 				    XFS_ITRUNC_MAX_EXTENTS,
 				    &first_block, &free_list,
 				    &done);
-		if (error) {
-			/*
-			 * If the bunmapi call encounters an error,
-			 * return to the caller where the transaction
-			 * can be properly aborted.  We just need to
-			 * make sure we're not holding any resources
-			 * that we were not when we came in.
-			 */
-			xfs_bmap_cancel(&free_list);
-			return error;
-		}
+		if (error)
+			goto out_bmap_cancel;
 
 		/*
 		 * Duplicate the transaction that has the permanent
 		 * reservation and commit the old transaction.
 		 */
-		error = xfs_bmap_finish(tp, &free_list, &committed);
-		ntp = *tp;
+		error = xfs_bmap_finish(tpp, &free_list, &committed);
+		ntp = *tpp;
 		if (committed)
 			xfs_trans_ijoin(ntp, ip);
-
-		if (error) {
-			/*
-			 * If the bmap finish call encounters an error, return
-			 * to the caller where the transaction can be properly
-			 * aborted.  We just need to make sure we're not
-			 * holding any resources that we were not when we came
-			 * in.
-			 *
-			 * Aborting from this point might lose some blocks in
-			 * the file system, but oh well.
-			 */
-			xfs_bmap_cancel(&free_list);
-			return error;
-		}
+		if (error)
+			goto out_bmap_cancel;
 
 		if (committed) {
 			/*
@@ -1464,15 +1311,16 @@ xfs_itruncate_finish(
 		}
 
 		ntp = xfs_trans_dup(ntp);
-		error = xfs_trans_commit(*tp, 0);
-		*tp = ntp;
+		error = xfs_trans_commit(*tpp, 0);
+		*tpp = ntp;
 
 		xfs_trans_ijoin(ntp, ip);
 
 		if (error)
 			return error;
+
 		/*
-		 * transaction commit worked ok so we can drop the extra ticket
+		 * Transaction commit worked ok so we can drop the extra ticket
 		 * reference that we gained in xfs_trans_dup()
 		 */
 		xfs_log_ticket_put(ntp->t_ticket);
@@ -1483,35 +1331,85 @@ xfs_itruncate_finish(
 		if (error)
 			return error;
 	}
+
+	return 0;
+
+out_bmap_cancel:
+	/*
+	 * If the bunmapi call encounters an error, return to the caller where
+	 * the transaction can be properly aborted.  We just need to make sure
+	 * we're not holding any resources that we were not when we came in.
+	 */
+	xfs_bmap_cancel(&free_list);
+	return error;
+}
+
+int
+xfs_itruncate_data(
+	struct xfs_trans	**tpp,
+	struct xfs_inode	*ip,
+	xfs_fsize_t		new_size)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	int			error;
+
+	trace_xfs_itruncate_data_start(ip, new_size);
+
 	/*
-	 * Only update the size in the case of the data fork, but
-	 * always re-log the inode so that our permanent transaction
-	 * can keep on rolling it forward in the log.
+	 * The first thing we do is set the size to new_size permanently on
+	 * disk.  This way we don't have to worry about anyone ever being able
+	 * to look at the data being freed even in the face of a crash.
+	 * What we're getting around here is the case where we free a block, it
+	 * is allocated to another file, it is written to, and then we crash.
+	 * If the new data gets written to the file but the log buffers
+	 * containing the free and reallocation don't, then we'd end up with
+	 * garbage in the blocks being freed.  As long as we make the new_size
+	 * permanent before actually freeing any blocks it doesn't matter if
+	 * they get written to.
 	 */
-	if (fork == XFS_DATA_FORK) {
-		xfs_isize_check(mp, ip, new_size);
+	if (ip->i_d.di_nextents > 0) {
 		/*
-		 * If we are not changing the file size then do
-		 * not update the on-disk file size - we may be
-		 * called from xfs_inactive_free_eofblocks().  If we
-		 * update the on-disk file size and then the system
-		 * crashes before the contents of the file are
-		 * flushed to disk then the files may be full of
-		 * holes (ie NULL files bug).
+		 * If we are not changing the file size then do not update
+		 * the on-disk file size - we may be called from
+		 * xfs_inactive_free_eofblocks().  If we update the on-disk
+		 * file size and then the system crashes before the contents
+		 * of the file are flushed to disk then the files may be
+		 * full of holes (ie NULL files bug).
 		 */
 		if (ip->i_size != new_size) {
 			ip->i_d.di_size = new_size;
 			ip->i_size = new_size;
+			xfs_trans_log_inode(*tpp, ip, XFS_ILOG_CORE);
 		}
 	}
-	xfs_trans_log_inode(ntp, ip, XFS_ILOG_CORE);
-	ASSERT((new_size != 0) ||
-	       (fork == XFS_ATTR_FORK) ||
-	       (ip->i_delayed_blks == 0));
-	ASSERT((new_size != 0) ||
-	       (fork == XFS_ATTR_FORK) ||
-	       (ip->i_d.di_nextents == 0));
-	trace_xfs_itruncate_finish_end(ip, new_size);
+
+	error = xfs_itruncate_extents(tpp, ip, XFS_DATA_FORK, new_size);
+	if (error)
+		return error;
+
+	/*
+	 * If we are not changing the file size then do not update the on-disk
+	 * file size - we may be called from xfs_inactive_free_eofblocks().
+	 * If we update the on-disk file size and then the system crashes
+	 * before the contents of the file are flushed to disk then the files
+	 * may be full of holes (ie NULL files bug).
+	 */
+	xfs_isize_check(mp, ip, new_size);
+	if (ip->i_size != new_size) {
+		ip->i_d.di_size = new_size;
+		ip->i_size = new_size;
+	}
+
+	ASSERT(new_size != 0 || ip->i_delayed_blks == 0);
+	ASSERT(new_size != 0 || ip->i_d.di_nextents == 0);
+
+	/*
+	 * Always re-log the inode so that our permanent transaction can keep
+	 * on rolling it forward in the log.
+	 */
+	xfs_trans_log_inode(*tpp, ip, XFS_ILOG_CORE);
+
+	trace_xfs_itruncate_data_end(ip, new_size);
 	return 0;
 }
 
Index: xfs/fs/xfs/xfs_inode.h
===================================================================
--- xfs.orig/fs/xfs/xfs_inode.h	2011-06-29 11:35:39.146158294 +0200
+++ xfs/fs/xfs/xfs_inode.h	2011-06-29 11:35:45.789455635 +0200
@@ -491,8 +491,10 @@ uint		xfs_ip2xflags(struct xfs_inode *);
 uint		xfs_dic2xflags(struct xfs_dinode *);
 int		xfs_ifree(struct xfs_trans *, xfs_inode_t *,
 			   struct xfs_bmap_free *);
-int		xfs_itruncate_finish(struct xfs_trans **, xfs_inode_t *,
-				     xfs_fsize_t, int, int);
+int		xfs_itruncate_extents(struct xfs_trans **, struct xfs_inode *,
+				      int, xfs_fsize_t);
+int		xfs_itruncate_data(struct xfs_trans **, struct xfs_inode *,
+				   xfs_fsize_t);
 int		xfs_iunlink(struct xfs_trans *, xfs_inode_t *);
 
 void		xfs_iext_realloc(xfs_inode_t *, int, int);
Index: xfs/fs/xfs/xfs_vnodeops.c
===================================================================
--- xfs.orig/fs/xfs/xfs_vnodeops.c	2011-06-29 11:35:39.162824869 +0200
+++ xfs/fs/xfs/xfs_vnodeops.c	2011-06-29 11:35:45.789455635 +0200
@@ -220,15 +220,12 @@ xfs_free_eofblocks(
 		xfs_ilock(ip, XFS_ILOCK_EXCL);
 		xfs_trans_ijoin(tp, ip);
 
-		error = xfs_itruncate_finish(&tp, ip,
-					     ip->i_size,
-					     XFS_DATA_FORK,
-					     0);
-		/*
-		 * If we get an error at this point we
-		 * simply don't bother truncating the file.
-		 */
+		error = xfs_itruncate_data(&tp, ip, ip->i_size);
 		if (error) {
+			/*
+			 * If we get an error at this point we simply don't
+			 * bother truncating the file.
+			 */
 			xfs_trans_cancel(tp,
 					 (XFS_TRANS_RELEASE_LOG_RES |
 					  XFS_TRANS_ABORT));
@@ -665,16 +662,7 @@ xfs_inactive(
 		xfs_ilock(ip, XFS_ILOCK_EXCL);
 		xfs_trans_ijoin(tp, ip);
 
-		/*
-		 * normally, we have to run xfs_itruncate_finish sync.
-		 * But if filesystem is wsync and we're in the inactive
-		 * path, then we know that nlink == 0, and that the
-		 * xaction that made nlink == 0 is permanently committed
-		 * since xfs_remove runs as a synchronous transaction.
-		 */
-		error = xfs_itruncate_finish(&tp, ip, 0, XFS_DATA_FORK,
-				(!(mp->m_flags & XFS_MOUNT_WSYNC) ? 1 : 0));
-
+		error = xfs_itruncate_data(&tp, ip, 0);
 		if (error) {
 			xfs_trans_cancel(tp,
 				XFS_TRANS_RELEASE_LOG_RES | XFS_TRANS_ABORT);
Index: xfs/fs/xfs/linux-2.6/xfs_trace.h
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/xfs_trace.h	2011-06-29 11:35:39.099491878 +0200
+++ xfs/fs/xfs/linux-2.6/xfs_trace.h	2011-06-29 11:35:45.792788951 +0200
@@ -1055,8 +1055,8 @@ DECLARE_EVENT_CLASS(xfs_itrunc_class,
 DEFINE_EVENT(xfs_itrunc_class, name, \
 	TP_PROTO(struct xfs_inode *ip, xfs_fsize_t new_size), \
 	TP_ARGS(ip, new_size))
-DEFINE_ITRUNC_EVENT(xfs_itruncate_finish_start);
-DEFINE_ITRUNC_EVENT(xfs_itruncate_finish_end);
+DEFINE_ITRUNC_EVENT(xfs_itruncate_data_start);
+DEFINE_ITRUNC_EVENT(xfs_itruncate_data_end);
 
 TRACE_EVENT(xfs_pagecache_inval,
 	TP_PROTO(struct xfs_inode *ip, xfs_off_t start, xfs_off_t finish),

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH 10/27] xfs: improve sync behaviour in the fact of aggressive dirtying
  2011-06-29 14:01 [PATCH 00/27] patch queue for Linux 3.1 Christoph Hellwig
                   ` (7 preceding siblings ...)
  2011-06-29 14:01 ` [PATCH 09/27] xfs: split xfs_itruncate_finish Christoph Hellwig
@ 2011-06-29 14:01 ` Christoph Hellwig
  2011-06-30  2:52   ` Dave Chinner
  2011-06-29 14:01 ` [PATCH 11/27] xfs: fix filesystsem freeze race in xfs_trans_alloc Christoph Hellwig
                   ` (17 subsequent siblings)
  26 siblings, 1 reply; 100+ messages in thread
From: Christoph Hellwig @ 2011-06-29 14:01 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-simplify-sync --]
[-- Type: text/plain, Size: 2375 bytes --]

The following script from Wu Fengguang shows very bad behaviour in XFS
when aggressively dirtying data during a sync on XFS, with sync times
up to almost 10 times as long as ext4.

A large part of the issue is that XFS writes data out itself two times
in the ->sync_fs method, overriding the lifelock protection in the core
writeback code, and another issue is the lock-less xfs_ioend_wait call,
which doesn't prevent new ioend from beeing queue up while waiting for
the count to reach zero.

This patch removes the XFS-internal sync calls and relies on the VFS
to do it's work just like all other filesystems do.  Note that the
i_iocount wait which is rather suboptimal is simply removed here.
We already do it in ->write_inode, which keeps the current supoptimal
behaviour.  We'll eventually need to remove that as well, but that's
material for a separate commit.

------------------------------ snip ------------------------------
#!/bin/sh

umount /dev/sda7
mkfs.xfs -f /dev/sda7
# mkfs.ext4 /dev/sda7
# mkfs.btrfs /dev/sda7
mount /dev/sda7 /fs

echo $((50<<20)) > /proc/sys/vm/dirty_bytes

pid=
for i in `seq 10`
do
	dd if=/dev/zero of=/fs/zero-$i bs=1M count=1000 &
	pid="$pid $!"
done

sleep 1

tic=$(date +'%s')
sync
tac=$(date +'%s')

echo
echo sync time: $((tac-tic))
egrep '(Dirty|Writeback|NFS_Unstable)' /proc/meminfo

pidof dd > /dev/null && { kill -9 $pid; echo sync NOT livelocked; }
------------------------------ snip ------------------------------

Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: xfs/fs/xfs/linux-2.6/xfs_sync.c
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/xfs_sync.c	2011-06-29 11:26:14.109219361 +0200
+++ xfs/fs/xfs/linux-2.6/xfs_sync.c	2011-06-29 11:37:20.642275110 +0200
@@ -359,14 +359,12 @@ xfs_quiesce_data(
 {
 	int			error, error2 = 0;
 
-	/* push non-blocking */
-	xfs_sync_data(mp, 0);
 	xfs_qm_sync(mp, SYNC_TRYLOCK);
-
-	/* push and block till complete */
-	xfs_sync_data(mp, SYNC_WAIT);
 	xfs_qm_sync(mp, SYNC_WAIT);
 
+	/* force out the newly dirtied log buffers */
+	xfs_log_force(mp, XFS_LOG_SYNC);
+
 	/* write superblock and hoover up shutdown errors */
 	error = xfs_sync_fsdata(mp);
 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH 11/27] xfs: fix filesystsem freeze race in xfs_trans_alloc
  2011-06-29 14:01 [PATCH 00/27] patch queue for Linux 3.1 Christoph Hellwig
                   ` (8 preceding siblings ...)
  2011-06-29 14:01 ` [PATCH 10/27] xfs: improve sync behaviour in the fact of aggressive dirtying Christoph Hellwig
@ 2011-06-29 14:01 ` Christoph Hellwig
  2011-06-30  2:59   ` Dave Chinner
  2011-06-29 14:01 ` [PATCH 12/27] xfs: remove i_transp Christoph Hellwig
                   ` (16 subsequent siblings)
  26 siblings, 1 reply; 100+ messages in thread
From: Christoph Hellwig @ 2011-06-29 14:01 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-fix-freeze-race --]
[-- Type: text/plain, Size: 5040 bytes --]

As pointed out by Jan xfs_trans_alloc can race with a concurrent filesystem
free when it sleeps during the memory allocation.  Fix this by moving the
wait_for_freeze call after the memory allocation.  This means moving the
freeze into the low-level _xfs_trans_alloc helper, which thus grows a new
argument.  Also fix up some comments in that area while at it.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: xfs/fs/xfs/xfs_fsops.c
===================================================================
--- xfs.orig/fs/xfs/xfs_fsops.c	2011-06-18 17:50:43.477373715 +0200
+++ xfs/fs/xfs/xfs_fsops.c	2011-06-20 09:17:00.933518761 +0200
@@ -626,7 +626,7 @@ xfs_fs_log_dummy(
 	xfs_trans_t	*tp;
 	int		error;
 
-	tp = _xfs_trans_alloc(mp, XFS_TRANS_DUMMY1, KM_SLEEP);
+	tp = _xfs_trans_alloc(mp, XFS_TRANS_DUMMY1, KM_SLEEP, false);
 	error = xfs_trans_reserve(tp, 0, mp->m_sb.sb_sectsize + 128, 0, 0,
 					XFS_DEFAULT_LOG_COUNT);
 	if (error) {
Index: xfs/fs/xfs/xfs_iomap.c
===================================================================
--- xfs.orig/fs/xfs/xfs_iomap.c	2011-06-18 17:50:43.487373714 +0200
+++ xfs/fs/xfs/xfs_iomap.c	2011-06-20 09:17:00.933518761 +0200
@@ -688,8 +688,7 @@ xfs_iomap_write_unwritten(
 		 * the same inode that we complete here and might deadlock
 		 * on the iolock.
 		 */
-		xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
-		tp = _xfs_trans_alloc(mp, XFS_TRANS_STRAT_WRITE, KM_NOFS);
+		tp = _xfs_trans_alloc(mp, XFS_TRANS_STRAT_WRITE, KM_NOFS, true);
 		tp->t_flags |= XFS_TRANS_RESERVE;
 		error = xfs_trans_reserve(tp, resblks,
 				XFS_WRITE_LOG_RES(mp), 0,
Index: xfs/fs/xfs/xfs_trans.h
===================================================================
--- xfs.orig/fs/xfs/xfs_trans.h	2011-06-18 17:50:43.497373713 +0200
+++ xfs/fs/xfs/xfs_trans.h	2011-06-21 10:57:04.908840421 +0200
@@ -447,8 +447,14 @@ typedef struct xfs_trans {
 /*
  * XFS transaction mechanism exported interfaces.
  */
-xfs_trans_t	*xfs_trans_alloc(struct xfs_mount *, uint);
-xfs_trans_t	*_xfs_trans_alloc(struct xfs_mount *, uint, uint);
+xfs_trans_t	*_xfs_trans_alloc(struct xfs_mount *, uint, uint, bool);
+
+static inline struct xfs_trans *
+xfs_trans_alloc(struct xfs_mount *mp, uint type)
+{
+	return _xfs_trans_alloc(mp, type, KM_SLEEP, true);
+}
+
 xfs_trans_t	*xfs_trans_dup(xfs_trans_t *);
 int		xfs_trans_reserve(xfs_trans_t *, uint, uint, uint,
 				  uint, uint);
Index: xfs/fs/xfs/xfs_mount.c
===================================================================
--- xfs.orig/fs/xfs/xfs_mount.c	2011-06-18 17:50:43.510707047 +0200
+++ xfs/fs/xfs/xfs_mount.c	2011-06-20 09:17:00.936852094 +0200
@@ -1566,15 +1566,9 @@ xfs_fs_writable(xfs_mount_t *mp)
 }
 
 /*
- * xfs_log_sbcount
- *
  * Called either periodically to keep the on disk superblock values
  * roughly up to date or from unmount to make sure the values are
  * correct on a clean unmount.
- *
- * Note this code can be called during the process of freezing, so
- * we may need to use the transaction allocator which does not not
- * block when the transaction subsystem is in its frozen state.
  */
 int
 xfs_log_sbcount(
@@ -1596,7 +1590,13 @@ xfs_log_sbcount(
 	if (!xfs_sb_version_haslazysbcount(&mp->m_sb))
 		return 0;
 
-	tp = _xfs_trans_alloc(mp, XFS_TRANS_SB_COUNT, KM_SLEEP);
+	/*
+	 * We can be called during the process of freezing, so make sure
+	 * we go ahead even if the frozen for new transactions.  We will
+	 * always use a sync transaction in the freeze path to make sure
+	 * the transaction has completed by the time we return.
+	 */
+	tp = _xfs_trans_alloc(mp, XFS_TRANS_SB_COUNT, KM_SLEEP, false);
 	error = xfs_trans_reserve(tp, 0, mp->m_sb.sb_sectsize + 128, 0, 0,
 					XFS_DEFAULT_LOG_COUNT);
 	if (error) {
Index: xfs/fs/xfs/xfs_trans.c
===================================================================
--- xfs.orig/fs/xfs/xfs_trans.c	2011-06-18 17:50:43.524040379 +0200
+++ xfs/fs/xfs/xfs_trans.c	2011-06-21 10:56:25.305509042 +0200
@@ -566,31 +566,24 @@ xfs_trans_init(
 
 /*
  * This routine is called to allocate a transaction structure.
+ *
  * The type parameter indicates the type of the transaction.  These
  * are enumerated in xfs_trans.h.
- *
- * Dynamically allocate the transaction structure from the transaction
- * zone, initialize it, and return it to the caller.
  */
-xfs_trans_t *
-xfs_trans_alloc(
-	xfs_mount_t	*mp,
-	uint		type)
-{
-	xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
-	return _xfs_trans_alloc(mp, type, KM_SLEEP);
-}
-
-xfs_trans_t *
+struct xfs_trans *
 _xfs_trans_alloc(
-	xfs_mount_t	*mp,
-	uint		type,
-	uint		memflags)
+	struct xfs_mount	*mp,
+	uint			type,
+	uint			memflags,
+	bool			wait_for_freeze)
 {
-	xfs_trans_t	*tp;
+	struct xfs_trans	*tp;
 
 	atomic_inc(&mp->m_active_trans);
 
+	if (wait_for_freeze)
+		xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
+
 	tp = kmem_zone_zalloc(xfs_trans_zone, memflags);
 	tp->t_magic = XFS_TRANS_MAGIC;
 	tp->t_type = type;

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH 12/27] xfs: remove i_transp
  2011-06-29 14:01 [PATCH 00/27] patch queue for Linux 3.1 Christoph Hellwig
                   ` (9 preceding siblings ...)
  2011-06-29 14:01 ` [PATCH 11/27] xfs: fix filesystsem freeze race in xfs_trans_alloc Christoph Hellwig
@ 2011-06-29 14:01 ` Christoph Hellwig
  2011-06-30  3:00   ` Dave Chinner
  2011-06-29 14:01 ` [PATCH 13/27] xfs: factor out xfs_dir2_leaf_find_entry Christoph Hellwig
                   ` (15 subsequent siblings)
  26 siblings, 1 reply; 100+ messages in thread
From: Christoph Hellwig @ 2011-06-29 14:01 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-kill-i_transp --]
[-- Type: text/plain, Size: 9068 bytes --]

Remove the transaction pointer in the inode.  It's only used to avoid
passing down an argument in the bmap code, and for a few asserts in
the transaction code right now.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: xfs/fs/xfs/quota/xfs_trans_dquot.c
===================================================================
--- xfs.orig/fs/xfs/quota/xfs_trans_dquot.c	2011-06-29 11:26:14.000000000 +0200
+++ xfs/fs/xfs/quota/xfs_trans_dquot.c	2011-06-29 11:38:58.095080494 +0200
@@ -59,7 +59,7 @@ xfs_trans_dqjoin(
 	xfs_trans_add_item(tp, &dqp->q_logitem.qli_item);
 
 	/*
-	 * Initialize i_transp so we can later determine if this dquot is
+	 * Initialize d_transp so we can later determine if this dquot is
 	 * associated with this transaction.
 	 */
 	dqp->q_transp = tp;
Index: xfs/fs/xfs/xfs_bmap.c
===================================================================
--- xfs.orig/fs/xfs/xfs_bmap.c	2011-06-29 11:26:14.000000000 +0200
+++ xfs/fs/xfs/xfs_bmap.c	2011-06-29 11:38:58.098413810 +0200
@@ -94,6 +94,7 @@ xfs_bmap_add_attrfork_local(
  */
 STATIC int				/* error */
 xfs_bmap_add_extent_delay_real(
+	struct xfs_trans	*tp,	/* transaction pointer */
 	xfs_inode_t		*ip,	/* incore inode pointer */
 	xfs_extnum_t		*idx,	/* extent number to update/insert */
 	xfs_btree_cur_t		**curp,	/* if *curp is null, not a btree */
@@ -439,6 +440,7 @@ xfs_bmap_add_attrfork_local(
  */
 STATIC int				/* error */
 xfs_bmap_add_extent(
+	struct xfs_trans	*tp,	/* transaction pointer */
 	xfs_inode_t		*ip,	/* incore inode pointer */
 	xfs_extnum_t		*idx,	/* extent number to update/insert */
 	xfs_btree_cur_t		**curp,	/* if *curp is null, not a btree */
@@ -524,7 +526,7 @@ xfs_bmap_add_extent(
 				if (cur)
 					ASSERT(cur->bc_private.b.flags &
 						XFS_BTCUR_BPRV_WASDEL);
-				error = xfs_bmap_add_extent_delay_real(ip,
+				error = xfs_bmap_add_extent_delay_real(tp, ip,
 						idx, &cur, new, &da_new,
 						first, flist, &logflags);
 			} else {
@@ -561,7 +563,7 @@ xfs_bmap_add_extent(
 		int	tmp_logflags;	/* partial log flag return val */
 
 		ASSERT(cur == NULL);
-		error = xfs_bmap_extents_to_btree(ip->i_transp, ip, first,
+		error = xfs_bmap_extents_to_btree(tp, ip, first,
 			flist, &cur, da_old > 0, &tmp_logflags, whichfork);
 		logflags |= tmp_logflags;
 		if (error)
@@ -604,6 +606,7 @@ done:
  */
 STATIC int				/* error */
 xfs_bmap_add_extent_delay_real(
+	struct xfs_trans	*tp,	/* transaction pointer */
 	xfs_inode_t		*ip,	/* incore inode pointer */
 	xfs_extnum_t		*idx,	/* extent number to update/insert */
 	xfs_btree_cur_t		**curp,	/* if *curp is null, not a btree */
@@ -901,7 +904,7 @@ xfs_bmap_add_extent_delay_real(
 		}
 		if (ip->i_d.di_format == XFS_DINODE_FMT_EXTENTS &&
 		    ip->i_d.di_nextents > ip->i_df.if_ext_max) {
-			error = xfs_bmap_extents_to_btree(ip->i_transp, ip,
+			error = xfs_bmap_extents_to_btree(tp, ip,
 					first, flist, &cur, 1, &tmp_rval,
 					XFS_DATA_FORK);
 			rval |= tmp_rval;
@@ -984,7 +987,7 @@ xfs_bmap_add_extent_delay_real(
 		}
 		if (ip->i_d.di_format == XFS_DINODE_FMT_EXTENTS &&
 		    ip->i_d.di_nextents > ip->i_df.if_ext_max) {
-			error = xfs_bmap_extents_to_btree(ip->i_transp, ip,
+			error = xfs_bmap_extents_to_btree(tp, ip,
 				first, flist, &cur, 1, &tmp_rval,
 				XFS_DATA_FORK);
 			rval |= tmp_rval;
@@ -1052,7 +1055,7 @@ xfs_bmap_add_extent_delay_real(
 		}
 		if (ip->i_d.di_format == XFS_DINODE_FMT_EXTENTS &&
 		    ip->i_d.di_nextents > ip->i_df.if_ext_max) {
-			error = xfs_bmap_extents_to_btree(ip->i_transp, ip,
+			error = xfs_bmap_extents_to_btree(tp, ip,
 					first, flist, &cur, 1, &tmp_rval,
 					XFS_DATA_FORK);
 			rval |= tmp_rval;
@@ -2871,8 +2874,8 @@ xfs_bmap_del_extent(
 			len = del->br_blockcount;
 			do_div(bno, mp->m_sb.sb_rextsize);
 			do_div(len, mp->m_sb.sb_rextsize);
-			if ((error = xfs_rtfree_extent(ip->i_transp, bno,
-					(xfs_extlen_t)len)))
+			error = xfs_rtfree_extent(tp, bno, (xfs_extlen_t)len);
+			if (error)
 				goto done;
 			do_fx = 0;
 			nblks = len * mp->m_sb.sb_rextsize;
@@ -4662,7 +4665,7 @@ xfs_bmapi(
 				if (!wasdelay && (flags & XFS_BMAPI_PREALLOC))
 					got.br_state = XFS_EXT_UNWRITTEN;
 			}
-			error = xfs_bmap_add_extent(ip, &lastx, &cur, &got,
+			error = xfs_bmap_add_extent(tp, ip, &lastx, &cur, &got,
 				firstblock, flist, &tmp_logflags,
 				whichfork);
 			logflags |= tmp_logflags;
@@ -4763,7 +4766,7 @@ xfs_bmapi(
 			mval->br_state = (mval->br_state == XFS_EXT_UNWRITTEN)
 						? XFS_EXT_NORM
 						: XFS_EXT_UNWRITTEN;
-			error = xfs_bmap_add_extent(ip, &lastx, &cur, mval,
+			error = xfs_bmap_add_extent(tp, ip, &lastx, &cur, mval,
 				firstblock, flist, &tmp_logflags,
 				whichfork);
 			logflags |= tmp_logflags;
@@ -5117,7 +5120,7 @@ xfs_bunmapi(
 				del.br_blockcount = mod;
 			}
 			del.br_state = XFS_EXT_UNWRITTEN;
-			error = xfs_bmap_add_extent(ip, &lastx, &cur, &del,
+			error = xfs_bmap_add_extent(tp, ip, &lastx, &cur, &del,
 				firstblock, flist, &logflags,
 				XFS_DATA_FORK);
 			if (error)
@@ -5175,18 +5178,18 @@ xfs_bunmapi(
 				}
 				prev.br_state = XFS_EXT_UNWRITTEN;
 				lastx--;
-				error = xfs_bmap_add_extent(ip, &lastx, &cur,
-					&prev, firstblock, flist, &logflags,
-					XFS_DATA_FORK);
+				error = xfs_bmap_add_extent(tp, ip, &lastx,
+						&cur, &prev, firstblock, flist,
+						&logflags, XFS_DATA_FORK);
 				if (error)
 					goto error0;
 				goto nodelete;
 			} else {
 				ASSERT(del.br_state == XFS_EXT_NORM);
 				del.br_state = XFS_EXT_UNWRITTEN;
-				error = xfs_bmap_add_extent(ip, &lastx, &cur,
-					&del, firstblock, flist, &logflags,
-					XFS_DATA_FORK);
+				error = xfs_bmap_add_extent(tp, ip, &lastx,
+						&cur, &del, firstblock, flist,
+						&logflags, XFS_DATA_FORK);
 				if (error)
 					goto error0;
 				goto nodelete;
Index: xfs/fs/xfs/xfs_inode.c
===================================================================
--- xfs.orig/fs/xfs/xfs_inode.c	2011-06-29 11:38:24.000000000 +0200
+++ xfs/fs/xfs/xfs_inode.c	2011-06-29 11:39:10.101682115 +0200
@@ -1259,7 +1259,6 @@ xfs_itruncate_extents(
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL|XFS_IOLOCK_EXCL));
 	ASSERT(new_size == 0 || new_size <= ip->i_size);
 	ASSERT((*tpp)->t_flags & XFS_TRANS_PERM_LOG_RES);
-	ASSERT(ip->i_transp == *tpp);
 	ASSERT(ip->i_itemp != NULL);
 	ASSERT(ip->i_itemp->ili_lock_flags == 0);
 	ASSERT(!XFS_NOT_DQATTACHED(mp, ip));
@@ -1435,7 +1434,6 @@ xfs_iunlink(
 
 	ASSERT(ip->i_d.di_nlink == 0);
 	ASSERT(ip->i_d.di_mode != 0);
-	ASSERT(ip->i_transp == tp);
 
 	mp = tp->t_mountp;
 
@@ -1827,7 +1825,6 @@ xfs_ifree(
 	xfs_buf_t       	*ibp;
 
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
-	ASSERT(ip->i_transp == tp);
 	ASSERT(ip->i_d.di_nlink == 0);
 	ASSERT(ip->i_d.di_nextents == 0);
 	ASSERT(ip->i_d.di_anextents == 0);
Index: xfs/fs/xfs/xfs_inode.h
===================================================================
--- xfs.orig/fs/xfs/xfs_inode.h	2011-06-29 11:35:45.000000000 +0200
+++ xfs/fs/xfs/xfs_inode.h	2011-06-29 11:38:58.105080440 +0200
@@ -241,7 +241,6 @@ typedef struct xfs_inode {
 	xfs_ifork_t		i_df;		/* data fork */
 
 	/* Transaction and locking information. */
-	struct xfs_trans	*i_transp;	/* ptr to owning transaction*/
 	struct xfs_inode_log_item *i_itemp;	/* logging information */
 	mrlock_t		i_lock;		/* inode lock */
 	mrlock_t		i_iolock;	/* inode IO lock */
Index: xfs/fs/xfs/xfs_inode_item.c
===================================================================
--- xfs.orig/fs/xfs/xfs_inode_item.c	2011-06-29 11:26:14.729216003 +0200
+++ xfs/fs/xfs/xfs_inode_item.c	2011-06-29 11:38:58.108413755 +0200
@@ -636,11 +636,6 @@ xfs_inode_item_unlock(
 	ASSERT(xfs_isilocked(iip->ili_inode, XFS_ILOCK_EXCL));
 
 	/*
-	 * Clear the transaction pointer in the inode.
-	 */
-	ip->i_transp = NULL;
-
-	/*
 	 * If the inode needed a separate buffer with which to log
 	 * its extents, then free it now.
 	 */
Index: xfs/fs/xfs/xfs_trans_inode.c
===================================================================
--- xfs.orig/fs/xfs/xfs_trans_inode.c	2011-06-29 11:26:14.745882578 +0200
+++ xfs/fs/xfs/xfs_trans_inode.c	2011-06-29 11:38:58.108413755 +0200
@@ -55,7 +55,6 @@ xfs_trans_ijoin(
 {
 	xfs_inode_log_item_t	*iip;
 
-	ASSERT(ip->i_transp == NULL);
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
 	if (ip->i_itemp == NULL)
 		xfs_inode_item_init(ip, ip->i_mount);
@@ -68,12 +67,6 @@ xfs_trans_ijoin(
 	xfs_trans_add_item(tp, &iip->ili_item);
 
 	xfs_trans_inode_broot_debug(ip);
-
-	/*
-	 * Initialize i_transp so we can find it with xfs_inode_incore()
-	 * in xfs_trans_iget() above.
-	 */
-	ip->i_transp = tp;
 }
 
 /*
@@ -111,7 +104,6 @@ xfs_trans_ichgtime(
 
 	ASSERT(tp);
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
-	ASSERT(ip->i_transp == tp);
 
 	tv = current_fs_time(inode->i_sb);
 
@@ -140,7 +132,6 @@ xfs_trans_log_inode(
 	xfs_inode_t	*ip,
 	uint		flags)
 {
-	ASSERT(ip->i_transp == tp);
 	ASSERT(ip->i_itemp != NULL);
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH 13/27] xfs: factor out xfs_dir2_leaf_find_entry
  2011-06-29 14:01 [PATCH 00/27] patch queue for Linux 3.1 Christoph Hellwig
                   ` (10 preceding siblings ...)
  2011-06-29 14:01 ` [PATCH 12/27] xfs: remove i_transp Christoph Hellwig
@ 2011-06-29 14:01 ` Christoph Hellwig
  2011-06-30  6:11   ` Dave Chinner
  2011-06-29 14:01 ` [PATCH 14/27] xfs: cleanup shortform directory inode number handling Christoph Hellwig
                   ` (14 subsequent siblings)
  26 siblings, 1 reply; 100+ messages in thread
From: Christoph Hellwig @ 2011-06-29 14:01 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-factor-dir2-leaf-code --]
[-- Type: text/plain, Size: 10975 bytes --]

Add a new xfs_dir2_leaf_find_entry helper to factor out some duplicate code
from xfs_dir2_leaf_addname xfs_dir2_leafn_add.  Found by Eric Sandeen using
an automated code duplication checked.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: xfs/fs/xfs/xfs_dir2_leaf.c
===================================================================
--- xfs.orig/fs/xfs/xfs_dir2_leaf.c	2011-06-22 21:56:26.102462981 +0200
+++ xfs/fs/xfs/xfs_dir2_leaf.c	2011-06-23 12:41:51.716439911 +0200
@@ -152,6 +152,118 @@ xfs_dir2_block_to_leaf(
 	return 0;
 }
 
+xfs_dir2_leaf_entry_t *
+xfs_dir2_leaf_find_entry(
+	xfs_dir2_leaf_t		*leaf,		/* leaf structure */
+	int			index,		/* leaf table position */
+	int			compact,	/* need to compact leaves */
+	int			lowstale,	/* index of prev stale leaf */
+	int			highstale,	/* index of next stale leaf */
+	int			*lfloglow,	/* low leaf logging index */
+	int			*lfloghigh)	/* high leaf logging index */
+{
+	xfs_dir2_leaf_entry_t	*lep;		/* leaf entry table pointer */
+
+	if (!leaf->hdr.stale) {
+		/*
+		 * Now we need to make room to insert the leaf entry.
+		 *
+		 * If there are no stale entries, just insert a hole at index.
+		 */
+		lep = &leaf->ents[index];
+		if (index < be16_to_cpu(leaf->hdr.count))
+			memmove(lep + 1, lep,
+				(be16_to_cpu(leaf->hdr.count) - index) *
+				 sizeof(*lep));
+
+		/*
+		 * Record low and high logging indices for the leaf.
+		 */
+		*lfloglow = index;
+		*lfloghigh = be16_to_cpu(leaf->hdr.count);
+		be16_add_cpu(&leaf->hdr.count, 1);
+	} else {
+		/*
+		 * There are stale entries.
+		 *
+		 * We will use one of them for the new entry.  It's probably
+		 * not at the right location, so we'll have to shift some up
+		 * or down first.
+		 *
+		 * If we didn't compact before, we need to find the nearest
+		 * stale entries before and after our insertion point.
+		 */
+		if (compact == 0) {
+			/*
+			 * Find the first stale entry before the insertion
+			 * point, if any.
+			 */
+			for (lowstale = index - 1;
+			     lowstale >= 0 &&
+				be32_to_cpu(leaf->ents[lowstale].address) !=
+				XFS_DIR2_NULL_DATAPTR;
+			     lowstale--)
+				continue;
+			/*
+			 * Find the next stale entry at or after the insertion
+			 * point, if any.   Stop if we go so far that the
+			 * lowstale entry would be better.
+			 */
+			for (highstale = index;
+			     highstale < be16_to_cpu(leaf->hdr.count) &&
+				be32_to_cpu(leaf->ents[highstale].address) !=
+				XFS_DIR2_NULL_DATAPTR &&
+				(lowstale < 0 ||
+				 index - lowstale - 1 >= highstale - index);
+			     highstale++)
+				continue;
+		}
+		/*
+		 * If the low one is better, use it.
+		 */
+		if (lowstale >= 0 &&
+		    (highstale == be16_to_cpu(leaf->hdr.count) ||
+		     index - lowstale - 1 < highstale - index)) {
+			ASSERT(index - lowstale - 1 >= 0);
+			ASSERT(be32_to_cpu(leaf->ents[lowstale].address) ==
+			       XFS_DIR2_NULL_DATAPTR);
+			/*
+			 * Copy entries up to cover the stale entry
+			 * and make room for the new entry.
+			 */
+			if (index - lowstale - 1 > 0)
+				memmove(&leaf->ents[lowstale],
+					&leaf->ents[lowstale + 1],
+					(index - lowstale - 1) * sizeof(*lep));
+			lep = &leaf->ents[index - 1];
+			*lfloglow = MIN(lowstale, *lfloglow);
+			*lfloghigh = MAX(index - 1, *lfloghigh);
+
+		/*
+		 * The high one is better, so use that one.
+		 */
+		} else {
+			ASSERT(highstale - index >= 0);
+			ASSERT(be32_to_cpu(leaf->ents[highstale].address) ==
+			       XFS_DIR2_NULL_DATAPTR);
+			/*
+			 * Copy entries down to cover the stale entry
+			 * and make room for the new entry.
+			 */
+			if (highstale - index > 0)
+				memmove(&leaf->ents[index + 1],
+					&leaf->ents[index],
+					(highstale - index) * sizeof(*lep));
+			lep = &leaf->ents[index];
+			*lfloglow = MIN(index, *lfloglow);
+			*lfloghigh = MAX(highstale, *lfloghigh);
+		}
+		be16_add_cpu(&leaf->hdr.stale, -1);
+	}
+
+	return lep;
+}
+
 /*
  * Add an entry to a leaf form directory.
  */
@@ -430,102 +542,11 @@ xfs_dir2_leaf_addname(
 		if (!grown)
 			xfs_dir2_leaf_log_bests(tp, lbp, use_block, use_block);
 	}
-	/*
-	 * Now we need to make room to insert the leaf entry.
-	 * If there are no stale entries, we just insert a hole at index.
-	 */
-	if (!leaf->hdr.stale) {
-		/*
-		 * lep is still good as the index leaf entry.
-		 */
-		if (index < be16_to_cpu(leaf->hdr.count))
-			memmove(lep + 1, lep,
-				(be16_to_cpu(leaf->hdr.count) - index) * sizeof(*lep));
-		/*
-		 * Record low and high logging indices for the leaf.
-		 */
-		lfloglow = index;
-		lfloghigh = be16_to_cpu(leaf->hdr.count);
-		be16_add_cpu(&leaf->hdr.count, 1);
-	}
-	/*
-	 * There are stale entries.
-	 * We will use one of them for the new entry.
-	 * It's probably not at the right location, so we'll have to
-	 * shift some up or down first.
-	 */
-	else {
-		/*
-		 * If we didn't compact before, we need to find the nearest
-		 * stale entries before and after our insertion point.
-		 */
-		if (compact == 0) {
-			/*
-			 * Find the first stale entry before the insertion
-			 * point, if any.
-			 */
-			for (lowstale = index - 1;
-			     lowstale >= 0 &&
-				be32_to_cpu(leaf->ents[lowstale].address) !=
-				XFS_DIR2_NULL_DATAPTR;
-			     lowstale--)
-				continue;
-			/*
-			 * Find the next stale entry at or after the insertion
-			 * point, if any.   Stop if we go so far that the
-			 * lowstale entry would be better.
-			 */
-			for (highstale = index;
-			     highstale < be16_to_cpu(leaf->hdr.count) &&
-				be32_to_cpu(leaf->ents[highstale].address) !=
-				XFS_DIR2_NULL_DATAPTR &&
-				(lowstale < 0 ||
-				 index - lowstale - 1 >= highstale - index);
-			     highstale++)
-				continue;
-		}
-		/*
-		 * If the low one is better, use it.
-		 */
-		if (lowstale >= 0 &&
-		    (highstale == be16_to_cpu(leaf->hdr.count) ||
-		     index - lowstale - 1 < highstale - index)) {
-			ASSERT(index - lowstale - 1 >= 0);
-			ASSERT(be32_to_cpu(leaf->ents[lowstale].address) ==
-			       XFS_DIR2_NULL_DATAPTR);
-			/*
-			 * Copy entries up to cover the stale entry
-			 * and make room for the new entry.
-			 */
-			if (index - lowstale - 1 > 0)
-				memmove(&leaf->ents[lowstale],
-					&leaf->ents[lowstale + 1],
-					(index - lowstale - 1) * sizeof(*lep));
-			lep = &leaf->ents[index - 1];
-			lfloglow = MIN(lowstale, lfloglow);
-			lfloghigh = MAX(index - 1, lfloghigh);
-		}
-		/*
-		 * The high one is better, so use that one.
-		 */
-		else {
-			ASSERT(highstale - index >= 0);
-			ASSERT(be32_to_cpu(leaf->ents[highstale].address) ==
-			       XFS_DIR2_NULL_DATAPTR);
-			/*
-			 * Copy entries down to cover the stale entry
-			 * and make room for the new entry.
-			 */
-			if (highstale - index > 0)
-				memmove(&leaf->ents[index + 1],
-					&leaf->ents[index],
-					(highstale - index) * sizeof(*lep));
-			lep = &leaf->ents[index];
-			lfloglow = MIN(index, lfloglow);
-			lfloghigh = MAX(highstale, lfloghigh);
-		}
-		be16_add_cpu(&leaf->hdr.stale, -1);
-	}
+
+
+	lep = xfs_dir2_leaf_find_entry(leaf, index, compact, lowstale,
+				       highstale, &lfloglow, &lfloghigh);
+
 	/*
 	 * Fill in the new leaf entry.
 	 */
Index: xfs/fs/xfs/xfs_dir2_leaf.h
===================================================================
--- xfs.orig/fs/xfs/xfs_dir2_leaf.h	2011-06-22 21:56:26.119129649 +0200
+++ xfs/fs/xfs/xfs_dir2_leaf.h	2011-06-23 12:40:42.889776728 +0200
@@ -248,6 +248,9 @@ extern int xfs_dir2_leaf_search_hash(str
 				     struct xfs_dabuf *lbp);
 extern int xfs_dir2_leaf_trim_data(struct xfs_da_args *args,
 				   struct xfs_dabuf *lbp, xfs_dir2_db_t db);
+extern xfs_dir2_leaf_entry_t *xfs_dir2_leaf_find_entry(xfs_dir2_leaf_t *, int,
+						       int, int, int,
+						       int *, int *);
 extern int xfs_dir2_node_to_leaf(struct xfs_da_state *state);
 
 #endif	/* __XFS_DIR2_LEAF_H__ */
Index: xfs/fs/xfs/xfs_dir2_node.c
===================================================================
--- xfs.orig/fs/xfs/xfs_dir2_node.c	2011-06-22 21:56:26.089129649 +0200
+++ xfs/fs/xfs/xfs_dir2_node.c	2011-06-23 12:40:42.919776727 +0200
@@ -244,89 +244,14 @@ xfs_dir2_leafn_add(
 		lfloglow = be16_to_cpu(leaf->hdr.count);
 		lfloghigh = -1;
 	}
-	/*
-	 * No stale entries, just insert a space for the new entry.
-	 */
-	if (!leaf->hdr.stale) {
-		lep = &leaf->ents[index];
-		if (index < be16_to_cpu(leaf->hdr.count))
-			memmove(lep + 1, lep,
-				(be16_to_cpu(leaf->hdr.count) - index) * sizeof(*lep));
-		lfloglow = index;
-		lfloghigh = be16_to_cpu(leaf->hdr.count);
-		be16_add_cpu(&leaf->hdr.count, 1);
-	}
-	/*
-	 * There are stale entries.  We'll use one for the new entry.
-	 */
-	else {
-		/*
-		 * If we didn't do a compact then we need to figure out
-		 * which stale entry will be used.
-		 */
-		if (compact == 0) {
-			/*
-			 * Find first stale entry before our insertion point.
-			 */
-			for (lowstale = index - 1;
-			     lowstale >= 0 &&
-				be32_to_cpu(leaf->ents[lowstale].address) !=
-				XFS_DIR2_NULL_DATAPTR;
-			     lowstale--)
-				continue;
-			/*
-			 * Find next stale entry after insertion point.
-			 * Stop looking if the answer would be worse than
-			 * lowstale already found.
-			 */
-			for (highstale = index;
-			     highstale < be16_to_cpu(leaf->hdr.count) &&
-				be32_to_cpu(leaf->ents[highstale].address) !=
-				XFS_DIR2_NULL_DATAPTR &&
-				(lowstale < 0 ||
-				 index - lowstale - 1 >= highstale - index);
-			     highstale++)
-				continue;
-		}
-		/*
-		 * Using the low stale entry.
-		 * Shift entries up toward the stale slot.
-		 */
-		if (lowstale >= 0 &&
-		    (highstale == be16_to_cpu(leaf->hdr.count) ||
-		     index - lowstale - 1 < highstale - index)) {
-			ASSERT(be32_to_cpu(leaf->ents[lowstale].address) ==
-			       XFS_DIR2_NULL_DATAPTR);
-			ASSERT(index - lowstale - 1 >= 0);
-			if (index - lowstale - 1 > 0)
-				memmove(&leaf->ents[lowstale],
-					&leaf->ents[lowstale + 1],
-					(index - lowstale - 1) * sizeof(*lep));
-			lep = &leaf->ents[index - 1];
-			lfloglow = MIN(lowstale, lfloglow);
-			lfloghigh = MAX(index - 1, lfloghigh);
-		}
-		/*
-		 * Using the high stale entry.
-		 * Shift entries down toward the stale slot.
-		 */
-		else {
-			ASSERT(be32_to_cpu(leaf->ents[highstale].address) ==
-			       XFS_DIR2_NULL_DATAPTR);
-			ASSERT(highstale - index >= 0);
-			if (highstale - index > 0)
-				memmove(&leaf->ents[index + 1],
-					&leaf->ents[index],
-					(highstale - index) * sizeof(*lep));
-			lep = &leaf->ents[index];
-			lfloglow = MIN(index, lfloglow);
-			lfloghigh = MAX(highstale, lfloghigh);
-		}
-		be16_add_cpu(&leaf->hdr.stale, -1);
-	}
+
+
 	/*
 	 * Insert the new entry, log everything.
 	 */
+	lep = xfs_dir2_leaf_find_entry(leaf, index, compact, lowstale,
+				       highstale, &lfloglow, &lfloghigh);
+
 	lep->hashval = cpu_to_be32(args->hashval);
 	lep->address = cpu_to_be32(xfs_dir2_db_off_to_dataptr(mp,
 				args->blkno, args->index));

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH 14/27] xfs: cleanup shortform directory inode number handling
  2011-06-29 14:01 [PATCH 00/27] patch queue for Linux 3.1 Christoph Hellwig
                   ` (11 preceding siblings ...)
  2011-06-29 14:01 ` [PATCH 13/27] xfs: factor out xfs_dir2_leaf_find_entry Christoph Hellwig
@ 2011-06-29 14:01 ` Christoph Hellwig
  2011-06-30  6:35   ` Dave Chinner
  2011-06-29 14:01 ` [PATCH 15/27] xfs: kill struct xfs_dir2_sf Christoph Hellwig
                   ` (13 subsequent siblings)
  26 siblings, 1 reply; 100+ messages in thread
From: Christoph Hellwig @ 2011-06-29 14:01 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-dir2_sf-cleanup-inum-handling --]
[-- Type: text/plain, Size: 13518 bytes --]

Refactor the shortform directory helpers that deal with the 32-bit vs
64-bit wide inode numbers into more sensible helpers, and kill the
xfs_intino_t typedef that is now superflous.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: xfs/fs/xfs/xfs_dir2_sf.c
===================================================================
--- xfs.orig/fs/xfs/xfs_dir2_sf.c	2010-05-25 11:40:59.357006075 +0200
+++ xfs/fs/xfs/xfs_dir2_sf.c	2010-05-27 14:48:16.709004470 +0200
@@ -59,6 +59,83 @@ static void xfs_dir2_sf_toino4(xfs_da_ar
 static void xfs_dir2_sf_toino8(xfs_da_args_t *args);
 #endif /* XFS_BIG_INUMS */
 
+
+/*
+ * Inode numbers in short-form directories can come in two versions,
+ * either 4 bytes or 8 bytes wide.  These helpers deal with the
+ * two forms transparently by looking at the headers i8count field.
+ */
+
+static xfs_ino_t
+xfs_dir2_sf_get_ino(
+	struct xfs_dir2_sf	*sfp,
+	xfs_dir2_inou_t		*from)
+{
+	if (sfp->hdr.i8count)
+		return XFS_GET_DIR_INO8(from->i8);
+	else
+		return XFS_GET_DIR_INO4(from->i4);
+}
+static void
+xfs_dir2_sf_put_inumber(
+	struct xfs_dir2_sf	*sfp,
+	xfs_dir2_inou_t		*to,
+	xfs_ino_t		ino)
+{
+	if (sfp->hdr.i8count)
+		XFS_PUT_DIR_INO8(ino, to->i8);
+	else
+		XFS_PUT_DIR_INO4(ino, to->i4);
+}
+
+xfs_ino_t
+xfs_dir2_sf_get_parent_ino(
+	struct xfs_dir2_sf	*sfp)
+{
+	return xfs_dir2_sf_get_ino(sfp, &sfp->hdr.parent);
+}
+
+
+static void
+xfs_dir2_sf_put_parent_ino(
+	struct xfs_dir2_sf	*sfp,
+	xfs_ino_t		ino)
+{
+	xfs_dir2_sf_put_inumber(sfp, &sfp->hdr.parent, ino);
+}
+
+
+/*
+ * In short-form directory entries the inode numbers are stored at variable
+ * offset behind the entry name.  The inode numbers may only be accessed
+ * through the helpers below.
+ */
+
+static xfs_dir2_inou_t *
+xfs_dir2_sf_inop(
+	struct xfs_dir2_sf_entry *sfep)
+{
+	return (xfs_dir2_inou_t *)&sfep->name[sfep->namelen];
+}
+
+xfs_ino_t
+xfs_dir2_sfe_get_ino(
+	struct xfs_dir2_sf	*sfp,
+	struct xfs_dir2_sf_entry *sfep)
+{
+	return xfs_dir2_sf_get_ino(sfp, xfs_dir2_sf_inop(sfep));
+}
+
+static void
+xfs_dir2_sfe_put_ino(
+	struct xfs_dir2_sf	*sfp,
+	struct xfs_dir2_sf_entry *sfep,
+	xfs_ino_t		ino)
+{
+	xfs_dir2_sf_put_inumber(sfp, xfs_dir2_sf_inop(sfep), ino);
+}
+
+
 /*
  * Given a block directory (dp/block), calculate its size as a shortform (sf)
  * directory and a header for the sf directory, if it will fit it the
@@ -138,7 +215,7 @@ xfs_dir2_block_sfsize(
 	 */
 	sfhp->count = count;
 	sfhp->i8count = i8count;
-	xfs_dir2_sf_put_inumber((xfs_dir2_sf_t *)sfhp, &parent, &sfhp->parent);
+	xfs_dir2_sf_put_parent_ino((xfs_dir2_sf_t *)sfhp, parent);
 	return size;
 }
 
@@ -165,7 +242,6 @@ xfs_dir2_block_to_sf(
 	char			*ptr;		/* current data pointer */
 	xfs_dir2_sf_entry_t	*sfep;		/* shortform entry */
 	xfs_dir2_sf_t		*sfp;		/* shortform structure */
-	xfs_ino_t               temp;
 
 	trace_xfs_dir2_block_to_sf(args);
 
@@ -233,7 +309,7 @@ xfs_dir2_block_to_sf(
 		else if (dep->namelen == 2 &&
 			 dep->name[0] == '.' && dep->name[1] == '.')
 			ASSERT(be64_to_cpu(dep->inumber) ==
-			       xfs_dir2_sf_get_inumber(sfp, &sfp->hdr.parent));
+			       xfs_dir2_sf_get_parent_ino(sfp));
 		/*
 		 * Normal entry, copy it into shortform.
 		 */
@@ -243,9 +319,9 @@ xfs_dir2_block_to_sf(
 				(xfs_dir2_data_aoff_t)
 				((char *)dep - (char *)block));
 			memcpy(sfep->name, dep->name, dep->namelen);
-			temp = be64_to_cpu(dep->inumber);
-			xfs_dir2_sf_put_inumber(sfp, &temp,
-				xfs_dir2_sf_inumberp(sfep));
+			xfs_dir2_sfe_put_ino(sfp, sfep,
+					     be64_to_cpu(dep->inumber));
+
 			sfep = xfs_dir2_sf_nextentry(sfp, sfep);
 		}
 		ptr += xfs_dir2_data_entsize(dep->namelen);
@@ -406,8 +482,7 @@ xfs_dir2_sf_addname_easy(
 	sfep->namelen = args->namelen;
 	xfs_dir2_sf_put_offset(sfep, offset);
 	memcpy(sfep->name, args->name, sfep->namelen);
-	xfs_dir2_sf_put_inumber(sfp, &args->inumber,
-		xfs_dir2_sf_inumberp(sfep));
+	xfs_dir2_sfe_put_ino(sfp, sfep, args->inumber);
 	/*
 	 * Update the header and inode.
 	 */
@@ -498,8 +573,7 @@ xfs_dir2_sf_addname_hard(
 	sfep->namelen = args->namelen;
 	xfs_dir2_sf_put_offset(sfep, offset);
 	memcpy(sfep->name, args->name, sfep->namelen);
-	xfs_dir2_sf_put_inumber(sfp, &args->inumber,
-		xfs_dir2_sf_inumberp(sfep));
+	xfs_dir2_sfe_put_ino(sfp, sfep, args->inumber);
 	sfp->hdr.count++;
 #if XFS_BIG_INUMS
 	if (args->inumber > XFS_DIR2_MAX_SHORT_INUM && !objchange)
@@ -618,14 +692,14 @@ xfs_dir2_sf_check(
 
 	sfp = (xfs_dir2_sf_t *)dp->i_df.if_u1.if_data;
 	offset = XFS_DIR2_DATA_FIRST_OFFSET;
-	ino = xfs_dir2_sf_get_inumber(sfp, &sfp->hdr.parent);
+	ino = xfs_dir2_sf_get_parent_ino(sfp);
 	i8count = ino > XFS_DIR2_MAX_SHORT_INUM;
 
 	for (i = 0, sfep = xfs_dir2_sf_firstentry(sfp);
 	     i < sfp->hdr.count;
 	     i++, sfep = xfs_dir2_sf_nextentry(sfp, sfep)) {
 		ASSERT(xfs_dir2_sf_get_offset(sfep) >= offset);
-		ino = xfs_dir2_sf_get_inumber(sfp, xfs_dir2_sf_inumberp(sfep));
+		ino = xfs_dir2_sfe_get_ino(sfp, sfep);
 		i8count += ino > XFS_DIR2_MAX_SHORT_INUM;
 		offset =
 			xfs_dir2_sf_get_offset(sfep) +
@@ -686,7 +760,7 @@ xfs_dir2_sf_create(
 	/*
 	 * Now can put in the inode number, since i8count is set.
 	 */
-	xfs_dir2_sf_put_inumber(sfp, &pino, &sfp->hdr.parent);
+	xfs_dir2_sf_put_parent_ino(sfp, pino);
 	sfp->hdr.count = 0;
 	dp->i_d.di_size = size;
 	xfs_dir2_sf_check(args);
@@ -759,7 +833,7 @@ xfs_dir2_sf_getdents(
 	 * Put .. entry unless we're starting past it.
 	 */
 	if (*offset <= dotdot_offset) {
-		ino = xfs_dir2_sf_get_inumber(sfp, &sfp->hdr.parent);
+		ino = xfs_dir2_sf_get_parent_ino(sfp);
 		if (filldir(dirent, "..", 2, dotdot_offset & 0x7fffffff, ino, DT_DIR)) {
 			*offset = dotdot_offset & 0x7fffffff;
 			return 0;
@@ -779,7 +853,7 @@ xfs_dir2_sf_getdents(
 			continue;
 		}
 
-		ino = xfs_dir2_sf_get_inumber(sfp, xfs_dir2_sf_inumberp(sfep));
+		ino = xfs_dir2_sfe_get_ino(sfp, sfep);
 		if (filldir(dirent, (char *)sfep->name, sfep->namelen,
 			    off & 0x7fffffff, ino, DT_UNKNOWN)) {
 			*offset = off & 0x7fffffff;
@@ -839,7 +913,7 @@ xfs_dir2_sf_lookup(
 	 */
 	if (args->namelen == 2 &&
 	    args->name[0] == '.' && args->name[1] == '.') {
-		args->inumber = xfs_dir2_sf_get_inumber(sfp, &sfp->hdr.parent);
+		args->inumber = xfs_dir2_sf_get_parent_ino(sfp);
 		args->cmpresult = XFS_CMP_EXACT;
 		return XFS_ERROR(EEXIST);
 	}
@@ -858,8 +932,7 @@ xfs_dir2_sf_lookup(
 								sfep->namelen);
 		if (cmp != XFS_CMP_DIFFERENT && cmp != args->cmpresult) {
 			args->cmpresult = cmp;
-			args->inumber = xfs_dir2_sf_get_inumber(sfp,
-						xfs_dir2_sf_inumberp(sfep));
+			args->inumber = xfs_dir2_sfe_get_ino(sfp, sfep);
 			if (cmp == XFS_CMP_EXACT)
 				return XFS_ERROR(EEXIST);
 			ci_sfep = sfep;
@@ -918,9 +991,8 @@ xfs_dir2_sf_removename(
 				i++, sfep = xfs_dir2_sf_nextentry(sfp, sfep)) {
 		if (xfs_da_compname(args, sfep->name, sfep->namelen) ==
 								XFS_CMP_EXACT) {
-			ASSERT(xfs_dir2_sf_get_inumber(sfp,
-						xfs_dir2_sf_inumberp(sfep)) ==
-								args->inumber);
+			ASSERT(xfs_dir2_sfe_get_ino(sfp, sfep) ==
+			       args->inumber);
 			break;
 		}
 	}
@@ -1040,10 +1112,10 @@ xfs_dir2_sf_replace(
 	if (args->namelen == 2 &&
 	    args->name[0] == '.' && args->name[1] == '.') {
 #if XFS_BIG_INUMS || defined(DEBUG)
-		ino = xfs_dir2_sf_get_inumber(sfp, &sfp->hdr.parent);
+		ino = xfs_dir2_sf_get_parent_ino(sfp);
 		ASSERT(args->inumber != ino);
 #endif
-		xfs_dir2_sf_put_inumber(sfp, &args->inumber, &sfp->hdr.parent);
+		xfs_dir2_sf_put_parent_ino(sfp, args->inumber);
 	}
 	/*
 	 * Normal entry, look for the name.
@@ -1055,12 +1127,10 @@ xfs_dir2_sf_replace(
 			if (xfs_da_compname(args, sfep->name, sfep->namelen) ==
 								XFS_CMP_EXACT) {
 #if XFS_BIG_INUMS || defined(DEBUG)
-				ino = xfs_dir2_sf_get_inumber(sfp,
-					xfs_dir2_sf_inumberp(sfep));
+				ino = xfs_dir2_sfe_get_ino(sfp, sfep);
 				ASSERT(args->inumber != ino);
 #endif
-				xfs_dir2_sf_put_inumber(sfp, &args->inumber,
-					xfs_dir2_sf_inumberp(sfep));
+				xfs_dir2_sfe_put_ino(sfp, sfep, args->inumber);
 				break;
 			}
 		}
@@ -1121,7 +1191,6 @@ xfs_dir2_sf_toino4(
 	char			*buf;		/* old dir's buffer */
 	xfs_inode_t		*dp;		/* incore directory inode */
 	int			i;		/* entry index */
-	xfs_ino_t		ino;		/* entry inode number */
 	int			newsize;	/* new inode size */
 	xfs_dir2_sf_entry_t	*oldsfep;	/* old sf entry */
 	xfs_dir2_sf_t		*oldsfp;	/* old sf directory */
@@ -1162,8 +1231,7 @@ xfs_dir2_sf_toino4(
 	 */
 	sfp->hdr.count = oldsfp->hdr.count;
 	sfp->hdr.i8count = 0;
-	ino = xfs_dir2_sf_get_inumber(oldsfp, &oldsfp->hdr.parent);
-	xfs_dir2_sf_put_inumber(sfp, &ino, &sfp->hdr.parent);
+	xfs_dir2_sf_put_parent_ino(sfp, xfs_dir2_sf_get_parent_ino(oldsfp));
 	/*
 	 * Copy the entries field by field.
 	 */
@@ -1175,9 +1243,8 @@ xfs_dir2_sf_toino4(
 		sfep->namelen = oldsfep->namelen;
 		sfep->offset = oldsfep->offset;
 		memcpy(sfep->name, oldsfep->name, sfep->namelen);
-		ino = xfs_dir2_sf_get_inumber(oldsfp,
-			xfs_dir2_sf_inumberp(oldsfep));
-		xfs_dir2_sf_put_inumber(sfp, &ino, xfs_dir2_sf_inumberp(sfep));
+		xfs_dir2_sfe_put_ino(sfp, sfep,
+			xfs_dir2_sfe_get_ino(oldsfp, oldsfep));
 	}
 	/*
 	 * Clean up the inode.
@@ -1199,7 +1266,6 @@ xfs_dir2_sf_toino8(
 	char			*buf;		/* old dir's buffer */
 	xfs_inode_t		*dp;		/* incore directory inode */
 	int			i;		/* entry index */
-	xfs_ino_t		ino;		/* entry inode number */
 	int			newsize;	/* new inode size */
 	xfs_dir2_sf_entry_t	*oldsfep;	/* old sf entry */
 	xfs_dir2_sf_t		*oldsfp;	/* old sf directory */
@@ -1240,8 +1306,7 @@ xfs_dir2_sf_toino8(
 	 */
 	sfp->hdr.count = oldsfp->hdr.count;
 	sfp->hdr.i8count = 1;
-	ino = xfs_dir2_sf_get_inumber(oldsfp, &oldsfp->hdr.parent);
-	xfs_dir2_sf_put_inumber(sfp, &ino, &sfp->hdr.parent);
+	xfs_dir2_sf_put_parent_ino(sfp, xfs_dir2_sf_get_parent_ino(oldsfp));
 	/*
 	 * Copy the entries field by field.
 	 */
@@ -1253,9 +1318,8 @@ xfs_dir2_sf_toino8(
 		sfep->namelen = oldsfep->namelen;
 		sfep->offset = oldsfep->offset;
 		memcpy(sfep->name, oldsfep->name, sfep->namelen);
-		ino = xfs_dir2_sf_get_inumber(oldsfp,
-			xfs_dir2_sf_inumberp(oldsfep));
-		xfs_dir2_sf_put_inumber(sfp, &ino, xfs_dir2_sf_inumberp(sfep));
+		xfs_dir2_sfe_put_ino(sfp, sfep,
+			xfs_dir2_sfe_get_ino(oldsfp, oldsfep));
 	}
 	/*
 	 * Clean up the inode.
Index: xfs/fs/xfs/xfs_dir2_sf.h
===================================================================
--- xfs.orig/fs/xfs/xfs_dir2_sf.h	2010-05-18 13:04:38.259255428 +0200
+++ xfs/fs/xfs/xfs_dir2_sf.h	2010-05-27 14:48:16.710004889 +0200
@@ -90,28 +90,6 @@ static inline int xfs_dir2_sf_hdr_size(i
 		((uint)sizeof(xfs_dir2_ino8_t) - (uint)sizeof(xfs_dir2_ino4_t)));
 }
 
-static inline xfs_dir2_inou_t *xfs_dir2_sf_inumberp(xfs_dir2_sf_entry_t *sfep)
-{
-	return (xfs_dir2_inou_t *)&(sfep)->name[(sfep)->namelen];
-}
-
-static inline xfs_intino_t
-xfs_dir2_sf_get_inumber(xfs_dir2_sf_t *sfp, xfs_dir2_inou_t *from)
-{
-	return ((sfp)->hdr.i8count == 0 ? \
-		(xfs_intino_t)XFS_GET_DIR_INO4((from)->i4) : \
-		(xfs_intino_t)XFS_GET_DIR_INO8((from)->i8));
-}
-
-static inline void xfs_dir2_sf_put_inumber(xfs_dir2_sf_t *sfp, xfs_ino_t *from,
-						xfs_dir2_inou_t *to)
-{
-	if ((sfp)->hdr.i8count == 0)
-		XFS_PUT_DIR_INO4(*(from), (to)->i4);
-	else
-		XFS_PUT_DIR_INO8(*(from), (to)->i8);
-}
-
 static inline xfs_dir2_data_aoff_t
 xfs_dir2_sf_get_offset(xfs_dir2_sf_entry_t *sfep)
 {
@@ -155,6 +133,9 @@ xfs_dir2_sf_nextentry(xfs_dir2_sf_t *sfp
 /*
  * Functions.
  */
+extern xfs_ino_t xfs_dir2_sf_get_parent_ino(struct xfs_dir2_sf *sfp);
+extern xfs_ino_t xfs_dir2_sfe_get_ino(struct xfs_dir2_sf *sfp,
+				      struct xfs_dir2_sf_entry *sfep);
 extern int xfs_dir2_block_sfsize(struct xfs_inode *dp,
 				 struct xfs_dir2_block *block,
 				 xfs_dir2_sf_hdr_t *sfhp);
Index: xfs/fs/xfs/xfs_inum.h
===================================================================
--- xfs.orig/fs/xfs/xfs_inum.h	2010-05-18 13:04:38.266255847 +0200
+++ xfs/fs/xfs/xfs_inum.h	2010-05-27 14:48:16.712037505 +0200
@@ -28,17 +28,6 @@
 
 typedef	__uint32_t	xfs_agino_t;	/* within allocation grp inode number */
 
-/*
- * Useful inode bits for this kernel.
- * Used in some places where having 64-bits in the 32-bit kernels
- * costs too much.
- */
-#if XFS_BIG_INUMS
-typedef	xfs_ino_t	xfs_intino_t;
-#else
-typedef	__uint32_t	xfs_intino_t;
-#endif
-
 #define	NULLFSINO	((xfs_ino_t)-1)
 #define	NULLAGINO	((xfs_agino_t)-1)
 
Index: xfs/fs/xfs/xfs_dir2_block.c
===================================================================
--- xfs.orig/fs/xfs/xfs_dir2_block.c	2010-05-25 11:40:59.336004958 +0200
+++ xfs/fs/xfs/xfs_dir2_block.c	2010-05-27 14:48:16.718003421 +0200
@@ -1146,7 +1146,7 @@ xfs_dir2_sf_to_block(
 	 */
 	dep = (xfs_dir2_data_entry_t *)
 		((char *)block + XFS_DIR2_DATA_DOTDOT_OFFSET);
-	dep->inumber = cpu_to_be64(xfs_dir2_sf_get_inumber(sfp, &sfp->hdr.parent));
+	dep->inumber = cpu_to_be64(xfs_dir2_sf_get_parent_ino(sfp));
 	dep->namelen = 2;
 	dep->name[0] = dep->name[1] = '.';
 	tagp = xfs_dir2_data_entry_tag_p(dep);
@@ -1195,8 +1195,7 @@ xfs_dir2_sf_to_block(
 		 * Copy a real entry.
 		 */
 		dep = (xfs_dir2_data_entry_t *)((char *)block + newoffset);
-		dep->inumber = cpu_to_be64(xfs_dir2_sf_get_inumber(sfp,
-				xfs_dir2_sf_inumberp(sfep)));
+		dep->inumber = cpu_to_be64(xfs_dir2_sfe_get_ino(sfp, sfep));
 		dep->namelen = sfep->namelen;
 		memcpy(dep->name, sfep->name, dep->namelen);
 		tagp = xfs_dir2_data_entry_tag_p(dep);

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH 15/27] xfs: kill struct xfs_dir2_sf
  2011-06-29 14:01 [PATCH 00/27] patch queue for Linux 3.1 Christoph Hellwig
                   ` (12 preceding siblings ...)
  2011-06-29 14:01 ` [PATCH 14/27] xfs: cleanup shortform directory inode number handling Christoph Hellwig
@ 2011-06-29 14:01 ` Christoph Hellwig
  2011-06-30  7:04   ` Dave Chinner
  2011-06-29 14:01 ` [PATCH 16/27] xfs: cleanup the defintion of struct xfs_dir2_sf_entry Christoph Hellwig
                   ` (12 subsequent siblings)
  26 siblings, 1 reply; 100+ messages in thread
From: Christoph Hellwig @ 2011-06-29 14:01 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-kill-xfs_dir2_sf_t --]
[-- Type: text/plain, Size: 29703 bytes --]

The list field of it is never cactually used, so all uses can simply be
replaced with the xfs_dir2_sf_hdr_t type that it has as first member.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: xfs/fs/xfs/xfs_dir2.c
===================================================================
--- xfs.orig/fs/xfs/xfs_dir2.c	2011-06-29 11:26:13.842554139 +0200
+++ xfs/fs/xfs/xfs_dir2.c	2011-06-29 13:06:35.083267610 +0200
@@ -122,15 +122,15 @@ int
 xfs_dir_isempty(
 	xfs_inode_t	*dp)
 {
-	xfs_dir2_sf_t	*sfp;
+	xfs_dir2_sf_hdr_t	*sfp;
 
 	ASSERT((dp->i_d.di_mode & S_IFMT) == S_IFDIR);
 	if (dp->i_d.di_size == 0)	/* might happen during shutdown. */
 		return 1;
 	if (dp->i_d.di_size > XFS_IFORK_DSIZE(dp))
 		return 0;
-	sfp = (xfs_dir2_sf_t *)dp->i_df.if_u1.if_data;
-	return !sfp->hdr.count;
+	sfp = (xfs_dir2_sf_hdr_t *)dp->i_df.if_u1.if_data;
+	return !sfp->count;
 }
 
 /*
Index: xfs/fs/xfs/xfs_dir2_block.c
===================================================================
--- xfs.orig/fs/xfs/xfs_dir2_block.c	2011-06-29 13:03:56.590792904 +0200
+++ xfs/fs/xfs/xfs_dir2_block.c	2011-06-29 13:06:35.083267610 +0200
@@ -1028,8 +1028,6 @@ xfs_dir2_sf_to_block(
 	xfs_dir2_leaf_entry_t	*blp;		/* block leaf entries */
 	xfs_dabuf_t		*bp;		/* block buffer */
 	xfs_dir2_block_tail_t	*btp;		/* block tail pointer */
-	char			*buf;		/* sf buffer */
-	int			buf_len;
 	xfs_dir2_data_entry_t	*dep;		/* data entry pointer */
 	xfs_inode_t		*dp;		/* incore directory inode */
 	int			dummy;		/* trash */
@@ -1043,7 +1041,8 @@ xfs_dir2_sf_to_block(
 	int			newoffset;	/* offset from current entry */
 	int			offset;		/* target block offset */
 	xfs_dir2_sf_entry_t	*sfep;		/* sf entry pointer */
-	xfs_dir2_sf_t		*sfp;		/* shortform structure */
+	xfs_dir2_sf_hdr_t	*oldsfp;	/* old shortform header  */
+	xfs_dir2_sf_hdr_t	*sfp;		/* shortform header  */
 	__be16			*tagp;		/* end of data entry */
 	xfs_trans_t		*tp;		/* transaction pointer */
 	struct xfs_name		name;
@@ -1061,32 +1060,30 @@ xfs_dir2_sf_to_block(
 		ASSERT(XFS_FORCED_SHUTDOWN(mp));
 		return XFS_ERROR(EIO);
 	}
+
+	oldsfp = (xfs_dir2_sf_hdr_t *)dp->i_df.if_u1.if_data;
+
 	ASSERT(dp->i_df.if_bytes == dp->i_d.di_size);
 	ASSERT(dp->i_df.if_u1.if_data != NULL);
-	sfp = (xfs_dir2_sf_t *)dp->i_df.if_u1.if_data;
-	ASSERT(dp->i_d.di_size >= xfs_dir2_sf_hdr_size(sfp->hdr.i8count));
+	ASSERT(dp->i_d.di_size >= xfs_dir2_sf_hdr_size(oldsfp->i8count));
+
 	/*
-	 * Copy the directory into the stack buffer.
+	 * Copy the directory into a temporary buffer.
 	 * Then pitch the incore inode data so we can make extents.
 	 */
+	sfp = kmem_alloc(dp->i_df.if_bytes, KM_SLEEP);
+	memcpy(sfp, oldsfp, dp->i_df.if_bytes);
 
-	buf_len = dp->i_df.if_bytes;
-	buf = kmem_alloc(buf_len, KM_SLEEP);
-
-	memcpy(buf, sfp, buf_len);
-	xfs_idata_realloc(dp, -buf_len, XFS_DATA_FORK);
+	xfs_idata_realloc(dp, -dp->i_df.if_bytes, XFS_DATA_FORK);
 	dp->i_d.di_size = 0;
 	xfs_trans_log_inode(tp, dp, XFS_ILOG_CORE);
-	/*
-	 * Reset pointer - old sfp is gone.
-	 */
-	sfp = (xfs_dir2_sf_t *)buf;
+
 	/*
 	 * Add block 0 to the inode.
 	 */
 	error = xfs_dir2_grow_inode(args, XFS_DIR2_DATA_SPACE, &blkno);
 	if (error) {
-		kmem_free(buf);
+		kmem_free(sfp);
 		return error;
 	}
 	/*
@@ -1094,7 +1091,7 @@ xfs_dir2_sf_to_block(
 	 */
 	error = xfs_dir2_data_init(args, blkno, &bp);
 	if (error) {
-		kmem_free(buf);
+		kmem_free(sfp);
 		return error;
 	}
 	block = bp->data;
@@ -1103,7 +1100,7 @@ xfs_dir2_sf_to_block(
 	 * Compute size of block "tail" area.
 	 */
 	i = (uint)sizeof(*btp) +
-	    (sfp->hdr.count + 2) * (uint)sizeof(xfs_dir2_leaf_entry_t);
+	    (sfp->count + 2) * (uint)sizeof(xfs_dir2_leaf_entry_t);
 	/*
 	 * The whole thing is initialized to free by the init routine.
 	 * Say we're using the leaf and tail area.
@@ -1117,7 +1114,7 @@ xfs_dir2_sf_to_block(
 	 * Fill in the tail.
 	 */
 	btp = xfs_dir2_block_tail_p(mp, block);
-	btp->count = cpu_to_be32(sfp->hdr.count + 2);	/* ., .. */
+	btp->count = cpu_to_be32(sfp->count + 2);	/* ., .. */
 	btp->stale = 0;
 	blp = xfs_dir2_block_leaf_p(btp);
 	endoffset = (uint)((char *)blp - (char *)block);
@@ -1159,7 +1156,8 @@ xfs_dir2_sf_to_block(
 	/*
 	 * Loop over existing entries, stuff them in.
 	 */
-	if ((i = 0) == sfp->hdr.count)
+	i = 0;
+	if (!sfp->count)
 		sfep = NULL;
 	else
 		sfep = xfs_dir2_sf_firstentry(sfp);
@@ -1208,13 +1206,13 @@ xfs_dir2_sf_to_block(
 		blp[2 + i].address = cpu_to_be32(xfs_dir2_byte_to_dataptr(mp,
 						 (char *)dep - (char *)block));
 		offset = (int)((char *)(tagp + 1) - (char *)block);
-		if (++i == sfp->hdr.count)
+		if (++i == sfp->count)
 			sfep = NULL;
 		else
 			sfep = xfs_dir2_sf_nextentry(sfp, sfep);
 	}
 	/* Done with the temporary buffer */
-	kmem_free(buf);
+	kmem_free(sfp);
 	/*
 	 * Sort the leaf entries by hash value.
 	 */
Index: xfs/fs/xfs/xfs_dir2_sf.c
===================================================================
--- xfs.orig/fs/xfs/xfs_dir2_sf.c	2011-06-29 13:03:56.587459589 +0200
+++ xfs/fs/xfs/xfs_dir2_sf.c	2011-06-29 13:06:35.086600926 +0200
@@ -68,21 +68,21 @@ static void xfs_dir2_sf_toino8(xfs_da_ar
 
 static xfs_ino_t
 xfs_dir2_sf_get_ino(
-	struct xfs_dir2_sf	*sfp,
+	struct xfs_dir2_sf_hdr	*hdr,
 	xfs_dir2_inou_t		*from)
 {
-	if (sfp->hdr.i8count)
+	if (hdr->i8count)
 		return XFS_GET_DIR_INO8(from->i8);
 	else
 		return XFS_GET_DIR_INO4(from->i4);
 }
 static void
 xfs_dir2_sf_put_inumber(
-	struct xfs_dir2_sf	*sfp,
+	struct xfs_dir2_sf_hdr	*hdr,
 	xfs_dir2_inou_t		*to,
 	xfs_ino_t		ino)
 {
-	if (sfp->hdr.i8count)
+	if (hdr->i8count)
 		XFS_PUT_DIR_INO8(ino, to->i8);
 	else
 		XFS_PUT_DIR_INO4(ino, to->i4);
@@ -90,18 +90,18 @@ xfs_dir2_sf_put_inumber(
 
 xfs_ino_t
 xfs_dir2_sf_get_parent_ino(
-	struct xfs_dir2_sf	*sfp)
+	struct xfs_dir2_sf_hdr	*hdr)
 {
-	return xfs_dir2_sf_get_ino(sfp, &sfp->hdr.parent);
+	return xfs_dir2_sf_get_ino(hdr, &hdr->parent);
 }
 
 
 static void
 xfs_dir2_sf_put_parent_ino(
-	struct xfs_dir2_sf	*sfp,
+	struct xfs_dir2_sf_hdr	*hdr,
 	xfs_ino_t		ino)
 {
-	xfs_dir2_sf_put_inumber(sfp, &sfp->hdr.parent, ino);
+	xfs_dir2_sf_put_inumber(hdr, &hdr->parent, ino);
 }
 
 
@@ -120,19 +120,19 @@ xfs_dir2_sf_inop(
 
 xfs_ino_t
 xfs_dir2_sfe_get_ino(
-	struct xfs_dir2_sf	*sfp,
+	struct xfs_dir2_sf_hdr	*hdr,
 	struct xfs_dir2_sf_entry *sfep)
 {
-	return xfs_dir2_sf_get_ino(sfp, xfs_dir2_sf_inop(sfep));
+	return xfs_dir2_sf_get_ino(hdr, xfs_dir2_sf_inop(sfep));
 }
 
 static void
 xfs_dir2_sfe_put_ino(
-	struct xfs_dir2_sf	*sfp,
+	struct xfs_dir2_sf_hdr	*hdr,
 	struct xfs_dir2_sf_entry *sfep,
 	xfs_ino_t		ino)
 {
-	xfs_dir2_sf_put_inumber(sfp, xfs_dir2_sf_inop(sfep), ino);
+	xfs_dir2_sf_put_inumber(hdr, xfs_dir2_sf_inop(sfep), ino);
 }
 
 
@@ -215,7 +215,7 @@ xfs_dir2_block_sfsize(
 	 */
 	sfhp->count = count;
 	sfhp->i8count = i8count;
-	xfs_dir2_sf_put_parent_ino((xfs_dir2_sf_t *)sfhp, parent);
+	xfs_dir2_sf_put_parent_ino(sfhp, parent);
 	return size;
 }
 
@@ -241,7 +241,7 @@ xfs_dir2_block_to_sf(
 	xfs_mount_t		*mp;		/* filesystem mount point */
 	char			*ptr;		/* current data pointer */
 	xfs_dir2_sf_entry_t	*sfep;		/* shortform entry */
-	xfs_dir2_sf_t		*sfp;		/* shortform structure */
+	xfs_dir2_sf_hdr_t	*sfp;		/* shortform structure */
 
 	trace_xfs_dir2_block_to_sf(args);
 
@@ -274,7 +274,7 @@ xfs_dir2_block_to_sf(
 	/*
 	 * Copy the header into the newly allocate local space.
 	 */
-	sfp = (xfs_dir2_sf_t *)dp->i_df.if_u1.if_data;
+	sfp = (xfs_dir2_sf_hdr_t *)dp->i_df.if_u1.if_data;
 	memcpy(sfp, sfhp, xfs_dir2_sf_hdr_size(sfhp->i8count));
 	dp->i_d.di_size = size;
 	/*
@@ -353,7 +353,7 @@ xfs_dir2_sf_addname(
 	xfs_dir2_data_aoff_t	offset = 0;	/* offset for new entry */
 	int			old_isize;	/* di_size before adding name */
 	int			pick;		/* which algorithm to use */
-	xfs_dir2_sf_t		*sfp;		/* shortform structure */
+	xfs_dir2_sf_hdr_t	*sfp;		/* shortform structure */
 	xfs_dir2_sf_entry_t	*sfep = NULL;	/* shortform entry */
 
 	trace_xfs_dir2_sf_addname(args);
@@ -370,8 +370,8 @@ xfs_dir2_sf_addname(
 	}
 	ASSERT(dp->i_df.if_bytes == dp->i_d.di_size);
 	ASSERT(dp->i_df.if_u1.if_data != NULL);
-	sfp = (xfs_dir2_sf_t *)dp->i_df.if_u1.if_data;
-	ASSERT(dp->i_d.di_size >= xfs_dir2_sf_hdr_size(sfp->hdr.i8count));
+	sfp = (xfs_dir2_sf_hdr_t *)dp->i_df.if_u1.if_data;
+	ASSERT(dp->i_d.di_size >= xfs_dir2_sf_hdr_size(sfp->i8count));
 	/*
 	 * Compute entry (and change in) size.
 	 */
@@ -382,7 +382,7 @@ xfs_dir2_sf_addname(
 	/*
 	 * Do we have to change to 8 byte inodes?
 	 */
-	if (args->inumber > XFS_DIR2_MAX_SHORT_INUM && sfp->hdr.i8count == 0) {
+	if (args->inumber > XFS_DIR2_MAX_SHORT_INUM && sfp->i8count == 0) {
 		/*
 		 * Yes, adjust the entry size and the total size.
 		 */
@@ -390,7 +390,7 @@ xfs_dir2_sf_addname(
 			(uint)sizeof(xfs_dir2_ino8_t) -
 			(uint)sizeof(xfs_dir2_ino4_t);
 		incr_isize +=
-			(sfp->hdr.count + 2) *
+			(sfp->count + 2) *
 			((uint)sizeof(xfs_dir2_ino8_t) -
 			 (uint)sizeof(xfs_dir2_ino4_t));
 		objchange = 1;
@@ -460,11 +460,11 @@ xfs_dir2_sf_addname_easy(
 {
 	int			byteoff;	/* byte offset in sf dir */
 	xfs_inode_t		*dp;		/* incore directory inode */
-	xfs_dir2_sf_t		*sfp;		/* shortform structure */
+	xfs_dir2_sf_hdr_t	*sfp;		/* shortform structure */
 
 	dp = args->dp;
 
-	sfp = (xfs_dir2_sf_t *)dp->i_df.if_u1.if_data;
+	sfp = (xfs_dir2_sf_hdr_t *)dp->i_df.if_u1.if_data;
 	byteoff = (int)((char *)sfep - (char *)sfp);
 	/*
 	 * Grow the in-inode space.
@@ -474,7 +474,7 @@ xfs_dir2_sf_addname_easy(
 	/*
 	 * Need to set up again due to realloc of the inode data.
 	 */
-	sfp = (xfs_dir2_sf_t *)dp->i_df.if_u1.if_data;
+	sfp = (xfs_dir2_sf_hdr_t *)dp->i_df.if_u1.if_data;
 	sfep = (xfs_dir2_sf_entry_t *)((char *)sfp + byteoff);
 	/*
 	 * Fill in the new entry.
@@ -486,10 +486,10 @@ xfs_dir2_sf_addname_easy(
 	/*
 	 * Update the header and inode.
 	 */
-	sfp->hdr.count++;
+	sfp->count++;
 #if XFS_BIG_INUMS
 	if (args->inumber > XFS_DIR2_MAX_SHORT_INUM)
-		sfp->hdr.i8count++;
+		sfp->i8count++;
 #endif
 	dp->i_d.di_size = new_isize;
 	xfs_dir2_sf_check(args);
@@ -519,19 +519,19 @@ xfs_dir2_sf_addname_hard(
 	xfs_dir2_data_aoff_t	offset;		/* current offset value */
 	int			old_isize;	/* previous di_size */
 	xfs_dir2_sf_entry_t	*oldsfep;	/* entry in original dir */
-	xfs_dir2_sf_t		*oldsfp;	/* original shortform dir */
+	xfs_dir2_sf_hdr_t	*oldsfp;	/* original shortform dir */
 	xfs_dir2_sf_entry_t	*sfep;		/* entry in new dir */
-	xfs_dir2_sf_t		*sfp;		/* new shortform dir */
+	xfs_dir2_sf_hdr_t	*sfp;		/* new shortform dir */
 
 	/*
 	 * Copy the old directory to the stack buffer.
 	 */
 	dp = args->dp;
 
-	sfp = (xfs_dir2_sf_t *)dp->i_df.if_u1.if_data;
+	sfp = (xfs_dir2_sf_hdr_t *)dp->i_df.if_u1.if_data;
 	old_isize = (int)dp->i_d.di_size;
 	buf = kmem_alloc(old_isize, KM_SLEEP);
-	oldsfp = (xfs_dir2_sf_t *)buf;
+	oldsfp = (xfs_dir2_sf_hdr_t *)buf;
 	memcpy(oldsfp, sfp, old_isize);
 	/*
 	 * Loop over the old directory finding the place we're going
@@ -560,7 +560,7 @@ xfs_dir2_sf_addname_hard(
 	/*
 	 * Reset the pointer since the buffer was reallocated.
 	 */
-	sfp = (xfs_dir2_sf_t *)dp->i_df.if_u1.if_data;
+	sfp = (xfs_dir2_sf_hdr_t *)dp->i_df.if_u1.if_data;
 	/*
 	 * Copy the first part of the directory, including the header.
 	 */
@@ -574,10 +574,10 @@ xfs_dir2_sf_addname_hard(
 	xfs_dir2_sf_put_offset(sfep, offset);
 	memcpy(sfep->name, args->name, sfep->namelen);
 	xfs_dir2_sfe_put_ino(sfp, sfep, args->inumber);
-	sfp->hdr.count++;
+	sfp->count++;
 #if XFS_BIG_INUMS
 	if (args->inumber > XFS_DIR2_MAX_SHORT_INUM && !objchange)
-		sfp->hdr.i8count++;
+		sfp->i8count++;
 #endif
 	/*
 	 * If there's more left to copy, do that.
@@ -611,14 +611,14 @@ xfs_dir2_sf_addname_pick(
 	xfs_mount_t		*mp;		/* filesystem mount point */
 	xfs_dir2_data_aoff_t	offset;		/* data block offset */
 	xfs_dir2_sf_entry_t	*sfep;		/* shortform entry */
-	xfs_dir2_sf_t		*sfp;		/* shortform structure */
+	xfs_dir2_sf_hdr_t	*sfp;		/* shortform structure */
 	int			size;		/* entry's data size */
 	int			used;		/* data bytes used */
 
 	dp = args->dp;
 	mp = dp->i_mount;
 
-	sfp = (xfs_dir2_sf_t *)dp->i_df.if_u1.if_data;
+	sfp = (xfs_dir2_sf_hdr_t *)dp->i_df.if_u1.if_data;
 	size = xfs_dir2_data_entsize(args->namelen);
 	offset = XFS_DIR2_DATA_FIRST_OFFSET;
 	sfep = xfs_dir2_sf_firstentry(sfp);
@@ -628,7 +628,7 @@ xfs_dir2_sf_addname_pick(
 	 * Keep track of data offset and whether we've seen a place
 	 * to insert the new entry.
 	 */
-	for (i = 0; i < sfp->hdr.count; i++) {
+	for (i = 0; i < sfp->count; i++) {
 		if (!holefit)
 			holefit = offset + size <= xfs_dir2_sf_get_offset(sfep);
 		offset = xfs_dir2_sf_get_offset(sfep) +
@@ -640,7 +640,7 @@ xfs_dir2_sf_addname_pick(
 	 * was a data block (block form directory).
 	 */
 	used = offset +
-	       (sfp->hdr.count + 3) * (uint)sizeof(xfs_dir2_leaf_entry_t) +
+	       (sfp->count + 3) * (uint)sizeof(xfs_dir2_leaf_entry_t) +
 	       (uint)sizeof(xfs_dir2_block_tail_t);
 	/*
 	 * If it won't fit in a block form then we can't insert it,
@@ -686,17 +686,17 @@ xfs_dir2_sf_check(
 	xfs_ino_t		ino;		/* entry inode number */
 	int			offset;		/* data offset */
 	xfs_dir2_sf_entry_t	*sfep;		/* shortform dir entry */
-	xfs_dir2_sf_t		*sfp;		/* shortform structure */
+	xfs_dir2_sf_hdr_t	*sfp;		/* shortform structure */
 
 	dp = args->dp;
 
-	sfp = (xfs_dir2_sf_t *)dp->i_df.if_u1.if_data;
+	sfp = (xfs_dir2_sf_hdr_t *)dp->i_df.if_u1.if_data;
 	offset = XFS_DIR2_DATA_FIRST_OFFSET;
 	ino = xfs_dir2_sf_get_parent_ino(sfp);
 	i8count = ino > XFS_DIR2_MAX_SHORT_INUM;
 
 	for (i = 0, sfep = xfs_dir2_sf_firstentry(sfp);
-	     i < sfp->hdr.count;
+	     i < sfp->count;
 	     i++, sfep = xfs_dir2_sf_nextentry(sfp, sfep)) {
 		ASSERT(xfs_dir2_sf_get_offset(sfep) >= offset);
 		ino = xfs_dir2_sfe_get_ino(sfp, sfep);
@@ -705,11 +705,11 @@ xfs_dir2_sf_check(
 			xfs_dir2_sf_get_offset(sfep) +
 			xfs_dir2_data_entsize(sfep->namelen);
 	}
-	ASSERT(i8count == sfp->hdr.i8count);
+	ASSERT(i8count == sfp->i8count);
 	ASSERT(XFS_BIG_INUMS || i8count == 0);
 	ASSERT((char *)sfep - (char *)sfp == dp->i_d.di_size);
 	ASSERT(offset +
-	       (sfp->hdr.count + 2) * (uint)sizeof(xfs_dir2_leaf_entry_t) +
+	       (sfp->count + 2) * (uint)sizeof(xfs_dir2_leaf_entry_t) +
 	       (uint)sizeof(xfs_dir2_block_tail_t) <=
 	       dp->i_mount->m_dirblksize);
 }
@@ -725,7 +725,7 @@ xfs_dir2_sf_create(
 {
 	xfs_inode_t	*dp;		/* incore directory inode */
 	int		i8count;	/* parent inode is an 8-byte number */
-	xfs_dir2_sf_t	*sfp;		/* shortform structure */
+	xfs_dir2_sf_hdr_t *sfp;		/* shortform structure */
 	int		size;		/* directory size */
 
 	trace_xfs_dir2_sf_create(args);
@@ -755,13 +755,13 @@ xfs_dir2_sf_create(
 	/*
 	 * Fill in the header,
 	 */
-	sfp = (xfs_dir2_sf_t *)dp->i_df.if_u1.if_data;
-	sfp->hdr.i8count = i8count;
+	sfp = (xfs_dir2_sf_hdr_t *)dp->i_df.if_u1.if_data;
+	sfp->i8count = i8count;
 	/*
 	 * Now can put in the inode number, since i8count is set.
 	 */
 	xfs_dir2_sf_put_parent_ino(sfp, pino);
-	sfp->hdr.count = 0;
+	sfp->count = 0;
 	dp->i_d.di_size = size;
 	xfs_dir2_sf_check(args);
 	xfs_trans_log_inode(args->trans, dp, XFS_ILOG_CORE | XFS_ILOG_DDATA);
@@ -779,7 +779,7 @@ xfs_dir2_sf_getdents(
 	xfs_mount_t		*mp;		/* filesystem mount point */
 	xfs_dir2_dataptr_t	off;		/* current entry's offset */
 	xfs_dir2_sf_entry_t	*sfep;		/* shortform directory entry */
-	xfs_dir2_sf_t		*sfp;		/* shortform structure */
+	xfs_dir2_sf_hdr_t	*sfp;		/* shortform structure */
 	xfs_dir2_dataptr_t	dot_offset;
 	xfs_dir2_dataptr_t	dotdot_offset;
 	xfs_ino_t		ino;
@@ -798,9 +798,9 @@ xfs_dir2_sf_getdents(
 	ASSERT(dp->i_df.if_bytes == dp->i_d.di_size);
 	ASSERT(dp->i_df.if_u1.if_data != NULL);
 
-	sfp = (xfs_dir2_sf_t *)dp->i_df.if_u1.if_data;
+	sfp = (xfs_dir2_sf_hdr_t *)dp->i_df.if_u1.if_data;
 
-	ASSERT(dp->i_d.di_size >= xfs_dir2_sf_hdr_size(sfp->hdr.i8count));
+	ASSERT(dp->i_d.di_size >= xfs_dir2_sf_hdr_size(sfp->i8count));
 
 	/*
 	 * If the block number in the offset is out of range, we're done.
@@ -844,7 +844,7 @@ xfs_dir2_sf_getdents(
 	 * Loop while there are more entries and put'ing works.
 	 */
 	sfep = xfs_dir2_sf_firstentry(sfp);
-	for (i = 0; i < sfp->hdr.count; i++) {
+	for (i = 0; i < sfp->count; i++) {
 		off = xfs_dir2_db_off_to_dataptr(mp, mp->m_dirdatablk,
 				xfs_dir2_sf_get_offset(sfep));
 
@@ -879,7 +879,7 @@ xfs_dir2_sf_lookup(
 	int			i;		/* entry index */
 	int			error;
 	xfs_dir2_sf_entry_t	*sfep;		/* shortform directory entry */
-	xfs_dir2_sf_t		*sfp;		/* shortform structure */
+	xfs_dir2_sf_hdr_t	*sfp;		/* shortform structure */
 	enum xfs_dacmp		cmp;		/* comparison result */
 	xfs_dir2_sf_entry_t	*ci_sfep;	/* case-insens. entry */
 
@@ -898,8 +898,8 @@ xfs_dir2_sf_lookup(
 	}
 	ASSERT(dp->i_df.if_bytes == dp->i_d.di_size);
 	ASSERT(dp->i_df.if_u1.if_data != NULL);
-	sfp = (xfs_dir2_sf_t *)dp->i_df.if_u1.if_data;
-	ASSERT(dp->i_d.di_size >= xfs_dir2_sf_hdr_size(sfp->hdr.i8count));
+	sfp = (xfs_dir2_sf_hdr_t *)dp->i_df.if_u1.if_data;
+	ASSERT(dp->i_d.di_size >= xfs_dir2_sf_hdr_size(sfp->i8count));
 	/*
 	 * Special case for .
 	 */
@@ -921,7 +921,7 @@ xfs_dir2_sf_lookup(
 	 * Loop over all the entries trying to match ours.
 	 */
 	ci_sfep = NULL;
-	for (i = 0, sfep = xfs_dir2_sf_firstentry(sfp); i < sfp->hdr.count;
+	for (i = 0, sfep = xfs_dir2_sf_firstentry(sfp); i < sfp->count;
 				i++, sfep = xfs_dir2_sf_nextentry(sfp, sfep)) {
 		/*
 		 * Compare name and if it's an exact match, return the inode
@@ -964,7 +964,7 @@ xfs_dir2_sf_removename(
 	int			newsize;	/* new inode size */
 	int			oldsize;	/* old inode size */
 	xfs_dir2_sf_entry_t	*sfep;		/* shortform directory entry */
-	xfs_dir2_sf_t		*sfp;		/* shortform structure */
+	xfs_dir2_sf_hdr_t	*sfp;		/* shortform structure */
 
 	trace_xfs_dir2_sf_removename(args);
 
@@ -981,13 +981,13 @@ xfs_dir2_sf_removename(
 	}
 	ASSERT(dp->i_df.if_bytes == oldsize);
 	ASSERT(dp->i_df.if_u1.if_data != NULL);
-	sfp = (xfs_dir2_sf_t *)dp->i_df.if_u1.if_data;
-	ASSERT(oldsize >= xfs_dir2_sf_hdr_size(sfp->hdr.i8count));
+	sfp = (xfs_dir2_sf_hdr_t *)dp->i_df.if_u1.if_data;
+	ASSERT(oldsize >= xfs_dir2_sf_hdr_size(sfp->i8count));
 	/*
 	 * Loop over the old directory entries.
 	 * Find the one we're deleting.
 	 */
-	for (i = 0, sfep = xfs_dir2_sf_firstentry(sfp); i < sfp->hdr.count;
+	for (i = 0, sfep = xfs_dir2_sf_firstentry(sfp); i < sfp->count;
 				i++, sfep = xfs_dir2_sf_nextentry(sfp, sfep)) {
 		if (xfs_da_compname(args, sfep->name, sfep->namelen) ==
 								XFS_CMP_EXACT) {
@@ -999,7 +999,7 @@ xfs_dir2_sf_removename(
 	/*
 	 * Didn't find it.
 	 */
-	if (i == sfp->hdr.count)
+	if (i == sfp->count)
 		return XFS_ERROR(ENOENT);
 	/*
 	 * Calculate sizes.
@@ -1016,22 +1016,22 @@ xfs_dir2_sf_removename(
 	/*
 	 * Fix up the header and file size.
 	 */
-	sfp->hdr.count--;
+	sfp->count--;
 	dp->i_d.di_size = newsize;
 	/*
 	 * Reallocate, making it smaller.
 	 */
 	xfs_idata_realloc(dp, newsize - oldsize, XFS_DATA_FORK);
-	sfp = (xfs_dir2_sf_t *)dp->i_df.if_u1.if_data;
+	sfp = (xfs_dir2_sf_hdr_t *)dp->i_df.if_u1.if_data;
 #if XFS_BIG_INUMS
 	/*
 	 * Are we changing inode number size?
 	 */
 	if (args->inumber > XFS_DIR2_MAX_SHORT_INUM) {
-		if (sfp->hdr.i8count == 1)
+		if (sfp->i8count == 1)
 			xfs_dir2_sf_toino4(args);
 		else
-			sfp->hdr.i8count--;
+			sfp->i8count--;
 	}
 #endif
 	xfs_dir2_sf_check(args);
@@ -1055,7 +1055,7 @@ xfs_dir2_sf_replace(
 	int			i8elevated;	/* sf_toino8 set i8count=1 */
 #endif
 	xfs_dir2_sf_entry_t	*sfep;		/* shortform directory entry */
-	xfs_dir2_sf_t		*sfp;		/* shortform structure */
+	xfs_dir2_sf_hdr_t	*sfp;		/* shortform structure */
 
 	trace_xfs_dir2_sf_replace(args);
 
@@ -1071,19 +1071,19 @@ xfs_dir2_sf_replace(
 	}
 	ASSERT(dp->i_df.if_bytes == dp->i_d.di_size);
 	ASSERT(dp->i_df.if_u1.if_data != NULL);
-	sfp = (xfs_dir2_sf_t *)dp->i_df.if_u1.if_data;
-	ASSERT(dp->i_d.di_size >= xfs_dir2_sf_hdr_size(sfp->hdr.i8count));
+	sfp = (xfs_dir2_sf_hdr_t *)dp->i_df.if_u1.if_data;
+	ASSERT(dp->i_d.di_size >= xfs_dir2_sf_hdr_size(sfp->i8count));
 #if XFS_BIG_INUMS
 	/*
 	 * New inode number is large, and need to convert to 8-byte inodes.
 	 */
-	if (args->inumber > XFS_DIR2_MAX_SHORT_INUM && sfp->hdr.i8count == 0) {
+	if (args->inumber > XFS_DIR2_MAX_SHORT_INUM && sfp->i8count == 0) {
 		int	error;			/* error return value */
 		int	newsize;		/* new inode size */
 
 		newsize =
 			dp->i_df.if_bytes +
-			(sfp->hdr.count + 1) *
+			(sfp->count + 1) *
 			((uint)sizeof(xfs_dir2_ino8_t) -
 			 (uint)sizeof(xfs_dir2_ino4_t));
 		/*
@@ -1101,7 +1101,7 @@ xfs_dir2_sf_replace(
 		 */
 		xfs_dir2_sf_toino8(args);
 		i8elevated = 1;
-		sfp = (xfs_dir2_sf_t *)dp->i_df.if_u1.if_data;
+		sfp = (xfs_dir2_sf_hdr_t *)dp->i_df.if_u1.if_data;
 	} else
 		i8elevated = 0;
 #endif
@@ -1122,7 +1122,7 @@ xfs_dir2_sf_replace(
 	 */
 	else {
 		for (i = 0, sfep = xfs_dir2_sf_firstentry(sfp);
-				i < sfp->hdr.count;
+				i < sfp->count;
 				i++, sfep = xfs_dir2_sf_nextentry(sfp, sfep)) {
 			if (xfs_da_compname(args, sfep->name, sfep->namelen) ==
 								XFS_CMP_EXACT) {
@@ -1137,7 +1137,7 @@ xfs_dir2_sf_replace(
 		/*
 		 * Didn't find it.
 		 */
-		if (i == sfp->hdr.count) {
+		if (i == sfp->count) {
 			ASSERT(args->op_flags & XFS_DA_OP_OKNOENT);
 #if XFS_BIG_INUMS
 			if (i8elevated)
@@ -1155,10 +1155,10 @@ xfs_dir2_sf_replace(
 		/*
 		 * And the old count was one, so need to convert to small.
 		 */
-		if (sfp->hdr.i8count == 1)
+		if (sfp->i8count == 1)
 			xfs_dir2_sf_toino4(args);
 		else
-			sfp->hdr.i8count--;
+			sfp->i8count--;
 	}
 	/*
 	 * See if the old number was small, the new number is large.
@@ -1169,9 +1169,9 @@ xfs_dir2_sf_replace(
 		 * add to the i8count unless we just converted to 8-byte
 		 * inodes (which does an implied i8count = 1)
 		 */
-		ASSERT(sfp->hdr.i8count != 0);
+		ASSERT(sfp->i8count != 0);
 		if (!i8elevated)
-			sfp->hdr.i8count++;
+			sfp->i8count++;
 	}
 #endif
 	xfs_dir2_sf_check(args);
@@ -1193,10 +1193,10 @@ xfs_dir2_sf_toino4(
 	int			i;		/* entry index */
 	int			newsize;	/* new inode size */
 	xfs_dir2_sf_entry_t	*oldsfep;	/* old sf entry */
-	xfs_dir2_sf_t		*oldsfp;	/* old sf directory */
+	xfs_dir2_sf_hdr_t	*oldsfp;	/* old sf directory */
 	int			oldsize;	/* old inode size */
 	xfs_dir2_sf_entry_t	*sfep;		/* new sf entry */
-	xfs_dir2_sf_t		*sfp;		/* new sf directory */
+	xfs_dir2_sf_hdr_t	*sfp;		/* new sf directory */
 
 	trace_xfs_dir2_sf_toino4(args);
 
@@ -1209,35 +1209,35 @@ xfs_dir2_sf_toino4(
 	 */
 	oldsize = dp->i_df.if_bytes;
 	buf = kmem_alloc(oldsize, KM_SLEEP);
-	oldsfp = (xfs_dir2_sf_t *)dp->i_df.if_u1.if_data;
-	ASSERT(oldsfp->hdr.i8count == 1);
+	oldsfp = (xfs_dir2_sf_hdr_t *)dp->i_df.if_u1.if_data;
+	ASSERT(oldsfp->i8count == 1);
 	memcpy(buf, oldsfp, oldsize);
 	/*
 	 * Compute the new inode size.
 	 */
 	newsize =
 		oldsize -
-		(oldsfp->hdr.count + 1) *
+		(oldsfp->count + 1) *
 		((uint)sizeof(xfs_dir2_ino8_t) - (uint)sizeof(xfs_dir2_ino4_t));
 	xfs_idata_realloc(dp, -oldsize, XFS_DATA_FORK);
 	xfs_idata_realloc(dp, newsize, XFS_DATA_FORK);
 	/*
 	 * Reset our pointers, the data has moved.
 	 */
-	oldsfp = (xfs_dir2_sf_t *)buf;
-	sfp = (xfs_dir2_sf_t *)dp->i_df.if_u1.if_data;
+	oldsfp = (xfs_dir2_sf_hdr_t *)buf;
+	sfp = (xfs_dir2_sf_hdr_t *)dp->i_df.if_u1.if_data;
 	/*
 	 * Fill in the new header.
 	 */
-	sfp->hdr.count = oldsfp->hdr.count;
-	sfp->hdr.i8count = 0;
+	sfp->count = oldsfp->count;
+	sfp->i8count = 0;
 	xfs_dir2_sf_put_parent_ino(sfp, xfs_dir2_sf_get_parent_ino(oldsfp));
 	/*
 	 * Copy the entries field by field.
 	 */
 	for (i = 0, sfep = xfs_dir2_sf_firstentry(sfp),
 		    oldsfep = xfs_dir2_sf_firstentry(oldsfp);
-	     i < sfp->hdr.count;
+	     i < sfp->count;
 	     i++, sfep = xfs_dir2_sf_nextentry(sfp, sfep),
 		  oldsfep = xfs_dir2_sf_nextentry(oldsfp, oldsfep)) {
 		sfep->namelen = oldsfep->namelen;
@@ -1268,10 +1268,10 @@ xfs_dir2_sf_toino8(
 	int			i;		/* entry index */
 	int			newsize;	/* new inode size */
 	xfs_dir2_sf_entry_t	*oldsfep;	/* old sf entry */
-	xfs_dir2_sf_t		*oldsfp;	/* old sf directory */
+	xfs_dir2_sf_hdr_t	*oldsfp;	/* old sf directory */
 	int			oldsize;	/* old inode size */
 	xfs_dir2_sf_entry_t	*sfep;		/* new sf entry */
-	xfs_dir2_sf_t		*sfp;		/* new sf directory */
+	xfs_dir2_sf_hdr_t	*sfp;		/* new sf directory */
 
 	trace_xfs_dir2_sf_toino8(args);
 
@@ -1284,35 +1284,35 @@ xfs_dir2_sf_toino8(
 	 */
 	oldsize = dp->i_df.if_bytes;
 	buf = kmem_alloc(oldsize, KM_SLEEP);
-	oldsfp = (xfs_dir2_sf_t *)dp->i_df.if_u1.if_data;
-	ASSERT(oldsfp->hdr.i8count == 0);
+	oldsfp = (xfs_dir2_sf_hdr_t *)dp->i_df.if_u1.if_data;
+	ASSERT(oldsfp->i8count == 0);
 	memcpy(buf, oldsfp, oldsize);
 	/*
 	 * Compute the new inode size.
 	 */
 	newsize =
 		oldsize +
-		(oldsfp->hdr.count + 1) *
+		(oldsfp->count + 1) *
 		((uint)sizeof(xfs_dir2_ino8_t) - (uint)sizeof(xfs_dir2_ino4_t));
 	xfs_idata_realloc(dp, -oldsize, XFS_DATA_FORK);
 	xfs_idata_realloc(dp, newsize, XFS_DATA_FORK);
 	/*
 	 * Reset our pointers, the data has moved.
 	 */
-	oldsfp = (xfs_dir2_sf_t *)buf;
-	sfp = (xfs_dir2_sf_t *)dp->i_df.if_u1.if_data;
+	oldsfp = (xfs_dir2_sf_hdr_t *)buf;
+	sfp = (xfs_dir2_sf_hdr_t *)dp->i_df.if_u1.if_data;
 	/*
 	 * Fill in the new header.
 	 */
-	sfp->hdr.count = oldsfp->hdr.count;
-	sfp->hdr.i8count = 1;
+	sfp->count = oldsfp->count;
+	sfp->i8count = 1;
 	xfs_dir2_sf_put_parent_ino(sfp, xfs_dir2_sf_get_parent_ino(oldsfp));
 	/*
 	 * Copy the entries field by field.
 	 */
 	for (i = 0, sfep = xfs_dir2_sf_firstentry(sfp),
 		    oldsfep = xfs_dir2_sf_firstentry(oldsfp);
-	     i < sfp->hdr.count;
+	     i < sfp->count;
 	     i++, sfep = xfs_dir2_sf_nextentry(sfp, sfep),
 		  oldsfep = xfs_dir2_sf_nextentry(oldsfp, oldsfep)) {
 		sfep->namelen = oldsfep->namelen;
Index: xfs/fs/xfs/xfs_dir2_sf.h
===================================================================
--- xfs.orig/fs/xfs/xfs_dir2_sf.h	2011-06-29 13:03:56.587459589 +0200
+++ xfs/fs/xfs/xfs_dir2_sf.h	2011-06-29 13:10:20.955377290 +0200
@@ -21,8 +21,12 @@
 /*
  * Directory layout when stored internal to an inode.
  *
- * Small directories are packed as tightly as possible so as to
- * fit into the literal area of the inode.
+ * Small directories are packed as tightly as possible so as to fit into the
+ * literal area of the inode.  They consist of a single xfs_dir2_sf_hdr header
+ * followed by zero or more xfs_dir2_sf_entry structures.  Due the different
+ * inode number storage sized and the variable length name filed in
+ * the xfs_dir2_sf_entry all these structure are variable length, and the
+ * accessors in this file need to be used to iterate over them.
  */
 
 struct uio;
@@ -61,9 +65,9 @@ typedef struct { __uint8_t i[2]; } __arc
  * The parent directory has a dedicated field, and the self-pointer must
  * be calculated on the fly.
  *
- * Entries are packed toward the top as tightly as possible.  The header
- * and the elements must be memcpy'd out into a work area to get correct
- * alignment for the inode number fields.
+ * Entries are packed toward the top as tightly as possible, and thus may
+ * be misaligned.  Care needs to be taken to access them through special
+ * helpers or copy them into aligned variables first.
  */
 typedef struct xfs_dir2_sf_hdr {
 	__uint8_t		count;		/* count of entries */
@@ -78,11 +82,6 @@ typedef struct xfs_dir2_sf_entry {
 	xfs_dir2_inou_t		inumber;	/* inode number, var. offset */
 } __arch_pack xfs_dir2_sf_entry_t; 
 
-typedef struct xfs_dir2_sf {
-	xfs_dir2_sf_hdr_t	hdr;		/* shortform header */
-	xfs_dir2_sf_entry_t	list[1];	/* shortform entries */
-} xfs_dir2_sf_t;
-
 static inline int xfs_dir2_sf_hdr_size(int i8count)
 {
 	return ((uint)sizeof(xfs_dir2_sf_hdr_t) - \
@@ -102,29 +101,29 @@ xfs_dir2_sf_put_offset(xfs_dir2_sf_entry
 	INT_SET_UNALIGNED_16_BE(&(sfep)->offset.i, off);
 }
 
-static inline int xfs_dir2_sf_entsize_byname(xfs_dir2_sf_t *sfp, int len)
+static inline int xfs_dir2_sf_entsize_byname(xfs_dir2_sf_hdr_t *sfp, int len)
 {
 	return ((uint)sizeof(xfs_dir2_sf_entry_t) - 1 + (len) - \
-		((sfp)->hdr.i8count == 0) * \
+		((sfp)->i8count == 0) * \
 		((uint)sizeof(xfs_dir2_ino8_t) - (uint)sizeof(xfs_dir2_ino4_t)));
 }
 
 static inline int
-xfs_dir2_sf_entsize_byentry(xfs_dir2_sf_t *sfp, xfs_dir2_sf_entry_t *sfep)
+xfs_dir2_sf_entsize_byentry(xfs_dir2_sf_hdr_t *sfp, xfs_dir2_sf_entry_t *sfep)
 {
 	return ((uint)sizeof(xfs_dir2_sf_entry_t) - 1 + (sfep)->namelen - \
-		((sfp)->hdr.i8count == 0) * \
+		((sfp)->i8count == 0) * \
 		((uint)sizeof(xfs_dir2_ino8_t) - (uint)sizeof(xfs_dir2_ino4_t)));
 }
 
-static inline xfs_dir2_sf_entry_t *xfs_dir2_sf_firstentry(xfs_dir2_sf_t *sfp)
+static inline xfs_dir2_sf_entry_t *xfs_dir2_sf_firstentry(xfs_dir2_sf_hdr_t *sfp)
 {
 	return ((xfs_dir2_sf_entry_t *) \
-		((char *)(sfp) + xfs_dir2_sf_hdr_size(sfp->hdr.i8count)));
+		((char *)(sfp) + xfs_dir2_sf_hdr_size(sfp->i8count)));
 }
 
 static inline xfs_dir2_sf_entry_t *
-xfs_dir2_sf_nextentry(xfs_dir2_sf_t *sfp, xfs_dir2_sf_entry_t *sfep)
+xfs_dir2_sf_nextentry(xfs_dir2_sf_hdr_t *sfp, xfs_dir2_sf_entry_t *sfep)
 {
 	return ((xfs_dir2_sf_entry_t *) \
 		((char *)(sfep) + xfs_dir2_sf_entsize_byentry(sfp,sfep)));
@@ -133,8 +132,8 @@ xfs_dir2_sf_nextentry(xfs_dir2_sf_t *sfp
 /*
  * Functions.
  */
-extern xfs_ino_t xfs_dir2_sf_get_parent_ino(struct xfs_dir2_sf *sfp);
-extern xfs_ino_t xfs_dir2_sfe_get_ino(struct xfs_dir2_sf *sfp,
+extern xfs_ino_t xfs_dir2_sf_get_parent_ino(struct xfs_dir2_sf_hdr *sfp);
+extern xfs_ino_t xfs_dir2_sfe_get_ino(struct xfs_dir2_sf_hdr *sfp,
 				      struct xfs_dir2_sf_entry *sfep);
 extern int xfs_dir2_block_sfsize(struct xfs_inode *dp,
 				 struct xfs_dir2_block *block,

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH 16/27] xfs: cleanup the defintion of struct xfs_dir2_sf_entry
  2011-06-29 14:01 [PATCH 00/27] patch queue for Linux 3.1 Christoph Hellwig
                   ` (13 preceding siblings ...)
  2011-06-29 14:01 ` [PATCH 15/27] xfs: kill struct xfs_dir2_sf Christoph Hellwig
@ 2011-06-29 14:01 ` Christoph Hellwig
  2011-06-29 14:01 ` [PATCH 17/27] xfs: avoid usage of struct xfs_dir2_block Christoph Hellwig
                   ` (11 subsequent siblings)
  26 siblings, 0 replies; 100+ messages in thread
From: Christoph Hellwig @ 2011-06-29 14:01 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-cleanup-xfs_dir2_sf_entry --]
[-- Type: text/plain, Size: 3899 bytes --]

Remove the inumber member which is at a variable offset after the actual
name, and make name a real variable sized C99 array instead of the incorrect
one-sized array which confuses (not only) gcc.  Based on this clean up
the helpers to calculate the entry size.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: xfs/fs/xfs/xfs_dir2_sf.c
===================================================================
--- xfs.orig/fs/xfs/xfs_dir2_sf.c	2011-06-29 13:15:18.790430446 +0200
+++ xfs/fs/xfs/xfs_dir2_sf.c	2011-06-29 13:15:26.800387051 +0200
@@ -375,7 +375,7 @@ xfs_dir2_sf_addname(
 	/*
 	 * Compute entry (and change in) size.
 	 */
-	add_entsize = xfs_dir2_sf_entsize_byname(sfp, args->namelen);
+	add_entsize = xfs_dir2_sf_entsize(sfp, args->namelen);
 	incr_isize = add_entsize;
 	objchange = 0;
 #if XFS_BIG_INUMS
@@ -469,7 +469,7 @@ xfs_dir2_sf_addname_easy(
 	/*
 	 * Grow the in-inode space.
 	 */
-	xfs_idata_realloc(dp, xfs_dir2_sf_entsize_byname(sfp, args->namelen),
+	xfs_idata_realloc(dp, xfs_dir2_sf_entsize(sfp, args->namelen),
 		XFS_DATA_FORK);
 	/*
 	 * Need to set up again due to realloc of the inode data.
@@ -1005,7 +1005,7 @@ xfs_dir2_sf_removename(
 	 * Calculate sizes.
 	 */
 	byteoff = (int)((char *)sfep - (char *)sfp);
-	entsize = xfs_dir2_sf_entsize_byname(sfp, args->namelen);
+	entsize = xfs_dir2_sf_entsize(sfp, args->namelen);
 	newsize = oldsize - entsize;
 	/*
 	 * Copy the part if any after the removed entry, sliding it down.
Index: xfs/fs/xfs/xfs_dir2_sf.h
===================================================================
--- xfs.orig/fs/xfs/xfs_dir2_sf.h	2011-06-29 13:15:18.807097021 +0200
+++ xfs/fs/xfs/xfs_dir2_sf.h	2011-06-29 13:17:15.143133442 +0200
@@ -76,10 +76,13 @@ typedef struct xfs_dir2_sf_hdr {
 } __arch_pack xfs_dir2_sf_hdr_t;
 
 typedef struct xfs_dir2_sf_entry {
-	__uint8_t		namelen;	/* actual name length */
+	__u8			namelen;	/* actual name length */
 	xfs_dir2_sf_off_t	offset;		/* saved offset */
-	__uint8_t		name[1];	/* name, variable size */
-	xfs_dir2_inou_t		inumber;	/* inode number, var. offset */
+	__u8			name[];		/* name, variable size */
+	/*
+	 * A xfs_dir2_ino8_t or xfs_dir2_ino4_t follows here, at a
+	 * variable offset after the name.
+	 */
 } __arch_pack xfs_dir2_sf_entry_t; 
 
 static inline int xfs_dir2_sf_hdr_size(int i8count)
@@ -101,32 +104,27 @@ xfs_dir2_sf_put_offset(xfs_dir2_sf_entry
 	INT_SET_UNALIGNED_16_BE(&(sfep)->offset.i, off);
 }
 
-static inline int xfs_dir2_sf_entsize_byname(xfs_dir2_sf_hdr_t *sfp, int len)
-{
-	return ((uint)sizeof(xfs_dir2_sf_entry_t) - 1 + (len) - \
-		((sfp)->i8count == 0) * \
-		((uint)sizeof(xfs_dir2_ino8_t) - (uint)sizeof(xfs_dir2_ino4_t)));
-}
-
 static inline int
-xfs_dir2_sf_entsize_byentry(xfs_dir2_sf_hdr_t *sfp, xfs_dir2_sf_entry_t *sfep)
+xfs_dir2_sf_entsize(xfs_dir2_sf_hdr_t *sfp, int len)
 {
-	return ((uint)sizeof(xfs_dir2_sf_entry_t) - 1 + (sfep)->namelen - \
-		((sfp)->i8count == 0) * \
-		((uint)sizeof(xfs_dir2_ino8_t) - (uint)sizeof(xfs_dir2_ino4_t)));
+	return sizeof(xfs_dir2_sf_entry_t) +	/* namelen + offset */
+		len +				/* name */
+		(sfp->i8count ?			/* ino */
+		 sizeof(xfs_dir2_ino8_t) :
+		 sizeof(xfs_dir2_ino4_t));
 }
 
 static inline xfs_dir2_sf_entry_t *xfs_dir2_sf_firstentry(xfs_dir2_sf_hdr_t *sfp)
 {
-	return ((xfs_dir2_sf_entry_t *) \
-		((char *)(sfp) + xfs_dir2_sf_hdr_size(sfp->i8count)));
+	return (xfs_dir2_sf_entry_t *)
+		((char *)sfp + xfs_dir2_sf_hdr_size(sfp->i8count));
 }
 
 static inline xfs_dir2_sf_entry_t *
 xfs_dir2_sf_nextentry(xfs_dir2_sf_hdr_t *sfp, xfs_dir2_sf_entry_t *sfep)
 {
-	return ((xfs_dir2_sf_entry_t *) \
-		((char *)(sfep) + xfs_dir2_sf_entsize_byentry(sfp,sfep)));
+	return (xfs_dir2_sf_entry_t *)
+		((char *)sfep + xfs_dir2_sf_entsize(sfp, sfep->namelen));
 }
 
 /*

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH 17/27] xfs: avoid usage of struct xfs_dir2_block
  2011-06-29 14:01 [PATCH 00/27] patch queue for Linux 3.1 Christoph Hellwig
                   ` (14 preceding siblings ...)
  2011-06-29 14:01 ` [PATCH 16/27] xfs: cleanup the defintion of struct xfs_dir2_sf_entry Christoph Hellwig
@ 2011-06-29 14:01 ` Christoph Hellwig
  2011-06-29 14:01 ` [PATCH 18/27] xfs: kill " Christoph Hellwig
                   ` (10 subsequent siblings)
  26 siblings, 0 replies; 100+ messages in thread
From: Christoph Hellwig @ 2011-06-29 14:01 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-avoid-xfs_dir2_block_t --]
[-- Type: text/plain, Size: 27184 bytes --]

In most places we can simply pass around and use the struct xfs_dir2_data_hdr,
which is the first and most important member of struct xfs_dir2_block instead
of the full structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: xfs/fs/xfs/xfs_dir2_block.c
===================================================================
--- xfs.orig/fs/xfs/xfs_dir2_block.c	2011-06-29 13:06:35.083267610 +0200
+++ xfs/fs/xfs/xfs_dir2_block.c	2011-06-29 13:17:34.256363230 +0200
@@ -67,7 +67,7 @@ xfs_dir2_block_addname(
 	xfs_da_args_t		*args)		/* directory op arguments */
 {
 	xfs_dir2_data_free_t	*bf;		/* bestfree table in block */
-	xfs_dir2_block_t	*block;		/* directory block structure */
+	xfs_dir2_data_hdr_t	*hdr;		/* block header */
 	xfs_dir2_leaf_entry_t	*blp;		/* block leaf entries */
 	xfs_dabuf_t		*bp;		/* buffer for block */
 	xfs_dir2_block_tail_t	*btp;		/* block tail */
@@ -105,13 +105,13 @@ xfs_dir2_block_addname(
 		return error;
 	}
 	ASSERT(bp != NULL);
-	block = bp->data;
+	hdr = bp->data;
 	/*
 	 * Check the magic number, corrupted if wrong.
 	 */
-	if (unlikely(be32_to_cpu(block->hdr.magic) != XFS_DIR2_BLOCK_MAGIC)) {
+	if (unlikely(hdr->magic != cpu_to_be32(XFS_DIR2_BLOCK_MAGIC))) {
 		XFS_CORRUPTION_ERROR("xfs_dir2_block_addname",
-				     XFS_ERRLEVEL_LOW, mp, block);
+				     XFS_ERRLEVEL_LOW, mp, hdr);
 		xfs_da_brelse(tp, bp);
 		return XFS_ERROR(EFSCORRUPTED);
 	}
@@ -119,8 +119,8 @@ xfs_dir2_block_addname(
 	/*
 	 * Set up pointers to parts of the block.
 	 */
-	bf = block->hdr.bestfree;
-	btp = xfs_dir2_block_tail_p(mp, block);
+	bf = hdr->bestfree;
+	btp = xfs_dir2_block_tail_p(mp, hdr);
 	blp = xfs_dir2_block_leaf_p(btp);
 	/*
 	 * No stale entries?  Need space for entry and new leaf.
@@ -133,7 +133,7 @@ xfs_dir2_block_addname(
 		/*
 		 * Data object just before the first leaf entry.
 		 */
-		enddup = (xfs_dir2_data_unused_t *)((char *)block + be16_to_cpu(*tagp));
+		enddup = (xfs_dir2_data_unused_t *)((char *)hdr + be16_to_cpu(*tagp));
 		/*
 		 * If it's not free then can't do this add without cleaning up:
 		 * the space before the first leaf entry needs to be free so it
@@ -146,7 +146,7 @@ xfs_dir2_block_addname(
 		 */
 		else {
 			dup = (xfs_dir2_data_unused_t *)
-			      ((char *)block + be16_to_cpu(bf[0].offset));
+			      ((char *)hdr + be16_to_cpu(bf[0].offset));
 			if (dup == enddup) {
 				/*
 				 * It is the biggest freespace, is it too small
@@ -159,7 +159,7 @@ xfs_dir2_block_addname(
 					 */
 					if (be16_to_cpu(bf[1].length) >= len)
 						dup = (xfs_dir2_data_unused_t *)
-						      ((char *)block +
+						      ((char *)hdr +
 						       be16_to_cpu(bf[1].offset));
 					else
 						dup = NULL;
@@ -182,7 +182,7 @@ xfs_dir2_block_addname(
 	 */
 	else if (be16_to_cpu(bf[0].length) >= len) {
 		dup = (xfs_dir2_data_unused_t *)
-		      ((char *)block + be16_to_cpu(bf[0].offset));
+		      ((char *)hdr + be16_to_cpu(bf[0].offset));
 		compact = 0;
 	}
 	/*
@@ -196,7 +196,7 @@ xfs_dir2_block_addname(
 		/*
 		 * Data object just before the first leaf entry.
 		 */
-		dup = (xfs_dir2_data_unused_t *)((char *)block + be16_to_cpu(*tagp));
+		dup = (xfs_dir2_data_unused_t *)((char *)hdr + be16_to_cpu(*tagp));
 		/*
 		 * If it's not free then the data will go where the
 		 * leaf data starts now, if it works at all.
@@ -272,7 +272,7 @@ xfs_dir2_block_addname(
 		lfloghigh -= be32_to_cpu(btp->stale) - 1;
 		be32_add_cpu(&btp->count, -(be32_to_cpu(btp->stale) - 1));
 		xfs_dir2_data_make_free(tp, bp,
-			(xfs_dir2_data_aoff_t)((char *)blp - (char *)block),
+			(xfs_dir2_data_aoff_t)((char *)blp - (char *)hdr),
 			(xfs_dir2_data_aoff_t)((be32_to_cpu(btp->stale) - 1) * sizeof(*blp)),
 			&needlog, &needscan);
 		blp += be32_to_cpu(btp->stale) - 1;
@@ -282,7 +282,7 @@ xfs_dir2_block_addname(
 		 * This needs to happen before the next call to use_free.
 		 */
 		if (needscan) {
-			xfs_dir2_data_freescan(mp, (xfs_dir2_data_t *)block, &needlog);
+			xfs_dir2_data_freescan(mp, (xfs_dir2_data_t *)hdr, &needlog);
 			needscan = 0;
 		}
 	}
@@ -318,7 +318,7 @@ xfs_dir2_block_addname(
 		 */
 		xfs_dir2_data_use_free(tp, bp, enddup,
 			(xfs_dir2_data_aoff_t)
-			((char *)enddup - (char *)block + be16_to_cpu(enddup->length) -
+			((char *)enddup - (char *)hdr + be16_to_cpu(enddup->length) -
 			 sizeof(*blp)),
 			(xfs_dir2_data_aoff_t)sizeof(*blp),
 			&needlog, &needscan);
@@ -331,7 +331,7 @@ xfs_dir2_block_addname(
 		 * This needs to happen before the next call to use_free.
 		 */
 		if (needscan) {
-			xfs_dir2_data_freescan(mp, (xfs_dir2_data_t *)block,
+			xfs_dir2_data_freescan(mp, (xfs_dir2_data_t *)hdr,
 				&needlog);
 			needscan = 0;
 		}
@@ -397,13 +397,13 @@ xfs_dir2_block_addname(
 	 */
 	blp[mid].hashval = cpu_to_be32(args->hashval);
 	blp[mid].address = cpu_to_be32(xfs_dir2_byte_to_dataptr(mp,
-				(char *)dep - (char *)block));
+				(char *)dep - (char *)hdr));
 	xfs_dir2_block_log_leaf(tp, bp, lfloglow, lfloghigh);
 	/*
 	 * Mark space for the data entry used.
 	 */
 	xfs_dir2_data_use_free(tp, bp, dup,
-		(xfs_dir2_data_aoff_t)((char *)dup - (char *)block),
+		(xfs_dir2_data_aoff_t)((char *)dup - (char *)hdr),
 		(xfs_dir2_data_aoff_t)len, &needlog, &needscan);
 	/*
 	 * Create the new data entry.
@@ -412,12 +412,12 @@ xfs_dir2_block_addname(
 	dep->namelen = args->namelen;
 	memcpy(dep->name, args->name, args->namelen);
 	tagp = xfs_dir2_data_entry_tag_p(dep);
-	*tagp = cpu_to_be16((char *)dep - (char *)block);
+	*tagp = cpu_to_be16((char *)dep - (char *)hdr);
 	/*
 	 * Clean up the bestfree array and log the header, tail, and entry.
 	 */
 	if (needscan)
-		xfs_dir2_data_freescan(mp, (xfs_dir2_data_t *)block, &needlog);
+		xfs_dir2_data_freescan(mp, (xfs_dir2_data_t *)hdr, &needlog);
 	if (needlog)
 		xfs_dir2_data_log_header(tp, bp);
 	xfs_dir2_block_log_tail(tp, bp);
@@ -438,6 +438,7 @@ xfs_dir2_block_getdents(
 	filldir_t		filldir)
 {
 	xfs_dir2_block_t	*block;		/* directory block structure */
+	xfs_dir2_data_hdr_t	*hdr;		/* block header */
 	xfs_dabuf_t		*bp;		/* buffer for block */
 	xfs_dir2_block_tail_t	*btp;		/* block tail */
 	xfs_dir2_data_entry_t	*dep;		/* block data entry */
@@ -471,11 +472,12 @@ xfs_dir2_block_getdents(
 	 */
 	wantoff = xfs_dir2_dataptr_to_off(mp, *offset);
 	block = bp->data;
+	hdr = &block->hdr;
 	xfs_dir2_data_check(dp, bp);
 	/*
 	 * Set up values for the loop.
 	 */
-	btp = xfs_dir2_block_tail_p(mp, block);
+	btp = xfs_dir2_block_tail_p(mp, hdr);
 	ptr = (char *)block->u;
 	endptr = (char *)xfs_dir2_block_leaf_p(btp);
 
@@ -502,11 +504,11 @@ xfs_dir2_block_getdents(
 		/*
 		 * The entry is before the desired starting point, skip it.
 		 */
-		if ((char *)dep - (char *)block < wantoff)
+		if ((char *)dep - (char *)hdr < wantoff)
 			continue;
 
 		cook = xfs_dir2_db_off_to_dataptr(mp, mp->m_dirdatablk,
-					    (char *)dep - (char *)block);
+					    (char *)dep - (char *)hdr);
 
 		/*
 		 * If it didn't fit, set the final offset to here & return.
@@ -540,17 +542,14 @@ xfs_dir2_block_log_leaf(
 	int			first,		/* index of first logged leaf */
 	int			last)		/* index of last logged leaf */
 {
-	xfs_dir2_block_t	*block;		/* directory block structure */
-	xfs_dir2_leaf_entry_t	*blp;		/* block leaf entries */
-	xfs_dir2_block_tail_t	*btp;		/* block tail */
-	xfs_mount_t		*mp;		/* filesystem mount point */
+	xfs_dir2_data_hdr_t	*hdr = bp->data;
+	xfs_dir2_leaf_entry_t	*blp;
+	xfs_dir2_block_tail_t	*btp;
 
-	mp = tp->t_mountp;
-	block = bp->data;
-	btp = xfs_dir2_block_tail_p(mp, block);
+	btp = xfs_dir2_block_tail_p(tp->t_mountp, hdr);
 	blp = xfs_dir2_block_leaf_p(btp);
-	xfs_da_log_buf(tp, bp, (uint)((char *)&blp[first] - (char *)block),
-		(uint)((char *)&blp[last + 1] - (char *)block - 1));
+	xfs_da_log_buf(tp, bp, (uint)((char *)&blp[first] - (char *)hdr),
+		(uint)((char *)&blp[last + 1] - (char *)hdr - 1));
 }
 
 /*
@@ -561,15 +560,12 @@ xfs_dir2_block_log_tail(
 	xfs_trans_t		*tp,		/* transaction structure */
 	xfs_dabuf_t		*bp)		/* block buffer */
 {
-	xfs_dir2_block_t	*block;		/* directory block structure */
-	xfs_dir2_block_tail_t	*btp;		/* block tail */
-	xfs_mount_t		*mp;		/* filesystem mount point */
+	xfs_dir2_data_hdr_t	*hdr = bp->data;
+	xfs_dir2_block_tail_t	*btp;
 
-	mp = tp->t_mountp;
-	block = bp->data;
-	btp = xfs_dir2_block_tail_p(mp, block);
-	xfs_da_log_buf(tp, bp, (uint)((char *)btp - (char *)block),
-		(uint)((char *)(btp + 1) - (char *)block - 1));
+	btp = xfs_dir2_block_tail_p(tp->t_mountp, hdr);
+	xfs_da_log_buf(tp, bp, (uint)((char *)btp - (char *)hdr),
+		(uint)((char *)(btp + 1) - (char *)hdr - 1));
 }
 
 /*
@@ -580,7 +576,7 @@ int						/* error */
 xfs_dir2_block_lookup(
 	xfs_da_args_t		*args)		/* dir lookup arguments */
 {
-	xfs_dir2_block_t	*block;		/* block structure */
+	xfs_dir2_data_hdr_t	*hdr;		/* block header */
 	xfs_dir2_leaf_entry_t	*blp;		/* block leaf entries */
 	xfs_dabuf_t		*bp;		/* block buffer */
 	xfs_dir2_block_tail_t	*btp;		/* block tail */
@@ -600,14 +596,14 @@ xfs_dir2_block_lookup(
 		return error;
 	dp = args->dp;
 	mp = dp->i_mount;
-	block = bp->data;
+	hdr = bp->data;
 	xfs_dir2_data_check(dp, bp);
-	btp = xfs_dir2_block_tail_p(mp, block);
+	btp = xfs_dir2_block_tail_p(mp, hdr);
 	blp = xfs_dir2_block_leaf_p(btp);
 	/*
 	 * Get the offset from the leaf entry, to point to the data.
 	 */
-	dep = (xfs_dir2_data_entry_t *)((char *)block +
+	dep = (xfs_dir2_data_entry_t *)((char *)hdr +
 		xfs_dir2_dataptr_to_off(mp, be32_to_cpu(blp[ent].address)));
 	/*
 	 * Fill in inode number, CI name if appropriate, release the block.
@@ -628,7 +624,7 @@ xfs_dir2_block_lookup_int(
 	int			*entno)		/* returned entry number */
 {
 	xfs_dir2_dataptr_t	addr;		/* data entry address */
-	xfs_dir2_block_t	*block;		/* block structure */
+	xfs_dir2_data_hdr_t	*hdr;		/* block header */
 	xfs_dir2_leaf_entry_t	*blp;		/* block leaf entries */
 	xfs_dabuf_t		*bp;		/* block buffer */
 	xfs_dir2_block_tail_t	*btp;		/* block tail */
@@ -654,9 +650,9 @@ xfs_dir2_block_lookup_int(
 		return error;
 	}
 	ASSERT(bp != NULL);
-	block = bp->data;
+	hdr = bp->data;
 	xfs_dir2_data_check(dp, bp);
-	btp = xfs_dir2_block_tail_p(mp, block);
+	btp = xfs_dir2_block_tail_p(mp, hdr);
 	blp = xfs_dir2_block_leaf_p(btp);
 	/*
 	 * Loop doing a binary search for our hash value.
@@ -694,7 +690,7 @@ xfs_dir2_block_lookup_int(
 		 * Get pointer to the entry from the leaf.
 		 */
 		dep = (xfs_dir2_data_entry_t *)
-			((char *)block + xfs_dir2_dataptr_to_off(mp, addr));
+			((char *)hdr + xfs_dir2_dataptr_to_off(mp, addr));
 		/*
 		 * Compare name and if it's an exact match, return the index
 		 * and buffer. If it's the first case-insensitive match, store
@@ -733,7 +729,7 @@ int						/* error */
 xfs_dir2_block_removename(
 	xfs_da_args_t		*args)		/* directory operation args */
 {
-	xfs_dir2_block_t	*block;		/* block structure */
+	xfs_dir2_data_hdr_t	*hdr;		/* block header */
 	xfs_dir2_leaf_entry_t	*blp;		/* block leaf pointer */
 	xfs_dabuf_t		*bp;		/* block buffer */
 	xfs_dir2_block_tail_t	*btp;		/* block tail */
@@ -760,20 +756,20 @@ xfs_dir2_block_removename(
 	dp = args->dp;
 	tp = args->trans;
 	mp = dp->i_mount;
-	block = bp->data;
-	btp = xfs_dir2_block_tail_p(mp, block);
+	hdr = bp->data;
+	btp = xfs_dir2_block_tail_p(mp, hdr);
 	blp = xfs_dir2_block_leaf_p(btp);
 	/*
 	 * Point to the data entry using the leaf entry.
 	 */
 	dep = (xfs_dir2_data_entry_t *)
-	      ((char *)block + xfs_dir2_dataptr_to_off(mp, be32_to_cpu(blp[ent].address)));
+	      ((char *)hdr + xfs_dir2_dataptr_to_off(mp, be32_to_cpu(blp[ent].address)));
 	/*
 	 * Mark the data entry's space free.
 	 */
 	needlog = needscan = 0;
 	xfs_dir2_data_make_free(tp, bp,
-		(xfs_dir2_data_aoff_t)((char *)dep - (char *)block),
+		(xfs_dir2_data_aoff_t)((char *)dep - (char *)hdr),
 		xfs_dir2_data_entsize(dep->namelen), &needlog, &needscan);
 	/*
 	 * Fix up the block tail.
@@ -789,15 +785,15 @@ xfs_dir2_block_removename(
 	 * Fix up bestfree, log the header if necessary.
 	 */
 	if (needscan)
-		xfs_dir2_data_freescan(mp, (xfs_dir2_data_t *)block, &needlog);
+		xfs_dir2_data_freescan(mp, (xfs_dir2_data_t *)hdr, &needlog);
 	if (needlog)
 		xfs_dir2_data_log_header(tp, bp);
 	xfs_dir2_data_check(dp, bp);
 	/*
 	 * See if the size as a shortform is good enough.
 	 */
-	if ((size = xfs_dir2_block_sfsize(dp, block, &sfh)) >
-	    XFS_IFORK_DSIZE(dp)) {
+	size = xfs_dir2_block_sfsize(dp, hdr, &sfh);
+	if (size > XFS_IFORK_DSIZE(dp)) {
 		xfs_da_buf_done(bp);
 		return 0;
 	}
@@ -815,7 +811,7 @@ int						/* error */
 xfs_dir2_block_replace(
 	xfs_da_args_t		*args)		/* directory operation args */
 {
-	xfs_dir2_block_t	*block;		/* block structure */
+	xfs_dir2_data_hdr_t	*hdr;		/* block header */
 	xfs_dir2_leaf_entry_t	*blp;		/* block leaf entries */
 	xfs_dabuf_t		*bp;		/* block buffer */
 	xfs_dir2_block_tail_t	*btp;		/* block tail */
@@ -836,14 +832,14 @@ xfs_dir2_block_replace(
 	}
 	dp = args->dp;
 	mp = dp->i_mount;
-	block = bp->data;
-	btp = xfs_dir2_block_tail_p(mp, block);
+	hdr = bp->data;
+	btp = xfs_dir2_block_tail_p(mp, hdr);
 	blp = xfs_dir2_block_leaf_p(btp);
 	/*
 	 * Point to the data entry we need to change.
 	 */
 	dep = (xfs_dir2_data_entry_t *)
-	      ((char *)block + xfs_dir2_dataptr_to_off(mp, be32_to_cpu(blp[ent].address)));
+	      ((char *)hdr + xfs_dir2_dataptr_to_off(mp, be32_to_cpu(blp[ent].address)));
 	ASSERT(be64_to_cpu(dep->inumber) != args->inumber);
 	/*
 	 * Change the inode number to the new value.
@@ -882,7 +878,7 @@ xfs_dir2_leaf_to_block(
 	xfs_dabuf_t		*dbp)		/* data buffer */
 {
 	__be16			*bestsp;	/* leaf bests table */
-	xfs_dir2_block_t	*block;		/* block structure */
+	xfs_dir2_data_hdr_t	*hdr;		/* block header */
 	xfs_dir2_block_tail_t	*btp;		/* block tail */
 	xfs_inode_t		*dp;		/* incore directory inode */
 	xfs_dir2_data_unused_t	*dup;		/* unused data entry */
@@ -917,7 +913,7 @@ xfs_dir2_leaf_to_block(
 	while (dp->i_d.di_size > mp->m_dirblksize) {
 		bestsp = xfs_dir2_leaf_bests_p(ltp);
 		if (be16_to_cpu(bestsp[be32_to_cpu(ltp->bestcount) - 1]) ==
-		    mp->m_dirblksize - (uint)sizeof(block->hdr)) {
+		    mp->m_dirblksize - (uint)sizeof(*hdr)) {
 			if ((error =
 			    xfs_dir2_leaf_trim_data(args, lbp,
 				    (xfs_dir2_db_t)(be32_to_cpu(ltp->bestcount) - 1))))
@@ -935,18 +931,18 @@ xfs_dir2_leaf_to_block(
 		    XFS_DATA_FORK))) {
 		goto out;
 	}
-	block = dbp->data;
-	ASSERT(be32_to_cpu(block->hdr.magic) == XFS_DIR2_DATA_MAGIC);
+	hdr = dbp->data;
+	ASSERT(be32_to_cpu(hdr->magic) == XFS_DIR2_DATA_MAGIC);
 	/*
 	 * Size of the "leaf" area in the block.
 	 */
-	size = (uint)sizeof(block->tail) +
+	size = (uint)sizeof(xfs_dir2_block_tail_t) +
 	       (uint)sizeof(*lep) * (be16_to_cpu(leaf->hdr.count) - be16_to_cpu(leaf->hdr.stale));
 	/*
 	 * Look at the last data entry.
 	 */
-	tagp = (__be16 *)((char *)block + mp->m_dirblksize) - 1;
-	dup = (xfs_dir2_data_unused_t *)((char *)block + be16_to_cpu(*tagp));
+	tagp = (__be16 *)((char *)hdr + mp->m_dirblksize) - 1;
+	dup = (xfs_dir2_data_unused_t *)((char *)hdr + be16_to_cpu(*tagp));
 	/*
 	 * If it's not free or is too short we can't do it.
 	 */
@@ -958,7 +954,7 @@ xfs_dir2_leaf_to_block(
 	/*
 	 * Start converting it to block form.
 	 */
-	block->hdr.magic = cpu_to_be32(XFS_DIR2_BLOCK_MAGIC);
+	hdr->magic = cpu_to_be32(XFS_DIR2_BLOCK_MAGIC);
 	needlog = 1;
 	needscan = 0;
 	/*
@@ -969,7 +965,7 @@ xfs_dir2_leaf_to_block(
 	/*
 	 * Initialize the block tail.
 	 */
-	btp = xfs_dir2_block_tail_p(mp, block);
+	btp = xfs_dir2_block_tail_p(mp, hdr);
 	btp->count = cpu_to_be32(be16_to_cpu(leaf->hdr.count) - be16_to_cpu(leaf->hdr.stale));
 	btp->stale = 0;
 	xfs_dir2_block_log_tail(tp, dbp);
@@ -988,7 +984,7 @@ xfs_dir2_leaf_to_block(
 	 * Scan the bestfree if we need it and log the data block header.
 	 */
 	if (needscan)
-		xfs_dir2_data_freescan(mp, (xfs_dir2_data_t *)block, &needlog);
+		xfs_dir2_data_freescan(mp, (xfs_dir2_data_t *)hdr, &needlog);
 	if (needlog)
 		xfs_dir2_data_log_header(tp, dbp);
 	/*
@@ -1002,8 +998,8 @@ xfs_dir2_leaf_to_block(
 	/*
 	 * Now see if the resulting block can be shrunken to shortform.
 	 */
-	if ((size = xfs_dir2_block_sfsize(dp, block, &sfh)) >
-	    XFS_IFORK_DSIZE(dp)) {
+	size = xfs_dir2_block_sfsize(dp, hdr, &sfh);
+	if (size > XFS_IFORK_DSIZE(dp)) {
 		error = 0;
 		goto out;
 	}
@@ -1025,6 +1021,7 @@ xfs_dir2_sf_to_block(
 {
 	xfs_dir2_db_t		blkno;		/* dir-relative block # (0) */
 	xfs_dir2_block_t	*block;		/* block structure */
+	xfs_dir2_data_hdr_t	*hdr;		/* block header */
 	xfs_dir2_leaf_entry_t	*blp;		/* block leaf entries */
 	xfs_dabuf_t		*bp;		/* block buffer */
 	xfs_dir2_block_tail_t	*btp;		/* block tail pointer */
@@ -1095,7 +1092,8 @@ xfs_dir2_sf_to_block(
 		return error;
 	}
 	block = bp->data;
-	block->hdr.magic = cpu_to_be32(XFS_DIR2_BLOCK_MAGIC);
+	hdr = &block->hdr;
+	hdr->magic = cpu_to_be32(XFS_DIR2_BLOCK_MAGIC);
 	/*
 	 * Compute size of block "tail" area.
 	 */
@@ -1113,45 +1111,45 @@ xfs_dir2_sf_to_block(
 	/*
 	 * Fill in the tail.
 	 */
-	btp = xfs_dir2_block_tail_p(mp, block);
+	btp = xfs_dir2_block_tail_p(mp, hdr);
 	btp->count = cpu_to_be32(sfp->count + 2);	/* ., .. */
 	btp->stale = 0;
 	blp = xfs_dir2_block_leaf_p(btp);
-	endoffset = (uint)((char *)blp - (char *)block);
+	endoffset = (uint)((char *)blp - (char *)hdr);
 	/*
 	 * Remove the freespace, we'll manage it.
 	 */
 	xfs_dir2_data_use_free(tp, bp, dup,
-		(xfs_dir2_data_aoff_t)((char *)dup - (char *)block),
+		(xfs_dir2_data_aoff_t)((char *)dup - (char *)hdr),
 		be16_to_cpu(dup->length), &needlog, &needscan);
 	/*
 	 * Create entry for .
 	 */
 	dep = (xfs_dir2_data_entry_t *)
-	      ((char *)block + XFS_DIR2_DATA_DOT_OFFSET);
+	      ((char *)hdr + XFS_DIR2_DATA_DOT_OFFSET);
 	dep->inumber = cpu_to_be64(dp->i_ino);
 	dep->namelen = 1;
 	dep->name[0] = '.';
 	tagp = xfs_dir2_data_entry_tag_p(dep);
-	*tagp = cpu_to_be16((char *)dep - (char *)block);
+	*tagp = cpu_to_be16((char *)dep - (char *)hdr);
 	xfs_dir2_data_log_entry(tp, bp, dep);
 	blp[0].hashval = cpu_to_be32(xfs_dir_hash_dot);
 	blp[0].address = cpu_to_be32(xfs_dir2_byte_to_dataptr(mp,
-				(char *)dep - (char *)block));
+				(char *)dep - (char *)hdr));
 	/*
 	 * Create entry for ..
 	 */
 	dep = (xfs_dir2_data_entry_t *)
-		((char *)block + XFS_DIR2_DATA_DOTDOT_OFFSET);
+		((char *)hdr + XFS_DIR2_DATA_DOTDOT_OFFSET);
 	dep->inumber = cpu_to_be64(xfs_dir2_sf_get_parent_ino(sfp));
 	dep->namelen = 2;
 	dep->name[0] = dep->name[1] = '.';
 	tagp = xfs_dir2_data_entry_tag_p(dep);
-	*tagp = cpu_to_be16((char *)dep - (char *)block);
+	*tagp = cpu_to_be16((char *)dep - (char *)hdr);
 	xfs_dir2_data_log_entry(tp, bp, dep);
 	blp[1].hashval = cpu_to_be32(xfs_dir_hash_dotdot);
 	blp[1].address = cpu_to_be32(xfs_dir2_byte_to_dataptr(mp,
-				(char *)dep - (char *)block));
+				(char *)dep - (char *)hdr));
 	offset = XFS_DIR2_DATA_FIRST_OFFSET;
 	/*
 	 * Loop over existing entries, stuff them in.
@@ -1177,14 +1175,13 @@ xfs_dir2_sf_to_block(
 		 * There should be a hole here, make one.
 		 */
 		if (offset < newoffset) {
-			dup = (xfs_dir2_data_unused_t *)
-			      ((char *)block + offset);
+			dup = (xfs_dir2_data_unused_t *)((char *)hdr + offset);
 			dup->freetag = cpu_to_be16(XFS_DIR2_DATA_FREE_TAG);
 			dup->length = cpu_to_be16(newoffset - offset);
 			*xfs_dir2_data_unused_tag_p(dup) = cpu_to_be16(
-				((char *)dup - (char *)block));
+				((char *)dup - (char *)hdr));
 			xfs_dir2_data_log_unused(tp, bp, dup);
-			(void)xfs_dir2_data_freeinsert((xfs_dir2_data_t *)block,
+			(void)xfs_dir2_data_freeinsert((xfs_dir2_data_t *)hdr,
 				dup, &dummy);
 			offset += be16_to_cpu(dup->length);
 			continue;
@@ -1192,20 +1189,20 @@ xfs_dir2_sf_to_block(
 		/*
 		 * Copy a real entry.
 		 */
-		dep = (xfs_dir2_data_entry_t *)((char *)block + newoffset);
+		dep = (xfs_dir2_data_entry_t *)((char *)hdr + newoffset);
 		dep->inumber = cpu_to_be64(xfs_dir2_sfe_get_ino(sfp, sfep));
 		dep->namelen = sfep->namelen;
 		memcpy(dep->name, sfep->name, dep->namelen);
 		tagp = xfs_dir2_data_entry_tag_p(dep);
-		*tagp = cpu_to_be16((char *)dep - (char *)block);
+		*tagp = cpu_to_be16((char *)dep - (char *)hdr);
 		xfs_dir2_data_log_entry(tp, bp, dep);
 		name.name = sfep->name;
 		name.len = sfep->namelen;
 		blp[2 + i].hashval = cpu_to_be32(mp->m_dirnameops->
 							hashname(&name));
 		blp[2 + i].address = cpu_to_be32(xfs_dir2_byte_to_dataptr(mp,
-						 (char *)dep - (char *)block));
-		offset = (int)((char *)(tagp + 1) - (char *)block);
+						 (char *)dep - (char *)hdr));
+		offset = (int)((char *)(tagp + 1) - (char *)hdr);
 		if (++i == sfp->count)
 			sfep = NULL;
 		else
Index: xfs/fs/xfs/xfs_dir2_data.c
===================================================================
--- xfs.orig/fs/xfs/xfs_dir2_data.c	2011-06-29 11:26:13.699221583 +0200
+++ xfs/fs/xfs/xfs_dir2_data.c	2011-06-29 13:17:34.256363230 +0200
@@ -72,7 +72,7 @@ xfs_dir2_data_check(
 	bf = d->hdr.bestfree;
 	p = (char *)d->u;
 	if (be32_to_cpu(d->hdr.magic) == XFS_DIR2_BLOCK_MAGIC) {
-		btp = xfs_dir2_block_tail_p(mp, (xfs_dir2_block_t *)d);
+		btp = xfs_dir2_block_tail_p(mp, &d->hdr);
 		lep = xfs_dir2_block_leaf_p(btp);
 		endp = (char *)lep;
 	} else
@@ -348,7 +348,7 @@ xfs_dir2_data_freescan(
 	 */
 	p = (char *)d->u;
 	if (be32_to_cpu(d->hdr.magic) == XFS_DIR2_BLOCK_MAGIC) {
-		btp = xfs_dir2_block_tail_p(mp, (xfs_dir2_block_t *)d);
+		btp = xfs_dir2_block_tail_p(mp, &d->hdr);
 		endp = (char *)xfs_dir2_block_leaf_p(btp);
 	} else
 		endp = (char *)d + mp->m_dirblksize;
@@ -537,7 +537,7 @@ xfs_dir2_data_make_free(
 		xfs_dir2_block_tail_t	*btp;	/* block tail */
 
 		ASSERT(be32_to_cpu(d->hdr.magic) == XFS_DIR2_BLOCK_MAGIC);
-		btp = xfs_dir2_block_tail_p(mp, (xfs_dir2_block_t *)d);
+		btp = xfs_dir2_block_tail_p(mp, &d->hdr);
 		endptr = (char *)xfs_dir2_block_leaf_p(btp);
 	}
 	/*
Index: xfs/fs/xfs/xfs_dir2_leaf.c
===================================================================
--- xfs.orig/fs/xfs/xfs_dir2_leaf.c	2011-06-29 13:02:38.617881987 +0200
+++ xfs/fs/xfs/xfs_dir2_leaf.c	2011-06-29 13:17:34.259696546 +0200
@@ -64,7 +64,7 @@ xfs_dir2_block_to_leaf(
 {
 	__be16			*bestsp;	/* leaf's bestsp entries */
 	xfs_dablk_t		blkno;		/* leaf block's bno */
-	xfs_dir2_block_t	*block;		/* block structure */
+	xfs_dir2_data_hdr_t	*hdr;		/* block header */
 	xfs_dir2_leaf_entry_t	*blp;		/* block's leaf entries */
 	xfs_dir2_block_tail_t	*btp;		/* block's tail */
 	xfs_inode_t		*dp;		/* incore directory inode */
@@ -101,9 +101,9 @@ xfs_dir2_block_to_leaf(
 	}
 	ASSERT(lbp != NULL);
 	leaf = lbp->data;
-	block = dbp->data;
+	hdr = dbp->data;
 	xfs_dir2_data_check(dp, dbp);
-	btp = xfs_dir2_block_tail_p(mp, block);
+	btp = xfs_dir2_block_tail_p(mp, hdr);
 	blp = xfs_dir2_block_leaf_p(btp);
 	/*
 	 * Set the counts in the leaf header.
@@ -123,23 +123,23 @@ xfs_dir2_block_to_leaf(
 	 * tail be free.
 	 */
 	xfs_dir2_data_make_free(tp, dbp,
-		(xfs_dir2_data_aoff_t)((char *)blp - (char *)block),
-		(xfs_dir2_data_aoff_t)((char *)block + mp->m_dirblksize -
+		(xfs_dir2_data_aoff_t)((char *)blp - (char *)hdr),
+		(xfs_dir2_data_aoff_t)((char *)hdr + mp->m_dirblksize -
 				       (char *)blp),
 		&needlog, &needscan);
 	/*
 	 * Fix up the block header, make it a data block.
 	 */
-	block->hdr.magic = cpu_to_be32(XFS_DIR2_DATA_MAGIC);
+	hdr->magic = cpu_to_be32(XFS_DIR2_DATA_MAGIC);
 	if (needscan)
-		xfs_dir2_data_freescan(mp, (xfs_dir2_data_t *)block, &needlog);
+		xfs_dir2_data_freescan(mp, (xfs_dir2_data_t *)hdr, &needlog);
 	/*
 	 * Set up leaf tail and bests table.
 	 */
 	ltp = xfs_dir2_leaf_tail_p(mp, leaf);
 	ltp->bestcount = cpu_to_be32(1);
 	bestsp = xfs_dir2_leaf_bests_p(ltp);
-	bestsp[0] =  block->hdr.bestfree[0].length;
+	bestsp[0] =  hdr->bestfree[0].length;
 	/*
 	 * Log the data header and leaf bests table.
 	 */
Index: xfs/fs/xfs/xfs_dir2_sf.c
===================================================================
--- xfs.orig/fs/xfs/xfs_dir2_sf.c	2011-06-29 13:15:26.800387051 +0200
+++ xfs/fs/xfs/xfs_dir2_sf.c	2011-06-29 13:17:34.263029862 +0200
@@ -145,7 +145,7 @@ xfs_dir2_sfe_put_ino(
 int						/* size for sf form */
 xfs_dir2_block_sfsize(
 	xfs_inode_t		*dp,		/* incore inode pointer */
-	xfs_dir2_block_t	*block,		/* block directory data */
+	xfs_dir2_data_hdr_t	*hdr,		/* block directory data */
 	xfs_dir2_sf_hdr_t	*sfhp)		/* output: header for sf form */
 {
 	xfs_dir2_dataptr_t	addr;		/* data entry address */
@@ -165,7 +165,7 @@ xfs_dir2_block_sfsize(
 	mp = dp->i_mount;
 
 	count = i8count = namelen = 0;
-	btp = xfs_dir2_block_tail_p(mp, block);
+	btp = xfs_dir2_block_tail_p(mp, hdr);
 	blp = xfs_dir2_block_leaf_p(btp);
 
 	/*
@@ -178,7 +178,7 @@ xfs_dir2_block_sfsize(
 		 * Calculate the pointer to the entry at hand.
 		 */
 		dep = (xfs_dir2_data_entry_t *)
-		      ((char *)block + xfs_dir2_dataptr_to_off(mp, addr));
+		      ((char *)hdr + xfs_dir2_dataptr_to_off(mp, addr));
 		/*
 		 * Detect . and .., so we can special-case them.
 		 * . is not included in sf directories.
@@ -259,6 +259,7 @@ xfs_dir2_block_to_sf(
 		ASSERT(error != ENOSPC);
 		goto out;
 	}
+
 	/*
 	 * The buffer is now unconditionally gone, whether
 	 * xfs_dir2_shrink_inode worked or not.
@@ -280,7 +281,7 @@ xfs_dir2_block_to_sf(
 	/*
 	 * Set up to loop over the block's entries.
 	 */
-	btp = xfs_dir2_block_tail_p(mp, block);
+	btp = xfs_dir2_block_tail_p(mp, &block->hdr);
 	ptr = (char *)block->u;
 	endptr = (char *)xfs_dir2_block_leaf_p(btp);
 	sfep = xfs_dir2_sf_firstentry(sfp);
Index: xfs/fs/xfs/xfs_dir2_sf.h
===================================================================
--- xfs.orig/fs/xfs/xfs_dir2_sf.h	2011-06-29 13:17:15.143133442 +0200
+++ xfs/fs/xfs/xfs_dir2_sf.h	2011-06-29 13:17:34.263029862 +0200
@@ -32,7 +32,7 @@
 struct uio;
 struct xfs_dabuf;
 struct xfs_da_args;
-struct xfs_dir2_block;
+struct xfs_dir2_data_hdr;
 struct xfs_inode;
 struct xfs_mount;
 struct xfs_trans;
@@ -134,7 +134,7 @@ extern xfs_ino_t xfs_dir2_sf_get_parent_
 extern xfs_ino_t xfs_dir2_sfe_get_ino(struct xfs_dir2_sf_hdr *sfp,
 				      struct xfs_dir2_sf_entry *sfep);
 extern int xfs_dir2_block_sfsize(struct xfs_inode *dp,
-				 struct xfs_dir2_block *block,
+				 struct xfs_dir2_data_hdr *block,
 				 xfs_dir2_sf_hdr_t *sfhp);
 extern int xfs_dir2_block_to_sf(struct xfs_da_args *args, struct xfs_dabuf *bp,
 				int size, xfs_dir2_sf_hdr_t *sfhp);
Index: xfs/fs/xfs/xfs_dir2_block.h
===================================================================
--- xfs.orig/fs/xfs/xfs_dir2_block.h	2011-06-29 11:26:13.725888103 +0200
+++ xfs/fs/xfs/xfs_dir2_block.h	2011-06-29 13:17:34.266363177 +0200
@@ -61,10 +61,9 @@ typedef struct xfs_dir2_block {
  * Pointer to the leaf header embedded in a data block (1-block format)
  */
 static inline xfs_dir2_block_tail_t *
-xfs_dir2_block_tail_p(struct xfs_mount *mp, xfs_dir2_block_t *block)
+xfs_dir2_block_tail_p(struct xfs_mount *mp, xfs_dir2_data_hdr_t *hdr)
 {
-	return (((xfs_dir2_block_tail_t *)
-		((char *)(block) + (mp)->m_dirblksize)) - 1);
+	return ((xfs_dir2_block_tail_t *)((char *)hdr + mp->m_dirblksize)) - 1;
 }
 
 /*

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH 18/27] xfs: kill struct xfs_dir2_block
  2011-06-29 14:01 [PATCH 00/27] patch queue for Linux 3.1 Christoph Hellwig
                   ` (15 preceding siblings ...)
  2011-06-29 14:01 ` [PATCH 17/27] xfs: avoid usage of struct xfs_dir2_block Christoph Hellwig
@ 2011-06-29 14:01 ` Christoph Hellwig
  2011-06-29 14:01 ` [PATCH 19/27] xfs: avoid usage of struct xfs_dir2_data Christoph Hellwig
                   ` (9 subsequent siblings)
  26 siblings, 0 replies; 100+ messages in thread
From: Christoph Hellwig @ 2011-06-29 14:01 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-kill-xfs_dir2_block_t --]
[-- Type: text/plain, Size: 7148 bytes --]

Remove the confusing xfs_dir2_block structure.  It is supposed to describe
an XFS dir2 block format btree block, but due to the variable sized nature
of almost all elements in it it can't actuall do anything close to that
job.  In addition to accessing the fixed offset header structure it was
only used to get a pointer to the first dir or unused entry after it,
which can be trivially replaced by pointer arithmetics on the header
pointer.  For most users that is actually more natural anyway, as they
don't use a typed pointer but rather a character pointer for further
arithmetics.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: xfs/fs/xfs/xfs_dir2_block.h
===================================================================
--- xfs.orig/fs/xfs/xfs_dir2_block.h	2011-06-29 13:19:30.639066065 +0200
+++ xfs/fs/xfs/xfs_dir2_block.h	2011-06-29 13:22:03.604904045 +0200
@@ -19,10 +19,30 @@
 #define	__XFS_DIR2_BLOCK_H__
 
 /*
- * xfs_dir2_block.h
- * Directory version 2, single block format structures
+ * Directory version 2, single block format structures.
+ *
+ * The single block format looks like the following drawing on disk:
+ *
+ *    +-------------------------------------------------+
+ *    | xfs_dir2_data_hdr_t                             |
+ *    +-------------------------------------------------+
+ *    | xfs_dir2_data_entry_t OR xfs_dir2_data_unused_t |
+ *    | xfs_dir2_data_entry_t OR xfs_dir2_data_unused_t |
+ *    | xfs_dir2_data_entry_t OR xfs_dir2_data_unused_t |
+ *    | ...                                             |
+ *    +-------------------------------------------------+
+ *    | unused space                                    |
+ *    +-------------------------------------------------+
+ *    | ...                                             |
+ *    | xfs_dir2_leaf_entry_t                           |
+ *    | xfs_dir2_leaf_entry_t                           |
+ *    +-------------------------------------------------+
+ *    | xfs_dir2_block_tail_t                           |
+ *    +-------------------------------------------------+
+ *
+ * As all the entries are variable sized structures the accessors in this
+ * file and xfs_dir2_data.h need to be used to iterate over them.
  */
-
 struct uio;
 struct xfs_dabuf;
 struct xfs_da_args;
@@ -32,14 +52,6 @@ struct xfs_inode;
 struct xfs_mount;
 struct xfs_trans;
 
-/*
- * The single block format is as follows:
- * xfs_dir2_data_hdr_t structure
- * xfs_dir2_data_entry_t and xfs_dir2_data_unused_t structures
- * xfs_dir2_leaf_entry_t structures
- * xfs_dir2_block_tail_t structure
- */
-
 #define	XFS_DIR2_BLOCK_MAGIC	0x58443242	/* XD2B: for one block dirs */
 
 typedef struct xfs_dir2_block_tail {
@@ -48,16 +60,6 @@ typedef struct xfs_dir2_block_tail {
 } xfs_dir2_block_tail_t;
 
 /*
- * Generic single-block structure, for xfs_db.
- */
-typedef struct xfs_dir2_block {
-	xfs_dir2_data_hdr_t	hdr;		/* magic XFS_DIR2_BLOCK_MAGIC */
-	xfs_dir2_data_union_t	u[1];
-	xfs_dir2_leaf_entry_t	leaf[1];
-	xfs_dir2_block_tail_t	tail;
-} xfs_dir2_block_t;
-
-/*
  * Pointer to the leaf header embedded in a data block (1-block format)
  */
 static inline xfs_dir2_block_tail_t *
Index: xfs/fs/xfs/xfs_dir2_block.c
===================================================================
--- xfs.orig/fs/xfs/xfs_dir2_block.c	2011-06-29 13:19:30.649066010 +0200
+++ xfs/fs/xfs/xfs_dir2_block.c	2011-06-29 13:19:45.908983340 +0200
@@ -437,7 +437,6 @@ xfs_dir2_block_getdents(
 	xfs_off_t		*offset,
 	filldir_t		filldir)
 {
-	xfs_dir2_block_t	*block;		/* directory block structure */
 	xfs_dir2_data_hdr_t	*hdr;		/* block header */
 	xfs_dabuf_t		*bp;		/* buffer for block */
 	xfs_dir2_block_tail_t	*btp;		/* block tail */
@@ -471,14 +470,13 @@ xfs_dir2_block_getdents(
 	 * We'll skip entries before this.
 	 */
 	wantoff = xfs_dir2_dataptr_to_off(mp, *offset);
-	block = bp->data;
-	hdr = &block->hdr;
+	hdr = bp->data;
 	xfs_dir2_data_check(dp, bp);
 	/*
 	 * Set up values for the loop.
 	 */
 	btp = xfs_dir2_block_tail_p(mp, hdr);
-	ptr = (char *)block->u;
+	ptr = (char *)(hdr + 1);
 	endptr = (char *)xfs_dir2_block_leaf_p(btp);
 
 	/*
@@ -1020,7 +1018,6 @@ xfs_dir2_sf_to_block(
 	xfs_da_args_t		*args)		/* operation arguments */
 {
 	xfs_dir2_db_t		blkno;		/* dir-relative block # (0) */
-	xfs_dir2_block_t	*block;		/* block structure */
 	xfs_dir2_data_hdr_t	*hdr;		/* block header */
 	xfs_dir2_leaf_entry_t	*blp;		/* block leaf entries */
 	xfs_dabuf_t		*bp;		/* block buffer */
@@ -1091,8 +1088,7 @@ xfs_dir2_sf_to_block(
 		kmem_free(sfp);
 		return error;
 	}
-	block = bp->data;
-	hdr = &block->hdr;
+	hdr = bp->data;
 	hdr->magic = cpu_to_be32(XFS_DIR2_BLOCK_MAGIC);
 	/*
 	 * Compute size of block "tail" area.
@@ -1103,7 +1099,7 @@ xfs_dir2_sf_to_block(
 	 * The whole thing is initialized to free by the init routine.
 	 * Say we're using the leaf and tail area.
 	 */
-	dup = (xfs_dir2_data_unused_t *)block->u;
+	dup = (xfs_dir2_data_unused_t *)(hdr + 1);
 	needlog = needscan = 0;
 	xfs_dir2_data_use_free(tp, bp, dup, mp->m_dirblksize - i, i, &needlog,
 		&needscan);
Index: xfs/fs/xfs/xfs_dir2_sf.c
===================================================================
--- xfs.orig/fs/xfs/xfs_dir2_sf.c	2011-06-29 13:19:30.662399271 +0200
+++ xfs/fs/xfs/xfs_dir2_sf.c	2011-06-29 13:19:45.912316655 +0200
@@ -230,7 +230,7 @@ xfs_dir2_block_to_sf(
 	int			size,		/* shortform directory size */
 	xfs_dir2_sf_hdr_t	*sfhp)		/* shortform directory hdr */
 {
-	xfs_dir2_block_t	*block;		/* block structure */
+	xfs_dir2_data_hdr_t	*hdr;		/* block header */
 	xfs_dir2_block_tail_t	*btp;		/* block tail pointer */
 	xfs_dir2_data_entry_t	*dep;		/* data entry pointer */
 	xfs_inode_t		*dp;		/* incore directory inode */
@@ -252,8 +252,8 @@ xfs_dir2_block_to_sf(
 	 * Make a copy of the block data, so we can shrink the inode
 	 * and add local data.
 	 */
-	block = kmem_alloc(mp->m_dirblksize, KM_SLEEP);
-	memcpy(block, bp->data, mp->m_dirblksize);
+	hdr = kmem_alloc(mp->m_dirblksize, KM_SLEEP);
+	memcpy(hdr, bp->data, mp->m_dirblksize);
 	logflags = XFS_ILOG_CORE;
 	if ((error = xfs_dir2_shrink_inode(args, mp->m_dirdatablk, bp))) {
 		ASSERT(error != ENOSPC);
@@ -281,8 +281,8 @@ xfs_dir2_block_to_sf(
 	/*
 	 * Set up to loop over the block's entries.
 	 */
-	btp = xfs_dir2_block_tail_p(mp, &block->hdr);
-	ptr = (char *)block->u;
+	btp = xfs_dir2_block_tail_p(mp, hdr);
+	ptr = (char *)(hdr + 1);
 	endptr = (char *)xfs_dir2_block_leaf_p(btp);
 	sfep = xfs_dir2_sf_firstentry(sfp);
 	/*
@@ -318,7 +318,7 @@ xfs_dir2_block_to_sf(
 			sfep->namelen = dep->namelen;
 			xfs_dir2_sf_put_offset(sfep,
 				(xfs_dir2_data_aoff_t)
-				((char *)dep - (char *)block));
+				((char *)dep - (char *)hdr));
 			memcpy(sfep->name, dep->name, dep->namelen);
 			xfs_dir2_sfe_put_ino(sfp, sfep,
 					     be64_to_cpu(dep->inumber));
@@ -331,7 +331,7 @@ xfs_dir2_block_to_sf(
 	xfs_dir2_sf_check(args);
 out:
 	xfs_trans_log_inode(args->trans, dp, logflags);
-	kmem_free(block);
+	kmem_free(hdr);
 	return error;
 }
 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH 19/27] xfs: avoid usage of struct xfs_dir2_data
  2011-06-29 14:01 [PATCH 00/27] patch queue for Linux 3.1 Christoph Hellwig
                   ` (16 preceding siblings ...)
  2011-06-29 14:01 ` [PATCH 18/27] xfs: kill " Christoph Hellwig
@ 2011-06-29 14:01 ` Christoph Hellwig
  2011-06-29 14:01 ` [PATCH 20/27] xfs: kill " Christoph Hellwig
                   ` (8 subsequent siblings)
  26 siblings, 0 replies; 100+ messages in thread
From: Christoph Hellwig @ 2011-06-29 14:01 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-avoid-xfs_dir2_data_t --]
[-- Type: text/plain, Size: 45985 bytes --]

In most places we can simply pass around and use the struct xfs_dir2_data_hdr,
which is the first and most important member of struct xfs_dir2_data instead
of the full structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: xfs/fs/xfs/xfs_da_btree.c
===================================================================
--- xfs.orig/fs/xfs/xfs_da_btree.c	2011-06-27 10:47:35.889487280 +0200
+++ xfs/fs/xfs/xfs_da_btree.c	2011-06-27 10:51:33.638199284 +0200
@@ -2079,16 +2079,13 @@ xfs_da_do_buf(
 	 * For read_buf, check the magic number.
 	 */
 	if (caller == 1) {
-		xfs_dir2_data_t		*data;
-		xfs_dir2_free_t		*free;
-		xfs_da_blkinfo_t	*info;
+		xfs_dir2_data_hdr_t	*hdr = rbp->data;
+		xfs_dir2_free_t		*free = rbp->data;
+		xfs_da_blkinfo_t	*info = rbp->data;
 		uint			magic, magic1;
 
-		info = rbp->data;
-		data = rbp->data;
-		free = rbp->data;
 		magic = be16_to_cpu(info->magic);
-		magic1 = be32_to_cpu(data->hdr.magic);
+		magic1 = be32_to_cpu(hdr->magic);
 		if (unlikely(
 		    XFS_TEST_ERROR((magic != XFS_DA_NODE_MAGIC) &&
 				   (magic != XFS_ATTR_LEAF_MAGIC) &&
Index: xfs/fs/xfs/xfs_dir2_block.c
===================================================================
--- xfs.orig/fs/xfs/xfs_dir2_block.c	2011-06-27 10:49:37.012164015 +0200
+++ xfs/fs/xfs/xfs_dir2_block.c	2011-06-27 10:51:33.641532599 +0200
@@ -282,7 +282,7 @@ xfs_dir2_block_addname(
 		 * This needs to happen before the next call to use_free.
 		 */
 		if (needscan) {
-			xfs_dir2_data_freescan(mp, (xfs_dir2_data_t *)hdr, &needlog);
+			xfs_dir2_data_freescan(mp, hdr, &needlog);
 			needscan = 0;
 		}
 	}
@@ -331,8 +331,7 @@ xfs_dir2_block_addname(
 		 * This needs to happen before the next call to use_free.
 		 */
 		if (needscan) {
-			xfs_dir2_data_freescan(mp, (xfs_dir2_data_t *)hdr,
-				&needlog);
+			xfs_dir2_data_freescan(mp, hdr, &needlog);
 			needscan = 0;
 		}
 		/*
@@ -417,7 +416,7 @@ xfs_dir2_block_addname(
 	 * Clean up the bestfree array and log the header, tail, and entry.
 	 */
 	if (needscan)
-		xfs_dir2_data_freescan(mp, (xfs_dir2_data_t *)hdr, &needlog);
+		xfs_dir2_data_freescan(mp, hdr, &needlog);
 	if (needlog)
 		xfs_dir2_data_log_header(tp, bp);
 	xfs_dir2_block_log_tail(tp, bp);
@@ -783,7 +782,7 @@ xfs_dir2_block_removename(
 	 * Fix up bestfree, log the header if necessary.
 	 */
 	if (needscan)
-		xfs_dir2_data_freescan(mp, (xfs_dir2_data_t *)hdr, &needlog);
+		xfs_dir2_data_freescan(mp, hdr, &needlog);
 	if (needlog)
 		xfs_dir2_data_log_header(tp, bp);
 	xfs_dir2_data_check(dp, bp);
@@ -982,7 +981,7 @@ xfs_dir2_leaf_to_block(
 	 * Scan the bestfree if we need it and log the data block header.
 	 */
 	if (needscan)
-		xfs_dir2_data_freescan(mp, (xfs_dir2_data_t *)hdr, &needlog);
+		xfs_dir2_data_freescan(mp, hdr, &needlog);
 	if (needlog)
 		xfs_dir2_data_log_header(tp, dbp);
 	/*
@@ -1177,8 +1176,7 @@ xfs_dir2_sf_to_block(
 			*xfs_dir2_data_unused_tag_p(dup) = cpu_to_be16(
 				((char *)dup - (char *)hdr));
 			xfs_dir2_data_log_unused(tp, bp, dup);
-			(void)xfs_dir2_data_freeinsert((xfs_dir2_data_t *)hdr,
-				dup, &dummy);
+			(void)xfs_dir2_data_freeinsert(hdr, dup, &dummy);
 			offset += be16_to_cpu(dup->length);
 			continue;
 		}
Index: xfs/fs/xfs/xfs_dir2_data.c
===================================================================
--- xfs.orig/fs/xfs/xfs_dir2_data.c	2011-06-27 10:47:35.909487170 +0200
+++ xfs/fs/xfs/xfs_dir2_data.c	2011-06-27 10:51:33.644865914 +0200
@@ -35,6 +35,9 @@
 #include "xfs_dir2_block.h"
 #include "xfs_error.h"
 
+STATIC xfs_dir2_data_free_t *
+xfs_dir2_data_freefind(xfs_dir2_data_hdr_t *hdr, xfs_dir2_data_unused_t *dup);
+
 #ifdef DEBUG
 /*
  * Check the consistency of the data block.
@@ -51,6 +54,7 @@ xfs_dir2_data_check(
 	xfs_dir2_block_tail_t	*btp=NULL;	/* block tail */
 	int			count;		/* count of entries found */
 	xfs_dir2_data_t		*d;		/* data block pointer */
+	xfs_dir2_data_hdr_t	*hdr;		/* data block header */
 	xfs_dir2_data_entry_t	*dep;		/* data entry */
 	xfs_dir2_data_free_t	*dfp;		/* bestfree entry */
 	xfs_dir2_data_unused_t	*dup;		/* unused entry */
@@ -67,16 +71,19 @@ xfs_dir2_data_check(
 
 	mp = dp->i_mount;
 	d = bp->data;
-	ASSERT(be32_to_cpu(d->hdr.magic) == XFS_DIR2_DATA_MAGIC ||
-	       be32_to_cpu(d->hdr.magic) == XFS_DIR2_BLOCK_MAGIC);
-	bf = d->hdr.bestfree;
+	hdr = &d->hdr;
+	bf = hdr->bestfree;
 	p = (char *)d->u;
-	if (be32_to_cpu(d->hdr.magic) == XFS_DIR2_BLOCK_MAGIC) {
-		btp = xfs_dir2_block_tail_p(mp, &d->hdr);
+
+	if (hdr->magic == cpu_to_be32(XFS_DIR2_BLOCK_MAGIC)) {
+		btp = xfs_dir2_block_tail_p(mp, hdr);
 		lep = xfs_dir2_block_leaf_p(btp);
 		endp = (char *)lep;
-	} else
-		endp = (char *)d + mp->m_dirblksize;
+	} else {
+		ASSERT(hdr->magic == cpu_to_be32(XFS_DIR2_DATA_MAGIC));
+		endp = (char *)hdr + mp->m_dirblksize;
+	}
+
 	count = lastfree = freeseen = 0;
 	/*
 	 * Account for zero bestfree entries.
@@ -108,8 +115,8 @@ xfs_dir2_data_check(
 		if (be16_to_cpu(dup->freetag) == XFS_DIR2_DATA_FREE_TAG) {
 			ASSERT(lastfree == 0);
 			ASSERT(be16_to_cpu(*xfs_dir2_data_unused_tag_p(dup)) ==
-			       (char *)dup - (char *)d);
-			dfp = xfs_dir2_data_freefind(d, dup);
+			       (char *)dup - (char *)hdr);
+			dfp = xfs_dir2_data_freefind(hdr, dup);
 			if (dfp) {
 				i = (int)(dfp - bf);
 				ASSERT((freeseen & (1 << i)) == 0);
@@ -132,13 +139,13 @@ xfs_dir2_data_check(
 		ASSERT(dep->namelen != 0);
 		ASSERT(xfs_dir_ino_validate(mp, be64_to_cpu(dep->inumber)) == 0);
 		ASSERT(be16_to_cpu(*xfs_dir2_data_entry_tag_p(dep)) ==
-		       (char *)dep - (char *)d);
+		       (char *)dep - (char *)hdr);
 		count++;
 		lastfree = 0;
-		if (be32_to_cpu(d->hdr.magic) == XFS_DIR2_BLOCK_MAGIC) {
+		if (hdr->magic == cpu_to_be32(XFS_DIR2_BLOCK_MAGIC)) {
 			addr = xfs_dir2_db_off_to_dataptr(mp, mp->m_dirdatablk,
 				(xfs_dir2_data_aoff_t)
-				((char *)dep - (char *)d));
+				((char *)dep - (char *)hdr));
 			name.name = dep->name;
 			name.len = dep->namelen;
 			hash = mp->m_dirnameops->hashname(&name);
@@ -155,7 +162,7 @@ xfs_dir2_data_check(
 	 * Need to have seen all the entries and all the bestfree slots.
 	 */
 	ASSERT(freeseen == 7);
-	if (be32_to_cpu(d->hdr.magic) == XFS_DIR2_BLOCK_MAGIC) {
+	if (hdr->magic == cpu_to_be32(XFS_DIR2_BLOCK_MAGIC)) {
 		for (i = stale = 0; i < be32_to_cpu(btp->count); i++) {
 			if (be32_to_cpu(lep[i].address) == XFS_DIR2_NULL_DATAPTR)
 				stale++;
@@ -172,9 +179,9 @@ xfs_dir2_data_check(
  * Given a data block and an unused entry from that block,
  * return the bestfree entry if any that corresponds to it.
  */
-xfs_dir2_data_free_t *
+STATIC xfs_dir2_data_free_t *
 xfs_dir2_data_freefind(
-	xfs_dir2_data_t		*d,		/* data block */
+	xfs_dir2_data_hdr_t	*hdr,		/* data block */
 	xfs_dir2_data_unused_t	*dup)		/* data unused entry */
 {
 	xfs_dir2_data_free_t	*dfp;		/* bestfree entry */
@@ -184,17 +191,17 @@ xfs_dir2_data_freefind(
 	int			seenzero;	/* saw a 0 bestfree entry */
 #endif
 
-	off = (xfs_dir2_data_aoff_t)((char *)dup - (char *)d);
+	off = (xfs_dir2_data_aoff_t)((char *)dup - (char *)hdr);
 #if defined(DEBUG) && defined(__KERNEL__)
 	/*
 	 * Validate some consistency in the bestfree table.
 	 * Check order, non-overlapping entries, and if we find the
 	 * one we're looking for it has to be exact.
 	 */
-	ASSERT(be32_to_cpu(d->hdr.magic) == XFS_DIR2_DATA_MAGIC ||
-	       be32_to_cpu(d->hdr.magic) == XFS_DIR2_BLOCK_MAGIC);
-	for (dfp = &d->hdr.bestfree[0], seenzero = matched = 0;
-	     dfp < &d->hdr.bestfree[XFS_DIR2_DATA_FD_COUNT];
+	ASSERT(be32_to_cpu(hdr->magic) == XFS_DIR2_DATA_MAGIC ||
+	       be32_to_cpu(hdr->magic) == XFS_DIR2_BLOCK_MAGIC);
+	for (dfp = &hdr->bestfree[0], seenzero = matched = 0;
+	     dfp < &hdr->bestfree[XFS_DIR2_DATA_FD_COUNT];
 	     dfp++) {
 		if (!dfp->offset) {
 			ASSERT(!dfp->length);
@@ -210,7 +217,7 @@ xfs_dir2_data_freefind(
 		else
 			ASSERT(be16_to_cpu(dfp->offset) + be16_to_cpu(dfp->length) <= off);
 		ASSERT(matched || be16_to_cpu(dfp->length) >= be16_to_cpu(dup->length));
-		if (dfp > &d->hdr.bestfree[0])
+		if (dfp > &hdr->bestfree[0])
 			ASSERT(be16_to_cpu(dfp[-1].length) >= be16_to_cpu(dfp[0].length));
 	}
 #endif
@@ -219,13 +226,13 @@ xfs_dir2_data_freefind(
 	 * it can't be there since they're sorted.
 	 */
 	if (be16_to_cpu(dup->length) <
-	    be16_to_cpu(d->hdr.bestfree[XFS_DIR2_DATA_FD_COUNT - 1].length))
+	    be16_to_cpu(hdr->bestfree[XFS_DIR2_DATA_FD_COUNT - 1].length))
 		return NULL;
 	/*
 	 * Look at the three bestfree entries for our guy.
 	 */
-	for (dfp = &d->hdr.bestfree[0];
-	     dfp < &d->hdr.bestfree[XFS_DIR2_DATA_FD_COUNT];
+	for (dfp = &hdr->bestfree[0];
+	     dfp < &hdr->bestfree[XFS_DIR2_DATA_FD_COUNT];
 	     dfp++) {
 		if (!dfp->offset)
 			return NULL;
@@ -243,7 +250,7 @@ xfs_dir2_data_freefind(
  */
 xfs_dir2_data_free_t *				/* entry inserted */
 xfs_dir2_data_freeinsert(
-	xfs_dir2_data_t		*d,		/* data block pointer */
+	xfs_dir2_data_hdr_t	*hdr,		/* data block pointer */
 	xfs_dir2_data_unused_t	*dup,		/* unused space */
 	int			*loghead)	/* log the data header (out) */
 {
@@ -251,12 +258,13 @@ xfs_dir2_data_freeinsert(
 	xfs_dir2_data_free_t	new;		/* new bestfree entry */
 
 #ifdef __KERNEL__
-	ASSERT(be32_to_cpu(d->hdr.magic) == XFS_DIR2_DATA_MAGIC ||
-	       be32_to_cpu(d->hdr.magic) == XFS_DIR2_BLOCK_MAGIC);
+	ASSERT(be32_to_cpu(hdr->magic) == XFS_DIR2_DATA_MAGIC ||
+	       be32_to_cpu(hdr->magic) == XFS_DIR2_BLOCK_MAGIC);
 #endif
-	dfp = d->hdr.bestfree;
+	dfp = hdr->bestfree;
 	new.length = dup->length;
-	new.offset = cpu_to_be16((char *)dup - (char *)d);
+	new.offset = cpu_to_be16((char *)dup - (char *)hdr);
+
 	/*
 	 * Insert at position 0, 1, or 2; or not at all.
 	 */
@@ -286,36 +294,36 @@ xfs_dir2_data_freeinsert(
  */
 STATIC void
 xfs_dir2_data_freeremove(
-	xfs_dir2_data_t		*d,		/* data block pointer */
+	xfs_dir2_data_hdr_t	*hdr,		/* data block header */
 	xfs_dir2_data_free_t	*dfp,		/* bestfree entry pointer */
 	int			*loghead)	/* out: log data header */
 {
 #ifdef __KERNEL__
-	ASSERT(be32_to_cpu(d->hdr.magic) == XFS_DIR2_DATA_MAGIC ||
-	       be32_to_cpu(d->hdr.magic) == XFS_DIR2_BLOCK_MAGIC);
+	ASSERT(be32_to_cpu(hdr->magic) == XFS_DIR2_DATA_MAGIC ||
+	       be32_to_cpu(hdr->magic) == XFS_DIR2_BLOCK_MAGIC);
 #endif
 	/*
 	 * It's the first entry, slide the next 2 up.
 	 */
-	if (dfp == &d->hdr.bestfree[0]) {
-		d->hdr.bestfree[0] = d->hdr.bestfree[1];
-		d->hdr.bestfree[1] = d->hdr.bestfree[2];
+	if (dfp == &hdr->bestfree[0]) {
+		hdr->bestfree[0] = hdr->bestfree[1];
+		hdr->bestfree[1] = hdr->bestfree[2];
 	}
 	/*
 	 * It's the second entry, slide the 3rd entry up.
 	 */
-	else if (dfp == &d->hdr.bestfree[1])
-		d->hdr.bestfree[1] = d->hdr.bestfree[2];
+	else if (dfp == &hdr->bestfree[1])
+		hdr->bestfree[1] = hdr->bestfree[2];
 	/*
 	 * Must be the last entry.
 	 */
 	else
-		ASSERT(dfp == &d->hdr.bestfree[2]);
+		ASSERT(dfp == &hdr->bestfree[2]);
 	/*
 	 * Clear the 3rd entry, must be zero now.
 	 */
-	d->hdr.bestfree[2].length = 0;
-	d->hdr.bestfree[2].offset = 0;
+	hdr->bestfree[2].length = 0;
+	hdr->bestfree[2].offset = 0;
 	*loghead = 1;
 }
 
@@ -325,9 +333,10 @@ xfs_dir2_data_freeremove(
 void
 xfs_dir2_data_freescan(
 	xfs_mount_t		*mp,		/* filesystem mount point */
-	xfs_dir2_data_t		*d,		/* data block pointer */
+	xfs_dir2_data_hdr_t	*hdr,		/* data block header */
 	int			*loghead)	/* out: log data header */
 {
+	xfs_dir2_data_t		*d = (xfs_dir2_data_t *)hdr;
 	xfs_dir2_block_tail_t	*btp;		/* block tail */
 	xfs_dir2_data_entry_t	*dep;		/* active data entry */
 	xfs_dir2_data_unused_t	*dup;		/* unused data entry */
@@ -335,23 +344,23 @@ xfs_dir2_data_freescan(
 	char			*p;		/* current entry pointer */
 
 #ifdef __KERNEL__
-	ASSERT(be32_to_cpu(d->hdr.magic) == XFS_DIR2_DATA_MAGIC ||
-	       be32_to_cpu(d->hdr.magic) == XFS_DIR2_BLOCK_MAGIC);
+	ASSERT(hdr->magic == cpu_to_be32(XFS_DIR2_DATA_MAGIC) ||
+	       hdr->magic == cpu_to_be32(XFS_DIR2_BLOCK_MAGIC));
 #endif
 	/*
 	 * Start by clearing the table.
 	 */
-	memset(d->hdr.bestfree, 0, sizeof(d->hdr.bestfree));
+	memset(hdr->bestfree, 0, sizeof(hdr->bestfree));
 	*loghead = 1;
 	/*
 	 * Set up pointers.
 	 */
 	p = (char *)d->u;
-	if (be32_to_cpu(d->hdr.magic) == XFS_DIR2_BLOCK_MAGIC) {
-		btp = xfs_dir2_block_tail_p(mp, &d->hdr);
+	if (be32_to_cpu(hdr->magic) == XFS_DIR2_BLOCK_MAGIC) {
+		btp = xfs_dir2_block_tail_p(mp, hdr);
 		endp = (char *)xfs_dir2_block_leaf_p(btp);
 	} else
-		endp = (char *)d + mp->m_dirblksize;
+		endp = (char *)hdr + mp->m_dirblksize;
 	/*
 	 * Loop over the block's entries.
 	 */
@@ -361,9 +370,9 @@ xfs_dir2_data_freescan(
 		 * If it's a free entry, insert it.
 		 */
 		if (be16_to_cpu(dup->freetag) == XFS_DIR2_DATA_FREE_TAG) {
-			ASSERT((char *)dup - (char *)d ==
+			ASSERT((char *)dup - (char *)hdr ==
 			       be16_to_cpu(*xfs_dir2_data_unused_tag_p(dup)));
-			xfs_dir2_data_freeinsert(d, dup, loghead);
+			xfs_dir2_data_freeinsert(hdr, dup, loghead);
 			p += be16_to_cpu(dup->length);
 		}
 		/*
@@ -371,7 +380,7 @@ xfs_dir2_data_freescan(
 		 */
 		else {
 			dep = (xfs_dir2_data_entry_t *)p;
-			ASSERT((char *)dep - (char *)d ==
+			ASSERT((char *)dep - (char *)hdr ==
 			       be16_to_cpu(*xfs_dir2_data_entry_tag_p(dep)));
 			p += xfs_dir2_data_entsize(dep->namelen);
 		}
@@ -390,6 +399,7 @@ xfs_dir2_data_init(
 {
 	xfs_dabuf_t		*bp;		/* block buffer */
 	xfs_dir2_data_t		*d;		/* pointer to block */
+	xfs_dir2_data_hdr_t	*hdr;		/* data block header */
 	xfs_inode_t		*dp;		/* incore directory inode */
 	xfs_dir2_data_unused_t	*dup;		/* unused entry pointer */
 	int			error;		/* error return value */
@@ -410,26 +420,29 @@ xfs_dir2_data_init(
 		return error;
 	}
 	ASSERT(bp != NULL);
+
 	/*
 	 * Initialize the header.
 	 */
 	d = bp->data;
-	d->hdr.magic = cpu_to_be32(XFS_DIR2_DATA_MAGIC);
-	d->hdr.bestfree[0].offset = cpu_to_be16(sizeof(d->hdr));
+	hdr = &d->hdr;
+	hdr->magic = cpu_to_be32(XFS_DIR2_DATA_MAGIC);
+	hdr->bestfree[0].offset = cpu_to_be16(sizeof(*hdr));
 	for (i = 1; i < XFS_DIR2_DATA_FD_COUNT; i++) {
-		d->hdr.bestfree[i].length = 0;
-		d->hdr.bestfree[i].offset = 0;
+		hdr->bestfree[i].length = 0;
+		hdr->bestfree[i].offset = 0;
 	}
+
 	/*
 	 * Set up an unused entry for the block's body.
 	 */
 	dup = &d->u[0].unused;
 	dup->freetag = cpu_to_be16(XFS_DIR2_DATA_FREE_TAG);
 
-	t=mp->m_dirblksize - (uint)sizeof(d->hdr);
-	d->hdr.bestfree[0].length = cpu_to_be16(t);
+	t = mp->m_dirblksize - (uint)sizeof(*hdr);
+	hdr->bestfree[0].length = cpu_to_be16(t);
 	dup->length = cpu_to_be16(t);
-	*xfs_dir2_data_unused_tag_p(dup) = cpu_to_be16((char *)dup - (char *)d);
+	*xfs_dir2_data_unused_tag_p(dup) = cpu_to_be16((char *)dup - (char *)hdr);
 	/*
 	 * Log it and return it.
 	 */
@@ -448,14 +461,14 @@ xfs_dir2_data_log_entry(
 	xfs_dabuf_t		*bp,		/* block buffer */
 	xfs_dir2_data_entry_t	*dep)		/* data entry pointer */
 {
-	xfs_dir2_data_t		*d;		/* data block pointer */
+	xfs_dir2_data_hdr_t	*hdr = bp->data;
 
-	d = bp->data;
-	ASSERT(be32_to_cpu(d->hdr.magic) == XFS_DIR2_DATA_MAGIC ||
-	       be32_to_cpu(d->hdr.magic) == XFS_DIR2_BLOCK_MAGIC);
-	xfs_da_log_buf(tp, bp, (uint)((char *)dep - (char *)d),
+	ASSERT(hdr->magic == cpu_to_be32(XFS_DIR2_DATA_MAGIC) ||
+	       hdr->magic == cpu_to_be32(XFS_DIR2_BLOCK_MAGIC));
+
+	xfs_da_log_buf(tp, bp, (uint)((char *)dep - (char *)hdr),
 		(uint)((char *)(xfs_dir2_data_entry_tag_p(dep) + 1) -
-		       (char *)d - 1));
+		       (char *)hdr - 1));
 }
 
 /*
@@ -466,13 +479,12 @@ xfs_dir2_data_log_header(
 	xfs_trans_t		*tp,		/* transaction pointer */
 	xfs_dabuf_t		*bp)		/* block buffer */
 {
-	xfs_dir2_data_t		*d;		/* data block pointer */
+	xfs_dir2_data_hdr_t	*hdr = bp->data;
 
-	d = bp->data;
-	ASSERT(be32_to_cpu(d->hdr.magic) == XFS_DIR2_DATA_MAGIC ||
-	       be32_to_cpu(d->hdr.magic) == XFS_DIR2_BLOCK_MAGIC);
-	xfs_da_log_buf(tp, bp, (uint)((char *)&d->hdr - (char *)d),
-		(uint)(sizeof(d->hdr) - 1));
+	ASSERT(hdr->magic == cpu_to_be32(XFS_DIR2_DATA_MAGIC) ||
+	       hdr->magic == cpu_to_be32(XFS_DIR2_BLOCK_MAGIC));
+
+	xfs_da_log_buf(tp, bp, 0, sizeof(*hdr) - 1);
 }
 
 /*
@@ -484,23 +496,23 @@ xfs_dir2_data_log_unused(
 	xfs_dabuf_t		*bp,		/* block buffer */
 	xfs_dir2_data_unused_t	*dup)		/* data unused pointer */
 {
-	xfs_dir2_data_t		*d;		/* data block pointer */
+	xfs_dir2_data_hdr_t	*hdr = bp->data;
+
+	ASSERT(hdr->magic == cpu_to_be32(XFS_DIR2_DATA_MAGIC) ||
+	       hdr->magic == cpu_to_be32(XFS_DIR2_BLOCK_MAGIC));
 
-	d = bp->data;
-	ASSERT(be32_to_cpu(d->hdr.magic) == XFS_DIR2_DATA_MAGIC ||
-	       be32_to_cpu(d->hdr.magic) == XFS_DIR2_BLOCK_MAGIC);
 	/*
 	 * Log the first part of the unused entry.
 	 */
-	xfs_da_log_buf(tp, bp, (uint)((char *)dup - (char *)d),
+	xfs_da_log_buf(tp, bp, (uint)((char *)dup - (char *)hdr),
 		(uint)((char *)&dup->length + sizeof(dup->length) -
-		       1 - (char *)d));
+		       1 - (char *)hdr));
 	/*
 	 * Log the end (tag) of the unused entry.
 	 */
 	xfs_da_log_buf(tp, bp,
-		(uint)((char *)xfs_dir2_data_unused_tag_p(dup) - (char *)d),
-		(uint)((char *)xfs_dir2_data_unused_tag_p(dup) - (char *)d +
+		(uint)((char *)xfs_dir2_data_unused_tag_p(dup) - (char *)hdr),
+		(uint)((char *)xfs_dir2_data_unused_tag_p(dup) - (char *)hdr +
 		       sizeof(xfs_dir2_data_off_t) - 1));
 }
 
@@ -517,7 +529,7 @@ xfs_dir2_data_make_free(
 	int			*needlogp,	/* out: log header */
 	int			*needscanp)	/* out: regen bestfree */
 {
-	xfs_dir2_data_t		*d;		/* data block pointer */
+	xfs_dir2_data_hdr_t	*hdr;		/* data block pointer */
 	xfs_dir2_data_free_t	*dfp;		/* bestfree pointer */
 	char			*endptr;	/* end of data area */
 	xfs_mount_t		*mp;		/* filesystem mount point */
@@ -527,28 +539,29 @@ xfs_dir2_data_make_free(
 	xfs_dir2_data_unused_t	*prevdup;	/* unused entry before us */
 
 	mp = tp->t_mountp;
-	d = bp->data;
+	hdr = bp->data;
+
 	/*
 	 * Figure out where the end of the data area is.
 	 */
-	if (be32_to_cpu(d->hdr.magic) == XFS_DIR2_DATA_MAGIC)
-		endptr = (char *)d + mp->m_dirblksize;
+	if (hdr->magic == cpu_to_be32(XFS_DIR2_DATA_MAGIC))
+		endptr = (char *)hdr + mp->m_dirblksize;
 	else {
 		xfs_dir2_block_tail_t	*btp;	/* block tail */
 
-		ASSERT(be32_to_cpu(d->hdr.magic) == XFS_DIR2_BLOCK_MAGIC);
-		btp = xfs_dir2_block_tail_p(mp, &d->hdr);
+		ASSERT(hdr->magic == cpu_to_be32(XFS_DIR2_BLOCK_MAGIC));
+		btp = xfs_dir2_block_tail_p(mp, hdr);
 		endptr = (char *)xfs_dir2_block_leaf_p(btp);
 	}
 	/*
 	 * If this isn't the start of the block, then back up to
 	 * the previous entry and see if it's free.
 	 */
-	if (offset > sizeof(d->hdr)) {
+	if (offset > sizeof(*hdr)) {
 		__be16			*tagp;	/* tag just before us */
 
-		tagp = (__be16 *)((char *)d + offset) - 1;
-		prevdup = (xfs_dir2_data_unused_t *)((char *)d + be16_to_cpu(*tagp));
+		tagp = (__be16 *)((char *)hdr + offset) - 1;
+		prevdup = (xfs_dir2_data_unused_t *)((char *)hdr + be16_to_cpu(*tagp));
 		if (be16_to_cpu(prevdup->freetag) != XFS_DIR2_DATA_FREE_TAG)
 			prevdup = NULL;
 	} else
@@ -557,9 +570,9 @@ xfs_dir2_data_make_free(
 	 * If this isn't the end of the block, see if the entry after
 	 * us is free.
 	 */
-	if ((char *)d + offset + len < endptr) {
+	if ((char *)hdr + offset + len < endptr) {
 		postdup =
-			(xfs_dir2_data_unused_t *)((char *)d + offset + len);
+			(xfs_dir2_data_unused_t *)((char *)hdr + offset + len);
 		if (be16_to_cpu(postdup->freetag) != XFS_DIR2_DATA_FREE_TAG)
 			postdup = NULL;
 	} else
@@ -576,21 +589,21 @@ xfs_dir2_data_make_free(
 		/*
 		 * See if prevdup and/or postdup are in bestfree table.
 		 */
-		dfp = xfs_dir2_data_freefind(d, prevdup);
-		dfp2 = xfs_dir2_data_freefind(d, postdup);
+		dfp = xfs_dir2_data_freefind(hdr, prevdup);
+		dfp2 = xfs_dir2_data_freefind(hdr, postdup);
 		/*
 		 * We need a rescan unless there are exactly 2 free entries
 		 * namely our two.  Then we know what's happening, otherwise
 		 * since the third bestfree is there, there might be more
 		 * entries.
 		 */
-		needscan = (d->hdr.bestfree[2].length != 0);
+		needscan = (hdr->bestfree[2].length != 0);
 		/*
 		 * Fix up the new big freespace.
 		 */
 		be16_add_cpu(&prevdup->length, len + be16_to_cpu(postdup->length));
 		*xfs_dir2_data_unused_tag_p(prevdup) =
-			cpu_to_be16((char *)prevdup - (char *)d);
+			cpu_to_be16((char *)prevdup - (char *)hdr);
 		xfs_dir2_data_log_unused(tp, bp, prevdup);
 		if (!needscan) {
 			/*
@@ -600,18 +613,18 @@ xfs_dir2_data_make_free(
 			 * Remove entry 1 first then entry 0.
 			 */
 			ASSERT(dfp && dfp2);
-			if (dfp == &d->hdr.bestfree[1]) {
-				dfp = &d->hdr.bestfree[0];
+			if (dfp == &hdr->bestfree[1]) {
+				dfp = &hdr->bestfree[0];
 				ASSERT(dfp2 == dfp);
-				dfp2 = &d->hdr.bestfree[1];
+				dfp2 = &hdr->bestfree[1];
 			}
-			xfs_dir2_data_freeremove(d, dfp2, needlogp);
-			xfs_dir2_data_freeremove(d, dfp, needlogp);
+			xfs_dir2_data_freeremove(hdr, dfp2, needlogp);
+			xfs_dir2_data_freeremove(hdr, dfp, needlogp);
 			/*
 			 * Now insert the new entry.
 			 */
-			dfp = xfs_dir2_data_freeinsert(d, prevdup, needlogp);
-			ASSERT(dfp == &d->hdr.bestfree[0]);
+			dfp = xfs_dir2_data_freeinsert(hdr, prevdup, needlogp);
+			ASSERT(dfp == &hdr->bestfree[0]);
 			ASSERT(dfp->length == prevdup->length);
 			ASSERT(!dfp[1].length);
 			ASSERT(!dfp[2].length);
@@ -621,10 +634,10 @@ xfs_dir2_data_make_free(
 	 * The entry before us is free, merge with it.
 	 */
 	else if (prevdup) {
-		dfp = xfs_dir2_data_freefind(d, prevdup);
+		dfp = xfs_dir2_data_freefind(hdr, prevdup);
 		be16_add_cpu(&prevdup->length, len);
 		*xfs_dir2_data_unused_tag_p(prevdup) =
-			cpu_to_be16((char *)prevdup - (char *)d);
+			cpu_to_be16((char *)prevdup - (char *)hdr);
 		xfs_dir2_data_log_unused(tp, bp, prevdup);
 		/*
 		 * If the previous entry was in the table, the new entry
@@ -632,27 +645,27 @@ xfs_dir2_data_make_free(
 		 * the old one and add the new one.
 		 */
 		if (dfp) {
-			xfs_dir2_data_freeremove(d, dfp, needlogp);
-			(void)xfs_dir2_data_freeinsert(d, prevdup, needlogp);
+			xfs_dir2_data_freeremove(hdr, dfp, needlogp);
+			(void)xfs_dir2_data_freeinsert(hdr, prevdup, needlogp);
 		}
 		/*
 		 * Otherwise we need a scan if the new entry is big enough.
 		 */
 		else {
 			needscan = be16_to_cpu(prevdup->length) >
-				   be16_to_cpu(d->hdr.bestfree[2].length);
+				   be16_to_cpu(hdr->bestfree[2].length);
 		}
 	}
 	/*
 	 * The following entry is free, merge with it.
 	 */
 	else if (postdup) {
-		dfp = xfs_dir2_data_freefind(d, postdup);
-		newdup = (xfs_dir2_data_unused_t *)((char *)d + offset);
+		dfp = xfs_dir2_data_freefind(hdr, postdup);
+		newdup = (xfs_dir2_data_unused_t *)((char *)hdr + offset);
 		newdup->freetag = cpu_to_be16(XFS_DIR2_DATA_FREE_TAG);
 		newdup->length = cpu_to_be16(len + be16_to_cpu(postdup->length));
 		*xfs_dir2_data_unused_tag_p(newdup) =
-			cpu_to_be16((char *)newdup - (char *)d);
+			cpu_to_be16((char *)newdup - (char *)hdr);
 		xfs_dir2_data_log_unused(tp, bp, newdup);
 		/*
 		 * If the following entry was in the table, the new entry
@@ -660,28 +673,28 @@ xfs_dir2_data_make_free(
 		 * the old one and add the new one.
 		 */
 		if (dfp) {
-			xfs_dir2_data_freeremove(d, dfp, needlogp);
-			(void)xfs_dir2_data_freeinsert(d, newdup, needlogp);
+			xfs_dir2_data_freeremove(hdr, dfp, needlogp);
+			(void)xfs_dir2_data_freeinsert(hdr, newdup, needlogp);
 		}
 		/*
 		 * Otherwise we need a scan if the new entry is big enough.
 		 */
 		else {
 			needscan = be16_to_cpu(newdup->length) >
-				   be16_to_cpu(d->hdr.bestfree[2].length);
+				   be16_to_cpu(hdr->bestfree[2].length);
 		}
 	}
 	/*
 	 * Neither neighbor is free.  Make a new entry.
 	 */
 	else {
-		newdup = (xfs_dir2_data_unused_t *)((char *)d + offset);
+		newdup = (xfs_dir2_data_unused_t *)((char *)hdr + offset);
 		newdup->freetag = cpu_to_be16(XFS_DIR2_DATA_FREE_TAG);
 		newdup->length = cpu_to_be16(len);
 		*xfs_dir2_data_unused_tag_p(newdup) =
-			cpu_to_be16((char *)newdup - (char *)d);
+			cpu_to_be16((char *)newdup - (char *)hdr);
 		xfs_dir2_data_log_unused(tp, bp, newdup);
-		(void)xfs_dir2_data_freeinsert(d, newdup, needlogp);
+		(void)xfs_dir2_data_freeinsert(hdr, newdup, needlogp);
 	}
 	*needscanp = needscan;
 }
@@ -699,7 +712,7 @@ xfs_dir2_data_use_free(
 	int			*needlogp,	/* out: need to log header */
 	int			*needscanp)	/* out: need regen bestfree */
 {
-	xfs_dir2_data_t		*d;		/* data block */
+	xfs_dir2_data_hdr_t	*hdr;		/* data block header */
 	xfs_dir2_data_free_t	*dfp;		/* bestfree pointer */
 	int			matchback;	/* matches end of freespace */
 	int			matchfront;	/* matches start of freespace */
@@ -708,24 +721,24 @@ xfs_dir2_data_use_free(
 	xfs_dir2_data_unused_t	*newdup2;	/* another new unused entry */
 	int			oldlen;		/* old unused entry's length */
 
-	d = bp->data;
-	ASSERT(be32_to_cpu(d->hdr.magic) == XFS_DIR2_DATA_MAGIC ||
-	       be32_to_cpu(d->hdr.magic) == XFS_DIR2_BLOCK_MAGIC);
+	hdr = bp->data;
+	ASSERT(be32_to_cpu(hdr->magic) == XFS_DIR2_DATA_MAGIC ||
+	       be32_to_cpu(hdr->magic) == XFS_DIR2_BLOCK_MAGIC);
 	ASSERT(be16_to_cpu(dup->freetag) == XFS_DIR2_DATA_FREE_TAG);
-	ASSERT(offset >= (char *)dup - (char *)d);
-	ASSERT(offset + len <= (char *)dup + be16_to_cpu(dup->length) - (char *)d);
-	ASSERT((char *)dup - (char *)d == be16_to_cpu(*xfs_dir2_data_unused_tag_p(dup)));
+	ASSERT(offset >= (char *)dup - (char *)hdr);
+	ASSERT(offset + len <= (char *)dup + be16_to_cpu(dup->length) - (char *)hdr);
+	ASSERT((char *)dup - (char *)hdr == be16_to_cpu(*xfs_dir2_data_unused_tag_p(dup)));
 	/*
 	 * Look up the entry in the bestfree table.
 	 */
-	dfp = xfs_dir2_data_freefind(d, dup);
+	dfp = xfs_dir2_data_freefind(hdr, dup);
 	oldlen = be16_to_cpu(dup->length);
-	ASSERT(dfp || oldlen <= be16_to_cpu(d->hdr.bestfree[2].length));
+	ASSERT(dfp || oldlen <= be16_to_cpu(hdr->bestfree[2].length));
 	/*
 	 * Check for alignment with front and back of the entry.
 	 */
-	matchfront = (char *)dup - (char *)d == offset;
-	matchback = (char *)dup + oldlen - (char *)d == offset + len;
+	matchfront = (char *)dup - (char *)hdr == offset;
+	matchback = (char *)dup + oldlen - (char *)hdr == offset + len;
 	ASSERT(*needscanp == 0);
 	needscan = 0;
 	/*
@@ -734,9 +747,9 @@ xfs_dir2_data_use_free(
 	 */
 	if (matchfront && matchback) {
 		if (dfp) {
-			needscan = (d->hdr.bestfree[2].offset != 0);
+			needscan = (hdr->bestfree[2].offset != 0);
 			if (!needscan)
-				xfs_dir2_data_freeremove(d, dfp, needlogp);
+				xfs_dir2_data_freeremove(hdr, dfp, needlogp);
 		}
 	}
 	/*
@@ -744,27 +757,27 @@ xfs_dir2_data_use_free(
 	 * Make a new entry with the remaining freespace.
 	 */
 	else if (matchfront) {
-		newdup = (xfs_dir2_data_unused_t *)((char *)d + offset + len);
+		newdup = (xfs_dir2_data_unused_t *)((char *)hdr + offset + len);
 		newdup->freetag = cpu_to_be16(XFS_DIR2_DATA_FREE_TAG);
 		newdup->length = cpu_to_be16(oldlen - len);
 		*xfs_dir2_data_unused_tag_p(newdup) =
-			cpu_to_be16((char *)newdup - (char *)d);
+			cpu_to_be16((char *)newdup - (char *)hdr);
 		xfs_dir2_data_log_unused(tp, bp, newdup);
 		/*
 		 * If it was in the table, remove it and add the new one.
 		 */
 		if (dfp) {
-			xfs_dir2_data_freeremove(d, dfp, needlogp);
-			dfp = xfs_dir2_data_freeinsert(d, newdup, needlogp);
+			xfs_dir2_data_freeremove(hdr, dfp, needlogp);
+			dfp = xfs_dir2_data_freeinsert(hdr, newdup, needlogp);
 			ASSERT(dfp != NULL);
 			ASSERT(dfp->length == newdup->length);
-			ASSERT(be16_to_cpu(dfp->offset) == (char *)newdup - (char *)d);
+			ASSERT(be16_to_cpu(dfp->offset) == (char *)newdup - (char *)hdr);
 			/*
 			 * If we got inserted at the last slot,
 			 * that means we don't know if there was a better
 			 * choice for the last slot, or not.  Rescan.
 			 */
-			needscan = dfp == &d->hdr.bestfree[2];
+			needscan = dfp == &hdr->bestfree[2];
 		}
 	}
 	/*
@@ -773,25 +786,25 @@ xfs_dir2_data_use_free(
 	 */
 	else if (matchback) {
 		newdup = dup;
-		newdup->length = cpu_to_be16(((char *)d + offset) - (char *)newdup);
+		newdup->length = cpu_to_be16(((char *)hdr + offset) - (char *)newdup);
 		*xfs_dir2_data_unused_tag_p(newdup) =
-			cpu_to_be16((char *)newdup - (char *)d);
+			cpu_to_be16((char *)newdup - (char *)hdr);
 		xfs_dir2_data_log_unused(tp, bp, newdup);
 		/*
 		 * If it was in the table, remove it and add the new one.
 		 */
 		if (dfp) {
-			xfs_dir2_data_freeremove(d, dfp, needlogp);
-			dfp = xfs_dir2_data_freeinsert(d, newdup, needlogp);
+			xfs_dir2_data_freeremove(hdr, dfp, needlogp);
+			dfp = xfs_dir2_data_freeinsert(hdr, newdup, needlogp);
 			ASSERT(dfp != NULL);
 			ASSERT(dfp->length == newdup->length);
-			ASSERT(be16_to_cpu(dfp->offset) == (char *)newdup - (char *)d);
+			ASSERT(be16_to_cpu(dfp->offset) == (char *)newdup - (char *)hdr);
 			/*
 			 * If we got inserted at the last slot,
 			 * that means we don't know if there was a better
 			 * choice for the last slot, or not.  Rescan.
 			 */
-			needscan = dfp == &d->hdr.bestfree[2];
+			needscan = dfp == &hdr->bestfree[2];
 		}
 	}
 	/*
@@ -800,15 +813,15 @@ xfs_dir2_data_use_free(
 	 */
 	else {
 		newdup = dup;
-		newdup->length = cpu_to_be16(((char *)d + offset) - (char *)newdup);
+		newdup->length = cpu_to_be16(((char *)hdr + offset) - (char *)newdup);
 		*xfs_dir2_data_unused_tag_p(newdup) =
-			cpu_to_be16((char *)newdup - (char *)d);
+			cpu_to_be16((char *)newdup - (char *)hdr);
 		xfs_dir2_data_log_unused(tp, bp, newdup);
-		newdup2 = (xfs_dir2_data_unused_t *)((char *)d + offset + len);
+		newdup2 = (xfs_dir2_data_unused_t *)((char *)hdr + offset + len);
 		newdup2->freetag = cpu_to_be16(XFS_DIR2_DATA_FREE_TAG);
 		newdup2->length = cpu_to_be16(oldlen - len - be16_to_cpu(newdup->length));
 		*xfs_dir2_data_unused_tag_p(newdup2) =
-			cpu_to_be16((char *)newdup2 - (char *)d);
+			cpu_to_be16((char *)newdup2 - (char *)hdr);
 		xfs_dir2_data_log_unused(tp, bp, newdup2);
 		/*
 		 * If the old entry was in the table, we need to scan
@@ -819,12 +832,12 @@ xfs_dir2_data_use_free(
 		 * the 2 new will work.
 		 */
 		if (dfp) {
-			needscan = (d->hdr.bestfree[2].length != 0);
+			needscan = (hdr->bestfree[2].length != 0);
 			if (!needscan) {
-				xfs_dir2_data_freeremove(d, dfp, needlogp);
-				(void)xfs_dir2_data_freeinsert(d, newdup,
+				xfs_dir2_data_freeremove(hdr, dfp, needlogp);
+				(void)xfs_dir2_data_freeinsert(hdr, newdup,
 					needlogp);
-				(void)xfs_dir2_data_freeinsert(d, newdup2,
+				(void)xfs_dir2_data_freeinsert(hdr, newdup2,
 					needlogp);
 			}
 		}
Index: xfs/fs/xfs/xfs_dir2_data.h
===================================================================
--- xfs.orig/fs/xfs/xfs_dir2_data.h	2011-06-27 10:47:35.919487117 +0200
+++ xfs/fs/xfs/xfs_dir2_data.h	2011-06-27 10:51:33.644865914 +0200
@@ -157,12 +157,10 @@ extern void xfs_dir2_data_check(struct x
 #else
 #define	xfs_dir2_data_check(dp,bp)
 #endif
-extern xfs_dir2_data_free_t *xfs_dir2_data_freefind(xfs_dir2_data_t *d,
-				xfs_dir2_data_unused_t *dup);
-extern xfs_dir2_data_free_t *xfs_dir2_data_freeinsert(xfs_dir2_data_t *d,
+extern xfs_dir2_data_free_t *xfs_dir2_data_freeinsert(xfs_dir2_data_hdr_t *hdr,
 				xfs_dir2_data_unused_t *dup, int *loghead);
-extern void xfs_dir2_data_freescan(struct xfs_mount *mp, xfs_dir2_data_t *d,
-				int *loghead);
+extern void xfs_dir2_data_freescan(struct xfs_mount *mp,
+				xfs_dir2_data_hdr_t *hdr, int *loghead);
 extern int xfs_dir2_data_init(struct xfs_da_args *args, xfs_dir2_db_t blkno,
 				struct xfs_dabuf **bpp);
 extern void xfs_dir2_data_log_entry(struct xfs_trans *tp, struct xfs_dabuf *bp,
Index: xfs/fs/xfs/xfs_dir2_leaf.c
===================================================================
--- xfs.orig/fs/xfs/xfs_dir2_leaf.c	2011-06-27 10:47:35.932820377 +0200
+++ xfs/fs/xfs/xfs_dir2_leaf.c	2011-06-27 10:51:33.648199229 +0200
@@ -132,7 +132,7 @@ xfs_dir2_block_to_leaf(
 	 */
 	hdr->magic = cpu_to_be32(XFS_DIR2_DATA_MAGIC);
 	if (needscan)
-		xfs_dir2_data_freescan(mp, (xfs_dir2_data_t *)hdr, &needlog);
+		xfs_dir2_data_freescan(mp, hdr, &needlog);
 	/*
 	 * Set up leaf tail and bests table.
 	 */
@@ -273,7 +273,7 @@ xfs_dir2_leaf_addname(
 {
 	__be16			*bestsp;	/* freespace table in leaf */
 	int			compact;	/* need to compact leaves */
-	xfs_dir2_data_t		*data;		/* data block structure */
+	xfs_dir2_data_hdr_t	*hdr;		/* data block header */
 	xfs_dabuf_t		*dbp;		/* data block buffer */
 	xfs_dir2_data_entry_t	*dep;		/* data block entry */
 	xfs_inode_t		*dp;		/* incore directory inode */
@@ -481,8 +481,8 @@ xfs_dir2_leaf_addname(
 		 */
 		else
 			xfs_dir2_leaf_log_bests(tp, lbp, use_block, use_block);
-		data = dbp->data;
-		bestsp[use_block] = data->hdr.bestfree[0].length;
+		hdr = dbp->data;
+		bestsp[use_block] = hdr->bestfree[0].length;
 		grown = 1;
 	}
 	/*
@@ -496,7 +496,7 @@ xfs_dir2_leaf_addname(
 			xfs_da_brelse(tp, lbp);
 			return error;
 		}
-		data = dbp->data;
+		hdr = dbp->data;
 		grown = 0;
 	}
 	xfs_dir2_data_check(dp, dbp);
@@ -504,14 +504,14 @@ xfs_dir2_leaf_addname(
 	 * Point to the biggest freespace in our data block.
 	 */
 	dup = (xfs_dir2_data_unused_t *)
-	      ((char *)data + be16_to_cpu(data->hdr.bestfree[0].offset));
+	      ((char *)hdr + be16_to_cpu(hdr->bestfree[0].offset));
 	ASSERT(be16_to_cpu(dup->length) >= length);
 	needscan = needlog = 0;
 	/*
 	 * Mark the initial part of our freespace in use for the new entry.
 	 */
 	xfs_dir2_data_use_free(tp, dbp, dup,
-		(xfs_dir2_data_aoff_t)((char *)dup - (char *)data), length,
+		(xfs_dir2_data_aoff_t)((char *)dup - (char *)hdr), length,
 		&needlog, &needscan);
 	/*
 	 * Initialize our new entry (at last).
@@ -521,12 +521,12 @@ xfs_dir2_leaf_addname(
 	dep->namelen = args->namelen;
 	memcpy(dep->name, args->name, dep->namelen);
 	tagp = xfs_dir2_data_entry_tag_p(dep);
-	*tagp = cpu_to_be16((char *)dep - (char *)data);
+	*tagp = cpu_to_be16((char *)dep - (char *)hdr);
 	/*
 	 * Need to scan fix up the bestfree table.
 	 */
 	if (needscan)
-		xfs_dir2_data_freescan(mp, data, &needlog);
+		xfs_dir2_data_freescan(mp, hdr, &needlog);
 	/*
 	 * Need to log the data block's header.
 	 */
@@ -537,8 +537,8 @@ xfs_dir2_leaf_addname(
 	 * If the bests table needs to be changed, do it.
 	 * Log the change unless we've already done that.
 	 */
-	if (be16_to_cpu(bestsp[use_block]) != be16_to_cpu(data->hdr.bestfree[0].length)) {
-		bestsp[use_block] = data->hdr.bestfree[0].length;
+	if (be16_to_cpu(bestsp[use_block]) != be16_to_cpu(hdr->bestfree[0].length)) {
+		bestsp[use_block] = hdr->bestfree[0].length;
 		if (!grown)
 			xfs_dir2_leaf_log_bests(tp, lbp, use_block, use_block);
 	}
@@ -782,6 +782,7 @@ xfs_dir2_leaf_getdents(
 	xfs_dir2_db_t		curdb;		/* db for current block */
 	xfs_dir2_off_t		curoff;		/* current overall offset */
 	xfs_dir2_data_t		*data;		/* data block structure */
+	xfs_dir2_data_hdr_t	*hdr;		/* data block header */
 	xfs_dir2_data_entry_t	*dep;		/* data entry */
 	xfs_dir2_data_unused_t	*dup;		/* unused entry */
 	int			error = 0;	/* error return value */
@@ -1040,6 +1041,7 @@ xfs_dir2_leaf_getdents(
 				ASSERT(xfs_dir2_byte_to_db(mp, curoff) ==
 				       curdb);
 			data = bp->data;
+			hdr = &data->hdr;
 			xfs_dir2_data_check(dp, bp);
 			/*
 			 * Find our position in the block.
@@ -1050,12 +1052,12 @@ xfs_dir2_leaf_getdents(
 			 * Skip past the header.
 			 */
 			if (byteoff == 0)
-				curoff += (uint)sizeof(data->hdr);
+				curoff += (uint)sizeof(*hdr);
 			/*
 			 * Skip past entries until we reach our offset.
 			 */
 			else {
-				while ((char *)ptr - (char *)data < byteoff) {
+				while ((char *)ptr - (char *)hdr < byteoff) {
 					dup = (xfs_dir2_data_unused_t *)ptr;
 
 					if (be16_to_cpu(dup->freetag)
@@ -1076,8 +1078,8 @@ xfs_dir2_leaf_getdents(
 				curoff =
 					xfs_dir2_db_off_to_byte(mp,
 					    xfs_dir2_byte_to_db(mp, curoff),
-					    (char *)ptr - (char *)data);
-				if (ptr >= (char *)data + mp->m_dirblksize) {
+					    (char *)ptr - (char *)hdr);
+				if (ptr >= (char *)hdr + mp->m_dirblksize) {
 					continue;
 				}
 			}
@@ -1458,7 +1460,7 @@ xfs_dir2_leaf_removename(
 	xfs_da_args_t		*args)		/* operation arguments */
 {
 	__be16			*bestsp;	/* leaf block best freespace */
-	xfs_dir2_data_t		*data;		/* data block structure */
+	xfs_dir2_data_hdr_t	*hdr;		/* data block header */
 	xfs_dir2_db_t		db;		/* data block number */
 	xfs_dabuf_t		*dbp;		/* data block buffer */
 	xfs_dir2_data_entry_t	*dep;		/* data entry structure */
@@ -1488,7 +1490,7 @@ xfs_dir2_leaf_removename(
 	tp = args->trans;
 	mp = dp->i_mount;
 	leaf = lbp->data;
-	data = dbp->data;
+	hdr = dbp->data;
 	xfs_dir2_data_check(dp, dbp);
 	/*
 	 * Point to the leaf entry, use that to point to the data entry.
@@ -1496,9 +1498,9 @@ xfs_dir2_leaf_removename(
 	lep = &leaf->ents[index];
 	db = xfs_dir2_dataptr_to_db(mp, be32_to_cpu(lep->address));
 	dep = (xfs_dir2_data_entry_t *)
-	      ((char *)data + xfs_dir2_dataptr_to_off(mp, be32_to_cpu(lep->address)));
+	      ((char *)hdr + xfs_dir2_dataptr_to_off(mp, be32_to_cpu(lep->address)));
 	needscan = needlog = 0;
-	oldbest = be16_to_cpu(data->hdr.bestfree[0].length);
+	oldbest = be16_to_cpu(hdr->bestfree[0].length);
 	ltp = xfs_dir2_leaf_tail_p(mp, leaf);
 	bestsp = xfs_dir2_leaf_bests_p(ltp);
 	ASSERT(be16_to_cpu(bestsp[db]) == oldbest);
@@ -1506,7 +1508,7 @@ xfs_dir2_leaf_removename(
 	 * Mark the former data entry unused.
 	 */
 	xfs_dir2_data_make_free(tp, dbp,
-		(xfs_dir2_data_aoff_t)((char *)dep - (char *)data),
+		(xfs_dir2_data_aoff_t)((char *)dep - (char *)hdr),
 		xfs_dir2_data_entsize(dep->namelen), &needlog, &needscan);
 	/*
 	 * We just mark the leaf entry stale by putting a null in it.
@@ -1520,23 +1522,23 @@ xfs_dir2_leaf_removename(
 	 * log the data block header if necessary.
 	 */
 	if (needscan)
-		xfs_dir2_data_freescan(mp, data, &needlog);
+		xfs_dir2_data_freescan(mp, hdr, &needlog);
 	if (needlog)
 		xfs_dir2_data_log_header(tp, dbp);
 	/*
 	 * If the longest freespace in the data block has changed,
 	 * put the new value in the bests table and log that.
 	 */
-	if (be16_to_cpu(data->hdr.bestfree[0].length) != oldbest) {
-		bestsp[db] = data->hdr.bestfree[0].length;
+	if (be16_to_cpu(hdr->bestfree[0].length) != oldbest) {
+		bestsp[db] = hdr->bestfree[0].length;
 		xfs_dir2_leaf_log_bests(tp, lbp, db, db);
 	}
 	xfs_dir2_data_check(dp, dbp);
 	/*
 	 * If the data block is now empty then get rid of the data block.
 	 */
-	if (be16_to_cpu(data->hdr.bestfree[0].length) ==
-	    mp->m_dirblksize - (uint)sizeof(data->hdr)) {
+	if (be16_to_cpu(hdr->bestfree[0].length) ==
+	    mp->m_dirblksize - (uint)sizeof(*hdr)) {
 		ASSERT(db != mp->m_dirdatablk);
 		if ((error = xfs_dir2_shrink_inode(args, db, dbp))) {
 			/*
@@ -1707,9 +1709,6 @@ xfs_dir2_leaf_trim_data(
 	xfs_dir2_db_t		db)		/* data block number */
 {
 	__be16			*bestsp;	/* leaf bests table */
-#ifdef DEBUG
-	xfs_dir2_data_t		*data;		/* data block structure */
-#endif
 	xfs_dabuf_t		*dbp;		/* data block buffer */
 	xfs_inode_t		*dp;		/* incore directory inode */
 	int			error;		/* error return value */
@@ -1728,20 +1727,21 @@ xfs_dir2_leaf_trim_data(
 			XFS_DATA_FORK))) {
 		return error;
 	}
-#ifdef DEBUG
-	data = dbp->data;
-	ASSERT(be32_to_cpu(data->hdr.magic) == XFS_DIR2_DATA_MAGIC);
-#endif
-	/* this seems to be an error
-	 * data is only valid if DEBUG is defined?
-	 * RMC 09/08/1999
-	 */
 
 	leaf = lbp->data;
 	ltp = xfs_dir2_leaf_tail_p(mp, leaf);
-	ASSERT(be16_to_cpu(data->hdr.bestfree[0].length) ==
-	       mp->m_dirblksize - (uint)sizeof(data->hdr));
+
+#ifdef DEBUG
+{
+	struct xfs_dir2_data_hdr *hdr = dbp->data;
+
+	ASSERT(be32_to_cpu(hdr->magic) == XFS_DIR2_DATA_MAGIC);
+	ASSERT(be16_to_cpu(hdr->bestfree[0].length) ==
+	       mp->m_dirblksize - (uint)sizeof(*hdr));
 	ASSERT(db == be32_to_cpu(ltp->bestcount) - 1);
+}
+#endif
+
 	/*
 	 * Get rid of the data block.
 	 */
Index: xfs/fs/xfs/xfs_dir2_node.c
===================================================================
--- xfs.orig/fs/xfs/xfs_dir2_node.c	2011-06-27 10:47:35.946153639 +0200
+++ xfs/fs/xfs/xfs_dir2_node.c	2011-06-27 10:51:33.648199229 +0200
@@ -843,7 +843,7 @@ xfs_dir2_leafn_remove(
 	xfs_da_state_blk_t	*dblk,		/* data block */
 	int			*rval)		/* resulting block needs join */
 {
-	xfs_dir2_data_t		*data;		/* data block structure */
+	xfs_dir2_data_hdr_t	*hdr;		/* data block header */
 	xfs_dir2_db_t		db;		/* data block number */
 	xfs_dabuf_t		*dbp;		/* data block buffer */
 	xfs_dir2_data_entry_t	*dep;		/* data block entry */
@@ -888,9 +888,9 @@ xfs_dir2_leafn_remove(
 	 * in the data block in case it changes.
 	 */
 	dbp = dblk->bp;
-	data = dbp->data;
-	dep = (xfs_dir2_data_entry_t *)((char *)data + off);
-	longest = be16_to_cpu(data->hdr.bestfree[0].length);
+	hdr = dbp->data;
+	dep = (xfs_dir2_data_entry_t *)((char *)hdr + off);
+	longest = be16_to_cpu(hdr->bestfree[0].length);
 	needlog = needscan = 0;
 	xfs_dir2_data_make_free(tp, dbp, off,
 		xfs_dir2_data_entsize(dep->namelen), &needlog, &needscan);
@@ -899,7 +899,7 @@ xfs_dir2_leafn_remove(
 	 * Log the data block header if needed.
 	 */
 	if (needscan)
-		xfs_dir2_data_freescan(mp, data, &needlog);
+		xfs_dir2_data_freescan(mp, hdr, &needlog);
 	if (needlog)
 		xfs_dir2_data_log_header(tp, dbp);
 	xfs_dir2_data_check(dp, dbp);
@@ -907,7 +907,7 @@ xfs_dir2_leafn_remove(
 	 * If the longest data block freespace changes, need to update
 	 * the corresponding freeblock entry.
 	 */
-	if (longest < be16_to_cpu(data->hdr.bestfree[0].length)) {
+	if (longest < be16_to_cpu(hdr->bestfree[0].length)) {
 		int		error;		/* error return value */
 		xfs_dabuf_t	*fbp;		/* freeblock buffer */
 		xfs_dir2_db_t	fdb;		/* freeblock block number */
@@ -933,19 +933,19 @@ xfs_dir2_leafn_remove(
 		 * Calculate which entry we need to fix.
 		 */
 		findex = xfs_dir2_db_to_fdindex(mp, db);
-		longest = be16_to_cpu(data->hdr.bestfree[0].length);
+		longest = be16_to_cpu(hdr->bestfree[0].length);
 		/*
 		 * If the data block is now empty we can get rid of it
 		 * (usually).
 		 */
-		if (longest == mp->m_dirblksize - (uint)sizeof(data->hdr)) {
+		if (longest == mp->m_dirblksize - (uint)sizeof(*hdr)) {
 			/*
 			 * Try to punch out the data block.
 			 */
 			error = xfs_dir2_shrink_inode(args, db, dbp);
 			if (error == 0) {
 				dblk->bp = NULL;
-				data = NULL;
+				hdr = NULL;
 			}
 			/*
 			 * We can get ENOSPC if there's no space reservation.
@@ -961,7 +961,7 @@ xfs_dir2_leafn_remove(
 		 * If we got rid of the data block, we can eliminate that entry
 		 * in the free block.
 		 */
-		if (data == NULL) {
+		if (hdr == NULL) {
 			/*
 			 * One less used entry in the free table.
 			 */
@@ -1357,7 +1357,7 @@ xfs_dir2_node_addname_int(
 	xfs_da_args_t		*args,		/* operation arguments */
 	xfs_da_state_blk_t	*fblk)		/* optional freespace block */
 {
-	xfs_dir2_data_t		*data;		/* data block structure */
+	xfs_dir2_data_hdr_t	*hdr;		/* data block header */
 	xfs_dir2_db_t		dbno;		/* data block number */
 	xfs_dabuf_t		*dbp;		/* data block buffer */
 	xfs_dir2_data_entry_t	*dep;		/* data entry pointer */
@@ -1642,8 +1642,8 @@ xfs_dir2_node_addname_int(
 		 * We haven't allocated the data entry yet so this will
 		 * change again.
 		 */
-		data = dbp->data;
-		free->bests[findex] = data->hdr.bestfree[0].length;
+		hdr = dbp->data;
+		free->bests[findex] = hdr->bestfree[0].length;
 		logfree = 1;
 	}
 	/*
@@ -1668,21 +1668,21 @@ xfs_dir2_node_addname_int(
 				xfs_da_buf_done(fbp);
 			return error;
 		}
-		data = dbp->data;
+		hdr = dbp->data;
 		logfree = 0;
 	}
-	ASSERT(be16_to_cpu(data->hdr.bestfree[0].length) >= length);
+	ASSERT(be16_to_cpu(hdr->bestfree[0].length) >= length);
 	/*
 	 * Point to the existing unused space.
 	 */
 	dup = (xfs_dir2_data_unused_t *)
-	      ((char *)data + be16_to_cpu(data->hdr.bestfree[0].offset));
+	      ((char *)hdr + be16_to_cpu(hdr->bestfree[0].offset));
 	needscan = needlog = 0;
 	/*
 	 * Mark the first part of the unused space, inuse for us.
 	 */
 	xfs_dir2_data_use_free(tp, dbp, dup,
-		(xfs_dir2_data_aoff_t)((char *)dup - (char *)data), length,
+		(xfs_dir2_data_aoff_t)((char *)dup - (char *)hdr), length,
 		&needlog, &needscan);
 	/*
 	 * Fill in the new entry and log it.
@@ -1692,13 +1692,13 @@ xfs_dir2_node_addname_int(
 	dep->namelen = args->namelen;
 	memcpy(dep->name, args->name, dep->namelen);
 	tagp = xfs_dir2_data_entry_tag_p(dep);
-	*tagp = cpu_to_be16((char *)dep - (char *)data);
+	*tagp = cpu_to_be16((char *)dep - (char *)hdr);
 	xfs_dir2_data_log_entry(tp, dbp, dep);
 	/*
 	 * Rescan the block for bestfree if needed.
 	 */
 	if (needscan)
-		xfs_dir2_data_freescan(mp, data, &needlog);
+		xfs_dir2_data_freescan(mp, hdr, &needlog);
 	/*
 	 * Log the data block header if needed.
 	 */
@@ -1707,8 +1707,8 @@ xfs_dir2_node_addname_int(
 	/*
 	 * If the freespace entry is now wrong, update it.
 	 */
-	if (be16_to_cpu(free->bests[findex]) != be16_to_cpu(data->hdr.bestfree[0].length)) {
-		free->bests[findex] = data->hdr.bestfree[0].length;
+	if (be16_to_cpu(free->bests[findex]) != be16_to_cpu(hdr->bestfree[0].length)) {
+		free->bests[findex] = hdr->bestfree[0].length;
 		logfree = 1;
 	}
 	/*
@@ -1858,7 +1858,7 @@ xfs_dir2_node_replace(
 	xfs_da_args_t		*args)		/* operation arguments */
 {
 	xfs_da_state_blk_t	*blk;		/* leaf block */
-	xfs_dir2_data_t		*data;		/* data block structure */
+	xfs_dir2_data_hdr_t	*hdr;		/* data block header */
 	xfs_dir2_data_entry_t	*dep;		/* data entry changed */
 	int			error;		/* error return value */
 	int			i;		/* btree level */
@@ -1902,10 +1902,10 @@ xfs_dir2_node_replace(
 		/*
 		 * Point to the data entry.
 		 */
-		data = state->extrablk.bp->data;
-		ASSERT(be32_to_cpu(data->hdr.magic) == XFS_DIR2_DATA_MAGIC);
+		hdr = state->extrablk.bp->data;
+		ASSERT(be32_to_cpu(hdr->magic) == XFS_DIR2_DATA_MAGIC);
 		dep = (xfs_dir2_data_entry_t *)
-		      ((char *)data +
+		      ((char *)hdr +
 		       xfs_dir2_dataptr_to_off(state->mp, be32_to_cpu(lep->address)));
 		ASSERT(inum != be64_to_cpu(dep->inumber));
 		/*

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH 20/27] xfs: kill struct xfs_dir2_data
  2011-06-29 14:01 [PATCH 00/27] patch queue for Linux 3.1 Christoph Hellwig
                   ` (17 preceding siblings ...)
  2011-06-29 14:01 ` [PATCH 19/27] xfs: avoid usage of struct xfs_dir2_data Christoph Hellwig
@ 2011-06-29 14:01 ` Christoph Hellwig
  2011-06-29 14:01 ` [PATCH 21/27] xfs: cleanup the defintion of struct xfs_dir2_data_entry Christoph Hellwig
                   ` (7 subsequent siblings)
  26 siblings, 0 replies; 100+ messages in thread
From: Christoph Hellwig @ 2011-06-29 14:01 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-kill-xfs_dir2_data_t --]
[-- Type: text/plain, Size: 5902 bytes --]

Remove the confusing xfs_dir2_data structure.  It is supposed to describe
an XFS dir2 data btree block, but due to the variable sized nature of
almost all elements in it it can't actuall do anything close to that
job.  In addition to accessing the fixed offset header structure it was
only used to get a pointer to the first dir or unused entry after it,
which can be trivially replaced by pointer arithmetics on the header
pointer.  For most users that is actually more natural anyway, as they
don't use a typed pointer but rather a character pointer for further
arithmetics.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: xfs/fs/xfs/xfs_dir2_data.c
===================================================================
--- xfs.orig/fs/xfs/xfs_dir2_data.c	2011-06-29 13:41:14.272003679 +0200
+++ xfs/fs/xfs/xfs_dir2_data.c	2011-06-29 13:41:15.021999615 +0200
@@ -53,7 +53,6 @@ xfs_dir2_data_check(
 	xfs_dir2_data_free_t	*bf;		/* bestfree table */
 	xfs_dir2_block_tail_t	*btp=NULL;	/* block tail */
 	int			count;		/* count of entries found */
-	xfs_dir2_data_t		*d;		/* data block pointer */
 	xfs_dir2_data_hdr_t	*hdr;		/* data block header */
 	xfs_dir2_data_entry_t	*dep;		/* data entry */
 	xfs_dir2_data_free_t	*dfp;		/* bestfree entry */
@@ -70,10 +69,9 @@ xfs_dir2_data_check(
 	struct xfs_name		name;
 
 	mp = dp->i_mount;
-	d = bp->data;
-	hdr = &d->hdr;
+	hdr = bp->data;
 	bf = hdr->bestfree;
-	p = (char *)d->u;
+	p = (char *)(hdr + 1);
 
 	if (hdr->magic == cpu_to_be32(XFS_DIR2_BLOCK_MAGIC)) {
 		btp = xfs_dir2_block_tail_p(mp, hdr);
@@ -336,7 +334,6 @@ xfs_dir2_data_freescan(
 	xfs_dir2_data_hdr_t	*hdr,		/* data block header */
 	int			*loghead)	/* out: log data header */
 {
-	xfs_dir2_data_t		*d = (xfs_dir2_data_t *)hdr;
 	xfs_dir2_block_tail_t	*btp;		/* block tail */
 	xfs_dir2_data_entry_t	*dep;		/* active data entry */
 	xfs_dir2_data_unused_t	*dup;		/* unused data entry */
@@ -355,7 +352,7 @@ xfs_dir2_data_freescan(
 	/*
 	 * Set up pointers.
 	 */
-	p = (char *)d->u;
+	p = (char *)(hdr + 1);
 	if (be32_to_cpu(hdr->magic) == XFS_DIR2_BLOCK_MAGIC) {
 		btp = xfs_dir2_block_tail_p(mp, hdr);
 		endp = (char *)xfs_dir2_block_leaf_p(btp);
@@ -398,7 +395,6 @@ xfs_dir2_data_init(
 	xfs_dabuf_t		**bpp)		/* output block buffer */
 {
 	xfs_dabuf_t		*bp;		/* block buffer */
-	xfs_dir2_data_t		*d;		/* pointer to block */
 	xfs_dir2_data_hdr_t	*hdr;		/* data block header */
 	xfs_inode_t		*dp;		/* incore directory inode */
 	xfs_dir2_data_unused_t	*dup;		/* unused entry pointer */
@@ -424,8 +420,7 @@ xfs_dir2_data_init(
 	/*
 	 * Initialize the header.
 	 */
-	d = bp->data;
-	hdr = &d->hdr;
+	hdr = bp->data;
 	hdr->magic = cpu_to_be32(XFS_DIR2_DATA_MAGIC);
 	hdr->bestfree[0].offset = cpu_to_be16(sizeof(*hdr));
 	for (i = 1; i < XFS_DIR2_DATA_FD_COUNT; i++) {
@@ -436,7 +431,7 @@ xfs_dir2_data_init(
 	/*
 	 * Set up an unused entry for the block's body.
 	 */
-	dup = &d->u[0].unused;
+	dup = (xfs_dir2_data_unused_t *)(hdr + 1);
 	dup->freetag = cpu_to_be16(XFS_DIR2_DATA_FREE_TAG);
 
 	t = mp->m_dirblksize - (uint)sizeof(*hdr);
Index: xfs/fs/xfs/xfs_dir2_data.h
===================================================================
--- xfs.orig/fs/xfs/xfs_dir2_data.h	2011-06-29 13:41:14.272003679 +0200
+++ xfs/fs/xfs/xfs_dir2_data.h	2011-06-29 13:42:35.521563513 +0200
@@ -20,6 +20,22 @@
 
 /*
  * Directory format 2, data block structures.
+ *
+ * A pure data block looks like the following drawing on disk:
+ *
+ *    +-------------------------------------------------+
+ *    | xfs_dir2_data_hdr_t                             |
+ *    +-------------------------------------------------+
+ *    | xfs_dir2_data_entry_t OR xfs_dir2_data_unused_t |
+ *    | xfs_dir2_data_entry_t OR xfs_dir2_data_unused_t |
+ *    | xfs_dir2_data_entry_t OR xfs_dir2_data_unused_t |
+ *    | ...                                             |
+ *    +-------------------------------------------------+
+ *    | unused space                                    |
+ *    +-------------------------------------------------+
+ *
+ * As all the entries are variable sized structures the accessors in this
+ * file need to be used to iterate over them.
  */
 
 struct xfs_dabuf;
@@ -103,23 +119,6 @@ typedef struct xfs_dir2_data_unused {
 	__be16			tag;		/* starting offset of us */
 } xfs_dir2_data_unused_t;
 
-typedef union {
-	xfs_dir2_data_entry_t	entry;
-	xfs_dir2_data_unused_t	unused;
-} xfs_dir2_data_union_t;
-
-/*
- * Generic data block structure, for xfs_db.
- */
-typedef struct xfs_dir2_data {
-	xfs_dir2_data_hdr_t	hdr;		/* magic XFS_DIR2_DATA_MAGIC */
-	xfs_dir2_data_union_t	u[1];
-} xfs_dir2_data_t;
-
-/*
- * Macros.
- */
-
 /*
  * Size of a data entry.
  */
Index: xfs/fs/xfs/xfs_dir2_leaf.c
===================================================================
--- xfs.orig/fs/xfs/xfs_dir2_leaf.c	2011-06-29 13:41:14.275336994 +0200
+++ xfs/fs/xfs/xfs_dir2_leaf.c	2011-06-29 13:41:15.025332931 +0200
@@ -781,7 +781,6 @@ xfs_dir2_leaf_getdents(
 	int			byteoff;	/* offset in current block */
 	xfs_dir2_db_t		curdb;		/* db for current block */
 	xfs_dir2_off_t		curoff;		/* current overall offset */
-	xfs_dir2_data_t		*data;		/* data block structure */
 	xfs_dir2_data_hdr_t	*hdr;		/* data block header */
 	xfs_dir2_data_entry_t	*dep;		/* data entry */
 	xfs_dir2_data_unused_t	*dup;		/* unused entry */
@@ -1040,13 +1039,12 @@ xfs_dir2_leaf_getdents(
 			else if (curoff > newoff)
 				ASSERT(xfs_dir2_byte_to_db(mp, curoff) ==
 				       curdb);
-			data = bp->data;
-			hdr = &data->hdr;
+			hdr = bp->data;
 			xfs_dir2_data_check(dp, bp);
 			/*
 			 * Find our position in the block.
 			 */
-			ptr = (char *)&data->u;
+			ptr = (char *)(hdr + 1);
 			byteoff = xfs_dir2_byte_to_off(mp, curoff);
 			/*
 			 * Skip past the header.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH 21/27] xfs: cleanup the defintion of struct xfs_dir2_data_entry
  2011-06-29 14:01 [PATCH 00/27] patch queue for Linux 3.1 Christoph Hellwig
                   ` (18 preceding siblings ...)
  2011-06-29 14:01 ` [PATCH 20/27] xfs: kill " Christoph Hellwig
@ 2011-06-29 14:01 ` Christoph Hellwig
  2011-06-29 14:01 ` [PATCH 22/27] xfs: cleanup struct xfs_dir2_leaf Christoph Hellwig
                   ` (6 subsequent siblings)
  26 siblings, 0 replies; 100+ messages in thread
From: Christoph Hellwig @ 2011-06-29 14:01 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-cleanup-xfs_dir2_data_entry --]
[-- Type: text/plain, Size: 1202 bytes --]

Remove the tag member which is at a variable offset after the actual
name, and make name a real variable sized C99 array instead of the incorrect
one-sized array which confuses (not only) gcc.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: xfs/fs/xfs/xfs_dir2_data.h
===================================================================
--- xfs.orig/fs/xfs/xfs_dir2_data.h	2011-06-29 13:42:35.521563513 +0200
+++ xfs/fs/xfs/xfs_dir2_data.h	2011-06-29 13:43:03.284746440 +0200
@@ -98,14 +98,14 @@ typedef struct xfs_dir2_data_hdr {
 
 /*
  * Active entry in a data block.  Aligned to 8 bytes.
- * Tag appears as the last 2 bytes.
+ *
+ * After the variable length name field there is a 2 byte tag field, which
+ * can be accessed using xfs_dir2_data_entry_tag_p.
  */
 typedef struct xfs_dir2_data_entry {
 	__be64			inumber;	/* inode number */
 	__u8			namelen;	/* name length */
-	__u8			name[1];	/* name bytes, no null */
-						/* variable offset */
-	__be16			tag;		/* starting offset of us */
+	__u8			name[];		/* name bytes, no null */
 } xfs_dir2_data_entry_t;
 
 /*

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH 22/27] xfs: cleanup struct xfs_dir2_leaf
  2011-06-29 14:01 [PATCH 00/27] patch queue for Linux 3.1 Christoph Hellwig
                   ` (19 preceding siblings ...)
  2011-06-29 14:01 ` [PATCH 21/27] xfs: cleanup the defintion of struct xfs_dir2_data_entry Christoph Hellwig
@ 2011-06-29 14:01 ` Christoph Hellwig
  2011-06-29 14:01 ` [PATCH 23/27] xfs: remove the unused xfs_bufhash structure Christoph Hellwig
                   ` (5 subsequent siblings)
  26 siblings, 0 replies; 100+ messages in thread
From: Christoph Hellwig @ 2011-06-29 14:01 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-cleanup-xfs_dir2_leaf_t --]
[-- Type: text/plain, Size: 3927 bytes --]

Simplify the confusing xfs_dir2_leaf structure.  It is supposed to describe
an XFS dir2 leaf format btree block, but due to the variable sized nature
of almost all elements in it it can't actuall do anything close to that
job.   Remove the members that are after the first variable sized array,
given that they could only be used for sizeof expressions that can as well
just use the underlying types directly, and make the ents array a real
C99 variable sized array.

Also factor out the xfs_dir2_leaf_size, to make the sizing of a leaf
entry which already was convoluted somewhat readable after using the
longer type names in the sizeof expressions.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: xfs/fs/xfs/xfs_dir2_leaf.c
===================================================================
--- xfs.orig/fs/xfs/xfs_dir2_leaf.c	2011-06-29 13:41:15.025332931 +0200
+++ xfs/fs/xfs/xfs_dir2_leaf.c	2011-06-29 13:44:03.487753624 +0200
@@ -362,9 +362,12 @@ xfs_dir2_leaf_addname(
 	/*
 	 * How many bytes do we need in the leaf block?
 	 */
-	needbytes =
-		(leaf->hdr.stale ? 0 : (uint)sizeof(leaf->ents[0])) +
-		(use_block != -1 ? 0 : (uint)sizeof(leaf->bests[0]));
+	needbytes = 0;
+	if (!leaf->hdr.stale)
+		needbytes += sizeof(xfs_dir2_leaf_entry_t);
+	if (use_block == -1)
+		needbytes += sizeof(xfs_dir2_data_off_t);
+
 	/*
 	 * Now kill use_block if it refers to a missing block, so we
 	 * can use it as an indication of allocation needed.
@@ -1759,6 +1762,20 @@ xfs_dir2_leaf_trim_data(
 	return 0;
 }
 
+static inline size_t
+xfs_dir2_leaf_size(
+	struct xfs_dir2_leaf_hdr	*hdr,
+	int				counts)
+{
+	int			entries;
+
+	entries = be16_to_cpu(hdr->count) - be16_to_cpu(hdr->stale);
+	return sizeof(xfs_dir2_leaf_hdr_t) +
+	    entries * sizeof(xfs_dir2_leaf_entry_t) +
+	    counts * sizeof(xfs_dir2_data_off_t) +
+	    sizeof(xfs_dir2_leaf_tail_t);
+}
+
 /*
  * Convert node form directory to leaf form directory.
  * The root of the node form dir needs to already be a LEAFN block.
@@ -1840,18 +1857,17 @@ xfs_dir2_node_to_leaf(
 	free = fbp->data;
 	ASSERT(be32_to_cpu(free->hdr.magic) == XFS_DIR2_FREE_MAGIC);
 	ASSERT(!free->hdr.firstdb);
+
 	/*
 	 * Now see if the leafn and free data will fit in a leaf1.
 	 * If not, release the buffer and give up.
 	 */
-	if ((uint)sizeof(leaf->hdr) +
-	    (be16_to_cpu(leaf->hdr.count) - be16_to_cpu(leaf->hdr.stale)) * (uint)sizeof(leaf->ents[0]) +
-	    be32_to_cpu(free->hdr.nvalid) * (uint)sizeof(leaf->bests[0]) +
-	    (uint)sizeof(leaf->tail) >
-	    mp->m_dirblksize) {
+	if (xfs_dir2_leaf_size(&leaf->hdr, be32_to_cpu(free->hdr.nvalid)) >
+			mp->m_dirblksize) {
 		xfs_da_brelse(tp, fbp);
 		return 0;
 	}
+
 	/*
 	 * If the leaf has any stale entries in it, compress them out.
 	 * The compact routine will log the header.
@@ -1870,7 +1886,7 @@ xfs_dir2_node_to_leaf(
 	 * Set up the leaf bests table.
 	 */
 	memcpy(xfs_dir2_leaf_bests_p(ltp), free->bests,
-		be32_to_cpu(ltp->bestcount) * sizeof(leaf->bests[0]));
+		be32_to_cpu(ltp->bestcount) * sizeof(xfs_dir2_data_off_t));
 	xfs_dir2_leaf_log_bests(tp, lbp, 0, be32_to_cpu(ltp->bestcount) - 1);
 	xfs_dir2_leaf_log_tail(tp, lbp);
 	xfs_dir2_leaf_check(dp, lbp);
Index: xfs/fs/xfs/xfs_dir2_leaf.h
===================================================================
--- xfs.orig/fs/xfs/xfs_dir2_leaf.h	2011-06-29 13:02:38.617881987 +0200
+++ xfs/fs/xfs/xfs_dir2_leaf.h	2011-06-29 13:44:03.491086939 +0200
@@ -72,10 +72,7 @@ typedef struct xfs_dir2_leaf_tail {
  */
 typedef struct xfs_dir2_leaf {
 	xfs_dir2_leaf_hdr_t	hdr;		/* leaf header */
-	xfs_dir2_leaf_entry_t	ents[1];	/* entries */
-						/* ... */
-	xfs_dir2_data_off_t	bests[1];	/* best free counts */
-	xfs_dir2_leaf_tail_t	tail;		/* leaf tail */
+	xfs_dir2_leaf_entry_t	ents[];	/* entries */
 } xfs_dir2_leaf_t;
 
 /*

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH 23/27] xfs: remove the unused xfs_bufhash structure
  2011-06-29 14:01 [PATCH 00/27] patch queue for Linux 3.1 Christoph Hellwig
                   ` (20 preceding siblings ...)
  2011-06-29 14:01 ` [PATCH 22/27] xfs: cleanup struct xfs_dir2_leaf Christoph Hellwig
@ 2011-06-29 14:01 ` Christoph Hellwig
  2011-06-29 14:01 ` [PATCH 24/27] xfs: clean up buffer locking helpers Christoph Hellwig
                   ` (4 subsequent siblings)
  26 siblings, 0 replies; 100+ messages in thread
From: Christoph Hellwig @ 2011-06-29 14:01 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-remove-bufhash --]
[-- Type: text/plain, Size: 686 bytes --]

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: xfs/fs/xfs/linux-2.6/xfs_buf.h
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/xfs_buf.h	2011-06-29 11:26:14.542550346 +0200
+++ xfs/fs/xfs/linux-2.6/xfs_buf.h	2011-06-29 13:50:40.648935352 +0200
@@ -91,11 +91,6 @@ typedef enum {
 	XBT_FORCE_FLUSH = 1,
 } xfs_buftarg_flags_t;
 
-typedef struct xfs_bufhash {
-	struct list_head	bh_list;
-	spinlock_t		bh_lock;
-} xfs_bufhash_t;
-
 typedef struct xfs_buftarg {
 	dev_t			bt_dev;
 	struct block_device	*bt_bdev;

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH 24/27] xfs: clean up buffer locking helpers
  2011-06-29 14:01 [PATCH 00/27] patch queue for Linux 3.1 Christoph Hellwig
                   ` (21 preceding siblings ...)
  2011-06-29 14:01 ` [PATCH 23/27] xfs: remove the unused xfs_bufhash structure Christoph Hellwig
@ 2011-06-29 14:01 ` Christoph Hellwig
  2011-06-29 14:01 ` [PATCH 25/27] xfs: return the buffer locked from xfs_buf_get_uncached Christoph Hellwig
                   ` (3 subsequent siblings)
  26 siblings, 0 replies; 100+ messages in thread
From: Christoph Hellwig @ 2011-06-29 14:01 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-cleanup-buffer-locking --]
[-- Type: text/plain, Size: 11189 bytes --]

Rename xfs_buf_cond_lock and reverse it's return value to fit most other
trylock operations in the Kernel and XFS (with the exception of down_trylock,
after which xfs_buf_cond_lock was modelled), and replace xfs_buf_lock_val
with an xfs_buf_islocked for use in asserts, or and opencoded variant in
tracing.  remove the XFS_BUF_* wrappers for all the locking helpers.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: xfs/fs/xfs/linux-2.6/xfs_buf.c
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/xfs_buf.c	2011-06-29 11:26:14.000000000 +0200
+++ xfs/fs/xfs/linux-2.6/xfs_buf.c	2011-06-29 13:57:15.596795734 +0200
@@ -499,16 +499,14 @@ found:
 	spin_unlock(&pag->pag_buf_lock);
 	xfs_perag_put(pag);
 
-	if (xfs_buf_cond_lock(bp)) {
-		/* failed, so wait for the lock if requested. */
-		if (!(flags & XBF_TRYLOCK)) {
-			xfs_buf_lock(bp);
-			XFS_STATS_INC(xb_get_locked_waited);
-		} else {
+	if (!xfs_buf_trylock(bp)) {
+		if (flags & XBF_TRYLOCK) {
 			xfs_buf_rele(bp);
 			XFS_STATS_INC(xb_busy_locked);
 			return NULL;
 		}
+		xfs_buf_lock(bp);
+		XFS_STATS_INC(xb_get_locked_waited);
 	}
 
 	/*
@@ -896,8 +894,8 @@ xfs_buf_rele(
  *	to push on stale inode buffers.
  */
 int
-xfs_buf_cond_lock(
-	xfs_buf_t		*bp)
+xfs_buf_trylock(
+	struct xfs_buf		*bp)
 {
 	int			locked;
 
@@ -907,15 +905,8 @@ xfs_buf_cond_lock(
 	else if (atomic_read(&bp->b_pin_count) && (bp->b_flags & XBF_STALE))
 		xfs_log_force(bp->b_target->bt_mount, 0);
 
-	trace_xfs_buf_cond_lock(bp, _RET_IP_);
-	return locked ? 0 : -EBUSY;
-}
-
-int
-xfs_buf_lock_value(
-	xfs_buf_t		*bp)
-{
-	return bp->b_sema.count;
+	trace_xfs_buf_trylock(bp, _RET_IP_);
+	return locked;
 }
 
 /*
@@ -929,7 +920,7 @@ xfs_buf_lock_value(
  */
 void
 xfs_buf_lock(
-	xfs_buf_t		*bp)
+	struct xfs_buf		*bp)
 {
 	trace_xfs_buf_lock(bp, _RET_IP_);
 
@@ -950,7 +941,7 @@ xfs_buf_lock(
  */
 void
 xfs_buf_unlock(
-	xfs_buf_t		*bp)
+	struct xfs_buf		*bp)
 {
 	if ((bp->b_flags & (XBF_DELWRI|_XBF_DELWRI_Q)) == XBF_DELWRI) {
 		atomic_inc(&bp->b_hold);
@@ -1694,7 +1685,7 @@ xfs_buf_delwri_split(
 	list_for_each_entry_safe(bp, n, dwq, b_list) {
 		ASSERT(bp->b_flags & XBF_DELWRI);
 
-		if (!XFS_BUF_ISPINNED(bp) && !xfs_buf_cond_lock(bp)) {
+		if (!XFS_BUF_ISPINNED(bp) && xfs_buf_trylock(bp)) {
 			if (!force &&
 			    time_before(jiffies, bp->b_queuetime + age)) {
 				xfs_buf_unlock(bp);
Index: xfs/fs/xfs/linux-2.6/xfs_buf.h
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/xfs_buf.h	2011-06-29 13:50:40.000000000 +0200
+++ xfs/fs/xfs/linux-2.6/xfs_buf.h	2011-06-29 13:54:35.250997736 +0200
@@ -187,10 +187,11 @@ extern void xfs_buf_free(xfs_buf_t *);
 extern void xfs_buf_rele(xfs_buf_t *);
 
 /* Locking and Unlocking Buffers */
-extern int xfs_buf_cond_lock(xfs_buf_t *);
-extern int xfs_buf_lock_value(xfs_buf_t *);
+extern int xfs_buf_trylock(xfs_buf_t *);
 extern void xfs_buf_lock(xfs_buf_t *);
 extern void xfs_buf_unlock(xfs_buf_t *);
+#define xfs_buf_islocked(bp) \
+	((bp)->b_sema.count <= 0)
 
 /* Buffer Read and Write Routines */
 extern int xfs_bwrite(struct xfs_mount *mp, struct xfs_buf *bp);
@@ -308,10 +309,6 @@ xfs_buf_set_ref(
 
 #define XFS_BUF_ISPINNED(bp)	atomic_read(&((bp)->b_pin_count))
 
-#define XFS_BUF_VALUSEMA(bp)	xfs_buf_lock_value(bp)
-#define XFS_BUF_CPSEMA(bp)	(xfs_buf_cond_lock(bp) == 0)
-#define XFS_BUF_VSEMA(bp)	xfs_buf_unlock(bp)
-#define XFS_BUF_PSEMA(bp,x)	xfs_buf_lock(bp)
 #define XFS_BUF_FINISH_IOWAIT(bp)	complete(&bp->b_iowait);
 
 #define XFS_BUF_SET_TARGET(bp, target)	((bp)->b_target = (target))
Index: xfs/fs/xfs/linux-2.6/xfs_trace.h
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/xfs_trace.h	2011-06-29 11:35:45.000000000 +0200
+++ xfs/fs/xfs/linux-2.6/xfs_trace.h	2011-06-29 13:54:32.974343403 +0200
@@ -293,7 +293,7 @@ DECLARE_EVENT_CLASS(xfs_buf_class,
 		__entry->buffer_length = bp->b_buffer_length;
 		__entry->hold = atomic_read(&bp->b_hold);
 		__entry->pincount = atomic_read(&bp->b_pin_count);
-		__entry->lockval = xfs_buf_lock_value(bp);
+		__entry->lockval = bp->b_sema.count;
 		__entry->flags = bp->b_flags;
 		__entry->caller_ip = caller_ip;
 	),
@@ -323,7 +323,7 @@ DEFINE_BUF_EVENT(xfs_buf_bawrite);
 DEFINE_BUF_EVENT(xfs_buf_bdwrite);
 DEFINE_BUF_EVENT(xfs_buf_lock);
 DEFINE_BUF_EVENT(xfs_buf_lock_done);
-DEFINE_BUF_EVENT(xfs_buf_cond_lock);
+DEFINE_BUF_EVENT(xfs_buf_trylock);
 DEFINE_BUF_EVENT(xfs_buf_unlock);
 DEFINE_BUF_EVENT(xfs_buf_iowait);
 DEFINE_BUF_EVENT(xfs_buf_iowait_done);
@@ -366,7 +366,7 @@ DECLARE_EVENT_CLASS(xfs_buf_flags_class,
 		__entry->flags = flags;
 		__entry->hold = atomic_read(&bp->b_hold);
 		__entry->pincount = atomic_read(&bp->b_pin_count);
-		__entry->lockval = xfs_buf_lock_value(bp);
+		__entry->lockval = bp->b_sema.count;
 		__entry->caller_ip = caller_ip;
 	),
 	TP_printk("dev %d:%d bno 0x%llx len 0x%zx hold %d pincount %d "
@@ -409,7 +409,7 @@ TRACE_EVENT(xfs_buf_ioerror,
 		__entry->buffer_length = bp->b_buffer_length;
 		__entry->hold = atomic_read(&bp->b_hold);
 		__entry->pincount = atomic_read(&bp->b_pin_count);
-		__entry->lockval = xfs_buf_lock_value(bp);
+		__entry->lockval = bp->b_sema.count;
 		__entry->error = error;
 		__entry->flags = bp->b_flags;
 		__entry->caller_ip = caller_ip;
@@ -454,7 +454,7 @@ DECLARE_EVENT_CLASS(xfs_buf_item_class,
 		__entry->buf_flags = bip->bli_buf->b_flags;
 		__entry->buf_hold = atomic_read(&bip->bli_buf->b_hold);
 		__entry->buf_pincount = atomic_read(&bip->bli_buf->b_pin_count);
-		__entry->buf_lockval = xfs_buf_lock_value(bip->bli_buf);
+		__entry->buf_lockval = bip->bli_buf->b_sema.count;
 		__entry->li_desc = bip->bli_item.li_desc;
 		__entry->li_flags = bip->bli_item.li_flags;
 	),
Index: xfs/fs/xfs/quota/xfs_dquot.c
===================================================================
--- xfs.orig/fs/xfs/quota/xfs_dquot.c	2011-05-11 08:41:56.000000000 +0200
+++ xfs/fs/xfs/quota/xfs_dquot.c	2011-06-29 13:53:07.801471491 +0200
@@ -318,7 +318,7 @@ xfs_qm_init_dquot_blk(
 
 	ASSERT(tp);
 	ASSERT(XFS_BUF_ISBUSY(bp));
-	ASSERT(XFS_BUF_VALUSEMA(bp) <= 0);
+	ASSERT(xfs_buf_islocked(bp));
 
 	d = (xfs_dqblk_t *)XFS_BUF_PTR(bp);
 
@@ -534,7 +534,7 @@ xfs_qm_dqtobp(
 	}
 
 	ASSERT(XFS_BUF_ISBUSY(bp));
-	ASSERT(XFS_BUF_VALUSEMA(bp) <= 0);
+	ASSERT(xfs_buf_islocked(bp));
 
 	/*
 	 * calculate the location of the dquot inside the buffer.
@@ -622,7 +622,7 @@ xfs_qm_dqread(
 	 * brelse it because we have the changes incore.
 	 */
 	ASSERT(XFS_BUF_ISBUSY(bp));
-	ASSERT(XFS_BUF_VALUSEMA(bp) <= 0);
+	ASSERT(xfs_buf_islocked(bp));
 	xfs_trans_brelse(tp, bp);
 
 	return (error);
Index: xfs/fs/xfs/xfs_buf_item.c
===================================================================
--- xfs.orig/fs/xfs/xfs_buf_item.c	2011-04-22 06:21:45.000000000 +0200
+++ xfs/fs/xfs/xfs_buf_item.c	2011-06-29 13:53:20.938066990 +0200
@@ -420,7 +420,7 @@ xfs_buf_item_unpin(
 
 	if (freed && stale) {
 		ASSERT(bip->bli_flags & XFS_BLI_STALE);
-		ASSERT(XFS_BUF_VALUSEMA(bp) <= 0);
+		ASSERT(xfs_buf_islocked(bp));
 		ASSERT(!(XFS_BUF_ISDELAYWRITE(bp)));
 		ASSERT(XFS_BUF_ISSTALE(bp));
 		ASSERT(bip->bli_format.blf_flags & XFS_BLF_CANCEL);
@@ -483,7 +483,7 @@ xfs_buf_item_trylock(
 
 	if (XFS_BUF_ISPINNED(bp))
 		return XFS_ITEM_PINNED;
-	if (!XFS_BUF_CPSEMA(bp))
+	if (!xfs_buf_trylock(bp))
 		return XFS_ITEM_LOCKED;
 
 	/* take a reference to the buffer.  */
@@ -905,7 +905,7 @@ xfs_buf_attach_iodone(
 	xfs_log_item_t	*head_lip;
 
 	ASSERT(XFS_BUF_ISBUSY(bp));
-	ASSERT(XFS_BUF_VALUSEMA(bp) <= 0);
+	ASSERT(xfs_buf_islocked(bp));
 
 	lip->li_cb = cb;
 	if (XFS_BUF_FSPRIVATE(bp, void *) != NULL) {
Index: xfs/fs/xfs/xfs_log.c
===================================================================
--- xfs.orig/fs/xfs/xfs_log.c	2011-06-17 14:07:57.000000000 +0200
+++ xfs/fs/xfs/xfs_log.c	2011-06-29 13:53:33.954663139 +0200
@@ -1059,7 +1059,7 @@ xlog_alloc_log(xfs_mount_t	*mp,
 	XFS_BUF_SET_IODONE_FUNC(bp, xlog_iodone);
 	XFS_BUF_SET_FSPRIVATE2(bp, (unsigned long)1);
 	ASSERT(XFS_BUF_ISBUSY(bp));
-	ASSERT(XFS_BUF_VALUSEMA(bp) <= 0);
+	ASSERT(xfs_buf_islocked(bp));
 	log->l_xbuf = bp;
 
 	spin_lock_init(&log->l_icloglock);
@@ -1090,7 +1090,7 @@ xlog_alloc_log(xfs_mount_t	*mp,
 						log->l_iclog_size, 0);
 		if (!bp)
 			goto out_free_iclog;
-		if (!XFS_BUF_CPSEMA(bp))
+		if (!xfs_buf_trylock(bp))
 			ASSERT(0);
 		XFS_BUF_SET_IODONE_FUNC(bp, xlog_iodone);
 		XFS_BUF_SET_FSPRIVATE2(bp, (unsigned long)1);
@@ -1118,7 +1118,7 @@ xlog_alloc_log(xfs_mount_t	*mp,
 		iclog->ic_datap = (char *)iclog->ic_data + log->l_iclog_hsize;
 
 		ASSERT(XFS_BUF_ISBUSY(iclog->ic_bp));
-		ASSERT(XFS_BUF_VALUSEMA(iclog->ic_bp) <= 0);
+		ASSERT(xfs_buf_islocked(iclog->ic_bp));
 		init_waitqueue_head(&iclog->ic_force_wait);
 		init_waitqueue_head(&iclog->ic_write_wait);
 
Index: xfs/fs/xfs/xfs_log_recover.c
===================================================================
--- xfs.orig/fs/xfs/xfs_log_recover.c	2011-05-20 15:25:52.000000000 +0200
+++ xfs/fs/xfs/xfs_log_recover.c	2011-06-29 13:51:20.425386530 +0200
@@ -264,7 +264,7 @@ xlog_bwrite(
 	XFS_BUF_ZEROFLAGS(bp);
 	XFS_BUF_BUSY(bp);
 	XFS_BUF_HOLD(bp);
-	XFS_BUF_PSEMA(bp, PRIBIO);
+	xfs_buf_lock(bp);
 	XFS_BUF_SET_COUNT(bp, BBTOB(nbblks));
 	XFS_BUF_SET_TARGET(bp, log->l_mp->m_logdev_targp);
 
Index: xfs/fs/xfs/xfs_mount.c
===================================================================
--- xfs.orig/fs/xfs/xfs_mount.c	2011-06-29 11:38:53.000000000 +0200
+++ xfs/fs/xfs/xfs_mount.c	2011-06-29 13:51:20.425386530 +0200
@@ -1941,22 +1941,19 @@ unwind:
  * the superblock buffer if it can be locked without sleeping.
  * If it can't then we'll return NULL.
  */
-xfs_buf_t *
+struct xfs_buf *
 xfs_getsb(
-	xfs_mount_t	*mp,
-	int		flags)
+	struct xfs_mount	*mp,
+	int			flags)
 {
-	xfs_buf_t	*bp;
+	struct xfs_buf		*bp = mp->m_sb_bp;
 
-	ASSERT(mp->m_sb_bp != NULL);
-	bp = mp->m_sb_bp;
-	if (flags & XBF_TRYLOCK) {
-		if (!XFS_BUF_CPSEMA(bp)) {
+	if (!xfs_buf_trylock(bp)) {
+		if (flags & XBF_TRYLOCK)
 			return NULL;
-		}
-	} else {
-		XFS_BUF_PSEMA(bp, PRIBIO);
+		xfs_buf_lock(bp);
 	}
+
 	XFS_BUF_HOLD(bp);
 	ASSERT(XFS_BUF_ISDONE(bp));
 	return bp;
Index: xfs/fs/xfs/xfs_trans_buf.c
===================================================================
--- xfs.orig/fs/xfs/xfs_trans_buf.c	2011-03-27 23:52:57.000000000 +0200
+++ xfs/fs/xfs/xfs_trans_buf.c	2011-06-29 13:53:47.084592005 +0200
@@ -160,7 +160,7 @@ xfs_trans_get_buf(xfs_trans_t	*tp,
 	 */
 	bp = xfs_trans_buf_item_match(tp, target_dev, blkno, len);
 	if (bp != NULL) {
-		ASSERT(XFS_BUF_VALUSEMA(bp) <= 0);
+		ASSERT(xfs_buf_islocked(bp));
 		if (XFS_FORCED_SHUTDOWN(tp->t_mountp))
 			XFS_BUF_SUPER_STALE(bp);
 
@@ -327,7 +327,7 @@ xfs_trans_read_buf(
 	 */
 	bp = xfs_trans_buf_item_match(tp, target, blkno, len);
 	if (bp != NULL) {
-		ASSERT(XFS_BUF_VALUSEMA(bp) <= 0);
+		ASSERT(xfs_buf_islocked(bp));
 		ASSERT(XFS_BUF_FSPRIVATE2(bp, xfs_trans_t *) == tp);
 		ASSERT(XFS_BUF_FSPRIVATE(bp, void *) != NULL);
 		ASSERT((XFS_BUF_ISERROR(bp)) == 0);

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH 25/27] xfs: return the buffer locked from xfs_buf_get_uncached
  2011-06-29 14:01 [PATCH 00/27] patch queue for Linux 3.1 Christoph Hellwig
                   ` (22 preceding siblings ...)
  2011-06-29 14:01 ` [PATCH 24/27] xfs: clean up buffer locking helpers Christoph Hellwig
@ 2011-06-29 14:01 ` Christoph Hellwig
  2011-06-29 14:01 ` [PATCH 26/27] xfs: cleanup I/O-related buffer flags Christoph Hellwig
                   ` (2 subsequent siblings)
  26 siblings, 0 replies; 100+ messages in thread
From: Christoph Hellwig @ 2011-06-29 14:01 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-buf_get_uncached-locked-buffer --]
[-- Type: text/plain, Size: 2875 bytes --]

All other xfs_buf_get/read-like helpers return the buffer locked, make sure
xfs_buf_get_uncached isn't different for no reason.  Half of the callers
already lock it directly after, and the others probably should also keep
it locked if only for consistency and beeing able to use xfs_buf_rele,
but I'll leave that for later.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: xfs/fs/xfs/linux-2.6/xfs_buf.c
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/xfs_buf.c	2011-06-29 13:57:15.596795734 +0200
+++ xfs/fs/xfs/linux-2.6/xfs_buf.c	2011-06-29 13:57:32.243372220 +0200
@@ -679,7 +679,6 @@ xfs_buf_read_uncached(
 		return NULL;
 
 	/* set up the buffer for a read IO */
-	xfs_buf_lock(bp);
 	XFS_BUF_SET_ADDR(bp, daddr);
 	XFS_BUF_READ(bp);
 	XFS_BUF_BUSY(bp);
@@ -814,8 +813,6 @@ xfs_buf_get_uncached(
 		goto fail_free_mem;
 	}
 
-	xfs_buf_unlock(bp);
-
 	trace_xfs_buf_get_uncached(bp, _RET_IP_);
 	return bp;
 
Index: xfs/fs/xfs/xfs_log.c
===================================================================
--- xfs.orig/fs/xfs/xfs_log.c	2011-06-29 13:53:33.954663139 +0200
+++ xfs/fs/xfs/xfs_log.c	2011-06-29 13:57:32.243372220 +0200
@@ -1090,8 +1090,7 @@ xlog_alloc_log(xfs_mount_t	*mp,
 						log->l_iclog_size, 0);
 		if (!bp)
 			goto out_free_iclog;
-		if (!xfs_buf_trylock(bp))
-			ASSERT(0);
+
 		XFS_BUF_SET_IODONE_FUNC(bp, xlog_iodone);
 		XFS_BUF_SET_FSPRIVATE2(bp, (unsigned long)1);
 		iclog->ic_bp = bp;
Index: xfs/fs/xfs/xfs_log_recover.c
===================================================================
--- xfs.orig/fs/xfs/xfs_log_recover.c	2011-06-29 13:51:20.425386530 +0200
+++ xfs/fs/xfs/xfs_log_recover.c	2011-06-29 13:57:32.246705535 +0200
@@ -91,6 +91,8 @@ xlog_get_bp(
 	xlog_t		*log,
 	int		nbblks)
 {
+	struct xfs_buf	*bp;
+
 	if (!xlog_buf_bbcount_valid(log, nbblks)) {
 		xfs_warn(log->l_mp, "Invalid block length (0x%x) for buffer",
 			nbblks);
@@ -118,8 +120,10 @@ xlog_get_bp(
 		nbblks += log->l_sectBBsize;
 	nbblks = round_up(nbblks, log->l_sectBBsize);
 
-	return xfs_buf_get_uncached(log->l_mp->m_logdev_targp,
-					BBTOB(nbblks), 0);
+	bp = xfs_buf_get_uncached(log->l_mp->m_logdev_targp, BBTOB(nbblks), 0);
+	if (bp)
+		xfs_buf_unlock(bp);
+	return bp;
 }
 
 STATIC void
Index: xfs/fs/xfs/xfs_vnodeops.c
===================================================================
--- xfs.orig/fs/xfs/xfs_vnodeops.c	2011-06-29 11:35:45.789455635 +0200
+++ xfs/fs/xfs/xfs_vnodeops.c	2011-06-29 13:57:32.250038850 +0200
@@ -1969,6 +1969,8 @@ xfs_zero_remaining_bytes(
 	if (!bp)
 		return XFS_ERROR(ENOMEM);
 
+	xfs_buf_unlock(bp);
+
 	for (offset = startoff; offset <= endoff; offset = lastoffset + 1) {
 		offset_fsb = XFS_B_TO_FSBT(mp, offset);
 		nimap = 1;

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH 26/27] xfs: cleanup I/O-related buffer flags
  2011-06-29 14:01 [PATCH 00/27] patch queue for Linux 3.1 Christoph Hellwig
                   ` (23 preceding siblings ...)
  2011-06-29 14:01 ` [PATCH 25/27] xfs: return the buffer locked from xfs_buf_get_uncached Christoph Hellwig
@ 2011-06-29 14:01 ` Christoph Hellwig
  2011-06-29 14:01 ` [PATCH 27/27] xfs: avoid a few disk cache flushes Christoph Hellwig
  2011-06-30  6:36 ` [PATCH 00/27] patch queue for Linux 3.1 Dave Chinner
  26 siblings, 0 replies; 100+ messages in thread
From: Christoph Hellwig @ 2011-06-29 14:01 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-buf-cleanup-flags --]
[-- Type: text/plain, Size: 15219 bytes --]

Remove the unused and misnamed _XBF_RUN_QUEUES flag, rename XBF_LOG_BUFFER
to the more fitting XBF_SYNCIO, and split XBF_ORDERED into XBF_FUA and
XBF_FLUSH to allow more fine grained control over the bio flags.  Also
cleanup processing of the flags in _xfs_buf_ioapply to make more sense,
and renumber the sparse flag number space to group flags by purpose.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: xfs/fs/xfs/linux-2.6/xfs_buf.c
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/xfs_buf.c	2011-06-29 14:04:28.084452749 +0200
+++ xfs/fs/xfs/linux-2.6/xfs_buf.c	2011-06-29 14:13:45.171434748 +0200
@@ -592,10 +592,8 @@ _xfs_buf_read(
 	ASSERT(!(flags & (XBF_DELWRI|XBF_WRITE)));
 	ASSERT(bp->b_bn != XFS_BUF_DADDR_NULL);
 
-	bp->b_flags &= ~(XBF_WRITE | XBF_ASYNC | XBF_DELWRI | \
-			XBF_READ_AHEAD | _XBF_RUN_QUEUES);
-	bp->b_flags |= flags & (XBF_READ | XBF_ASYNC | \
-			XBF_READ_AHEAD | _XBF_RUN_QUEUES);
+	bp->b_flags &= ~(XBF_WRITE | XBF_ASYNC | XBF_DELWRI | XBF_READ_AHEAD);
+	bp->b_flags |= flags & (XBF_READ | XBF_ASYNC | XBF_READ_AHEAD);
 
 	status = xfs_buf_iorequest(bp);
 	if (status || XFS_BUF_ISERROR(bp) || (flags & XBF_ASYNC))
@@ -1211,23 +1209,21 @@ _xfs_buf_ioapply(
 	total_nr_pages = bp->b_page_count;
 	map_i = 0;
 
-	if (bp->b_flags & XBF_ORDERED) {
-		ASSERT(!(bp->b_flags & XBF_READ));
-		rw = WRITE_FLUSH_FUA;
-	} else if (bp->b_flags & XBF_LOG_BUFFER) {
-		ASSERT(!(bp->b_flags & XBF_READ_AHEAD));
-		bp->b_flags &= ~_XBF_RUN_QUEUES;
-		rw = (bp->b_flags & XBF_WRITE) ? WRITE_SYNC : READ_SYNC;
-	} else if (bp->b_flags & _XBF_RUN_QUEUES) {
-		ASSERT(!(bp->b_flags & XBF_READ_AHEAD));
-		bp->b_flags &= ~_XBF_RUN_QUEUES;
-		rw = (bp->b_flags & XBF_WRITE) ? WRITE_META : READ_META;
+	if (bp->b_flags & XBF_WRITE) {
+		if (bp->b_flags & XBF_SYNCIO)
+			rw = WRITE_SYNC;
+		else
+			rw = WRITE;
+		if (bp->b_flags & XBF_FUA)
+			rw |= REQ_FUA;
+		if (bp->b_flags & XBF_FLUSH)
+			rw |= REQ_FLUSH;
+	} else if (bp->b_flags & XBF_READ_AHEAD) {
+		rw = READ;
 	} else {
-		rw = (bp->b_flags & XBF_WRITE) ? WRITE :
-		     (bp->b_flags & XBF_READ_AHEAD) ? READA : READ;
+		rw = READ;
 	}
 
-
 next_chunk:
 	atomic_inc(&bp->b_io_remaining);
 	nr_pages = BIO_MAX_SECTORS >> (PAGE_SHIFT - BBSHIFT);
@@ -1689,8 +1685,7 @@ xfs_buf_delwri_split(
 				break;
 			}
 
-			bp->b_flags &= ~(XBF_DELWRI|_XBF_DELWRI_Q|
-					 _XBF_RUN_QUEUES);
+			bp->b_flags &= ~(XBF_DELWRI | _XBF_DELWRI_Q);
 			bp->b_flags |= XBF_WRITE;
 			list_move_tail(&bp->b_list, list);
 			trace_xfs_buf_delwri_split(bp, _RET_IP_);
Index: xfs/fs/xfs/linux-2.6/xfs_buf.h
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/xfs_buf.h	2011-06-29 14:03:57.994615760 +0200
+++ xfs/fs/xfs/linux-2.6/xfs_buf.h	2011-06-29 14:18:16.806629842 +0200
@@ -46,43 +46,46 @@ typedef enum {
 
 #define XBF_READ	(1 << 0) /* buffer intended for reading from device */
 #define XBF_WRITE	(1 << 1) /* buffer intended for writing to device */
-#define XBF_MAPPED	(1 << 2) /* buffer mapped (b_addr valid) */
+#define XBF_READ_AHEAD	(1 << 2) /* asynchronous read-ahead */
+#define XBF_MAPPED	(1 << 3) /* buffer mapped (b_addr valid) */
 #define XBF_ASYNC	(1 << 4) /* initiator will not wait for completion */
 #define XBF_DONE	(1 << 5) /* all pages in the buffer uptodate */
 #define XBF_DELWRI	(1 << 6) /* buffer has dirty pages */
 #define XBF_STALE	(1 << 7) /* buffer has been staled, do not find it */
-#define XBF_ORDERED	(1 << 11)/* use ordered writes */
-#define XBF_READ_AHEAD	(1 << 12)/* asynchronous read-ahead */
-#define XBF_LOG_BUFFER	(1 << 13)/* this is a buffer used for the log */
+
+/* I/O hints for the BIO layer */
+#define XBF_SYNCIO	(1 << 10)/* treat this buffer as synchronous I/O */
+#define XBF_FUA		(1 << 11)/* force cache write through mode */
+#define XBF_FLUSH	(1 << 12)/* flush the disk cache before a write */
 
 /* flags used only as arguments to access routines */
-#define XBF_LOCK	(1 << 14)/* lock requested */
-#define XBF_TRYLOCK	(1 << 15)/* lock requested, but do not wait */
-#define XBF_DONT_BLOCK	(1 << 16)/* do not block in current thread */
+#define XBF_LOCK	(1 << 15)/* lock requested */
+#define XBF_TRYLOCK	(1 << 16)/* lock requested, but do not wait */
+#define XBF_DONT_BLOCK	(1 << 17)/* do not block in current thread */
 
 /* flags used only internally */
-#define _XBF_PAGES	(1 << 18)/* backed by refcounted pages */
-#define	_XBF_RUN_QUEUES	(1 << 19)/* run block device task queue	*/
-#define	_XBF_KMEM	(1 << 20)/* backed by heap memory */
-#define _XBF_DELWRI_Q	(1 << 21)/* buffer on delwri queue */
+#define _XBF_PAGES	(1 << 20)/* backed by refcounted pages */
+#define	_XBF_KMEM	(1 << 21)/* backed by heap memory */
+#define _XBF_DELWRI_Q	(1 << 22)/* buffer on delwri queue */
 
 typedef unsigned int xfs_buf_flags_t;
 
 #define XFS_BUF_FLAGS \
 	{ XBF_READ,		"READ" }, \
 	{ XBF_WRITE,		"WRITE" }, \
+	{ XBF_READ_AHEAD,	"READ_AHEAD" }, \
 	{ XBF_MAPPED,		"MAPPED" }, \
 	{ XBF_ASYNC,		"ASYNC" }, \
 	{ XBF_DONE,		"DONE" }, \
 	{ XBF_DELWRI,		"DELWRI" }, \
 	{ XBF_STALE,		"STALE" }, \
-	{ XBF_ORDERED,		"ORDERED" }, \
-	{ XBF_READ_AHEAD,	"READ_AHEAD" }, \
+	{ XBF_SYNCIO,		"SYNCIO" }, \
+	{ XBF_FUA,		"FUA" }, \
+	{ XBF_FLUSH,		"FLUSH" }, \
 	{ XBF_LOCK,		"LOCK" },  	/* should never be set */\
 	{ XBF_TRYLOCK,		"TRYLOCK" }, 	/* ditto */\
 	{ XBF_DONT_BLOCK,	"DONT_BLOCK" },	/* ditto */\
 	{ _XBF_PAGES,		"PAGES" }, \
-	{ _XBF_RUN_QUEUES,	"RUN_QUEUES" }, \
 	{ _XBF_KMEM,		"KMEM" }, \
 	{ _XBF_DELWRI_Q,	"DELWRI_Q" }
 
@@ -230,8 +233,9 @@ extern void xfs_buf_terminate(void);
 
 
 #define XFS_BUF_BFLAGS(bp)	((bp)->b_flags)
-#define XFS_BUF_ZEROFLAGS(bp)	((bp)->b_flags &= \
-		~(XBF_READ|XBF_WRITE|XBF_ASYNC|XBF_DELWRI|XBF_ORDERED))
+#define XFS_BUF_ZEROFLAGS(bp) \
+	((bp)->b_flags &= ~(XBF_READ|XBF_WRITE|XBF_ASYNC|XBF_DELWRI| \
+			    XBF_SYNCIO|XBF_FUA|XBF_FLUSH))
 
 void xfs_buf_stale(struct xfs_buf *bp);
 #define XFS_BUF_STALE(bp)	xfs_buf_stale(bp);
@@ -263,10 +267,6 @@ void xfs_buf_stale(struct xfs_buf *bp);
 #define XFS_BUF_UNASYNC(bp)	((bp)->b_flags &= ~XBF_ASYNC)
 #define XFS_BUF_ISASYNC(bp)	((bp)->b_flags & XBF_ASYNC)
 
-#define XFS_BUF_ORDERED(bp)	((bp)->b_flags |= XBF_ORDERED)
-#define XFS_BUF_UNORDERED(bp)	((bp)->b_flags &= ~XBF_ORDERED)
-#define XFS_BUF_ISORDERED(bp)	((bp)->b_flags & XBF_ORDERED)
-
 #define XFS_BUF_HOLD(bp)	xfs_buf_hold(bp)
 #define XFS_BUF_READ(bp)	((bp)->b_flags |= XBF_READ)
 #define XFS_BUF_UNREAD(bp)	((bp)->b_flags &= ~XBF_READ)
Index: xfs/fs/xfs/xfs_log.c
===================================================================
--- xfs.orig/fs/xfs/xfs_log.c	2011-06-29 14:04:18.587837528 +0200
+++ xfs/fs/xfs/xfs_log.c	2011-06-29 14:13:47.761420718 +0200
@@ -1268,7 +1268,6 @@ xlog_bdstrat(
 		return 0;
 	}
 
-	bp->b_flags |= _XBF_RUN_QUEUES;
 	xfs_buf_iorequest(bp);
 	return 0;
 }
@@ -1369,7 +1368,7 @@ xlog_sync(xlog_t		*log,
 	XFS_BUF_ZEROFLAGS(bp);
 	XFS_BUF_BUSY(bp);
 	XFS_BUF_ASYNC(bp);
-	bp->b_flags |= XBF_LOG_BUFFER;
+	bp->b_flags |= XBF_SYNCIO;
 
 	if (log->l_mp->m_flags & XFS_MOUNT_BARRIER) {
 		/*
@@ -1380,7 +1379,7 @@ xlog_sync(xlog_t		*log,
 		 */
 		if (log->l_mp->m_logdev_targp != log->l_mp->m_ddev_targp)
 			xfs_blkdev_issue_flush(log->l_mp->m_ddev_targp);
-		XFS_BUF_ORDERED(bp);
+		bp->b_flags |= XBF_FUA | XBF_FLUSH;
 	}
 
 	ASSERT(XFS_BUF_ADDR(bp) <= log->l_logBBsize-1);
@@ -1413,9 +1412,9 @@ xlog_sync(xlog_t		*log,
 		XFS_BUF_ZEROFLAGS(bp);
 		XFS_BUF_BUSY(bp);
 		XFS_BUF_ASYNC(bp);
-		bp->b_flags |= XBF_LOG_BUFFER;
+		bp->b_flags |= XBF_SYNCIO;
 		if (log->l_mp->m_flags & XFS_MOUNT_BARRIER)
-			XFS_BUF_ORDERED(bp);
+			bp->b_flags |= XBF_FUA | XBF_FLUSH;
 		dptr = XFS_BUF_PTR(bp);
 		/*
 		 * Bump the cycle numbers at the start of each block

Index: xfs/fs/xfs/linux-2.6/xfs_buf.c
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/xfs_buf.c	2011-06-29 14:04:28.084452749 +0200
+++ xfs/fs/xfs/linux-2.6/xfs_buf.c	2011-06-29 14:13:45.171434748 +0200
@@ -592,10 +592,8 @@ _xfs_buf_read(
 	ASSERT(!(flags & (XBF_DELWRI|XBF_WRITE)));
 	ASSERT(bp->b_bn != XFS_BUF_DADDR_NULL);
 
-	bp->b_flags &= ~(XBF_WRITE | XBF_ASYNC | XBF_DELWRI | \
-			XBF_READ_AHEAD | _XBF_RUN_QUEUES);
-	bp->b_flags |= flags & (XBF_READ | XBF_ASYNC | \
-			XBF_READ_AHEAD | _XBF_RUN_QUEUES);
+	bp->b_flags &= ~(XBF_WRITE | XBF_ASYNC | XBF_DELWRI | XBF_READ_AHEAD);
+	bp->b_flags |= flags & (XBF_READ | XBF_ASYNC | XBF_READ_AHEAD);
 
 	status = xfs_buf_iorequest(bp);
 	if (status || XFS_BUF_ISERROR(bp) || (flags & XBF_ASYNC))
@@ -1211,23 +1209,21 @@ _xfs_buf_ioapply(
 	total_nr_pages = bp->b_page_count;
 	map_i = 0;
 
-	if (bp->b_flags & XBF_ORDERED) {
-		ASSERT(!(bp->b_flags & XBF_READ));
-		rw = WRITE_FLUSH_FUA;
-	} else if (bp->b_flags & XBF_LOG_BUFFER) {
-		ASSERT(!(bp->b_flags & XBF_READ_AHEAD));
-		bp->b_flags &= ~_XBF_RUN_QUEUES;
-		rw = (bp->b_flags & XBF_WRITE) ? WRITE_SYNC : READ_SYNC;
-	} else if (bp->b_flags & _XBF_RUN_QUEUES) {
-		ASSERT(!(bp->b_flags & XBF_READ_AHEAD));
-		bp->b_flags &= ~_XBF_RUN_QUEUES;
-		rw = (bp->b_flags & XBF_WRITE) ? WRITE_META : READ_META;
+	if (bp->b_flags & XBF_WRITE) {
+		if (bp->b_flags & XBF_SYNCIO)
+			rw = WRITE_SYNC;
+		else
+			rw = WRITE;
+		if (bp->b_flags & XBF_FUA)
+			rw |= REQ_FUA;
+		if (bp->b_flags & XBF_FLUSH)
+			rw |= REQ_FLUSH;
+	} else if (bp->b_flags & XBF_READ_AHEAD) {
+		rw = READ;
 	} else {
-		rw = (bp->b_flags & XBF_WRITE) ? WRITE :
-		     (bp->b_flags & XBF_READ_AHEAD) ? READA : READ;
+		rw = READ;
 	}
 
-
 next_chunk:
 	atomic_inc(&bp->b_io_remaining);
 	nr_pages = BIO_MAX_SECTORS >> (PAGE_SHIFT - BBSHIFT);
@@ -1689,8 +1685,7 @@ xfs_buf_delwri_split(
 				break;
 			}
 
-			bp->b_flags &= ~(XBF_DELWRI|_XBF_DELWRI_Q|
-					 _XBF_RUN_QUEUES);
+			bp->b_flags &= ~(XBF_DELWRI | _XBF_DELWRI_Q);
 			bp->b_flags |= XBF_WRITE;
 			list_move_tail(&bp->b_list, list);
 			trace_xfs_buf_delwri_split(bp, _RET_IP_);
Index: xfs/fs/xfs/linux-2.6/xfs_buf.h
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/xfs_buf.h	2011-06-29 14:03:57.994615760 +0200
+++ xfs/fs/xfs/linux-2.6/xfs_buf.h	2011-06-29 14:18:16.806629842 +0200
@@ -46,43 +46,46 @@ typedef enum {
 
 #define XBF_READ	(1 << 0) /* buffer intended for reading from device */
 #define XBF_WRITE	(1 << 1) /* buffer intended for writing to device */
-#define XBF_MAPPED	(1 << 2) /* buffer mapped (b_addr valid) */
+#define XBF_READ_AHEAD	(1 << 2) /* asynchronous read-ahead */
+#define XBF_MAPPED	(1 << 3) /* buffer mapped (b_addr valid) */
 #define XBF_ASYNC	(1 << 4) /* initiator will not wait for completion */
 #define XBF_DONE	(1 << 5) /* all pages in the buffer uptodate */
 #define XBF_DELWRI	(1 << 6) /* buffer has dirty pages */
 #define XBF_STALE	(1 << 7) /* buffer has been staled, do not find it */
-#define XBF_ORDERED	(1 << 11)/* use ordered writes */
-#define XBF_READ_AHEAD	(1 << 12)/* asynchronous read-ahead */
-#define XBF_LOG_BUFFER	(1 << 13)/* this is a buffer used for the log */
+
+/* I/O hints for the BIO layer */
+#define XBF_SYNCIO	(1 << 10)/* treat this buffer as synchronous I/O */
+#define XBF_FUA		(1 << 11)/* force cache write through mode */
+#define XBF_FLUSH	(1 << 12)/* flush the disk cache before a write */
 
 /* flags used only as arguments to access routines */
-#define XBF_LOCK	(1 << 14)/* lock requested */
-#define XBF_TRYLOCK	(1 << 15)/* lock requested, but do not wait */
-#define XBF_DONT_BLOCK	(1 << 16)/* do not block in current thread */
+#define XBF_LOCK	(1 << 15)/* lock requested */
+#define XBF_TRYLOCK	(1 << 16)/* lock requested, but do not wait */
+#define XBF_DONT_BLOCK	(1 << 17)/* do not block in current thread */
 
 /* flags used only internally */
-#define _XBF_PAGES	(1 << 18)/* backed by refcounted pages */
-#define	_XBF_RUN_QUEUES	(1 << 19)/* run block device task queue	*/
-#define	_XBF_KMEM	(1 << 20)/* backed by heap memory */
-#define _XBF_DELWRI_Q	(1 << 21)/* buffer on delwri queue */
+#define _XBF_PAGES	(1 << 20)/* backed by refcounted pages */
+#define	_XBF_KMEM	(1 << 21)/* backed by heap memory */
+#define _XBF_DELWRI_Q	(1 << 22)/* buffer on delwri queue */
 
 typedef unsigned int xfs_buf_flags_t;
 
 #define XFS_BUF_FLAGS \
 	{ XBF_READ,		"READ" }, \
 	{ XBF_WRITE,		"WRITE" }, \
+	{ XBF_READ_AHEAD,	"READ_AHEAD" }, \
 	{ XBF_MAPPED,		"MAPPED" }, \
 	{ XBF_ASYNC,		"ASYNC" }, \
 	{ XBF_DONE,		"DONE" }, \
 	{ XBF_DELWRI,		"DELWRI" }, \
 	{ XBF_STALE,		"STALE" }, \
-	{ XBF_ORDERED,		"ORDERED" }, \
-	{ XBF_READ_AHEAD,	"READ_AHEAD" }, \
+	{ XBF_SYNCIO,		"SYNCIO" }, \
+	{ XBF_FUA,		"FUA" }, \
+	{ XBF_FLUSH,		"FLUSH" }, \
 	{ XBF_LOCK,		"LOCK" },  	/* should never be set */\
 	{ XBF_TRYLOCK,		"TRYLOCK" }, 	/* ditto */\
 	{ XBF_DONT_BLOCK,	"DONT_BLOCK" },	/* ditto */\
 	{ _XBF_PAGES,		"PAGES" }, \
-	{ _XBF_RUN_QUEUES,	"RUN_QUEUES" }, \
 	{ _XBF_KMEM,		"KMEM" }, \
 	{ _XBF_DELWRI_Q,	"DELWRI_Q" }
 
@@ -230,8 +233,9 @@ extern void xfs_buf_terminate(void);
 
 
 #define XFS_BUF_BFLAGS(bp)	((bp)->b_flags)
-#define XFS_BUF_ZEROFLAGS(bp)	((bp)->b_flags &= \
-		~(XBF_READ|XBF_WRITE|XBF_ASYNC|XBF_DELWRI|XBF_ORDERED))
+#define XFS_BUF_ZEROFLAGS(bp) \
+	((bp)->b_flags &= ~(XBF_READ|XBF_WRITE|XBF_ASYNC|XBF_DELWRI| \
+			    XBF_SYNCIO|XBF_FUA|XBF_FLUSH))
 
 void xfs_buf_stale(struct xfs_buf *bp);
 #define XFS_BUF_STALE(bp)	xfs_buf_stale(bp);
@@ -263,10 +267,6 @@ void xfs_buf_stale(struct xfs_buf *bp);
 #define XFS_BUF_UNASYNC(bp)	((bp)->b_flags &= ~XBF_ASYNC)
 #define XFS_BUF_ISASYNC(bp)	((bp)->b_flags & XBF_ASYNC)
 
-#define XFS_BUF_ORDERED(bp)	((bp)->b_flags |= XBF_ORDERED)
-#define XFS_BUF_UNORDERED(bp)	((bp)->b_flags &= ~XBF_ORDERED)
-#define XFS_BUF_ISORDERED(bp)	((bp)->b_flags & XBF_ORDERED)
-
 #define XFS_BUF_HOLD(bp)	xfs_buf_hold(bp)
 #define XFS_BUF_READ(bp)	((bp)->b_flags |= XBF_READ)
 #define XFS_BUF_UNREAD(bp)	((bp)->b_flags &= ~XBF_READ)
Index: xfs/fs/xfs/xfs_log.c
===================================================================
--- xfs.orig/fs/xfs/xfs_log.c	2011-06-29 14:04:18.587837528 +0200
+++ xfs/fs/xfs/xfs_log.c	2011-06-29 14:13:47.761420718 +0200
@@ -1268,7 +1268,6 @@ xlog_bdstrat(
 		return 0;
 	}
 
-	bp->b_flags |= _XBF_RUN_QUEUES;
 	xfs_buf_iorequest(bp);
 	return 0;
 }
@@ -1369,7 +1368,7 @@ xlog_sync(xlog_t		*log,
 	XFS_BUF_ZEROFLAGS(bp);
 	XFS_BUF_BUSY(bp);
 	XFS_BUF_ASYNC(bp);
-	bp->b_flags |= XBF_LOG_BUFFER;
+	bp->b_flags |= XBF_SYNCIO;
 
 	if (log->l_mp->m_flags & XFS_MOUNT_BARRIER) {
 		/*
@@ -1380,7 +1379,7 @@ xlog_sync(xlog_t		*log,
 		 */
 		if (log->l_mp->m_logdev_targp != log->l_mp->m_ddev_targp)
 			xfs_blkdev_issue_flush(log->l_mp->m_ddev_targp);
-		XFS_BUF_ORDERED(bp);
+		bp->b_flags |= XBF_FUA | XBF_FLUSH;
 	}
 
 	ASSERT(XFS_BUF_ADDR(bp) <= log->l_logBBsize-1);
@@ -1413,9 +1412,9 @@ xlog_sync(xlog_t		*log,
 		XFS_BUF_ZEROFLAGS(bp);
 		XFS_BUF_BUSY(bp);
 		XFS_BUF_ASYNC(bp);
-		bp->b_flags |= XBF_LOG_BUFFER;
+		bp->b_flags |= XBF_SYNCIO;
 		if (log->l_mp->m_flags & XFS_MOUNT_BARRIER)
-			XFS_BUF_ORDERED(bp);
+			bp->b_flags |= XBF_FUA | XBF_FLUSH;
 		dptr = XFS_BUF_PTR(bp);
 		/*
 		 * Bump the cycle numbers at the start of each block

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH 27/27] xfs: avoid a few disk cache flushes
  2011-06-29 14:01 [PATCH 00/27] patch queue for Linux 3.1 Christoph Hellwig
                   ` (24 preceding siblings ...)
  2011-06-29 14:01 ` [PATCH 26/27] xfs: cleanup I/O-related buffer flags Christoph Hellwig
@ 2011-06-29 14:01 ` Christoph Hellwig
  2011-06-30  6:36 ` [PATCH 00/27] patch queue for Linux 3.1 Dave Chinner
  26 siblings, 0 replies; 100+ messages in thread
From: Christoph Hellwig @ 2011-06-29 14:01 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-avoid-cache-flushes --]
[-- Type: text/plain, Size: 1999 bytes --]

There is no need for a pre-flush when doing writing the second part of a
split log buffer, and if we are using an external log there is no need
to do a full cache flush of the log device at all given that all writes
to it use the FUA flag.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: xfs/fs/xfs/xfs_log.c
===================================================================
--- xfs.orig/fs/xfs/xfs_log.c	2011-06-29 14:27:59.166808258 +0200
+++ xfs/fs/xfs/xfs_log.c	2011-06-29 14:30:22.779363576 +0200
@@ -1371,15 +1371,21 @@ xlog_sync(xlog_t		*log,
 	bp->b_flags |= XBF_SYNCIO;
 
 	if (log->l_mp->m_flags & XFS_MOUNT_BARRIER) {
+		bp->b_flags |= XBF_FUA;
+
 		/*
-		 * If we have an external log device, flush the data device
-		 * before flushing the log to make sure all meta data
-		 * written back from the AIL actually made it to disk
-		 * before writing out the new log tail LSN in the log buffer.
+		 * Flush the data device before flushing the log to make
+		 * sure all meta data written back from the AIL actually made
+		 * it to disk before stamping the new log tail LSN into the
+		 * log buffer.  For an external log we need to issue the
+		 * flush explicitly, and unfortunately synchronously here;
+		 * for an internal log we can simply use the block layer
+		 * state machine for preflushes.
 		 */
 		if (log->l_mp->m_logdev_targp != log->l_mp->m_ddev_targp)
 			xfs_blkdev_issue_flush(log->l_mp->m_ddev_targp);
-		bp->b_flags |= XBF_FUA | XBF_FLUSH;
+		else
+			bp->b_flags |= XBF_FLUSH;
 	}
 
 	ASSERT(XFS_BUF_ADDR(bp) <= log->l_logBBsize-1);
@@ -1414,7 +1420,7 @@ xlog_sync(xlog_t		*log,
 		XFS_BUF_ASYNC(bp);
 		bp->b_flags |= XBF_SYNCIO;
 		if (log->l_mp->m_flags & XFS_MOUNT_BARRIER)
-			bp->b_flags |= XBF_FUA | XBF_FLUSH;
+			bp->b_flags |= XBF_FUA;
 		dptr = XFS_BUF_PTR(bp);
 		/*
 		 * Bump the cycle numbers at the start of each block

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 04/27] xfs: cleanup xfs_add_to_ioend
  2011-06-29 14:01 ` [PATCH 04/27] xfs: cleanup xfs_add_to_ioend Christoph Hellwig
@ 2011-06-29 22:13   ` Alex Elder
  2011-06-30  2:00   ` Dave Chinner
  1 sibling, 0 replies; 100+ messages in thread
From: Alex Elder @ 2011-06-29 22:13 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Wed, 2011-06-29 at 10:01 -0400, Christoph Hellwig wrote:
> plain text document attachment (xfs-cleanup-xfs_add_to_ioend)
> Pass the writeback context to xfs_add_to_ioend to make the ioend
> chain manipulations self-contained in this function.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Looks good.  I much prefer what you've done to how
it was before.

Reviewed-by: Alex Elder <aelder@sgi.com>

PS  I still have to review patch 3 in this series.


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 05/27] xfs: work around bogus gcc warning in xfs_allocbt_init_cursor
  2011-06-29 14:01 ` [PATCH 05/27] xfs: work around bogus gcc warning in xfs_allocbt_init_cursor Christoph Hellwig
@ 2011-06-29 22:13   ` Alex Elder
  0 siblings, 0 replies; 100+ messages in thread
From: Alex Elder @ 2011-06-29 22:13 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Wed, 2011-06-29 at 10:01 -0400, Christoph Hellwig wrote:
> plain text document attachment
> (xfs-fix-xfs_allocbt_init_cursor-warning)
> GCC 4.6 complains about an array subscript is above array bounds when
> using the btree index to index into the agf_levels array.  The only
> two indices passed in are 0 and 1, and we have an assert insuring that.
> 
> Replace the trick of using the array index directly with using constants
> in the already existing branch for assigning the XFS_BTREE_LASTREC_UPDATE
> flag.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Dave Chinner <dchinner@redhat.com>

Looks good.

Reviewed-by: Alex Elder <aelder@sgi.com>


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 06/27] xfs: split xfs_setattr
  2011-06-29 14:01 ` [PATCH 06/27] xfs: split xfs_setattr Christoph Hellwig
@ 2011-06-29 22:13   ` Alex Elder
  2011-06-30  7:03     ` Christoph Hellwig
  2011-06-30  2:11   ` Dave Chinner
  1 sibling, 1 reply; 100+ messages in thread
From: Alex Elder @ 2011-06-29 22:13 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Wed, 2011-06-29 at 10:01 -0400, Christoph Hellwig wrote:
> plain text document attachment (xfs-split-setattr)
> Split up xfs_setattr into two functions, one for the complex truncate
> handling, and one for the trivial attribute updates.  Also move both
> new routines to xfs_iops.c as they are fairly Linux-specific.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Nice simplification (good start anyway...).

Looks good but I think that you need to mask off the
ia_valid bits in the calls now made in xfs_vn_setattr().
Also, I think you may still need to check the file type
for the size-setting function.  Details below.

Seems like a semi-exhaustive setattr() checker
test program would be pretty easy to create.

If you fix these things (or you point out I'm
mistaken), you can consider this:
    Reviewed-by: Alex Elder <aelder@sgi.com>

> Index: xfs/fs/xfs/linux-2.6/xfs_iops.c
> ===================================================================
> --- xfs.orig/fs/xfs/linux-2.6/xfs_iops.c	2011-06-29 11:29:02.684972774 +0200
> +++ xfs/fs/xfs/linux-2.6/xfs_iops.c	2011-06-29 11:29:07.154948558 +0200
> @@ -39,6 +39,7 @@
>  #include "xfs_buf_item.h"
>  #include "xfs_utils.h"
>  #include "xfs_vnodeops.h"
> +#include "xfs_inode_item.h"
>  #include "xfs_trace.h"
>  
>  #include <linux/capability.h>
> @@ -497,12 +498,449 @@ xfs_vn_getattr(
>  	return 0;
>  }
>  
> +int
> +xfs_setattr_nonsize(
> +	struct xfs_inode	*ip,
> +	struct iattr		*iattr,
> +	int			flags)
> +{
> +	xfs_mount_t		*mp = ip->i_mount;
> +	struct inode		*inode = VFS_I(ip);
> +	int			mask = iattr->ia_valid;
> +	xfs_trans_t		*tp;
> +	int			error;
> +	uid_t			uid = 0, iuid = 0;
> +	gid_t			gid = 0, igid = 0;
> +	struct xfs_dquot	*udqp = NULL, *gdqp = NULL;
> +	struct xfs_dquot	*olddquot1 = NULL, *olddquot2 = NULL;
> +
> +	trace_xfs_setattr(ip);
> +
> +	if (mp->m_flags & XFS_MOUNT_RDONLY)
> +		return XFS_ERROR(EROFS);
> +
> +	if (XFS_FORCED_SHUTDOWN(mp))
> +		return XFS_ERROR(EIO);
> +
> +	error = -inode_change_ok(inode, iattr);
> +	if (error)
> +		return XFS_ERROR(error);
> +
> +	ASSERT((mask & ATTR_SIZE) == 0);

You need to mask ATTR_SIZE off in xfs_vn_setattr() if
you're going to make this assertion.  But you might
as well just mask it off locally I suppose (though
it's nice to have the explicitness of the assertion
here).

> +
> +	/*
> +	 * If disk quotas is on, we make sure that the dquots do exist on disk,
> +	 * before we start any other transactions. Trying to do this later
> +	 * is messy. We don't care to take a readlock to look at the ids
> +	 * in inode here, because we can't hold it across the trans_reserve.

. . .

> +}
> +
> +/*
> + * Truncate file.  Must have write permission and not be a directory.
> + */
> +int
> +xfs_setattr_size(
> +	struct xfs_inode	*ip,
> +	struct iattr		*iattr,
> +	int			flags)
> +{
> +	struct xfs_mount	*mp = ip->i_mount;
> +	struct inode		*inode = VFS_I(ip);
> +	int			mask = iattr->ia_valid;
> +	struct xfs_trans	*tp;
> +	int			error;
> +	uint			lock_flags;
> +	uint			commit_flags = 0;
> +
> +	trace_xfs_setattr(ip);
> +
> +	if (mp->m_flags & XFS_MOUNT_RDONLY)
> +		return XFS_ERROR(EROFS);
> +
> +	if (XFS_FORCED_SHUTDOWN(mp))
> +		return XFS_ERROR(EIO);
> +
> +	error = -inode_change_ok(inode, iattr);
> +	if (error)
> +		return XFS_ERROR(error);
> +
> +	ASSERT(S_ISREG(ip->i_d.di_mode));

There is nothing in xfs_vn_setattr(), nor--as far as
I can tell--anywhere else that reaches xfs_vn_seattr()
that will ensure this.  (Maybe it's unreachable for
a directory or whatever, but that's not obvious.)

You may have to put back the code that returns
EISDIR or EINVAL based on the file type rather
than do this assertion.  Seems like inode_change_ok()
might have been able to do this check for us.

> +	ASSERT((mask & (ATTR_MODE|ATTR_UID|ATTR_GID|ATTR_ATIME|ATTR_ATIME_SET|
> +			ATTR_MTIME_SET|ATTR_KILL_SUID|ATTR_KILL_SGID|
> +			ATTR_KILL_PRIV|ATTR_TIMES_SET)) == 0);

You'll have to mask these off in xfs_vn_setattr() if you're
going to make this assertion.

> +
> +	lock_flags = XFS_ILOCK_EXCL;
> +	if (!(flags & XFS_ATTR_NOLOCK))
> +		lock_flags |= XFS_IOLOCK_EXCL;
> +	xfs_ilock(ip, lock_flags);
> +
> +	/*
> +	 * Short circuit the truncate case for zero length files.
> +	 */

. . .

> +	goto out_unlock;
> +}
> +
>  STATIC int
>  xfs_vn_setattr(
>  	struct dentry	*dentry,
>  	struct iattr	*iattr)
>  {
> -	return -xfs_setattr(XFS_I(dentry->d_inode), iattr, 0);
> +	if (iattr->ia_valid & ATTR_SIZE)
> +		return -xfs_setattr_size(XFS_I(dentry->d_inode), iattr, 0);
> +	return -xfs_setattr_nonsize(XFS_I(dentry->d_inode), iattr, 0);
>  }

Just leaving this part in so you can see what I'm
talking about.
 
>  #define XFS_FIEMAP_FLAGS	(FIEMAP_FLAG_SYNC|FIEMAP_FLAG_XATTR)

. . .

> Index: xfs/fs/xfs/xfs_vnodeops.c
> ===================================================================
> --- xfs.orig/fs/xfs/xfs_vnodeops.c	2011-06-29 11:29:02.721639242 +0200
> +++ xfs/fs/xfs/xfs_vnodeops.c	2011-06-29 11:29:07.158281874 +0200
> @@ -50,430 +50,6 @@
>  #include "xfs_vnodeops.h"
>  #include "xfs_trace.h"
>  
> -int
> -xfs_setattr(
> -	struct xfs_inode	*ip,
> -	struct iattr		*iattr,
> -	int			flags)
> -{
> -	xfs_mount_t		*mp = ip->i_mount;
> -	struct inode		*inode = VFS_I(ip);
> -	int			mask = iattr->ia_valid;
> -	xfs_trans_t		*tp;
> -	int			code;
> -	uint			lock_flags;
> -	uint			commit_flags=0;
> -	uid_t			uid=0, iuid=0;
> -	gid_t			gid=0, igid=0;
> -	struct xfs_dquot	*udqp, *gdqp, *olddquot1, *olddquot2;
> -	int			need_iolock = 1;
> -

. . .

> -	/*
> -	 * Truncate file.  Must have write permission and not be a directory.
> -	 */
> -	if (mask & ATTR_SIZE) {
> -		/* Short circuit the truncate case for zero length files */
> -		if (iattr->ia_size == 0 &&
> -		    ip->i_size == 0 && ip->i_d.di_nextents == 0) {
> -			xfs_iunlock(ip, XFS_ILOCK_EXCL);
> -			lock_flags &= ~XFS_ILOCK_EXCL;
> -			if (mask & ATTR_CTIME) {
> -				inode->i_mtime = inode->i_ctime =
> -						current_fs_time(inode->i_sb);
> -				xfs_mark_inode_dirty_sync(ip);
> -			}
> -			code = 0;
> -			goto error_return;
> -		}
> -
> -		if (S_ISDIR(ip->i_d.di_mode)) {
> -			code = XFS_ERROR(EISDIR);
> -			goto error_return;
> -		} else if (!S_ISREG(ip->i_d.di_mode)) {
> -			code = XFS_ERROR(EINVAL);
> -			goto error_return;
> -		}

This is the file type checking code I referred to above.

> -
> -		/*


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 08/27] xfs: kill xfs_itruncate_start
  2011-06-29 14:01 ` [PATCH 08/27] xfs: kill xfs_itruncate_start Christoph Hellwig
@ 2011-06-29 22:13   ` Alex Elder
  0 siblings, 0 replies; 100+ messages in thread
From: Alex Elder @ 2011-06-29 22:13 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Wed, 2011-06-29 at 10:01 -0400, Christoph Hellwig wrote:
> plain text document attachment (xfs-kill-xfs_itruncate_start)
> xfs_itruncate_start is a rather length wrapper that evaluates to a call
> to xfs_ioend_wait and xfs_tosspages, and only has two callers.
> 
> Instead of using the complicated checks left over from IRIX where we
> can to truncate the pagecache just call xfs_tosspages
> (aka truncate_inode_pages) directly as we want to get rid of all data
> after i_size, and truncate_inode_pages handles incorrect alignments
> and too large offsets just fine.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Dave Chinner <dchinner@redhat.com>

Nice cleanup.  Looks good.

I will continue reviewing this series tomorrow.

Reviewed-by: Alex Elder <aelder@sgi.com>



_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 02/27] xfs: remove the unused ilock_nowait codepath in writepage
  2011-06-29 14:01 ` [PATCH 02/27] xfs: remove the unused ilock_nowait codepath in writepage Christoph Hellwig
@ 2011-06-30  0:15   ` Dave Chinner
  2011-06-30  1:26     ` Dave Chinner
  2011-06-30  6:55     ` Christoph Hellwig
  0 siblings, 2 replies; 100+ messages in thread
From: Dave Chinner @ 2011-06-30  0:15 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Wed, Jun 29, 2011 at 10:01:11AM -0400, Christoph Hellwig wrote:
> wbc->nonblocking is never set, so this whole code has been unreachable
> for a long time.  I'm also not sure it would make a lot of sense -
> we'd rather finish our writeout after a short wait for the ilock
> instead of cancelling the whole ioend.

The problem that the non-blocking code is trying to solve is only
obvious when the disk subsystem is fast enough to drive the flusher
thread to being CPU bound.

e.g. when you have a disk subsystem doing background writeback
10GB/s and the flusher thread is put to sleep for 50ms while we wait
for the lock, it can now only push 9.5GB/s. If we just move on, then
we'll spend that 50ms doing useful work on another dirty inode
rather than sleeping onthis one and hence maintaining a 10GB/s
background write rate.

I'd suggest that the only thing that should be dropped is the
wbc->nonblocking check. Numbers would be good to validate that this
is still relevant, but I don't have a storage subsystem with enough
bandwidth to drive a flusher thread to being CPU bound...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 02/27] xfs: remove the unused ilock_nowait codepath in writepage
  2011-06-30  0:15   ` Dave Chinner
@ 2011-06-30  1:26     ` Dave Chinner
  2011-06-30  6:55     ` Christoph Hellwig
  1 sibling, 0 replies; 100+ messages in thread
From: Dave Chinner @ 2011-06-30  1:26 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Thu, Jun 30, 2011 at 10:15:25AM +1000, Dave Chinner wrote:
> On Wed, Jun 29, 2011 at 10:01:11AM -0400, Christoph Hellwig wrote:
> > wbc->nonblocking is never set, so this whole code has been unreachable
> > for a long time.  I'm also not sure it would make a lot of sense -
> > we'd rather finish our writeout after a short wait for the ilock
> > instead of cancelling the whole ioend.
> 
> The problem that the non-blocking code is trying to solve is only
> obvious when the disk subsystem is fast enough to drive the flusher
> thread to being CPU bound.
> 
> e.g. when you have a disk subsystem doing background writeback
> 10GB/s and the flusher thread is put to sleep for 50ms while we wait
> for the lock, it can now only push 9.5GB/s. If we just move on, then
> we'll spend that 50ms doing useful work on another dirty inode
> rather than sleeping onthis one and hence maintaining a 10GB/s
> background write rate.
> 
> I'd suggest that the only thing that should be dropped is the
> wbc->nonblocking check. Numbers would be good to validate that this
> is still relevant, but I don't have a storage subsystem with enough
> bandwidth to drive a flusher thread to being CPU bound...

I just confirmed that I don't have a fast enough storage system to
test this - the flusher thread uses only ~15% of a CPU @ 800MB/s
writeback, so I'd need somewhere above 5GB/s of throughput to see
any sort of artifact from this change....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 01/27] xfs: PF_FSTRANS should never be set in ->writepage
  2011-06-29 14:01 ` [PATCH 01/27] xfs: PF_FSTRANS should never be set in ->writepage Christoph Hellwig
@ 2011-06-30  1:34   ` Dave Chinner
  0 siblings, 0 replies; 100+ messages in thread
From: Dave Chinner @ 2011-06-30  1:34 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Wed, Jun 29, 2011 at 10:01:10AM -0400, Christoph Hellwig wrote:
> Now that we reject direct reclaim in addition to always using GFP_NOFS
> allocation there's no chance we'll ever end up in ->writepage with
> PF_FSTRANS set.  Add a WARN_ON if we hit this case, and stop checking
> if we'd actually need to start a transaction.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Alex Elder <aelder@sgi.com>

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-06-29 14:01 ` [PATCH 03/27] xfs: use write_cache_pages for writeback clustering Christoph Hellwig
@ 2011-06-30  2:00   ` Dave Chinner
  2011-06-30  2:48     ` Dave Chinner
  2011-06-30  6:57     ` Christoph Hellwig
  2011-07-01  2:22   ` Dave Chinner
  1 sibling, 2 replies; 100+ messages in thread
From: Dave Chinner @ 2011-06-30  2:00 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Wed, Jun 29, 2011 at 10:01:12AM -0400, Christoph Hellwig wrote:
> Instead of implementing our own writeback clustering use write_cache_pages
> to do it for us.  This means the guts of the current writepage implementation
> become a new helper used both for implementing ->writepage and as a callback
> to write_cache_pages for ->writepages.  A new struct xfs_writeback_ctx
> is used to track block mapping state and the ioend chain over multiple
> invocation of it.
> 
> The advantage over the old code is that we avoid a double pagevec lookup,
> and a more efficient handling of extent boundaries inside a page for
> small blocksize filesystems, as well as having less XFS specific code.

Yes, it should be, but I can't actually measure any noticable CPU
usage difference @800MB/s writeback. The profiles change shape
around the changed code, but overall cpu usage does not change. I
think this is because the second pagevec lookup is pretty much free
because the radix tree is already hot in cache when we do the second
lookup...

> The downside is that we don't do writeback clustering when called from
> kswapd anyore, but that is a case that should be avoided anyway.  Note
> that we still convert the whole delalloc range from ->writepage, so
> the on-disk allocation pattern is not affected.

All the more reason to ensure the mm subsystem doesn't do this....

.....
>  error:
> -	if (iohead)
> -		xfs_cancel_ioend(iohead);
> -
> -	if (err == -EAGAIN)
> -		goto redirty;
> -

Should this EAGAIN handling be dealt with in the removing-the-non-
blocking-mode patch?

> +STATIC int
>  xfs_vm_writepages(
>  	struct address_space	*mapping,
>  	struct writeback_control *wbc)
>  {
> +	struct xfs_writeback_ctx ctx = { };
> +	int ret;
> +
>  	xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
> -	return generic_writepages(mapping, wbc);
> +
> +	ret = write_cache_pages(mapping, wbc, __xfs_vm_writepage, &ctx);
> +
> +	if (ctx.iohead) {
> +		if (ret)
> +			xfs_cancel_ioend(ctx.iohead);
> +		else
> +			xfs_submit_ioend(wbc, ctx.iohead);
> +	}

I think this error handling does not work. If we have put pages into
the ioend (i.e. successful ->writepage calls) and then have a
->writepage call fail, we'll get all the pages under writeback (i.e.
those on the ioend) remain in that state, and not ever get written
back (so move into the clean state) or redirtied (so written again
later)

xfs_cancel_ioend() was only ever called for the first page sent down
to ->writepage, and on error that page was redirtied separately.
Hence it doesn't handle this case at all as it never occurs in the
existing code.

I'd suggest that regardless of whether an error is returned here,
the existence of ctx.iohead indicates a valid ioend that needs to be
submitted....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 04/27] xfs: cleanup xfs_add_to_ioend
  2011-06-29 14:01 ` [PATCH 04/27] xfs: cleanup xfs_add_to_ioend Christoph Hellwig
  2011-06-29 22:13   ` Alex Elder
@ 2011-06-30  2:00   ` Dave Chinner
  1 sibling, 0 replies; 100+ messages in thread
From: Dave Chinner @ 2011-06-30  2:00 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Wed, Jun 29, 2011 at 10:01:13AM -0400, Christoph Hellwig wrote:
> Pass the writeback context to xfs_add_to_ioend to make the ioend
> chain manipulations self-contained in this function.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 06/27] xfs: split xfs_setattr
  2011-06-29 14:01 ` [PATCH 06/27] xfs: split xfs_setattr Christoph Hellwig
  2011-06-29 22:13   ` Alex Elder
@ 2011-06-30  2:11   ` Dave Chinner
  1 sibling, 0 replies; 100+ messages in thread
From: Dave Chinner @ 2011-06-30  2:11 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Wed, Jun 29, 2011 at 10:01:15AM -0400, Christoph Hellwig wrote:
> Split up xfs_setattr into two functions, one for the complex truncate
> handling, and one for the trivial attribute updates.  Also move both
> new routines to xfs_iops.c as they are fairly Linux-specific.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Dave Chinner <dchinner@redhat.com>
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 09/27] xfs: split xfs_itruncate_finish
  2011-06-29 14:01 ` [PATCH 09/27] xfs: split xfs_itruncate_finish Christoph Hellwig
@ 2011-06-30  2:44   ` Dave Chinner
  2011-06-30  7:18     ` Christoph Hellwig
  0 siblings, 1 reply; 100+ messages in thread
From: Dave Chinner @ 2011-06-30  2:44 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Wed, Jun 29, 2011 at 10:01:18AM -0400, Christoph Hellwig wrote:
> Split the guts of xfs_itruncate_finish that loop over the existing extents
> and calls xfs_bunmapi on them into a new helper, xfs_itruncate_externs.
> Make xfs_attr_inactive call it directly instead of xfs_itruncate_finish,
> which allows to simplify the latter a lot, by only letting it deal with
> the data fork.  As a result xfs_itruncate_finish is renamed to
> xfs_itruncate_data to make its use case more obvious.
> 
> Also remove the sync parameter from xfs_itruncate_data, which has been
> unessecary since the introduction of the busy extent list in 2002, and
> completely dead code since 2003 when the XFS_BMAPI_ASYNC parameter was
> made a no-op.
> 
> I can't actually see why the xfs_attr_inactive needs to set the transaction
> sync, but let's keep this patch simple and without changes in behaviour.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Overall, looks good. A few minor comments in line, but consider it

Reviewed-by: Dave Chinner <dchinner@redhat.com>

> +	struct xfs_mount	*mp = ip->i_mount;
> +	struct xfs_trans	*ntp = *tpp;
> +	xfs_bmap_free_t		free_list;
> +	xfs_fsblock_t		first_block;
> +	xfs_fileoff_t		first_unmap_block;
> +	xfs_fileoff_t		last_block;
> +	xfs_filblks_t		unmap_len;
> +	int			committed;
> +	int			error = 0;
> +	int			done = 0;
>  
>  	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL|XFS_IOLOCK_EXCL));
> -	ASSERT((new_size == 0) || (new_size <= ip->i_size));
> -	ASSERT(*tp != NULL);
> -	ASSERT((*tp)->t_flags & XFS_TRANS_PERM_LOG_RES);
> -	ASSERT(ip->i_transp == *tp);
> +	ASSERT(new_size == 0 || new_size <= ip->i_size);

If new_size == 0, then it will always be <= ip->i_size, so that's
kind of a redundant check. I think this really should be two
different asserts, one that validates the data fork new_size range,
and one that validates the attr fork truncate to zero length only
condition:

	ASSERT(new_size <= ip->i_size);
	ASSERT(whichfork != XFS_ATTR_FORK || new_size == 0);


> @@ -1464,15 +1311,16 @@ xfs_itruncate_finish(
>  		}
>  
>  		ntp = xfs_trans_dup(ntp);
> -		error = xfs_trans_commit(*tp, 0);
> -		*tp = ntp;
> +		error = xfs_trans_commit(*tpp, 0);
> +		*tpp = ntp;

I've always found this a mess to follow which transaction is which
because of the rewriting of ntp. This is easier to follow:

		ntp = xfs_trans_dup(*tpp);
		error = xfs_trans_commit(*tpp, 0);
		*tpp = ntp;

Now it's clear that we are duplicating *tpp, then committing it, and
then setting it to the duplicated transaction. Now I don't have to
go look at all the surrounding code to remind myself what ntp
contains to validate that the fragment of code is doing the right
thing.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-06-30  2:00   ` Dave Chinner
@ 2011-06-30  2:48     ` Dave Chinner
  2011-06-30  6:57     ` Christoph Hellwig
  1 sibling, 0 replies; 100+ messages in thread
From: Dave Chinner @ 2011-06-30  2:48 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Thu, Jun 30, 2011 at 12:00:13PM +1000, Dave Chinner wrote:
> On Wed, Jun 29, 2011 at 10:01:12AM -0400, Christoph Hellwig wrote:
> > Instead of implementing our own writeback clustering use write_cache_pages
> > to do it for us.  This means the guts of the current writepage implementation
> > become a new helper used both for implementing ->writepage and as a callback
> > to write_cache_pages for ->writepages.  A new struct xfs_writeback_ctx
> > is used to track block mapping state and the ioend chain over multiple
> > invocation of it.
> > 
> > The advantage over the old code is that we avoid a double pagevec lookup,
> > and a more efficient handling of extent boundaries inside a page for
> > small blocksize filesystems, as well as having less XFS specific code.
> 
> Yes, it should be, but I can't actually measure any noticable CPU
> usage difference @800MB/s writeback. The profiles change shape
> around the changed code, but overall cpu usage does not change. I
> think this is because the second pagevec lookup is pretty much free
> because the radix tree is already hot in cache when we do the second
> lookup...
> 
> > The downside is that we don't do writeback clustering when called from
> > kswapd anyore, but that is a case that should be avoided anyway.  Note
> > that we still convert the whole delalloc range from ->writepage, so
> > the on-disk allocation pattern is not affected.
> 
> All the more reason to ensure the mm subsystem doesn't do this....
> 
> .....
> >  error:
> > -	if (iohead)
> > -		xfs_cancel_ioend(iohead);
> > -
> > -	if (err == -EAGAIN)
> > -		goto redirty;
> > -
> 
> Should this EAGAIN handling be dealt with in the removing-the-non-
> blocking-mode patch?
> 
> > +STATIC int
> >  xfs_vm_writepages(
> >  	struct address_space	*mapping,
> >  	struct writeback_control *wbc)
> >  {
> > +	struct xfs_writeback_ctx ctx = { };
> > +	int ret;
> > +
> >  	xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
> > -	return generic_writepages(mapping, wbc);
> > +
> > +	ret = write_cache_pages(mapping, wbc, __xfs_vm_writepage, &ctx);
> > +
> > +	if (ctx.iohead) {
> > +		if (ret)
> > +			xfs_cancel_ioend(ctx.iohead);
> > +		else
> > +			xfs_submit_ioend(wbc, ctx.iohead);
> > +	}
> 
> I think this error handling does not work. If we have put pages into
> the ioend (i.e. successful ->writepage calls) and then have a
> ->writepage call fail, we'll get all the pages under writeback (i.e.
> those on the ioend) remain in that state, and not ever get written
> back (so move into the clean state) or redirtied (so written again
> later)
> 
> xfs_cancel_ioend() was only ever called for the first page sent down
> to ->writepage, and on error that page was redirtied separately.
> Hence it doesn't handle this case at all as it never occurs in the
> existing code.
> 
> I'd suggest that regardless of whether an error is returned here,
> the existence of ctx.iohead indicates a valid ioend that needs to be
> submitted....

I think i just tripped this. I'm running a 1k block size filesystem,
and test 224 has hung with waiting on IO completion after .writepage
errors:

[ 2850.300979] XFS (vdb): Mounting Filesystem
[ 2850.310069] XFS (vdb): Ending clean mount
[ 2867.246341] Filesystem "vdb": reserve blocks depleted! Consider increasing reserve pool size.
[ 2867.247652] XFS (vdb): page discard on page ffffea0000257b40, inode 0x1c6, offset 1187840.
[ 2867.254135] XFS (vdb): page discard on page ffffea0000025f40, inode 0x423, offset 1839104.
[ 2867.256289] XFS (vdb): page discard on page ffffea0000a21aa0, inode 0x34e, offset 28672.
[ 2867.258845] XFS (vdb): page discard on page ffffea00001830d0, inode 0xe5, offset 3637248.
[ 2867.260637] XFS (vdb): page discard on page ffffea0000776af8, inode 0x132, offset 6283264.
[ 2867.269380] XFS (vdb): page discard on page ffffea00009d5d38, inode 0xf1, offset 5632000.
[ 2867.277851] XFS (vdb): page discard on page ffffea0000017e60, inode 0x27a, offset 32768.
[ 2867.281165] XFS (vdb): page discard on page ffffea0000258278, inode 0x274, offset 32768.
[ 2867.282802] XFS (vdb): page discard on page ffffea00009a3c60, inode 0x48a, offset 32768.
[ 2867.284166] XFS (vdb): page discard on page ffffea0000cc7808, inode 0x42e, offset 32768.
[ 2867.287138] XFS (vdb): page discard on page ffffea00004d4440, inode 0x4e0, offset 32768.
[ 2867.288500] XFS (vdb): page discard on page ffffea0000b34978, inode 0x4cd, offset 32768.
[ 2867.289381] XFS (vdb): page discard on page ffffea00003f40f8, inode 0x4c4, offset 155648.
[ 2867.291536] XFS (vdb): page discard on page ffffea0000023578, inode 0x4c7, offset 32768.
[ 2867.300880] XFS (vdb): page discard on page ffffea00005276e8, inode 0x4cc, offset 32768.
[ 2867.318819] XFS (vdb): page discard on page ffffea0000777230, inode 0x449, offset 8581120.
[ 4701.141666] SysRq : Show Blocked State
[ 4701.142093]   task                        PC stack   pid father
[ 4701.142707] dd              D ffff8800076edbe8     0 14211   8946 0x00000000
[ 4701.143509]  ffff88002b03fa58 0000000000000086 ffffea00002db598 ffffea0000000000
[ 4701.144009]  ffff88002b03f9d8 ffffffff81113a35 ffff8800076ed860 0000000000010f80
[ 4701.144009]  ffff88002b03ffd8 ffff88002b03e010 ffff88002b03ffd8 0000000000010f80
[ 4701.144009] Call Trace:
[ 4701.144009]  [<ffffffff81113a35>] ? __free_pages+0x35/0x40
[ 4701.144009]  [<ffffffff81062f69>] ? default_spin_lock_flags+0x9/0x10
[ 4701.144009]  [<ffffffff8110b520>] ? __lock_page+0x70/0x70
[ 4701.144009]  [<ffffffff81afe2d0>] io_schedule+0x60/0x80
[ 4701.144009]  [<ffffffff8110b52e>] sleep_on_page+0xe/0x20
[ 4701.144009]  [<ffffffff81afec2f>] __wait_on_bit+0x5f/0x90
[ 4701.144009]  [<ffffffff8110b773>] wait_on_page_bit+0x73/0x80
[ 4701.144009]  [<ffffffff810a4110>] ? autoremove_wake_function+0x40/0x40
[ 4701.144009]  [<ffffffff81116365>] ? pagevec_lookup_tag+0x25/0x40
[ 4701.144009]  [<ffffffff8110bbc2>] filemap_fdatawait_range+0x112/0x1a0
[ 4701.144009]  [<ffffffff8145f469>] xfs_wait_on_pages+0x59/0x80
[ 4701.144009]  [<ffffffff8145f51d>] xfs_flush_pages+0x8d/0xb0
[ 4701.144009]  [<ffffffff8145f084>] xfs_file_buffered_aio_write+0x104/0x190
[ 4701.144009]  [<ffffffff81b03a98>] ? do_page_fault+0x1e8/0x450
[ 4701.144009]  [<ffffffff8145f2cf>] xfs_file_aio_write+0x1bf/0x300
[ 4701.144009]  [<ffffffff81160844>] ? path_openat+0x104/0x3f0
[ 4701.144009]  [<ffffffff8115251a>] do_sync_write+0xda/0x120
[ 4701.144009]  [<ffffffff816488b3>] ? security_file_permission+0x23/0x90
[ 4701.144009]  [<ffffffff81152a88>] vfs_write+0xc8/0x180
[ 4701.144009]  [<ffffffff81152c31>] sys_write+0x51/0x90
[ 4701.144009]  [<ffffffff81b07ec2>] system_call_fastpath+0x16/0x1b

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 10/27] xfs: improve sync behaviour in the fact of aggressive dirtying
  2011-06-29 14:01 ` [PATCH 10/27] xfs: improve sync behaviour in the fact of aggressive dirtying Christoph Hellwig
@ 2011-06-30  2:52   ` Dave Chinner
  0 siblings, 0 replies; 100+ messages in thread
From: Dave Chinner @ 2011-06-30  2:52 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Wed, Jun 29, 2011 at 10:01:19AM -0400, Christoph Hellwig wrote:
> The following script from Wu Fengguang shows very bad behaviour in XFS
> when aggressively dirtying data during a sync on XFS, with sync times
> up to almost 10 times as long as ext4.
> 
> A large part of the issue is that XFS writes data out itself two times
> in the ->sync_fs method, overriding the lifelock protection in the core
> writeback code, and another issue is the lock-less xfs_ioend_wait call,
> which doesn't prevent new ioend from beeing queue up while waiting for
> the count to reach zero.
> 
> This patch removes the XFS-internal sync calls and relies on the VFS
> to do it's work just like all other filesystems do.  Note that the
> i_iocount wait which is rather suboptimal is simply removed here.
> We already do it in ->write_inode, which keeps the current supoptimal
> behaviour.  We'll eventually need to remove that as well, but that's
> material for a separate commit.
> 
> ------------------------------ snip ------------------------------
> #!/bin/sh
> 
> umount /dev/sda7
> mkfs.xfs -f /dev/sda7
> # mkfs.ext4 /dev/sda7
> # mkfs.btrfs /dev/sda7
> mount /dev/sda7 /fs
> 
> echo $((50<<20)) > /proc/sys/vm/dirty_bytes
> 
> pid=
> for i in `seq 10`
> do
> 	dd if=/dev/zero of=/fs/zero-$i bs=1M count=1000 &
> 	pid="$pid $!"
> done
> 
> sleep 1
> 
> tic=$(date +'%s')
> sync
> tac=$(date +'%s')
> 
> echo
> echo sync time: $((tac-tic))
> egrep '(Dirty|Writeback|NFS_Unstable)' /proc/meminfo
> 
> pidof dd > /dev/null && { kill -9 $pid; echo sync NOT livelocked; }
> ------------------------------ snip ------------------------------
> 
> Reported-by: Wu Fengguang <fengguang.wu@intel.com>
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 11/27] xfs: fix filesystsem freeze race in xfs_trans_alloc
  2011-06-29 14:01 ` [PATCH 11/27] xfs: fix filesystsem freeze race in xfs_trans_alloc Christoph Hellwig
@ 2011-06-30  2:59   ` Dave Chinner
  0 siblings, 0 replies; 100+ messages in thread
From: Dave Chinner @ 2011-06-30  2:59 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Wed, Jun 29, 2011 at 10:01:20AM -0400, Christoph Hellwig wrote:
> As pointed out by Jan xfs_trans_alloc can race with a concurrent filesystem
> free when it sleeps during the memory allocation.  Fix this by moving the
  freeze

> wait_for_freeze call after the memory allocation.  This means moving the
> freeze into the low-level _xfs_trans_alloc helper, which thus grows a new
> argument.  Also fix up some comments in that area while at it.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Dave Chinner <david@fromorbit.com>

>  /*
> - * xfs_log_sbcount
> - *
>   * Called either periodically to keep the on disk superblock values
>   * roughly up to date or from unmount to make sure the values are
>   * correct on a clean unmount.
> - *
> - * Note this code can be called during the process of freezing, so
> - * we may need to use the transaction allocator which does not not
> - * block when the transaction subsystem is in its frozen state.
>   */

I't s not called periodically any more from xfssyncd. Hmmm, that was
removed because it was preventing the filesystem from idling, but we
really should be doing this every so often when the filesystem is
dirty. I'll have a think about that....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 12/27] xfs: remove i_transp
  2011-06-29 14:01 ` [PATCH 12/27] xfs: remove i_transp Christoph Hellwig
@ 2011-06-30  3:00   ` Dave Chinner
  0 siblings, 0 replies; 100+ messages in thread
From: Dave Chinner @ 2011-06-30  3:00 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Wed, Jun 29, 2011 at 10:01:21AM -0400, Christoph Hellwig wrote:
> Remove the transaction pointer in the inode.  It's only used to avoid
> passing down an argument in the bmap code, and for a few asserts in
> the transaction code right now.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 13/27] xfs: factor out xfs_dir2_leaf_find_entry
  2011-06-29 14:01 ` [PATCH 13/27] xfs: factor out xfs_dir2_leaf_find_entry Christoph Hellwig
@ 2011-06-30  6:11   ` Dave Chinner
  2011-06-30  7:34     ` Christoph Hellwig
  0 siblings, 1 reply; 100+ messages in thread
From: Dave Chinner @ 2011-06-30  6:11 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Wed, Jun 29, 2011 at 10:01:22AM -0400, Christoph Hellwig wrote:
> Add a new xfs_dir2_leaf_find_entry helper to factor out some duplicate code
> from xfs_dir2_leaf_addname xfs_dir2_leafn_add.  Found by Eric Sandeen using
> an automated code duplication checked.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Looks sane - a couple of minor whitespacy comments, otherwise:

Reviewed-by: Dave Chinner <dchinner@redhat.com>

> 
> Index: xfs/fs/xfs/xfs_dir2_leaf.c
> ===================================================================
> --- xfs.orig/fs/xfs/xfs_dir2_leaf.c	2011-06-22 21:56:26.102462981 +0200
> +++ xfs/fs/xfs/xfs_dir2_leaf.c	2011-06-23 12:41:51.716439911 +0200
> @@ -152,6 +152,118 @@ xfs_dir2_block_to_leaf(
>  	return 0;
>  }
>  
> +xfs_dir2_leaf_entry_t *
> +xfs_dir2_leaf_find_entry(
> +	xfs_dir2_leaf_t		*leaf,		/* leaf structure */
> +	int			index,		/* leaf table position */
> +	int			compact,	/* need to compact leaves */
> +	int			lowstale,	/* index of prev stale leaf */
> +	int			highstale,	/* index of next stale leaf */
> +	int			*lfloglow,	/* low leaf logging index */
> +	int			*lfloghigh)	/* high leaf logging index */
> +{
> +	xfs_dir2_leaf_entry_t	*lep;		/* leaf entry table pointer */
> +
> +	if (!leaf->hdr.stale) {
> +		/*
> +		 * Now we need to make room to insert the leaf entry.
> +		 *
> +		 * If there are no stale entries, just insert a hole at index.
> +		 */
> +		lep = &leaf->ents[index];
> +		if (index < be16_to_cpu(leaf->hdr.count))
> +			memmove(lep + 1, lep,
> +				(be16_to_cpu(leaf->hdr.count) - index) *
> +				 sizeof(*lep));
> +
> +		/*
> +		 * Record low and high logging indices for the leaf.
> +		 */
> +		*lfloglow = index;
> +		*lfloghigh = be16_to_cpu(leaf->hdr.count);
> +		be16_add_cpu(&leaf->hdr.count, 1);

You could probably just return here, and that would remove the:

> +	} else {

and the indenting that the else branch causes.

> +		/*
> +		 * There are stale entries.
> +		 *
> +		 * We will use one of them for the new entry.  It's probably
> +		 * not at the right location, so we'll have to shift some up
> +		 * or down first.
> +		 *
> +		 * If we didn't compact before, we need to find the nearest
> +		 * stale entries before and after our insertion point.
> +		 */
> +		if (compact == 0) {
> +			/*
> +			 * Find the first stale entry before the insertion
> +			 * point, if any.
> +			 */
> +			for (lowstale = index - 1;
> +			     lowstale >= 0 &&
> +				be32_to_cpu(leaf->ents[lowstale].address) !=
> +				XFS_DIR2_NULL_DATAPTR;
> +			     lowstale--)
> +				continue;
> +			/*
> +			 * Find the next stale entry at or after the insertion
> +			 * point, if any.   Stop if we go so far that the
> +			 * lowstale entry would be better.
> +			 */
> +			for (highstale = index;
> +			     highstale < be16_to_cpu(leaf->hdr.count) &&
> +				be32_to_cpu(leaf->ents[highstale].address) !=
> +				XFS_DIR2_NULL_DATAPTR &&
> +				(lowstale < 0 ||
> +				 index - lowstale - 1 >= highstale - index);
> +			     highstale++)
> +				continue;
> +		}
> +		/*
> +		 * If the low one is better, use it.
> +		 */

Line of whitespace before the comment.

> +		if (lowstale >= 0 &&
> +		    (highstale == be16_to_cpu(leaf->hdr.count) ||
> +		     index - lowstale - 1 < highstale - index)) {
> +			ASSERT(index - lowstale - 1 >= 0);
> +			ASSERT(be32_to_cpu(leaf->ents[lowstale].address) ==
> +			       XFS_DIR2_NULL_DATAPTR);
> +			/*
> +			 * Copy entries up to cover the stale entry
> +			 * and make room for the new entry.
> +			 */
> +			if (index - lowstale - 1 > 0)
> +				memmove(&leaf->ents[lowstale],
> +					&leaf->ents[lowstale + 1],
> +					(index - lowstale - 1) * sizeof(*lep));
> +			lep = &leaf->ents[index - 1];
> +			*lfloglow = MIN(lowstale, *lfloglow);
> +			*lfloghigh = MAX(index - 1, *lfloghigh);
> +
> +		/*
> +		 * The high one is better, so use that one.
> +		 */
> +		} else {

I prefer comments inside the else branch...

> +			ASSERT(highstale - index >= 0);
> +			ASSERT(be32_to_cpu(leaf->ents[highstale].address) ==
> +			       XFS_DIR2_NULL_DATAPTR);
> +			/*
> +			 * Copy entries down to cover the stale entry
> +			 * and make room for the new entry.
> +			 */
> +			if (highstale - index > 0)
> +				memmove(&leaf->ents[index + 1],
> +					&leaf->ents[index],
> +					(highstale - index) * sizeof(*lep));
> +			lep = &leaf->ents[index];
> +			*lfloglow = MIN(index, *lfloglow);
> +			*lfloghigh = MAX(highstale, *lfloghigh);
> +		}
> +		be16_add_cpu(&leaf->hdr.stale, -1);
> +	}
> +
> +	return lep;
> +}
> +
>  /*
>   * Add an entry to a leaf form directory.
>   */
> @@ -430,102 +542,11 @@ xfs_dir2_leaf_addname(
.....
> -		}
> -		be16_add_cpu(&leaf->hdr.stale, -1);
> -	}
> +
> +
> +	lep = xfs_dir2_leaf_find_entry(leaf, index, compact, lowstale,
> +				       highstale, &lfloglow, &lfloghigh);
> +

Only need one line of whitespace before the function call.

.....
> -			lep = &leaf->ents[index];
> -			lfloglow = MIN(index, lfloglow);
> -			lfloghigh = MAX(highstale, lfloghigh);
> -		}
> -		be16_add_cpu(&leaf->hdr.stale, -1);
> -	}
> +
> +
>  	/*
>  	 * Insert the new entry, log everything.
>  	 */
> +	lep = xfs_dir2_leaf_find_entry(leaf, index, compact, lowstale,
> +				       highstale, &lfloglow, &lfloghigh);
> +

Same for the whitespace before the comment.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 14/27] xfs: cleanup shortform directory inode number handling
  2011-06-29 14:01 ` [PATCH 14/27] xfs: cleanup shortform directory inode number handling Christoph Hellwig
@ 2011-06-30  6:35   ` Dave Chinner
  2011-06-30  7:39     ` Christoph Hellwig
  0 siblings, 1 reply; 100+ messages in thread
From: Dave Chinner @ 2011-06-30  6:35 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Wed, Jun 29, 2011 at 10:01:23AM -0400, Christoph Hellwig wrote:
> Refactor the shortform directory helpers that deal with the 32-bit vs
> 64-bit wide inode numbers into more sensible helpers, and kill the
> xfs_intino_t typedef that is now superflous.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

A few consistency things, and a bit whitespacy, otherwise:

Reviewed-by: Dave Chinner <dchinner@redhat.com>

> 
> Index: xfs/fs/xfs/xfs_dir2_sf.c
> ===================================================================
> --- xfs.orig/fs/xfs/xfs_dir2_sf.c	2010-05-25 11:40:59.357006075 +0200
> +++ xfs/fs/xfs/xfs_dir2_sf.c	2010-05-27 14:48:16.709004470 +0200
> @@ -59,6 +59,83 @@ static void xfs_dir2_sf_toino4(xfs_da_ar
>  static void xfs_dir2_sf_toino8(xfs_da_args_t *args);
>  #endif /* XFS_BIG_INUMS */
>  
> +
> +/*
> + * Inode numbers in short-form directories can come in two versions,
> + * either 4 bytes or 8 bytes wide.  These helpers deal with the
> + * two forms transparently by looking at the headers i8count field.
> + */
> +
> +static xfs_ino_t

Extra line there...

> +xfs_dir2_sf_get_ino(
> +	struct xfs_dir2_sf	*sfp,
> +	xfs_dir2_inou_t		*from)
> +{
> +	if (sfp->hdr.i8count)
> +		return XFS_GET_DIR_INO8(from->i8);
> +	else
> +		return XFS_GET_DIR_INO4(from->i4);
> +}
> +static void

And none there.

> +xfs_dir2_sf_put_inumber(
> +	struct xfs_dir2_sf	*sfp,
> +	xfs_dir2_inou_t		*to,
> +	xfs_ino_t		ino)
> +{
> +	if (sfp->hdr.i8count)
> +		XFS_PUT_DIR_INO8(ino, to->i8);
> +	else
> +		XFS_PUT_DIR_INO4(ino, to->i4);
> +}

Also, xfs_dir2_sf_get_ino() vs xfs_dir2_sf_put_inumber() - either
use _ino or _inumber as the suffix for both. _ino is probably more
consistent with the other functions...

> +
> +xfs_ino_t
> +xfs_dir2_sf_get_parent_ino(
> +	struct xfs_dir2_sf	*sfp)
> +{
> +	return xfs_dir2_sf_get_ino(sfp, &sfp->hdr.parent);
> +}
> +
> +

Extra whitespace.

> +static void
> +xfs_dir2_sf_put_parent_ino(
> +	struct xfs_dir2_sf	*sfp,
> +	xfs_ino_t		ino)
> +{
> +	xfs_dir2_sf_put_inumber(sfp, &sfp->hdr.parent, ino);
> +}
> +
> +

Extra whitespace.

> +/*
> + * In short-form directory entries the inode numbers are stored at variable
> + * offset behind the entry name.  The inode numbers may only be accessed
> + * through the helpers below.
> + */
> +

Extra whitespace.

> +static xfs_dir2_inou_t *
> +xfs_dir2_sf_inop(
> +	struct xfs_dir2_sf_entry *sfep)
> +{
> +	return (xfs_dir2_inou_t *)&sfep->name[sfep->namelen];
> +}

Probably should be called xfs_dir2_sfe_inop()  because it takes a
xfs_dir2_sf_entry, similar to the following functions use "sfe".

> +
> +xfs_ino_t
> +xfs_dir2_sfe_get_ino(
> +	struct xfs_dir2_sf	*sfp,
> +	struct xfs_dir2_sf_entry *sfep)
> +{
> +	return xfs_dir2_sf_get_ino(sfp, xfs_dir2_sf_inop(sfep));
> +}
> +
> +static void
> +xfs_dir2_sfe_put_ino(
> +	struct xfs_dir2_sf	*sfp,
> +	struct xfs_dir2_sf_entry *sfep,
> +	xfs_ino_t		ino)
> +{
> +	xfs_dir2_sf_put_inumber(sfp, xfs_dir2_sf_inop(sfep), ino);
> +}
> +
> +

Extra whitespace.

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 00/27] patch queue for Linux 3.1
  2011-06-29 14:01 [PATCH 00/27] patch queue for Linux 3.1 Christoph Hellwig
                   ` (25 preceding siblings ...)
  2011-06-29 14:01 ` [PATCH 27/27] xfs: avoid a few disk cache flushes Christoph Hellwig
@ 2011-06-30  6:36 ` Dave Chinner
  2011-06-30  6:50   ` Christoph Hellwig
  26 siblings, 1 reply; 100+ messages in thread
From: Dave Chinner @ 2011-06-30  6:36 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Wed, Jun 29, 2011 at 10:01:09AM -0400, Christoph Hellwig wrote:
> This is my current patch queue for Linux 3.1.  It includes the previously
> all previously sent patches I'm planning for Linux 3.1 inclusion through
> the XFS tree and a few new ones.  The most important new bits is a cleanup
> of the structures describing the dir2 on-disk format, which got a bit
> more urgent due to more recent gcc versions complaining about the hacks
> used in the current version.
> 
> The sync lifelock fix is included only in a minimal version that removes
> the data syncs.  I plan to sort out the iocount waiting via the i_alloc_sem
> removal patches that have been sent for inclusion in the VFS tree.  I'll
> cc the XFS list on the updated version with XFS chances.

With this series I'm seeing test 180 fail relatively frequently with
1k block size filesystem. I shall try to debug this given the other
reports of 180 failing that have recently come to light...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 00/27] patch queue for Linux 3.1
  2011-06-30  6:36 ` [PATCH 00/27] patch queue for Linux 3.1 Dave Chinner
@ 2011-06-30  6:50   ` Christoph Hellwig
  0 siblings, 0 replies; 100+ messages in thread
From: Christoph Hellwig @ 2011-06-30  6:50 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Thu, Jun 30, 2011 at 04:36:58PM +1000, Dave Chinner wrote:
> On Wed, Jun 29, 2011 at 10:01:09AM -0400, Christoph Hellwig wrote:
> > This is my current patch queue for Linux 3.1.  It includes the previously
> > all previously sent patches I'm planning for Linux 3.1 inclusion through
> > the XFS tree and a few new ones.  The most important new bits is a cleanup
> > of the structures describing the dir2 on-disk format, which got a bit
> > more urgent due to more recent gcc versions complaining about the hacks
> > used in the current version.
> > 
> > The sync lifelock fix is included only in a minimal version that removes
> > the data syncs.  I plan to sort out the iocount waiting via the i_alloc_sem
> > removal patches that have been sent for inclusion in the VFS tree.  I'll
> > cc the XFS list on the updated version with XFS chances.
> 
> With this series I'm seeing test 180 fail relatively frequently with
> 1k block size filesystem. I shall try to debug this given the other
> reports of 180 failing that have recently come to light...

Interesting.  I've not seen 180 fail any time recently, with either 4k
of 512 byte block size filesystems.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 02/27] xfs: remove the unused ilock_nowait codepath in writepage
  2011-06-30  0:15   ` Dave Chinner
  2011-06-30  1:26     ` Dave Chinner
@ 2011-06-30  6:55     ` Christoph Hellwig
  1 sibling, 0 replies; 100+ messages in thread
From: Christoph Hellwig @ 2011-06-30  6:55 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs

On Thu, Jun 30, 2011 at 10:15:25AM +1000, Dave Chinner wrote:
> On Wed, Jun 29, 2011 at 10:01:11AM -0400, Christoph Hellwig wrote:
> > wbc->nonblocking is never set, so this whole code has been unreachable
> > for a long time.  I'm also not sure it would make a lot of sense -
> > we'd rather finish our writeout after a short wait for the ilock
> > instead of cancelling the whole ioend.
> 
> I'd suggest that the only thing that should be dropped is the
> wbc->nonblocking check. Numbers would be good to validate that this
> is still relevant, but I don't have a storage subsystem with enough
> bandwidth to drive a flusher thread to being CPU bound...

I don't mind re-introducing this if we actuall have a testcase for it.
Note that simply keeping the code won't work for the writepages
implementation as we'd cancel the whole ioend if one lock fails,
discarding potentially a lot of I/O.  It's already bad enough with
the simpler clustering we have in the current code.  Back in SLES10 /
2.6.16 when the code could still be reached we only did it for the
bmap calls directly from writepage, but not from the writeout
clustering.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-06-30  2:00   ` Dave Chinner
  2011-06-30  2:48     ` Dave Chinner
@ 2011-06-30  6:57     ` Christoph Hellwig
  1 sibling, 0 replies; 100+ messages in thread
From: Christoph Hellwig @ 2011-06-30  6:57 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs

On Thu, Jun 30, 2011 at 12:00:13PM +1000, Dave Chinner wrote:
> > -	if (iohead)
> > -		xfs_cancel_ioend(iohead);
> > -
> > -	if (err == -EAGAIN)
> > -		goto redirty;
> > -
> 
> Should this EAGAIN handling be dealt with in the removing-the-non-
> blocking-mode patch?

Probably.

> > +	ret = write_cache_pages(mapping, wbc, __xfs_vm_writepage, &ctx);
> > +
> > +	if (ctx.iohead) {
> > +		if (ret)
> > +			xfs_cancel_ioend(ctx.iohead);
> > +		else
> > +			xfs_submit_ioend(wbc, ctx.iohead);
> > +	}
> 
> I think this error handling does not work. If we have put pages into
> the ioend (i.e. successful ->writepage calls) and then have a
> ->writepage call fail, we'll get all the pages under writeback (i.e.
> those on the ioend) remain in that state, and not ever get written
> back (so move into the clean state) or redirtied (so written again
> later)
> 
> xfs_cancel_ioend() was only ever called for the first page sent down
> to ->writepage, and on error that page was redirtied separately.
> Hence it doesn't handle this case at all as it never occurs in the
> existing code.
> 
> I'd suggest that regardless of whether an error is returned here,
> the existence of ctx.iohead indicates a valid ioend that needs to be
> submitted....

Ok.  That would also solve the problem of the trylock failures.  I'll
see how we can deal with it nicely.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 06/27] xfs: split xfs_setattr
  2011-06-29 22:13   ` Alex Elder
@ 2011-06-30  7:03     ` Christoph Hellwig
  2011-06-30 12:28       ` Alex Elder
  0 siblings, 1 reply; 100+ messages in thread
From: Christoph Hellwig @ 2011-06-30  7:03 UTC (permalink / raw)
  To: Alex Elder; +Cc: Christoph Hellwig, xfs

On Wed, Jun 29, 2011 at 05:13:16PM -0500, Alex Elder wrote:
> Looks good but I think that you need to mask off the
> ia_valid bits in the calls now made in xfs_vn_setattr().

Why? We call xfs_setattr_size if ATTR_SIZE is set.  The ATTR_SIZE
may also have a few other attributes we can handle, and assert on
those that it can't just to make sure.  Similarly xfs_setattr_nonsize
can handle everything but ATTR_SIZE, and again we have an assert to
protect against breeding incorrect XFS-internal callers.

> Also, I think you may still need to check the file type
> for the size-setting function.  Details below.

The VFS only ever does an ATTR_SIZE setattr on regular files.  We have
an assert to ensure that for debug builds, which is a lot more than
most other filesystems do.

> > +	ASSERT((mask & (ATTR_MODE|ATTR_UID|ATTR_GID|ATTR_ATIME|ATTR_ATIME_SET|
> > +			ATTR_MTIME_SET|ATTR_KILL_SUID|ATTR_KILL_SGID|
> > +			ATTR_KILL_PRIV|ATTR_TIMES_SET)) == 0);
> 
> You'll have to mask these off in xfs_vn_setattr() if you're
> going to make this assertion.

No, this is the (implicit) calling convention by the VFS.

> > -		if (S_ISDIR(ip->i_d.di_mode)) {
> > -			code = XFS_ERROR(EISDIR);
> > -			goto error_return;
> > -		} else if (!S_ISREG(ip->i_d.di_mode)) {
> > -			code = XFS_ERROR(EINVAL);
> > -			goto error_return;
> > -		}
> 
> This is the file type checking code I referred to above.

It simply was a leftover from IRIX that we can't hit on Linux.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 15/27] xfs: kill struct xfs_dir2_sf
  2011-06-29 14:01 ` [PATCH 15/27] xfs: kill struct xfs_dir2_sf Christoph Hellwig
@ 2011-06-30  7:04   ` Dave Chinner
  2011-06-30  7:09     ` Christoph Hellwig
  0 siblings, 1 reply; 100+ messages in thread
From: Dave Chinner @ 2011-06-30  7:04 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Wed, Jun 29, 2011 at 10:01:24AM -0400, Christoph Hellwig wrote:
> The list field of it is never cactually used, so all uses can simply be
> replaced with the xfs_dir2_sf_hdr_t type that it has as first member.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

I can't help but think this would have been a much smaller patch if
you rolled the contents of the xfs_dir2_sf_hdr_t into xfs_dir2_sf_t
and killed the xfs_dir2_sf_hdr_t instead. There's many more
occurrences of xfs_dir2_sf_t than there are xfs_dir2_sf_hdr_t.

Anyway, gotta run now so I'll look at this more later....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 15/27] xfs: kill struct xfs_dir2_sf
  2011-06-30  7:04   ` Dave Chinner
@ 2011-06-30  7:09     ` Christoph Hellwig
  0 siblings, 0 replies; 100+ messages in thread
From: Christoph Hellwig @ 2011-06-30  7:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs

On Thu, Jun 30, 2011 at 05:04:49PM +1000, Dave Chinner wrote:
> On Wed, Jun 29, 2011 at 10:01:24AM -0400, Christoph Hellwig wrote:
> > The list field of it is never cactually used, so all uses can simply be
> > replaced with the xfs_dir2_sf_hdr_t type that it has as first member.
> > 
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> 
> I can't help but think this would have been a much smaller patch if
> you rolled the contents of the xfs_dir2_sf_hdr_t into xfs_dir2_sf_t
> and killed the xfs_dir2_sf_hdr_t instead. There's many more
> occurrences of xfs_dir2_sf_t than there are xfs_dir2_sf_hdr_t.

It would have been simpler, but using the xfs_dir2_sf_hdr_t makes
more sense to me.  If there's broad disagreement with this (and
the similar changes for the other structures) I'll revisit the naming
scheme.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 09/27] xfs: split xfs_itruncate_finish
  2011-06-30  2:44   ` Dave Chinner
@ 2011-06-30  7:18     ` Christoph Hellwig
  0 siblings, 0 replies; 100+ messages in thread
From: Christoph Hellwig @ 2011-06-30  7:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs

On Thu, Jun 30, 2011 at 12:44:28PM +1000, Dave Chinner wrote:
> >  
> >  	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL|XFS_IOLOCK_EXCL));
> > -	ASSERT((new_size == 0) || (new_size <= ip->i_size));
> > -	ASSERT(*tp != NULL);
> > -	ASSERT((*tp)->t_flags & XFS_TRANS_PERM_LOG_RES);
> > -	ASSERT(ip->i_transp == *tp);
> > +	ASSERT(new_size == 0 || new_size <= ip->i_size);
> 
> If new_size == 0, then it will always be <= ip->i_size, so that's
> kind of a redundant check. I think this really should be two
> different asserts, one that validates the data fork new_size range,
> and one that validates the attr fork truncate to zero length only
> condition:
> 
> 	ASSERT(new_size <= ip->i_size);
> 	ASSERT(whichfork != XFS_ATTR_FORK || new_size == 0);

For now I was just keeping the existing assert, but changing this one
sounds ok.  OTOH I kept the whole routine fork agnostic, so I think
I'll rather just make the assert read:

	ASSERT(new_size <= ip->i_size);

and assume the one and only attr fork caller does the right thing.

> > @@ -1464,15 +1311,16 @@ xfs_itruncate_finish(
> >  		}
> >  
> >  		ntp = xfs_trans_dup(ntp);
> > -		error = xfs_trans_commit(*tp, 0);
> > -		*tp = ntp;
> > +		error = xfs_trans_commit(*tpp, 0);
> > +		*tpp = ntp;
> 
> I've always found this a mess to follow which transaction is which
> because of the rewriting of ntp. This is easier to follow:
> 
> 		ntp = xfs_trans_dup(*tpp);
> 		error = xfs_trans_commit(*tpp, 0);
> 		*tpp = ntp;
> 
> Now it's clear that we are duplicating *tpp, then committing it, and
> then setting it to the duplicated transaction. Now I don't have to
> go look at all the surrounding code to remind myself what ntp
> contains to validate that the fragment of code is doing the right
> thing.....

I've cleaned this up even further and added a local tp variable that
has the current transaction as a normal pointer.  *tpp is only assigned
back to in a single place after goto out, and ntp is only used for
the switching around to the duplicated transaction.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 13/27] xfs: factor out xfs_dir2_leaf_find_entry
  2011-06-30  6:11   ` Dave Chinner
@ 2011-06-30  7:34     ` Christoph Hellwig
  0 siblings, 0 replies; 100+ messages in thread
From: Christoph Hellwig @ 2011-06-30  7:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs

On Thu, Jun 30, 2011 at 04:11:02PM +1000, Dave Chinner wrote:
> You could probably just return here, and that would remove the:

Indeed.  We can do the same near the end, too.  It means duplicating
the be16_add_cpu, but allows to directly return the expression that
lead to lep, so I did it.

> > +				continue;
> > +		}
> > +		/*
> > +		 * If the low one is better, use it.
> > +		 */
> 
> Line of whitespace before the comment.

Fixed.

> > +		/*
> > +		 * The high one is better, so use that one.
> > +		 */
> > +		} else {
> 
> I prefer comments inside the else branch...

The else is completely gone now, fixing that issue.

> > -		be16_add_cpu(&leaf->hdr.stale, -1);
> > -	}
> > +
> > +
> > +	lep = xfs_dir2_leaf_find_entry(leaf, index, compact, lowstale,
> > +				       highstale, &lfloglow, &lfloghigh);
> > +
> 
> Only need one line of whitespace before the function call.

Fixed.

> > +
> > +
> >  	/*
> >  	 * Insert the new entry, log everything.
> >  	 */
> > +	lep = xfs_dir2_leaf_find_entry(leaf, index, compact, lowstale,
> > +				       highstale, &lfloglow, &lfloghigh);
> > +
> 
> Same for the whitespace before the comment.

Fixed.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 14/27] xfs: cleanup shortform directory inode number handling
  2011-06-30  6:35   ` Dave Chinner
@ 2011-06-30  7:39     ` Christoph Hellwig
  0 siblings, 0 replies; 100+ messages in thread
From: Christoph Hellwig @ 2011-06-30  7:39 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs

> > +static xfs_ino_t
> 
> Extra line there...

Fixed.

> > +}
> > +static void
> 
> And none there.

Fixed.

> > +xfs_dir2_sf_put_inumber(
> > +	struct xfs_dir2_sf	*sfp,
> > +	xfs_dir2_inou_t		*to,
> > +	xfs_ino_t		ino)
> > +{
> > +	if (sfp->hdr.i8count)
> > +		XFS_PUT_DIR_INO8(ino, to->i8);
> > +	else
> > +		XFS_PUT_DIR_INO4(ino, to->i4);
> > +}
> 
> Also, xfs_dir2_sf_get_ino() vs xfs_dir2_sf_put_inumber() - either
> use _ino or _inumber as the suffix for both. _ino is probably more
> consistent with the other functions...

xfs_dir2_sf_put_inumber already exists in the current code and I just
moved it blindly.  I've renamed it to xfs_dir2_sf_put_ino for the
next version.

> > +
> > +xfs_ino_t
> > +xfs_dir2_sf_get_parent_ino(
> > +	struct xfs_dir2_sf	*sfp)
> > +{
> > +	return xfs_dir2_sf_get_ino(sfp, &sfp->hdr.parent);
> > +}
> > +
> > +
> 
> Extra whitespace.

Fixed.

> > +static void
> > +xfs_dir2_sf_put_parent_ino(
> > +	struct xfs_dir2_sf	*sfp,
> > +	xfs_ino_t		ino)
> > +{
> > +	xfs_dir2_sf_put_inumber(sfp, &sfp->hdr.parent, ino);
> > +}
> > +
> > +
> 
> Extra whitespace.
> 

Fixed.

> > +/*
> > + * In short-form directory entries the inode numbers are stored at variable
> > + * offset behind the entry name.  The inode numbers may only be accessed
> > + * through the helpers below.
> > + */
> > +
> 
> Extra whitespace.
> 

Fixed.

> > +static xfs_dir2_inou_t *
> > +xfs_dir2_sf_inop(
> > +	struct xfs_dir2_sf_entry *sfep)
> > +{
> > +	return (xfs_dir2_inou_t *)&sfep->name[sfep->namelen];
> > +}
> 
> Probably should be called xfs_dir2_sfe_inop()  because it takes a
> xfs_dir2_sf_entry, similar to the following functions use "sfe".

Ok.

> 
> > +
> > +xfs_ino_t
> > +xfs_dir2_sfe_get_ino(
> > +	struct xfs_dir2_sf	*sfp,
> > +	struct xfs_dir2_sf_entry *sfep)
> > +{
> > +	return xfs_dir2_sf_get_ino(sfp, xfs_dir2_sf_inop(sfep));
> > +}
> > +
> > +static void
> > +xfs_dir2_sfe_put_ino(
> > +	struct xfs_dir2_sf	*sfp,
> > +	struct xfs_dir2_sf_entry *sfep,
> > +	xfs_ino_t		ino)
> > +{
> > +	xfs_dir2_sf_put_inumber(sfp, xfs_dir2_sf_inop(sfep), ino);
> > +}
> > +
> > +
> 
> Extra whitespace.

Fixed.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 06/27] xfs: split xfs_setattr
  2011-06-30  7:03     ` Christoph Hellwig
@ 2011-06-30 12:28       ` Alex Elder
  0 siblings, 0 replies; 100+ messages in thread
From: Alex Elder @ 2011-06-30 12:28 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Thu, 2011-06-30 at 03:03 -0400, Christoph Hellwig wrote:
> On Wed, Jun 29, 2011 at 05:13:16PM -0500, Alex Elder wrote:
> > Looks good but I think that you need to mask off the
> > ia_valid bits in the calls now made in xfs_vn_setattr().
> 
> Why? We call xfs_setattr_size if ATTR_SIZE is set.  The ATTR_SIZE
> may also have a few other attributes we can handle, and assert on
> those that it can't just to make sure.  Similarly xfs_setattr_nonsize
> can handle everything but ATTR_SIZE, and again we have an assert to
> protect against breeding incorrect XFS-internal callers.

OK.  I didn't go that far back, or at least didn't check all
the notify_change() calls.  I now have, and although I'm not
100% sure on encryptfs and nfsd setattr it looks like you're
right.  That basically explains all of my comments--the VFS
prevents those conditions from occurring, and therefore an
assertion to communicate and enforce that is proper.

Reviewed-by: Alex Elder <aelder@sgi.com>


> > Also, I think you may still need to check the file type
> > for the size-setting function.  Details below.
> 
> The VFS only ever does an ATTR_SIZE setattr on regular files.  We have
> an assert to ensure that for debug builds, which is a lot more than
> most other filesystems do.
> 
> > > +	ASSERT((mask & (ATTR_MODE|ATTR_UID|ATTR_GID|ATTR_ATIME|ATTR_ATIME_SET|
> > > +			ATTR_MTIME_SET|ATTR_KILL_SUID|ATTR_KILL_SGID|
> > > +			ATTR_KILL_PRIV|ATTR_TIMES_SET)) == 0);
> > 
> > You'll have to mask these off in xfs_vn_setattr() if you're
> > going to make this assertion.
> 
> No, this is the (implicit) calling convention by the VFS.
> 
> > > -		if (S_ISDIR(ip->i_d.di_mode)) {
> > > -			code = XFS_ERROR(EISDIR);
> > > -			goto error_return;
> > > -		} else if (!S_ISREG(ip->i_d.di_mode)) {
> > > -			code = XFS_ERROR(EINVAL);
> > > -			goto error_return;
> > > -		}
> > 
> > This is the file type checking code I referred to above.
> 
> It simply was a leftover from IRIX that we can't hit on Linux.
> 



_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-06-29 14:01 ` [PATCH 03/27] xfs: use write_cache_pages for writeback clustering Christoph Hellwig
  2011-06-30  2:00   ` Dave Chinner
@ 2011-07-01  2:22   ` Dave Chinner
  2011-07-01  4:18     ` Dave Chinner
  2011-07-01  8:51     ` Christoph Hellwig
  1 sibling, 2 replies; 100+ messages in thread
From: Dave Chinner @ 2011-07-01  2:22 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Wed, Jun 29, 2011 at 10:01:12AM -0400, Christoph Hellwig wrote:
> Instead of implementing our own writeback clustering use write_cache_pages
> to do it for us.  This means the guts of the current writepage implementation
> become a new helper used both for implementing ->writepage and as a callback
> to write_cache_pages for ->writepages.  A new struct xfs_writeback_ctx
> is used to track block mapping state and the ioend chain over multiple
> invocation of it.
> 
> The advantage over the old code is that we avoid a double pagevec lookup,
> and a more efficient handling of extent boundaries inside a page for
> small blocksize filesystems, as well as having less XFS specific code.

It's not more efficient right now, due to a little bug:

> @@ -973,36 +821,38 @@ xfs_vm_writepage(
>  		 * buffers covering holes here.
>  		 */
>  		if (!buffer_mapped(bh) && buffer_uptodate(bh)) {
> -			imap_valid = 0;
> +			ctx->imap_valid = 0;
>  			continue;
>  		}
>  
>  		if (buffer_unwritten(bh)) {
>  			if (type != IO_UNWRITTEN) {
>  				type = IO_UNWRITTEN;
> -				imap_valid = 0;
> +				ctx->imap_valid = 0;
>  			}
>  		} else if (buffer_delay(bh)) {
>  			if (type != IO_DELALLOC) {
>  				type = IO_DELALLOC;
> -				imap_valid = 0;
> +				ctx->imap_valid = 0;
>  			}
>  		} else if (buffer_uptodate(bh)) {
>  			if (type != IO_OVERWRITE) {
>  				type = IO_OVERWRITE;
> -				imap_valid = 0;
> +				ctx->imap_valid = 0;
>  			}
>  		} else {
>  			if (PageUptodate(page)) {
>  				ASSERT(buffer_mapped(bh));
> -				imap_valid = 0;
> +				ctx->imap_valid = 0;
>  			}
>  			continue;
>  		}

This piece of logic checks is the type of buffer has changed from the
previous buffer. This used to work just fine, but now "type" is
local to the __xfs_vm_writepage() function, while the imap life
spanѕ multiple calls to the __xfs_vm_writepage() function. Hence
type is reinitialised to IO_OVERWRITE on every page that written,
and so for delalloc we are invalidating the imap and looking it up
again on every page. Traces show this sort of behaviour:

           <...>-514   [000] 689640.881953: xfs_writepage:        dev 253:16 ino 0x552248 pgoff 0xf7000 size 0xa00000 offset 0 delalloc 1 unwritten 0
           <...>-514   [000] 689640.881954: xfs_ilock:            dev 253:16 ino 0x552248 flags ILOCK_SHARED caller xfs_map_blocks
           <...>-514   [000] 689640.881954: xfs_iunlock:          dev 253:16 ino 0x552248 flags ILOCK_SHARED caller xfs_map_blocks
           <...>-514   [000] 689640.881954: xfs_map_blocks_found: dev 253:16 ino 0x552248 size 0x0 new_size 0x0 offset 0xf7000 count 1024 type  startoff 0x0 startblock 6297609 blockcount 0x2800
           <...>-514   [000] 689640.881956: xfs_writepage:        dev 253:16 ino 0x552248 pgoff 0xf8000 size 0xa00000 offset 0 delalloc 1 unwritten 0
           <...>-514   [000] 689640.881957: xfs_ilock:            dev 253:16 ino 0x552248 flags ILOCK_SHARED caller xfs_map_blocks
           <...>-514   [000] 689640.881957: xfs_iunlock:          dev 253:16 ino 0x552248 flags ILOCK_SHARED caller xfs_map_blocks
           <...>-514   [000] 689640.881957: xfs_map_blocks_found: dev 253:16 ino 0x552248 size 0x0 new_size 0x0 offset 0xf8000 count 1024 type  startoff 0x0 startblock 6297609 blockcount 0x2800
           <...>-514   [000] 689640.881960: xfs_writepage:        dev 253:16 ino 0x552248 pgoff 0xf9000 size 0xa00000 offset 0 delalloc 1 unwritten 0
           <...>-514   [000] 689640.881960: xfs_ilock:            dev 253:16 ino 0x552248 flags ILOCK_SHARED caller xfs_map_blocks
           <...>-514   [000] 689640.881961: xfs_iunlock:          dev 253:16 ino 0x552248 flags ILOCK_SHARED caller xfs_map_blocks
           <...>-514   [000] 689640.881961: xfs_map_blocks_found: dev 253:16 ino 0x552248 size 0x0 new_size 0x0 offset 0xf9000 count 1024 type  startoff 0x0 startblock 6297609 blockcount 0x2800

IOWs, the type field also needs to be moved into the writepage
context structure so that we don't keep doing needless extent map
lookups.

With the following patch, the trace output now looks like this for
delalloc writeback:

           <...>-12623 [000] 694093.594883: xfs_writepage:        dev 253:16 ino 0x2300a5 pgoff 0x505000 size 0xa00000 offset 0 delalloc 1 unwritten 0
           <...>-12623 [000] 694093.594884: xfs_writepage:        dev 253:16 ino 0x2300a5 pgoff 0x506000 size 0xa00000 offset 0 delalloc 1 unwritten 0
           <...>-12623 [000] 694093.594884: xfs_writepage:        dev 253:16 ino 0x2300a5 pgoff 0x507000 size 0xa00000 offset 0 delalloc 1 unwritten 0
           <...>-12623 [000] 694093.594885: xfs_writepage:        dev 253:16 ino 0x2300a5 pgoff 0x508000 size 0xa00000 offset 0 delalloc 1 unwritten 0
           <...>-12623 [000] 694093.594885: xfs_writepage:        dev 253:16 ino 0x2300a5 pgoff 0x509000 size 0xa00000 offset 0 delalloc 1 unwritten 0
           <...>-12623 [000] 694093.594886: xfs_writepage:        dev 253:16 ino 0x2300a5 pgoff 0x50a000 size 0xa00000 offset 0 delalloc 1 unwritten 0
           <...>-12623 [000] 694093.594887: xfs_writepage:        dev 253:16 ino 0x2300a5 pgoff 0x50b000 size 0xa00000 offset 0 delalloc 1 unwritten 0
           <...>-12623 [000] 694093.594888: xfs_writepage:        dev 253:16 ino 0x2300a5 pgoff 0x50c000 size 0xa00000 offset 0 delalloc 1 unwritten 0


i.e. there mapping lookup is no longer occurring for every page.

As a side effect, the failure case I'm seeing with test 180 has gone
from 5-10 files with the wrong size to >200 files with the wrong
size with this patch, so clearly there is something wrong with file
size updates getting to disk that this patch set makes worse.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

xfs: io type needs to be part of the writepage context

From: Dave Chinner <dchinner@redhat.com>

If we don't pass the IO type we are mapping with the writeage
context, then the imap is recalculated on every delalloc page that
is passed to _xfs_vm_writepage(). This defeats the purpose of having
a cached imap between calls and increases the overhead of delalloc
writeback significantly.

Fix this by moving the io type into the writepage context structure
so that it moves with the cached imap through the stack.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_aops.c |   30 ++++++++++++++++++------------
 1 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index 73dac4b..25b63cd 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -40,6 +40,7 @@
 
 struct xfs_writeback_ctx {
 	unsigned int		imap_valid;
+	unsigned int		io_type;
 	struct xfs_bmbt_irec	imap;
 	struct xfs_ioend	*iohead;
 	struct xfs_ioend	*ioend;
@@ -804,7 +805,6 @@ __xfs_vm_writepage(
 
 	bh = head = page_buffers(page);
 	offset = page_offset(page);
-	type = IO_OVERWRITE;
 
 	do {
 		int new_ioend = 0;
@@ -826,18 +826,18 @@ __xfs_vm_writepage(
 		}
 
 		if (buffer_unwritten(bh)) {
-			if (type != IO_UNWRITTEN) {
-				type = IO_UNWRITTEN;
+			if (ctx->io_type != IO_UNWRITTEN) {
+				ctx->io_type = IO_UNWRITTEN;
 				ctx->imap_valid = 0;
 			}
 		} else if (buffer_delay(bh)) {
-			if (type != IO_DELALLOC) {
-				type = IO_DELALLOC;
+			if (ctx->io_type != IO_DELALLOC) {
+				ctx->io_type = IO_DELALLOC;
 				ctx->imap_valid = 0;
 			}
 		} else if (buffer_uptodate(bh)) {
-			if (type != IO_OVERWRITE) {
-				type = IO_OVERWRITE;
+			if (ctx->io_type != IO_OVERWRITE) {
+				ctx->io_type = IO_OVERWRITE;
 				ctx->imap_valid = 0;
 			}
 		} else {
@@ -862,7 +862,8 @@ __xfs_vm_writepage(
 			 * time.
 			 */
 			new_ioend = 1;
-			err = xfs_map_blocks(inode, offset, &ctx->imap, type);
+			err = xfs_map_blocks(inode, offset, &ctx->imap,
+					     ctx->io_type);
 			if (err)
 				goto error;
 			ctx->imap_valid =
@@ -870,11 +871,12 @@ __xfs_vm_writepage(
 		}
 		if (ctx->imap_valid) {
 			lock_buffer(bh);
-			if (type != IO_OVERWRITE) {
+			if (ctx->io_type != IO_OVERWRITE) {
 				xfs_map_at_offset(inode, bh, &ctx->imap,
 						  offset);
 			}
-			xfs_add_to_ioend(ctx, inode, bh, offset, type, new_ioend);
+			xfs_add_to_ioend(ctx, inode, bh, offset, ctx->io_type,
+					 new_ioend);
 			count++;
 		}
 	} while (offset += len, ((bh = bh->b_this_page) != head));
@@ -902,7 +904,9 @@ xfs_vm_writepage(
 	struct page		*page,
 	struct writeback_control *wbc)
 {
-	struct xfs_writeback_ctx ctx = { };
+	struct xfs_writeback_ctx ctx = {
+		.io_type = IO_OVERWRITE,
+	};
 	int ret;
 
 	/*
@@ -939,7 +943,9 @@ xfs_vm_writepages(
 	struct address_space	*mapping,
 	struct writeback_control *wbc)
 {
-	struct xfs_writeback_ctx ctx = { };
+	struct xfs_writeback_ctx ctx = {
+		.io_type = IO_OVERWRITE,
+	};
 	int ret;
 
 	xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-07-01  2:22   ` Dave Chinner
@ 2011-07-01  4:18     ` Dave Chinner
  2011-07-01  8:59       ` Christoph Hellwig
  2011-07-01  9:33         ` Christoph Hellwig
  2011-07-01  8:51     ` Christoph Hellwig
  1 sibling, 2 replies; 100+ messages in thread
From: Dave Chinner @ 2011-07-01  4:18 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Fri, Jul 01, 2011 at 12:22:48PM +1000, Dave Chinner wrote:
> On Wed, Jun 29, 2011 at 10:01:12AM -0400, Christoph Hellwig wrote:
> > Instead of implementing our own writeback clustering use write_cache_pages
> > to do it for us.  This means the guts of the current writepage implementation
> > become a new helper used both for implementing ->writepage and as a callback
> > to write_cache_pages for ->writepages.  A new struct xfs_writeback_ctx
> > is used to track block mapping state and the ioend chain over multiple
> > invocation of it.
> > 
> > The advantage over the old code is that we avoid a double pagevec lookup,
> > and a more efficient handling of extent boundaries inside a page for
> > small blocksize filesystems, as well as having less XFS specific code.
> 
> It's not more efficient right now, due to a little bug:
.....
> With the following patch, the trace output now looks like this for
> delalloc writeback:
> 
>            <...>-12623 [000] 694093.594883: xfs_writepage:        dev 253:16 ino 0x2300a5 pgoff 0x505000 size 0xa00000 offset 0 delalloc 1 unwritten 0
>            <...>-12623 [000] 694093.594884: xfs_writepage:        dev 253:16 ino 0x2300a5 pgoff 0x506000 size 0xa00000 offset 0 delalloc 1 unwritten 0
>            <...>-12623 [000] 694093.594884: xfs_writepage:        dev 253:16 ino 0x2300a5 pgoff 0x507000 size 0xa00000 offset 0 delalloc 1 unwritten 0
>            <...>-12623 [000] 694093.594885: xfs_writepage:        dev 253:16 ino 0x2300a5 pgoff 0x508000 size 0xa00000 offset 0 delalloc 1 unwritten 0
>            <...>-12623 [000] 694093.594885: xfs_writepage:        dev 253:16 ino 0x2300a5 pgoff 0x509000 size 0xa00000 offset 0 delalloc 1 unwritten 0
>            <...>-12623 [000] 694093.594886: xfs_writepage:        dev 253:16 ino 0x2300a5 pgoff 0x50a000 size 0xa00000 offset 0 delalloc 1 unwritten 0
>            <...>-12623 [000] 694093.594887: xfs_writepage:        dev 253:16 ino 0x2300a5 pgoff 0x50b000 size 0xa00000 offset 0 delalloc 1 unwritten 0
>            <...>-12623 [000] 694093.594888: xfs_writepage:        dev 253:16 ino 0x2300a5 pgoff 0x50c000 size 0xa00000 offset 0 delalloc 1 unwritten 0
> 
> 
> i.e. there mapping lookup is no longer occurring for every page.
> 
> As a side effect, the failure case I'm seeing with test 180 has gone
> from 5-10 files with the wrong size to >200 files with the wrong
> size with this patch, so clearly there is something wrong with file
> size updates getting to disk that this patch set makes worse.

I'm now only running test 180 on 100 files rather than the 1000 the
test normally runs on, because it's faster and still shows the
problem.  That means the test is only using 1GB of disk space, and
I'm running on a VM with 1GB RAM. It appears to be related to the VM
triggering random page writeback from the LRU - 100x10MB files more
than fills memory, hence it being the smallest test case i could
reproduce the problem on.

My triage notes are as follows, and the patch that fixes the bug is
attached below.

--- 180.out     2010-04-28 15:00:22.000000000 +1000
+++ 180.out.bad 2011-07-01 12:44:12.000000000 +1000
@@ -1 +1,9 @@
 QA output created by 180
+file /mnt/scratch/81 has incorrect size 10473472 - sync failed
+file /mnt/scratch/86 has incorrect size 10371072 - sync failed
+file /mnt/scratch/87 has incorrect size 10104832 - sync failed
+file /mnt/scratch/88 has incorrect size 10125312 - sync failed
+file /mnt/scratch/89 has incorrect size 10469376 - sync failed
+file /mnt/scratch/90 has incorrect size 10240000 - sync failed
+file /mnt/scratch/91 has incorrect size 10362880 - sync failed
+file /mnt/scratch/92 has incorrect size 10366976 - sync failed

$ ls -li /mnt/scratch/ | awk '/rw/ { printf("0x%x %d %d\n", $1, $6, $10); }'
0x244093 10473472 81
0x244098 10371072 86
0x244099 10104832 87
0x24409a 10125312 88
0x24409b 10469376 89
0x24409c 10240000 90
0x24409d 10362880 91
0x24409e 10366976 92

So looking at inode 0x244099 (/mnt/scratch/87), the last setfilesize
call in the trace (got a separate patch for that) is:

           <...>-393   [000] 696245.229559: xfs_ilock_nowait:     dev 253:16 ino 0x244099 flags ILOCK_EXCL caller xfs_setfilesize
           <...>-393   [000] 696245.229560: xfs_setfilesize:      dev 253:16 ino 0x244099 isize 0xa00000 disize 0x94e000 new_size 0x0 offset 0x600000 count 3813376
           <...>-393   [000] 696245.229561: xfs_iunlock:          dev 253:16 ino 0x244099 flags ILOCK_EXCL caller xfs_setfilesize

For an IO that was from offset 0x600000 for just under 4MB. The end
of that IO is at byte 10104832, which is _exactly_ what the inode
size says it is.

It is very clear that from the IO completions that we are getting a
*lot* of kswapd driven writeback directly through .writepage:

$ grep "xfs_setfilesize:" t.t |grep "4096$" | wc -l
801
$ grep "xfs_setfilesize:" t.t |grep -v "4096$" | wc -l
78

So there's ~900 IO completions that change the file size, and 90% of
them are single page updates.

$ ps -ef |grep [k]swap
root       514     2  0 12:43 ?        00:00:00 [kswapd0]
$ grep "writepage:" t.t | grep "514 " |wc -l
799

Oh, now that is too close to just be a co-incidence. We're getting
significant amounts of random page writeback from the the ends of
the LRUs done by the VM.

<sigh>

back on topic:

           <...>-393   [000] 696245.511905: xfs_ilock_nowait:     dev 253:16 ino 0x24409e flags ILOCK_EXCL caller xfs_setfilesize
           <...>-393   [000] 696245.511906: xfs_setfilesize:      dev 253:16 ino 0x24409e isize 0xa00000 disize 0x99e000 new_size 0x0 offset 0x99e000 count 4096
           <...>-393   [000] 696245.511906: xfs_iunlock:          dev 253:16 ino 0x24409e flags ILOCK_EXCL caller xfs_setfilesize

Completion that updated the file size

           <...>-393   [000] 696245.515279: xfs_ilock_nowait:     dev 253:16 ino 0x24409e flags ILOCK_EXCL caller xfs_setfilesize
           <...>-393   [000] 696245.515280: xfs_iunlock:          dev 253:16 ino 0x24409e flags ILOCK_EXCL caller xfs_setfilesize

Immediately followed by one that didn't.

           <...>-2619  [000] 696245.806576: xfs_writepage:        dev 253:16 ino 0x24409e pgoff 0x858000 size 0xa00000 offset 0 delalloc 1 unwritten 0
           <...>-2619  [000] 696245.806578: xfs_ilock:            dev 253:16 ino 0x24409e flags ILOCK_SHARED caller xfs_map_blocks
           <...>-2619  [000] 696245.806579: xfs_iunlock:          dev 253:16 ino 0x24409e flags ILOCK_SHARED caller xfs_map_blocks
           <...>-2619  [000] 696245.806579: xfs_map_blocks_found: dev 253:16 ino 0x24409e size 0x99f000 new_size 0x0 offset 0x858000 count 1024 type  startoff 0x0 startblock 931888 blockcount 0x2800

New writepage call, showing the on disk file size matches with the last  xfs_setfilesize call.

           <...>-2619  [000] 696245.806581: xfs_writepage:        dev 253:16 ino 0x24409e pgoff 0x859000 size 0xa00000 offset 0 delalloc 1 unwritten 0
           <...>-2619  [000] 696245.806582: xfs_writepage:        dev 253:16 ino 0x24409e pgoff 0x85a000 size 0xa00000 offset 0 delalloc 1 unwritten 0
.....
           <...>-2619  [000] 696245.806825: xfs_writepage:        dev 253:16 ino 0x24409e pgoff 0x9fc000 size 0xa00000 offset 0 delalloc 1 unwritten 0
           <...>-2619  [000] 696245.806826: xfs_writepage:        dev 253:16 ino 0x24409e pgoff 0x9fd000 size 0xa00000 offset 0 delalloc 1 unwritten 0
           <...>-2619  [000] 696245.806827: xfs_writepage:        dev 253:16 ino 0x24409e pgoff 0x9fe000 size 0xa00000 offset 0 delalloc 1 unwritten 0
           <...>-2619  [000] 696245.806828: xfs_writepage:        dev 253:16 ino 0x24409e pgoff 0x9ff000 size 0xa00000 offset 0 delalloc 1 unwritten 0

Ummmm, hold on just a second there. We've already written the page
at pgoff 0x9fe000: how else did we get that completion and file size
update?  So how come that page is still considered to be dirty *and*
delalloc?  WTF?

Ok, so limit the tracing to writepage, block map and setfilesize
events to try to see what is going on.

           <...>-514   [000] 699227.049423: xfs_writepage:        dev 253:16 ino 0x21c098 pgoff 0x88a000 size 0xa00000 offset 0 delalloc 1 unwritten 0
           <...>-514   [000] 699227.049426: xfs_map_blocks_found: dev 253:16 ino 0x21c098 size 0x0 new_size 0x0 offset 0x88a000 count 1024 type  startoff 0x0 startblock 870448 blockcount 0x2800
           <...>-393   [000] 699227.229449: xfs_setfilesize:      dev 253:16 ino 0x21c098 isize 0xa00000 disize 0x0 new_size 0x0 offset 0x0 count 2097152
           <...>-514   [000] 699227.251726: xfs_writepage:        dev 253:16 ino 0x21c098 pgoff 0x88b000 size 0xa00000 offset 0 delalloc 1 unwritten 0
           <...>-514   [000] 699227.251729: xfs_map_blocks_found: dev 253:16 ino 0x21c098 size 0x200000 new_size 0x0 offset 0x88b000 count 1024 type  startoff 0x0 startblock 870448 blockcount 0x2800
.....

Ok, a bunch of kswapd writeback, then:

           <...>-4070  [000] 699227.987373: xfs_writepage:        dev 253:16 ino 0x21c098 pgoff 0x800000 size 0xa00000 offset 0 delalloc 1 unwritten 0
           <...>-4070  [000] 699227.987376: xfs_map_blocks_found: dev 253:16 ino 0x21c098 size 0x8ab000 new_size 0x0 offset 0x800000 count 1024 type  startoff 0x0 startblock 870448 blockcount 0x2800
           <...>-4070  [000] 699227.987377: xfs_writepage:        dev 253:16 ino 0x21c098 pgoff 0x801000 size 0xa00000 offset 0 delalloc 1 unwritten 0
.....
           <...>-4070  [000] 699227.987706: xfs_writepage:        dev 253:16 ino 0x21c098 pgoff 0x9fe000 size 0xa00000 offset 0 delalloc 1 unwritten 0
           <...>-4070  [000] 699227.987707: xfs_writepage:        dev 253:16 ino 0x21c098 pgoff 0x9ff000 size 0xa00000 offset 0 delalloc 1 unwritten 0
           <...>-393   [000] 699228.154118: xfs_setfilesize:      dev 253:16 ino 0x21c098 isize 0xa00000 disize 0x8ab000 new_size 0x0 offset 0x800000 count 1961984
 
Normal writeback. Ok, writeback there spanned a range of 0x200000
(2^21 bytes or 2MiB) pages, but we get an ioend count of on 1961984
bytes, which is 136KiB short. Ok, looking back at the kswapd
writeback, it fell right in the middle of this range, and what we
see is this during the scanning:

           <...>-4070  [000] 699227.987474: xfs_writepage:        dev 253:16 ino 0x21c098 pgoff 0x888000 size 0xa00000 offset 0 delalloc 1 unwritten 0
           <...>-4070  [000] 699227.987475: xfs_writepage:        dev 253:16 ino 0x21c098 pgoff 0x889000 size 0xa00000 offset 0 delalloc 1 unwritten 0
           <...>-4070  [000] 699227.987476: xfs_writepage:        dev 253:16 ino 0x21c098 pgoff 0x8ab000 size 0xa00000 offset 0 delalloc 1 unwritten 0
           <...>-4070  [000] 699227.987477: xfs_writepage:        dev 253:16 ino 0x21c098 pgoff 0x8ac000 size 0xa00000 offset 0 delalloc 1 unwritten 0

A non contiguous page range. That's 132KiB long, so matches the
incorrect ioned value pretty closely. There's probably another
single page hole in the scan somewhere.

What this means is that the ioend is aggregating a non-contiguous
range of pages, which is being submitted as multiple IO so the data
is being written to the correct place. The problem is that the ioend
size doesn't include the holes, so doesn't reflect the range of IO
correctly and so if not setting the file size correctly.

The old code used to terminate an ioend when a discontiguity in the
mapping was discovered by the clustering page cache lookup. The
callback we have now doesn't do this discontiguity discovery, so is
simply placing discontiguous pages in the same ioend. We need to
start a new ioend when we get a discontiguity.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

xfs: writepage context needs to handle discontiguous page ranges

From: Dave Chinner <dchinner@redhat.com>

If the pages sent down by write_cache_pages to the writepage
callback are discontiguous, we need to detect this and put each
discontiguous page range into individual ioends. This is needed to
ensure that the ioend accurately represents the range of the file
that it covers so that file size updates during IO completion set
the size correctly. Failure to take into account the discontiguous
ranges results in files being too small when writeback patterns are
non-sequential.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_aops.c |    7 ++++++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index 9f3f387..eadff82 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -44,6 +44,7 @@ struct xfs_writeback_ctx {
 	struct xfs_bmbt_irec	imap;
 	struct xfs_ioend	*iohead;
 	struct xfs_ioend	*ioend;
+	sector_t		last_block;
 };
 
 /*
@@ -575,7 +576,10 @@ xfs_add_to_ioend(
 	unsigned int		type,
 	int			need_ioend)
 {
-	if (!ctx->ioend || need_ioend || type != ctx->ioend->io_type) {
+	if (!ctx->ioend ||
+	     need_ioend ||
+	     type != ctx->ioend->io_type ||
+	     bh->b_blocknr != ctx->last_block + 1) {
 		struct xfs_ioend	*new;
 
 		new = xfs_alloc_ioend(inode, type);
@@ -595,6 +599,7 @@ xfs_add_to_ioend(
 
 	bh->b_private = NULL;
 	ctx->ioend->io_size += bh->b_size;
+	ctx->last_block = bh->b_blocknr;
 }
 
 STATIC void

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-07-01  2:22   ` Dave Chinner
  2011-07-01  4:18     ` Dave Chinner
@ 2011-07-01  8:51     ` Christoph Hellwig
  1 sibling, 0 replies; 100+ messages in thread
From: Christoph Hellwig @ 2011-07-01  8:51 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs

> 
> This piece of logic checks is the type of buffer has changed from the
> previous buffer. This used to work just fine, but now "type" is
> local to the __xfs_vm_writepage() function, while the imap life
> span?? multiple calls to the __xfs_vm_writepage() function. Hence
> type is reinitialised to IO_OVERWRITE on every page that written,
> and so for delalloc we are invalidating the imap and looking it up
> again on every page. Traces show this sort of behaviour:

Ah crap.  I actually had it that way initially, but it got lost during
a rebase due to a minimal context change screwing most hunks of the
patch.

Thanks for tracking this down!

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-07-01  4:18     ` Dave Chinner
@ 2011-07-01  8:59       ` Christoph Hellwig
  2011-07-01  9:20         ` Dave Chinner
  2011-07-01  9:33         ` Christoph Hellwig
  1 sibling, 1 reply; 100+ messages in thread
From: Christoph Hellwig @ 2011-07-01  8:59 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs

> xfs: writepage context needs to handle discontiguous page ranges
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> If the pages sent down by write_cache_pages to the writepage
> callback are discontiguous, we need to detect this and put each
> discontiguous page range into individual ioends. This is needed to
> ensure that the ioend accurately represents the range of the file
> that it covers so that file size updates during IO completion set
> the size correctly. Failure to take into account the discontiguous
> ranges results in files being too small when writeback patterns are
> non-sequential.

Looks good.  I still wonder why I haven't been able to hit this.
Haven't seen any 180 failure for a long time, with both 4k and 512 byte
filesystems and since yesterday 1k as well.

I'll merge this, and to avoid bisect regressions it'll have to go into
the main writepages patch.  That probaby means folding the add_to_ioend
cleanup into it as well to not make the calling convention too ugly.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-07-01  8:59       ` Christoph Hellwig
@ 2011-07-01  9:20         ` Dave Chinner
  0 siblings, 0 replies; 100+ messages in thread
From: Dave Chinner @ 2011-07-01  9:20 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Fri, Jul 01, 2011 at 04:59:58AM -0400, Christoph Hellwig wrote:
> > xfs: writepage context needs to handle discontiguous page ranges
> > 
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > If the pages sent down by write_cache_pages to the writepage
> > callback are discontiguous, we need to detect this and put each
> > discontiguous page range into individual ioends. This is needed to
> > ensure that the ioend accurately represents the range of the file
> > that it covers so that file size updates during IO completion set
> > the size correctly. Failure to take into account the discontiguous
> > ranges results in files being too small when writeback patterns are
> > non-sequential.
> 
> Looks good.  I still wonder why I haven't been able to hit this.
> Haven't seen any 180 failure for a long time, with both 4k and 512 byte
> filesystems and since yesterday 1k as well.

It requires the test to run the VM out of RAM and then force enough
memory pressure for kswapd to start writeback from the LRU. The
reproducer I have is a 1p, 1GB RAM VM with it's disk image on a
100MB/s HW RAID1 w/ 512MB BBWC disk subsystem.

When kswapd starts doing writeback from the LRU, the iops rate goes
through the roof (from ~300iops @~320k/io to ~7000iops @4k/io) and
throughput drops from 100MB/s to ~30MB/s. BBWC is the only reason
the IOPS stays as high as it does - maybe that is why I saw this and
you haven't.

As it is, the kswapd writeback behaviour is utterly atrocious and,
ultimately, quite easy to provoke. I wish the MM folk would fix that
goddamn problem already - we've only been complaining about it for
the last 6 or 7 years. As such, I'm wondering if it's a bad idea to
even consider removing the .writepage clustering...

> I'll merge this, and to avoid bisect regressions it'll have to go into
> the main writepages patch.  That probaby means folding the add_to_ioend
> cleanup into it as well to not make the calling convention too ugly.

Yup, I figured you'd want to do that.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-07-01  4:18     ` Dave Chinner
@ 2011-07-01  9:33         ` Christoph Hellwig
  2011-07-01  9:33         ` Christoph Hellwig
  1 sibling, 0 replies; 100+ messages in thread
From: Christoph Hellwig @ 2011-07-01  9:33 UTC (permalink / raw)
  To: Mel Gorman, Johannes Weiner, Wu Fengguang; +Cc: linux-mm, xfs

Johannes, Mel, Wu,

Dave has been stressing some XFS patches of mine that remove the XFS
internal writeback clustering in favour of using write_cache_pages.

As part of investigating the behaviour he found out that we're still
doing lots of I/O from the end of the LRU in kswapd.  Not only is that
pretty bad behaviour in general, but it also means we really can't
just remove the writeback clustering in writepage given how much
I/O is still done through that.

Any chance we could the writeback vs kswap behaviour sorted out a bit
better finally?

Some excerpts from the previous discussion:

On Fri, Jul 01, 2011 at 02:18:51PM +1000, Dave Chinner wrote:
> I'm now only running test 180 on 100 files rather than the 1000 the
> test normally runs on, because it's faster and still shows the
> problem.  That means the test is only using 1GB of disk space, and
> I'm running on a VM with 1GB RAM. It appears to be related to the VM
> triggering random page writeback from the LRU - 100x10MB files more
> than fills memory, hence it being the smallest test case i could
> reproduce the problem on.
> 
> My triage notes are as follows, and the patch that fixes the bug is
> attached below.
> 
> --- 180.out     2010-04-28 15:00:22.000000000 +1000
> +++ 180.out.bad 2011-07-01 12:44:12.000000000 +1000
> @@ -1 +1,9 @@
>  QA output created by 180
> +file /mnt/scratch/81 has incorrect size 10473472 - sync failed
> +file /mnt/scratch/86 has incorrect size 10371072 - sync failed
> +file /mnt/scratch/87 has incorrect size 10104832 - sync failed
> +file /mnt/scratch/88 has incorrect size 10125312 - sync failed
> +file /mnt/scratch/89 has incorrect size 10469376 - sync failed
> +file /mnt/scratch/90 has incorrect size 10240000 - sync failed
> +file /mnt/scratch/91 has incorrect size 10362880 - sync failed
> +file /mnt/scratch/92 has incorrect size 10366976 - sync failed
> 
> $ ls -li /mnt/scratch/ | awk '/rw/ { printf("0x%x %d %d\n", $1, $6, $10); }'
> 0x244093 10473472 81
> 0x244098 10371072 86
> 0x244099 10104832 87
> 0x24409a 10125312 88
> 0x24409b 10469376 89
> 0x24409c 10240000 90
> 0x24409d 10362880 91
> 0x24409e 10366976 92
> 
> So looking at inode 0x244099 (/mnt/scratch/87), the last setfilesize
> call in the trace (got a separate patch for that) is:
> 
>            <...>-393   [000] 696245.229559: xfs_ilock_nowait:     dev 253:16 ino 0x244099 flags ILOCK_EXCL caller xfs_setfilesize
>            <...>-393   [000] 696245.229560: xfs_setfilesize:      dev 253:16 ino 0x244099 isize 0xa00000 disize 0x94e000 new_size 0x0 offset 0x600000 count 3813376
>            <...>-393   [000] 696245.229561: xfs_iunlock:          dev 253:16 ino 0x244099 flags ILOCK_EXCL caller xfs_setfilesize
> 
> For an IO that was from offset 0x600000 for just under 4MB. The end
> of that IO is at byte 10104832, which is _exactly_ what the inode
> size says it is.
> 
> It is very clear that from the IO completions that we are getting a
> *lot* of kswapd driven writeback directly through .writepage:
> 
> $ grep "xfs_setfilesize:" t.t |grep "4096$" | wc -l
> 801
> $ grep "xfs_setfilesize:" t.t |grep -v "4096$" | wc -l
> 78
> 
> So there's ~900 IO completions that change the file size, and 90% of
> them are single page updates.
> 
> $ ps -ef |grep [k]swap
> root       514     2  0 12:43 ?        00:00:00 [kswapd0]
> $ grep "writepage:" t.t | grep "514 " |wc -l
> 799
> 
> Oh, now that is too close to just be a co-incidence. We're getting
> significant amounts of random page writeback from the the ends of
> the LRUs done by the VM.
> 
> <sigh>


On Fri, Jul 01, 2011 at 07:20:21PM +1000, Dave Chinner wrote:
> > Looks good.  I still wonder why I haven't been able to hit this.
> > Haven't seen any 180 failure for a long time, with both 4k and 512 byte
> > filesystems and since yesterday 1k as well.
> 
> It requires the test to run the VM out of RAM and then force enough
> memory pressure for kswapd to start writeback from the LRU. The
> reproducer I have is a 1p, 1GB RAM VM with it's disk image on a
> 100MB/s HW RAID1 w/ 512MB BBWC disk subsystem.
> 
> When kswapd starts doing writeback from the LRU, the iops rate goes
> through the roof (from ~300iops @~320k/io to ~7000iops @4k/io) and
> throughput drops from 100MB/s to ~30MB/s. BBWC is the only reason
> the IOPS stays as high as it does - maybe that is why I saw this and
> you haven't.
> 
> As it is, the kswapd writeback behaviour is utterly atrocious and,
> ultimately, quite easy to provoke. I wish the MM folk would fix that
> goddamn problem already - we've only been complaining about it for
> the last 6 or 7 years. As such, I'm wondering if it's a bad idea to
> even consider removing the .writepage clustering...

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
@ 2011-07-01  9:33         ` Christoph Hellwig
  0 siblings, 0 replies; 100+ messages in thread
From: Christoph Hellwig @ 2011-07-01  9:33 UTC (permalink / raw)
  To: Mel Gorman, Johannes Weiner, Wu Fengguang; +Cc: Dave Chinner, xfs, linux-mm

Johannes, Mel, Wu,

Dave has been stressing some XFS patches of mine that remove the XFS
internal writeback clustering in favour of using write_cache_pages.

As part of investigating the behaviour he found out that we're still
doing lots of I/O from the end of the LRU in kswapd.  Not only is that
pretty bad behaviour in general, but it also means we really can't
just remove the writeback clustering in writepage given how much
I/O is still done through that.

Any chance we could the writeback vs kswap behaviour sorted out a bit
better finally?

Some excerpts from the previous discussion:

On Fri, Jul 01, 2011 at 02:18:51PM +1000, Dave Chinner wrote:
> I'm now only running test 180 on 100 files rather than the 1000 the
> test normally runs on, because it's faster and still shows the
> problem.  That means the test is only using 1GB of disk space, and
> I'm running on a VM with 1GB RAM. It appears to be related to the VM
> triggering random page writeback from the LRU - 100x10MB files more
> than fills memory, hence it being the smallest test case i could
> reproduce the problem on.
> 
> My triage notes are as follows, and the patch that fixes the bug is
> attached below.
> 
> --- 180.out     2010-04-28 15:00:22.000000000 +1000
> +++ 180.out.bad 2011-07-01 12:44:12.000000000 +1000
> @@ -1 +1,9 @@
>  QA output created by 180
> +file /mnt/scratch/81 has incorrect size 10473472 - sync failed
> +file /mnt/scratch/86 has incorrect size 10371072 - sync failed
> +file /mnt/scratch/87 has incorrect size 10104832 - sync failed
> +file /mnt/scratch/88 has incorrect size 10125312 - sync failed
> +file /mnt/scratch/89 has incorrect size 10469376 - sync failed
> +file /mnt/scratch/90 has incorrect size 10240000 - sync failed
> +file /mnt/scratch/91 has incorrect size 10362880 - sync failed
> +file /mnt/scratch/92 has incorrect size 10366976 - sync failed
> 
> $ ls -li /mnt/scratch/ | awk '/rw/ { printf("0x%x %d %d\n", $1, $6, $10); }'
> 0x244093 10473472 81
> 0x244098 10371072 86
> 0x244099 10104832 87
> 0x24409a 10125312 88
> 0x24409b 10469376 89
> 0x24409c 10240000 90
> 0x24409d 10362880 91
> 0x24409e 10366976 92
> 
> So looking at inode 0x244099 (/mnt/scratch/87), the last setfilesize
> call in the trace (got a separate patch for that) is:
> 
>            <...>-393   [000] 696245.229559: xfs_ilock_nowait:     dev 253:16 ino 0x244099 flags ILOCK_EXCL caller xfs_setfilesize
>            <...>-393   [000] 696245.229560: xfs_setfilesize:      dev 253:16 ino 0x244099 isize 0xa00000 disize 0x94e000 new_size 0x0 offset 0x600000 count 3813376
>            <...>-393   [000] 696245.229561: xfs_iunlock:          dev 253:16 ino 0x244099 flags ILOCK_EXCL caller xfs_setfilesize
> 
> For an IO that was from offset 0x600000 for just under 4MB. The end
> of that IO is at byte 10104832, which is _exactly_ what the inode
> size says it is.
> 
> It is very clear that from the IO completions that we are getting a
> *lot* of kswapd driven writeback directly through .writepage:
> 
> $ grep "xfs_setfilesize:" t.t |grep "4096$" | wc -l
> 801
> $ grep "xfs_setfilesize:" t.t |grep -v "4096$" | wc -l
> 78
> 
> So there's ~900 IO completions that change the file size, and 90% of
> them are single page updates.
> 
> $ ps -ef |grep [k]swap
> root       514     2  0 12:43 ?        00:00:00 [kswapd0]
> $ grep "writepage:" t.t | grep "514 " |wc -l
> 799
> 
> Oh, now that is too close to just be a co-incidence. We're getting
> significant amounts of random page writeback from the the ends of
> the LRUs done by the VM.
> 
> <sigh>


On Fri, Jul 01, 2011 at 07:20:21PM +1000, Dave Chinner wrote:
> > Looks good.  I still wonder why I haven't been able to hit this.
> > Haven't seen any 180 failure for a long time, with both 4k and 512 byte
> > filesystems and since yesterday 1k as well.
> 
> It requires the test to run the VM out of RAM and then force enough
> memory pressure for kswapd to start writeback from the LRU. The
> reproducer I have is a 1p, 1GB RAM VM with it's disk image on a
> 100MB/s HW RAID1 w/ 512MB BBWC disk subsystem.
> 
> When kswapd starts doing writeback from the LRU, the iops rate goes
> through the roof (from ~300iops @~320k/io to ~7000iops @4k/io) and
> throughput drops from 100MB/s to ~30MB/s. BBWC is the only reason
> the IOPS stays as high as it does - maybe that is why I saw this and
> you haven't.
> 
> As it is, the kswapd writeback behaviour is utterly atrocious and,
> ultimately, quite easy to provoke. I wish the MM folk would fix that
> goddamn problem already - we've only been complaining about it for
> the last 6 or 7 years. As such, I'm wondering if it's a bad idea to
> even consider removing the .writepage clustering...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-07-01  9:33         ` Christoph Hellwig
@ 2011-07-01 14:59           ` Mel Gorman
  -1 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-07-01 14:59 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: jack, xfs, linux-mm, Wu Fengguang, Johannes Weiner

On Fri, Jul 01, 2011 at 05:33:05AM -0400, Christoph Hellwig wrote:
> Johannes, Mel, Wu,

Am adding Jan Kara as he has been working on writeback efficiency
recently as well.

> Dave has been stressing some XFS patches of mine that remove the XFS
> internal writeback clustering in favour of using write_cache_pages.
> 

Against what kernel? 2.6.38 was a disaster for reclaim I've been
finding out this week. I don't know about 2.6.38.8. 2.6.39 was better.

> As part of investigating the behaviour he found out that we're still
> doing lots of I/O from the end of the LRU in kswapd.  Not only is that
> pretty bad behaviour in general, but it also means we really can't
> just remove the writeback clustering in writepage given how much
> I/O is still done through that.
> 
> Any chance we could the writeback vs kswap behaviour sorted out a bit
> better finally?
> 
> Some excerpts from the previous discussion:
> 
> On Fri, Jul 01, 2011 at 02:18:51PM +1000, Dave Chinner wrote:
> > I'm now only running test 180 on 100 files rather than the 1000 the
> > test normally runs on, because it's faster and still shows the
> > problem. 

I had stopped looking at writeback problems while Wu and Jan were
working on various writeback patchsets like io-less throttling. I
don't know where they currently stand and while I submitted a number
of reclaim patches since I last looked at this problem around 2.6.37,
they were related to migration, kswapd reclaiming too much memory
and kswapd using too much CPU - not writeback.

At the time I stopped, the tests I was looking at were writing very
few pages off the end of the LRU. Unfortunately I no longer have the
results to see but for unrelated reasons, I've been other regression
tests. Here is an example fsmark report over a number of kernels. The
machine used is old but unfortunately it's the only one I have a full
range of results at the moment.

FS-Mark
            fsmark-2.6.32.42-mainline-fsmarkfsmark-2.6.34.10-mainline-fsmarkfsmark-2.6.37.6-mainline-fsmarkfsmark-2.6.38-mainline-fsmarkfsmark-2.6.39-mainline-fsmark
            2.6.32.42-mainline2.6.34.10-mainline 2.6.37.6-mainline   2.6.38-mainline   2.6.39-mainline
Files/s  min         162.80 ( 0.00%)      156.20 (-4.23%)      155.60 (-4.63%)      157.80 (-3.17%)      151.10 (-7.74%)
Files/s  mean        173.77 ( 0.00%)      176.27 ( 1.42%)      168.19 (-3.32%)      172.98 (-0.45%)      172.05 (-1.00%)
Files/s  stddev        7.64 ( 0.00%)       12.54 (39.05%)        8.55 (10.57%)        8.39 ( 8.90%)       10.30 (25.77%)
Files/s  max         190.30 ( 0.00%)      206.80 ( 7.98%)      185.20 (-2.75%)      198.90 ( 4.32%)      201.00 ( 5.32%)
Overhead min     1742851.00 ( 0.00%)  1612311.00 ( 8.10%)  1251552.00 (39.26%)  1239859.00 (40.57%)  1393047.00 (25.11%)
Overhead mean    2443021.87 ( 0.00%)  2486525.60 (-1.75%)  2024365.53 (20.68%)  1849402.47 (32.10%)  1886692.53 (29.49%)
Overhead stddev   744034.70 ( 0.00%)   359446.19 (106.99%)   335986.49 (121.45%)   375627.48 (98.08%)   320901.34 (131.86%)
Overhead max     4744130.00 ( 0.00%)  3082235.00 (53.92%)  2561054.00 (85.24%)  2626346.00 (80.64%)  2559170.00 (85.38%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds)        624.12    647.61     658.8    670.78    653.98
Total Elapsed Time (seconds)               5767.71   5742.30   5974.45   5852.32   5760.49

MMTests Statistics: vmstat
Page Ins                                   3143712   3367600   3108596   3371952   3102548
Page Outs                                104939296 105255268 105126820 105130540 105226620
Swap Ins                                         0         0         0         0         0
Swap Outs                                        0         0         0         0         0
Direct pages scanned                          3521       131      7035         0         0
Kswapd pages scanned                      23596104  23662641  23588211  23695015  23638226
Kswapd pages reclaimed                    23594758  23661359  23587478  23693447  23637005
Direct pages reclaimed                        3521       131      7031         0         0
Kswapd efficiency                              99%       99%       99%       99%       99%
Kswapd velocity                           4091.070  4120.760  3948.181  4048.824  4103.510
Direct efficiency                             100%      100%       99%      100%      100%
Direct velocity                              0.610     0.023     1.178     0.000     0.000
Percentage direct scans                         0%        0%        0%        0%        0%
Page writes by reclaim                          75        32        37       252        44
Slabs scanned                              1843200   1927168   2714112   2801280   2738816
Direct inode steals                              0         0         0         0         0
Kswapd inode steals                        1827970   1822770   1669879   1819583   1681155
Compaction stalls                                0         0         0         0         0
Compaction success                               0         0         0         0         0
Compaction failures                              0         0         0         0         0
Compaction pages moved                           0         0         0    228180         0
Compaction move failure                          0         0         0    637776         0

The number of pages written from reclaim is exceptionally low (2.6.38
was a total disaster but that release was bad for a number of reasons,
haven't tested 2.6.38.8 yet) but reduced by 2.6.37 as expected. Direct
reclaim usage was reduced and efficiency (ratio of pages scanned to
pages reclaimed) was high.

As I look through the results I have at the moment, the number of
pages written back was simply really low which is why the problem fell
off my radar.

> > That means the test is only using 1GB of disk space, and
> > I'm running on a VM with 1GB RAM. It appears to be related to the VM
> > triggering random page writeback from the LRU - 100x10MB files more
> > than fills memory, hence it being the smallest test case i could
> > reproduce the problem on.
> > 

My tests were on a machine with 8G and ext3. I'm running some of
the tests against ext4 and xfs to see if that makes a difference but
it's possible the tests are simply not agressive enough so I want to
reproduce Dave's test if possible.

I'm assuming "test 180" is from xfstests which was not one of the tests
I used previously. To run with 1000 files instead of 100, was the file
"180" simply editted to make it look like this loop instead?

# create files and sync them
i=1;
while [ $i -lt 100 ]
do
        file=$SCRATCH_MNT/$i
        xfs_io -f -c "pwrite -b 64k -S 0xff 0 10m" $file > /dev/null
        if [ $? -ne 0 ]
        then
                echo error creating/writing file $file
                exit
        fi
        let i=$i+1
done

> > My triage notes are as follows, and the patch that fixes the bug is
> > attached below.
> > 
> > <SNIP>
> > 
> >            <...>-393   [000] 696245.229559: xfs_ilock_nowait:     dev 253:16 ino 0x244099 flags ILOCK_EXCL caller xfs_setfilesize
> >            <...>-393   [000] 696245.229560: xfs_setfilesize:      dev 253:16 ino 0x244099 isize 0xa00000 disize 0x94e000 new_size 0x0 offset 0x600000 count 3813376
> >            <...>-393   [000] 696245.229561: xfs_iunlock:          dev 253:16 ino 0x244099 flags ILOCK_EXCL caller xfs_setfilesize
> > 
> > For an IO that was from offset 0x600000 for just under 4MB. The end
> > of that IO is at byte 10104832, which is _exactly_ what the inode
> > size says it is.
> > 
> > It is very clear that from the IO completions that we are getting a
> > *lot* of kswapd driven writeback directly through .writepage:
> > 
> > $ grep "xfs_setfilesize:" t.t |grep "4096$" | wc -l
> > 801
> > $ grep "xfs_setfilesize:" t.t |grep -v "4096$" | wc -l
> > 78
> > 
> > So there's ~900 IO completions that change the file size, and 90% of
> > them are single page updates.
> > 
> > $ ps -ef |grep [k]swap
> > root       514     2  0 12:43 ?        00:00:00 [kswapd0]
> > $ grep "writepage:" t.t | grep "514 " |wc -l
> > 799
> > 
> > Oh, now that is too close to just be a co-incidence. We're getting
> > significant amounts of random page writeback from the the ends of
> > the LRUs done by the VM.
> > 
> > <sigh>

Does the value for nr_vmscan_write in /proc/vmstat correlate? It must
but lets me sure because I'm using that figure rather than ftrace to
count writebacks at the moment. A more relevant question is this -
how many pages were reclaimed by kswapd and what percentage is 799
pages of that? What do you consider an acceptable percentage?

> On Fri, Jul 01, 2011 at 07:20:21PM +1000, Dave Chinner wrote:
> > > Looks good.  I still wonder why I haven't been able to hit this.
> > > Haven't seen any 180 failure for a long time, with both 4k and 512 byte
> > > filesystems and since yesterday 1k as well.
> > 
> > It requires the test to run the VM out of RAM and then force enough
> > memory pressure for kswapd to start writeback from the LRU. The
> > reproducer I have is a 1p, 1GB RAM VM with it's disk image on a
> > 100MB/s HW RAID1 w/ 512MB BBWC disk subsystem.
> > 

You say it's a 1G VM but you don't say what architecure. What is
the size of the highest zone? If this is 32-bit x86 for example, the
highest zone is HighMem and it would be really small. Unfortunately
it would always be the first choice for allocating and reclaiming
from which would drastically increase the number of pages written back
from reclaim.

-- 
Mel Gorman
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
@ 2011-07-01 14:59           ` Mel Gorman
  0 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-07-01 14:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Johannes Weiner, Wu Fengguang, Dave Chinner, xfs, jack, linux-mm

On Fri, Jul 01, 2011 at 05:33:05AM -0400, Christoph Hellwig wrote:
> Johannes, Mel, Wu,

Am adding Jan Kara as he has been working on writeback efficiency
recently as well.

> Dave has been stressing some XFS patches of mine that remove the XFS
> internal writeback clustering in favour of using write_cache_pages.
> 

Against what kernel? 2.6.38 was a disaster for reclaim I've been
finding out this week. I don't know about 2.6.38.8. 2.6.39 was better.

> As part of investigating the behaviour he found out that we're still
> doing lots of I/O from the end of the LRU in kswapd.  Not only is that
> pretty bad behaviour in general, but it also means we really can't
> just remove the writeback clustering in writepage given how much
> I/O is still done through that.
> 
> Any chance we could the writeback vs kswap behaviour sorted out a bit
> better finally?
> 
> Some excerpts from the previous discussion:
> 
> On Fri, Jul 01, 2011 at 02:18:51PM +1000, Dave Chinner wrote:
> > I'm now only running test 180 on 100 files rather than the 1000 the
> > test normally runs on, because it's faster and still shows the
> > problem. 

I had stopped looking at writeback problems while Wu and Jan were
working on various writeback patchsets like io-less throttling. I
don't know where they currently stand and while I submitted a number
of reclaim patches since I last looked at this problem around 2.6.37,
they were related to migration, kswapd reclaiming too much memory
and kswapd using too much CPU - not writeback.

At the time I stopped, the tests I was looking at were writing very
few pages off the end of the LRU. Unfortunately I no longer have the
results to see but for unrelated reasons, I've been other regression
tests. Here is an example fsmark report over a number of kernels. The
machine used is old but unfortunately it's the only one I have a full
range of results at the moment.

FS-Mark
            fsmark-2.6.32.42-mainline-fsmarkfsmark-2.6.34.10-mainline-fsmarkfsmark-2.6.37.6-mainline-fsmarkfsmark-2.6.38-mainline-fsmarkfsmark-2.6.39-mainline-fsmark
            2.6.32.42-mainline2.6.34.10-mainline 2.6.37.6-mainline   2.6.38-mainline   2.6.39-mainline
Files/s  min         162.80 ( 0.00%)      156.20 (-4.23%)      155.60 (-4.63%)      157.80 (-3.17%)      151.10 (-7.74%)
Files/s  mean        173.77 ( 0.00%)      176.27 ( 1.42%)      168.19 (-3.32%)      172.98 (-0.45%)      172.05 (-1.00%)
Files/s  stddev        7.64 ( 0.00%)       12.54 (39.05%)        8.55 (10.57%)        8.39 ( 8.90%)       10.30 (25.77%)
Files/s  max         190.30 ( 0.00%)      206.80 ( 7.98%)      185.20 (-2.75%)      198.90 ( 4.32%)      201.00 ( 5.32%)
Overhead min     1742851.00 ( 0.00%)  1612311.00 ( 8.10%)  1251552.00 (39.26%)  1239859.00 (40.57%)  1393047.00 (25.11%)
Overhead mean    2443021.87 ( 0.00%)  2486525.60 (-1.75%)  2024365.53 (20.68%)  1849402.47 (32.10%)  1886692.53 (29.49%)
Overhead stddev   744034.70 ( 0.00%)   359446.19 (106.99%)   335986.49 (121.45%)   375627.48 (98.08%)   320901.34 (131.86%)
Overhead max     4744130.00 ( 0.00%)  3082235.00 (53.92%)  2561054.00 (85.24%)  2626346.00 (80.64%)  2559170.00 (85.38%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds)        624.12    647.61     658.8    670.78    653.98
Total Elapsed Time (seconds)               5767.71   5742.30   5974.45   5852.32   5760.49

MMTests Statistics: vmstat
Page Ins                                   3143712   3367600   3108596   3371952   3102548
Page Outs                                104939296 105255268 105126820 105130540 105226620
Swap Ins                                         0         0         0         0         0
Swap Outs                                        0         0         0         0         0
Direct pages scanned                          3521       131      7035         0         0
Kswapd pages scanned                      23596104  23662641  23588211  23695015  23638226
Kswapd pages reclaimed                    23594758  23661359  23587478  23693447  23637005
Direct pages reclaimed                        3521       131      7031         0         0
Kswapd efficiency                              99%       99%       99%       99%       99%
Kswapd velocity                           4091.070  4120.760  3948.181  4048.824  4103.510
Direct efficiency                             100%      100%       99%      100%      100%
Direct velocity                              0.610     0.023     1.178     0.000     0.000
Percentage direct scans                         0%        0%        0%        0%        0%
Page writes by reclaim                          75        32        37       252        44
Slabs scanned                              1843200   1927168   2714112   2801280   2738816
Direct inode steals                              0         0         0         0         0
Kswapd inode steals                        1827970   1822770   1669879   1819583   1681155
Compaction stalls                                0         0         0         0         0
Compaction success                               0         0         0         0         0
Compaction failures                              0         0         0         0         0
Compaction pages moved                           0         0         0    228180         0
Compaction move failure                          0         0         0    637776         0

The number of pages written from reclaim is exceptionally low (2.6.38
was a total disaster but that release was bad for a number of reasons,
haven't tested 2.6.38.8 yet) but reduced by 2.6.37 as expected. Direct
reclaim usage was reduced and efficiency (ratio of pages scanned to
pages reclaimed) was high.

As I look through the results I have at the moment, the number of
pages written back was simply really low which is why the problem fell
off my radar.

> > That means the test is only using 1GB of disk space, and
> > I'm running on a VM with 1GB RAM. It appears to be related to the VM
> > triggering random page writeback from the LRU - 100x10MB files more
> > than fills memory, hence it being the smallest test case i could
> > reproduce the problem on.
> > 

My tests were on a machine with 8G and ext3. I'm running some of
the tests against ext4 and xfs to see if that makes a difference but
it's possible the tests are simply not agressive enough so I want to
reproduce Dave's test if possible.

I'm assuming "test 180" is from xfstests which was not one of the tests
I used previously. To run with 1000 files instead of 100, was the file
"180" simply editted to make it look like this loop instead?

# create files and sync them
i=1;
while [ $i -lt 100 ]
do
        file=$SCRATCH_MNT/$i
        xfs_io -f -c "pwrite -b 64k -S 0xff 0 10m" $file > /dev/null
        if [ $? -ne 0 ]
        then
                echo error creating/writing file $file
                exit
        fi
        let i=$i+1
done

> > My triage notes are as follows, and the patch that fixes the bug is
> > attached below.
> > 
> > <SNIP>
> > 
> >            <...>-393   [000] 696245.229559: xfs_ilock_nowait:     dev 253:16 ino 0x244099 flags ILOCK_EXCL caller xfs_setfilesize
> >            <...>-393   [000] 696245.229560: xfs_setfilesize:      dev 253:16 ino 0x244099 isize 0xa00000 disize 0x94e000 new_size 0x0 offset 0x600000 count 3813376
> >            <...>-393   [000] 696245.229561: xfs_iunlock:          dev 253:16 ino 0x244099 flags ILOCK_EXCL caller xfs_setfilesize
> > 
> > For an IO that was from offset 0x600000 for just under 4MB. The end
> > of that IO is at byte 10104832, which is _exactly_ what the inode
> > size says it is.
> > 
> > It is very clear that from the IO completions that we are getting a
> > *lot* of kswapd driven writeback directly through .writepage:
> > 
> > $ grep "xfs_setfilesize:" t.t |grep "4096$" | wc -l
> > 801
> > $ grep "xfs_setfilesize:" t.t |grep -v "4096$" | wc -l
> > 78
> > 
> > So there's ~900 IO completions that change the file size, and 90% of
> > them are single page updates.
> > 
> > $ ps -ef |grep [k]swap
> > root       514     2  0 12:43 ?        00:00:00 [kswapd0]
> > $ grep "writepage:" t.t | grep "514 " |wc -l
> > 799
> > 
> > Oh, now that is too close to just be a co-incidence. We're getting
> > significant amounts of random page writeback from the the ends of
> > the LRUs done by the VM.
> > 
> > <sigh>

Does the value for nr_vmscan_write in /proc/vmstat correlate? It must
but lets me sure because I'm using that figure rather than ftrace to
count writebacks at the moment. A more relevant question is this -
how many pages were reclaimed by kswapd and what percentage is 799
pages of that? What do you consider an acceptable percentage?

> On Fri, Jul 01, 2011 at 07:20:21PM +1000, Dave Chinner wrote:
> > > Looks good.  I still wonder why I haven't been able to hit this.
> > > Haven't seen any 180 failure for a long time, with both 4k and 512 byte
> > > filesystems and since yesterday 1k as well.
> > 
> > It requires the test to run the VM out of RAM and then force enough
> > memory pressure for kswapd to start writeback from the LRU. The
> > reproducer I have is a 1p, 1GB RAM VM with it's disk image on a
> > 100MB/s HW RAID1 w/ 512MB BBWC disk subsystem.
> > 

You say it's a 1G VM but you don't say what architecure. What is
the size of the highest zone? If this is 32-bit x86 for example, the
highest zone is HighMem and it would be really small. Unfortunately
it would always be the first choice for allocating and reclaiming
from which would drastically increase the number of pages written back
from reclaim.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-07-01 14:59           ` Mel Gorman
@ 2011-07-01 15:15             ` Christoph Hellwig
  -1 siblings, 0 replies; 100+ messages in thread
From: Christoph Hellwig @ 2011-07-01 15:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: jack, xfs, Christoph Hellwig, linux-mm, Wu Fengguang, Johannes Weiner

On Fri, Jul 01, 2011 at 03:59:35PM +0100, Mel Gorman wrote:
> On Fri, Jul 01, 2011 at 05:33:05AM -0400, Christoph Hellwig wrote:
> > Johannes, Mel, Wu,
> 
> Am adding Jan Kara as he has been working on writeback efficiency
> recently as well.
> 
> > Dave has been stressing some XFS patches of mine that remove the XFS
> > internal writeback clustering in favour of using write_cache_pages.
> > 
> 
> Against what kernel? 2.6.38 was a disaster for reclaim I've been
> finding out this week. I don't know about 2.6.38.8. 2.6.39 was better.

The patch series is against current 3.0-rc, I assume that's what Dave
tested as well.

> I'm assuming "test 180" is from xfstests which was not one of the tests
> I used previously. To run with 1000 files instead of 100, was the file
> "180" simply editted to make it look like this loop instead?

Yes. to both questions.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
@ 2011-07-01 15:15             ` Christoph Hellwig
  0 siblings, 0 replies; 100+ messages in thread
From: Christoph Hellwig @ 2011-07-01 15:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, Johannes Weiner, Wu Fengguang, Dave Chinner,
	xfs, jack, linux-mm

On Fri, Jul 01, 2011 at 03:59:35PM +0100, Mel Gorman wrote:
> On Fri, Jul 01, 2011 at 05:33:05AM -0400, Christoph Hellwig wrote:
> > Johannes, Mel, Wu,
> 
> Am adding Jan Kara as he has been working on writeback efficiency
> recently as well.
> 
> > Dave has been stressing some XFS patches of mine that remove the XFS
> > internal writeback clustering in favour of using write_cache_pages.
> > 
> 
> Against what kernel? 2.6.38 was a disaster for reclaim I've been
> finding out this week. I don't know about 2.6.38.8. 2.6.39 was better.

The patch series is against current 3.0-rc, I assume that's what Dave
tested as well.

> I'm assuming "test 180" is from xfstests which was not one of the tests
> I used previously. To run with 1000 files instead of 100, was the file
> "180" simply editted to make it look like this loop instead?

Yes. to both questions.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-07-01  9:33         ` Christoph Hellwig
@ 2011-07-01 15:41           ` Wu Fengguang
  -1 siblings, 0 replies; 100+ messages in thread
From: Wu Fengguang @ 2011-07-01 15:41 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-mm, xfs, Mel Gorman, Johannes Weiner

Christoph,

On Fri, Jul 01, 2011 at 05:33:05PM +0800, Christoph Hellwig wrote:
> Johannes, Mel, Wu,
> 
> Dave has been stressing some XFS patches of mine that remove the XFS
> internal writeback clustering in favour of using write_cache_pages.
> 
> As part of investigating the behaviour he found out that we're still
> doing lots of I/O from the end of the LRU in kswapd.  Not only is that
> pretty bad behaviour in general, but it also means we really can't
> just remove the writeback clustering in writepage given how much
> I/O is still done through that.
> 
> Any chance we could the writeback vs kswap behaviour sorted out a bit
> better finally?

I once tried this approach:

http://www.spinics.net/lists/linux-mm/msg09202.html

It used a list structure that is not linearly scalable, however that
part should be independently improvable when necessary.

The real problem was, it seem to not very effective in my test runs.
I found many ->nr_pages works queued before the ->inode works, which
effectively makes the flusher working on more dispersed pages rather
than focusing on the dirty pages encountered in LRU reclaim.

So for the patch to work efficiently, we'll need to first merge the
->nr_pages works and make them lower priority than the ->inode works.

Thanks,
Fengguang

> Some excerpts from the previous discussion:
> 
> On Fri, Jul 01, 2011 at 02:18:51PM +1000, Dave Chinner wrote:
> > I'm now only running test 180 on 100 files rather than the 1000 the
> > test normally runs on, because it's faster and still shows the
> > problem.  That means the test is only using 1GB of disk space, and
> > I'm running on a VM with 1GB RAM. It appears to be related to the VM
> > triggering random page writeback from the LRU - 100x10MB files more
> > than fills memory, hence it being the smallest test case i could
> > reproduce the problem on.
> > 
> > My triage notes are as follows, and the patch that fixes the bug is
> > attached below.
> > 
> > --- 180.out     2010-04-28 15:00:22.000000000 +1000
> > +++ 180.out.bad 2011-07-01 12:44:12.000000000 +1000
> > @@ -1 +1,9 @@
> >  QA output created by 180
> > +file /mnt/scratch/81 has incorrect size 10473472 - sync failed
> > +file /mnt/scratch/86 has incorrect size 10371072 - sync failed
> > +file /mnt/scratch/87 has incorrect size 10104832 - sync failed
> > +file /mnt/scratch/88 has incorrect size 10125312 - sync failed
> > +file /mnt/scratch/89 has incorrect size 10469376 - sync failed
> > +file /mnt/scratch/90 has incorrect size 10240000 - sync failed
> > +file /mnt/scratch/91 has incorrect size 10362880 - sync failed
> > +file /mnt/scratch/92 has incorrect size 10366976 - sync failed
> > 
> > $ ls -li /mnt/scratch/ | awk '/rw/ { printf("0x%x %d %d\n", $1, $6, $10); }'
> > 0x244093 10473472 81
> > 0x244098 10371072 86
> > 0x244099 10104832 87
> > 0x24409a 10125312 88
> > 0x24409b 10469376 89
> > 0x24409c 10240000 90
> > 0x24409d 10362880 91
> > 0x24409e 10366976 92
> > 
> > So looking at inode 0x244099 (/mnt/scratch/87), the last setfilesize
> > call in the trace (got a separate patch for that) is:
> > 
> >            <...>-393   [000] 696245.229559: xfs_ilock_nowait:     dev 253:16 ino 0x244099 flags ILOCK_EXCL caller xfs_setfilesize
> >            <...>-393   [000] 696245.229560: xfs_setfilesize:      dev 253:16 ino 0x244099 isize 0xa00000 disize 0x94e000 new_size 0x0 offset 0x600000 count 3813376
> >            <...>-393   [000] 696245.229561: xfs_iunlock:          dev 253:16 ino 0x244099 flags ILOCK_EXCL caller xfs_setfilesize
> > 
> > For an IO that was from offset 0x600000 for just under 4MB. The end
> > of that IO is at byte 10104832, which is _exactly_ what the inode
> > size says it is.
> > 
> > It is very clear that from the IO completions that we are getting a
> > *lot* of kswapd driven writeback directly through .writepage:
> > 
> > $ grep "xfs_setfilesize:" t.t |grep "4096$" | wc -l
> > 801
> > $ grep "xfs_setfilesize:" t.t |grep -v "4096$" | wc -l
> > 78
> > 
> > So there's ~900 IO completions that change the file size, and 90% of
> > them are single page updates.
> > 
> > $ ps -ef |grep [k]swap
> > root       514     2  0 12:43 ?        00:00:00 [kswapd0]
> > $ grep "writepage:" t.t | grep "514 " |wc -l
> > 799
> > 
> > Oh, now that is too close to just be a co-incidence. We're getting
> > significant amounts of random page writeback from the the ends of
> > the LRUs done by the VM.
> > 
> > <sigh>
> 
> 
> On Fri, Jul 01, 2011 at 07:20:21PM +1000, Dave Chinner wrote:
> > > Looks good.  I still wonder why I haven't been able to hit this.
> > > Haven't seen any 180 failure for a long time, with both 4k and 512 byte
> > > filesystems and since yesterday 1k as well.
> > 
> > It requires the test to run the VM out of RAM and then force enough
> > memory pressure for kswapd to start writeback from the LRU. The
> > reproducer I have is a 1p, 1GB RAM VM with it's disk image on a
> > 100MB/s HW RAID1 w/ 512MB BBWC disk subsystem.
> > 
> > When kswapd starts doing writeback from the LRU, the iops rate goes
> > through the roof (from ~300iops @~320k/io to ~7000iops @4k/io) and
> > throughput drops from 100MB/s to ~30MB/s. BBWC is the only reason
> > the IOPS stays as high as it does - maybe that is why I saw this and
> > you haven't.
> > 
> > As it is, the kswapd writeback behaviour is utterly atrocious and,
> > ultimately, quite easy to provoke. I wish the MM folk would fix that
> > goddamn problem already - we've only been complaining about it for
> > the last 6 or 7 years. As such, I'm wondering if it's a bad idea to
> > even consider removing the .writepage clustering...

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
@ 2011-07-01 15:41           ` Wu Fengguang
  0 siblings, 0 replies; 100+ messages in thread
From: Wu Fengguang @ 2011-07-01 15:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mel Gorman, Johannes Weiner, Dave Chinner, xfs, linux-mm

Christoph,

On Fri, Jul 01, 2011 at 05:33:05PM +0800, Christoph Hellwig wrote:
> Johannes, Mel, Wu,
> 
> Dave has been stressing some XFS patches of mine that remove the XFS
> internal writeback clustering in favour of using write_cache_pages.
> 
> As part of investigating the behaviour he found out that we're still
> doing lots of I/O from the end of the LRU in kswapd.  Not only is that
> pretty bad behaviour in general, but it also means we really can't
> just remove the writeback clustering in writepage given how much
> I/O is still done through that.
> 
> Any chance we could the writeback vs kswap behaviour sorted out a bit
> better finally?

I once tried this approach:

http://www.spinics.net/lists/linux-mm/msg09202.html

It used a list structure that is not linearly scalable, however that
part should be independently improvable when necessary.

The real problem was, it seem to not very effective in my test runs.
I found many ->nr_pages works queued before the ->inode works, which
effectively makes the flusher working on more dispersed pages rather
than focusing on the dirty pages encountered in LRU reclaim.

So for the patch to work efficiently, we'll need to first merge the
->nr_pages works and make them lower priority than the ->inode works.

Thanks,
Fengguang

> Some excerpts from the previous discussion:
> 
> On Fri, Jul 01, 2011 at 02:18:51PM +1000, Dave Chinner wrote:
> > I'm now only running test 180 on 100 files rather than the 1000 the
> > test normally runs on, because it's faster and still shows the
> > problem.  That means the test is only using 1GB of disk space, and
> > I'm running on a VM with 1GB RAM. It appears to be related to the VM
> > triggering random page writeback from the LRU - 100x10MB files more
> > than fills memory, hence it being the smallest test case i could
> > reproduce the problem on.
> > 
> > My triage notes are as follows, and the patch that fixes the bug is
> > attached below.
> > 
> > --- 180.out     2010-04-28 15:00:22.000000000 +1000
> > +++ 180.out.bad 2011-07-01 12:44:12.000000000 +1000
> > @@ -1 +1,9 @@
> >  QA output created by 180
> > +file /mnt/scratch/81 has incorrect size 10473472 - sync failed
> > +file /mnt/scratch/86 has incorrect size 10371072 - sync failed
> > +file /mnt/scratch/87 has incorrect size 10104832 - sync failed
> > +file /mnt/scratch/88 has incorrect size 10125312 - sync failed
> > +file /mnt/scratch/89 has incorrect size 10469376 - sync failed
> > +file /mnt/scratch/90 has incorrect size 10240000 - sync failed
> > +file /mnt/scratch/91 has incorrect size 10362880 - sync failed
> > +file /mnt/scratch/92 has incorrect size 10366976 - sync failed
> > 
> > $ ls -li /mnt/scratch/ | awk '/rw/ { printf("0x%x %d %d\n", $1, $6, $10); }'
> > 0x244093 10473472 81
> > 0x244098 10371072 86
> > 0x244099 10104832 87
> > 0x24409a 10125312 88
> > 0x24409b 10469376 89
> > 0x24409c 10240000 90
> > 0x24409d 10362880 91
> > 0x24409e 10366976 92
> > 
> > So looking at inode 0x244099 (/mnt/scratch/87), the last setfilesize
> > call in the trace (got a separate patch for that) is:
> > 
> >            <...>-393   [000] 696245.229559: xfs_ilock_nowait:     dev 253:16 ino 0x244099 flags ILOCK_EXCL caller xfs_setfilesize
> >            <...>-393   [000] 696245.229560: xfs_setfilesize:      dev 253:16 ino 0x244099 isize 0xa00000 disize 0x94e000 new_size 0x0 offset 0x600000 count 3813376
> >            <...>-393   [000] 696245.229561: xfs_iunlock:          dev 253:16 ino 0x244099 flags ILOCK_EXCL caller xfs_setfilesize
> > 
> > For an IO that was from offset 0x600000 for just under 4MB. The end
> > of that IO is at byte 10104832, which is _exactly_ what the inode
> > size says it is.
> > 
> > It is very clear that from the IO completions that we are getting a
> > *lot* of kswapd driven writeback directly through .writepage:
> > 
> > $ grep "xfs_setfilesize:" t.t |grep "4096$" | wc -l
> > 801
> > $ grep "xfs_setfilesize:" t.t |grep -v "4096$" | wc -l
> > 78
> > 
> > So there's ~900 IO completions that change the file size, and 90% of
> > them are single page updates.
> > 
> > $ ps -ef |grep [k]swap
> > root       514     2  0 12:43 ?        00:00:00 [kswapd0]
> > $ grep "writepage:" t.t | grep "514 " |wc -l
> > 799
> > 
> > Oh, now that is too close to just be a co-incidence. We're getting
> > significant amounts of random page writeback from the the ends of
> > the LRUs done by the VM.
> > 
> > <sigh>
> 
> 
> On Fri, Jul 01, 2011 at 07:20:21PM +1000, Dave Chinner wrote:
> > > Looks good.  I still wonder why I haven't been able to hit this.
> > > Haven't seen any 180 failure for a long time, with both 4k and 512 byte
> > > filesystems and since yesterday 1k as well.
> > 
> > It requires the test to run the VM out of RAM and then force enough
> > memory pressure for kswapd to start writeback from the LRU. The
> > reproducer I have is a 1p, 1GB RAM VM with it's disk image on a
> > 100MB/s HW RAID1 w/ 512MB BBWC disk subsystem.
> > 
> > When kswapd starts doing writeback from the LRU, the iops rate goes
> > through the roof (from ~300iops @~320k/io to ~7000iops @4k/io) and
> > throughput drops from 100MB/s to ~30MB/s. BBWC is the only reason
> > the IOPS stays as high as it does - maybe that is why I saw this and
> > you haven't.
> > 
> > As it is, the kswapd writeback behaviour is utterly atrocious and,
> > ultimately, quite easy to provoke. I wish the MM folk would fix that
> > goddamn problem already - we've only been complaining about it for
> > the last 6 or 7 years. As such, I'm wondering if it's a bad idea to
> > even consider removing the .writepage clustering...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-07-01 14:59           ` Mel Gorman
@ 2011-07-02  2:42             ` Dave Chinner
  -1 siblings, 0 replies; 100+ messages in thread
From: Dave Chinner @ 2011-07-02  2:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: jack, xfs, Christoph Hellwig, linux-mm, Wu Fengguang, Johannes Weiner

On Fri, Jul 01, 2011 at 03:59:35PM +0100, Mel Gorman wrote:
> On Fri, Jul 01, 2011 at 05:33:05AM -0400, Christoph Hellwig wrote:
> > Johannes, Mel, Wu,
> 
> Am adding Jan Kara as he has been working on writeback efficiency
> recently as well.

Writeback looks to be working fine - it's kswapd screwing up the
writeback patterns that appears to be the problem....

> > Dave has been stressing some XFS patches of mine that remove the XFS
> > internal writeback clustering in favour of using write_cache_pages.
> 
> Against what kernel? 2.6.38 was a disaster for reclaim I've been
> finding out this week. I don't know about 2.6.38.8. 2.6.39 was better.

3.0-rc4

....
> The number of pages written from reclaim is exceptionally low (2.6.38
> was a total disaster but that release was bad for a number of reasons,
> haven't tested 2.6.38.8 yet) but reduced by 2.6.37 as expected. Direct
> reclaim usage was reduced and efficiency (ratio of pages scanned to
> pages reclaimed) was high.

And is that consistent across ext3/ext4/xfs/btrfs filesystems? I
doubt it very much, as all have very different .writepage
behaviours...

BTW, called a workload "fsmark" tells us nothing about the workload
being tested - fsmark can do a lot of interesting things. IOWs, you
need to quote the command line for it to be meaningful to anyone...

> As I look through the results I have at the moment, the number of
> pages written back was simply really low which is why the problem fell
> off my radar.

It doesn't take many to completely screw up writeback IO patterns.
Write a few random pages to a 10MB file well before writeback would
get to the file, and instead of getting optimal sequential writeback
patterns when writeback gets to it, we get multiple disjoint IOs
that require multiple seeks to complete.

Slower, less efficient writeback IO causes memory pressure to last
longer and hence more likely to result in kswapd writeback, and it's
just a downward spiral from there....

> > > That means the test is only using 1GB of disk space, and
> > > I'm running on a VM with 1GB RAM. It appears to be related to the VM
> > > triggering random page writeback from the LRU - 100x10MB files more
> > > than fills memory, hence it being the smallest test case i could
> > > reproduce the problem on.
> > > 
> 
> My tests were on a machine with 8G and ext3. I'm running some of
> the tests against ext4 and xfs to see if that makes a difference but
> it's possible the tests are simply not agressive enough so I want to
> reproduce Dave's test if possible.

To tell the truth, I don't think anyone really cares how ext3
performs these days. XFS seems to be the filesystem that brings out
all the bad behaviour in the mm subsystem....

FWIW, the mm subsystem works well enough when there is RAM
available, so I'd suggest that your reclaim testing needs to focus
on smaller memory configurations to really stress the reclaim
algorithms. That's one of the reason why I regularly test on 1GB, 1p
machines - they show problems that are hard to rep┌oduce on larger
configs....

> I'm assuming "test 180" is from xfstests which was not one of the tests
> I used previously. To run with 1000 files instead of 100, was the file
> "180" simply editted to make it look like this loop instead?

I reduced it to 100 files simply to speed up the testing process for
the "bad file size" problem I was trying to find. If you want to
reproduce the IO collapse in a big way, run it with 1000 files, and
it happens about 2/3rds of the way through the test on my hardware.

> > > It is very clear that from the IO completions that we are getting a
> > > *lot* of kswapd driven writeback directly through .writepage:
> > > 
> > > $ grep "xfs_setfilesize:" t.t |grep "4096$" | wc -l
> > > 801
> > > $ grep "xfs_setfilesize:" t.t |grep -v "4096$" | wc -l
> > > 78
> > > 
> > > So there's ~900 IO completions that change the file size, and 90% of
> > > them are single page updates.
> > > 
> > > $ ps -ef |grep [k]swap
> > > root       514     2  0 12:43 ?        00:00:00 [kswapd0]
> > > $ grep "writepage:" t.t | grep "514 " |wc -l
> > > 799
> > > 
> > > Oh, now that is too close to just be a co-incidence. We're getting
> > > significant amounts of random page writeback from the the ends of
> > > the LRUs done by the VM.
> > > 
> > > <sigh>
> 
> Does the value for nr_vmscan_write in /proc/vmstat correlate? It must
> but lets me sure because I'm using that figure rather than ftrace to
> count writebacks at the moment.

The number in /proc/vmstat is higher. Much higher.  I just ran the
test at 1000 files (only collapsed to ~3000 iops this time because I
ran it on a plain 3.0-rc4 kernel that still has the .writepage
clustering in XFS), and I see:

nr_vmscan_write 6723

after the test. The event trace only capture ~1400 writepage events
from kswapd, but it tends to miss a lot of events as the system is
quite unresponsive at times under this workload - it's not uncommon
to have ssh sessions not echo a character for 10s... e.g: I started
the workload ~11:08:22:

$ while [ 1 ]; do date; sleep 1; done
Sat Jul  2 11:08:15 EST 2011
Sat Jul  2 11:08:16 EST 2011
Sat Jul  2 11:08:17 EST 2011
Sat Jul  2 11:08:18 EST 2011
Sat Jul  2 11:08:19 EST 2011
Sat Jul  2 11:08:20 EST 2011
Sat Jul  2 11:08:21 EST 2011
Sat Jul  2 11:08:22 EST 2011         <<<<<<<< start test here
Sat Jul  2 11:08:23 EST 2011
Sat Jul  2 11:08:24 EST 2011
Sat Jul  2 11:08:25 EST 2011
Sat Jul  2 11:08:26 EST 2011         <<<<<<<<
Sat Jul  2 11:08:27 EST 2011         <<<<<<<<
Sat Jul  2 11:08:30 EST 2011         <<<<<<<<
Sat Jul  2 11:08:35 EST 2011         <<<<<<<<
Sat Jul  2 11:08:36 EST 2011
Sat Jul  2 11:08:37 EST 2011
Sat Jul  2 11:08:38 EST 2011         <<<<<<<<
Sat Jul  2 11:08:40 EST 2011         <<<<<<<<
Sat Jul  2 11:08:41 EST 2011
Sat Jul  2 11:08:42 EST 2011
Sat Jul  2 11:08:43 EST 2011

And there are quite a few more multi-second holdoffs during the
test, too.

> A more relevant question is this -
> how many pages were reclaimed by kswapd and what percentage is 799
> pages of that? What do you consider an acceptable percentage?

I don't care what the percentage is or what the number is. kswapd is
reclaiming pages most of the time without affect IO patterns, and
when that happens I just don't care because it is working just fine.

What I care about is what kswapd is doing when it finds dirty pages
and it decides they need to be written back. It's not a problem that
they are found or need to be written, the problem is the utterly
crap way that memory reclaim is throwing the pages at the filesystem.

I'm not sure how to get through to you guys that single, random page
writeback is *BAD*. Using .writepage directly is considered harmful
to IO throughput, and memory reclaim needs to stop doing that.
We've got hacks in the filesystems to try to make the IO memory
reclaim executes suck less, but ultimately the problem is the IO
memory reclaim is doing. And now the memory reclaim IO patterns are
getting in the way of further improving the writeback path in XFS
because were finding the hacks we've been carrying for years are
*still* the only thing that is making IO under memory pressure not
suck completely.

What I find extremely frustrating is that this is not a new issue.
We (filesystem people) have been asking for a long time to have the
memory reclaim subsystem either defer IO to the writeback threads or
to use the .writepages interface. We're not asking this to be
difficult, we're asking for this so that we can cluster IO in an
optimal manner to avoid these IO collapses that memory reclaim
currently triggers.  We now have generic methods of handing off IO
to flusher threads that also provide some level of throttling/
blocking while IO is submitted (e.g.  writeback_inodes_sb_nr()), so
this shouldn't be a difficult problem to solve for the memory
reclaim subsystem.

Hell, maybe memory reclaim should take a leaf from the IO-less
throttle work we are doing - hit a bunch of dirty pages on the LRU,
just back off and let the writeback subsystem clean a few more pages
before starting another scan.  Letting the writeback code clean
pages is the fastest way to get pages cleaned in the system, so if
we've already got a generic method for cleaning and/or waiting for
pages to be cleaned, why not aim to use that?

And while I'm ranting, when on earth is the issue-writeback-from-
direct-reclaim problem going to be fixed so we can remove the hacks
in the filesystem .writepage implementations to prevent this from
occurring?

I mean, when we combine the two issues, doesn't it imply that the
memory reclaim subsystem needs to be redesigned around the fact it
*can't clean pages directly*?  This IO collapse issue shows that we
really don't 't want kswapd issuing IO directly via .writepage, and
we already reject IO from direct reclaim in .writepage in ext4, XFS
and BTRFS because we'll overrun the stack on anything other than
trivial storage configurations.

That says to me in a big, flashing bright pink neon sign way that
memory reclaim simply should not be issuing IO at all. Perhaps it's
time to rethink the way memory reclaim deals with dirty pages to
take into account the current reality?

</rant>

> > On Fri, Jul 01, 2011 at 07:20:21PM +1000, Dave Chinner wrote:
> > > > Looks good.  I still wonder why I haven't been able to hit this.
> > > > Haven't seen any 180 failure for a long time, with both 4k and 512 byte
> > > > filesystems and since yesterday 1k as well.
> > > 
> > > It requires the test to run the VM out of RAM and then force enough
> > > memory pressure for kswapd to start writeback from the LRU. The
> > > reproducer I have is a 1p, 1GB RAM VM with it's disk image on a
> > > 100MB/s HW RAID1 w/ 512MB BBWC disk subsystem.
> > > 
> 
> You say it's a 1G VM but you don't say what architecure.

x86-64 for both the guest and the host.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
@ 2011-07-02  2:42             ` Dave Chinner
  0 siblings, 0 replies; 100+ messages in thread
From: Dave Chinner @ 2011-07-02  2:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, Johannes Weiner, Wu Fengguang, xfs, jack, linux-mm

On Fri, Jul 01, 2011 at 03:59:35PM +0100, Mel Gorman wrote:
> On Fri, Jul 01, 2011 at 05:33:05AM -0400, Christoph Hellwig wrote:
> > Johannes, Mel, Wu,
> 
> Am adding Jan Kara as he has been working on writeback efficiency
> recently as well.

Writeback looks to be working fine - it's kswapd screwing up the
writeback patterns that appears to be the problem....

> > Dave has been stressing some XFS patches of mine that remove the XFS
> > internal writeback clustering in favour of using write_cache_pages.
> 
> Against what kernel? 2.6.38 was a disaster for reclaim I've been
> finding out this week. I don't know about 2.6.38.8. 2.6.39 was better.

3.0-rc4

....
> The number of pages written from reclaim is exceptionally low (2.6.38
> was a total disaster but that release was bad for a number of reasons,
> haven't tested 2.6.38.8 yet) but reduced by 2.6.37 as expected. Direct
> reclaim usage was reduced and efficiency (ratio of pages scanned to
> pages reclaimed) was high.

And is that consistent across ext3/ext4/xfs/btrfs filesystems? I
doubt it very much, as all have very different .writepage
behaviours...

BTW, called a workload "fsmark" tells us nothing about the workload
being tested - fsmark can do a lot of interesting things. IOWs, you
need to quote the command line for it to be meaningful to anyone...

> As I look through the results I have at the moment, the number of
> pages written back was simply really low which is why the problem fell
> off my radar.

It doesn't take many to completely screw up writeback IO patterns.
Write a few random pages to a 10MB file well before writeback would
get to the file, and instead of getting optimal sequential writeback
patterns when writeback gets to it, we get multiple disjoint IOs
that require multiple seeks to complete.

Slower, less efficient writeback IO causes memory pressure to last
longer and hence more likely to result in kswapd writeback, and it's
just a downward spiral from there....

> > > That means the test is only using 1GB of disk space, and
> > > I'm running on a VM with 1GB RAM. It appears to be related to the VM
> > > triggering random page writeback from the LRU - 100x10MB files more
> > > than fills memory, hence it being the smallest test case i could
> > > reproduce the problem on.
> > > 
> 
> My tests were on a machine with 8G and ext3. I'm running some of
> the tests against ext4 and xfs to see if that makes a difference but
> it's possible the tests are simply not agressive enough so I want to
> reproduce Dave's test if possible.

To tell the truth, I don't think anyone really cares how ext3
performs these days. XFS seems to be the filesystem that brings out
all the bad behaviour in the mm subsystem....

FWIW, the mm subsystem works well enough when there is RAM
available, so I'd suggest that your reclaim testing needs to focus
on smaller memory configurations to really stress the reclaim
algorithms. That's one of the reason why I regularly test on 1GB, 1p
machines - they show problems that are hard to repa??oduce on larger
configs....

> I'm assuming "test 180" is from xfstests which was not one of the tests
> I used previously. To run with 1000 files instead of 100, was the file
> "180" simply editted to make it look like this loop instead?

I reduced it to 100 files simply to speed up the testing process for
the "bad file size" problem I was trying to find. If you want to
reproduce the IO collapse in a big way, run it with 1000 files, and
it happens about 2/3rds of the way through the test on my hardware.

> > > It is very clear that from the IO completions that we are getting a
> > > *lot* of kswapd driven writeback directly through .writepage:
> > > 
> > > $ grep "xfs_setfilesize:" t.t |grep "4096$" | wc -l
> > > 801
> > > $ grep "xfs_setfilesize:" t.t |grep -v "4096$" | wc -l
> > > 78
> > > 
> > > So there's ~900 IO completions that change the file size, and 90% of
> > > them are single page updates.
> > > 
> > > $ ps -ef |grep [k]swap
> > > root       514     2  0 12:43 ?        00:00:00 [kswapd0]
> > > $ grep "writepage:" t.t | grep "514 " |wc -l
> > > 799
> > > 
> > > Oh, now that is too close to just be a co-incidence. We're getting
> > > significant amounts of random page writeback from the the ends of
> > > the LRUs done by the VM.
> > > 
> > > <sigh>
> 
> Does the value for nr_vmscan_write in /proc/vmstat correlate? It must
> but lets me sure because I'm using that figure rather than ftrace to
> count writebacks at the moment.

The number in /proc/vmstat is higher. Much higher.  I just ran the
test at 1000 files (only collapsed to ~3000 iops this time because I
ran it on a plain 3.0-rc4 kernel that still has the .writepage
clustering in XFS), and I see:

nr_vmscan_write 6723

after the test. The event trace only capture ~1400 writepage events
from kswapd, but it tends to miss a lot of events as the system is
quite unresponsive at times under this workload - it's not uncommon
to have ssh sessions not echo a character for 10s... e.g: I started
the workload ~11:08:22:

$ while [ 1 ]; do date; sleep 1; done
Sat Jul  2 11:08:15 EST 2011
Sat Jul  2 11:08:16 EST 2011
Sat Jul  2 11:08:17 EST 2011
Sat Jul  2 11:08:18 EST 2011
Sat Jul  2 11:08:19 EST 2011
Sat Jul  2 11:08:20 EST 2011
Sat Jul  2 11:08:21 EST 2011
Sat Jul  2 11:08:22 EST 2011         <<<<<<<< start test here
Sat Jul  2 11:08:23 EST 2011
Sat Jul  2 11:08:24 EST 2011
Sat Jul  2 11:08:25 EST 2011
Sat Jul  2 11:08:26 EST 2011         <<<<<<<<
Sat Jul  2 11:08:27 EST 2011         <<<<<<<<
Sat Jul  2 11:08:30 EST 2011         <<<<<<<<
Sat Jul  2 11:08:35 EST 2011         <<<<<<<<
Sat Jul  2 11:08:36 EST 2011
Sat Jul  2 11:08:37 EST 2011
Sat Jul  2 11:08:38 EST 2011         <<<<<<<<
Sat Jul  2 11:08:40 EST 2011         <<<<<<<<
Sat Jul  2 11:08:41 EST 2011
Sat Jul  2 11:08:42 EST 2011
Sat Jul  2 11:08:43 EST 2011

And there are quite a few more multi-second holdoffs during the
test, too.

> A more relevant question is this -
> how many pages were reclaimed by kswapd and what percentage is 799
> pages of that? What do you consider an acceptable percentage?

I don't care what the percentage is or what the number is. kswapd is
reclaiming pages most of the time without affect IO patterns, and
when that happens I just don't care because it is working just fine.

What I care about is what kswapd is doing when it finds dirty pages
and it decides they need to be written back. It's not a problem that
they are found or need to be written, the problem is the utterly
crap way that memory reclaim is throwing the pages at the filesystem.

I'm not sure how to get through to you guys that single, random page
writeback is *BAD*. Using .writepage directly is considered harmful
to IO throughput, and memory reclaim needs to stop doing that.
We've got hacks in the filesystems to try to make the IO memory
reclaim executes suck less, but ultimately the problem is the IO
memory reclaim is doing. And now the memory reclaim IO patterns are
getting in the way of further improving the writeback path in XFS
because were finding the hacks we've been carrying for years are
*still* the only thing that is making IO under memory pressure not
suck completely.

What I find extremely frustrating is that this is not a new issue.
We (filesystem people) have been asking for a long time to have the
memory reclaim subsystem either defer IO to the writeback threads or
to use the .writepages interface. We're not asking this to be
difficult, we're asking for this so that we can cluster IO in an
optimal manner to avoid these IO collapses that memory reclaim
currently triggers.  We now have generic methods of handing off IO
to flusher threads that also provide some level of throttling/
blocking while IO is submitted (e.g.  writeback_inodes_sb_nr()), so
this shouldn't be a difficult problem to solve for the memory
reclaim subsystem.

Hell, maybe memory reclaim should take a leaf from the IO-less
throttle work we are doing - hit a bunch of dirty pages on the LRU,
just back off and let the writeback subsystem clean a few more pages
before starting another scan.  Letting the writeback code clean
pages is the fastest way to get pages cleaned in the system, so if
we've already got a generic method for cleaning and/or waiting for
pages to be cleaned, why not aim to use that?

And while I'm ranting, when on earth is the issue-writeback-from-
direct-reclaim problem going to be fixed so we can remove the hacks
in the filesystem .writepage implementations to prevent this from
occurring?

I mean, when we combine the two issues, doesn't it imply that the
memory reclaim subsystem needs to be redesigned around the fact it
*can't clean pages directly*?  This IO collapse issue shows that we
really don't 't want kswapd issuing IO directly via .writepage, and
we already reject IO from direct reclaim in .writepage in ext4, XFS
and BTRFS because we'll overrun the stack on anything other than
trivial storage configurations.

That says to me in a big, flashing bright pink neon sign way that
memory reclaim simply should not be issuing IO at all. Perhaps it's
time to rethink the way memory reclaim deals with dirty pages to
take into account the current reality?

</rant>

> > On Fri, Jul 01, 2011 at 07:20:21PM +1000, Dave Chinner wrote:
> > > > Looks good.  I still wonder why I haven't been able to hit this.
> > > > Haven't seen any 180 failure for a long time, with both 4k and 512 byte
> > > > filesystems and since yesterday 1k as well.
> > > 
> > > It requires the test to run the VM out of RAM and then force enough
> > > memory pressure for kswapd to start writeback from the LRU. The
> > > reproducer I have is a 1p, 1GB RAM VM with it's disk image on a
> > > 100MB/s HW RAID1 w/ 512MB BBWC disk subsystem.
> > > 
> 
> You say it's a 1G VM but you don't say what architecure.

x86-64 for both the guest and the host.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-07-01 15:41           ` Wu Fengguang
@ 2011-07-04  3:25             ` Dave Chinner
  -1 siblings, 0 replies; 100+ messages in thread
From: Dave Chinner @ 2011-07-04  3:25 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christoph Hellwig, linux-mm, xfs, Mel Gorman, Johannes Weiner

On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote:
> Christoph,
> 
> On Fri, Jul 01, 2011 at 05:33:05PM +0800, Christoph Hellwig wrote:
> > Johannes, Mel, Wu,
> > 
> > Dave has been stressing some XFS patches of mine that remove the XFS
> > internal writeback clustering in favour of using write_cache_pages.
> > 
> > As part of investigating the behaviour he found out that we're still
> > doing lots of I/O from the end of the LRU in kswapd.  Not only is that
> > pretty bad behaviour in general, but it also means we really can't
> > just remove the writeback clustering in writepage given how much
> > I/O is still done through that.
> > 
> > Any chance we could the writeback vs kswap behaviour sorted out a bit
> > better finally?
> 
> I once tried this approach:
> 
> http://www.spinics.net/lists/linux-mm/msg09202.html
> 
> It used a list structure that is not linearly scalable, however that
> part should be independently improvable when necessary.

I don't think that handing random writeback to the flusher thread is
much better than doing random writeback directly.  Yes, you added
some clustering, but I'm still don't think writing specific pages is
the best solution.

> The real problem was, it seem to not very effective in my test runs.
> I found many ->nr_pages works queued before the ->inode works, which
> effectively makes the flusher working on more dispersed pages rather
> than focusing on the dirty pages encountered in LRU reclaim.

But that's really just an implementation issue related to how you
tried to solve the problem. That could be addressed.

However, what I'm questioning is whether we should even care what
page memory reclaim wants to write - it seems to make fundamentally
bad decisions from an IO persepctive.

We have to remember that memory reclaim is doing LRU reclaim and the
flusher threads are doing "oldest first" writeback. IOWs, both are trying
to operate in the same direction (oldest to youngest) for the same
purpose.  The fundamental problem that occurs when memory reclaim
starts writing pages back from the LRU is this:

	- memory reclaim has run ahead of IO writeback -

The LRU usually looks like this:

	oldest					youngest
	+---------------+---------------+--------------+
	clean		writeback	dirty
			^		^
			|		|
			|		Where flusher will next work from
			|		Where kswapd is working from
			|
			IO submitted by flusher, waiting on completion


If memory reclaim is hitting dirty pages on the LRU, it means it has
got ahead of writeback without being throttled - it's passed over
all the pages currently under writeback and is trying to write back
pages that are *newer* than what writeback is working on. IOWs, it
starts trying to do the job of the flusher threads, and it does that
very badly.

The $100 question is ∗why is it getting ahead of writeback*?

From a brief look at the vmscan code, it appears that scanning does
not throttle/block until reclaim priority has got pretty high. That
means at low priority reclaim, it *skips pages under writeback*.
However, if it comes across a dirty page, it will trigger writeback
of the page.

Now call me crazy, but if we've already got a large number of pages
under writeback, why would we want to *start more IO* when clearly
the system is taking care of cleaning pages already and all we have
to do is wait for a short while to get clean pages ready for
reclaim?

Indeed, I added this quick hack to prevent the VM from doing
writeback via pageout until after it starts blocking on writeback
pages:

@@ -825,6 +825,8 @@ static unsigned long shrink_page_list(struct list_head *page_l
 		if (PageDirty(page)) {
 			nr_dirty++;
 
+			if (!(sc->reclaim_mode & RECLAIM_MODE_SYNC))
+				goto keep_locked;
 			if (references == PAGEREF_RECLAIM_CLEAN)
 				goto keep_locked;
 			if (!may_enter_fs)

IOWs, we don't write pages from kswapd unless there is no IO
writeback going on at all (waited on all the writeback pages or none
exist) and there are dirty pages on the LRU.

This doesn't completely stop the IO collapse, (looks like foreground
throttling is the other cause, which IO-less write throttling fixes)
but the collapse was significantly reduced in duration and intensity
by removing kswapd writeback. In fact, the IO rate only dropped to
~60MB/s instead of 30MB/s, and the improvement is easily measured by
the runtime of the test:

			run 1	run 2	run 3
3.0-rc5-vanilla		135s	137s	138s
3.0-rc5-patched		117s	115s	115s

That's a pretty massive improvement for a 2-line patch. ;) I expect
the IO-less write throttling patchset will further improve this.

FWIW, the nr_vmscan_write values changed like this:

			run 1	run 2	run 3
3.0-rc5-vanilla		6751	6893	6465
3.0-rc5-patched		0	0	0

These results support my argument that memory reclaim should not be
doing dirty page writeback at all - defering writeback to the
writeback infrastructure and just waiting for it to complete
appropriately is the Right Thing To Do. i.e. IO-less memory reclaim
works better than the current code for the same reason IO-less write
throttling works better than the current code....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
@ 2011-07-04  3:25             ` Dave Chinner
  0 siblings, 0 replies; 100+ messages in thread
From: Dave Chinner @ 2011-07-04  3:25 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christoph Hellwig, Mel Gorman, Johannes Weiner, xfs, linux-mm

On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote:
> Christoph,
> 
> On Fri, Jul 01, 2011 at 05:33:05PM +0800, Christoph Hellwig wrote:
> > Johannes, Mel, Wu,
> > 
> > Dave has been stressing some XFS patches of mine that remove the XFS
> > internal writeback clustering in favour of using write_cache_pages.
> > 
> > As part of investigating the behaviour he found out that we're still
> > doing lots of I/O from the end of the LRU in kswapd.  Not only is that
> > pretty bad behaviour in general, but it also means we really can't
> > just remove the writeback clustering in writepage given how much
> > I/O is still done through that.
> > 
> > Any chance we could the writeback vs kswap behaviour sorted out a bit
> > better finally?
> 
> I once tried this approach:
> 
> http://www.spinics.net/lists/linux-mm/msg09202.html
> 
> It used a list structure that is not linearly scalable, however that
> part should be independently improvable when necessary.

I don't think that handing random writeback to the flusher thread is
much better than doing random writeback directly.  Yes, you added
some clustering, but I'm still don't think writing specific pages is
the best solution.

> The real problem was, it seem to not very effective in my test runs.
> I found many ->nr_pages works queued before the ->inode works, which
> effectively makes the flusher working on more dispersed pages rather
> than focusing on the dirty pages encountered in LRU reclaim.

But that's really just an implementation issue related to how you
tried to solve the problem. That could be addressed.

However, what I'm questioning is whether we should even care what
page memory reclaim wants to write - it seems to make fundamentally
bad decisions from an IO persepctive.

We have to remember that memory reclaim is doing LRU reclaim and the
flusher threads are doing "oldest first" writeback. IOWs, both are trying
to operate in the same direction (oldest to youngest) for the same
purpose.  The fundamental problem that occurs when memory reclaim
starts writing pages back from the LRU is this:

	- memory reclaim has run ahead of IO writeback -

The LRU usually looks like this:

	oldest					youngest
	+---------------+---------------+--------------+
	clean		writeback	dirty
			^		^
			|		|
			|		Where flusher will next work from
			|		Where kswapd is working from
			|
			IO submitted by flusher, waiting on completion


If memory reclaim is hitting dirty pages on the LRU, it means it has
got ahead of writeback without being throttled - it's passed over
all the pages currently under writeback and is trying to write back
pages that are *newer* than what writeback is working on. IOWs, it
starts trying to do the job of the flusher threads, and it does that
very badly.

The $100 question is a??why is it getting ahead of writeback*?

>From a brief look at the vmscan code, it appears that scanning does
not throttle/block until reclaim priority has got pretty high. That
means at low priority reclaim, it *skips pages under writeback*.
However, if it comes across a dirty page, it will trigger writeback
of the page.

Now call me crazy, but if we've already got a large number of pages
under writeback, why would we want to *start more IO* when clearly
the system is taking care of cleaning pages already and all we have
to do is wait for a short while to get clean pages ready for
reclaim?

Indeed, I added this quick hack to prevent the VM from doing
writeback via pageout until after it starts blocking on writeback
pages:

@@ -825,6 +825,8 @@ static unsigned long shrink_page_list(struct list_head *page_l
 		if (PageDirty(page)) {
 			nr_dirty++;
 
+			if (!(sc->reclaim_mode & RECLAIM_MODE_SYNC))
+				goto keep_locked;
 			if (references == PAGEREF_RECLAIM_CLEAN)
 				goto keep_locked;
 			if (!may_enter_fs)

IOWs, we don't write pages from kswapd unless there is no IO
writeback going on at all (waited on all the writeback pages or none
exist) and there are dirty pages on the LRU.

This doesn't completely stop the IO collapse, (looks like foreground
throttling is the other cause, which IO-less write throttling fixes)
but the collapse was significantly reduced in duration and intensity
by removing kswapd writeback. In fact, the IO rate only dropped to
~60MB/s instead of 30MB/s, and the improvement is easily measured by
the runtime of the test:

			run 1	run 2	run 3
3.0-rc5-vanilla		135s	137s	138s
3.0-rc5-patched		117s	115s	115s

That's a pretty massive improvement for a 2-line patch. ;) I expect
the IO-less write throttling patchset will further improve this.

FWIW, the nr_vmscan_write values changed like this:

			run 1	run 2	run 3
3.0-rc5-vanilla		6751	6893	6465
3.0-rc5-patched		0	0	0

These results support my argument that memory reclaim should not be
doing dirty page writeback at all - defering writeback to the
writeback infrastructure and just waiting for it to complete
appropriately is the Right Thing To Do. i.e. IO-less memory reclaim
works better than the current code for the same reason IO-less write
throttling works better than the current code....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-07-02  2:42             ` Dave Chinner
@ 2011-07-05 14:10               ` Mel Gorman
  -1 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-07-05 14:10 UTC (permalink / raw)
  To: Dave Chinner
  Cc: jack, xfs, Christoph Hellwig, linux-mm, Wu Fengguang, Johannes Weiner

On Sat, Jul 02, 2011 at 12:42:19PM +1000, Dave Chinner wrote:
> On Fri, Jul 01, 2011 at 03:59:35PM +0100, Mel Gorman wrote:
> > On Fri, Jul 01, 2011 at 05:33:05AM -0400, Christoph Hellwig wrote:
> > > Johannes, Mel, Wu,
> > 
> > Am adding Jan Kara as he has been working on writeback efficiency
> > recently as well.
> 
> Writeback looks to be working fine - it's kswapd screwing up the
> writeback patterns that appears to be the problem....
> 

Not a new complaint.

> > > Dave has been stressing some XFS patches of mine that remove the XFS
> > > internal writeback clustering in favour of using write_cache_pages.
> > 
> > Against what kernel? 2.6.38 was a disaster for reclaim I've been
> > finding out this week. I don't know about 2.6.38.8. 2.6.39 was better.
> 
> 3.0-rc4
> 

Ok.

> ....
> > The number of pages written from reclaim is exceptionally low (2.6.38
> > was a total disaster but that release was bad for a number of reasons,
> > haven't tested 2.6.38.8 yet) but reduced by 2.6.37 as expected. Direct
> > reclaim usage was reduced and efficiency (ratio of pages scanned to
> > pages reclaimed) was high.
> 
> And is that consistent across ext3/ext4/xfs/btrfs filesystems? I
> doubt it very much, as all have very different .writepage
> behaviours...
> 

Some preliminary results are in and it looks like it is close to the
same across filesystems which was a suprise to me. Sometimes the
filesystem makes a difference to how many pages are written back but
it's not consistent across all tests i.e. in comparing ext3, ext4 and
xfs, there are big differences in performance but moderate differences
in pages written back. This implies that for the configurations I was
testing that pages are generally cleaned before reaching the end of the
LRU.

In all cases, the machines had ample memory. More on that later.

> BTW, called a workload "fsmark" tells us nothing about the workload
> being tested - fsmark can do a lot of interesting things. IOWs, you
> need to quote the command line for it to be meaningful to anyone...
> 

My bad.

./fs_mark -d /tmp/fsmark-14880 -D 225  -N  22500  -n  3125  -L  15 -t  16  -S0  -s  131072

> > As I look through the results I have at the moment, the number of
> > pages written back was simply really low which is why the problem fell
> > off my radar.
> 
> It doesn't take many to completely screw up writeback IO patterns.
> Write a few random pages to a 10MB file well before writeback would
> get to the file, and instead of getting optimal sequential writeback
> patterns when writeback gets to it, we get multiple disjoint IOs
> that require multiple seeks to complete.
> 
> Slower, less efficient writeback IO causes memory pressure to last
> longer and hence more likely to result in kswapd writeback, and it's
> just a downward spiral from there....
> 

Yes, I see the negative feedback loop. This has always been a struggle
in that kswapd needs pages from a particular zone to be cleaned and
freed but calling writepage can make things slower. There were
prototypes in the past to give hints to the flusher threads on what
inode and pages to be freed and they were never met with any degree of
satisfaction.

The consensus (amount VM people at least) was as long as that number was
low, it wasn't much of a problem. I know you disagree.

> > > > That means the test is only using 1GB of disk space, and
> > > > I'm running on a VM with 1GB RAM. It appears to be related to the VM
> > > > triggering random page writeback from the LRU - 100x10MB files more
> > > > than fills memory, hence it being the smallest test case i could
> > > > reproduce the problem on.
> > > > 
> > 
> > My tests were on a machine with 8G and ext3. I'm running some of
> > the tests against ext4 and xfs to see if that makes a difference but
> > it's possible the tests are simply not agressive enough so I want to
> > reproduce Dave's test if possible.
> 
> To tell the truth, I don't think anyone really cares how ext3
> performs these days.

I do but the reasoning is weak. I wanted to be able to compare kernels
between 2.6.32 and today with few points of variability. ext3 changed
relatively little between those times.

> XFS seems to be the filesystem that brings out
> all the bad behaviour in the mm subsystem....
> 
> FWIW, the mm subsystem works well enough when there is RAM
> available, so I'd suggest that your reclaim testing needs to focus
> on smaller memory configurations to really stress the reclaim
> algorithms. That's one of the reason why I regularly test on 1GB, 1p
> machines - they show problems that are hard to rep???oduce on larger
> configs....
> 

Based on the results coming in, I fully agree. I'm going to let the
tests run to completion so I'll have the data in the future. I'll then
go back and test for 1G, 1P configurations and it should be
reproducible.

> > I'm assuming "test 180" is from xfstests which was not one of the tests
> > I used previously. To run with 1000 files instead of 100, was the file
> > "180" simply editted to make it look like this loop instead?
> 
> I reduced it to 100 files simply to speed up the testing process for
> the "bad file size" problem I was trying to find. If you want to
> reproduce the IO collapse in a big way, run it with 1000 files, and
> it happens about 2/3rds of the way through the test on my hardware.
> 

Ok, I have a test prepared that will run this. At the rate tests are
currently going though, it could be Thursday before I can run them
though :(

> > > > It is very clear that from the IO completions that we are getting a
> > > > *lot* of kswapd driven writeback directly through .writepage:
> > > > 
> > > > $ grep "xfs_setfilesize:" t.t |grep "4096$" | wc -l
> > > > 801
> > > > $ grep "xfs_setfilesize:" t.t |grep -v "4096$" | wc -l
> > > > 78
> > > > 
> > > > So there's ~900 IO completions that change the file size, and 90% of
> > > > them are single page updates.
> > > > 
> > > > $ ps -ef |grep [k]swap
> > > > root       514     2  0 12:43 ?        00:00:00 [kswapd0]
> > > > $ grep "writepage:" t.t | grep "514 " |wc -l
> > > > 799
> > > > 
> > > > Oh, now that is too close to just be a co-incidence. We're getting
> > > > significant amounts of random page writeback from the the ends of
> > > > the LRUs done by the VM.
> > > > 
> > > > <sigh>
> > 
> > Does the value for nr_vmscan_write in /proc/vmstat correlate? It must
> > but lets me sure because I'm using that figure rather than ftrace to
> > count writebacks at the moment.
> 
> The number in /proc/vmstat is higher. Much higher.  I just ran the
> test at 1000 files (only collapsed to ~3000 iops this time because I
> ran it on a plain 3.0-rc4 kernel that still has the .writepage
> clustering in XFS), and I see:
> 
> nr_vmscan_write 6723
> 
> after the test. The event trace only capture ~1400 writepage events
> from kswapd, but it tends to miss a lot of events as the system is
> quite unresponsive at times under this workload - it's not uncommon
> to have ssh sessions not echo a character for 10s... e.g: I started
> the workload ~11:08:22:
> 

Ok, I'll be looking at nr_vmscan_write as the basis for "badness".

> $ while [ 1 ]; do date; sleep 1; done
> Sat Jul  2 11:08:15 EST 2011
> Sat Jul  2 11:08:16 EST 2011
> Sat Jul  2 11:08:17 EST 2011
> Sat Jul  2 11:08:18 EST 2011
> Sat Jul  2 11:08:19 EST 2011
> Sat Jul  2 11:08:20 EST 2011
> Sat Jul  2 11:08:21 EST 2011
> Sat Jul  2 11:08:22 EST 2011         <<<<<<<< start test here
> Sat Jul  2 11:08:23 EST 2011
> Sat Jul  2 11:08:24 EST 2011
> Sat Jul  2 11:08:25 EST 2011
> Sat Jul  2 11:08:26 EST 2011         <<<<<<<<
> Sat Jul  2 11:08:27 EST 2011         <<<<<<<<
> Sat Jul  2 11:08:30 EST 2011         <<<<<<<<
> Sat Jul  2 11:08:35 EST 2011         <<<<<<<<
> Sat Jul  2 11:08:36 EST 2011
> Sat Jul  2 11:08:37 EST 2011
> Sat Jul  2 11:08:38 EST 2011         <<<<<<<<
> Sat Jul  2 11:08:40 EST 2011         <<<<<<<<
> Sat Jul  2 11:08:41 EST 2011
> Sat Jul  2 11:08:42 EST 2011
> Sat Jul  2 11:08:43 EST 2011
> 
> And there are quite a few more multi-second holdoffs during the
> test, too.
> 
> > A more relevant question is this -
> > how many pages were reclaimed by kswapd and what percentage is 799
> > pages of that? What do you consider an acceptable percentage?
> 
> I don't care what the percentage is or what the number is. kswapd is
> reclaiming pages most of the time without affect IO patterns, and
> when that happens I just don't care because it is working just fine.
> 

I do care. I'm looking at some early XFS results here based on a laptop
(4G). For fsmark with the command line above, the number of pages
written back by kswapd was 0. The worst test by far was sysbench using a
particularly large database. The number of writes was 48745 which is
0.27% of pages scanned or 0.28% of pages reclaimed. Ordinarily I would
ignore that.

If I run this at 1G and get a similar ratio, I will assume that I
am not reproducing your problem at all unless I know what ratio you
are seeing.

So .... How many pages were reclaimed by kswapd and what percentage
is 799 pages of that?

You answered my second question. You consider 0% to be the acceptable
percentage.

> What I care about is what kswapd is doing when it finds dirty pages
> and it decides they need to be written back. It's not a problem that
> they are found or need to be written, the problem is the utterly
> crap way that memory reclaim is throwing the pages at the filesystem.
> 
> I'm not sure how to get through to you guys that single, random page
> writeback is *BAD*.

It got through. The feedback during discussions on the VM side was
that as long as the percentage was sufficiently low it wasn't a problem
because on occasion, the VM really needs pages from a particular zone.
A solution that addressed both problems has never been agreed on and
energy and time runs out before it gets fixed each time.

> Using .writepage directly is considered harmful
> to IO throughput, and memory reclaim needs to stop doing that.
> We've got hacks in the filesystems to try to make the IO memory
> reclaim executes suck less, but ultimately the problem is the IO
> memory reclaim is doing. And now the memory reclaim IO patterns are
> getting in the way of further improving the writeback path in XFS
> because were finding the hacks we've been carrying for years are
> *still* the only thing that is making IO under memory pressure not
> suck completely.
> 
> What I find extremely frustrating is that this is not a new issue.

I know.

> We (filesystem people) have been asking for a long time to have the
> memory reclaim subsystem either defer IO to the writeback threads or
> to use the .writepages interface.

There was a prototypes along these lines. One of the criticisms was
that it was fixing the wrong problem because dirty pages should be
at the end of the LRU at all. Later work focused on fixing that and
it was never revisited (at least not by me).

There was a bucket of complains about the initial series at
https://lkml.org/lkml/2010/6/8/82 . Despite the fact I wrote it,
I will have to read back to see why I stopped working on it but I
think it's because I focused on avoiding dirty pages reading the
end of the LRU judging by https://lkml.org/lkml/2010/6/11/157 and
eventually was satisified that the ratio of pages scanned to pages
written was acceptable.

> We're not asking this to be
> difficult, we're asking for this so that we can cluster IO in an
> optimal manner to avoid these IO collapses that memory reclaim
> currently triggers.  We now have generic methods of handing off IO
> to flusher threads that also provide some level of throttling/
> blocking while IO is submitted (e.g.  writeback_inodes_sb_nr()), so
> this shouldn't be a difficult problem to solve for the memory
> reclaim subsystem.
> 
> Hell, maybe memory reclaim should take a leaf from the IO-less
> throttle work we are doing - hit a bunch of dirty pages on the LRU,
> just back off and let the writeback subsystem clean a few more pages
> before starting another scan. 

Prototyped this before although I can't find it now. I think I
concluded at the time that it didn't really help and another direction
was taken. There was also the problem that the time to clean a page
from a particular zone was potentially unbounded and a solution didn't
present itself.

> Letting the writeback code clean
> pages is the fastest way to get pages cleaned in the system, so if
> we've already got a generic method for cleaning and/or waiting for
> pages to be cleaned, why not aim to use that?
> 
> And while I'm ranting, when on earth is the issue-writeback-from-
> direct-reclaim problem going to be fixed so we can remove the hacks
> in the filesystem .writepage implementations to prevent this from
> occurring?
> 

Prototyped that too, same thread. Same type of problem, writeback
from direct reclaim should happen so rarely that it should not be
optimised for. See https://lkml.org/lkml/2010/6/11/32

> I mean, when we combine the two issues, doesn't it imply that the
> memory reclaim subsystem needs to be redesigned around the fact it
> *can't clean pages directly*?  This IO collapse issue shows that we
> really don't 't want kswapd issuing IO directly via .writepage, and
> we already reject IO from direct reclaim in .writepage in ext4, XFS
> and BTRFS because we'll overrun the stack on anything other than
> trivial storage configurations.
> 
> That says to me in a big, flashing bright pink neon sign way that
> memory reclaim simply should not be issuing IO at all. Perhaps it's
> time to rethink the way memory reclaim deals with dirty pages to
> take into account the current reality?
> 
> </rant>
> 

At the risk of pissing you off, this isn't new information so I'll
consider myself duly nudged into revisiting.

> > > On Fri, Jul 01, 2011 at 07:20:21PM +1000, Dave Chinner wrote:
> > > > > Looks good.  I still wonder why I haven't been able to hit this.
> > > > > Haven't seen any 180 failure for a long time, with both 4k and 512 byte
> > > > > filesystems and since yesterday 1k as well.
> > > > 
> > > > It requires the test to run the VM out of RAM and then force enough
> > > > memory pressure for kswapd to start writeback from the LRU. The
> > > > reproducer I have is a 1p, 1GB RAM VM with it's disk image on a
> > > > 100MB/s HW RAID1 w/ 512MB BBWC disk subsystem.
> > > > 
> > 
> > You say it's a 1G VM but you don't say what architecure.
> 
> x86-64 for both the guest and the host.
> 

Grand.

-- 
Mel Gorman
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
@ 2011-07-05 14:10               ` Mel Gorman
  0 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-07-05 14:10 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Johannes Weiner, Wu Fengguang, xfs, jack, linux-mm

On Sat, Jul 02, 2011 at 12:42:19PM +1000, Dave Chinner wrote:
> On Fri, Jul 01, 2011 at 03:59:35PM +0100, Mel Gorman wrote:
> > On Fri, Jul 01, 2011 at 05:33:05AM -0400, Christoph Hellwig wrote:
> > > Johannes, Mel, Wu,
> > 
> > Am adding Jan Kara as he has been working on writeback efficiency
> > recently as well.
> 
> Writeback looks to be working fine - it's kswapd screwing up the
> writeback patterns that appears to be the problem....
> 

Not a new complaint.

> > > Dave has been stressing some XFS patches of mine that remove the XFS
> > > internal writeback clustering in favour of using write_cache_pages.
> > 
> > Against what kernel? 2.6.38 was a disaster for reclaim I've been
> > finding out this week. I don't know about 2.6.38.8. 2.6.39 was better.
> 
> 3.0-rc4
> 

Ok.

> ....
> > The number of pages written from reclaim is exceptionally low (2.6.38
> > was a total disaster but that release was bad for a number of reasons,
> > haven't tested 2.6.38.8 yet) but reduced by 2.6.37 as expected. Direct
> > reclaim usage was reduced and efficiency (ratio of pages scanned to
> > pages reclaimed) was high.
> 
> And is that consistent across ext3/ext4/xfs/btrfs filesystems? I
> doubt it very much, as all have very different .writepage
> behaviours...
> 

Some preliminary results are in and it looks like it is close to the
same across filesystems which was a suprise to me. Sometimes the
filesystem makes a difference to how many pages are written back but
it's not consistent across all tests i.e. in comparing ext3, ext4 and
xfs, there are big differences in performance but moderate differences
in pages written back. This implies that for the configurations I was
testing that pages are generally cleaned before reaching the end of the
LRU.

In all cases, the machines had ample memory. More on that later.

> BTW, called a workload "fsmark" tells us nothing about the workload
> being tested - fsmark can do a lot of interesting things. IOWs, you
> need to quote the command line for it to be meaningful to anyone...
> 

My bad.

./fs_mark -d /tmp/fsmark-14880 -D 225  -N  22500  -n  3125  -L  15 -t  16  -S0  -s  131072

> > As I look through the results I have at the moment, the number of
> > pages written back was simply really low which is why the problem fell
> > off my radar.
> 
> It doesn't take many to completely screw up writeback IO patterns.
> Write a few random pages to a 10MB file well before writeback would
> get to the file, and instead of getting optimal sequential writeback
> patterns when writeback gets to it, we get multiple disjoint IOs
> that require multiple seeks to complete.
> 
> Slower, less efficient writeback IO causes memory pressure to last
> longer and hence more likely to result in kswapd writeback, and it's
> just a downward spiral from there....
> 

Yes, I see the negative feedback loop. This has always been a struggle
in that kswapd needs pages from a particular zone to be cleaned and
freed but calling writepage can make things slower. There were
prototypes in the past to give hints to the flusher threads on what
inode and pages to be freed and they were never met with any degree of
satisfaction.

The consensus (amount VM people at least) was as long as that number was
low, it wasn't much of a problem. I know you disagree.

> > > > That means the test is only using 1GB of disk space, and
> > > > I'm running on a VM with 1GB RAM. It appears to be related to the VM
> > > > triggering random page writeback from the LRU - 100x10MB files more
> > > > than fills memory, hence it being the smallest test case i could
> > > > reproduce the problem on.
> > > > 
> > 
> > My tests were on a machine with 8G and ext3. I'm running some of
> > the tests against ext4 and xfs to see if that makes a difference but
> > it's possible the tests are simply not agressive enough so I want to
> > reproduce Dave's test if possible.
> 
> To tell the truth, I don't think anyone really cares how ext3
> performs these days.

I do but the reasoning is weak. I wanted to be able to compare kernels
between 2.6.32 and today with few points of variability. ext3 changed
relatively little between those times.

> XFS seems to be the filesystem that brings out
> all the bad behaviour in the mm subsystem....
> 
> FWIW, the mm subsystem works well enough when there is RAM
> available, so I'd suggest that your reclaim testing needs to focus
> on smaller memory configurations to really stress the reclaim
> algorithms. That's one of the reason why I regularly test on 1GB, 1p
> machines - they show problems that are hard to rep???oduce on larger
> configs....
> 

Based on the results coming in, I fully agree. I'm going to let the
tests run to completion so I'll have the data in the future. I'll then
go back and test for 1G, 1P configurations and it should be
reproducible.

> > I'm assuming "test 180" is from xfstests which was not one of the tests
> > I used previously. To run with 1000 files instead of 100, was the file
> > "180" simply editted to make it look like this loop instead?
> 
> I reduced it to 100 files simply to speed up the testing process for
> the "bad file size" problem I was trying to find. If you want to
> reproduce the IO collapse in a big way, run it with 1000 files, and
> it happens about 2/3rds of the way through the test on my hardware.
> 

Ok, I have a test prepared that will run this. At the rate tests are
currently going though, it could be Thursday before I can run them
though :(

> > > > It is very clear that from the IO completions that we are getting a
> > > > *lot* of kswapd driven writeback directly through .writepage:
> > > > 
> > > > $ grep "xfs_setfilesize:" t.t |grep "4096$" | wc -l
> > > > 801
> > > > $ grep "xfs_setfilesize:" t.t |grep -v "4096$" | wc -l
> > > > 78
> > > > 
> > > > So there's ~900 IO completions that change the file size, and 90% of
> > > > them are single page updates.
> > > > 
> > > > $ ps -ef |grep [k]swap
> > > > root       514     2  0 12:43 ?        00:00:00 [kswapd0]
> > > > $ grep "writepage:" t.t | grep "514 " |wc -l
> > > > 799
> > > > 
> > > > Oh, now that is too close to just be a co-incidence. We're getting
> > > > significant amounts of random page writeback from the the ends of
> > > > the LRUs done by the VM.
> > > > 
> > > > <sigh>
> > 
> > Does the value for nr_vmscan_write in /proc/vmstat correlate? It must
> > but lets me sure because I'm using that figure rather than ftrace to
> > count writebacks at the moment.
> 
> The number in /proc/vmstat is higher. Much higher.  I just ran the
> test at 1000 files (only collapsed to ~3000 iops this time because I
> ran it on a plain 3.0-rc4 kernel that still has the .writepage
> clustering in XFS), and I see:
> 
> nr_vmscan_write 6723
> 
> after the test. The event trace only capture ~1400 writepage events
> from kswapd, but it tends to miss a lot of events as the system is
> quite unresponsive at times under this workload - it's not uncommon
> to have ssh sessions not echo a character for 10s... e.g: I started
> the workload ~11:08:22:
> 

Ok, I'll be looking at nr_vmscan_write as the basis for "badness".

> $ while [ 1 ]; do date; sleep 1; done
> Sat Jul  2 11:08:15 EST 2011
> Sat Jul  2 11:08:16 EST 2011
> Sat Jul  2 11:08:17 EST 2011
> Sat Jul  2 11:08:18 EST 2011
> Sat Jul  2 11:08:19 EST 2011
> Sat Jul  2 11:08:20 EST 2011
> Sat Jul  2 11:08:21 EST 2011
> Sat Jul  2 11:08:22 EST 2011         <<<<<<<< start test here
> Sat Jul  2 11:08:23 EST 2011
> Sat Jul  2 11:08:24 EST 2011
> Sat Jul  2 11:08:25 EST 2011
> Sat Jul  2 11:08:26 EST 2011         <<<<<<<<
> Sat Jul  2 11:08:27 EST 2011         <<<<<<<<
> Sat Jul  2 11:08:30 EST 2011         <<<<<<<<
> Sat Jul  2 11:08:35 EST 2011         <<<<<<<<
> Sat Jul  2 11:08:36 EST 2011
> Sat Jul  2 11:08:37 EST 2011
> Sat Jul  2 11:08:38 EST 2011         <<<<<<<<
> Sat Jul  2 11:08:40 EST 2011         <<<<<<<<
> Sat Jul  2 11:08:41 EST 2011
> Sat Jul  2 11:08:42 EST 2011
> Sat Jul  2 11:08:43 EST 2011
> 
> And there are quite a few more multi-second holdoffs during the
> test, too.
> 
> > A more relevant question is this -
> > how many pages were reclaimed by kswapd and what percentage is 799
> > pages of that? What do you consider an acceptable percentage?
> 
> I don't care what the percentage is or what the number is. kswapd is
> reclaiming pages most of the time without affect IO patterns, and
> when that happens I just don't care because it is working just fine.
> 

I do care. I'm looking at some early XFS results here based on a laptop
(4G). For fsmark with the command line above, the number of pages
written back by kswapd was 0. The worst test by far was sysbench using a
particularly large database. The number of writes was 48745 which is
0.27% of pages scanned or 0.28% of pages reclaimed. Ordinarily I would
ignore that.

If I run this at 1G and get a similar ratio, I will assume that I
am not reproducing your problem at all unless I know what ratio you
are seeing.

So .... How many pages were reclaimed by kswapd and what percentage
is 799 pages of that?

You answered my second question. You consider 0% to be the acceptable
percentage.

> What I care about is what kswapd is doing when it finds dirty pages
> and it decides they need to be written back. It's not a problem that
> they are found or need to be written, the problem is the utterly
> crap way that memory reclaim is throwing the pages at the filesystem.
> 
> I'm not sure how to get through to you guys that single, random page
> writeback is *BAD*.

It got through. The feedback during discussions on the VM side was
that as long as the percentage was sufficiently low it wasn't a problem
because on occasion, the VM really needs pages from a particular zone.
A solution that addressed both problems has never been agreed on and
energy and time runs out before it gets fixed each time.

> Using .writepage directly is considered harmful
> to IO throughput, and memory reclaim needs to stop doing that.
> We've got hacks in the filesystems to try to make the IO memory
> reclaim executes suck less, but ultimately the problem is the IO
> memory reclaim is doing. And now the memory reclaim IO patterns are
> getting in the way of further improving the writeback path in XFS
> because were finding the hacks we've been carrying for years are
> *still* the only thing that is making IO under memory pressure not
> suck completely.
> 
> What I find extremely frustrating is that this is not a new issue.

I know.

> We (filesystem people) have been asking for a long time to have the
> memory reclaim subsystem either defer IO to the writeback threads or
> to use the .writepages interface.

There was a prototypes along these lines. One of the criticisms was
that it was fixing the wrong problem because dirty pages should be
at the end of the LRU at all. Later work focused on fixing that and
it was never revisited (at least not by me).

There was a bucket of complains about the initial series at
https://lkml.org/lkml/2010/6/8/82 . Despite the fact I wrote it,
I will have to read back to see why I stopped working on it but I
think it's because I focused on avoiding dirty pages reading the
end of the LRU judging by https://lkml.org/lkml/2010/6/11/157 and
eventually was satisified that the ratio of pages scanned to pages
written was acceptable.

> We're not asking this to be
> difficult, we're asking for this so that we can cluster IO in an
> optimal manner to avoid these IO collapses that memory reclaim
> currently triggers.  We now have generic methods of handing off IO
> to flusher threads that also provide some level of throttling/
> blocking while IO is submitted (e.g.  writeback_inodes_sb_nr()), so
> this shouldn't be a difficult problem to solve for the memory
> reclaim subsystem.
> 
> Hell, maybe memory reclaim should take a leaf from the IO-less
> throttle work we are doing - hit a bunch of dirty pages on the LRU,
> just back off and let the writeback subsystem clean a few more pages
> before starting another scan. 

Prototyped this before although I can't find it now. I think I
concluded at the time that it didn't really help and another direction
was taken. There was also the problem that the time to clean a page
from a particular zone was potentially unbounded and a solution didn't
present itself.

> Letting the writeback code clean
> pages is the fastest way to get pages cleaned in the system, so if
> we've already got a generic method for cleaning and/or waiting for
> pages to be cleaned, why not aim to use that?
> 
> And while I'm ranting, when on earth is the issue-writeback-from-
> direct-reclaim problem going to be fixed so we can remove the hacks
> in the filesystem .writepage implementations to prevent this from
> occurring?
> 

Prototyped that too, same thread. Same type of problem, writeback
from direct reclaim should happen so rarely that it should not be
optimised for. See https://lkml.org/lkml/2010/6/11/32

> I mean, when we combine the two issues, doesn't it imply that the
> memory reclaim subsystem needs to be redesigned around the fact it
> *can't clean pages directly*?  This IO collapse issue shows that we
> really don't 't want kswapd issuing IO directly via .writepage, and
> we already reject IO from direct reclaim in .writepage in ext4, XFS
> and BTRFS because we'll overrun the stack on anything other than
> trivial storage configurations.
> 
> That says to me in a big, flashing bright pink neon sign way that
> memory reclaim simply should not be issuing IO at all. Perhaps it's
> time to rethink the way memory reclaim deals with dirty pages to
> take into account the current reality?
> 
> </rant>
> 

At the risk of pissing you off, this isn't new information so I'll
consider myself duly nudged into revisiting.

> > > On Fri, Jul 01, 2011 at 07:20:21PM +1000, Dave Chinner wrote:
> > > > > Looks good.  I still wonder why I haven't been able to hit this.
> > > > > Haven't seen any 180 failure for a long time, with both 4k and 512 byte
> > > > > filesystems and since yesterday 1k as well.
> > > > 
> > > > It requires the test to run the VM out of RAM and then force enough
> > > > memory pressure for kswapd to start writeback from the LRU. The
> > > > reproducer I have is a 1p, 1GB RAM VM with it's disk image on a
> > > > 100MB/s HW RAID1 w/ 512MB BBWC disk subsystem.
> > > > 
> > 
> > You say it's a 1G VM but you don't say what architecure.
> 
> x86-64 for both the guest and the host.
> 

Grand.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-07-04  3:25             ` Dave Chinner
@ 2011-07-05 14:34               ` Mel Gorman
  -1 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-07-05 14:34 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, linux-mm, Wu Fengguang, Johannes Weiner, xfs

On Mon, Jul 04, 2011 at 01:25:34PM +1000, Dave Chinner wrote:
> On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote:
> > Christoph,
> > 
> > On Fri, Jul 01, 2011 at 05:33:05PM +0800, Christoph Hellwig wrote:
> > > Johannes, Mel, Wu,
> > > 
> > > Dave has been stressing some XFS patches of mine that remove the XFS
> > > internal writeback clustering in favour of using write_cache_pages.
> > > 
> > > As part of investigating the behaviour he found out that we're still
> > > doing lots of I/O from the end of the LRU in kswapd.  Not only is that
> > > pretty bad behaviour in general, but it also means we really can't
> > > just remove the writeback clustering in writepage given how much
> > > I/O is still done through that.
> > > 
> > > Any chance we could the writeback vs kswap behaviour sorted out a bit
> > > better finally?
> > 
> > I once tried this approach:
> > 
> > http://www.spinics.net/lists/linux-mm/msg09202.html
> > 
> > It used a list structure that is not linearly scalable, however that
> > part should be independently improvable when necessary.
> 
> I don't think that handing random writeback to the flusher thread is
> much better than doing random writeback directly.  Yes, you added
> some clustering, but I'm still don't think writing specific pages is
> the best solution.
> 
> > The real problem was, it seem to not very effective in my test runs.
> > I found many ->nr_pages works queued before the ->inode works, which
> > effectively makes the flusher working on more dispersed pages rather
> > than focusing on the dirty pages encountered in LRU reclaim.
> 
> But that's really just an implementation issue related to how you
> tried to solve the problem. That could be addressed.
> 
> However, what I'm questioning is whether we should even care what
> page memory reclaim wants to write - it seems to make fundamentally
> bad decisions from an IO persepctive.
> 

It sucks from an IO perspective but from the perspective of the VM that
needs memory to be free in a particular zone or node, it's a reasonable
request.

> We have to remember that memory reclaim is doing LRU reclaim and the
> flusher threads are doing "oldest first" writeback. IOWs, both are trying
> to operate in the same direction (oldest to youngest) for the same
> purpose.  The fundamental problem that occurs when memory reclaim
> starts writing pages back from the LRU is this:
> 
> 	- memory reclaim has run ahead of IO writeback -
> 

This reasoning was the basis for this patch
http://www.gossamer-threads.com/lists/linux/kernel/1251235?do=post_view_threaded#1251235

i.e. if old pages are dirty then the flusher threads are either not
awake or not doing enough work so wake them. It was flawed in a number
of respects and never finished though.

> The LRU usually looks like this:
> 
> 	oldest					youngest
> 	+---------------+---------------+--------------+
> 	clean		writeback	dirty
> 			^		^
> 			|		|
> 			|		Where flusher will next work from
> 			|		Where kswapd is working from
> 			|
> 			IO submitted by flusher, waiting on completion
> 
> 
> If memory reclaim is hitting dirty pages on the LRU, it means it has
> got ahead of writeback without being throttled - it's passed over
> all the pages currently under writeback and is trying to write back
> pages that are *newer* than what writeback is working on. IOWs, it
> starts trying to do the job of the flusher threads, and it does that
> very badly.
> 
> The $100 question is ???why is it getting ahead of writeback*?
> 

Allocating and dirtying memory faster than writeback. Large dd to USB
stick would also trigger it.

> From a brief look at the vmscan code, it appears that scanning does
> not throttle/block until reclaim priority has got pretty high. That
> means at low priority reclaim, it *skips pages under writeback*.
> However, if it comes across a dirty page, it will trigger writeback
> of the page.
> 
> Now call me crazy, but if we've already got a large number of pages
> under writeback, why would we want to *start more IO* when clearly
> the system is taking care of cleaning pages already and all we have
> to do is wait for a short while to get clean pages ready for
> reclaim?
> 

It doesnt' check how many pages are under writeback. Direct reclaim
will check if the block device is congested but that is about
it. Otherwise the expectation was the elevator would handle the
merging of requests into a sensible patter. Also, while filesystem
pages are getting cleaned by flushs, that does not cover anonymous
pages being written to swap.

> Indeed, I added this quick hack to prevent the VM from doing
> writeback via pageout until after it starts blocking on writeback
> pages:
> 
> @@ -825,6 +825,8 @@ static unsigned long shrink_page_list(struct list_head *page_l
>  		if (PageDirty(page)) {
>  			nr_dirty++;
>  
> +			if (!(sc->reclaim_mode & RECLAIM_MODE_SYNC))
> +				goto keep_locked;
>  			if (references == PAGEREF_RECLAIM_CLEAN)
>  				goto keep_locked;
>  			if (!may_enter_fs)
> 
> IOWs, we don't write pages from kswapd unless there is no IO
> writeback going on at all (waited on all the writeback pages or none
> exist) and there are dirty pages on the LRU.
> 

A side effect of this patch is that kswapd is no longer writing
anonymous pages to swap and possibly never will. RECLAIM_MODE_SYNC is
only set for lumpy reclaim which if you have CONFIG_COMPACTION set, will
never happen.

I see your figures and know why you want this but it never was that
straight-forward :/

-- 
Mel Gorman
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
@ 2011-07-05 14:34               ` Mel Gorman
  0 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-07-05 14:34 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Wu Fengguang, Christoph Hellwig, Johannes Weiner, xfs, linux-mm

On Mon, Jul 04, 2011 at 01:25:34PM +1000, Dave Chinner wrote:
> On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote:
> > Christoph,
> > 
> > On Fri, Jul 01, 2011 at 05:33:05PM +0800, Christoph Hellwig wrote:
> > > Johannes, Mel, Wu,
> > > 
> > > Dave has been stressing some XFS patches of mine that remove the XFS
> > > internal writeback clustering in favour of using write_cache_pages.
> > > 
> > > As part of investigating the behaviour he found out that we're still
> > > doing lots of I/O from the end of the LRU in kswapd.  Not only is that
> > > pretty bad behaviour in general, but it also means we really can't
> > > just remove the writeback clustering in writepage given how much
> > > I/O is still done through that.
> > > 
> > > Any chance we could the writeback vs kswap behaviour sorted out a bit
> > > better finally?
> > 
> > I once tried this approach:
> > 
> > http://www.spinics.net/lists/linux-mm/msg09202.html
> > 
> > It used a list structure that is not linearly scalable, however that
> > part should be independently improvable when necessary.
> 
> I don't think that handing random writeback to the flusher thread is
> much better than doing random writeback directly.  Yes, you added
> some clustering, but I'm still don't think writing specific pages is
> the best solution.
> 
> > The real problem was, it seem to not very effective in my test runs.
> > I found many ->nr_pages works queued before the ->inode works, which
> > effectively makes the flusher working on more dispersed pages rather
> > than focusing on the dirty pages encountered in LRU reclaim.
> 
> But that's really just an implementation issue related to how you
> tried to solve the problem. That could be addressed.
> 
> However, what I'm questioning is whether we should even care what
> page memory reclaim wants to write - it seems to make fundamentally
> bad decisions from an IO persepctive.
> 

It sucks from an IO perspective but from the perspective of the VM that
needs memory to be free in a particular zone or node, it's a reasonable
request.

> We have to remember that memory reclaim is doing LRU reclaim and the
> flusher threads are doing "oldest first" writeback. IOWs, both are trying
> to operate in the same direction (oldest to youngest) for the same
> purpose.  The fundamental problem that occurs when memory reclaim
> starts writing pages back from the LRU is this:
> 
> 	- memory reclaim has run ahead of IO writeback -
> 

This reasoning was the basis for this patch
http://www.gossamer-threads.com/lists/linux/kernel/1251235?do=post_view_threaded#1251235

i.e. if old pages are dirty then the flusher threads are either not
awake or not doing enough work so wake them. It was flawed in a number
of respects and never finished though.

> The LRU usually looks like this:
> 
> 	oldest					youngest
> 	+---------------+---------------+--------------+
> 	clean		writeback	dirty
> 			^		^
> 			|		|
> 			|		Where flusher will next work from
> 			|		Where kswapd is working from
> 			|
> 			IO submitted by flusher, waiting on completion
> 
> 
> If memory reclaim is hitting dirty pages on the LRU, it means it has
> got ahead of writeback without being throttled - it's passed over
> all the pages currently under writeback and is trying to write back
> pages that are *newer* than what writeback is working on. IOWs, it
> starts trying to do the job of the flusher threads, and it does that
> very badly.
> 
> The $100 question is ???why is it getting ahead of writeback*?
> 

Allocating and dirtying memory faster than writeback. Large dd to USB
stick would also trigger it.

> From a brief look at the vmscan code, it appears that scanning does
> not throttle/block until reclaim priority has got pretty high. That
> means at low priority reclaim, it *skips pages under writeback*.
> However, if it comes across a dirty page, it will trigger writeback
> of the page.
> 
> Now call me crazy, but if we've already got a large number of pages
> under writeback, why would we want to *start more IO* when clearly
> the system is taking care of cleaning pages already and all we have
> to do is wait for a short while to get clean pages ready for
> reclaim?
> 

It doesnt' check how many pages are under writeback. Direct reclaim
will check if the block device is congested but that is about
it. Otherwise the expectation was the elevator would handle the
merging of requests into a sensible patter. Also, while filesystem
pages are getting cleaned by flushs, that does not cover anonymous
pages being written to swap.

> Indeed, I added this quick hack to prevent the VM from doing
> writeback via pageout until after it starts blocking on writeback
> pages:
> 
> @@ -825,6 +825,8 @@ static unsigned long shrink_page_list(struct list_head *page_l
>  		if (PageDirty(page)) {
>  			nr_dirty++;
>  
> +			if (!(sc->reclaim_mode & RECLAIM_MODE_SYNC))
> +				goto keep_locked;
>  			if (references == PAGEREF_RECLAIM_CLEAN)
>  				goto keep_locked;
>  			if (!may_enter_fs)
> 
> IOWs, we don't write pages from kswapd unless there is no IO
> writeback going on at all (waited on all the writeback pages or none
> exist) and there are dirty pages on the LRU.
> 

A side effect of this patch is that kswapd is no longer writing
anonymous pages to swap and possibly never will. RECLAIM_MODE_SYNC is
only set for lumpy reclaim which if you have CONFIG_COMPACTION set, will
never happen.

I see your figures and know why you want this but it never was that
straight-forward :/

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-07-05 14:10               ` Mel Gorman
@ 2011-07-05 15:55                 ` Dave Chinner
  -1 siblings, 0 replies; 100+ messages in thread
From: Dave Chinner @ 2011-07-05 15:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: jack, xfs, Christoph Hellwig, linux-mm, Wu Fengguang, Johannes Weiner

On Tue, Jul 05, 2011 at 03:10:16PM +0100, Mel Gorman wrote:
> On Sat, Jul 02, 2011 at 12:42:19PM +1000, Dave Chinner wrote:
> > On Fri, Jul 01, 2011 at 03:59:35PM +0100, Mel Gorman wrote:
> > BTW, called a workload "fsmark" tells us nothing about the workload
> > being tested - fsmark can do a lot of interesting things. IOWs, you
> > need to quote the command line for it to be meaningful to anyone...
> > 
> 
> My bad.
> 
> ./fs_mark -d /tmp/fsmark-14880 -D 225  -N  22500  -n  3125  -L  15 -t  16  -S0  -s  131072

Ok, so 16 threads, 3125 files per thread, 128k per file, all created
in to the same directory which rolls over when it gets to 22500
files in the directory. Yeah, it generates a bit of memory pressure,
but I think the file sizes are too small to really stress writeback
much. You need to use files that are at least 10MB in size to really
start to mix up the writeback lists and the way they juggle new and
old inodes to try not to starve any particular inode of writeback
bandwidth....

Also, I don't use the "-t <num>" threading mechanism because all it
does is bash on the directory mutex without really improving
parallelism for creates. perf top on my system shows:

           samples  pcnt function                           DSO
             _______ _____ __________________________________ __________________________________

             2799.00  9.3% mutex_spin_on_owner                [kernel.kallsyms]
             2049.00  6.8% copy_user_generic_string           [kernel.kallsyms]
             1912.00  6.3% _raw_spin_unlock_irqrestore        [kernel.kallsyms]

A contended mutex as the prime CPU consumer. That's more CPU than
copying 750MB/s of data.

Hence I normally drive parallelism with fsmark by using multiple "-d
<dir>" options, which runs a thread per directory and a workload
unit per directory and so you don't get directory mutex contention
causing serialisation and interference with what you are really
trying to measure....

> > > As I look through the results I have at the moment, the number of
> > > pages written back was simply really low which is why the problem fell
> > > off my radar.
> > 
> > It doesn't take many to completely screw up writeback IO patterns.
> > Write a few random pages to a 10MB file well before writeback would
> > get to the file, and instead of getting optimal sequential writeback
> > patterns when writeback gets to it, we get multiple disjoint IOs
> > that require multiple seeks to complete.
> > 
> > Slower, less efficient writeback IO causes memory pressure to last
> > longer and hence more likely to result in kswapd writeback, and it's
> > just a downward spiral from there....
> > 
> 
> Yes, I see the negative feedback loop. This has always been a struggle
> in that kswapd needs pages from a particular zone to be cleaned and
> freed but calling writepage can make things slower. There were
> prototypes in the past to give hints to the flusher threads on what
> inode and pages to be freed and they were never met with any degree of
> satisfaction.
> 
> The consensus (amount VM people at least) was as long as that number was
> low, it wasn't much of a problem.

Therein lies the problem. You've got storage people telling you
there is an IO problem with memory reclaim, but the mm community
then put their heads together somewhere private, decide it isn't
a problem worth fixing and do nothing. Rinse, lather, repeat.

I expect memory reclaim to play nicely with writeback that is
already in progress. These subsystems do not work in isolation, yet
memory reclaim treats it that way - as though it is the most
important IO submitter and everything else can suffer while memory
reclaim does it's stuff.  Memory reclaim needs to co-ordinate with
writeback effectively for the system as a whole to work well
together.

> I know you disagree.

Right, that's because it doesn't have to be a very high number to be
a problem. IO is orders of magnitude slower than the CPU time it
takes to flush a page, so the cost of making a bad flush decision is
very high. And single page writeback from the LRU is almost always a
bad flush decision.

> > > > > Oh, now that is too close to just be a co-incidence. We're getting
> > > > > significant amounts of random page writeback from the the ends of
> > > > > the LRUs done by the VM.
> > > > > 
> > > > > <sigh>
> > > 
> > > Does the value for nr_vmscan_write in /proc/vmstat correlate? It must
> > > but lets me sure because I'm using that figure rather than ftrace to
> > > count writebacks at the moment.
> > 
> > The number in /proc/vmstat is higher. Much higher.  I just ran the
> > test at 1000 files (only collapsed to ~3000 iops this time because I
> > ran it on a plain 3.0-rc4 kernel that still has the .writepage
> > clustering in XFS), and I see:
> > 
> > nr_vmscan_write 6723
> > 
> > after the test. The event trace only capture ~1400 writepage events
> > from kswapd, but it tends to miss a lot of events as the system is
> > quite unresponsive at times under this workload - it's not uncommon
> > to have ssh sessions not echo a character for 10s... e.g: I started
> > the workload ~11:08:22:
> > 
> 
> Ok, I'll be looking at nr_vmscan_write as the basis for "badness".

Perhaps you should look at my other reply (and two line "fix") in
the thread about stopping dirty page writeback until after waiting
on pages under writeback.....

> > > A more relevant question is this -
> > > how many pages were reclaimed by kswapd and what percentage is 799
> > > pages of that? What do you consider an acceptable percentage?
> > 
> > I don't care what the percentage is or what the number is. kswapd is
> > reclaiming pages most of the time without affect IO patterns, and
> > when that happens I just don't care because it is working just fine.
> > 
> 
> I do care. I'm looking at some early XFS results here based on a laptop
> (4G). For fsmark with the command line above, the number of pages
> written back by kswapd was 0. The worst test by far was sysbench using a
> particularly large database. The number of writes was 48745 which is
> 0.27% of pages scanned or 0.28% of pages reclaimed. Ordinarily I would
> ignore that.
> 
> If I run this at 1G and get a similar ratio, I will assume that I
> am not reproducing your problem at all unless I know what ratio you
> are seeing.

Single threaded writing of files should -never- cause writeback from
the LRUs. If that is happening, then the memory reclaim throttling
is broken. See my other email.

> So .... How many pages were reclaimed by kswapd and what percentage
> is 799 pages of that?

No idea. That information is long gone....

> You answered my second question. You consider 0% to be the acceptable
> percentage.

No, I expect memory reclaim to behave nicely with writeback that is
already in progress. This subsystems do not work in isolation - they
need to co-ordinate 

> > What I care about is what kswapd is doing when it finds dirty pages
> > and it decides they need to be written back. It's not a problem that
> > they are found or need to be written, the problem is the utterly
> > crap way that memory reclaim is throwing the pages at the filesystem.
> > 
> > I'm not sure how to get through to you guys that single, random page
> > writeback is *BAD*.
> 
> It got through. The feedback during discussions on the VM side was
> that as long as the percentage was sufficiently low it wasn't a problem
> because on occasion, the VM really needs pages from a particular zone.
> A solution that addressed both problems has never been agreed on and
> energy and time runs out before it gets fixed each time.

<sigh>

> > And while I'm ranting, when on earth is the issue-writeback-from-
> > direct-reclaim problem going to be fixed so we can remove the hacks
> > in the filesystem .writepage implementations to prevent this from
> > occurring?
> > 
> 
> Prototyped that too, same thread. Same type of problem, writeback
> from direct reclaim should happen so rarely that it should not be
> optimised for. See https://lkml.org/lkml/2010/6/11/32

Writeback from direct reclaim crashes systems by causing stack
overruns - that's why we've disabled it. It's not an "optimisation"
problem - it's a _memory corruption_ bug that needs to be fixed.....

> At the risk of pissing you off, this isn't new information so I'll
> consider myself duly nudged into revisiting.

No, I've had a rant to express my displeasure at the lack of
progress on this front.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
@ 2011-07-05 15:55                 ` Dave Chinner
  0 siblings, 0 replies; 100+ messages in thread
From: Dave Chinner @ 2011-07-05 15:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, Johannes Weiner, Wu Fengguang, xfs, jack, linux-mm

On Tue, Jul 05, 2011 at 03:10:16PM +0100, Mel Gorman wrote:
> On Sat, Jul 02, 2011 at 12:42:19PM +1000, Dave Chinner wrote:
> > On Fri, Jul 01, 2011 at 03:59:35PM +0100, Mel Gorman wrote:
> > BTW, called a workload "fsmark" tells us nothing about the workload
> > being tested - fsmark can do a lot of interesting things. IOWs, you
> > need to quote the command line for it to be meaningful to anyone...
> > 
> 
> My bad.
> 
> ./fs_mark -d /tmp/fsmark-14880 -D 225  -N  22500  -n  3125  -L  15 -t  16  -S0  -s  131072

Ok, so 16 threads, 3125 files per thread, 128k per file, all created
in to the same directory which rolls over when it gets to 22500
files in the directory. Yeah, it generates a bit of memory pressure,
but I think the file sizes are too small to really stress writeback
much. You need to use files that are at least 10MB in size to really
start to mix up the writeback lists and the way they juggle new and
old inodes to try not to starve any particular inode of writeback
bandwidth....

Also, I don't use the "-t <num>" threading mechanism because all it
does is bash on the directory mutex without really improving
parallelism for creates. perf top on my system shows:

           samples  pcnt function                           DSO
             _______ _____ __________________________________ __________________________________

             2799.00  9.3% mutex_spin_on_owner                [kernel.kallsyms]
             2049.00  6.8% copy_user_generic_string           [kernel.kallsyms]
             1912.00  6.3% _raw_spin_unlock_irqrestore        [kernel.kallsyms]

A contended mutex as the prime CPU consumer. That's more CPU than
copying 750MB/s of data.

Hence I normally drive parallelism with fsmark by using multiple "-d
<dir>" options, which runs a thread per directory and a workload
unit per directory and so you don't get directory mutex contention
causing serialisation and interference with what you are really
trying to measure....

> > > As I look through the results I have at the moment, the number of
> > > pages written back was simply really low which is why the problem fell
> > > off my radar.
> > 
> > It doesn't take many to completely screw up writeback IO patterns.
> > Write a few random pages to a 10MB file well before writeback would
> > get to the file, and instead of getting optimal sequential writeback
> > patterns when writeback gets to it, we get multiple disjoint IOs
> > that require multiple seeks to complete.
> > 
> > Slower, less efficient writeback IO causes memory pressure to last
> > longer and hence more likely to result in kswapd writeback, and it's
> > just a downward spiral from there....
> > 
> 
> Yes, I see the negative feedback loop. This has always been a struggle
> in that kswapd needs pages from a particular zone to be cleaned and
> freed but calling writepage can make things slower. There were
> prototypes in the past to give hints to the flusher threads on what
> inode and pages to be freed and they were never met with any degree of
> satisfaction.
> 
> The consensus (amount VM people at least) was as long as that number was
> low, it wasn't much of a problem.

Therein lies the problem. You've got storage people telling you
there is an IO problem with memory reclaim, but the mm community
then put their heads together somewhere private, decide it isn't
a problem worth fixing and do nothing. Rinse, lather, repeat.

I expect memory reclaim to play nicely with writeback that is
already in progress. These subsystems do not work in isolation, yet
memory reclaim treats it that way - as though it is the most
important IO submitter and everything else can suffer while memory
reclaim does it's stuff.  Memory reclaim needs to co-ordinate with
writeback effectively for the system as a whole to work well
together.

> I know you disagree.

Right, that's because it doesn't have to be a very high number to be
a problem. IO is orders of magnitude slower than the CPU time it
takes to flush a page, so the cost of making a bad flush decision is
very high. And single page writeback from the LRU is almost always a
bad flush decision.

> > > > > Oh, now that is too close to just be a co-incidence. We're getting
> > > > > significant amounts of random page writeback from the the ends of
> > > > > the LRUs done by the VM.
> > > > > 
> > > > > <sigh>
> > > 
> > > Does the value for nr_vmscan_write in /proc/vmstat correlate? It must
> > > but lets me sure because I'm using that figure rather than ftrace to
> > > count writebacks at the moment.
> > 
> > The number in /proc/vmstat is higher. Much higher.  I just ran the
> > test at 1000 files (only collapsed to ~3000 iops this time because I
> > ran it on a plain 3.0-rc4 kernel that still has the .writepage
> > clustering in XFS), and I see:
> > 
> > nr_vmscan_write 6723
> > 
> > after the test. The event trace only capture ~1400 writepage events
> > from kswapd, but it tends to miss a lot of events as the system is
> > quite unresponsive at times under this workload - it's not uncommon
> > to have ssh sessions not echo a character for 10s... e.g: I started
> > the workload ~11:08:22:
> > 
> 
> Ok, I'll be looking at nr_vmscan_write as the basis for "badness".

Perhaps you should look at my other reply (and two line "fix") in
the thread about stopping dirty page writeback until after waiting
on pages under writeback.....

> > > A more relevant question is this -
> > > how many pages were reclaimed by kswapd and what percentage is 799
> > > pages of that? What do you consider an acceptable percentage?
> > 
> > I don't care what the percentage is or what the number is. kswapd is
> > reclaiming pages most of the time without affect IO patterns, and
> > when that happens I just don't care because it is working just fine.
> > 
> 
> I do care. I'm looking at some early XFS results here based on a laptop
> (4G). For fsmark with the command line above, the number of pages
> written back by kswapd was 0. The worst test by far was sysbench using a
> particularly large database. The number of writes was 48745 which is
> 0.27% of pages scanned or 0.28% of pages reclaimed. Ordinarily I would
> ignore that.
> 
> If I run this at 1G and get a similar ratio, I will assume that I
> am not reproducing your problem at all unless I know what ratio you
> are seeing.

Single threaded writing of files should -never- cause writeback from
the LRUs. If that is happening, then the memory reclaim throttling
is broken. See my other email.

> So .... How many pages were reclaimed by kswapd and what percentage
> is 799 pages of that?

No idea. That information is long gone....

> You answered my second question. You consider 0% to be the acceptable
> percentage.

No, I expect memory reclaim to behave nicely with writeback that is
already in progress. This subsystems do not work in isolation - they
need to co-ordinate 

> > What I care about is what kswapd is doing when it finds dirty pages
> > and it decides they need to be written back. It's not a problem that
> > they are found or need to be written, the problem is the utterly
> > crap way that memory reclaim is throwing the pages at the filesystem.
> > 
> > I'm not sure how to get through to you guys that single, random page
> > writeback is *BAD*.
> 
> It got through. The feedback during discussions on the VM side was
> that as long as the percentage was sufficiently low it wasn't a problem
> because on occasion, the VM really needs pages from a particular zone.
> A solution that addressed both problems has never been agreed on and
> energy and time runs out before it gets fixed each time.

<sigh>

> > And while I'm ranting, when on earth is the issue-writeback-from-
> > direct-reclaim problem going to be fixed so we can remove the hacks
> > in the filesystem .writepage implementations to prevent this from
> > occurring?
> > 
> 
> Prototyped that too, same thread. Same type of problem, writeback
> from direct reclaim should happen so rarely that it should not be
> optimised for. See https://lkml.org/lkml/2010/6/11/32

Writeback from direct reclaim crashes systems by causing stack
overruns - that's why we've disabled it. It's not an "optimisation"
problem - it's a _memory corruption_ bug that needs to be fixed.....

> At the risk of pissing you off, this isn't new information so I'll
> consider myself duly nudged into revisiting.

No, I've had a rant to express my displeasure at the lack of
progress on this front.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-07-05 14:34               ` Mel Gorman
@ 2011-07-06  1:23                 ` Dave Chinner
  -1 siblings, 0 replies; 100+ messages in thread
From: Dave Chinner @ 2011-07-06  1:23 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, linux-mm, Wu Fengguang, Johannes Weiner, xfs

On Tue, Jul 05, 2011 at 03:34:10PM +0100, Mel Gorman wrote:
> On Mon, Jul 04, 2011 at 01:25:34PM +1000, Dave Chinner wrote:
> > On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote:
> > > Christoph,
> > > 
> > > On Fri, Jul 01, 2011 at 05:33:05PM +0800, Christoph Hellwig wrote:
> > > > Johannes, Mel, Wu,
> > > > 
> > > > Dave has been stressing some XFS patches of mine that remove the XFS
> > > > internal writeback clustering in favour of using write_cache_pages.
> > > > 
> > > > As part of investigating the behaviour he found out that we're still
> > > > doing lots of I/O from the end of the LRU in kswapd.  Not only is that
> > > > pretty bad behaviour in general, but it also means we really can't
> > > > just remove the writeback clustering in writepage given how much
> > > > I/O is still done through that.
> > > > 
> > > > Any chance we could the writeback vs kswap behaviour sorted out a bit
> > > > better finally?
> > > 
> > > I once tried this approach:
> > > 
> > > http://www.spinics.net/lists/linux-mm/msg09202.html
> > > 
> > > It used a list structure that is not linearly scalable, however that
> > > part should be independently improvable when necessary.
> > 
> > I don't think that handing random writeback to the flusher thread is
> > much better than doing random writeback directly.  Yes, you added
> > some clustering, but I'm still don't think writing specific pages is
> > the best solution.
> > 
> > > The real problem was, it seem to not very effective in my test runs.
> > > I found many ->nr_pages works queued before the ->inode works, which
> > > effectively makes the flusher working on more dispersed pages rather
> > > than focusing on the dirty pages encountered in LRU reclaim.
> > 
> > But that's really just an implementation issue related to how you
> > tried to solve the problem. That could be addressed.
> > 
> > However, what I'm questioning is whether we should even care what
> > page memory reclaim wants to write - it seems to make fundamentally
> > bad decisions from an IO persepctive.
> > 
> 
> It sucks from an IO perspective but from the perspective of the VM that
> needs memory to be free in a particular zone or node, it's a reasonable
> request.

Sure, I'm not suggesting there is anything wrong the requirement of
being able to clean pages in a particular zone. My comments are
aimed at the fact the implementation of this feature is about as
friendly to the IO subsystem as a game of Roshambeau....

If someone comes to us complaining about an application that causes
this sort of IO behaviour, our answer is always "fix the
application" because it is not something we can fix in the
filesystem. Same here - we need to have the "application" fixed to
play well with others.

> > We have to remember that memory reclaim is doing LRU reclaim and the
> > flusher threads are doing "oldest first" writeback. IOWs, both are trying
> > to operate in the same direction (oldest to youngest) for the same
> > purpose.  The fundamental problem that occurs when memory reclaim
> > starts writing pages back from the LRU is this:
> > 
> > 	- memory reclaim has run ahead of IO writeback -
> > 
> 
> This reasoning was the basis for this patch
> http://www.gossamer-threads.com/lists/linux/kernel/1251235?do=post_view_threaded#1251235
> 
> i.e. if old pages are dirty then the flusher threads are either not
> awake or not doing enough work so wake them. It was flawed in a number
> of respects and never finished though.

But that's dealing with a different situation - you're assuming that the
writeback threads are not running or are running inefficiently.

What I'm seeing is bad behaviour when the IO subsystem is already
running flat out with perfectly formed IO. No additional IO
submission is going to make it clean pages faster than it already
is. It is in this situation that memory reclaim should never, ever
be trying to write dirty pages.

IIRC, the situation was that there were about 15,000 dirty pages and
~20,000 pages under writeback when memory reclaim started pushing
pages from the LRU. This is on a single node machine, with all IO
being single threaded (so a single source of memory pressure) and
writeback doing it's job.  Memory reclaim should *never* get ahead
of writeback under such a simple workload on such a simple
configuration....

> > The LRU usually looks like this:
> > 
> > 	oldest					youngest
> > 	+---------------+---------------+--------------+
> > 	clean		writeback	dirty
> > 			^		^
> > 			|		|
> > 			|		Where flusher will next work from
> > 			|		Where kswapd is working from
> > 			|
> > 			IO submitted by flusher, waiting on completion
> > 
> > 
> > If memory reclaim is hitting dirty pages on the LRU, it means it has
> > got ahead of writeback without being throttled - it's passed over
> > all the pages currently under writeback and is trying to write back
> > pages that are *newer* than what writeback is working on. IOWs, it
> > starts trying to do the job of the flusher threads, and it does that
> > very badly.
> > 
> > The $100 question is ???why is it getting ahead of writeback*?
> > 
> 
> Allocating and dirtying memory faster than writeback. Large dd to USB
> stick would also trigger it.

Write throttling is supposed to prevent that situation from being
problematic. It's entire purpose is to throttle the dirtying rate to
match the writeback rate. If that's a problem, the memory reclaim
subsystem is the wrong place to be trying to fix it.

And as such, that is not the case here; foreground throttling is
definitely occurring and works fine for 70-80s, then memory reclaim
gets ahead of writeback and it all goes to shit.

> > From a brief look at the vmscan code, it appears that scanning does
> > not throttle/block until reclaim priority has got pretty high. That
> > means at low priority reclaim, it *skips pages under writeback*.
> > However, if it comes across a dirty page, it will trigger writeback
> > of the page.
> > 
> > Now call me crazy, but if we've already got a large number of pages
> > under writeback, why would we want to *start more IO* when clearly
> > the system is taking care of cleaning pages already and all we have
> > to do is wait for a short while to get clean pages ready for
> > reclaim?
> > 
> 
> It doesnt' check how many pages are under writeback.

Isn't that an indication of a design flaw? You want to clean
pages, but you don't even bother to check on how many pages are
currently being cleaned and will soon be reclaimable?

> Direct reclaim
> will check if the block device is congested but that is about
> it.

FWIW, we've removed all the congestion logic from the writeback
subsystem because IO throttling never really worked well that way.
Writeback IO throttling now works by foreground blocking during IO
submission on request queue slots in the elevator. That's why we
have flusher threads per-bdi - so writeback can block on a congested
bdi and not block writeback to other bdis. It's simpler, more
extensible and far more scalable than the old method.

Anyway, it's a moot point because direct reclaim can't issue IO
through xfs, ext4 or btrfs and as such I have doubts that the
throttling logic in vmscan is completely robust.

> Otherwise the expectation was the elevator would handle the
> merging of requests into a sensible patter. Also, while filesystem
> pages are getting cleaned by flushs, that does not cover anonymous
> pages being written to swap.

Anonymous pages written to swap are not the issue here - I couldn't
care less what you do with them. It's writeback of dirty file pages
that I care about...

> 
> > Indeed, I added this quick hack to prevent the VM from doing
> > writeback via pageout until after it starts blocking on writeback
> > pages:
> > 
> > @@ -825,6 +825,8 @@ static unsigned long shrink_page_list(struct list_head *page_l
> >  		if (PageDirty(page)) {
> >  			nr_dirty++;
> >  
> > +			if (!(sc->reclaim_mode & RECLAIM_MODE_SYNC))
> > +				goto keep_locked;
> >  			if (references == PAGEREF_RECLAIM_CLEAN)
> >  				goto keep_locked;
> >  			if (!may_enter_fs)
> > 
> > IOWs, we don't write pages from kswapd unless there is no IO
> > writeback going on at all (waited on all the writeback pages or none
> > exist) and there are dirty pages on the LRU.
> > 
> 
> A side effect of this patch is that kswapd is no longer writing
> anonymous pages to swap and possibly never will.

For dirty anon pages to still get written, all that needs to be
done is pass the file parameter to shrink_page_list() and change the
test to: 

+			if (file && (sc->reclaim_mode & RECLAIM_MODE_SYNC))
+				goto keep_locked;

As it is, I haven't had any of my test systems (which run tests that
deliberately cause OOM conditions) fail with this patch. While I
agree it is just a hack, it's naivety has also demonstrated that a
working system does not need to write back dirty file pages from
memory reclaim -at all-. i.e. it makes my argument stronger, not
weaker....

> RECLAIM_MODE_SYNC is
> only set for lumpy reclaim which if you have CONFIG_COMPACTION set, will
> never happen.

Which means that memory reclaim does not throttle reliably on
writeback in progress. Even when the priority has ratcheted right up
and it is obvious that the zone in question has pages being cleaned
and will soon be available for reclaim, memory reclaim won't wait
for them directly.

Once again this points to the throttling mechanism being sub-optimal
- it relies on second order effects (congestion_wait) to try to
block long enough for pages to be cleaned in the zone being
reclaimed from before doing another scan to find those pages. It's a
"wait and hope" approach to throttling, and that's one of the
reasons it never worked well in the writeback subsystem.

Instead, if memory reclaim waits directly on a page on the given LRU
under writeback it guarantees that when you are woken that there was
at least some progress made by the IO subsystem that would allow the
memory reclaim subsystem to move forward.

What it comes down to is the fact that you can scan tens of
thousands of pages in the time it takes for IO on a single page to
complete. If there are pages already under IO, then why start more
IO when what ends up getting reclaimed is one of the pages that is
already under IO when the new IO was issued?

BTW:

# CONFIG_COMPACTION is not set

> I see your figures and know why you want this but it never was that
> straight-forward :/

If the code is complex enough that implementing a basic policy such
as "don't writeback pages if there are already pages under
writeback" is difficult, then maybe the code needs to be
simplified....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
@ 2011-07-06  1:23                 ` Dave Chinner
  0 siblings, 0 replies; 100+ messages in thread
From: Dave Chinner @ 2011-07-06  1:23 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Wu Fengguang, Christoph Hellwig, Johannes Weiner, xfs, linux-mm

On Tue, Jul 05, 2011 at 03:34:10PM +0100, Mel Gorman wrote:
> On Mon, Jul 04, 2011 at 01:25:34PM +1000, Dave Chinner wrote:
> > On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote:
> > > Christoph,
> > > 
> > > On Fri, Jul 01, 2011 at 05:33:05PM +0800, Christoph Hellwig wrote:
> > > > Johannes, Mel, Wu,
> > > > 
> > > > Dave has been stressing some XFS patches of mine that remove the XFS
> > > > internal writeback clustering in favour of using write_cache_pages.
> > > > 
> > > > As part of investigating the behaviour he found out that we're still
> > > > doing lots of I/O from the end of the LRU in kswapd.  Not only is that
> > > > pretty bad behaviour in general, but it also means we really can't
> > > > just remove the writeback clustering in writepage given how much
> > > > I/O is still done through that.
> > > > 
> > > > Any chance we could the writeback vs kswap behaviour sorted out a bit
> > > > better finally?
> > > 
> > > I once tried this approach:
> > > 
> > > http://www.spinics.net/lists/linux-mm/msg09202.html
> > > 
> > > It used a list structure that is not linearly scalable, however that
> > > part should be independently improvable when necessary.
> > 
> > I don't think that handing random writeback to the flusher thread is
> > much better than doing random writeback directly.  Yes, you added
> > some clustering, but I'm still don't think writing specific pages is
> > the best solution.
> > 
> > > The real problem was, it seem to not very effective in my test runs.
> > > I found many ->nr_pages works queued before the ->inode works, which
> > > effectively makes the flusher working on more dispersed pages rather
> > > than focusing on the dirty pages encountered in LRU reclaim.
> > 
> > But that's really just an implementation issue related to how you
> > tried to solve the problem. That could be addressed.
> > 
> > However, what I'm questioning is whether we should even care what
> > page memory reclaim wants to write - it seems to make fundamentally
> > bad decisions from an IO persepctive.
> > 
> 
> It sucks from an IO perspective but from the perspective of the VM that
> needs memory to be free in a particular zone or node, it's a reasonable
> request.

Sure, I'm not suggesting there is anything wrong the requirement of
being able to clean pages in a particular zone. My comments are
aimed at the fact the implementation of this feature is about as
friendly to the IO subsystem as a game of Roshambeau....

If someone comes to us complaining about an application that causes
this sort of IO behaviour, our answer is always "fix the
application" because it is not something we can fix in the
filesystem. Same here - we need to have the "application" fixed to
play well with others.

> > We have to remember that memory reclaim is doing LRU reclaim and the
> > flusher threads are doing "oldest first" writeback. IOWs, both are trying
> > to operate in the same direction (oldest to youngest) for the same
> > purpose.  The fundamental problem that occurs when memory reclaim
> > starts writing pages back from the LRU is this:
> > 
> > 	- memory reclaim has run ahead of IO writeback -
> > 
> 
> This reasoning was the basis for this patch
> http://www.gossamer-threads.com/lists/linux/kernel/1251235?do=post_view_threaded#1251235
> 
> i.e. if old pages are dirty then the flusher threads are either not
> awake or not doing enough work so wake them. It was flawed in a number
> of respects and never finished though.

But that's dealing with a different situation - you're assuming that the
writeback threads are not running or are running inefficiently.

What I'm seeing is bad behaviour when the IO subsystem is already
running flat out with perfectly formed IO. No additional IO
submission is going to make it clean pages faster than it already
is. It is in this situation that memory reclaim should never, ever
be trying to write dirty pages.

IIRC, the situation was that there were about 15,000 dirty pages and
~20,000 pages under writeback when memory reclaim started pushing
pages from the LRU. This is on a single node machine, with all IO
being single threaded (so a single source of memory pressure) and
writeback doing it's job.  Memory reclaim should *never* get ahead
of writeback under such a simple workload on such a simple
configuration....

> > The LRU usually looks like this:
> > 
> > 	oldest					youngest
> > 	+---------------+---------------+--------------+
> > 	clean		writeback	dirty
> > 			^		^
> > 			|		|
> > 			|		Where flusher will next work from
> > 			|		Where kswapd is working from
> > 			|
> > 			IO submitted by flusher, waiting on completion
> > 
> > 
> > If memory reclaim is hitting dirty pages on the LRU, it means it has
> > got ahead of writeback without being throttled - it's passed over
> > all the pages currently under writeback and is trying to write back
> > pages that are *newer* than what writeback is working on. IOWs, it
> > starts trying to do the job of the flusher threads, and it does that
> > very badly.
> > 
> > The $100 question is ???why is it getting ahead of writeback*?
> > 
> 
> Allocating and dirtying memory faster than writeback. Large dd to USB
> stick would also trigger it.

Write throttling is supposed to prevent that situation from being
problematic. It's entire purpose is to throttle the dirtying rate to
match the writeback rate. If that's a problem, the memory reclaim
subsystem is the wrong place to be trying to fix it.

And as such, that is not the case here; foreground throttling is
definitely occurring and works fine for 70-80s, then memory reclaim
gets ahead of writeback and it all goes to shit.

> > From a brief look at the vmscan code, it appears that scanning does
> > not throttle/block until reclaim priority has got pretty high. That
> > means at low priority reclaim, it *skips pages under writeback*.
> > However, if it comes across a dirty page, it will trigger writeback
> > of the page.
> > 
> > Now call me crazy, but if we've already got a large number of pages
> > under writeback, why would we want to *start more IO* when clearly
> > the system is taking care of cleaning pages already and all we have
> > to do is wait for a short while to get clean pages ready for
> > reclaim?
> > 
> 
> It doesnt' check how many pages are under writeback.

Isn't that an indication of a design flaw? You want to clean
pages, but you don't even bother to check on how many pages are
currently being cleaned and will soon be reclaimable?

> Direct reclaim
> will check if the block device is congested but that is about
> it.

FWIW, we've removed all the congestion logic from the writeback
subsystem because IO throttling never really worked well that way.
Writeback IO throttling now works by foreground blocking during IO
submission on request queue slots in the elevator. That's why we
have flusher threads per-bdi - so writeback can block on a congested
bdi and not block writeback to other bdis. It's simpler, more
extensible and far more scalable than the old method.

Anyway, it's a moot point because direct reclaim can't issue IO
through xfs, ext4 or btrfs and as such I have doubts that the
throttling logic in vmscan is completely robust.

> Otherwise the expectation was the elevator would handle the
> merging of requests into a sensible patter. Also, while filesystem
> pages are getting cleaned by flushs, that does not cover anonymous
> pages being written to swap.

Anonymous pages written to swap are not the issue here - I couldn't
care less what you do with them. It's writeback of dirty file pages
that I care about...

> 
> > Indeed, I added this quick hack to prevent the VM from doing
> > writeback via pageout until after it starts blocking on writeback
> > pages:
> > 
> > @@ -825,6 +825,8 @@ static unsigned long shrink_page_list(struct list_head *page_l
> >  		if (PageDirty(page)) {
> >  			nr_dirty++;
> >  
> > +			if (!(sc->reclaim_mode & RECLAIM_MODE_SYNC))
> > +				goto keep_locked;
> >  			if (references == PAGEREF_RECLAIM_CLEAN)
> >  				goto keep_locked;
> >  			if (!may_enter_fs)
> > 
> > IOWs, we don't write pages from kswapd unless there is no IO
> > writeback going on at all (waited on all the writeback pages or none
> > exist) and there are dirty pages on the LRU.
> > 
> 
> A side effect of this patch is that kswapd is no longer writing
> anonymous pages to swap and possibly never will.

For dirty anon pages to still get written, all that needs to be
done is pass the file parameter to shrink_page_list() and change the
test to: 

+			if (file && (sc->reclaim_mode & RECLAIM_MODE_SYNC))
+				goto keep_locked;

As it is, I haven't had any of my test systems (which run tests that
deliberately cause OOM conditions) fail with this patch. While I
agree it is just a hack, it's naivety has also demonstrated that a
working system does not need to write back dirty file pages from
memory reclaim -at all-. i.e. it makes my argument stronger, not
weaker....

> RECLAIM_MODE_SYNC is
> only set for lumpy reclaim which if you have CONFIG_COMPACTION set, will
> never happen.

Which means that memory reclaim does not throttle reliably on
writeback in progress. Even when the priority has ratcheted right up
and it is obvious that the zone in question has pages being cleaned
and will soon be available for reclaim, memory reclaim won't wait
for them directly.

Once again this points to the throttling mechanism being sub-optimal
- it relies on second order effects (congestion_wait) to try to
block long enough for pages to be cleaned in the zone being
reclaimed from before doing another scan to find those pages. It's a
"wait and hope" approach to throttling, and that's one of the
reasons it never worked well in the writeback subsystem.

Instead, if memory reclaim waits directly on a page on the given LRU
under writeback it guarantees that when you are woken that there was
at least some progress made by the IO subsystem that would allow the
memory reclaim subsystem to move forward.

What it comes down to is the fact that you can scan tens of
thousands of pages in the time it takes for IO on a single page to
complete. If there are pages already under IO, then why start more
IO when what ends up getting reclaimed is one of the pages that is
already under IO when the new IO was issued?

BTW:

# CONFIG_COMPACTION is not set

> I see your figures and know why you want this but it never was that
> straight-forward :/

If the code is complex enough that implementing a basic policy such
as "don't writeback pages if there are already pages under
writeback" is difficult, then maybe the code needs to be
simplified....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-07-04  3:25             ` Dave Chinner
@ 2011-07-06  4:53               ` Wu Fengguang
  -1 siblings, 0 replies; 100+ messages in thread
From: Wu Fengguang @ 2011-07-06  4:53 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, linux-mm, xfs, Mel Gorman, Johannes Weiner

On Mon, Jul 04, 2011 at 11:25:34AM +0800, Dave Chinner wrote:
> On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote:
> > Christoph,
> > 
> > On Fri, Jul 01, 2011 at 05:33:05PM +0800, Christoph Hellwig wrote:
> > > Johannes, Mel, Wu,
> > > 
> > > Dave has been stressing some XFS patches of mine that remove the XFS
> > > internal writeback clustering in favour of using write_cache_pages.
> > > 
> > > As part of investigating the behaviour he found out that we're still
> > > doing lots of I/O from the end of the LRU in kswapd.  Not only is that
> > > pretty bad behaviour in general, but it also means we really can't
> > > just remove the writeback clustering in writepage given how much
> > > I/O is still done through that.
> > > 
> > > Any chance we could the writeback vs kswap behaviour sorted out a bit
> > > better finally?
> > 
> > I once tried this approach:
> > 
> > http://www.spinics.net/lists/linux-mm/msg09202.html
> > 
> > It used a list structure that is not linearly scalable, however that
> > part should be independently improvable when necessary.
> 
> I don't think that handing random writeback to the flusher thread is
> much better than doing random writeback directly.  Yes, you added
> some clustering, but I'm still don't think writing specific pages is
> the best solution.

I agree that the VM should avoid writing specific pages as much as
possible. Mostly often, it's indeed OK to just skip sporadically
encountered dirty page and reclaim the clean pages presumably not
far away in the LRU list. So your 2-liner patch is all good if
constraining it to low scan pressure, which will look like

        if (priority == DEF_PRIORITY)
                tag PG_reclaim on encountered dirty pages and
                skip writing it

However the VM in general does need the ability to write specific
pages, such as when reclaiming from specific zone/memcg. So I'll still
propose to do bdi_start_inode_writeback().

Below is the patch rebased to linux-next. It's good enough for testing
purpose, and I guess even with the ->nr_pages work issue, it's
complete enough to get roughly the same performance as your 2-liner
patch.

> > The real problem was, it seem to not very effective in my test runs.
> > I found many ->nr_pages works queued before the ->inode works, which
> > effectively makes the flusher working on more dispersed pages rather
> > than focusing on the dirty pages encountered in LRU reclaim.
> 
> But that's really just an implementation issue related to how you
> tried to solve the problem. That could be addressed.
> 
> However, what I'm questioning is whether we should even care what
> page memory reclaim wants to write - it seems to make fundamentally
> bad decisions from an IO persepctive.
> 
> We have to remember that memory reclaim is doing LRU reclaim and the
> flusher threads are doing "oldest first" writeback. IOWs, both are trying
> to operate in the same direction (oldest to youngest) for the same
> purpose.  The fundamental problem that occurs when memory reclaim
> starts writing pages back from the LRU is this:
> 
> 	- memory reclaim has run ahead of IO writeback -
> 
> The LRU usually looks like this:
> 
> 	oldest					youngest
> 	+---------------+---------------+--------------+
> 	clean		writeback	dirty
> 			^		^
> 			|		|
> 			|		Where flusher will next work from
> 			|		Where kswapd is working from
> 			|
> 			IO submitted by flusher, waiting on completion
> 
> 
> If memory reclaim is hitting dirty pages on the LRU, it means it has
> got ahead of writeback without being throttled - it's passed over
> all the pages currently under writeback and is trying to write back
> pages that are *newer* than what writeback is working on. IOWs, it
> starts trying to do the job of the flusher threads, and it does that
> very badly.
> 
> The $100 question is ∗why is it getting ahead of writeback*?

The most important case is: faster reader + relatively slow writer.

Assume for every 10 pages read, 1 page is dirtied, and the dirty speed
is fast enough to trigger the 20% dirty ratio and hence dirty balancing.

That pattern is able to evenly distribute dirty pages all over the LRU
list and hence trigger lots of pageout()s. The "skip reclaim writes on
low pressure" approach can fix this case.

Thanks,
Fengguang
---
Subject: writeback: introduce bdi_start_inode_writeback()
Date: Thu Jul 29 14:41:19 CST 2010

This relays ASYNC file writeback IOs to the flusher threads.

pageout() will continue to serve the SYNC file page writes for necessary
throttling for preventing OOM, which may happen if the LRU list is small
and/or the storage is slow, so that the flusher cannot clean enough
pages before the LRU is full scanned.

Only ASYNC pageout() is relayed to the flusher threads, the less
frequent SYNC pageout()s will work as before as a last resort.
This helps to avoid OOM when the LRU list is small and/or the storage is
slow, and the flusher cannot clean enough pages before the LRU is
full scanned.

The flusher will piggy back more dirty pages for IO
- it's more IO efficient
- it helps clean more pages, a good number of them may sit in the same
  LRU list that is being scanned.

To avoid memory allocations at page reclaim, a mempool is created.

Background/periodic works will quit automatically (as done in another
patch), so as to clean the pages under reclaim ASAP. However for now the
sync work can still block us for long time.

Jan Kara: limit the search scope.

CC: Jan Kara <jack@suse.cz>
CC: Rik van Riel <riel@redhat.com>
CC: Mel Gorman <mel@linux.vnet.ibm.com>
CC: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c                |  156 ++++++++++++++++++++++++++++-
 include/linux/backing-dev.h      |    1 
 include/trace/events/writeback.h |   15 ++
 mm/vmscan.c                      |    8 +
 4 files changed, 174 insertions(+), 6 deletions(-)

--- linux-next.orig/mm/vmscan.c	2011-06-29 20:43:10.000000000 -0700
+++ linux-next/mm/vmscan.c	2011-07-05 18:30:19.000000000 -0700
@@ -825,6 +825,14 @@ static unsigned long shrink_page_list(st
 		if (PageDirty(page)) {
 			nr_dirty++;
 
+			if (page_is_file_cache(page) && mapping &&
+			    sc->reclaim_mode != RECLAIM_MODE_SYNC) {
+				if (flush_inode_page(page, mapping) >= 0) {
+					SetPageReclaim(page);
+					goto keep_locked;
+				}
+			}
+
 			if (references == PAGEREF_RECLAIM_CLEAN)
 				goto keep_locked;
 			if (!may_enter_fs)
--- linux-next.orig/fs/fs-writeback.c	2011-07-05 18:30:16.000000000 -0700
+++ linux-next/fs/fs-writeback.c	2011-07-05 18:30:52.000000000 -0700
@@ -30,12 +30,21 @@
 #include "internal.h"
 
 /*
+ * When flushing an inode page (for page reclaim), try to piggy back up to
+ * 4MB nearby pages for IO efficiency. These pages will have good opportunity
+ * to be in the same LRU list.
+ */
+#define WRITE_AROUND_PAGES	MIN_WRITEBACK_PAGES
+
+/*
  * Passed into wb_writeback(), essentially a subset of writeback_control
  */
 struct wb_writeback_work {
 	long nr_pages;
 	struct super_block *sb;
 	unsigned long *older_than_this;
+	struct inode *inode;
+	pgoff_t offset;
 	enum writeback_sync_modes sync_mode;
 	unsigned int tagged_writepages:1;
 	unsigned int for_kupdate:1;
@@ -59,6 +68,27 @@ struct wb_writeback_work {
  */
 int nr_pdflush_threads;
 
+static mempool_t *wb_work_mempool;
+
+static void *wb_work_alloc(gfp_t gfp_mask, void *pool_data)
+{
+	/*
+	 * bdi_start_inode_writeback() may be called on page reclaim
+	 */
+	if (current->flags & PF_MEMALLOC)
+		return NULL;
+
+	return kmalloc(sizeof(struct wb_writeback_work), gfp_mask);
+}
+
+static __init int wb_work_init(void)
+{
+	wb_work_mempool = mempool_create(1024,
+					 wb_work_alloc, mempool_kfree, NULL);
+	return wb_work_mempool ? 0 : -ENOMEM;
+}
+fs_initcall(wb_work_init);
+
 /**
  * writeback_in_progress - determine whether there is writeback in progress
  * @bdi: the device's backing_dev_info structure.
@@ -123,7 +153,7 @@ __bdi_start_writeback(struct backing_dev
 	 * This is WB_SYNC_NONE writeback, so if allocation fails just
 	 * wakeup the thread for old dirty data writeback
 	 */
-	work = kzalloc(sizeof(*work), GFP_ATOMIC);
+	work = mempool_alloc(wb_work_mempool, GFP_NOWAIT);
 	if (!work) {
 		if (bdi->wb.task) {
 			trace_writeback_nowork(bdi);
@@ -132,6 +162,7 @@ __bdi_start_writeback(struct backing_dev
 		return;
 	}
 
+	memset(work, 0, sizeof(*work));
 	work->sync_mode	= WB_SYNC_NONE;
 	work->nr_pages	= nr_pages;
 	work->range_cyclic = range_cyclic;
@@ -177,6 +208,107 @@ void bdi_start_background_writeback(stru
 	spin_unlock_bh(&bdi->wb_lock);
 }
 
+static bool extend_writeback_range(struct wb_writeback_work *work,
+				   pgoff_t offset)
+{
+	pgoff_t end = work->offset + work->nr_pages;
+
+	if (offset >= work->offset && offset < end)
+		return true;
+
+	/* the unsigned comparison helps eliminate one compare */
+	if (work->offset - offset < WRITE_AROUND_PAGES) {
+		work->nr_pages += WRITE_AROUND_PAGES;
+		work->offset -= WRITE_AROUND_PAGES;
+		return true;
+	}
+
+	if (offset - end < WRITE_AROUND_PAGES) {
+		work->nr_pages += WRITE_AROUND_PAGES;
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * schedule writeback on a range of inode pages.
+ */
+static struct wb_writeback_work *
+bdi_flush_inode_range(struct backing_dev_info *bdi,
+		      struct inode *inode,
+		      pgoff_t offset,
+		      pgoff_t len)
+{
+	struct wb_writeback_work *work;
+
+	if (!igrab(inode))
+		return ERR_PTR(-ENOENT);
+
+	work = mempool_alloc(wb_work_mempool, GFP_NOWAIT);
+	if (!work)
+		return ERR_PTR(-ENOMEM);
+
+	memset(work, 0, sizeof(*work));
+	work->sync_mode		= WB_SYNC_NONE;
+	work->inode		= inode;
+	work->offset		= offset;
+	work->nr_pages		= len;
+
+	bdi_queue_work(bdi, work);
+
+	return work;
+}
+
+/*
+ * Called by page reclaim code to flush the dirty page ASAP. Do write-around to
+ * improve IO throughput. The nearby pages will have good chance to reside in
+ * the same LRU list that vmscan is working on, and even close to each other
+ * inside the LRU list in the common case of sequential read/write.
+ *
+ * ret > 0: success, found/reused a previous writeback work
+ * ret = 0: success, allocated/queued a new writeback work
+ * ret < 0: failed
+ */
+long flush_inode_page(struct page *page, struct address_space *mapping)
+{
+	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	struct inode *inode = mapping->host;
+	pgoff_t offset = page->index;
+	pgoff_t len = 0;
+	struct wb_writeback_work *work;
+	long ret = -ENOENT;
+
+	if (unlikely(!inode))
+		goto out;
+
+	len = 1;
+	spin_lock_bh(&bdi->wb_lock);
+	list_for_each_entry_reverse(work, &bdi->work_list, list) {
+		if (work->inode != inode)
+			continue;
+		if (extend_writeback_range(work, offset)) {
+			ret = len;
+			offset = work->offset;
+			len = work->nr_pages;
+			break;
+		}
+		if (len++ > 30)	/* do limited search */
+			break;
+	}
+	spin_unlock_bh(&bdi->wb_lock);
+
+	if (ret > 0)
+		goto out;
+
+	offset = round_down(offset, WRITE_AROUND_PAGES);
+	len = WRITE_AROUND_PAGES;
+	work = bdi_flush_inode_range(bdi, inode, offset, len);
+	ret = IS_ERR(work) ? PTR_ERR(work) : 0;
+out:
+	return ret;
+}
+
 /*
  * Remove the inode from the writeback list it is on.
  */
@@ -830,6 +962,21 @@ static unsigned long get_nr_dirty_pages(
 		get_nr_dirty_inodes();
 }
 
+static long wb_flush_inode(struct bdi_writeback *wb,
+			   struct wb_writeback_work *work)
+{
+	loff_t start = work->offset;
+	loff_t end   = work->offset + work->nr_pages - 1;
+	int wrote;
+
+	wrote = __filemap_fdatawrite_range(work->inode->i_mapping,
+					   start << PAGE_CACHE_SHIFT,
+					   end   << PAGE_CACHE_SHIFT,
+					   WB_SYNC_NONE);
+	iput(work->inode);
+	return wrote;
+}
+
 static long wb_check_background_flush(struct bdi_writeback *wb)
 {
 	if (over_bground_thresh()) {
@@ -900,7 +1047,10 @@ long wb_do_writeback(struct bdi_writebac
 
 		trace_writeback_exec(bdi, work);
 
-		wrote += wb_writeback(wb, work);
+		if (work->inode)
+			wrote += wb_flush_inode(wb, work);
+		else
+			wrote += wb_writeback(wb, work);
 
 		/*
 		 * Notify the caller of completion if this is a synchronous
@@ -909,7 +1059,7 @@ long wb_do_writeback(struct bdi_writebac
 		if (work->done)
 			complete(work->done);
 		else
-			kfree(work);
+			mempool_free(work, wb_work_mempool);
 	}
 
 	/*
--- linux-next.orig/include/linux/backing-dev.h	2011-07-03 20:03:37.000000000 -0700
+++ linux-next/include/linux/backing-dev.h	2011-07-05 18:30:19.000000000 -0700
@@ -109,6 +109,7 @@ void bdi_unregister(struct backing_dev_i
 int bdi_setup_and_register(struct backing_dev_info *, char *, unsigned int);
 void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages);
 void bdi_start_background_writeback(struct backing_dev_info *bdi);
+long flush_inode_page(struct page *page, struct address_space *mapping);
 int bdi_writeback_thread(void *data);
 int bdi_has_dirty_io(struct backing_dev_info *bdi);
 void bdi_arm_supers_timer(void);
--- linux-next.orig/include/trace/events/writeback.h	2011-07-05 18:30:16.000000000 -0700
+++ linux-next/include/trace/events/writeback.h	2011-07-05 18:30:19.000000000 -0700
@@ -28,31 +28,40 @@ DECLARE_EVENT_CLASS(writeback_work_class
 	TP_ARGS(bdi, work),
 	TP_STRUCT__entry(
 		__array(char, name, 32)
+		__field(struct wb_writeback_work*, work)
 		__field(long, nr_pages)
 		__field(dev_t, sb_dev)
 		__field(int, sync_mode)
 		__field(int, for_kupdate)
 		__field(int, range_cyclic)
 		__field(int, for_background)
+		__field(unsigned long, ino)
+		__field(unsigned long, offset)
 	),
 	TP_fast_assign(
 		strncpy(__entry->name, dev_name(bdi->dev), 32);
+		__entry->work = work;
 		__entry->nr_pages = work->nr_pages;
 		__entry->sb_dev = work->sb ? work->sb->s_dev : 0;
 		__entry->sync_mode = work->sync_mode;
 		__entry->for_kupdate = work->for_kupdate;
 		__entry->range_cyclic = work->range_cyclic;
 		__entry->for_background	= work->for_background;
+		__entry->ino		= work->inode ? work->inode->i_ino : 0;
+		__entry->offset		= work->offset;
 	),
-	TP_printk("bdi %s: sb_dev %d:%d nr_pages=%ld sync_mode=%d "
-		  "kupdate=%d range_cyclic=%d background=%d",
+	TP_printk("bdi %s: sb_dev %d:%d %p nr_pages=%ld sync_mode=%d "
+		  "kupdate=%d range_cyclic=%d background=%d ino=%lu offset=%lu",
 		  __entry->name,
 		  MAJOR(__entry->sb_dev), MINOR(__entry->sb_dev),
+		  __entry->work,
 		  __entry->nr_pages,
 		  __entry->sync_mode,
 		  __entry->for_kupdate,
 		  __entry->range_cyclic,
-		  __entry->for_background
+		  __entry->for_background,
+		  __entry->ino,
+		  __entry->offset
 	)
 );
 #define DEFINE_WRITEBACK_WORK_EVENT(name) \

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
@ 2011-07-06  4:53               ` Wu Fengguang
  0 siblings, 0 replies; 100+ messages in thread
From: Wu Fengguang @ 2011-07-06  4:53 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Mel Gorman, Johannes Weiner, xfs, linux-mm

On Mon, Jul 04, 2011 at 11:25:34AM +0800, Dave Chinner wrote:
> On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote:
> > Christoph,
> > 
> > On Fri, Jul 01, 2011 at 05:33:05PM +0800, Christoph Hellwig wrote:
> > > Johannes, Mel, Wu,
> > > 
> > > Dave has been stressing some XFS patches of mine that remove the XFS
> > > internal writeback clustering in favour of using write_cache_pages.
> > > 
> > > As part of investigating the behaviour he found out that we're still
> > > doing lots of I/O from the end of the LRU in kswapd.  Not only is that
> > > pretty bad behaviour in general, but it also means we really can't
> > > just remove the writeback clustering in writepage given how much
> > > I/O is still done through that.
> > > 
> > > Any chance we could the writeback vs kswap behaviour sorted out a bit
> > > better finally?
> > 
> > I once tried this approach:
> > 
> > http://www.spinics.net/lists/linux-mm/msg09202.html
> > 
> > It used a list structure that is not linearly scalable, however that
> > part should be independently improvable when necessary.
> 
> I don't think that handing random writeback to the flusher thread is
> much better than doing random writeback directly.  Yes, you added
> some clustering, but I'm still don't think writing specific pages is
> the best solution.

I agree that the VM should avoid writing specific pages as much as
possible. Mostly often, it's indeed OK to just skip sporadically
encountered dirty page and reclaim the clean pages presumably not
far away in the LRU list. So your 2-liner patch is all good if
constraining it to low scan pressure, which will look like

        if (priority == DEF_PRIORITY)
                tag PG_reclaim on encountered dirty pages and
                skip writing it

However the VM in general does need the ability to write specific
pages, such as when reclaiming from specific zone/memcg. So I'll still
propose to do bdi_start_inode_writeback().

Below is the patch rebased to linux-next. It's good enough for testing
purpose, and I guess even with the ->nr_pages work issue, it's
complete enough to get roughly the same performance as your 2-liner
patch.

> > The real problem was, it seem to not very effective in my test runs.
> > I found many ->nr_pages works queued before the ->inode works, which
> > effectively makes the flusher working on more dispersed pages rather
> > than focusing on the dirty pages encountered in LRU reclaim.
> 
> But that's really just an implementation issue related to how you
> tried to solve the problem. That could be addressed.
> 
> However, what I'm questioning is whether we should even care what
> page memory reclaim wants to write - it seems to make fundamentally
> bad decisions from an IO persepctive.
> 
> We have to remember that memory reclaim is doing LRU reclaim and the
> flusher threads are doing "oldest first" writeback. IOWs, both are trying
> to operate in the same direction (oldest to youngest) for the same
> purpose.  The fundamental problem that occurs when memory reclaim
> starts writing pages back from the LRU is this:
> 
> 	- memory reclaim has run ahead of IO writeback -
> 
> The LRU usually looks like this:
> 
> 	oldest					youngest
> 	+---------------+---------------+--------------+
> 	clean		writeback	dirty
> 			^		^
> 			|		|
> 			|		Where flusher will next work from
> 			|		Where kswapd is working from
> 			|
> 			IO submitted by flusher, waiting on completion
> 
> 
> If memory reclaim is hitting dirty pages on the LRU, it means it has
> got ahead of writeback without being throttled - it's passed over
> all the pages currently under writeback and is trying to write back
> pages that are *newer* than what writeback is working on. IOWs, it
> starts trying to do the job of the flusher threads, and it does that
> very badly.
> 
> The $100 question is a??why is it getting ahead of writeback*?

The most important case is: faster reader + relatively slow writer.

Assume for every 10 pages read, 1 page is dirtied, and the dirty speed
is fast enough to trigger the 20% dirty ratio and hence dirty balancing.

That pattern is able to evenly distribute dirty pages all over the LRU
list and hence trigger lots of pageout()s. The "skip reclaim writes on
low pressure" approach can fix this case.

Thanks,
Fengguang
---
Subject: writeback: introduce bdi_start_inode_writeback()
Date: Thu Jul 29 14:41:19 CST 2010

This relays ASYNC file writeback IOs to the flusher threads.

pageout() will continue to serve the SYNC file page writes for necessary
throttling for preventing OOM, which may happen if the LRU list is small
and/or the storage is slow, so that the flusher cannot clean enough
pages before the LRU is full scanned.

Only ASYNC pageout() is relayed to the flusher threads, the less
frequent SYNC pageout()s will work as before as a last resort.
This helps to avoid OOM when the LRU list is small and/or the storage is
slow, and the flusher cannot clean enough pages before the LRU is
full scanned.

The flusher will piggy back more dirty pages for IO
- it's more IO efficient
- it helps clean more pages, a good number of them may sit in the same
  LRU list that is being scanned.

To avoid memory allocations at page reclaim, a mempool is created.

Background/periodic works will quit automatically (as done in another
patch), so as to clean the pages under reclaim ASAP. However for now the
sync work can still block us for long time.

Jan Kara: limit the search scope.

CC: Jan Kara <jack@suse.cz>
CC: Rik van Riel <riel@redhat.com>
CC: Mel Gorman <mel@linux.vnet.ibm.com>
CC: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c                |  156 ++++++++++++++++++++++++++++-
 include/linux/backing-dev.h      |    1 
 include/trace/events/writeback.h |   15 ++
 mm/vmscan.c                      |    8 +
 4 files changed, 174 insertions(+), 6 deletions(-)

--- linux-next.orig/mm/vmscan.c	2011-06-29 20:43:10.000000000 -0700
+++ linux-next/mm/vmscan.c	2011-07-05 18:30:19.000000000 -0700
@@ -825,6 +825,14 @@ static unsigned long shrink_page_list(st
 		if (PageDirty(page)) {
 			nr_dirty++;
 
+			if (page_is_file_cache(page) && mapping &&
+			    sc->reclaim_mode != RECLAIM_MODE_SYNC) {
+				if (flush_inode_page(page, mapping) >= 0) {
+					SetPageReclaim(page);
+					goto keep_locked;
+				}
+			}
+
 			if (references == PAGEREF_RECLAIM_CLEAN)
 				goto keep_locked;
 			if (!may_enter_fs)
--- linux-next.orig/fs/fs-writeback.c	2011-07-05 18:30:16.000000000 -0700
+++ linux-next/fs/fs-writeback.c	2011-07-05 18:30:52.000000000 -0700
@@ -30,12 +30,21 @@
 #include "internal.h"
 
 /*
+ * When flushing an inode page (for page reclaim), try to piggy back up to
+ * 4MB nearby pages for IO efficiency. These pages will have good opportunity
+ * to be in the same LRU list.
+ */
+#define WRITE_AROUND_PAGES	MIN_WRITEBACK_PAGES
+
+/*
  * Passed into wb_writeback(), essentially a subset of writeback_control
  */
 struct wb_writeback_work {
 	long nr_pages;
 	struct super_block *sb;
 	unsigned long *older_than_this;
+	struct inode *inode;
+	pgoff_t offset;
 	enum writeback_sync_modes sync_mode;
 	unsigned int tagged_writepages:1;
 	unsigned int for_kupdate:1;
@@ -59,6 +68,27 @@ struct wb_writeback_work {
  */
 int nr_pdflush_threads;
 
+static mempool_t *wb_work_mempool;
+
+static void *wb_work_alloc(gfp_t gfp_mask, void *pool_data)
+{
+	/*
+	 * bdi_start_inode_writeback() may be called on page reclaim
+	 */
+	if (current->flags & PF_MEMALLOC)
+		return NULL;
+
+	return kmalloc(sizeof(struct wb_writeback_work), gfp_mask);
+}
+
+static __init int wb_work_init(void)
+{
+	wb_work_mempool = mempool_create(1024,
+					 wb_work_alloc, mempool_kfree, NULL);
+	return wb_work_mempool ? 0 : -ENOMEM;
+}
+fs_initcall(wb_work_init);
+
 /**
  * writeback_in_progress - determine whether there is writeback in progress
  * @bdi: the device's backing_dev_info structure.
@@ -123,7 +153,7 @@ __bdi_start_writeback(struct backing_dev
 	 * This is WB_SYNC_NONE writeback, so if allocation fails just
 	 * wakeup the thread for old dirty data writeback
 	 */
-	work = kzalloc(sizeof(*work), GFP_ATOMIC);
+	work = mempool_alloc(wb_work_mempool, GFP_NOWAIT);
 	if (!work) {
 		if (bdi->wb.task) {
 			trace_writeback_nowork(bdi);
@@ -132,6 +162,7 @@ __bdi_start_writeback(struct backing_dev
 		return;
 	}
 
+	memset(work, 0, sizeof(*work));
 	work->sync_mode	= WB_SYNC_NONE;
 	work->nr_pages	= nr_pages;
 	work->range_cyclic = range_cyclic;
@@ -177,6 +208,107 @@ void bdi_start_background_writeback(stru
 	spin_unlock_bh(&bdi->wb_lock);
 }
 
+static bool extend_writeback_range(struct wb_writeback_work *work,
+				   pgoff_t offset)
+{
+	pgoff_t end = work->offset + work->nr_pages;
+
+	if (offset >= work->offset && offset < end)
+		return true;
+
+	/* the unsigned comparison helps eliminate one compare */
+	if (work->offset - offset < WRITE_AROUND_PAGES) {
+		work->nr_pages += WRITE_AROUND_PAGES;
+		work->offset -= WRITE_AROUND_PAGES;
+		return true;
+	}
+
+	if (offset - end < WRITE_AROUND_PAGES) {
+		work->nr_pages += WRITE_AROUND_PAGES;
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * schedule writeback on a range of inode pages.
+ */
+static struct wb_writeback_work *
+bdi_flush_inode_range(struct backing_dev_info *bdi,
+		      struct inode *inode,
+		      pgoff_t offset,
+		      pgoff_t len)
+{
+	struct wb_writeback_work *work;
+
+	if (!igrab(inode))
+		return ERR_PTR(-ENOENT);
+
+	work = mempool_alloc(wb_work_mempool, GFP_NOWAIT);
+	if (!work)
+		return ERR_PTR(-ENOMEM);
+
+	memset(work, 0, sizeof(*work));
+	work->sync_mode		= WB_SYNC_NONE;
+	work->inode		= inode;
+	work->offset		= offset;
+	work->nr_pages		= len;
+
+	bdi_queue_work(bdi, work);
+
+	return work;
+}
+
+/*
+ * Called by page reclaim code to flush the dirty page ASAP. Do write-around to
+ * improve IO throughput. The nearby pages will have good chance to reside in
+ * the same LRU list that vmscan is working on, and even close to each other
+ * inside the LRU list in the common case of sequential read/write.
+ *
+ * ret > 0: success, found/reused a previous writeback work
+ * ret = 0: success, allocated/queued a new writeback work
+ * ret < 0: failed
+ */
+long flush_inode_page(struct page *page, struct address_space *mapping)
+{
+	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	struct inode *inode = mapping->host;
+	pgoff_t offset = page->index;
+	pgoff_t len = 0;
+	struct wb_writeback_work *work;
+	long ret = -ENOENT;
+
+	if (unlikely(!inode))
+		goto out;
+
+	len = 1;
+	spin_lock_bh(&bdi->wb_lock);
+	list_for_each_entry_reverse(work, &bdi->work_list, list) {
+		if (work->inode != inode)
+			continue;
+		if (extend_writeback_range(work, offset)) {
+			ret = len;
+			offset = work->offset;
+			len = work->nr_pages;
+			break;
+		}
+		if (len++ > 30)	/* do limited search */
+			break;
+	}
+	spin_unlock_bh(&bdi->wb_lock);
+
+	if (ret > 0)
+		goto out;
+
+	offset = round_down(offset, WRITE_AROUND_PAGES);
+	len = WRITE_AROUND_PAGES;
+	work = bdi_flush_inode_range(bdi, inode, offset, len);
+	ret = IS_ERR(work) ? PTR_ERR(work) : 0;
+out:
+	return ret;
+}
+
 /*
  * Remove the inode from the writeback list it is on.
  */
@@ -830,6 +962,21 @@ static unsigned long get_nr_dirty_pages(
 		get_nr_dirty_inodes();
 }
 
+static long wb_flush_inode(struct bdi_writeback *wb,
+			   struct wb_writeback_work *work)
+{
+	loff_t start = work->offset;
+	loff_t end   = work->offset + work->nr_pages - 1;
+	int wrote;
+
+	wrote = __filemap_fdatawrite_range(work->inode->i_mapping,
+					   start << PAGE_CACHE_SHIFT,
+					   end   << PAGE_CACHE_SHIFT,
+					   WB_SYNC_NONE);
+	iput(work->inode);
+	return wrote;
+}
+
 static long wb_check_background_flush(struct bdi_writeback *wb)
 {
 	if (over_bground_thresh()) {
@@ -900,7 +1047,10 @@ long wb_do_writeback(struct bdi_writebac
 
 		trace_writeback_exec(bdi, work);
 
-		wrote += wb_writeback(wb, work);
+		if (work->inode)
+			wrote += wb_flush_inode(wb, work);
+		else
+			wrote += wb_writeback(wb, work);
 
 		/*
 		 * Notify the caller of completion if this is a synchronous
@@ -909,7 +1059,7 @@ long wb_do_writeback(struct bdi_writebac
 		if (work->done)
 			complete(work->done);
 		else
-			kfree(work);
+			mempool_free(work, wb_work_mempool);
 	}
 
 	/*
--- linux-next.orig/include/linux/backing-dev.h	2011-07-03 20:03:37.000000000 -0700
+++ linux-next/include/linux/backing-dev.h	2011-07-05 18:30:19.000000000 -0700
@@ -109,6 +109,7 @@ void bdi_unregister(struct backing_dev_i
 int bdi_setup_and_register(struct backing_dev_info *, char *, unsigned int);
 void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages);
 void bdi_start_background_writeback(struct backing_dev_info *bdi);
+long flush_inode_page(struct page *page, struct address_space *mapping);
 int bdi_writeback_thread(void *data);
 int bdi_has_dirty_io(struct backing_dev_info *bdi);
 void bdi_arm_supers_timer(void);
--- linux-next.orig/include/trace/events/writeback.h	2011-07-05 18:30:16.000000000 -0700
+++ linux-next/include/trace/events/writeback.h	2011-07-05 18:30:19.000000000 -0700
@@ -28,31 +28,40 @@ DECLARE_EVENT_CLASS(writeback_work_class
 	TP_ARGS(bdi, work),
 	TP_STRUCT__entry(
 		__array(char, name, 32)
+		__field(struct wb_writeback_work*, work)
 		__field(long, nr_pages)
 		__field(dev_t, sb_dev)
 		__field(int, sync_mode)
 		__field(int, for_kupdate)
 		__field(int, range_cyclic)
 		__field(int, for_background)
+		__field(unsigned long, ino)
+		__field(unsigned long, offset)
 	),
 	TP_fast_assign(
 		strncpy(__entry->name, dev_name(bdi->dev), 32);
+		__entry->work = work;
 		__entry->nr_pages = work->nr_pages;
 		__entry->sb_dev = work->sb ? work->sb->s_dev : 0;
 		__entry->sync_mode = work->sync_mode;
 		__entry->for_kupdate = work->for_kupdate;
 		__entry->range_cyclic = work->range_cyclic;
 		__entry->for_background	= work->for_background;
+		__entry->ino		= work->inode ? work->inode->i_ino : 0;
+		__entry->offset		= work->offset;
 	),
-	TP_printk("bdi %s: sb_dev %d:%d nr_pages=%ld sync_mode=%d "
-		  "kupdate=%d range_cyclic=%d background=%d",
+	TP_printk("bdi %s: sb_dev %d:%d %p nr_pages=%ld sync_mode=%d "
+		  "kupdate=%d range_cyclic=%d background=%d ino=%lu offset=%lu",
 		  __entry->name,
 		  MAJOR(__entry->sb_dev), MINOR(__entry->sb_dev),
+		  __entry->work,
 		  __entry->nr_pages,
 		  __entry->sync_mode,
 		  __entry->for_kupdate,
 		  __entry->range_cyclic,
-		  __entry->for_background
+		  __entry->for_background,
+		  __entry->ino,
+		  __entry->offset
 	)
 );
 #define DEFINE_WRITEBACK_WORK_EVENT(name) \

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-07-06  4:53               ` Wu Fengguang
@ 2011-07-06  6:47                 ` Minchan Kim
  -1 siblings, 0 replies; 100+ messages in thread
From: Minchan Kim @ 2011-07-06  6:47 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: xfs, Christoph Hellwig, linux-mm, Mel Gorman, Johannes Weiner

On Wed, Jul 6, 2011 at 1:53 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> On Mon, Jul 04, 2011 at 11:25:34AM +0800, Dave Chinner wrote:
>> On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote:
>> > Christoph,
>> >
>> > On Fri, Jul 01, 2011 at 05:33:05PM +0800, Christoph Hellwig wrote:
>> > > Johannes, Mel, Wu,
>> > >
>> > > Dave has been stressing some XFS patches of mine that remove the XFS
>> > > internal writeback clustering in favour of using write_cache_pages.
>> > >
>> > > As part of investigating the behaviour he found out that we're still
>> > > doing lots of I/O from the end of the LRU in kswapd.  Not only is that
>> > > pretty bad behaviour in general, but it also means we really can't
>> > > just remove the writeback clustering in writepage given how much
>> > > I/O is still done through that.
>> > >
>> > > Any chance we could the writeback vs kswap behaviour sorted out a bit
>> > > better finally?
>> >
>> > I once tried this approach:
>> >
>> > http://www.spinics.net/lists/linux-mm/msg09202.html
>> >
>> > It used a list structure that is not linearly scalable, however that
>> > part should be independently improvable when necessary.
>>
>> I don't think that handing random writeback to the flusher thread is
>> much better than doing random writeback directly.  Yes, you added
>> some clustering, but I'm still don't think writing specific pages is
>> the best solution.
>
> I agree that the VM should avoid writing specific pages as much as
> possible. Mostly often, it's indeed OK to just skip sporadically
> encountered dirty page and reclaim the clean pages presumably not
> far away in the LRU list. So your 2-liner patch is all good if
> constraining it to low scan pressure, which will look like
>
>        if (priority == DEF_PRIORITY)
>                tag PG_reclaim on encountered dirty pages and
>                skip writing it
>
> However the VM in general does need the ability to write specific
> pages, such as when reclaiming from specific zone/memcg. So I'll still
> propose to do bdi_start_inode_writeback().
>
> Below is the patch rebased to linux-next. It's good enough for testing
> purpose, and I guess even with the ->nr_pages work issue, it's
> complete enough to get roughly the same performance as your 2-liner
> patch.
>
>> > The real problem was, it seem to not very effective in my test runs.
>> > I found many ->nr_pages works queued before the ->inode works, which
>> > effectively makes the flusher working on more dispersed pages rather
>> > than focusing on the dirty pages encountered in LRU reclaim.
>>
>> But that's really just an implementation issue related to how you
>> tried to solve the problem. That could be addressed.
>>
>> However, what I'm questioning is whether we should even care what
>> page memory reclaim wants to write - it seems to make fundamentally
>> bad decisions from an IO persepctive.
>>
>> We have to remember that memory reclaim is doing LRU reclaim and the
>> flusher threads are doing "oldest first" writeback. IOWs, both are trying
>> to operate in the same direction (oldest to youngest) for the same
>> purpose.  The fundamental problem that occurs when memory reclaim
>> starts writing pages back from the LRU is this:
>>
>>       - memory reclaim has run ahead of IO writeback -
>>
>> The LRU usually looks like this:
>>
>>       oldest                                  youngest
>>       +---------------+---------------+--------------+
>>       clean           writeback       dirty
>>                       ^               ^
>>                       |               |
>>                       |               Where flusher will next work from
>>                       |               Where kswapd is working from
>>                       |
>>                       IO submitted by flusher, waiting on completion
>>
>>
>> If memory reclaim is hitting dirty pages on the LRU, it means it has
>> got ahead of writeback without being throttled - it's passed over
>> all the pages currently under writeback and is trying to write back
>> pages that are *newer* than what writeback is working on. IOWs, it
>> starts trying to do the job of the flusher threads, and it does that
>> very badly.
>>
>> The $100 question is ∗why is it getting ahead of writeback*?
>
> The most important case is: faster reader + relatively slow writer.
>
> Assume for every 10 pages read, 1 page is dirtied, and the dirty speed
> is fast enough to trigger the 20% dirty ratio and hence dirty balancing.
>
> That pattern is able to evenly distribute dirty pages all over the LRU
> list and hence trigger lots of pageout()s. The "skip reclaim writes on
> low pressure" approach can fix this case.
>
> Thanks,
> Fengguang
> ---
> Subject: writeback: introduce bdi_start_inode_writeback()
> Date: Thu Jul 29 14:41:19 CST 2010
>
> This relays ASYNC file writeback IOs to the flusher threads.
>
> pageout() will continue to serve the SYNC file page writes for necessary
> throttling for preventing OOM, which may happen if the LRU list is small
> and/or the storage is slow, so that the flusher cannot clean enough
> pages before the LRU is full scanned.
>
> Only ASYNC pageout() is relayed to the flusher threads, the less
> frequent SYNC pageout()s will work as before as a last resort.
> This helps to avoid OOM when the LRU list is small and/or the storage is
> slow, and the flusher cannot clean enough pages before the LRU is
> full scanned.
>
> The flusher will piggy back more dirty pages for IO
> - it's more IO efficient
> - it helps clean more pages, a good number of them may sit in the same
>  LRU list that is being scanned.
>
> To avoid memory allocations at page reclaim, a mempool is created.
>
> Background/periodic works will quit automatically (as done in another
> patch), so as to clean the pages under reclaim ASAP. However for now the
> sync work can still block us for long time.
>
> Jan Kara: limit the search scope.
>
> CC: Jan Kara <jack@suse.cz>
> CC: Rik van Riel <riel@redhat.com>
> CC: Mel Gorman <mel@linux.vnet.ibm.com>
> CC: Minchan Kim <minchan.kim@gmail.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

It seems to be enhanced version of old Mel's done.
I support this approach :) but I have some questions.

> ---
>  fs/fs-writeback.c                |  156 ++++++++++++++++++++++++++++-
>  include/linux/backing-dev.h      |    1
>  include/trace/events/writeback.h |   15 ++
>  mm/vmscan.c                      |    8 +
>  4 files changed, 174 insertions(+), 6 deletions(-)
>
> --- linux-next.orig/mm/vmscan.c 2011-06-29 20:43:10.000000000 -0700
> +++ linux-next/mm/vmscan.c      2011-07-05 18:30:19.000000000 -0700
> @@ -825,6 +825,14 @@ static unsigned long shrink_page_list(st
>                if (PageDirty(page)) {
>                        nr_dirty++;
>
> +                       if (page_is_file_cache(page) && mapping &&
> +                           sc->reclaim_mode != RECLAIM_MODE_SYNC) {
> +                               if (flush_inode_page(page, mapping) >= 0) {
> +                                       SetPageReclaim(page);
> +                                       goto keep_locked;

keep_locked changes old behavior.
Normally, in case of async mode, we does keep_lumpy(ie, we didn't
reset reclaim_mode) but now you are always resetting reclaim_mode. so
sync call of shrink_page_list never happen if flush_inode_page is
successful.
Is it your intention?


> +                               }
> +                       }
> +

If flush_inode_page fails(ie, the page isn't nearby of current work's
writeback range), we still do pageout although it's async mode. Is it
your intention?

-- 
Kind regards,
Minchan Kim

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
@ 2011-07-06  6:47                 ` Minchan Kim
  0 siblings, 0 replies; 100+ messages in thread
From: Minchan Kim @ 2011-07-06  6:47 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Dave Chinner, Christoph Hellwig, Mel Gorman, Johannes Weiner,
	xfs, linux-mm

On Wed, Jul 6, 2011 at 1:53 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> On Mon, Jul 04, 2011 at 11:25:34AM +0800, Dave Chinner wrote:
>> On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote:
>> > Christoph,
>> >
>> > On Fri, Jul 01, 2011 at 05:33:05PM +0800, Christoph Hellwig wrote:
>> > > Johannes, Mel, Wu,
>> > >
>> > > Dave has been stressing some XFS patches of mine that remove the XFS
>> > > internal writeback clustering in favour of using write_cache_pages.
>> > >
>> > > As part of investigating the behaviour he found out that we're still
>> > > doing lots of I/O from the end of the LRU in kswapd.  Not only is that
>> > > pretty bad behaviour in general, but it also means we really can't
>> > > just remove the writeback clustering in writepage given how much
>> > > I/O is still done through that.
>> > >
>> > > Any chance we could the writeback vs kswap behaviour sorted out a bit
>> > > better finally?
>> >
>> > I once tried this approach:
>> >
>> > http://www.spinics.net/lists/linux-mm/msg09202.html
>> >
>> > It used a list structure that is not linearly scalable, however that
>> > part should be independently improvable when necessary.
>>
>> I don't think that handing random writeback to the flusher thread is
>> much better than doing random writeback directly.  Yes, you added
>> some clustering, but I'm still don't think writing specific pages is
>> the best solution.
>
> I agree that the VM should avoid writing specific pages as much as
> possible. Mostly often, it's indeed OK to just skip sporadically
> encountered dirty page and reclaim the clean pages presumably not
> far away in the LRU list. So your 2-liner patch is all good if
> constraining it to low scan pressure, which will look like
>
>        if (priority == DEF_PRIORITY)
>                tag PG_reclaim on encountered dirty pages and
>                skip writing it
>
> However the VM in general does need the ability to write specific
> pages, such as when reclaiming from specific zone/memcg. So I'll still
> propose to do bdi_start_inode_writeback().
>
> Below is the patch rebased to linux-next. It's good enough for testing
> purpose, and I guess even with the ->nr_pages work issue, it's
> complete enough to get roughly the same performance as your 2-liner
> patch.
>
>> > The real problem was, it seem to not very effective in my test runs.
>> > I found many ->nr_pages works queued before the ->inode works, which
>> > effectively makes the flusher working on more dispersed pages rather
>> > than focusing on the dirty pages encountered in LRU reclaim.
>>
>> But that's really just an implementation issue related to how you
>> tried to solve the problem. That could be addressed.
>>
>> However, what I'm questioning is whether we should even care what
>> page memory reclaim wants to write - it seems to make fundamentally
>> bad decisions from an IO persepctive.
>>
>> We have to remember that memory reclaim is doing LRU reclaim and the
>> flusher threads are doing "oldest first" writeback. IOWs, both are trying
>> to operate in the same direction (oldest to youngest) for the same
>> purpose.  The fundamental problem that occurs when memory reclaim
>> starts writing pages back from the LRU is this:
>>
>>       - memory reclaim has run ahead of IO writeback -
>>
>> The LRU usually looks like this:
>>
>>       oldest                                  youngest
>>       +---------------+---------------+--------------+
>>       clean           writeback       dirty
>>                       ^               ^
>>                       |               |
>>                       |               Where flusher will next work from
>>                       |               Where kswapd is working from
>>                       |
>>                       IO submitted by flusher, waiting on completion
>>
>>
>> If memory reclaim is hitting dirty pages on the LRU, it means it has
>> got ahead of writeback without being throttled - it's passed over
>> all the pages currently under writeback and is trying to write back
>> pages that are *newer* than what writeback is working on. IOWs, it
>> starts trying to do the job of the flusher threads, and it does that
>> very badly.
>>
>> The $100 question is ∗why is it getting ahead of writeback*?
>
> The most important case is: faster reader + relatively slow writer.
>
> Assume for every 10 pages read, 1 page is dirtied, and the dirty speed
> is fast enough to trigger the 20% dirty ratio and hence dirty balancing.
>
> That pattern is able to evenly distribute dirty pages all over the LRU
> list and hence trigger lots of pageout()s. The "skip reclaim writes on
> low pressure" approach can fix this case.
>
> Thanks,
> Fengguang
> ---
> Subject: writeback: introduce bdi_start_inode_writeback()
> Date: Thu Jul 29 14:41:19 CST 2010
>
> This relays ASYNC file writeback IOs to the flusher threads.
>
> pageout() will continue to serve the SYNC file page writes for necessary
> throttling for preventing OOM, which may happen if the LRU list is small
> and/or the storage is slow, so that the flusher cannot clean enough
> pages before the LRU is full scanned.
>
> Only ASYNC pageout() is relayed to the flusher threads, the less
> frequent SYNC pageout()s will work as before as a last resort.
> This helps to avoid OOM when the LRU list is small and/or the storage is
> slow, and the flusher cannot clean enough pages before the LRU is
> full scanned.
>
> The flusher will piggy back more dirty pages for IO
> - it's more IO efficient
> - it helps clean more pages, a good number of them may sit in the same
>  LRU list that is being scanned.
>
> To avoid memory allocations at page reclaim, a mempool is created.
>
> Background/periodic works will quit automatically (as done in another
> patch), so as to clean the pages under reclaim ASAP. However for now the
> sync work can still block us for long time.
>
> Jan Kara: limit the search scope.
>
> CC: Jan Kara <jack@suse.cz>
> CC: Rik van Riel <riel@redhat.com>
> CC: Mel Gorman <mel@linux.vnet.ibm.com>
> CC: Minchan Kim <minchan.kim@gmail.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

It seems to be enhanced version of old Mel's done.
I support this approach :) but I have some questions.

> ---
>  fs/fs-writeback.c                |  156 ++++++++++++++++++++++++++++-
>  include/linux/backing-dev.h      |    1
>  include/trace/events/writeback.h |   15 ++
>  mm/vmscan.c                      |    8 +
>  4 files changed, 174 insertions(+), 6 deletions(-)
>
> --- linux-next.orig/mm/vmscan.c 2011-06-29 20:43:10.000000000 -0700
> +++ linux-next/mm/vmscan.c      2011-07-05 18:30:19.000000000 -0700
> @@ -825,6 +825,14 @@ static unsigned long shrink_page_list(st
>                if (PageDirty(page)) {
>                        nr_dirty++;
>
> +                       if (page_is_file_cache(page) && mapping &&
> +                           sc->reclaim_mode != RECLAIM_MODE_SYNC) {
> +                               if (flush_inode_page(page, mapping) >= 0) {
> +                                       SetPageReclaim(page);
> +                                       goto keep_locked;

keep_locked changes old behavior.
Normally, in case of async mode, we does keep_lumpy(ie, we didn't
reset reclaim_mode) but now you are always resetting reclaim_mode. so
sync call of shrink_page_list never happen if flush_inode_page is
successful.
Is it your intention?


> +                               }
> +                       }
> +

If flush_inode_page fails(ie, the page isn't nearby of current work's
writeback range), we still do pageout although it's async mode. Is it
your intention?

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-07-06  4:53               ` Wu Fengguang
@ 2011-07-06  7:17                 ` Dave Chinner
  -1 siblings, 0 replies; 100+ messages in thread
From: Dave Chinner @ 2011-07-06  7:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christoph Hellwig, linux-mm, xfs, Mel Gorman, Johannes Weiner

On Tue, Jul 05, 2011 at 09:53:01PM -0700, Wu Fengguang wrote:
> On Mon, Jul 04, 2011 at 11:25:34AM +0800, Dave Chinner wrote:
> > On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote:
> > We have to remember that memory reclaim is doing LRU reclaim and the
> > flusher threads are doing "oldest first" writeback. IOWs, both are trying
> > to operate in the same direction (oldest to youngest) for the same
> > purpose.  The fundamental problem that occurs when memory reclaim
> > starts writing pages back from the LRU is this:
> > 
> > 	- memory reclaim has run ahead of IO writeback -
> > 
> > The LRU usually looks like this:
> > 
> > 	oldest					youngest
> > 	+---------------+---------------+--------------+
> > 	clean		writeback	dirty
> > 			^		^
> > 			|		|
> > 			|		Where flusher will next work from
> > 			|		Where kswapd is working from
> > 			|
> > 			IO submitted by flusher, waiting on completion
> > 
> > 
> > If memory reclaim is hitting dirty pages on the LRU, it means it has
> > got ahead of writeback without being throttled - it's passed over
> > all the pages currently under writeback and is trying to write back
> > pages that are *newer* than what writeback is working on. IOWs, it
> > starts trying to do the job of the flusher threads, and it does that
> > very badly.
> > 
> > The $100 question is ∗why is it getting ahead of writeback*?
> 
> The most important case is: faster reader + relatively slow writer.

Same thing I said to Mel: that is not the workload that is causing
this problem I am seeing.

> Assume for every 10 pages read, 1 page is dirtied, and the dirty speed
> is fast enough to trigger the 20% dirty ratio and hence dirty balancing.
> 
> That pattern is able to evenly distribute dirty pages all over the LRU
> list and hence trigger lots of pageout()s. The "skip reclaim writes on
> low pressure" approach can fix this case.

Sure it can, but even better would be to simply skip the dirty pages
and reclaim the interspersed clean pages which greatly
outnumber the dirty pages. That then lets writeback deal with
cleaning the dirty pages in the most optimal manner, and no
writeback from memory reclaim is needed.

IOWs, I don't think writeback from the LRU is the right solution to
the problem you've described, either.

> 
> Thanks,
> Fengguang
> ---
> Subject: writeback: introduce bdi_start_inode_writeback()
> Date: Thu Jul 29 14:41:19 CST 2010
> 
> This relays ASYNC file writeback IOs to the flusher threads.
> 
> pageout() will continue to serve the SYNC file page writes for necessary
> throttling for preventing OOM, which may happen if the LRU list is small
> and/or the storage is slow, so that the flusher cannot clean enough
> pages before the LRU is full scanned.
> 
> Only ASYNC pageout() is relayed to the flusher threads, the less
> frequent SYNC pageout()s will work as before as a last resort.
> This helps to avoid OOM when the LRU list is small and/or the storage is
> slow, and the flusher cannot clean enough pages before the LRU is
> full scanned.

Which ignores the fact that async pageout should not be happening in
most cases. Let's try and fix the root cause of the problem, not
paper over it again...

> The flusher will piggy back more dirty pages for IO
> - it's more IO efficient
> - it helps clean more pages, a good number of them may sit in the same
>   LRU list that is being scanned.
> 
> To avoid memory allocations at page reclaim, a mempool is created.
> 
> Background/periodic works will quit automatically (as done in another
> patch), so as to clean the pages under reclaim ASAP. However for now the
> sync work can still block us for long time.

>  /*
> + * When flushing an inode page (for page reclaim), try to piggy back up to
> + * 4MB nearby pages for IO efficiency. These pages will have good opportunity
> + * to be in the same LRU list.
> + */
> +#define WRITE_AROUND_PAGES	MIN_WRITEBACK_PAGES

Regardless of the trigger, I think you're going too far in the other
direction, here. If we have to do one IO to clean the page that the
VM wants, then it has to be done with as little latency as possible
but large enough to still maintain decent throughput.

With the above patch, for every single dirty page the VM wants
cleaned, we'll clean 4MB of pages around it. Ok, but once the VM has
tripped over pages on 25 different inodes, we've now got 100MB of
writeback work to chew through before we can get to the 26th page
the VM wanted cleaned.

At which point, we may as well just ignore what the VM wants and
continue to clean pages via the existing mechanisms because the
latency for cleaning a specific page will worse than if the VM just
skipped it in the first place....

FWIW, XFS limited such clustering to 64 pages at a time to try to
balance the bandwidth vs completion latency problem.


> +/*
> + * Called by page reclaim code to flush the dirty page ASAP. Do write-around to
> + * improve IO throughput. The nearby pages will have good chance to reside in
> + * the same LRU list that vmscan is working on, and even close to each other
> + * inside the LRU list in the common case of sequential read/write.
> + *
> + * ret > 0: success, found/reused a previous writeback work
> + * ret = 0: success, allocated/queued a new writeback work
> + * ret < 0: failed
> + */
> +long flush_inode_page(struct page *page, struct address_space *mapping)
> +{
> +	struct backing_dev_info *bdi = mapping->backing_dev_info;
> +	struct inode *inode = mapping->host;
> +	pgoff_t offset = page->index;
> +	pgoff_t len = 0;
> +	struct wb_writeback_work *work;
> +	long ret = -ENOENT;
> +
> +	if (unlikely(!inode))
> +		goto out;
> +
> +	len = 1;
> +	spin_lock_bh(&bdi->wb_lock);
> +	list_for_each_entry_reverse(work, &bdi->work_list, list) {
> +		if (work->inode != inode)
> +			continue;
> +		if (extend_writeback_range(work, offset)) {
> +			ret = len;
> +			offset = work->offset;
> +			len = work->nr_pages;
> +			break;
> +		}
> +		if (len++ > 30)	/* do limited search */
> +			break;
> +	}
> +	spin_unlock_bh(&bdi->wb_lock);

I dont think this is a necessary or scalable optimisation. It won't
be useful when there are lots of dirty inodes and dirty pages are
tripped over in their hundreds or thousands - it'll just burn CPU
doing nothing, and serialise against other reclaim and writeback
work. It looks like a case of premature optimisation to me....

Anyway, if there's a page flush near to an existing piece of work the
IO elevator should merge them appropriately.

> +static long wb_flush_inode(struct bdi_writeback *wb,
> +			   struct wb_writeback_work *work)
> +{
> +	loff_t start = work->offset;
> +	loff_t end   = work->offset + work->nr_pages - 1;
> +	int wrote;
> +
> +	wrote = __filemap_fdatawrite_range(work->inode->i_mapping,
> +					   start << PAGE_CACHE_SHIFT,
> +					   end   << PAGE_CACHE_SHIFT,
> +					   WB_SYNC_NONE);
> +	iput(work->inode);
> +	return wrote;
> +}

Out of curiousity, before going down the complex route did you try
just calling this directly and seeing if it solved the problem? i.e.

	igrab()
	get start/end
	unlock page
	__filemap_fdatawrite_range()
	iput()

I mean, much as I dislike the idea of writeback from the LRU, if all
we need to do is call through .writepages() to do get decent IO from
reclaim (when it occurs), then why do we need to add this async
complexity to the generic writeback code to acheive the same end?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
@ 2011-07-06  7:17                 ` Dave Chinner
  0 siblings, 0 replies; 100+ messages in thread
From: Dave Chinner @ 2011-07-06  7:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christoph Hellwig, Mel Gorman, Johannes Weiner, xfs, linux-mm

On Tue, Jul 05, 2011 at 09:53:01PM -0700, Wu Fengguang wrote:
> On Mon, Jul 04, 2011 at 11:25:34AM +0800, Dave Chinner wrote:
> > On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote:
> > We have to remember that memory reclaim is doing LRU reclaim and the
> > flusher threads are doing "oldest first" writeback. IOWs, both are trying
> > to operate in the same direction (oldest to youngest) for the same
> > purpose.  The fundamental problem that occurs when memory reclaim
> > starts writing pages back from the LRU is this:
> > 
> > 	- memory reclaim has run ahead of IO writeback -
> > 
> > The LRU usually looks like this:
> > 
> > 	oldest					youngest
> > 	+---------------+---------------+--------------+
> > 	clean		writeback	dirty
> > 			^		^
> > 			|		|
> > 			|		Where flusher will next work from
> > 			|		Where kswapd is working from
> > 			|
> > 			IO submitted by flusher, waiting on completion
> > 
> > 
> > If memory reclaim is hitting dirty pages on the LRU, it means it has
> > got ahead of writeback without being throttled - it's passed over
> > all the pages currently under writeback and is trying to write back
> > pages that are *newer* than what writeback is working on. IOWs, it
> > starts trying to do the job of the flusher threads, and it does that
> > very badly.
> > 
> > The $100 question is a??why is it getting ahead of writeback*?
> 
> The most important case is: faster reader + relatively slow writer.

Same thing I said to Mel: that is not the workload that is causing
this problem I am seeing.

> Assume for every 10 pages read, 1 page is dirtied, and the dirty speed
> is fast enough to trigger the 20% dirty ratio and hence dirty balancing.
> 
> That pattern is able to evenly distribute dirty pages all over the LRU
> list and hence trigger lots of pageout()s. The "skip reclaim writes on
> low pressure" approach can fix this case.

Sure it can, but even better would be to simply skip the dirty pages
and reclaim the interspersed clean pages which greatly
outnumber the dirty pages. That then lets writeback deal with
cleaning the dirty pages in the most optimal manner, and no
writeback from memory reclaim is needed.

IOWs, I don't think writeback from the LRU is the right solution to
the problem you've described, either.

> 
> Thanks,
> Fengguang
> ---
> Subject: writeback: introduce bdi_start_inode_writeback()
> Date: Thu Jul 29 14:41:19 CST 2010
> 
> This relays ASYNC file writeback IOs to the flusher threads.
> 
> pageout() will continue to serve the SYNC file page writes for necessary
> throttling for preventing OOM, which may happen if the LRU list is small
> and/or the storage is slow, so that the flusher cannot clean enough
> pages before the LRU is full scanned.
> 
> Only ASYNC pageout() is relayed to the flusher threads, the less
> frequent SYNC pageout()s will work as before as a last resort.
> This helps to avoid OOM when the LRU list is small and/or the storage is
> slow, and the flusher cannot clean enough pages before the LRU is
> full scanned.

Which ignores the fact that async pageout should not be happening in
most cases. Let's try and fix the root cause of the problem, not
paper over it again...

> The flusher will piggy back more dirty pages for IO
> - it's more IO efficient
> - it helps clean more pages, a good number of them may sit in the same
>   LRU list that is being scanned.
> 
> To avoid memory allocations at page reclaim, a mempool is created.
> 
> Background/periodic works will quit automatically (as done in another
> patch), so as to clean the pages under reclaim ASAP. However for now the
> sync work can still block us for long time.

>  /*
> + * When flushing an inode page (for page reclaim), try to piggy back up to
> + * 4MB nearby pages for IO efficiency. These pages will have good opportunity
> + * to be in the same LRU list.
> + */
> +#define WRITE_AROUND_PAGES	MIN_WRITEBACK_PAGES

Regardless of the trigger, I think you're going too far in the other
direction, here. If we have to do one IO to clean the page that the
VM wants, then it has to be done with as little latency as possible
but large enough to still maintain decent throughput.

With the above patch, for every single dirty page the VM wants
cleaned, we'll clean 4MB of pages around it. Ok, but once the VM has
tripped over pages on 25 different inodes, we've now got 100MB of
writeback work to chew through before we can get to the 26th page
the VM wanted cleaned.

At which point, we may as well just ignore what the VM wants and
continue to clean pages via the existing mechanisms because the
latency for cleaning a specific page will worse than if the VM just
skipped it in the first place....

FWIW, XFS limited such clustering to 64 pages at a time to try to
balance the bandwidth vs completion latency problem.


> +/*
> + * Called by page reclaim code to flush the dirty page ASAP. Do write-around to
> + * improve IO throughput. The nearby pages will have good chance to reside in
> + * the same LRU list that vmscan is working on, and even close to each other
> + * inside the LRU list in the common case of sequential read/write.
> + *
> + * ret > 0: success, found/reused a previous writeback work
> + * ret = 0: success, allocated/queued a new writeback work
> + * ret < 0: failed
> + */
> +long flush_inode_page(struct page *page, struct address_space *mapping)
> +{
> +	struct backing_dev_info *bdi = mapping->backing_dev_info;
> +	struct inode *inode = mapping->host;
> +	pgoff_t offset = page->index;
> +	pgoff_t len = 0;
> +	struct wb_writeback_work *work;
> +	long ret = -ENOENT;
> +
> +	if (unlikely(!inode))
> +		goto out;
> +
> +	len = 1;
> +	spin_lock_bh(&bdi->wb_lock);
> +	list_for_each_entry_reverse(work, &bdi->work_list, list) {
> +		if (work->inode != inode)
> +			continue;
> +		if (extend_writeback_range(work, offset)) {
> +			ret = len;
> +			offset = work->offset;
> +			len = work->nr_pages;
> +			break;
> +		}
> +		if (len++ > 30)	/* do limited search */
> +			break;
> +	}
> +	spin_unlock_bh(&bdi->wb_lock);

I dont think this is a necessary or scalable optimisation. It won't
be useful when there are lots of dirty inodes and dirty pages are
tripped over in their hundreds or thousands - it'll just burn CPU
doing nothing, and serialise against other reclaim and writeback
work. It looks like a case of premature optimisation to me....

Anyway, if there's a page flush near to an existing piece of work the
IO elevator should merge them appropriately.

> +static long wb_flush_inode(struct bdi_writeback *wb,
> +			   struct wb_writeback_work *work)
> +{
> +	loff_t start = work->offset;
> +	loff_t end   = work->offset + work->nr_pages - 1;
> +	int wrote;
> +
> +	wrote = __filemap_fdatawrite_range(work->inode->i_mapping,
> +					   start << PAGE_CACHE_SHIFT,
> +					   end   << PAGE_CACHE_SHIFT,
> +					   WB_SYNC_NONE);
> +	iput(work->inode);
> +	return wrote;
> +}

Out of curiousity, before going down the complex route did you try
just calling this directly and seeing if it solved the problem? i.e.

	igrab()
	get start/end
	unlock page
	__filemap_fdatawrite_range()
	iput()

I mean, much as I dislike the idea of writeback from the LRU, if all
we need to do is call through .writepages() to do get decent IO from
reclaim (when it occurs), then why do we need to add this async
complexity to the generic writeback code to acheive the same end?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-07-04  3:25             ` Dave Chinner
@ 2011-07-06 15:12               ` Johannes Weiner
  -1 siblings, 0 replies; 100+ messages in thread
From: Johannes Weiner @ 2011-07-06 15:12 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, linux-mm, Wu Fengguang, Mel Gorman, xfs

On Mon, Jul 04, 2011 at 01:25:34PM +1000, Dave Chinner wrote:
> On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote:
> > Christoph,
> > 
> > On Fri, Jul 01, 2011 at 05:33:05PM +0800, Christoph Hellwig wrote:
> > > Johannes, Mel, Wu,
> > > 
> > > Dave has been stressing some XFS patches of mine that remove the XFS
> > > internal writeback clustering in favour of using write_cache_pages.
> > > 
> > > As part of investigating the behaviour he found out that we're still
> > > doing lots of I/O from the end of the LRU in kswapd.  Not only is that
> > > pretty bad behaviour in general, but it also means we really can't
> > > just remove the writeback clustering in writepage given how much
> > > I/O is still done through that.
> > > 
> > > Any chance we could the writeback vs kswap behaviour sorted out a bit
> > > better finally?
> > 
> > I once tried this approach:
> > 
> > http://www.spinics.net/lists/linux-mm/msg09202.html
> > 
> > It used a list structure that is not linearly scalable, however that
> > part should be independently improvable when necessary.
> 
> I don't think that handing random writeback to the flusher thread is
> much better than doing random writeback directly.  Yes, you added
> some clustering, but I'm still don't think writing specific pages is
> the best solution.
> 
> > The real problem was, it seem to not very effective in my test runs.
> > I found many ->nr_pages works queued before the ->inode works, which
> > effectively makes the flusher working on more dispersed pages rather
> > than focusing on the dirty pages encountered in LRU reclaim.
> 
> But that's really just an implementation issue related to how you
> tried to solve the problem. That could be addressed.
> 
> However, what I'm questioning is whether we should even care what
> page memory reclaim wants to write - it seems to make fundamentally
> bad decisions from an IO persepctive.
> 
> We have to remember that memory reclaim is doing LRU reclaim and the
> flusher threads are doing "oldest first" writeback. IOWs, both are trying
> to operate in the same direction (oldest to youngest) for the same
> purpose.  The fundamental problem that occurs when memory reclaim
> starts writing pages back from the LRU is this:
> 
> 	- memory reclaim has run ahead of IO writeback -
> 
> The LRU usually looks like this:
> 
> 	oldest					youngest
> 	+---------------+---------------+--------------+
> 	clean		writeback	dirty
> 			^		^
> 			|		|
> 			|		Where flusher will next work from
> 			|		Where kswapd is working from
> 			|
> 			IO submitted by flusher, waiting on completion
> 
> 
> If memory reclaim is hitting dirty pages on the LRU, it means it has
> got ahead of writeback without being throttled - it's passed over
> all the pages currently under writeback and is trying to write back
> pages that are *newer* than what writeback is working on. IOWs, it
> starts trying to do the job of the flusher threads, and it does that
> very badly.
> 
> The $100 question is ∗why is it getting ahead of writeback*?

Unless you have a purely sequential writer, the LRU order is - at
least in theory - diverging away from the writeback order.

According to the reasoning behind generational garbage collection,
they should in fact be inverse to each other.  The oldest pages still
in use are the most likely to be still needed in the future.

In practice we only make a generational distinction between used-once
and used-many, which manifests in the inactive and the active list.
But still, when reclaim starts off with a localized writer, the oldest
pages are likely to be at the end of the active list.

So pages from the inactive list are likely to be written in the right
order, but at the same time active pages are even older, thus written
before them.  Memory reclaim starts with the inactive pages, and this
is why it gets ahead.

Then there is also the case where a fast writer pushes dirty pages to
the end of the LRU list, of course, but you already said that this is
not applicable to your workload.

My point is that I don't think it's unexpected that dirty pages come
off the inactive list in practice.  It just sucks how we handle them.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
@ 2011-07-06 15:12               ` Johannes Weiner
  0 siblings, 0 replies; 100+ messages in thread
From: Johannes Weiner @ 2011-07-06 15:12 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Wu Fengguang, Christoph Hellwig, Mel Gorman, xfs, linux-mm

On Mon, Jul 04, 2011 at 01:25:34PM +1000, Dave Chinner wrote:
> On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote:
> > Christoph,
> > 
> > On Fri, Jul 01, 2011 at 05:33:05PM +0800, Christoph Hellwig wrote:
> > > Johannes, Mel, Wu,
> > > 
> > > Dave has been stressing some XFS patches of mine that remove the XFS
> > > internal writeback clustering in favour of using write_cache_pages.
> > > 
> > > As part of investigating the behaviour he found out that we're still
> > > doing lots of I/O from the end of the LRU in kswapd.  Not only is that
> > > pretty bad behaviour in general, but it also means we really can't
> > > just remove the writeback clustering in writepage given how much
> > > I/O is still done through that.
> > > 
> > > Any chance we could the writeback vs kswap behaviour sorted out a bit
> > > better finally?
> > 
> > I once tried this approach:
> > 
> > http://www.spinics.net/lists/linux-mm/msg09202.html
> > 
> > It used a list structure that is not linearly scalable, however that
> > part should be independently improvable when necessary.
> 
> I don't think that handing random writeback to the flusher thread is
> much better than doing random writeback directly.  Yes, you added
> some clustering, but I'm still don't think writing specific pages is
> the best solution.
> 
> > The real problem was, it seem to not very effective in my test runs.
> > I found many ->nr_pages works queued before the ->inode works, which
> > effectively makes the flusher working on more dispersed pages rather
> > than focusing on the dirty pages encountered in LRU reclaim.
> 
> But that's really just an implementation issue related to how you
> tried to solve the problem. That could be addressed.
> 
> However, what I'm questioning is whether we should even care what
> page memory reclaim wants to write - it seems to make fundamentally
> bad decisions from an IO persepctive.
> 
> We have to remember that memory reclaim is doing LRU reclaim and the
> flusher threads are doing "oldest first" writeback. IOWs, both are trying
> to operate in the same direction (oldest to youngest) for the same
> purpose.  The fundamental problem that occurs when memory reclaim
> starts writing pages back from the LRU is this:
> 
> 	- memory reclaim has run ahead of IO writeback -
> 
> The LRU usually looks like this:
> 
> 	oldest					youngest
> 	+---------------+---------------+--------------+
> 	clean		writeback	dirty
> 			^		^
> 			|		|
> 			|		Where flusher will next work from
> 			|		Where kswapd is working from
> 			|
> 			IO submitted by flusher, waiting on completion
> 
> 
> If memory reclaim is hitting dirty pages on the LRU, it means it has
> got ahead of writeback without being throttled - it's passed over
> all the pages currently under writeback and is trying to write back
> pages that are *newer* than what writeback is working on. IOWs, it
> starts trying to do the job of the flusher threads, and it does that
> very badly.
> 
> The $100 question is a??why is it getting ahead of writeback*?

Unless you have a purely sequential writer, the LRU order is - at
least in theory - diverging away from the writeback order.

According to the reasoning behind generational garbage collection,
they should in fact be inverse to each other.  The oldest pages still
in use are the most likely to be still needed in the future.

In practice we only make a generational distinction between used-once
and used-many, which manifests in the inactive and the active list.
But still, when reclaim starts off with a localized writer, the oldest
pages are likely to be at the end of the active list.

So pages from the inactive list are likely to be written in the right
order, but at the same time active pages are even older, thus written
before them.  Memory reclaim starts with the inactive pages, and this
is why it gets ahead.

Then there is also the case where a fast writer pushes dirty pages to
the end of the LRU list, of course, but you already said that this is
not applicable to your workload.

My point is that I don't think it's unexpected that dirty pages come
off the inactive list in practice.  It just sucks how we handle them.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-07-06 15:12               ` Johannes Weiner
@ 2011-07-08  9:54                 ` Dave Chinner
  -1 siblings, 0 replies; 100+ messages in thread
From: Dave Chinner @ 2011-07-08  9:54 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Christoph Hellwig, linux-mm, Wu Fengguang, Mel Gorman, xfs

On Wed, Jul 06, 2011 at 05:12:29PM +0200, Johannes Weiner wrote:
> On Mon, Jul 04, 2011 at 01:25:34PM +1000, Dave Chinner wrote:
> > On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote:
> > We have to remember that memory reclaim is doing LRU reclaim and the
> > flusher threads are doing "oldest first" writeback. IOWs, both are trying
> > to operate in the same direction (oldest to youngest) for the same
> > purpose.  The fundamental problem that occurs when memory reclaim
> > starts writing pages back from the LRU is this:
> > 
> > 	- memory reclaim has run ahead of IO writeback -
> > 
> > The LRU usually looks like this:
> > 
> > 	oldest					youngest
> > 	+---------------+---------------+--------------+
> > 	clean		writeback	dirty
> > 			^		^
> > 			|		|
> > 			|		Where flusher will next work from
> > 			|		Where kswapd is working from
> > 			|
> > 			IO submitted by flusher, waiting on completion
> > 
> > 
> > If memory reclaim is hitting dirty pages on the LRU, it means it has
> > got ahead of writeback without being throttled - it's passed over
> > all the pages currently under writeback and is trying to write back
> > pages that are *newer* than what writeback is working on. IOWs, it
> > starts trying to do the job of the flusher threads, and it does that
> > very badly.
> > 
> > The $100 question is ∗why is it getting ahead of writeback*?
> 
> Unless you have a purely sequential writer, the LRU order is - at
> least in theory - diverging away from the writeback order.

Which is the root cause of the IO collapse that writeback from the
LRU causes, yes?

> According to the reasoning behind generational garbage collection,
> they should in fact be inverse to each other.  The oldest pages still
> in use are the most likely to be still needed in the future.
> 
> In practice we only make a generational distinction between used-once
> and used-many, which manifests in the inactive and the active list.
> But still, when reclaim starts off with a localized writer, the oldest
> pages are likely to be at the end of the active list.

Yet the file pages on the active list are unlikely to be dirty -
overwrite-in-place cache hot workloads are pretty scarce in my
experience. hence writeback of dirty pages from the active LRU is
unlikely to be a problem.

That leaves all the use-once pages cycling through the inactive
list. The oldest pages on this list are the ones that get reclaimed,
and if we are getting lots of dirty pages here it seems pretty clear
that memory demand is mostly for pages being rapidly dirtied. In
which case, trying to speed up the rate at which they are cleaned by
issuing IO is only effective if there is no IO already in progress.

Who knows if Io is already in progress? The writeback subsystem....

> So pages from the inactive list are likely to be written in the right
> order, but at the same time active pages are even older, thus written
> before them.  Memory reclaim starts with the inactive pages, and this
> is why it gets ahead.

All right, if the design is such that you can't avoid having reclaim
write back dirty pages as it encounters them on the inactive LRU,
should the dirty pages even be on that LRU?

That is, dirty pages cannot be reclaimed immediately but they are
intertwined with pages that can be reclaimed immediately. We really
want to reclaim pages that can be reclaimed quickly while not
blocking on or continually having to skip over pages that cannot be
reclaimed.

So why not make a distinction between clean and dirty file pages on
the inactive list? That is, consider dirty pages to still be "in
use" and "owned" by the writeback subsystem. while pages are dirty
they are kept on a separate "dirty file page LRU" that memory
reclaim does not ever touch unless it runs out of clean pages on the
inactive list to reclaim. And then when it runs out of clean pages,
it can go find pages under writeback on the dirty list and block on
them before going back to reclaiming off the clean list....

And given that cgroups have their own LRUs for reclaim now, this
problem of dirty pages being written from the LRUs has a much larger
scope.  It's not just whether the global LRU reclaim is hitting
dirty pages, it's a per-cgroup problem and they are much more likely
to have low memory limits that lead to such problems. And
concurrently at that, too.  Writeback simply does't scale to having
multiple sources of random page IO being despatched concurrently.

> Then there is also the case where a fast writer pushes dirty pages to
> the end of the LRU list, of course, but you already said that this is
> not applicable to your workload.
> 
> My point is that I don't think it's unexpected that dirty pages come
> off the inactive list in practice.  It just sucks how we handle them.

Exactly what I've been saying.

And what I'm also trying to say is the way to fix the "we do shitty
IO on dirty pages" problem is *not to do IO*. That's -exactly- why
the IO-less write throttling is a significant improvement: we've
turned shitty IO into good IO by *waiting for IO* during throttling
rather than submitting IO.

Fundamentally, scaling to N IO waiters is far easier and more
efficient than scaling to N IO submitters. All I'm asking is that
you apply that same principle to memory reclaim, please.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
@ 2011-07-08  9:54                 ` Dave Chinner
  0 siblings, 0 replies; 100+ messages in thread
From: Dave Chinner @ 2011-07-08  9:54 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Wu Fengguang, Christoph Hellwig, Mel Gorman, xfs, linux-mm

On Wed, Jul 06, 2011 at 05:12:29PM +0200, Johannes Weiner wrote:
> On Mon, Jul 04, 2011 at 01:25:34PM +1000, Dave Chinner wrote:
> > On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote:
> > We have to remember that memory reclaim is doing LRU reclaim and the
> > flusher threads are doing "oldest first" writeback. IOWs, both are trying
> > to operate in the same direction (oldest to youngest) for the same
> > purpose.  The fundamental problem that occurs when memory reclaim
> > starts writing pages back from the LRU is this:
> > 
> > 	- memory reclaim has run ahead of IO writeback -
> > 
> > The LRU usually looks like this:
> > 
> > 	oldest					youngest
> > 	+---------------+---------------+--------------+
> > 	clean		writeback	dirty
> > 			^		^
> > 			|		|
> > 			|		Where flusher will next work from
> > 			|		Where kswapd is working from
> > 			|
> > 			IO submitted by flusher, waiting on completion
> > 
> > 
> > If memory reclaim is hitting dirty pages on the LRU, it means it has
> > got ahead of writeback without being throttled - it's passed over
> > all the pages currently under writeback and is trying to write back
> > pages that are *newer* than what writeback is working on. IOWs, it
> > starts trying to do the job of the flusher threads, and it does that
> > very badly.
> > 
> > The $100 question is a??why is it getting ahead of writeback*?
> 
> Unless you have a purely sequential writer, the LRU order is - at
> least in theory - diverging away from the writeback order.

Which is the root cause of the IO collapse that writeback from the
LRU causes, yes?

> According to the reasoning behind generational garbage collection,
> they should in fact be inverse to each other.  The oldest pages still
> in use are the most likely to be still needed in the future.
> 
> In practice we only make a generational distinction between used-once
> and used-many, which manifests in the inactive and the active list.
> But still, when reclaim starts off with a localized writer, the oldest
> pages are likely to be at the end of the active list.

Yet the file pages on the active list are unlikely to be dirty -
overwrite-in-place cache hot workloads are pretty scarce in my
experience. hence writeback of dirty pages from the active LRU is
unlikely to be a problem.

That leaves all the use-once pages cycling through the inactive
list. The oldest pages on this list are the ones that get reclaimed,
and if we are getting lots of dirty pages here it seems pretty clear
that memory demand is mostly for pages being rapidly dirtied. In
which case, trying to speed up the rate at which they are cleaned by
issuing IO is only effective if there is no IO already in progress.

Who knows if Io is already in progress? The writeback subsystem....

> So pages from the inactive list are likely to be written in the right
> order, but at the same time active pages are even older, thus written
> before them.  Memory reclaim starts with the inactive pages, and this
> is why it gets ahead.

All right, if the design is such that you can't avoid having reclaim
write back dirty pages as it encounters them on the inactive LRU,
should the dirty pages even be on that LRU?

That is, dirty pages cannot be reclaimed immediately but they are
intertwined with pages that can be reclaimed immediately. We really
want to reclaim pages that can be reclaimed quickly while not
blocking on or continually having to skip over pages that cannot be
reclaimed.

So why not make a distinction between clean and dirty file pages on
the inactive list? That is, consider dirty pages to still be "in
use" and "owned" by the writeback subsystem. while pages are dirty
they are kept on a separate "dirty file page LRU" that memory
reclaim does not ever touch unless it runs out of clean pages on the
inactive list to reclaim. And then when it runs out of clean pages,
it can go find pages under writeback on the dirty list and block on
them before going back to reclaiming off the clean list....

And given that cgroups have their own LRUs for reclaim now, this
problem of dirty pages being written from the LRUs has a much larger
scope.  It's not just whether the global LRU reclaim is hitting
dirty pages, it's a per-cgroup problem and they are much more likely
to have low memory limits that lead to such problems. And
concurrently at that, too.  Writeback simply does't scale to having
multiple sources of random page IO being despatched concurrently.

> Then there is also the case where a fast writer pushes dirty pages to
> the end of the LRU list, of course, but you already said that this is
> not applicable to your workload.
> 
> My point is that I don't think it's unexpected that dirty pages come
> off the inactive list in practice.  It just sucks how we handle them.

Exactly what I've been saying.

And what I'm also trying to say is the way to fix the "we do shitty
IO on dirty pages" problem is *not to do IO*. That's -exactly- why
the IO-less write throttling is a significant improvement: we've
turned shitty IO into good IO by *waiting for IO* during throttling
rather than submitting IO.

Fundamentally, scaling to N IO waiters is far easier and more
efficient than scaling to N IO submitters. All I'm asking is that
you apply that same principle to memory reclaim, please.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-07-02  2:42             ` Dave Chinner
@ 2011-07-11 10:26               ` Christoph Hellwig
  -1 siblings, 0 replies; 100+ messages in thread
From: Christoph Hellwig @ 2011-07-11 10:26 UTC (permalink / raw)
  To: Dave Chinner
  Cc: jack, xfs, Christoph Hellwig, linux-mm, Mel Gorman, Wu Fengguang,
	Johannes Weiner

On Sat, Jul 02, 2011 at 12:42:19PM +1000, Dave Chinner wrote:
> To tell the truth, I don't think anyone really cares how ext3
> performs these days. XFS seems to be the filesystem that brings out
> all the bad behaviour in the mm subsystem....

Maybe that's because XFS actually plays by the rules?

btrfs simply rejects all attempts from kswapd to write back, as it
has the following check:

	if (current->flags & PF_MEMALLOC) {
		redirty_page_for_writepage(wbc, page);
		unlock_page(page);
		return 0;
	}

while XFS tries to play nice and allow writeback from kswapd:

	if ((current->flags & (PF_MEMALLOC|PF_KSWAPD)) == PF_MEMALLOC)
		goto redirty;

ext4 can't perform delalloc conversions from writepage:

	if (walk_page_buffers(NULL, page_bufs, 0, len, NULL,
			      ext4_bh_delay_or_unwritten)) {
		/*
		 * We don't want to do block allocation, so redirty
		 * the page and return.  We may reach here when we do
		 * a journal commit via journal_submit_inode_data_buffers.
		 * We can also reach here via shrink_page_list
		 */
		goto redirty_pages;
	}

so any normal worklaods that don't involve overwrites will every get
any writeback from kswapd.

This should tell us that the VM can live just fine without doing
writeback from kswapd, as otherwise all systems using btrfs or ext4
would have completely fallen over.

It also suggested we should have standardized helpers in the VFS to work
around the braindead VM behaviour.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
@ 2011-07-11 10:26               ` Christoph Hellwig
  0 siblings, 0 replies; 100+ messages in thread
From: Christoph Hellwig @ 2011-07-11 10:26 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Mel Gorman, Christoph Hellwig, Johannes Weiner, Wu Fengguang,
	xfs, jack, linux-mm

On Sat, Jul 02, 2011 at 12:42:19PM +1000, Dave Chinner wrote:
> To tell the truth, I don't think anyone really cares how ext3
> performs these days. XFS seems to be the filesystem that brings out
> all the bad behaviour in the mm subsystem....

Maybe that's because XFS actually plays by the rules?

btrfs simply rejects all attempts from kswapd to write back, as it
has the following check:

	if (current->flags & PF_MEMALLOC) {
		redirty_page_for_writepage(wbc, page);
		unlock_page(page);
		return 0;
	}

while XFS tries to play nice and allow writeback from kswapd:

	if ((current->flags & (PF_MEMALLOC|PF_KSWAPD)) == PF_MEMALLOC)
		goto redirty;

ext4 can't perform delalloc conversions from writepage:

	if (walk_page_buffers(NULL, page_bufs, 0, len, NULL,
			      ext4_bh_delay_or_unwritten)) {
		/*
		 * We don't want to do block allocation, so redirty
		 * the page and return.  We may reach here when we do
		 * a journal commit via journal_submit_inode_data_buffers.
		 * We can also reach here via shrink_page_list
		 */
		goto redirty_pages;
	}

so any normal worklaods that don't involve overwrites will every get
any writeback from kswapd.

This should tell us that the VM can live just fine without doing
writeback from kswapd, as otherwise all systems using btrfs or ext4
would have completely fallen over.

It also suggested we should have standardized helpers in the VFS to work
around the braindead VM behaviour.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-07-05 14:34               ` Mel Gorman
@ 2011-07-11 11:10                 ` Christoph Hellwig
  -1 siblings, 0 replies; 100+ messages in thread
From: Christoph Hellwig @ 2011-07-11 11:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: xfs, Christoph Hellwig, linux-mm, Wu Fengguang, Johannes Weiner

On Tue, Jul 05, 2011 at 03:34:10PM +0100, Mel Gorman wrote:
> > However, what I'm questioning is whether we should even care what
> > page memory reclaim wants to write - it seems to make fundamentally
> > bad decisions from an IO persepctive.
> > 
> 
> It sucks from an IO perspective but from the perspective of the VM that
> needs memory to be free in a particular zone or node, it's a reasonable
> request.

It might appear reasonable, but it's not.

What the VM wants underneath is generally (1):

 - free N pages in zone Z

and it then goes own to free the pages one one by one though kswapd,
which leads to freeing those N pages, but unless they already were
clean it will take very long to get there and bog down the whole
system.

So we need a better way to actually perform that underlying request.
Dave's suggestion of keeping different lists for clean vs dirty pages
in the VM and preferably reclaiming for the clean ones when having
zone pressure is one first step.  The second one will be to tell the
writeback threads to preferably reclaim from a zone.  I'm actually
not sure how do that yet, as we could have memory from different
zones on a single inode.  Taking an inode that has memory from the
right zone and the writing that out will probably work fine for
different zones in a 64-bit NUMA systems where zones more or less
equal nodes.  It probably won't work very well if we need to free
up memory in the various low memory zones, as those will be spread
over random inodes.

> It doesnt' check how many pages are under writeback. Direct reclaim
> will check if the block device is congested but that is about
> it. Otherwise the expectation was the elevator would handle the
> merging of requests into a sensible patter.

It can't.  The elevator has a relatively small window it can operate
on, and can never fix up a bad large scale writeback pattern. 

> Also, while filesystem
> pages are getting cleaned by flushs, that does not cover anonymous
> pages being written to swap.

At least for now we will have to keep kswapd writeback for swap.  It
is just as inefficient a on a filesystem, but given that people don't
rely on swap performance we can probably live with it.  Note that we
can't simply use background flushing for swap, as that would mean
we'd need backing space allocated for all main memory, which isn't
very practical with todays memory sized.  The whole concept of demand
paging anonymous memory leads to pretty bad I/O patterns.  If you're
actually making heavy use of it the old-school unix full process paging
would be a lot faster.

(1) moulo things like compaction

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
@ 2011-07-11 11:10                 ` Christoph Hellwig
  0 siblings, 0 replies; 100+ messages in thread
From: Christoph Hellwig @ 2011-07-11 11:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Dave Chinner, Wu Fengguang, Christoph Hellwig, Johannes Weiner,
	xfs, linux-mm

On Tue, Jul 05, 2011 at 03:34:10PM +0100, Mel Gorman wrote:
> > However, what I'm questioning is whether we should even care what
> > page memory reclaim wants to write - it seems to make fundamentally
> > bad decisions from an IO persepctive.
> > 
> 
> It sucks from an IO perspective but from the perspective of the VM that
> needs memory to be free in a particular zone or node, it's a reasonable
> request.

It might appear reasonable, but it's not.

What the VM wants underneath is generally (1):

 - free N pages in zone Z

and it then goes own to free the pages one one by one though kswapd,
which leads to freeing those N pages, but unless they already were
clean it will take very long to get there and bog down the whole
system.

So we need a better way to actually perform that underlying request.
Dave's suggestion of keeping different lists for clean vs dirty pages
in the VM and preferably reclaiming for the clean ones when having
zone pressure is one first step.  The second one will be to tell the
writeback threads to preferably reclaim from a zone.  I'm actually
not sure how do that yet, as we could have memory from different
zones on a single inode.  Taking an inode that has memory from the
right zone and the writing that out will probably work fine for
different zones in a 64-bit NUMA systems where zones more or less
equal nodes.  It probably won't work very well if we need to free
up memory in the various low memory zones, as those will be spread
over random inodes.

> It doesnt' check how many pages are under writeback. Direct reclaim
> will check if the block device is congested but that is about
> it. Otherwise the expectation was the elevator would handle the
> merging of requests into a sensible patter.

It can't.  The elevator has a relatively small window it can operate
on, and can never fix up a bad large scale writeback pattern. 

> Also, while filesystem
> pages are getting cleaned by flushs, that does not cover anonymous
> pages being written to swap.

At least for now we will have to keep kswapd writeback for swap.  It
is just as inefficient a on a filesystem, but given that people don't
rely on swap performance we can probably live with it.  Note that we
can't simply use background flushing for swap, as that would mean
we'd need backing space allocated for all main memory, which isn't
very practical with todays memory sized.  The whole concept of demand
paging anonymous memory leads to pretty bad I/O patterns.  If you're
actually making heavy use of it the old-school unix full process paging
would be a lot faster.

(1) moulo things like compaction

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-07-08  9:54                 ` Dave Chinner
@ 2011-07-11 17:20                   ` Johannes Weiner
  -1 siblings, 0 replies; 100+ messages in thread
From: Johannes Weiner @ 2011-07-11 17:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Rik van Riel, xfs, Christoph Hellwig, linux-mm, Mel Gorman, Wu Fengguang

On Fri, Jul 08, 2011 at 07:54:56PM +1000, Dave Chinner wrote:
> On Wed, Jul 06, 2011 at 05:12:29PM +0200, Johannes Weiner wrote:
> > On Mon, Jul 04, 2011 at 01:25:34PM +1000, Dave Chinner wrote:
> > > On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote:
> > > We have to remember that memory reclaim is doing LRU reclaim and the
> > > flusher threads are doing "oldest first" writeback. IOWs, both are trying
> > > to operate in the same direction (oldest to youngest) for the same
> > > purpose.  The fundamental problem that occurs when memory reclaim
> > > starts writing pages back from the LRU is this:
> > > 
> > > 	- memory reclaim has run ahead of IO writeback -
> > > 
> > > The LRU usually looks like this:
> > > 
> > > 	oldest					youngest
> > > 	+---------------+---------------+--------------+
> > > 	clean		writeback	dirty
> > > 			^		^
> > > 			|		|
> > > 			|		Where flusher will next work from
> > > 			|		Where kswapd is working from
> > > 			|
> > > 			IO submitted by flusher, waiting on completion
> > > 
> > > 
> > > If memory reclaim is hitting dirty pages on the LRU, it means it has
> > > got ahead of writeback without being throttled - it's passed over
> > > all the pages currently under writeback and is trying to write back
> > > pages that are *newer* than what writeback is working on. IOWs, it
> > > starts trying to do the job of the flusher threads, and it does that
> > > very badly.
> > > 
> > > The $100 question is ∗why is it getting ahead of writeback*?
> > 
> > Unless you have a purely sequential writer, the LRU order is - at
> > least in theory - diverging away from the writeback order.
> 
> Which is the root cause of the IO collapse that writeback from the
> LRU causes, yes?
> 
> > According to the reasoning behind generational garbage collection,
> > they should in fact be inverse to each other.  The oldest pages still
> > in use are the most likely to be still needed in the future.
> > 
> > In practice we only make a generational distinction between used-once
> > and used-many, which manifests in the inactive and the active list.
> > But still, when reclaim starts off with a localized writer, the oldest
> > pages are likely to be at the end of the active list.
> 
> Yet the file pages on the active list are unlikely to be dirty -
> overwrite-in-place cache hot workloads are pretty scarce in my
> experience. hence writeback of dirty pages from the active LRU is
> unlikely to be a problem.

Just to clarify, I looked at this too much from the reclaim POV, where
use-once applies to full pages, not bytes.

Even if you do not overwrite the same bytes over and over again,
issuing two subsequent write()s that end up against the same page will
have it activated.

Are your workloads writing in perfectly page-aligned chunks?

This effect may build up slowly, but every page that is written from
the active list makes room for a dirty page on the inactive list wrt
the dirty limit.  I.e. without the active pages, you have 10-20% dirty
pages at the head of the inactive list (default dirty ratio), or a
80-90% clean tail, and for every page cleaned, a new dirty page can
appear at the inactive head.

But taking the active list into account, some of these clean pages are
taken away from the headstart the flusher has over the reclaimer, they
sit on the active list.  For every page cleaned, a new dirty page can
appear at the inactive head, plus a few deactivated clean pages.

Now, the active list is not scanned anymore until it is bigger than
the inactive list, giving the flushers plenty of time to clean the
pages on it and let them accumulate even while memory pressure is
already occurring.  For every page cleaned, a new dirty page can
appear at the inactive head, plus a LOT of deactivated clean pages.

So when memory needs to be reclaimed, the LRU lists in those three
scenarios look like this:

	inactive-only: [CCCCCCCCDD][]

	active-small:  [CCCCCCDD][CC]

	active-huge:   [CCCDD][CCCCC]

where the third scenario is the most likely for the reclaimer to run
into dirty pages.

I CC'd Rik for reclaim-wizardry.  But if I am not completly off with
this there is a chance that the change that let the active list grow
unscanned may actually have contributed to this single-page writing
problem becoming worse?

commit 56e49d218890f49b0057710a4b6fef31f5ffbfec
Author: Rik van Riel <riel@redhat.com>
Date:   Tue Jun 16 15:32:28 2009 -0700

    vmscan: evict use-once pages first
    
    When the file LRU lists are dominated by streaming IO pages, evict those
    pages first, before considering evicting other pages.
    
    This should be safe from deadlocks or performance problems
    because only three things can happen to an inactive file page:
    
    1) referenced twice and promoted to the active list
    2) evicted by the pageout code
    3) under IO, after which it will get evicted or promoted
    
    The pages freed in this way can either be reused for streaming IO, or
    allocated for something else.  If the pages are used for streaming IO,
    this pageout pattern continues.  Otherwise, we will fall back to the
    normal pageout pattern.
    
    Signed-off-by: Rik van Riel <riel@redhat.com>
    Reported-by: Elladan <elladan@eskimo.com>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
@ 2011-07-11 17:20                   ` Johannes Weiner
  0 siblings, 0 replies; 100+ messages in thread
From: Johannes Weiner @ 2011-07-11 17:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Wu Fengguang, Christoph Hellwig, Mel Gorman, Rik van Riel, xfs, linux-mm

On Fri, Jul 08, 2011 at 07:54:56PM +1000, Dave Chinner wrote:
> On Wed, Jul 06, 2011 at 05:12:29PM +0200, Johannes Weiner wrote:
> > On Mon, Jul 04, 2011 at 01:25:34PM +1000, Dave Chinner wrote:
> > > On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote:
> > > We have to remember that memory reclaim is doing LRU reclaim and the
> > > flusher threads are doing "oldest first" writeback. IOWs, both are trying
> > > to operate in the same direction (oldest to youngest) for the same
> > > purpose.  The fundamental problem that occurs when memory reclaim
> > > starts writing pages back from the LRU is this:
> > > 
> > > 	- memory reclaim has run ahead of IO writeback -
> > > 
> > > The LRU usually looks like this:
> > > 
> > > 	oldest					youngest
> > > 	+---------------+---------------+--------------+
> > > 	clean		writeback	dirty
> > > 			^		^
> > > 			|		|
> > > 			|		Where flusher will next work from
> > > 			|		Where kswapd is working from
> > > 			|
> > > 			IO submitted by flusher, waiting on completion
> > > 
> > > 
> > > If memory reclaim is hitting dirty pages on the LRU, it means it has
> > > got ahead of writeback without being throttled - it's passed over
> > > all the pages currently under writeback and is trying to write back
> > > pages that are *newer* than what writeback is working on. IOWs, it
> > > starts trying to do the job of the flusher threads, and it does that
> > > very badly.
> > > 
> > > The $100 question is a??why is it getting ahead of writeback*?
> > 
> > Unless you have a purely sequential writer, the LRU order is - at
> > least in theory - diverging away from the writeback order.
> 
> Which is the root cause of the IO collapse that writeback from the
> LRU causes, yes?
> 
> > According to the reasoning behind generational garbage collection,
> > they should in fact be inverse to each other.  The oldest pages still
> > in use are the most likely to be still needed in the future.
> > 
> > In practice we only make a generational distinction between used-once
> > and used-many, which manifests in the inactive and the active list.
> > But still, when reclaim starts off with a localized writer, the oldest
> > pages are likely to be at the end of the active list.
> 
> Yet the file pages on the active list are unlikely to be dirty -
> overwrite-in-place cache hot workloads are pretty scarce in my
> experience. hence writeback of dirty pages from the active LRU is
> unlikely to be a problem.

Just to clarify, I looked at this too much from the reclaim POV, where
use-once applies to full pages, not bytes.

Even if you do not overwrite the same bytes over and over again,
issuing two subsequent write()s that end up against the same page will
have it activated.

Are your workloads writing in perfectly page-aligned chunks?

This effect may build up slowly, but every page that is written from
the active list makes room for a dirty page on the inactive list wrt
the dirty limit.  I.e. without the active pages, you have 10-20% dirty
pages at the head of the inactive list (default dirty ratio), or a
80-90% clean tail, and for every page cleaned, a new dirty page can
appear at the inactive head.

But taking the active list into account, some of these clean pages are
taken away from the headstart the flusher has over the reclaimer, they
sit on the active list.  For every page cleaned, a new dirty page can
appear at the inactive head, plus a few deactivated clean pages.

Now, the active list is not scanned anymore until it is bigger than
the inactive list, giving the flushers plenty of time to clean the
pages on it and let them accumulate even while memory pressure is
already occurring.  For every page cleaned, a new dirty page can
appear at the inactive head, plus a LOT of deactivated clean pages.

So when memory needs to be reclaimed, the LRU lists in those three
scenarios look like this:

	inactive-only: [CCCCCCCCDD][]

	active-small:  [CCCCCCDD][CC]

	active-huge:   [CCCDD][CCCCC]

where the third scenario is the most likely for the reclaimer to run
into dirty pages.

I CC'd Rik for reclaim-wizardry.  But if I am not completly off with
this there is a chance that the change that let the active list grow
unscanned may actually have contributed to this single-page writing
problem becoming worse?

commit 56e49d218890f49b0057710a4b6fef31f5ffbfec
Author: Rik van Riel <riel@redhat.com>
Date:   Tue Jun 16 15:32:28 2009 -0700

    vmscan: evict use-once pages first
    
    When the file LRU lists are dominated by streaming IO pages, evict those
    pages first, before considering evicting other pages.
    
    This should be safe from deadlocks or performance problems
    because only three things can happen to an inactive file page:
    
    1) referenced twice and promoted to the active list
    2) evicted by the pageout code
    3) under IO, after which it will get evicted or promoted
    
    The pages freed in this way can either be reused for streaming IO, or
    allocated for something else.  If the pages are used for streaming IO,
    this pageout pattern continues.  Otherwise, we will fall back to the
    normal pageout pattern.
    
    Signed-off-by: Rik van Riel <riel@redhat.com>
    Reported-by: Elladan <elladan@eskimo.com>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-07-11 17:20                   ` Johannes Weiner
@ 2011-07-11 17:24                     ` Christoph Hellwig
  -1 siblings, 0 replies; 100+ messages in thread
From: Christoph Hellwig @ 2011-07-11 17:24 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Rik van Riel, xfs, Christoph Hellwig, linux-mm, Mel Gorman, Wu Fengguang

On Mon, Jul 11, 2011 at 07:20:50PM +0200, Johannes Weiner wrote:
> > Yet the file pages on the active list are unlikely to be dirty -
> > overwrite-in-place cache hot workloads are pretty scarce in my
> > experience. hence writeback of dirty pages from the active LRU is
> > unlikely to be a problem.
> 
> Just to clarify, I looked at this too much from the reclaim POV, where
> use-once applies to full pages, not bytes.
> 
> Even if you do not overwrite the same bytes over and over again,
> issuing two subsequent write()s that end up against the same page will
> have it activated.
> 
> Are your workloads writing in perfectly page-aligned chunks?

Many workloads do, given that we already tell them our preferred
I/O size through struct stat, which alway is the page size or larger.

That won't help with workloads having to write in small chunksizes.
The performance critical ones using small chunksizes usually use
O_(D)SYNC, so pages will be clean after the write returned to userspace.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
@ 2011-07-11 17:24                     ` Christoph Hellwig
  0 siblings, 0 replies; 100+ messages in thread
From: Christoph Hellwig @ 2011-07-11 17:24 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Dave Chinner, Wu Fengguang, Christoph Hellwig, Mel Gorman,
	Rik van Riel, xfs, linux-mm

On Mon, Jul 11, 2011 at 07:20:50PM +0200, Johannes Weiner wrote:
> > Yet the file pages on the active list are unlikely to be dirty -
> > overwrite-in-place cache hot workloads are pretty scarce in my
> > experience. hence writeback of dirty pages from the active LRU is
> > unlikely to be a problem.
> 
> Just to clarify, I looked at this too much from the reclaim POV, where
> use-once applies to full pages, not bytes.
> 
> Even if you do not overwrite the same bytes over and over again,
> issuing two subsequent write()s that end up against the same page will
> have it activated.
> 
> Are your workloads writing in perfectly page-aligned chunks?

Many workloads do, given that we already tell them our preferred
I/O size through struct stat, which alway is the page size or larger.

That won't help with workloads having to write in small chunksizes.
The performance critical ones using small chunksizes usually use
O_(D)SYNC, so pages will be clean after the write returned to userspace.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-07-11 17:20                   ` Johannes Weiner
@ 2011-07-11 19:09                     ` Rik van Riel
  -1 siblings, 0 replies; 100+ messages in thread
From: Rik van Riel @ 2011-07-11 19:09 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: xfs, Christoph Hellwig, linux-mm, Mel Gorman, Wu Fengguang

On 07/11/2011 01:20 PM, Johannes Weiner wrote:

> I CC'd Rik for reclaim-wizardry.  But if I am not completly off with
> this there is a chance that the change that let the active list grow
> unscanned may actually have contributed to this single-page writing
> problem becoming worse?

Yes, the patch probably contributed.

However, the patch does help protect the working set in
the page cache from streaming IO, so on balance I believe
we need to keep this change.

What it changes is that the size of the inactive file list
can no longer grow unbounded, keeping it a little smaller
than it could have grown in the past.

> commit 56e49d218890f49b0057710a4b6fef31f5ffbfec
> Author: Rik van Riel<riel@redhat.com>
> Date:   Tue Jun 16 15:32:28 2009 -0700
>
>      vmscan: evict use-once pages first
>
>      When the file LRU lists are dominated by streaming IO pages, evict those
>      pages first, before considering evicting other pages.
>
>      This should be safe from deadlocks or performance problems
>      because only three things can happen to an inactive file page:
>
>      1) referenced twice and promoted to the active list
>      2) evicted by the pageout code
>      3) under IO, after which it will get evicted or promoted
>
>      The pages freed in this way can either be reused for streaming IO, or
>      allocated for something else.  If the pages are used for streaming IO,
>      this pageout pattern continues.  Otherwise, we will fall back to the
>      normal pageout pattern.
>
>      Signed-off-by: Rik van Riel<riel@redhat.com>
>      Reported-by: Elladan<elladan@eskimo.com>
>      Cc: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>
>      Cc: Peter Zijlstra<peterz@infradead.org>
>      Cc: Lee Schermerhorn<lee.schermerhorn@hp.com>
>      Acked-by: Johannes Weiner<hannes@cmpxchg.org>
>      Signed-off-by: Andrew Morton<akpm@linux-foundation.org>
>      Signed-off-by: Linus Torvalds<torvalds@linux-foundation.org>


-- 
All rights reversed

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
@ 2011-07-11 19:09                     ` Rik van Riel
  0 siblings, 0 replies; 100+ messages in thread
From: Rik van Riel @ 2011-07-11 19:09 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Dave Chinner, Wu Fengguang, Christoph Hellwig, Mel Gorman, xfs, linux-mm

On 07/11/2011 01:20 PM, Johannes Weiner wrote:

> I CC'd Rik for reclaim-wizardry.  But if I am not completly off with
> this there is a chance that the change that let the active list grow
> unscanned may actually have contributed to this single-page writing
> problem becoming worse?

Yes, the patch probably contributed.

However, the patch does help protect the working set in
the page cache from streaming IO, so on balance I believe
we need to keep this change.

What it changes is that the size of the inactive file list
can no longer grow unbounded, keeping it a little smaller
than it could have grown in the past.

> commit 56e49d218890f49b0057710a4b6fef31f5ffbfec
> Author: Rik van Riel<riel@redhat.com>
> Date:   Tue Jun 16 15:32:28 2009 -0700
>
>      vmscan: evict use-once pages first
>
>      When the file LRU lists are dominated by streaming IO pages, evict those
>      pages first, before considering evicting other pages.
>
>      This should be safe from deadlocks or performance problems
>      because only three things can happen to an inactive file page:
>
>      1) referenced twice and promoted to the active list
>      2) evicted by the pageout code
>      3) under IO, after which it will get evicted or promoted
>
>      The pages freed in this way can either be reused for streaming IO, or
>      allocated for something else.  If the pages are used for streaming IO,
>      this pageout pattern continues.  Otherwise, we will fall back to the
>      normal pageout pattern.
>
>      Signed-off-by: Rik van Riel<riel@redhat.com>
>      Reported-by: Elladan<elladan@eskimo.com>
>      Cc: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>
>      Cc: Peter Zijlstra<peterz@infradead.org>
>      Cc: Lee Schermerhorn<lee.schermerhorn@hp.com>
>      Acked-by: Johannes Weiner<hannes@cmpxchg.org>
>      Signed-off-by: Andrew Morton<akpm@linux-foundation.org>
>      Signed-off-by: Linus Torvalds<torvalds@linux-foundation.org>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 100+ messages in thread

end of thread, other threads:[~2011-07-11 19:09 UTC | newest]

Thread overview: 100+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-06-29 14:01 [PATCH 00/27] patch queue for Linux 3.1 Christoph Hellwig
2011-06-29 14:01 ` [PATCH 01/27] xfs: PF_FSTRANS should never be set in ->writepage Christoph Hellwig
2011-06-30  1:34   ` Dave Chinner
2011-06-29 14:01 ` [PATCH 02/27] xfs: remove the unused ilock_nowait codepath in writepage Christoph Hellwig
2011-06-30  0:15   ` Dave Chinner
2011-06-30  1:26     ` Dave Chinner
2011-06-30  6:55     ` Christoph Hellwig
2011-06-29 14:01 ` [PATCH 03/27] xfs: use write_cache_pages for writeback clustering Christoph Hellwig
2011-06-30  2:00   ` Dave Chinner
2011-06-30  2:48     ` Dave Chinner
2011-06-30  6:57     ` Christoph Hellwig
2011-07-01  2:22   ` Dave Chinner
2011-07-01  4:18     ` Dave Chinner
2011-07-01  8:59       ` Christoph Hellwig
2011-07-01  9:20         ` Dave Chinner
2011-07-01  9:33       ` Christoph Hellwig
2011-07-01  9:33         ` Christoph Hellwig
2011-07-01 14:59         ` Mel Gorman
2011-07-01 14:59           ` Mel Gorman
2011-07-01 15:15           ` Christoph Hellwig
2011-07-01 15:15             ` Christoph Hellwig
2011-07-02  2:42           ` Dave Chinner
2011-07-02  2:42             ` Dave Chinner
2011-07-05 14:10             ` Mel Gorman
2011-07-05 14:10               ` Mel Gorman
2011-07-05 15:55               ` Dave Chinner
2011-07-05 15:55                 ` Dave Chinner
2011-07-11 10:26             ` Christoph Hellwig
2011-07-11 10:26               ` Christoph Hellwig
2011-07-01 15:41         ` Wu Fengguang
2011-07-01 15:41           ` Wu Fengguang
2011-07-04  3:25           ` Dave Chinner
2011-07-04  3:25             ` Dave Chinner
2011-07-05 14:34             ` Mel Gorman
2011-07-05 14:34               ` Mel Gorman
2011-07-06  1:23               ` Dave Chinner
2011-07-06  1:23                 ` Dave Chinner
2011-07-11 11:10               ` Christoph Hellwig
2011-07-11 11:10                 ` Christoph Hellwig
2011-07-06  4:53             ` Wu Fengguang
2011-07-06  4:53               ` Wu Fengguang
2011-07-06  6:47               ` Minchan Kim
2011-07-06  6:47                 ` Minchan Kim
2011-07-06  7:17               ` Dave Chinner
2011-07-06  7:17                 ` Dave Chinner
2011-07-06 15:12             ` Johannes Weiner
2011-07-06 15:12               ` Johannes Weiner
2011-07-08  9:54               ` Dave Chinner
2011-07-08  9:54                 ` Dave Chinner
2011-07-11 17:20                 ` Johannes Weiner
2011-07-11 17:20                   ` Johannes Weiner
2011-07-11 17:24                   ` Christoph Hellwig
2011-07-11 17:24                     ` Christoph Hellwig
2011-07-11 19:09                   ` Rik van Riel
2011-07-11 19:09                     ` Rik van Riel
2011-07-01  8:51     ` Christoph Hellwig
2011-06-29 14:01 ` [PATCH 04/27] xfs: cleanup xfs_add_to_ioend Christoph Hellwig
2011-06-29 22:13   ` Alex Elder
2011-06-30  2:00   ` Dave Chinner
2011-06-29 14:01 ` [PATCH 05/27] xfs: work around bogus gcc warning in xfs_allocbt_init_cursor Christoph Hellwig
2011-06-29 22:13   ` Alex Elder
2011-06-29 14:01 ` [PATCH 06/27] xfs: split xfs_setattr Christoph Hellwig
2011-06-29 22:13   ` Alex Elder
2011-06-30  7:03     ` Christoph Hellwig
2011-06-30 12:28       ` Alex Elder
2011-06-30  2:11   ` Dave Chinner
2011-06-29 14:01 ` [PATCH 08/27] xfs: kill xfs_itruncate_start Christoph Hellwig
2011-06-29 22:13   ` Alex Elder
2011-06-29 14:01 ` [PATCH 09/27] xfs: split xfs_itruncate_finish Christoph Hellwig
2011-06-30  2:44   ` Dave Chinner
2011-06-30  7:18     ` Christoph Hellwig
2011-06-29 14:01 ` [PATCH 10/27] xfs: improve sync behaviour in the fact of aggressive dirtying Christoph Hellwig
2011-06-30  2:52   ` Dave Chinner
2011-06-29 14:01 ` [PATCH 11/27] xfs: fix filesystsem freeze race in xfs_trans_alloc Christoph Hellwig
2011-06-30  2:59   ` Dave Chinner
2011-06-29 14:01 ` [PATCH 12/27] xfs: remove i_transp Christoph Hellwig
2011-06-30  3:00   ` Dave Chinner
2011-06-29 14:01 ` [PATCH 13/27] xfs: factor out xfs_dir2_leaf_find_entry Christoph Hellwig
2011-06-30  6:11   ` Dave Chinner
2011-06-30  7:34     ` Christoph Hellwig
2011-06-29 14:01 ` [PATCH 14/27] xfs: cleanup shortform directory inode number handling Christoph Hellwig
2011-06-30  6:35   ` Dave Chinner
2011-06-30  7:39     ` Christoph Hellwig
2011-06-29 14:01 ` [PATCH 15/27] xfs: kill struct xfs_dir2_sf Christoph Hellwig
2011-06-30  7:04   ` Dave Chinner
2011-06-30  7:09     ` Christoph Hellwig
2011-06-29 14:01 ` [PATCH 16/27] xfs: cleanup the defintion of struct xfs_dir2_sf_entry Christoph Hellwig
2011-06-29 14:01 ` [PATCH 17/27] xfs: avoid usage of struct xfs_dir2_block Christoph Hellwig
2011-06-29 14:01 ` [PATCH 18/27] xfs: kill " Christoph Hellwig
2011-06-29 14:01 ` [PATCH 19/27] xfs: avoid usage of struct xfs_dir2_data Christoph Hellwig
2011-06-29 14:01 ` [PATCH 20/27] xfs: kill " Christoph Hellwig
2011-06-29 14:01 ` [PATCH 21/27] xfs: cleanup the defintion of struct xfs_dir2_data_entry Christoph Hellwig
2011-06-29 14:01 ` [PATCH 22/27] xfs: cleanup struct xfs_dir2_leaf Christoph Hellwig
2011-06-29 14:01 ` [PATCH 23/27] xfs: remove the unused xfs_bufhash structure Christoph Hellwig
2011-06-29 14:01 ` [PATCH 24/27] xfs: clean up buffer locking helpers Christoph Hellwig
2011-06-29 14:01 ` [PATCH 25/27] xfs: return the buffer locked from xfs_buf_get_uncached Christoph Hellwig
2011-06-29 14:01 ` [PATCH 26/27] xfs: cleanup I/O-related buffer flags Christoph Hellwig
2011-06-29 14:01 ` [PATCH 27/27] xfs: avoid a few disk cache flushes Christoph Hellwig
2011-06-30  6:36 ` [PATCH 00/27] patch queue for Linux 3.1 Dave Chinner
2011-06-30  6:50   ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.