All of lore.kernel.org
 help / color / mirror / Atom feed
* xfs: outstanding patches for 2.6.39
@ 2011-03-23  6:14 Dave Chinner
  2011-03-23  6:14 ` [PATCH 1/6] xfs: preallocation transactions do not need to be synchronous Dave Chinner
                   ` (7 more replies)
  0 siblings, 8 replies; 27+ messages in thread
From: Dave Chinner @ 2011-03-23  6:14 UTC (permalink / raw)
  To: xfs; +Cc: aelder

Hi Alex,

The followingis the remaing series of patches that I have ready for
2.6.39. Most of them are bug fixes, only the prealloc transaction
change and the buffer cache changes are enhancements. The
xfs_trans_read_buf() fixes a bug found due to vmap allocation
failures, and the shrinker registration changes fix a problem
reported via IRC by arekm.

Of course, the major patch in this series is the conversion of the
buffer cache to using kmalloc and get_free_page() directly rather
than using the page cache. This has many benefits and I haven't
found any regressions due to making that change yet. I understand if
you think it too risky for .39 at this stage, so I'll defer to your
judgement as to whether it is .40 material or not.

I haven't pushed these into a git tree branch yet - I can do that if
you want once I've got reviewed-by tags for them all.

Cheers,

Dave.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/6] xfs: preallocation transactions do not need to be synchronous
  2011-03-23  6:14 xfs: outstanding patches for 2.6.39 Dave Chinner
@ 2011-03-23  6:14 ` Dave Chinner
  2011-03-24 17:18   ` brian.foster
  2011-03-25 21:00   ` [PATCH 1/6] " Alex Elder
  2011-03-23  6:14 ` [PATCH 2/6] vmap: flush vmap aliases when mapping fails Dave Chinner
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 27+ messages in thread
From: Dave Chinner @ 2011-03-23  6:14 UTC (permalink / raw)
  To: xfs; +Cc: aelder

From: Dave Chinner <dchinner@redhat.com>

Preallocation and hole punch transactions are currently synchronous
and this is causing performance problems in some cases. The
transactions don't need to be synchronous as we don't need to
guarantee the preallocation is persistent on disk until a
fdatasync, fsync, sync operation occurs. If the file is opened
O_SYNC or O_DATASYNC, only then should the transaction be issued
synchronously.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/linux-2.6/xfs_file.c  |    4 ++++
 fs/xfs/linux-2.6/xfs_ioctl.c |    4 ++++
 fs/xfs/xfs_vnodeops.c        |    3 ++-
 fs/xfs/xfs_vnodeops.h        |    1 +
 4 files changed, 11 insertions(+), 1 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_file.c b/fs/xfs/linux-2.6/xfs_file.c
index ae59865..baa2cb3 100644
--- a/fs/xfs/linux-2.6/xfs_file.c
+++ b/fs/xfs/linux-2.6/xfs_file.c
@@ -896,6 +896,7 @@ xfs_file_fallocate(
 	xfs_flock64_t	bf;
 	xfs_inode_t	*ip = XFS_I(inode);
 	int		cmd = XFS_IOC_RESVSP;
+	int		attr_flags = XFS_ATTR_NOLOCK;
 
 	if (mode & ~(FALLOC_FL_KEEP_SIZE |
 		     FALLOC_FL_PUNCH_HOLE |
@@ -922,6 +923,9 @@ xfs_file_fallocate(
 			goto out_unlock;
 	}
 
+	if (file->f_flags & O_DSYNC)
+		attr_flags |= XFS_ATTR_SYNC;
+
 	error = -xfs_change_file_space(ip, cmd, &bf, 0, XFS_ATTR_NOLOCK);
 	if (error)
 		goto out_unlock;
diff --git a/fs/xfs/linux-2.6/xfs_ioctl.c b/fs/xfs/linux-2.6/xfs_ioctl.c
index 0ca0e3c..acca2c5 100644
--- a/fs/xfs/linux-2.6/xfs_ioctl.c
+++ b/fs/xfs/linux-2.6/xfs_ioctl.c
@@ -624,6 +624,10 @@ xfs_ioc_space(
 
 	if (filp->f_flags & (O_NDELAY|O_NONBLOCK))
 		attr_flags |= XFS_ATTR_NONBLOCK;
+
+	if (filp->f_flags & O_DSYNC)
+		attr_flags |= XFS_ATTR_SYNC;
+
 	if (ioflags & IO_INVIS)
 		attr_flags |= XFS_ATTR_DMI;
 
diff --git a/fs/xfs/xfs_vnodeops.c b/fs/xfs/xfs_vnodeops.c
index 37d8146..c48b421 100644
--- a/fs/xfs/xfs_vnodeops.c
+++ b/fs/xfs/xfs_vnodeops.c
@@ -2831,7 +2831,8 @@ xfs_change_file_space(
 		ip->i_d.di_flags &= ~XFS_DIFLAG_PREALLOC;
 
 	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
-	xfs_trans_set_sync(tp);
+	if (attr_flags & XFS_ATTR_SYNC)
+		xfs_trans_set_sync(tp);
 
 	error = xfs_trans_commit(tp, 0);
 
diff --git a/fs/xfs/xfs_vnodeops.h b/fs/xfs/xfs_vnodeops.h
index f6702927..3bcd233 100644
--- a/fs/xfs/xfs_vnodeops.h
+++ b/fs/xfs/xfs_vnodeops.h
@@ -18,6 +18,7 @@ int xfs_setattr(struct xfs_inode *ip, struct iattr *vap, int flags);
 #define	XFS_ATTR_NONBLOCK	0x02	/* return EAGAIN if operation would block */
 #define XFS_ATTR_NOLOCK		0x04	/* Don't grab any conflicting locks */
 #define XFS_ATTR_NOACL		0x08	/* Don't call xfs_acl_chmod */
+#define XFS_ATTR_SYNC		0x10	/* synchronous operation required */
 
 int xfs_readlink(struct xfs_inode *ip, char *link);
 int xfs_release(struct xfs_inode *ip);
-- 
1.7.2.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/6] vmap: flush vmap aliases when mapping fails
  2011-03-23  6:14 xfs: outstanding patches for 2.6.39 Dave Chinner
  2011-03-23  6:14 ` [PATCH 1/6] xfs: preallocation transactions do not need to be synchronous Dave Chinner
@ 2011-03-23  6:14 ` Dave Chinner
  2011-03-25 21:00   ` Alex Elder
  2011-03-23  6:14 ` [PATCH 3/6] xfs: introduce inode cluster buffer trylocks for xfs_iflush Dave Chinner
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Dave Chinner @ 2011-03-23  6:14 UTC (permalink / raw)
  To: xfs; +Cc: aelder

From: Dave Chinner <dchinner@redhat.com>

On 32 bit systems, vmalloc space is limited and XFS can chew through
it quickly as the vmalloc space is lazily freed. This can result in
failure to map buffers, even when there is apparently large amounts
of vmalloc space available. Hence, if we fail to map a buffer, purge
the aliases that have not yet been freed to hopefuly free up enough
vmalloc space to allow a retry to succeed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/linux-2.6/xfs_buf.c |   14 +++++++++++---
 1 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index 5cb230f..fe51e09 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -455,9 +455,17 @@ _xfs_buf_map_pages(
 		bp->b_addr = page_address(bp->b_pages[0]) + bp->b_offset;
 		bp->b_flags |= XBF_MAPPED;
 	} else if (flags & XBF_MAPPED) {
-		bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
-					-1, PAGE_KERNEL);
-		if (unlikely(bp->b_addr == NULL))
+		int retried = 0;
+
+		do {
+			bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
+						-1, PAGE_KERNEL);
+			if (bp->b_addr)
+				break;
+			vm_unmap_aliases();
+		} while (retried++ <= 1);
+
+		if (!bp->b_addr)
 			return -ENOMEM;
 		bp->b_addr += bp->b_offset;
 		bp->b_flags |= XBF_MAPPED;
-- 
1.7.2.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 3/6] xfs: introduce inode cluster buffer trylocks for xfs_iflush
  2011-03-23  6:14 xfs: outstanding patches for 2.6.39 Dave Chinner
  2011-03-23  6:14 ` [PATCH 1/6] xfs: preallocation transactions do not need to be synchronous Dave Chinner
  2011-03-23  6:14 ` [PATCH 2/6] vmap: flush vmap aliases when mapping fails Dave Chinner
@ 2011-03-23  6:14 ` Dave Chinner
  2011-03-25 21:00   ` Alex Elder
  2011-03-23  6:14 ` [PATCH 4/6] xfs: xfs_trans_read_buf() should return an error on failure Dave Chinner
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Dave Chinner @ 2011-03-23  6:14 UTC (permalink / raw)
  To: xfs; +Cc: aelder

From: Dave Chinner <dchinner@redhat.com>

There is an ABBA deadlock between synchronous inode flushing in
xfs_reclaim_inode and xfs_icluster_free. xfs_icluster_free locks the
buffer, then takes inode ilocks, whilst synchronous reclaim takes
the ilock followed by the buffer lock in xfs_iflush().

To avoid this deadlock, separate the inode cluster buffer locking
semantics from the synchronous inode flush semantics, allowing
callers to attempt to lock the buffer but still issue synchronous IO
if it can get the buffer. This requires xfs_iflush() calls that
currently use non-blocking semantics to pass SYNC_TRYLOCK rather
than 0 as the flags parameter.

This allows xfs_reclaim_inode to avoid the deadlock on the buffer
lock and detect the failure so that it can drop the inode ilock and
restart the reclaim attempt on the inode. This allows
xfs_ifree_cluster to obtain the inode lock, mark the inode stale and
release it and hence defuse the deadlock situation. It also has the
pleasant side effect of avoiding IO in xfs_reclaim_inode when it
tries to next reclaim the inode as it is now marked stale.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/linux-2.6/xfs_super.c |    2 +-
 fs/xfs/linux-2.6/xfs_sync.c  |   30 +++++++++++++++++++++++++++---
 fs/xfs/xfs_inode.c           |    2 +-
 fs/xfs/xfs_inode_item.c      |    6 +++---
 4 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
index 818c4cf..8a70b2a 100644
--- a/fs/xfs/linux-2.6/xfs_super.c
+++ b/fs/xfs/linux-2.6/xfs_super.c
@@ -1078,7 +1078,7 @@ xfs_fs_write_inode(
 			error = 0;
 			goto out_unlock;
 		}
-		error = xfs_iflush(ip, 0);
+		error = xfs_iflush(ip, SYNC_TRYLOCK);
 	}
 
  out_unlock:
diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index 6c10f1d..594cd82 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -761,8 +761,10 @@ xfs_reclaim_inode(
 	struct xfs_perag	*pag,
 	int			sync_mode)
 {
-	int	error = 0;
+	int	error;
 
+restart:
+	error = 0;
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	if (!xfs_iflock_nowait(ip)) {
 		if (!(sync_mode & SYNC_WAIT))
@@ -788,9 +790,31 @@ xfs_reclaim_inode(
 	if (xfs_inode_clean(ip))
 		goto reclaim;
 
-	/* Now we have an inode that needs flushing */
-	error = xfs_iflush(ip, sync_mode);
+	/*
+	 * Now we have an inode that needs flushing.
+	 *
+	 * We do a nonblocking flush here even if we are doing a SYNC_WAIT
+	 * reclaim as we can deadlock with inode cluster removal.
+	 * xfs_ifree_cluster() can lock the inode buffer before it locks the
+	 * ip->i_lock, and we are doing the exact opposite here. As a result,
+	 * doing a blocking xfs_itobp() to get the cluster buffer will result
+	 * in an ABBA deadlock with xfs_ifree_cluster().
+	 *
+	 * As xfs_ifree_cluser() must gather all inodes that are active in the
+	 * cache to mark them stale, if we hit this case we don't actually want
+	 * to do IO here - we want the inode marked stale so we can simply
+	 * reclaim it. Hence if we get an EAGAIN error on a SYNC_WAIT flush,
+	 * just unlock the inode, back off and try again. Hopefully the next
+	 * pass through will see the stale flag set on the inode.
+	 */
+	error = xfs_iflush(ip, SYNC_TRYLOCK | sync_mode);
 	if (sync_mode & SYNC_WAIT) {
+		if (error == EAGAIN) {
+			xfs_iunlock(ip, XFS_ILOCK_EXCL);
+			/* backoff longer than in xfs_ifree_cluster */
+			delay(2);
+			goto restart;
+		}
 		xfs_iflock(ip);
 		goto reclaim;
 	}
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index da871f5..742c833 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2835,7 +2835,7 @@ xfs_iflush(
 	 * Get the buffer containing the on-disk inode.
 	 */
 	error = xfs_itobp(mp, NULL, ip, &dip, &bp,
-				(flags & SYNC_WAIT) ? XBF_LOCK : XBF_TRYLOCK);
+				(flags & SYNC_TRYLOCK) ? XBF_TRYLOCK : XBF_LOCK);
 	if (error || !bp) {
 		xfs_ifunlock(ip);
 		return error;
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index fd4f398..46cc401 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -760,11 +760,11 @@ xfs_inode_item_push(
 	 * Push the inode to it's backing buffer. This will not remove the
 	 * inode from the AIL - a further push will be required to trigger a
 	 * buffer push. However, this allows all the dirty inodes to be pushed
-	 * to the buffer before it is pushed to disk. THe buffer IO completion
-	 * will pull th einode from the AIL, mark it clean and unlock the flush
+	 * to the buffer before it is pushed to disk. The buffer IO completion
+	 * will pull the inode from the AIL, mark it clean and unlock the flush
 	 * lock.
 	 */
-	(void) xfs_iflush(ip, 0);
+	(void) xfs_iflush(ip, SYNC_TRYLOCK);
 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
 }
 
-- 
1.7.2.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 4/6] xfs: xfs_trans_read_buf() should return an error on failure
  2011-03-23  6:14 xfs: outstanding patches for 2.6.39 Dave Chinner
                   ` (2 preceding siblings ...)
  2011-03-23  6:14 ` [PATCH 3/6] xfs: introduce inode cluster buffer trylocks for xfs_iflush Dave Chinner
@ 2011-03-23  6:14 ` Dave Chinner
  2011-03-23 11:53   ` Christoph Hellwig
  2011-03-25 21:01   ` Alex Elder
  2011-03-23  6:14 ` [PATCH 5/6] xfs: register the inode cache shrinker before quotachecks Dave Chinner
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 27+ messages in thread
From: Dave Chinner @ 2011-03-23  6:14 UTC (permalink / raw)
  To: xfs; +Cc: aelder

From: Dave Chinner <dchinner@redhat.com>

When inside a transaction and we fail to read a buffer,
xfs_trans_read_buf returns a null buffer pointer and no error.
xfs_do_da_buf() checks the error return, but not the buffer, and as
a result this read failure condition causes a panic when it attempts
to dereference the non-existant buffer.

Make xfs_trans_read_buf() return the same error for this situation
regardless of whether it is in a transaction or not. This means
every caller does not need to check both the error return and the
buffer before proceeding to use the buffer.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_trans_buf.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
index 3bea661..03b3b7f 100644
--- a/fs/xfs/xfs_trans_buf.c
+++ b/fs/xfs/xfs_trans_buf.c
@@ -383,7 +383,8 @@ xfs_trans_read_buf(
 	bp = xfs_buf_read(target, blkno, len, flags | XBF_DONT_BLOCK);
 	if (bp == NULL) {
 		*bpp = NULL;
-		return 0;
+		return (flags & XBF_TRYLOCK) ?
+					0 : XFS_ERROR(ENOMEM);
 	}
 	if (XFS_BUF_GETERROR(bp) != 0) {
 	    XFS_BUF_SUPER_STALE(bp);
-- 
1.7.2.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 5/6] xfs: register the inode cache shrinker before quotachecks
  2011-03-23  6:14 xfs: outstanding patches for 2.6.39 Dave Chinner
                   ` (3 preceding siblings ...)
  2011-03-23  6:14 ` [PATCH 4/6] xfs: xfs_trans_read_buf() should return an error on failure Dave Chinner
@ 2011-03-23  6:14 ` Dave Chinner
  2011-03-23 21:24   ` Arkadiusz Miskiewicz
  2011-03-25 21:01   ` Alex Elder
  2011-03-23  6:14 ` [PATCH 6/6] xfs: stop using the page cache to back the buffer cache Dave Chinner
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 27+ messages in thread
From: Dave Chinner @ 2011-03-23  6:14 UTC (permalink / raw)
  To: xfs; +Cc: aelder

From: Dave Chinner <dchinner@redhat.com>

During mount, we can do a quotacheck that involves a bulkstat pass
on all inodes. If there are more inodes in the filesystem than can
be held in memory, we require the inode cache shrinker to run to
ensure that we don't run out of memory.

Unfortunately, the inode cache shrinker is not registered until we
get to the end of the superblock setup process, which is after a
quotacheck is run if it is needed. Hence we need to register the
inode cache shrinker earlier in the mount process so that we don't
OOM during mount. This requires that we also initialise the syncd
work before we register the shrinker, so we nee dto juggle that
around as well.

While there, make sure that we have set up the block sizes in the
VFS superblock correctly before the quotacheck is run so that any
inodes that are cached as a result of the quotacheck have their
block size fields set up correctly.

Cc: stable@kernel.org
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_super.c |   34 ++++++++++++++++++++++++----------
 1 files changed, 24 insertions(+), 10 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
index 8a70b2a..1ba5c45 100644
--- a/fs/xfs/linux-2.6/xfs_super.c
+++ b/fs/xfs/linux-2.6/xfs_super.c
@@ -1539,10 +1539,14 @@ xfs_fs_fill_super(
 	if (error)
 		goto out_free_sb;
 
-	error = xfs_mountfs(mp);
-	if (error)
-		goto out_filestream_unmount;
-
+	/*
+	 * we must configure the block size in the superblock before we run the
+	 * full mount process as the mount process can lookup and cache inodes.
+	 * For the same reason we must also initialise the syncd and register
+	 * the inode cache shrinker so that inodes can be reclaimed during
+	 * operations like a quotacheck that iterate all inodes in the
+	 * filesystem.
+	 */
 	sb->s_magic = XFS_SB_MAGIC;
 	sb->s_blocksize = mp->m_sb.sb_blocksize;
 	sb->s_blocksize_bits = ffs(sb->s_blocksize) - 1;
@@ -1550,6 +1554,16 @@ xfs_fs_fill_super(
 	sb->s_time_gran = 1;
 	set_posix_acl_flag(sb);
 
+	error = xfs_syncd_init(mp);
+	if (error)
+		goto out_filestream_unmount;
+
+	xfs_inode_shrinker_register(mp);
+
+	error = xfs_mountfs(mp);
+	if (error)
+		goto out_syncd_stop;
+
 	root = igrab(VFS_I(mp->m_rootip));
 	if (!root) {
 		error = ENOENT;
@@ -1565,14 +1579,11 @@ xfs_fs_fill_super(
 		goto fail_vnrele;
 	}
 
-	error = xfs_syncd_init(mp);
-	if (error)
-		goto fail_vnrele;
-
-	xfs_inode_shrinker_register(mp);
-
 	return 0;
 
+ out_syncd_stop:
+	xfs_inode_shrinker_unregister(mp);
+	xfs_syncd_stop(mp);
  out_filestream_unmount:
 	xfs_filestream_unmount(mp);
  out_free_sb:
@@ -1596,6 +1607,9 @@ xfs_fs_fill_super(
 	}
 
  fail_unmount:
+	xfs_inode_shrinker_unregister(mp);
+	xfs_syncd_stop(mp);
+
 	/*
 	 * Blow away any referenced inode in the filestreams cache.
 	 * This can and will cause log traffic as inodes go inactive
-- 
1.7.2.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 6/6] xfs: stop using the page cache to back the buffer cache
  2011-03-23  6:14 xfs: outstanding patches for 2.6.39 Dave Chinner
                   ` (4 preceding siblings ...)
  2011-03-23  6:14 ` [PATCH 5/6] xfs: register the inode cache shrinker before quotachecks Dave Chinner
@ 2011-03-23  6:14 ` Dave Chinner
  2011-03-25 21:02   ` Alex Elder
  2011-03-23  7:01 ` xfs: outstanding patches for 2.6.39 Andi Kleen
  2011-03-23 11:18 ` Christoph Hellwig
  7 siblings, 1 reply; 27+ messages in thread
From: Dave Chinner @ 2011-03-23  6:14 UTC (permalink / raw)
  To: xfs; +Cc: aelder

From: Dave Chinner <dchinner@redhat.com>

Now that the buffer cache has it's own LRU, we do not need to use
the page cache to provide persistent caching and reclaim
infrastructure. Convert the buffer cache to use alloc_pages()
instead of the page cache. This will remove all the overhead of page
cache management from setup and teardown of the buffers, as well as
needing to mark pages accessed as we find buffers in the buffer
cache.

By avoiding the page cache, we also remove the need to keep state in
the page_private(page) field for persistant storage across buffer
free/buffer rebuild and so all that code can be removed. This also
fixes the long-standing problem of not having enough bits in the
page_private field to track all the state needed for a 512
sector/64k page setup.

It also removes the need for page locking during reads as the pages
are unique to the buffer and nobody else will be attempting to
access them.

Finally, it removes the buftarg address space lock as a point of
global contention on workloads that allocate and free buffers
quickly such as when creating or removing large numbers of inodes in
parallel. This remove the 16TB limit on filesystem size on 32 bit
machines as the page index (32 bit) is no longer used for lookups
of metadata buffers - the buffer cache is now solely indexed by disk
address which is stored in a 64 bit field in the buffer.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_buf.c |  337 ++++++++++----------------------------------
 fs/xfs/linux-2.6/xfs_buf.h |   40 +-----
 2 files changed, 81 insertions(+), 296 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index fe51e09..19b0769 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -94,75 +94,6 @@ xfs_buf_vmap_len(
 }
 
 /*
- *	Page Region interfaces.
- *
- *	For pages in filesystems where the blocksize is smaller than the
- *	pagesize, we use the page->private field (long) to hold a bitmap
- * 	of uptodate regions within the page.
- *
- *	Each such region is "bytes per page / bits per long" bytes long.
- *
- *	NBPPR == number-of-bytes-per-page-region
- *	BTOPR == bytes-to-page-region (rounded up)
- *	BTOPRT == bytes-to-page-region-truncated (rounded down)
- */
-#if (BITS_PER_LONG == 32)
-#define PRSHIFT		(PAGE_CACHE_SHIFT - 5)	/* (32 == 1<<5) */
-#elif (BITS_PER_LONG == 64)
-#define PRSHIFT		(PAGE_CACHE_SHIFT - 6)	/* (64 == 1<<6) */
-#else
-#error BITS_PER_LONG must be 32 or 64
-#endif
-#define NBPPR		(PAGE_CACHE_SIZE/BITS_PER_LONG)
-#define BTOPR(b)	(((unsigned int)(b) + (NBPPR - 1)) >> PRSHIFT)
-#define BTOPRT(b)	(((unsigned int)(b) >> PRSHIFT))
-
-STATIC unsigned long
-page_region_mask(
-	size_t		offset,
-	size_t		length)
-{
-	unsigned long	mask;
-	int		first, final;
-
-	first = BTOPR(offset);
-	final = BTOPRT(offset + length - 1);
-	first = min(first, final);
-
-	mask = ~0UL;
-	mask <<= BITS_PER_LONG - (final - first);
-	mask >>= BITS_PER_LONG - (final);
-
-	ASSERT(offset + length <= PAGE_CACHE_SIZE);
-	ASSERT((final - first) < BITS_PER_LONG && (final - first) >= 0);
-
-	return mask;
-}
-
-STATIC void
-set_page_region(
-	struct page	*page,
-	size_t		offset,
-	size_t		length)
-{
-	set_page_private(page,
-		page_private(page) | page_region_mask(offset, length));
-	if (page_private(page) == ~0UL)
-		SetPageUptodate(page);
-}
-
-STATIC int
-test_page_region(
-	struct page	*page,
-	size_t		offset,
-	size_t		length)
-{
-	unsigned long	mask = page_region_mask(offset, length);
-
-	return (mask && (page_private(page) & mask) == mask);
-}
-
-/*
  * xfs_buf_lru_add - add a buffer to the LRU.
  *
  * The LRU takes a new reference to the buffer so that it will only be freed
@@ -332,7 +263,7 @@ xfs_buf_free(
 
 	ASSERT(list_empty(&bp->b_lru));
 
-	if (bp->b_flags & (_XBF_PAGE_CACHE|_XBF_PAGES)) {
+	if (bp->b_flags & _XBF_PAGES) {
 		uint		i;
 
 		if (xfs_buf_is_vmapped(bp))
@@ -342,25 +273,22 @@ xfs_buf_free(
 		for (i = 0; i < bp->b_page_count; i++) {
 			struct page	*page = bp->b_pages[i];
 
-			if (bp->b_flags & _XBF_PAGE_CACHE)
-				ASSERT(!PagePrivate(page));
-			page_cache_release(page);
+			__free_page(page);
 		}
-	}
+	} else if (bp->b_flags & _XBF_KMEM)
+		kmem_free(bp->b_addr);
 	_xfs_buf_free_pages(bp);
 	xfs_buf_deallocate(bp);
 }
 
 /*
- *	Finds all pages for buffer in question and builds it's page list.
+ * Allocates all the pages for buffer in question and builds it's page list.
  */
 STATIC int
-_xfs_buf_lookup_pages(
+xfs_buf_allocate_memory(
 	xfs_buf_t		*bp,
 	uint			flags)
 {
-	struct address_space	*mapping = bp->b_target->bt_mapping;
-	size_t			blocksize = bp->b_target->bt_bsize;
 	size_t			size = bp->b_count_desired;
 	size_t			nbytes, offset;
 	gfp_t			gfp_mask = xb_to_gfp(flags);
@@ -369,29 +297,55 @@ _xfs_buf_lookup_pages(
 	xfs_off_t		end;
 	int			error;
 
+	/*
+	 * for buffers that are contained within a single page, just allocate
+	 * the memory from the heap - there's no need for the complexity of
+	 * page arrays to keep allocation down to order 0.
+	 */
+	if (bp->b_buffer_length < PAGE_SIZE) {
+		bp->b_addr = kmem_alloc(bp->b_buffer_length, xb_to_km(flags));
+		if (!bp->b_addr) {
+			/* low memory - use alloc_page loop instead */
+			goto use_alloc_page;
+		}
+
+		if (((unsigned long)(bp->b_addr + bp->b_buffer_length - 1) &
+								PAGE_MASK) !=
+		    ((unsigned long)bp->b_addr & PAGE_MASK)) {
+			/* b_addr spans two pages - use alloc_page instead */
+			kmem_free(bp->b_addr);
+			bp->b_addr = NULL;
+			goto use_alloc_page;
+		}
+		bp->b_offset = offset_in_page(bp->b_addr);
+		bp->b_pages = bp->b_page_array;
+		bp->b_pages[0] = virt_to_page(bp->b_addr);
+		bp->b_page_count = 1;
+		bp->b_flags |= XBF_MAPPED | _XBF_KMEM;
+		return 0;
+	}
+
+use_alloc_page:
 	end = bp->b_file_offset + bp->b_buffer_length;
 	page_count = xfs_buf_btoc(end) - xfs_buf_btoct(bp->b_file_offset);
-
 	error = _xfs_buf_get_pages(bp, page_count, flags);
 	if (unlikely(error))
 		return error;
-	bp->b_flags |= _XBF_PAGE_CACHE;
 
 	offset = bp->b_offset;
-	first = bp->b_file_offset >> PAGE_CACHE_SHIFT;
+	first = bp->b_file_offset >> PAGE_SHIFT;
+	bp->b_flags |= _XBF_PAGES;
 
 	for (i = 0; i < bp->b_page_count; i++) {
 		struct page	*page;
 		uint		retries = 0;
-
-	      retry:
-		page = find_or_create_page(mapping, first + i, gfp_mask);
+retry:
+		page = alloc_page(gfp_mask);
 		if (unlikely(page == NULL)) {
 			if (flags & XBF_READ_AHEAD) {
 				bp->b_page_count = i;
-				for (i = 0; i < bp->b_page_count; i++)
-					unlock_page(bp->b_pages[i]);
-				return -ENOMEM;
+				error = ENOMEM;
+				goto out_free_pages;
 			}
 
 			/*
@@ -412,33 +366,16 @@ _xfs_buf_lookup_pages(
 
 		XFS_STATS_INC(xb_page_found);
 
-		nbytes = min_t(size_t, size, PAGE_CACHE_SIZE - offset);
+		nbytes = min_t(size_t, size, PAGE_SIZE - offset);
 		size -= nbytes;
-
-		ASSERT(!PagePrivate(page));
-		if (!PageUptodate(page)) {
-			page_count--;
-			if (blocksize >= PAGE_CACHE_SIZE) {
-				if (flags & XBF_READ)
-					bp->b_flags |= _XBF_PAGE_LOCKED;
-			} else if (!PagePrivate(page)) {
-				if (test_page_region(page, offset, nbytes))
-					page_count++;
-			}
-		}
-
 		bp->b_pages[i] = page;
 		offset = 0;
 	}
+	return 0;
 
-	if (!(bp->b_flags & _XBF_PAGE_LOCKED)) {
-		for (i = 0; i < bp->b_page_count; i++)
-			unlock_page(bp->b_pages[i]);
-	}
-
-	if (page_count == bp->b_page_count)
-		bp->b_flags |= XBF_DONE;
-
+out_free_pages:
+	for (i = 0; i < bp->b_page_count; i++)
+		__free_page(bp->b_pages[i]);
 	return error;
 }
 
@@ -450,8 +387,9 @@ _xfs_buf_map_pages(
 	xfs_buf_t		*bp,
 	uint			flags)
 {
-	/* A single page buffer is always mappable */
+	ASSERT(bp->b_flags & _XBF_PAGES);
 	if (bp->b_page_count == 1) {
+		/* A single page buffer is always mappable */
 		bp->b_addr = page_address(bp->b_pages[0]) + bp->b_offset;
 		bp->b_flags |= XBF_MAPPED;
 	} else if (flags & XBF_MAPPED) {
@@ -576,9 +514,14 @@ found:
 		}
 	}
 
+	/*
+	 * if the buffer is stale, clear all the external state associated with
+	 * it. We need to keep flags such as how we allocated the buffer memory
+	 * intact here.
+	 */
 	if (bp->b_flags & XBF_STALE) {
 		ASSERT((bp->b_flags & _XBF_DELWRI_Q) == 0);
-		bp->b_flags &= XBF_MAPPED;
+		bp->b_flags &= XBF_MAPPED | _XBF_KMEM | _XBF_PAGES;
 	}
 
 	trace_xfs_buf_find(bp, flags, _RET_IP_);
@@ -599,7 +542,7 @@ xfs_buf_get(
 	xfs_buf_flags_t		flags)
 {
 	xfs_buf_t		*bp, *new_bp;
-	int			error = 0, i;
+	int			error = 0;
 
 	new_bp = xfs_buf_allocate(flags);
 	if (unlikely(!new_bp))
@@ -607,7 +550,7 @@ xfs_buf_get(
 
 	bp = _xfs_buf_find(target, ioff, isize, flags, new_bp);
 	if (bp == new_bp) {
-		error = _xfs_buf_lookup_pages(bp, flags);
+		error = xfs_buf_allocate_memory(bp, flags);
 		if (error)
 			goto no_buffer;
 	} else {
@@ -616,9 +559,6 @@ xfs_buf_get(
 			return NULL;
 	}
 
-	for (i = 0; i < bp->b_page_count; i++)
-		mark_page_accessed(bp->b_pages[i]);
-
 	if (!(bp->b_flags & XBF_MAPPED)) {
 		error = _xfs_buf_map_pages(bp, flags);
 		if (unlikely(error)) {
@@ -719,7 +659,7 @@ xfs_buf_readahead(
 {
 	struct backing_dev_info *bdi;
 
-	bdi = target->bt_mapping->backing_dev_info;
+	bdi = blk_get_backing_dev_info(target->bt_bdev);
 	if (bdi_read_congested(bdi))
 		return;
 
@@ -798,10 +738,10 @@ xfs_buf_associate_memory(
 	size_t			buflen;
 	int			page_count;
 
-	pageaddr = (unsigned long)mem & PAGE_CACHE_MASK;
+	pageaddr = (unsigned long)mem & PAGE_MASK;
 	offset = (unsigned long)mem - pageaddr;
-	buflen = PAGE_CACHE_ALIGN(len + offset);
-	page_count = buflen >> PAGE_CACHE_SHIFT;
+	buflen = PAGE_ALIGN(len + offset);
+	page_count = buflen >> PAGE_SHIFT;
 
 	/* Free any previous set of page pointers */
 	if (bp->b_pages)
@@ -818,13 +758,12 @@ xfs_buf_associate_memory(
 
 	for (i = 0; i < bp->b_page_count; i++) {
 		bp->b_pages[i] = mem_to_page((void *)pageaddr);
-		pageaddr += PAGE_CACHE_SIZE;
+		pageaddr += PAGE_SIZE;
 	}
 
 	bp->b_count_desired = len;
 	bp->b_buffer_length = buflen;
 	bp->b_flags |= XBF_MAPPED;
-	bp->b_flags &= ~_XBF_PAGE_LOCKED;
 
 	return 0;
 }
@@ -931,20 +870,7 @@ xfs_buf_rele(
 
 
 /*
- *	Mutual exclusion on buffers.  Locking model:
- *
- *	Buffers associated with inodes for which buffer locking
- *	is not enabled are not protected by semaphores, and are
- *	assumed to be exclusively owned by the caller.  There is a
- *	spinlock in the buffer, used by the caller when concurrent
- *	access is possible.
- */
-
-/*
- *	Locks a buffer object, if it is not already locked.  Note that this in
- *	no way locks the underlying pages, so it is only useful for
- *	synchronizing concurrent use of buffer objects, not for synchronizing
- *	independent access to the underlying pages.
+ *	Lock a buffer object, if it is not already locked.
  *
  *	If we come across a stale, pinned, locked buffer, we know that we are
  *	being asked to lock a buffer that has been reallocated. Because it is
@@ -978,10 +904,7 @@ xfs_buf_lock_value(
 }
 
 /*
- *	Locks a buffer object.
- *	Note that this in no way locks the underlying pages, so it is only
- *	useful for synchronizing concurrent use of buffer objects, not for
- *	synchronizing independent access to the underlying pages.
+ *	Lock a buffer object.
  *
  *	If we come across a stale, pinned, locked buffer, we know that we
  *	are being asked to lock a buffer that has been reallocated. Because
@@ -998,7 +921,7 @@ xfs_buf_lock(
 	if (atomic_read(&bp->b_pin_count) && (bp->b_flags & XBF_STALE))
 		xfs_log_force(bp->b_target->bt_mount, 0);
 	if (atomic_read(&bp->b_io_remaining))
-		blk_run_address_space(bp->b_target->bt_mapping);
+		blk_run_backing_dev(bp->b_target->bt_bdi, NULL);
 	down(&bp->b_sema);
 	XB_SET_OWNER(bp);
 
@@ -1043,7 +966,7 @@ xfs_buf_wait_unpin(
 		if (atomic_read(&bp->b_pin_count) == 0)
 			break;
 		if (atomic_read(&bp->b_io_remaining))
-			blk_run_address_space(bp->b_target->bt_mapping);
+			blk_run_backing_dev(bp->b_target->bt_bdi, NULL);
 		schedule();
 	}
 	remove_wait_queue(&bp->b_waiters, &wait);
@@ -1256,10 +1179,8 @@ _xfs_buf_ioend(
 	xfs_buf_t		*bp,
 	int			schedule)
 {
-	if (atomic_dec_and_test(&bp->b_io_remaining) == 1) {
-		bp->b_flags &= ~_XBF_PAGE_LOCKED;
+	if (atomic_dec_and_test(&bp->b_io_remaining) == 1)
 		xfs_buf_ioend(bp, schedule);
-	}
 }
 
 STATIC void
@@ -1268,35 +1189,12 @@ xfs_buf_bio_end_io(
 	int			error)
 {
 	xfs_buf_t		*bp = (xfs_buf_t *)bio->bi_private;
-	unsigned int		blocksize = bp->b_target->bt_bsize;
-	struct bio_vec		*bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
 
 	xfs_buf_ioerror(bp, -error);
 
 	if (!error && xfs_buf_is_vmapped(bp) && (bp->b_flags & XBF_READ))
 		invalidate_kernel_vmap_range(bp->b_addr, xfs_buf_vmap_len(bp));
 
-	do {
-		struct page	*page = bvec->bv_page;
-
-		ASSERT(!PagePrivate(page));
-		if (unlikely(bp->b_error)) {
-			if (bp->b_flags & XBF_READ)
-				ClearPageUptodate(page);
-		} else if (blocksize >= PAGE_CACHE_SIZE) {
-			SetPageUptodate(page);
-		} else if (!PagePrivate(page) &&
-				(bp->b_flags & _XBF_PAGE_CACHE)) {
-			set_page_region(page, bvec->bv_offset, bvec->bv_len);
-		}
-
-		if (--bvec >= bio->bi_io_vec)
-			prefetchw(&bvec->bv_page->flags);
-
-		if (bp->b_flags & _XBF_PAGE_LOCKED)
-			unlock_page(page);
-	} while (bvec >= bio->bi_io_vec);
-
 	_xfs_buf_ioend(bp, 1);
 	bio_put(bio);
 }
@@ -1310,7 +1208,6 @@ _xfs_buf_ioapply(
 	int			offset = bp->b_offset;
 	int			size = bp->b_count_desired;
 	sector_t		sector = bp->b_bn;
-	unsigned int		blocksize = bp->b_target->bt_bsize;
 
 	total_nr_pages = bp->b_page_count;
 	map_i = 0;
@@ -1331,29 +1228,6 @@ _xfs_buf_ioapply(
 		     (bp->b_flags & XBF_READ_AHEAD) ? READA : READ;
 	}
 
-	/* Special code path for reading a sub page size buffer in --
-	 * we populate up the whole page, and hence the other metadata
-	 * in the same page.  This optimization is only valid when the
-	 * filesystem block size is not smaller than the page size.
-	 */
-	if ((bp->b_buffer_length < PAGE_CACHE_SIZE) &&
-	    ((bp->b_flags & (XBF_READ|_XBF_PAGE_LOCKED)) ==
-	      (XBF_READ|_XBF_PAGE_LOCKED)) &&
-	    (blocksize >= PAGE_CACHE_SIZE)) {
-		bio = bio_alloc(GFP_NOIO, 1);
-
-		bio->bi_bdev = bp->b_target->bt_bdev;
-		bio->bi_sector = sector - (offset >> BBSHIFT);
-		bio->bi_end_io = xfs_buf_bio_end_io;
-		bio->bi_private = bp;
-
-		bio_add_page(bio, bp->b_pages[0], PAGE_CACHE_SIZE, 0);
-		size = 0;
-
-		atomic_inc(&bp->b_io_remaining);
-
-		goto submit_io;
-	}
 
 next_chunk:
 	atomic_inc(&bp->b_io_remaining);
@@ -1367,8 +1241,9 @@ next_chunk:
 	bio->bi_end_io = xfs_buf_bio_end_io;
 	bio->bi_private = bp;
 
+
 	for (; size && nr_pages; nr_pages--, map_i++) {
-		int	rbytes, nbytes = PAGE_CACHE_SIZE - offset;
+		int	rbytes, nbytes = PAGE_SIZE - offset;
 
 		if (nbytes > size)
 			nbytes = size;
@@ -1383,7 +1258,6 @@ next_chunk:
 		total_nr_pages--;
 	}
 
-submit_io:
 	if (likely(bio->bi_size)) {
 		if (xfs_buf_is_vmapped(bp)) {
 			flush_kernel_vmap_range(bp->b_addr,
@@ -1393,18 +1267,7 @@ submit_io:
 		if (size)
 			goto next_chunk;
 	} else {
-		/*
-		 * if we get here, no pages were added to the bio. However,
-		 * we can't just error out here - if the pages are locked then
-		 * we have to unlock them otherwise we can hang on a later
-		 * access to the page.
-		 */
 		xfs_buf_ioerror(bp, EIO);
-		if (bp->b_flags & _XBF_PAGE_LOCKED) {
-			int i;
-			for (i = 0; i < bp->b_page_count; i++)
-				unlock_page(bp->b_pages[i]);
-		}
 		bio_put(bio);
 	}
 }
@@ -1450,7 +1313,7 @@ xfs_buf_iowait(
 	trace_xfs_buf_iowait(bp, _RET_IP_);
 
 	if (atomic_read(&bp->b_io_remaining))
-		blk_run_address_space(bp->b_target->bt_mapping);
+		blk_run_backing_dev(bp->b_target->bt_bdi, NULL);
 	wait_for_completion(&bp->b_iowait);
 
 	trace_xfs_buf_iowait_done(bp, _RET_IP_);
@@ -1468,8 +1331,8 @@ xfs_buf_offset(
 		return XFS_BUF_PTR(bp) + offset;
 
 	offset += bp->b_offset;
-	page = bp->b_pages[offset >> PAGE_CACHE_SHIFT];
-	return (xfs_caddr_t)page_address(page) + (offset & (PAGE_CACHE_SIZE-1));
+	page = bp->b_pages[offset >> PAGE_SHIFT];
+	return (xfs_caddr_t)page_address(page) + (offset & (PAGE_SIZE-1));
 }
 
 /*
@@ -1491,9 +1354,9 @@ xfs_buf_iomove(
 		page = bp->b_pages[xfs_buf_btoct(boff + bp->b_offset)];
 		cpoff = xfs_buf_poff(boff + bp->b_offset);
 		csize = min_t(size_t,
-			      PAGE_CACHE_SIZE-cpoff, bp->b_count_desired-boff);
+			      PAGE_SIZE-cpoff, bp->b_count_desired-boff);
 
-		ASSERT(((csize + cpoff) <= PAGE_CACHE_SIZE));
+		ASSERT(((csize + cpoff) <= PAGE_SIZE));
 
 		switch (mode) {
 		case XBRW_ZERO:
@@ -1606,7 +1469,6 @@ xfs_free_buftarg(
 	xfs_flush_buftarg(btp, 1);
 	if (mp->m_flags & XFS_MOUNT_BARRIER)
 		xfs_blkdev_issue_flush(btp);
-	iput(btp->bt_mapping->host);
 
 	kthread_stop(btp->bt_task);
 	kmem_free(btp);
@@ -1630,15 +1492,6 @@ xfs_setsize_buftarg_flags(
 		return EINVAL;
 	}
 
-	if (verbose &&
-	    (PAGE_CACHE_SIZE / BITS_PER_LONG) > sectorsize) {
-		printk(KERN_WARNING
-			"XFS: %u byte sectors in use on device %s.  "
-			"This is suboptimal; %u or greater is ideal.\n",
-			sectorsize, XFS_BUFTARG_NAME(btp),
-			(unsigned int)PAGE_CACHE_SIZE / BITS_PER_LONG);
-	}
-
 	return 0;
 }
 
@@ -1653,7 +1506,7 @@ xfs_setsize_buftarg_early(
 	struct block_device	*bdev)
 {
 	return xfs_setsize_buftarg_flags(btp,
-			PAGE_CACHE_SIZE, bdev_logical_block_size(bdev), 0);
+			PAGE_SIZE, bdev_logical_block_size(bdev), 0);
 }
 
 int
@@ -1666,41 +1519,6 @@ xfs_setsize_buftarg(
 }
 
 STATIC int
-xfs_mapping_buftarg(
-	xfs_buftarg_t		*btp,
-	struct block_device	*bdev)
-{
-	struct backing_dev_info	*bdi;
-	struct inode		*inode;
-	struct address_space	*mapping;
-	static const struct address_space_operations mapping_aops = {
-		.sync_page = block_sync_page,
-		.migratepage = fail_migrate_page,
-	};
-
-	inode = new_inode(bdev->bd_inode->i_sb);
-	if (!inode) {
-		printk(KERN_WARNING
-			"XFS: Cannot allocate mapping inode for device %s\n",
-			XFS_BUFTARG_NAME(btp));
-		return ENOMEM;
-	}
-	inode->i_ino = get_next_ino();
-	inode->i_mode = S_IFBLK;
-	inode->i_bdev = bdev;
-	inode->i_rdev = bdev->bd_dev;
-	bdi = blk_get_backing_dev_info(bdev);
-	if (!bdi)
-		bdi = &default_backing_dev_info;
-	mapping = &inode->i_data;
-	mapping->a_ops = &mapping_aops;
-	mapping->backing_dev_info = bdi;
-	mapping_set_gfp_mask(mapping, GFP_NOFS);
-	btp->bt_mapping = mapping;
-	return 0;
-}
-
-STATIC int
 xfs_alloc_delwrite_queue(
 	xfs_buftarg_t		*btp,
 	const char		*fsname)
@@ -1728,12 +1546,11 @@ xfs_alloc_buftarg(
 	btp->bt_mount = mp;
 	btp->bt_dev =  bdev->bd_dev;
 	btp->bt_bdev = bdev;
+	btp->bt_bdi = blk_get_backing_dev_info(bdev);
 	INIT_LIST_HEAD(&btp->bt_lru);
 	spin_lock_init(&btp->bt_lru_lock);
 	if (xfs_setsize_buftarg_early(btp, bdev))
 		goto error;
-	if (xfs_mapping_buftarg(btp, bdev))
-		goto error;
 	if (xfs_alloc_delwrite_queue(btp, fsname))
 		goto error;
 	btp->bt_shrinker.shrink = xfs_buftarg_shrink;
@@ -1955,7 +1772,7 @@ xfsbufd(
 			count++;
 		}
 		if (count)
-			blk_run_address_space(target->bt_mapping);
+			blk_run_backing_dev(target->bt_bdi, NULL);
 
 	} while (!kthread_should_stop());
 
@@ -2003,7 +1820,7 @@ xfs_flush_buftarg(
 
 	if (wait) {
 		/* Expedite and wait for IO to complete. */
-		blk_run_address_space(target->bt_mapping);
+		blk_run_backing_dev(target->bt_bdi, NULL);
 		while (!list_empty(&wait_list)) {
 			bp = list_first_entry(&wait_list, struct xfs_buf, b_list);
 
diff --git a/fs/xfs/linux-2.6/xfs_buf.h b/fs/xfs/linux-2.6/xfs_buf.h
index cbe6595..a9a1c45 100644
--- a/fs/xfs/linux-2.6/xfs_buf.h
+++ b/fs/xfs/linux-2.6/xfs_buf.h
@@ -61,30 +61,11 @@ typedef enum {
 #define XBF_DONT_BLOCK	(1 << 16)/* do not block in current thread */
 
 /* flags used only internally */
-#define _XBF_PAGE_CACHE	(1 << 17)/* backed by pagecache */
 #define _XBF_PAGES	(1 << 18)/* backed by refcounted pages */
 #define	_XBF_RUN_QUEUES	(1 << 19)/* run block device task queue	*/
+#define	_XBF_KMEM	(1 << 20)/* backed by heap memory */
 #define _XBF_DELWRI_Q	(1 << 21)/* buffer on delwri queue */
 
-/*
- * Special flag for supporting metadata blocks smaller than a FSB.
- *
- * In this case we can have multiple xfs_buf_t on a single page and
- * need to lock out concurrent xfs_buf_t readers as they only
- * serialise access to the buffer.
- *
- * If the FSB size >= PAGE_CACHE_SIZE case, we have no serialisation
- * between reads of the page. Hence we can have one thread read the
- * page and modify it, but then race with another thread that thinks
- * the page is not up-to-date and hence reads it again.
- *
- * The result is that the first modifcation to the page is lost.
- * This sort of AGF/AGI reading race can happen when unlinking inodes
- * that require truncation and results in the AGI unlinked list
- * modifications being lost.
- */
-#define _XBF_PAGE_LOCKED	(1 << 22)
-
 typedef unsigned int xfs_buf_flags_t;
 
 #define XFS_BUF_FLAGS \
@@ -100,12 +81,10 @@ typedef unsigned int xfs_buf_flags_t;
 	{ XBF_LOCK,		"LOCK" },  	/* should never be set */\
 	{ XBF_TRYLOCK,		"TRYLOCK" }, 	/* ditto */\
 	{ XBF_DONT_BLOCK,	"DONT_BLOCK" },	/* ditto */\
-	{ _XBF_PAGE_CACHE,	"PAGE_CACHE" }, \
 	{ _XBF_PAGES,		"PAGES" }, \
 	{ _XBF_RUN_QUEUES,	"RUN_QUEUES" }, \
-	{ _XBF_DELWRI_Q,	"DELWRI_Q" }, \
-	{ _XBF_PAGE_LOCKED,	"PAGE_LOCKED" }
-
+	{ _XBF_KMEM,		"KMEM" }, \
+	{ _XBF_DELWRI_Q,	"DELWRI_Q" }
 
 typedef enum {
 	XBT_FORCE_SLEEP = 0,
@@ -120,7 +99,7 @@ typedef struct xfs_bufhash {
 typedef struct xfs_buftarg {
 	dev_t			bt_dev;
 	struct block_device	*bt_bdev;
-	struct address_space	*bt_mapping;
+	struct backing_dev_info	*bt_bdi;
 	struct xfs_mount	*bt_mount;
 	unsigned int		bt_bsize;
 	unsigned int		bt_sshift;
@@ -139,17 +118,6 @@ typedef struct xfs_buftarg {
 	unsigned int		bt_lru_nr;
 } xfs_buftarg_t;
 
-/*
- *	xfs_buf_t:  Buffer structure for pagecache-based buffers
- *
- * This buffer structure is used by the pagecache buffer management routines
- * to refer to an assembly of pages forming a logical buffer.
- *
- * The buffer structure is used on a temporary basis only, and discarded when
- * released.  The real data storage is recorded in the pagecache. Buffers are
- * hashed to the block device on which the file system resides.
- */
-
 struct xfs_buf;
 typedef void (*xfs_buf_iodone_t)(struct xfs_buf *);
 
-- 
1.7.2.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: xfs: outstanding patches for 2.6.39
  2011-03-23  6:14 xfs: outstanding patches for 2.6.39 Dave Chinner
                   ` (5 preceding siblings ...)
  2011-03-23  6:14 ` [PATCH 6/6] xfs: stop using the page cache to back the buffer cache Dave Chinner
@ 2011-03-23  7:01 ` Andi Kleen
  2011-03-23 11:38   ` Dave Chinner
  2011-03-23 11:18 ` Christoph Hellwig
  7 siblings, 1 reply; 27+ messages in thread
From: Andi Kleen @ 2011-03-23  7:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: aelder, xfs

Dave Chinner <david@fromorbit.com> writes:

> Of course, the major patch in this series is the conversion of the
> buffer cache to using kmalloc and get_free_page() directly rather
> than using the page cache. This has many benefits and I haven't
> found any regressions due to making that change yet.

One (somewhat obscure) regression will be that you won't be able to
recover from uncorrected memory errors in the buffer cache anymore.
Previously memory_failure() could just drop it transparently when that
happens and the page is currently not used.  This is usually only a
problem if there's significant memory tied up in it.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: xfs: outstanding patches for 2.6.39
  2011-03-23  6:14 xfs: outstanding patches for 2.6.39 Dave Chinner
                   ` (6 preceding siblings ...)
  2011-03-23  7:01 ` xfs: outstanding patches for 2.6.39 Andi Kleen
@ 2011-03-23 11:18 ` Christoph Hellwig
  2011-03-23 11:38   ` Dave Chinner
  7 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2011-03-23 11:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: aelder, xfs

I think we'll need the inode OOM fixes for .39, too.  But if you need
some more time for them it's fine to wait for them for a while.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: xfs: outstanding patches for 2.6.39
  2011-03-23  7:01 ` xfs: outstanding patches for 2.6.39 Andi Kleen
@ 2011-03-23 11:38   ` Dave Chinner
  2011-03-23 16:05     ` Andi Kleen
  0 siblings, 1 reply; 27+ messages in thread
From: Dave Chinner @ 2011-03-23 11:38 UTC (permalink / raw)
  To: Andi Kleen; +Cc: aelder, xfs

On Wed, Mar 23, 2011 at 12:01:17AM -0700, Andi Kleen wrote:
> Dave Chinner <david@fromorbit.com> writes:
> 
> > Of course, the major patch in this series is the conversion of the
> > buffer cache to using kmalloc and get_free_page() directly rather
> > than using the page cache. This has many benefits and I haven't
> > found any regressions due to making that change yet.
> 
> One (somewhat obscure) regression will be that you won't be able to
> recover from uncorrected memory errors in the buffer cache anymore.

We can't do that right now, anyway.

> Previously memory_failure() could just drop it transparently when that
> happens and the page is currently not used.

If the page is not in use, we don't care about it after this patch
set is applied - the page is either active in a buffer or it has been
freed. If it is in use, then we'll shut the filesystem down if we
detect the memory corruption just like we currently do. Hence I
don't see any regression here.

As it is, there is no way for the filesytem to be notified about
such failures on active pages in buffers, so in reality we can't
reliably detect them so there is little point in trying to recover
from such errors.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: xfs: outstanding patches for 2.6.39
  2011-03-23 11:18 ` Christoph Hellwig
@ 2011-03-23 11:38   ` Dave Chinner
  0 siblings, 0 replies; 27+ messages in thread
From: Dave Chinner @ 2011-03-23 11:38 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: aelder, xfs

On Wed, Mar 23, 2011 at 07:18:42AM -0400, Christoph Hellwig wrote:
> I think we'll need the inode OOM fixes for .39, too.  But if you need
> some more time for them it's fine to wait for them for a while.

They are not ready yet. When i have them sorted out, I'll post them
as 2.6.39 candidates.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/6] xfs: xfs_trans_read_buf() should return an error on failure
  2011-03-23  6:14 ` [PATCH 4/6] xfs: xfs_trans_read_buf() should return an error on failure Dave Chinner
@ 2011-03-23 11:53   ` Christoph Hellwig
  2011-03-25 21:01   ` Alex Elder
  1 sibling, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2011-03-23 11:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: aelder, xfs

On Wed, Mar 23, 2011 at 05:14:28PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When inside a transaction and we fail to read a buffer,
> xfs_trans_read_buf returns a null buffer pointer and no error.
> xfs_do_da_buf() checks the error return, but not the buffer, and as
> a result this read failure condition causes a panic when it attempts
> to dereference the non-existant buffer.
> 
> Make xfs_trans_read_buf() return the same error for this situation
> regardless of whether it is in a transaction or not. This means
> every caller does not need to check both the error return and the
> buffer before proceeding to use the buffer.

Most callers seem to fine because they always pass 0 as flags,
or handle a NULL bp return.

The exception is xfs_imap_to_bp, which can get a trylock flag via
xfs_itobp and xfs_iflush, which needs a fix for this.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: xfs: outstanding patches for 2.6.39
  2011-03-23 11:38   ` Dave Chinner
@ 2011-03-23 16:05     ` Andi Kleen
  2011-03-23 22:48       ` Dave Chinner
  0 siblings, 1 reply; 27+ messages in thread
From: Andi Kleen @ 2011-03-23 16:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: aelder, Andi Kleen, xfs

> We can't do that right now, anyway.

It seemed to work on XFS when we tested it. Undoutedly after your
patch it won't work anymore.

> If the page is not in use, we don't care about it after this patch
> set is applied - the page is either active in a buffer or it has been
> freed. If it is in use, then we'll shut the filesystem down if we
> detect the memory corruption just like we currently do. Hence I
> don't see any regression here.

I think you're confusing the memory_failure() HWPoison path with some XFS
internal checking. I don't think XFS has any HWPoison checking on its own.

> As it is, there is no way for the filesytem to be notified about
> such failures on active pages in buffers, so in reality we can't
> reliably detect them so there is little point in trying to recover
> from such errors.

Well it works today if the page is in pagecache. memory-failure will
just remove it transparently.

In principle you can check for HWPoison manually yourself, but I'm not sure
that is a good way to do it.


-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/6] xfs: register the inode cache shrinker before quotachecks
  2011-03-23  6:14 ` [PATCH 5/6] xfs: register the inode cache shrinker before quotachecks Dave Chinner
@ 2011-03-23 21:24   ` Arkadiusz Miskiewicz
  2011-03-25 12:56     ` Michael Weissenbacher
  2011-03-25 21:01   ` Alex Elder
  1 sibling, 1 reply; 27+ messages in thread
From: Arkadiusz Miskiewicz @ 2011-03-23 21:24 UTC (permalink / raw)
  To: xfs; +Cc: aelder

On Wednesday 23 of March 2011, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>

Thanks a lot for this patch. I was able to mount my two ~800GB filesystems 
with quota enabled without out of memory/machine reboot suprises (which were 
happening without this patch every time I tried mount with quota).

So

Tested-by: Arkadiusz Miśkiewicz <arekm@maven.pl>

and hope to see it in 2.6.39 . Test was done on patched 2.6.37.

> 
> During mount, we can do a quotacheck that involves a bulkstat pass
> on all inodes. If there are more inodes in the filesystem than can
> be held in memory, we require the inode cache shrinker to run to
> ensure that we don't run out of memory.
> 
> Unfortunately, the inode cache shrinker is not registered until we
> get to the end of the superblock setup process, which is after a
> quotacheck is run if it is needed. Hence we need to register the
> inode cache shrinker earlier in the mount process so that we don't
> OOM during mount. This requires that we also initialise the syncd
> work before we register the shrinker, so we nee dto juggle that
> around as well.
> 
> While there, make sure that we have set up the block sizes in the
> VFS superblock correctly before the quotacheck is run so that any
> inodes that are cached as a result of the quotacheck have their
> block size fields set up correctly.
> 
> Cc: stable@kernel.org
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/linux-2.6/xfs_super.c |   34 ++++++++++++++++++++++++----------
>  1 files changed, 24 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
> index 8a70b2a..1ba5c45 100644
> --- a/fs/xfs/linux-2.6/xfs_super.c
> +++ b/fs/xfs/linux-2.6/xfs_super.c
> @@ -1539,10 +1539,14 @@ xfs_fs_fill_super(
>  	if (error)
>  		goto out_free_sb;
> 
> -	error = xfs_mountfs(mp);
> -	if (error)
> -		goto out_filestream_unmount;
> -
> +	/*
> +	 * we must configure the block size in the superblock before we run the
> +	 * full mount process as the mount process can lookup and cache inodes.
> +	 * For the same reason we must also initialise the syncd and register
> +	 * the inode cache shrinker so that inodes can be reclaimed during
> +	 * operations like a quotacheck that iterate all inodes in the
> +	 * filesystem.
> +	 */
>  	sb->s_magic = XFS_SB_MAGIC;
>  	sb->s_blocksize = mp->m_sb.sb_blocksize;
>  	sb->s_blocksize_bits = ffs(sb->s_blocksize) - 1;
> @@ -1550,6 +1554,16 @@ xfs_fs_fill_super(
>  	sb->s_time_gran = 1;
>  	set_posix_acl_flag(sb);
> 
> +	error = xfs_syncd_init(mp);
> +	if (error)
> +		goto out_filestream_unmount;
> +
> +	xfs_inode_shrinker_register(mp);
> +
> +	error = xfs_mountfs(mp);
> +	if (error)
> +		goto out_syncd_stop;
> +
>  	root = igrab(VFS_I(mp->m_rootip));
>  	if (!root) {
>  		error = ENOENT;
> @@ -1565,14 +1579,11 @@ xfs_fs_fill_super(
>  		goto fail_vnrele;
>  	}
> 
> -	error = xfs_syncd_init(mp);
> -	if (error)
> -		goto fail_vnrele;
> -
> -	xfs_inode_shrinker_register(mp);
> -
>  	return 0;
> 
> + out_syncd_stop:
> +	xfs_inode_shrinker_unregister(mp);
> +	xfs_syncd_stop(mp);
>   out_filestream_unmount:
>  	xfs_filestream_unmount(mp);
>   out_free_sb:
> @@ -1596,6 +1607,9 @@ xfs_fs_fill_super(
>  	}
> 
>   fail_unmount:
> +	xfs_inode_shrinker_unregister(mp);
> +	xfs_syncd_stop(mp);
> +
>  	/*
>  	 * Blow away any referenced inode in the filestreams cache.
>  	 * This can and will cause log traffic as inodes go inactive


-- 
Arkadiusz Miśkiewicz        PLD/Linux Team
arekm / maven.pl            http://ftp.pld-linux.org/

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: xfs: outstanding patches for 2.6.39
  2011-03-23 16:05     ` Andi Kleen
@ 2011-03-23 22:48       ` Dave Chinner
  0 siblings, 0 replies; 27+ messages in thread
From: Dave Chinner @ 2011-03-23 22:48 UTC (permalink / raw)
  To: Andi Kleen; +Cc: aelder, xfs

On Wed, Mar 23, 2011 at 05:05:15PM +0100, Andi Kleen wrote:
> > We can't do that right now, anyway.
> 
> It seemed to work on XFS when we tested it. Undoutedly after your
> patch it won't work anymore.

And why? We still have exactly the same nuumber of active buffers as
before this patch, so exactly the same number of pages that you
simply can't replace

> > If the page is not in use, we don't care about it after this patch
> > set is applied - the page is either active in a buffer or it has been
> > freed. If it is in use, then we'll shut the filesystem down if we
> > detect the memory corruption just like we currently do. Hence I
> > don't see any regression here.
> 
> I think you're confusing the memory_failure() HWPoison path with some XFS
> internal checking. I don't think XFS has any HWPoison checking on its own.

No, it doesn't. My point is that you can't tear down an active page
in a buffer without causing in memory corruption that XFS will notice.

That is, the old lifecycle is:

	alloc buffer
	alloc page cache page
	attach page to buffer	(active reference)
	use buffer
	....
	buffer placed on buffer LRU
	....
	buffer reclaimed by shrinker
	page left in page cache LRU (no active reference)
	.....
	page removed from LRU by memory reclaim


During the time that there is an active reference, an uncorrectable
memory error will corrupt the contents of the buffer. The
memory_failure() path sets flags on the page (page error) and the
mapping, and then tries to invalidate the page in the mapping, which
may or may not fail. Either way, we still has an active reference to
the page so we'll continue to use it regardless of the fact that the
page has been poisoned. IOWs, it does not work at all for errors in
active metadata buffers on XFS and such memory errors will end up
with filesystem metadata corruption.

If the page has no active buffers, then it is in the LRU, and
invalidating the page will work just fine to remove it from the
cache.

The new lifecycle is:

	alloc buffer
	alloc page
	attach page to buffer (active reference)
	use buffer
	....
	buffer placed on buffer LRU
	....
	buffer reclaimed by shrinker
	free page (no reference at all)

IOWs, the page attached to the buffer is only ever in the active
state now, which is the case for the existing code where the error
propagation simply does not work. And instead of leaving the page
on the LRU when the buffer is no longer active, we free it back to
the allocator and errors in freed pages are handled just fine by the
existing code (just like LRU pages).


> > As it is, there is no way for the filesytem to be notified about
> > such failures on active pages in buffers, so in reality we can't
> > reliably detect them so there is little point in trying to recover
> > from such errors.
> 
> Well it works today if the page is in pagecache. memory-failure will
> just remove it transparently.

It's not transparent as it requires help from the layer above the
page cache to handle errors on pages correctly (i.e. the
error_remove_page callback). That special handling is completely
missing from the XFS buffer cache regardless of whether we use the
page cache or not.

> In principle you can check for HWPoison manually yourself, but I'm not sure
> that is a good way to do it.

It'd require a custom callback from memory_failure() and a bunch
of restructing to be able to map a page back to the owner buffer
efficiently and then tearing down the buffer (clean case) or
shutting down the filesystem (dirty case).  If we wait for a check
of the page state to catch such an error rather than trigger an
immediate filesysetm shutdown, then it is too late because we've
could have either written the bad memory onto disk (either via a
transaction into the log or via buffer writeback) and so we will
have propagated the memory error onto disk...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH 1/6] xfs: preallocation transactions do not need to be synchronous
  2011-03-23  6:14 ` [PATCH 1/6] xfs: preallocation transactions do not need to be synchronous Dave Chinner
@ 2011-03-24 17:18   ` brian.foster
  2011-03-24 22:53     ` [PATCH V2] " Dave Chinner
  2011-03-25 21:00   ` [PATCH 1/6] " Alex Elder
  1 sibling, 1 reply; 27+ messages in thread
From: brian.foster @ 2011-03-24 17:18 UTC (permalink / raw)
  To: david, xfs; +Cc: aelder

> -----Original Message-----
> From: xfs-bounces@oss.sgi.com [mailto:xfs-bounces@oss.sgi.com] On
> Behalf Of Dave Chinner
> Sent: Wednesday, March 23, 2011 2:14 AM
> To: xfs@oss.sgi.com
> Cc: aelder@sgi.com
> Subject: [PATCH 1/6] xfs: preallocation transactions do not need to be
> synchronous

> diff --git a/fs/xfs/linux-2.6/xfs_file.c b/fs/xfs/linux-2.6/xfs_file.c index
> ae59865..baa2cb3 100644
> --- a/fs/xfs/linux-2.6/xfs_file.c
> +++ b/fs/xfs/linux-2.6/xfs_file.c
> @@ -896,6 +896,7 @@ xfs_file_fallocate(
>  	xfs_flock64_t	bf;
>  	xfs_inode_t	*ip = XFS_I(inode);
>  	int		cmd = XFS_IOC_RESVSP;
> +	int		attr_flags = XFS_ATTR_NOLOCK;
> 
>  	if (mode & ~(FALLOC_FL_KEEP_SIZE |
>  		     FALLOC_FL_PUNCH_HOLE |
> @@ -922,6 +923,9 @@ xfs_file_fallocate(
>  			goto out_unlock;
>  	}
> 
> +	if (file->f_flags & O_DSYNC)
> +		attr_flags |= XFS_ATTR_SYNC;
> +
>  	error = -xfs_change_file_space(ip, cmd, &bf, 0, XFS_ATTR_NOLOCK);
>  	if (error)
>  		goto out_unlock;

Where are you passing attr_flags?

I was looking at this because I noticed a nice performance improvement in some Samba tests with the xfs_trans_set_sync() call removed, but I have a follow up question... I'd like to back port this patch to 2.6.34. The majority of this patch applies (manually), but the segment above is problematic in that I have an xfs_vn_fallocate() inode operations handler (xfs_iops.c) without the file pointer rather than the above file operations handler. 

Without really knowing any of this code, I assume I can IS_SYNC() the inode to check for the sync mount situation? Assuming that is correct, is there a straightforward way to cover the open(..., O_SYNC) situation? I suppose I'm open to hacking the VFS to one off this call if I have to, but would rather avoid the ugliness. Any comments are appreciated, thanks.

Brian

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH V2] xfs: preallocation transactions do not need to be synchronous
  2011-03-24 17:18   ` brian.foster
@ 2011-03-24 22:53     ` Dave Chinner
  0 siblings, 0 replies; 27+ messages in thread
From: Dave Chinner @ 2011-03-24 22:53 UTC (permalink / raw)
  To: brian.foster; +Cc: aelder, xfs

On Thu, Mar 24, 2011 at 01:18:06PM -0400, brian.foster@emc.com wrote:
> > -----Original Message-----
> > From: xfs-bounces@oss.sgi.com [mailto:xfs-bounces@oss.sgi.com] On
> > Behalf Of Dave Chinner
> > Sent: Wednesday, March 23, 2011 2:14 AM
> > To: xfs@oss.sgi.com
> > Cc: aelder@sgi.com
> > Subject: [PATCH 1/6] xfs: preallocation transactions do not need to be
> > synchronous
> 
> > diff --git a/fs/xfs/linux-2.6/xfs_file.c b/fs/xfs/linux-2.6/xfs_file.c index
> > ae59865..baa2cb3 100644
> > --- a/fs/xfs/linux-2.6/xfs_file.c
> > +++ b/fs/xfs/linux-2.6/xfs_file.c
> > @@ -896,6 +896,7 @@ xfs_file_fallocate(
> >  	xfs_flock64_t	bf;
> >  	xfs_inode_t	*ip = XFS_I(inode);
> >  	int		cmd = XFS_IOC_RESVSP;
> > +	int		attr_flags = XFS_ATTR_NOLOCK;
> > 
> >  	if (mode & ~(FALLOC_FL_KEEP_SIZE |
> >  		     FALLOC_FL_PUNCH_HOLE |
> > @@ -922,6 +923,9 @@ xfs_file_fallocate(
> >  			goto out_unlock;
> >  	}
> > 
> > +	if (file->f_flags & O_DSYNC)
> > +		attr_flags |= XFS_ATTR_SYNC;
> > +
> >  	error = -xfs_change_file_space(ip, cmd, &bf, 0, XFS_ATTR_NOLOCK);
> >  	if (error)
> >  		goto out_unlock;
> 
> Where are you passing attr_flags?

Argh. Good catch. That's the problem with having two interfaces that
do the same thing - you're not sure which one is being tested by the
test suite in any given test. Fixed patch below.

> I was looking at this because I noticed a nice performance
> improvement in some Samba tests with the xfs_trans_set_sync() call
> removed, but I have a follow up question... I'd like to back port
> this patch to 2.6.34. The majority of this patch applies
> (manually), but the segment above is problematic in that I have an
> xfs_vn_fallocate() inode operations handler (xfs_iops.c) without
> the file pointer rather than the above file operations handler.

Yup, .fallocate got moved from an inode operation to a file
operation a while back in:

2fe17c1 fallocate should be a file operation

> Without really knowing any of this code, I assume I can IS_SYNC()
> the inode to check for the sync mount situation? Assuming that is
> correct, is there a straightforward way to cover the open(...,
> O_SYNC) situation? I suppose I'm open to hacking the VFS to one
> off this call if I have to, but would rather avoid the ugliness.
> Any comments are appreciated, thanks.

Porting the above commit first is probably the only way to get this.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

xfs: preallocation transactions do not need to be synchronous

From: Dave Chinner <dchinner@redhat.com>

Preallocation and hole punch transactions are currently synchronous
and this is causing performance problems in some cases. The
transactions don't need to be synchronous as we don't need to
guarantee the preallocation is persistent on disk until a
fdatasync, fsync, sync operation occurs. If the file is opened
O_SYNC or O_DATASYNC, only then should the transaction be issued
synchronously.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/linux-2.6/xfs_file.c  |    6 +++++-
 fs/xfs/linux-2.6/xfs_ioctl.c |    4 ++++
 fs/xfs/xfs_vnodeops.c        |    3 ++-
 fs/xfs/xfs_vnodeops.h        |    1 +
 4 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_file.c b/fs/xfs/linux-2.6/xfs_file.c
index ae59865..0c255b1 100644
--- a/fs/xfs/linux-2.6/xfs_file.c
+++ b/fs/xfs/linux-2.6/xfs_file.c
@@ -896,6 +896,7 @@ xfs_file_fallocate(
 	xfs_flock64_t	bf;
 	xfs_inode_t	*ip = XFS_I(inode);
 	int		cmd = XFS_IOC_RESVSP;
+	int		attr_flags = XFS_ATTR_NOLOCK;
 
 	if (mode & ~(FALLOC_FL_KEEP_SIZE |
 		     FALLOC_FL_PUNCH_HOLE |
@@ -922,7 +923,10 @@ xfs_file_fallocate(
 			goto out_unlock;
 	}
 
-	error = -xfs_change_file_space(ip, cmd, &bf, 0, XFS_ATTR_NOLOCK);
+	if (file->f_flags & O_DSYNC)
+		attr_flags |= XFS_ATTR_SYNC;
+
+	error = -xfs_change_file_space(ip, cmd, &bf, 0, attr_flags);
 	if (error)
 		goto out_unlock;
 
diff --git a/fs/xfs/linux-2.6/xfs_ioctl.c b/fs/xfs/linux-2.6/xfs_ioctl.c
index 0ca0e3c..acca2c5 100644
--- a/fs/xfs/linux-2.6/xfs_ioctl.c
+++ b/fs/xfs/linux-2.6/xfs_ioctl.c
@@ -624,6 +624,10 @@ xfs_ioc_space(
 
 	if (filp->f_flags & (O_NDELAY|O_NONBLOCK))
 		attr_flags |= XFS_ATTR_NONBLOCK;
+
+	if (filp->f_flags & O_DSYNC)
+		attr_flags |= XFS_ATTR_SYNC;
+
 	if (ioflags & IO_INVIS)
 		attr_flags |= XFS_ATTR_DMI;
 
diff --git a/fs/xfs/xfs_vnodeops.c b/fs/xfs/xfs_vnodeops.c
index 37d8146..c48b421 100644
--- a/fs/xfs/xfs_vnodeops.c
+++ b/fs/xfs/xfs_vnodeops.c
@@ -2831,7 +2831,8 @@ xfs_change_file_space(
 		ip->i_d.di_flags &= ~XFS_DIFLAG_PREALLOC;
 
 	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
-	xfs_trans_set_sync(tp);
+	if (attr_flags & XFS_ATTR_SYNC)
+		xfs_trans_set_sync(tp);
 
 	error = xfs_trans_commit(tp, 0);
 
diff --git a/fs/xfs/xfs_vnodeops.h b/fs/xfs/xfs_vnodeops.h
index f6702927..3bcd233 100644
--- a/fs/xfs/xfs_vnodeops.h
+++ b/fs/xfs/xfs_vnodeops.h
@@ -18,6 +18,7 @@ int xfs_setattr(struct xfs_inode *ip, struct iattr *vap, int flags);
 #define	XFS_ATTR_NONBLOCK	0x02	/* return EAGAIN if operation would block */
 #define XFS_ATTR_NOLOCK		0x04	/* Don't grab any conflicting locks */
 #define XFS_ATTR_NOACL		0x08	/* Don't call xfs_acl_chmod */
+#define XFS_ATTR_SYNC		0x10	/* synchronous operation required */
 
 int xfs_readlink(struct xfs_inode *ip, char *link);
 int xfs_release(struct xfs_inode *ip);

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/6] xfs: register the inode cache shrinker before quotachecks
  2011-03-23 21:24   ` Arkadiusz Miskiewicz
@ 2011-03-25 12:56     ` Michael Weissenbacher
  2011-03-25 23:08       ` Dave Chinner
  0 siblings, 1 reply; 27+ messages in thread
From: Michael Weissenbacher @ 2011-03-25 12:56 UTC (permalink / raw)
  To: Arkadiusz Miskiewicz; +Cc: aelder, xfs

> From: Arkadiusz Miskiewicz <arekm@maven.pl>
> On Wednesday 23 of March 2011, Dave Chinner wrote:
>> From: Dave Chinner <dchinner@redhat.com>
> 
> Thanks a lot for this patch. I was able to mount my two ~800GB filesystems 
> with quota enabled without out of memory/machine reboot suprises (which were 
> happening without this patch every time I tried mount with quota).
Seconded, this also solved my long-standing problems with a Quota mounts
on some 32-bit i686 and low-mem x64 machines.

I tested the patch with Kernel 2.6.38.1-i686 on a ~500GB fs which
refused to mount with quotas for some time now.

Is there any chance that this patch will be integrated/backparted to
older kernels like 2.6.32.x, 2.6.35.x?

thanks,
Michael

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/6] xfs: preallocation transactions do not need to be synchronous
  2011-03-23  6:14 ` [PATCH 1/6] xfs: preallocation transactions do not need to be synchronous Dave Chinner
  2011-03-24 17:18   ` brian.foster
@ 2011-03-25 21:00   ` Alex Elder
  2011-03-25 22:02     ` Dave Chinner
  1 sibling, 1 reply; 27+ messages in thread
From: Alex Elder @ 2011-03-25 21:00 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Wed, 2011-03-23 at 17:14 +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Preallocation and hole punch transactions are currently synchronous
> and this is causing performance problems in some cases. The
> transactions don't need to be synchronous as we don't need to
> guarantee the preallocation is persistent on disk until a
> fdatasync, fsync, sync operation occurs. If the file is opened
> O_SYNC or O_DATASYNC, only then should the transaction be issued
> synchronously.

There's a minor (but important) bug in this.  Other than that
this looks good.

Reviewed-by: Alex Elder <aelder@sgi.com>

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/linux-2.6/xfs_file.c  |    4 ++++
>  fs/xfs/linux-2.6/xfs_ioctl.c |    4 ++++
>  fs/xfs/xfs_vnodeops.c        |    3 ++-
>  fs/xfs/xfs_vnodeops.h        |    1 +
>  4 files changed, 11 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/xfs/linux-2.6/xfs_file.c b/fs/xfs/linux-2.6/xfs_file.c
> index ae59865..baa2cb3 100644
> --- a/fs/xfs/linux-2.6/xfs_file.c
> +++ b/fs/xfs/linux-2.6/xfs_file.c
> @@ -896,6 +896,7 @@ xfs_file_fallocate(
>  	xfs_flock64_t	bf;
>  	xfs_inode_t	*ip = XFS_I(inode);
>  	int		cmd = XFS_IOC_RESVSP;
> +	int		attr_flags = XFS_ATTR_NOLOCK;
>  
>  	if (mode & ~(FALLOC_FL_KEEP_SIZE |
>  		     FALLOC_FL_PUNCH_HOLE |
> @@ -922,6 +923,9 @@ xfs_file_fallocate(
>  			goto out_unlock;
>  	}
>  
> +	if (file->f_flags & O_DSYNC)
> +		attr_flags |= XFS_ATTR_SYNC;
> +
>  	error = -xfs_change_file_space(ip, cmd, &bf, 0, XFS_ATTR_NOLOCK);

You need to pass attr_flags as the last argument here.

>  	if (error)
>  		goto out_unlock;

. . .

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/6] vmap: flush vmap aliases when mapping fails
  2011-03-23  6:14 ` [PATCH 2/6] vmap: flush vmap aliases when mapping fails Dave Chinner
@ 2011-03-25 21:00   ` Alex Elder
  0 siblings, 0 replies; 27+ messages in thread
From: Alex Elder @ 2011-03-25 21:00 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Wed, 2011-03-23 at 17:14 +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> On 32 bit systems, vmalloc space is limited and XFS can chew through
> it quickly as the vmalloc space is lazily freed. This can result in
> failure to map buffers, even when there is apparently large amounts
> of vmalloc space available. Hence, if we fail to map a buffer, purge
> the aliases that have not yet been freed to hopefuly free up enough
> vmalloc space to allow a retry to succeed.

Looks good.

Reviewed-by: Alex Elder <aelder@sgi.com>

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

. . .

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/6] xfs: introduce inode cluster buffer trylocks for xfs_iflush
  2011-03-23  6:14 ` [PATCH 3/6] xfs: introduce inode cluster buffer trylocks for xfs_iflush Dave Chinner
@ 2011-03-25 21:00   ` Alex Elder
  0 siblings, 0 replies; 27+ messages in thread
From: Alex Elder @ 2011-03-25 21:00 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Wed, 2011-03-23 at 17:14 +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> There is an ABBA deadlock between synchronous inode flushing in
> xfs_reclaim_inode and xfs_icluster_free. xfs_icluster_free locks the
> buffer, then takes inode ilocks, whilst synchronous reclaim takes
> the ilock followed by the buffer lock in xfs_iflush().
> 
> To avoid this deadlock, separate the inode cluster buffer locking
> semantics from the synchronous inode flush semantics, allowing
> callers to attempt to lock the buffer but still issue synchronous IO
> if it can get the buffer. This requires xfs_iflush() calls that
> currently use non-blocking semantics to pass SYNC_TRYLOCK rather
> than 0 as the flags parameter.
> 
> This allows xfs_reclaim_inode to avoid the deadlock on the buffer
> lock and detect the failure so that it can drop the inode ilock and
> restart the reclaim attempt on the inode. This allows
> xfs_ifree_cluster to obtain the inode lock, mark the inode stale and
> release it and hence defuse the deadlock situation. It also has the
> pleasant side effect of avoiding IO in xfs_reclaim_inode when it
> tries to next reclaim the inode as it is now marked stale.

Looks good.

Reviewed-by: Alex Elder <aelder@sgi.com>

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/6] xfs: xfs_trans_read_buf() should return an error on failure
  2011-03-23  6:14 ` [PATCH 4/6] xfs: xfs_trans_read_buf() should return an error on failure Dave Chinner
  2011-03-23 11:53   ` Christoph Hellwig
@ 2011-03-25 21:01   ` Alex Elder
  1 sibling, 0 replies; 27+ messages in thread
From: Alex Elder @ 2011-03-25 21:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Wed, 2011-03-23 at 17:14 +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When inside a transaction and we fail to read a buffer,
> xfs_trans_read_buf returns a null buffer pointer and no error.
> xfs_do_da_buf() checks the error return, but not the buffer, and as
> a result this read failure condition causes a panic when it attempts
> to dereference the non-existant buffer.
> 
> Make xfs_trans_read_buf() return the same error for this situation
> regardless of whether it is in a transaction or not. This means
> every caller does not need to check both the error return and the
> buffer before proceeding to use the buffer.

Good idea.  Later we can clean up callers.  I see that not
all callers currently check for a null buffer return...

Reviewed-by: Alex Elder <aelder@sgi.com>

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_trans_buf.c |    3 ++-

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/6] xfs: register the inode cache shrinker before quotachecks
  2011-03-23  6:14 ` [PATCH 5/6] xfs: register the inode cache shrinker before quotachecks Dave Chinner
  2011-03-23 21:24   ` Arkadiusz Miskiewicz
@ 2011-03-25 21:01   ` Alex Elder
  1 sibling, 0 replies; 27+ messages in thread
From: Alex Elder @ 2011-03-25 21:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Wed, 2011-03-23 at 17:14 +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> During mount, we can do a quotacheck that involves a bulkstat pass
> on all inodes. If there are more inodes in the filesystem than can
> be held in memory, we require the inode cache shrinker to run to
> ensure that we don't run out of memory.
> 
> Unfortunately, the inode cache shrinker is not registered until we
> get to the end of the superblock setup process, which is after a
> quotacheck is run if it is needed. Hence we need to register the
> inode cache shrinker earlier in the mount process so that we don't
> OOM during mount. This requires that we also initialise the syncd
> work before we register the shrinker, so we nee dto juggle that
> around as well.
> 
> While there, make sure that we have set up the block sizes in the
> VFS superblock correctly before the quotacheck is run so that any
> inodes that are cached as a result of the quotacheck have their
> block size fields set up correctly.
> 
> Cc: stable@kernel.org
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks good.

Reviewed-by: Alex Elder <aelder@sgi.com>

. . .

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 6/6] xfs: stop using the page cache to back the buffer cache
  2011-03-23  6:14 ` [PATCH 6/6] xfs: stop using the page cache to back the buffer cache Dave Chinner
@ 2011-03-25 21:02   ` Alex Elder
  2011-03-25 22:04     ` Dave Chinner
  0 siblings, 1 reply; 27+ messages in thread
From: Alex Elder @ 2011-03-25 21:02 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Wed, 2011-03-23 at 17:14 +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now that the buffer cache has it's own LRU, we do not need to use
> the page cache to provide persistent caching and reclaim
> infrastructure. Convert the buffer cache to use alloc_pages()
> instead of the page cache. This will remove all the overhead of page
> cache management from setup and teardown of the buffers, as well as
> needing to mark pages accessed as we find buffers in the buffer
> cache.
> 
> By avoiding the page cache, we also remove the need to keep state in
> the page_private(page) field for persistant storage across buffer
> free/buffer rebuild and so all that code can be removed. This also
> fixes the long-standing problem of not having enough bits in the
> page_private field to track all the state needed for a 512
> sector/64k page setup.
> 
> It also removes the need for page locking during reads as the pages
> are unique to the buffer and nobody else will be attempting to
> access them.
> 
> Finally, it removes the buftarg address space lock as a point of
> global contention on workloads that allocate and free buffers
> quickly such as when creating or removing large numbers of inodes in
> parallel. This remove the 16TB limit on filesystem size on 32 bit
> machines as the page index (32 bit) is no longer used for lookups
> of metadata buffers - the buffer cache is now solely indexed by disk
> address which is stored in a 64 bit field in the buffer.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

This is really a great change, a long time coming.

I have two comments below, one of which I think is
a real (but simple) problem.

I've been using this series all week without problems.  
This patch cleans things up so nicely I *would* like
to include it in 2.6.39 if you can update it and turn
it around with a pull request for me.

If so, I'll do some sanity testing and push it to
oss.sgi.com, then send a pull request to Linus with
it early next week.

Reviewed-by: Alex Elder <aelder@sgi.com>

PS  I'm sorry it took so long to get back to you on
    this stuff.  I've had a hard time getting my brain
    re-engaged this week after coming back from vacation
    for some reason...

> ---
>  fs/xfs/linux-2.6/xfs_buf.c |  337 ++++++++++----------------------------------
>  fs/xfs/linux-2.6/xfs_buf.h |   40 +-----
>  2 files changed, 81 insertions(+), 296 deletions(-)
> 
> diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
> index fe51e09..19b0769 100644
> --- a/fs/xfs/linux-2.6/xfs_buf.c
> +++ b/fs/xfs/linux-2.6/xfs_buf.c

. . .

> @@ -719,7 +659,7 @@ xfs_buf_readahead(
>  {
>  	struct backing_dev_info *bdi;
>  
> -	bdi = target->bt_mapping->backing_dev_info;
> +	bdi = blk_get_backing_dev_info(target->bt_bdev);
>  	if (bdi_read_congested(bdi))
>  		return;

Why not just target->bt_bdi here?  In which case, just skip
the local variable and call:

	if (bdi_read_congested(target->bt_bdi))
		return;

. . .

> @@ -1728,12 +1546,11 @@ xfs_alloc_buftarg(
>  	btp->bt_mount = mp;
>  	btp->bt_dev =  bdev->bd_dev;
>  	btp->bt_bdev = bdev;
> +	btp->bt_bdi = blk_get_backing_dev_info(bdev);

I think you need to check for a null return value here.

	if (!btp->bt_bdi)
		goto error;

>  	INIT_LIST_HEAD(&btp->bt_lru);
>  	spin_lock_init(&btp->bt_lru_lock);
>  	if (xfs_setsize_buftarg_early(btp, bdev))
>  		goto error;
> -	if (xfs_mapping_buftarg(btp, bdev))
> -		goto error;
>  	if (xfs_alloc_delwrite_queue(btp, fsname))
>  		goto error;
>  	btp->bt_shrinker.shrink = xfs_buftarg_shrink;

. . .

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/6] xfs: preallocation transactions do not need to be synchronous
  2011-03-25 21:00   ` [PATCH 1/6] " Alex Elder
@ 2011-03-25 22:02     ` Dave Chinner
  0 siblings, 0 replies; 27+ messages in thread
From: Dave Chinner @ 2011-03-25 22:02 UTC (permalink / raw)
  To: Alex Elder; +Cc: xfs

On Fri, Mar 25, 2011 at 04:00:36PM -0500, Alex Elder wrote:
> On Wed, 2011-03-23 at 17:14 +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Preallocation and hole punch transactions are currently synchronous
> > and this is causing performance problems in some cases. The
> > transactions don't need to be synchronous as we don't need to
> > guarantee the preallocation is persistent on disk until a
> > fdatasync, fsync, sync operation occurs. If the file is opened
> > O_SYNC or O_DATASYNC, only then should the transaction be issued
> > synchronously.
> 
> There's a minor (but important) bug in this.  Other than that
> this looks good.
> 
> Reviewed-by: Alex Elder <aelder@sgi.com>

Already fixed and reposted.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 6/6] xfs: stop using the page cache to back the buffer cache
  2011-03-25 21:02   ` Alex Elder
@ 2011-03-25 22:04     ` Dave Chinner
  0 siblings, 0 replies; 27+ messages in thread
From: Dave Chinner @ 2011-03-25 22:04 UTC (permalink / raw)
  To: Alex Elder; +Cc: xfs

On Fri, Mar 25, 2011 at 04:02:53PM -0500, Alex Elder wrote:
> On Wed, 2011-03-23 at 17:14 +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Now that the buffer cache has it's own LRU, we do not need to use
> > the page cache to provide persistent caching and reclaim
> > infrastructure. Convert the buffer cache to use alloc_pages()
> > instead of the page cache. This will remove all the overhead of page
> > cache management from setup and teardown of the buffers, as well as
> > needing to mark pages accessed as we find buffers in the buffer
> > cache.
> > 
> > By avoiding the page cache, we also remove the need to keep state in
> > the page_private(page) field for persistant storage across buffer
> > free/buffer rebuild and so all that code can be removed. This also
> > fixes the long-standing problem of not having enough bits in the
> > page_private field to track all the state needed for a 512
> > sector/64k page setup.
> > 
> > It also removes the need for page locking during reads as the pages
> > are unique to the buffer and nobody else will be attempting to
> > access them.
> > 
> > Finally, it removes the buftarg address space lock as a point of
> > global contention on workloads that allocate and free buffers
> > quickly such as when creating or removing large numbers of inodes in
> > parallel. This remove the 16TB limit on filesystem size on 32 bit
> > machines as the page index (32 bit) is no longer used for lookups
> > of metadata buffers - the buffer cache is now solely indexed by disk
> > address which is stored in a 64 bit field in the buffer.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> 
> This is really a great change, a long time coming.
> 
> I have two comments below, one of which I think is
> a real (but simple) problem.

Yup, i'll fix them.

> I've been using this series all week without problems.  
> This patch cleans things up so nicely I *would* like
> to include it in 2.6.39 if you can update it and turn
> it around with a pull request for me.

Ok, I'll update the series, prep a brach, run a quick sanity check
and send a pull req.
> 
> If so, I'll do some sanity testing and push it to
> oss.sgi.com, then send a pull request to Linus with
> it early next week.
> 
> Reviewed-by: Alex Elder <aelder@sgi.com>
> 
> PS  I'm sorry it took so long to get back to you on
>     this stuff.  I've had a hard time getting my brain
>     re-engaged this week after coming back from vacation
>     for some reason...

No problems, I knew you were away for a while...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/6] xfs: register the inode cache shrinker before quotachecks
  2011-03-25 12:56     ` Michael Weissenbacher
@ 2011-03-25 23:08       ` Dave Chinner
  0 siblings, 0 replies; 27+ messages in thread
From: Dave Chinner @ 2011-03-25 23:08 UTC (permalink / raw)
  To: Michael Weissenbacher; +Cc: xfs, aelder

On Fri, Mar 25, 2011 at 01:56:22PM +0100, Michael Weissenbacher wrote:
> > From: Arkadiusz Miskiewicz <arekm@maven.pl>
> > On Wednesday 23 of March 2011, Dave Chinner wrote:
> >> From: Dave Chinner <dchinner@redhat.com>
> > 
> > Thanks a lot for this patch. I was able to mount my two ~800GB filesystems 
> > with quota enabled without out of memory/machine reboot suprises (which were 
> > happening without this patch every time I tried mount with quota).
> Seconded, this also solved my long-standing problems with a Quota mounts
> on some 32-bit i686 and low-mem x64 machines.
> 
> I tested the patch with Kernel 2.6.38.1-i686 on a ~500GB fs which
> refused to mount with quotas for some time now.
> 
> Is there any chance that this patch will be integrated/backparted to
> older kernels like 2.6.32.x, 2.6.35.x?

The patch will be committed with a CC to stable@kernel.org, so that
should happen magically once this is in mainline.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2011-03-25 23:05 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-03-23  6:14 xfs: outstanding patches for 2.6.39 Dave Chinner
2011-03-23  6:14 ` [PATCH 1/6] xfs: preallocation transactions do not need to be synchronous Dave Chinner
2011-03-24 17:18   ` brian.foster
2011-03-24 22:53     ` [PATCH V2] " Dave Chinner
2011-03-25 21:00   ` [PATCH 1/6] " Alex Elder
2011-03-25 22:02     ` Dave Chinner
2011-03-23  6:14 ` [PATCH 2/6] vmap: flush vmap aliases when mapping fails Dave Chinner
2011-03-25 21:00   ` Alex Elder
2011-03-23  6:14 ` [PATCH 3/6] xfs: introduce inode cluster buffer trylocks for xfs_iflush Dave Chinner
2011-03-25 21:00   ` Alex Elder
2011-03-23  6:14 ` [PATCH 4/6] xfs: xfs_trans_read_buf() should return an error on failure Dave Chinner
2011-03-23 11:53   ` Christoph Hellwig
2011-03-25 21:01   ` Alex Elder
2011-03-23  6:14 ` [PATCH 5/6] xfs: register the inode cache shrinker before quotachecks Dave Chinner
2011-03-23 21:24   ` Arkadiusz Miskiewicz
2011-03-25 12:56     ` Michael Weissenbacher
2011-03-25 23:08       ` Dave Chinner
2011-03-25 21:01   ` Alex Elder
2011-03-23  6:14 ` [PATCH 6/6] xfs: stop using the page cache to back the buffer cache Dave Chinner
2011-03-25 21:02   ` Alex Elder
2011-03-25 22:04     ` Dave Chinner
2011-03-23  7:01 ` xfs: outstanding patches for 2.6.39 Andi Kleen
2011-03-23 11:38   ` Dave Chinner
2011-03-23 16:05     ` Andi Kleen
2011-03-23 22:48       ` Dave Chinner
2011-03-23 11:18 ` Christoph Hellwig
2011-03-23 11:38   ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.