All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATH 0/8] xfs: outstanding patches for 3.4 merge window
@ 2012-03-22  5:15 Dave Chinner
  2012-03-22  5:15 ` [PATCH 1/8] xfs: Fix open flag handling in open_by_handle code Dave Chinner
                   ` (7 more replies)
  0 siblings, 8 replies; 31+ messages in thread
From: Dave Chinner @ 2012-03-22  5:15 UTC (permalink / raw)
  To: xfs

The first 5 patches have been previously posted and reviewed, but not yet
comitted to the XFS repo.

The last three patches are new. The first two fix reported bugs and
are stable kernel backport candidates, and the last is an update to
the attribute event tracing code to add some insight into the
attribute modification paths. This one might be 3.5 material, but it
can't hurt to try.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/8] xfs: Fix open flag handling in open_by_handle code
  2012-03-22  5:15 [PATH 0/8] xfs: outstanding patches for 3.4 merge window Dave Chinner
@ 2012-03-22  5:15 ` Dave Chinner
  2012-03-22  5:15 ` [PATCH 2/8] xfs: introduce an allocation workqueue Dave Chinner
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 31+ messages in thread
From: Dave Chinner @ 2012-03-22  5:15 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Sparse identified some unsafe handling of open flags in the xfs open
by handle ioctl code. Update the code to use the correct access
macros to ensure that we handle the open flags correctly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
---
 fs/xfs/xfs_ioctl.c |   14 +++++---------
 1 files changed, 5 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index f588320..91f8ff5 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -209,6 +209,7 @@ xfs_open_by_handle(
 	struct file		*filp;
 	struct inode		*inode;
 	struct dentry		*dentry;
+	fmode_t			fmode;
 
 	if (!capable(CAP_SYS_ADMIN))
 		return -XFS_ERROR(EPERM);
@@ -228,26 +229,21 @@ xfs_open_by_handle(
 	hreq->oflags |= O_LARGEFILE;
 #endif
 
-	/* Put open permission in namei format. */
 	permflag = hreq->oflags;
-	if ((permflag+1) & O_ACCMODE)
-		permflag++;
-	if (permflag & O_TRUNC)
-		permflag |= 2;
-
+	fmode = OPEN_FMODE(permflag);
 	if ((!(permflag & O_APPEND) || (permflag & O_TRUNC)) &&
-	    (permflag & FMODE_WRITE) && IS_APPEND(inode)) {
+	    (fmode & FMODE_WRITE) && IS_APPEND(inode)) {
 		error = -XFS_ERROR(EPERM);
 		goto out_dput;
 	}
 
-	if ((permflag & FMODE_WRITE) && IS_IMMUTABLE(inode)) {
+	if ((fmode & FMODE_WRITE) && IS_IMMUTABLE(inode)) {
 		error = -XFS_ERROR(EACCES);
 		goto out_dput;
 	}
 
 	/* Can't write directories. */
-	if (S_ISDIR(inode->i_mode) && (permflag & FMODE_WRITE)) {
+	if (S_ISDIR(inode->i_mode) && (fmode & FMODE_WRITE)) {
 		error = -XFS_ERROR(EISDIR);
 		goto out_dput;
 	}
-- 
1.7.9

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 2/8] xfs: introduce an allocation workqueue
  2012-03-22  5:15 [PATH 0/8] xfs: outstanding patches for 3.4 merge window Dave Chinner
  2012-03-22  5:15 ` [PATCH 1/8] xfs: Fix open flag handling in open_by_handle code Dave Chinner
@ 2012-03-22  5:15 ` Dave Chinner
  2012-03-22  5:15 ` [PATCH 3/8] xfs: initialise xfssync work before running quotachecks Dave Chinner
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 31+ messages in thread
From: Dave Chinner @ 2012-03-22  5:15 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

We currently have significant issues with the amount of stack that
allocation in XFS uses, especially in the writeback path. We can
easily consume 4k of stack between mapping the page, manipulating
the bmap btree and allocating blocks from the free list. Not to
mention btree block readahead and other functionality that issues IO
in the allocation path.

As a result, we can no longer fit allocation in the writeback path
in the stack space provided on x86_64. To alleviate this problem,
introduce an allocation workqueue and move all allocations to a
seperate context. This can be easily added as an interposing layer
into xfs_alloc_vextent(), which takes a single argument structure
and does not return until the allocation is complete or has failed.

To do this, add a work structure and a completion to the allocation
args structure. This allows xfs_alloc_vextent to queue the args onto
the workqueue and wait for it to be completed by the worker. This
can be done completely transparently to the caller.

The worker function needs to ensure that it sets and clears the
PF_TRANS flag appropriately as it is being run in an active
transaction context. Work can also be queued in a memory reclaim
context, so a rescuer is needed for the workqueue.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_alloc.c |   34 +++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_alloc.h |    5 +++++
 fs/xfs/xfs_super.c |   16 ++++++++++++++++
 3 files changed, 54 insertions(+), 1 deletions(-)

diff --git a/fs/xfs/xfs_alloc.c b/fs/xfs/xfs_alloc.c
index ce84ffd..31e9033 100644
--- a/fs/xfs/xfs_alloc.c
+++ b/fs/xfs/xfs_alloc.c
@@ -35,6 +35,7 @@
 #include "xfs_error.h"
 #include "xfs_trace.h"
 
+struct workqueue_struct *xfs_alloc_wq;
 
 #define XFS_ABSDIFF(a,b)	(((a) <= (b)) ? ((b) - (a)) : ((a) - (b)))
 
@@ -2207,7 +2208,7 @@ xfs_alloc_read_agf(
  * group or loop over the allocation groups to find the result.
  */
 int				/* error */
-xfs_alloc_vextent(
+__xfs_alloc_vextent(
 	xfs_alloc_arg_t	*args)	/* allocation argument structure */
 {
 	xfs_agblock_t	agsize;	/* allocation group size */
@@ -2417,6 +2418,37 @@ error0:
 	return error;
 }
 
+static void
+xfs_alloc_vextent_worker(
+	struct work_struct	*work)
+{
+	struct xfs_alloc_arg	*args = container_of(work,
+						struct xfs_alloc_arg, work);
+	unsigned long		pflags;
+
+	/* we are in a transaction context here */
+	current_set_flags_nested(&pflags, PF_FSTRANS);
+
+	args->result = __xfs_alloc_vextent(args);
+	complete(args->done);
+
+	current_restore_flags_nested(&pflags, PF_FSTRANS);
+}
+
+
+int				/* error */
+xfs_alloc_vextent(
+	xfs_alloc_arg_t	*args)	/* allocation argument structure */
+{
+	DECLARE_COMPLETION_ONSTACK(done);
+
+	args->done = &done;
+	INIT_WORK(&args->work, xfs_alloc_vextent_worker);
+	queue_work(xfs_alloc_wq, &args->work);
+	wait_for_completion(&done);
+	return args->result;
+}
+
 /*
  * Free an extent.
  * Just break up the extent address and hand off to xfs_free_ag_extent
diff --git a/fs/xfs/xfs_alloc.h b/fs/xfs/xfs_alloc.h
index 2f52b92..ab5d0fd 100644
--- a/fs/xfs/xfs_alloc.h
+++ b/fs/xfs/xfs_alloc.h
@@ -25,6 +25,8 @@ struct xfs_perag;
 struct xfs_trans;
 struct xfs_busy_extent;
 
+extern struct workqueue_struct *xfs_alloc_wq;
+
 /*
  * Freespace allocation types.  Argument to xfs_alloc_[v]extent.
  */
@@ -119,6 +121,9 @@ typedef struct xfs_alloc_arg {
 	char		isfl;		/* set if is freelist blocks - !acctg */
 	char		userdata;	/* set if this is user data */
 	xfs_fsblock_t	firstblock;	/* io first block allocated */
+	struct completion *done;
+	struct work_struct work;
+	int		result;
 } xfs_alloc_arg_t;
 
 /*
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 912442c..2c3cc2e 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1606,12 +1606,28 @@ xfs_init_workqueues(void)
 	xfs_syncd_wq = alloc_workqueue("xfssyncd", WQ_NON_REENTRANT, 0);
 	if (!xfs_syncd_wq)
 		return -ENOMEM;
+
+	/*
+	 * The allocation workqueue can be used in memory reclaim situations
+	 * (writepage path), and parallelism is only limited by the number of
+	 * AGs in all the filesystems mounted. Hence use the default large
+	 * max_active value for this workqueue.
+	 */
+	xfs_alloc_wq = alloc_workqueue("xfsalloc", WQ_MEM_RECLAIM, 0);
+	if (!xfs_alloc_wq)
+		goto out_destroy_syncd;
+
 	return 0;
+
+out_destroy_syncd:
+	destroy_workqueue(xfs_syncd_wq);
+	return -ENOMEM;
 }
 
 STATIC void
 xfs_destroy_workqueues(void)
 {
+	destroy_workqueue(xfs_alloc_wq);
 	destroy_workqueue(xfs_syncd_wq);
 }
 
-- 
1.7.9

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 3/8] xfs: initialise xfssync work before running quotachecks
  2012-03-22  5:15 [PATH 0/8] xfs: outstanding patches for 3.4 merge window Dave Chinner
  2012-03-22  5:15 ` [PATCH 1/8] xfs: Fix open flag handling in open_by_handle code Dave Chinner
  2012-03-22  5:15 ` [PATCH 2/8] xfs: introduce an allocation workqueue Dave Chinner
@ 2012-03-22  5:15 ` Dave Chinner
  2012-03-22 15:15   ` Ben Myers
  2012-03-28 17:38   ` [PATCH 3/8] xfs: initialise xfssync work before running quotachecks Ben Myers
  2012-03-22  5:15 ` [PATCH 4/8] xfs: remove MS_ACTIVE guard from inode reclaim work Dave Chinner
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 31+ messages in thread
From: Dave Chinner @ 2012-03-22  5:15 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Because the mount process can run a quotacheck and consume lots of
inodes, we need to be able to run periodic inode reclaim during the
mount process. This will prevent running the system out of memory
during quota checks.

This essentially reverts 2bcf6e97, but that is safe to do now that
the quota sync code that was causing problems during long quotacheck
executions is now gone.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_super.c |   17 +++++++++--------
 1 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 2c3cc2e..a08c56d 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1346,31 +1346,33 @@ xfs_fs_fill_super(
 	sb->s_time_gran = 1;
 	set_posix_acl_flag(sb);
 
-	error = xfs_mountfs(mp);
+	error = xfs_syncd_init(mp);
 	if (error)
 		goto out_filestream_unmount;
 
-	error = xfs_syncd_init(mp);
+	error = xfs_mountfs(mp);
 	if (error)
-		goto out_unmount;
+		goto out_syncd_stop;
 
 	root = igrab(VFS_I(mp->m_rootip));
 	if (!root) {
 		error = ENOENT;
-		goto out_syncd_stop;
+		goto out_unmount;
 	}
 	if (is_bad_inode(root)) {
 		error = EINVAL;
-		goto out_syncd_stop;
+		goto out_unmount;
 	}
 	sb->s_root = d_make_root(root);
 	if (!sb->s_root) {
 		error = ENOMEM;
-		goto out_syncd_stop;
+		goto out_unmount;
 	}
 
 	return 0;
 
+ out_syncd_stop:
+	xfs_syncd_stop(mp);
  out_filestream_unmount:
 	xfs_filestream_unmount(mp);
  out_free_sb:
@@ -1387,8 +1389,6 @@ out_destroy_workqueues:
  out:
 	return -error;
 
- out_syncd_stop:
-	xfs_syncd_stop(mp);
  out_unmount:
 	/*
 	 * Blow away any referenced inode in the filestreams cache.
@@ -1400,6 +1400,7 @@ out_destroy_workqueues:
 	xfs_flush_buftarg(mp->m_ddev_targp, 1);
 
 	xfs_unmountfs(mp);
+	xfs_syncd_stop(mp);
 	goto out_free_sb;
 }
 
-- 
1.7.9

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 4/8] xfs: remove MS_ACTIVE guard from inode reclaim work
  2012-03-22  5:15 [PATH 0/8] xfs: outstanding patches for 3.4 merge window Dave Chinner
                   ` (2 preceding siblings ...)
  2012-03-22  5:15 ` [PATCH 3/8] xfs: initialise xfssync work before running quotachecks Dave Chinner
@ 2012-03-22  5:15 ` Dave Chinner
  2012-03-22  5:15 ` [PATCH 5/8] xfs: don't cache inodes read through bulkstat Dave Chinner
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 31+ messages in thread
From: Dave Chinner @ 2012-03-22  5:15 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

We need to be able to queue inode reclaim work during the mount
process as quotacheck can cause large amounts of inodes to be read
and we need to clean them up periodically as the shrinkers can not
run until after the mount process has completed.

The reclaim work is currently protected from running during the
unmount process by a check against MS_ACTIVE. Unfortunately, this
also means that the relcaim work cannot run during mount.  The
unmount process should stop the reclaim cleanly before freeing
anything that the reclaim work depends on, so there is no need to
have this guard in place.

Also, the inode reclaim work is demand driven, so there is no need
to start it immediately during mount. It will be started the moment
an inode is queued for reclaim, so qutoacheck will trigger it just
fine.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_super.c |    3 +--
 fs/xfs/xfs_sync.c  |   27 ++++++++++++++++-----------
 2 files changed, 17 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index a08c56d..3e87b02 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -965,8 +965,6 @@ xfs_fs_put_super(
 {
 	struct xfs_mount	*mp = XFS_M(sb);
 
-	xfs_syncd_stop(mp);
-
 	/*
 	 * Blow away any referenced inode in the filestreams cache.
 	 * This can and will cause log traffic as inodes go inactive
@@ -977,6 +975,7 @@ xfs_fs_put_super(
 	xfs_flush_buftarg(mp->m_ddev_targp, 1);
 
 	xfs_unmountfs(mp);
+	xfs_syncd_stop(mp);
 	xfs_freesb(mp);
 	xfs_icsb_destroy_counters(mp);
 	xfs_destroy_mount_workqueues(mp);
diff --git a/fs/xfs/xfs_sync.c b/fs/xfs/xfs_sync.c
index 205ebcb..6771fcd 100644
--- a/fs/xfs/xfs_sync.c
+++ b/fs/xfs/xfs_sync.c
@@ -460,7 +460,15 @@ xfs_sync_worker(
 					struct xfs_mount, m_sync_work);
 	int		error;
 
-	if (!(mp->m_flags & XFS_MOUNT_RDONLY)) {
+	/*
+	 * We shouldn't write/force the log if we are in the mount/unmount
+	 * process or on a read only filesystem. The workqueue still needs to be
+	 * active in both cases, however, because it is used for inode reclaim
+	 * during these times. hence use the MS_ACTIVE flag to avoid doing
+	 * anything in these periods.
+	 */
+	if (!(mp->m_super->s_flags & MS_ACTIVE) &&
+	    !(mp->m_flags & XFS_MOUNT_RDONLY)) {
 		/* dgc: errors ignored here */
 		if (mp->m_super->s_frozen == SB_UNFROZEN &&
 		    xfs_log_need_covered(mp))
@@ -488,14 +496,6 @@ xfs_syncd_queue_reclaim(
 	struct xfs_mount        *mp)
 {
 
-	/*
-	 * We can have inodes enter reclaim after we've shut down the syncd
-	 * workqueue during unmount, so don't allow reclaim work to be queued
-	 * during unmount.
-	 */
-	if (!(mp->m_super->s_flags & MS_ACTIVE))
-		return;
-
 	rcu_read_lock();
 	if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_RECLAIM_TAG)) {
 		queue_delayed_work(xfs_syncd_wq, &mp->m_reclaim_work,
@@ -564,7 +564,6 @@ xfs_syncd_init(
 	INIT_DELAYED_WORK(&mp->m_reclaim_work, xfs_reclaim_worker);
 
 	xfs_syncd_queue_sync(mp);
-	xfs_syncd_queue_reclaim(mp);
 
 	return 0;
 }
@@ -574,7 +573,13 @@ xfs_syncd_stop(
 	struct xfs_mount	*mp)
 {
 	cancel_delayed_work_sync(&mp->m_sync_work);
-	cancel_delayed_work_sync(&mp->m_reclaim_work);
+
+	/*
+	 * we flush any pending inode reclaim work rather than cancel it here.
+	 * This ensures that there are no clean inodes queued during unmount
+	 * left unreclaimed when we return.
+	 */
+	flush_delayed_work_sync(&mp->m_reclaim_work);
 	cancel_work_sync(&mp->m_flush_work);
 }
 
-- 
1.7.9

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 5/8] xfs: don't cache inodes read through bulkstat
  2012-03-22  5:15 [PATH 0/8] xfs: outstanding patches for 3.4 merge window Dave Chinner
                   ` (3 preceding siblings ...)
  2012-03-22  5:15 ` [PATCH 4/8] xfs: remove MS_ACTIVE guard from inode reclaim work Dave Chinner
@ 2012-03-22  5:15 ` Dave Chinner
  2012-03-22  5:15 ` [PATCH 6/8] xfs: Account log unmount transaction correctly Dave Chinner
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 31+ messages in thread
From: Dave Chinner @ 2012-03-22  5:15 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

When we read inodes via bulkstat, we generally only read them once
and then throw them away - they never get used again. If we retain
them in cache, then it simply causes the working set of inodes and
other cached items to be reclaimed just so the inode cache can grow.

Avoid this problem by marking inodes read by bulkstat not to be
cached and check this flag in .drop_inode to determine whether the
inode should be added to the VFS LRU or not. If the inode lookup
hits an already cached inode, then don't set the flag. If the inode
lookup hits an inode marked with no cache flag, remove the flag and
allow it to be cached once the current reference goes away.

Inodes marked as not cached will get cleaned up by the background
inode reclaim or via memory pressure, so they will still generate
some short term cache pressure. They will, however, be reclaimed
much sooner and in preference to cache hot inodes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_iget.c   |    8 ++++++--
 fs/xfs/xfs_inode.h  |    4 +++-
 fs/xfs/xfs_itable.c |    3 ++-
 fs/xfs/xfs_super.c  |   17 +++++++++++++++++
 4 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_iget.c b/fs/xfs/xfs_iget.c
index a98cb45..bcc6c24 100644
--- a/fs/xfs/xfs_iget.c
+++ b/fs/xfs/xfs_iget.c
@@ -289,7 +289,7 @@ xfs_iget_cache_hit(
 	if (lock_flags != 0)
 		xfs_ilock(ip, lock_flags);
 
-	xfs_iflags_clear(ip, XFS_ISTALE);
+	xfs_iflags_clear(ip, XFS_ISTALE | XFS_IDONTCACHE);
 	XFS_STATS_INC(xs_ig_found);
 
 	return 0;
@@ -314,6 +314,7 @@ xfs_iget_cache_miss(
 	struct xfs_inode	*ip;
 	int			error;
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ino);
+	int			iflags;
 
 	ip = xfs_inode_alloc(mp, ino);
 	if (!ip)
@@ -358,8 +359,11 @@ xfs_iget_cache_miss(
 	 * memory barrier that ensures this detection works correctly at lookup
 	 * time.
 	 */
+	iflags = XFS_INEW;
+	if (flags & XFS_IGET_DONTCACHE)
+		iflags |= XFS_IDONTCACHE;
 	ip->i_udquot = ip->i_gdquot = NULL;
-	xfs_iflags_set(ip, XFS_INEW);
+	xfs_iflags_set(ip, iflags);
 
 	/* insert the new inode */
 	spin_lock(&pag->pag_ici_lock);
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index f123dbe..7fee338 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -387,10 +387,11 @@ xfs_set_projid(struct xfs_inode *ip,
 #define XFS_IFLOCK		(1 << __XFS_IFLOCK_BIT)
 #define __XFS_IPINNED_BIT	8	 /* wakeup key for zero pin count */
 #define XFS_IPINNED		(1 << __XFS_IPINNED_BIT)
+#define XFS_IDONTCACHE		(1 << 9) /* don't cache the inode long term */
 
 /*
  * Per-lifetime flags need to be reset when re-using a reclaimable inode during
- * inode lookup. Thi prevents unintended behaviour on the new inode from
+ * inode lookup. This prevents unintended behaviour on the new inode from
  * ocurring.
  */
 #define XFS_IRECLAIM_RESET_FLAGS	\
@@ -553,6 +554,7 @@ do { \
  */
 #define XFS_IGET_CREATE		0x1
 #define XFS_IGET_UNTRUSTED	0x2
+#define XFS_IGET_DONTCACHE	0x4
 
 int		xfs_inotobp(struct xfs_mount *, struct xfs_trans *,
 			    xfs_ino_t, struct xfs_dinode **,
diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
index 9720c54..acc2bf2 100644
--- a/fs/xfs/xfs_itable.c
+++ b/fs/xfs/xfs_itable.c
@@ -75,7 +75,8 @@ xfs_bulkstat_one_int(
 		return XFS_ERROR(ENOMEM);
 
 	error = xfs_iget(mp, NULL, ino,
-			 XFS_IGET_UNTRUSTED, XFS_ILOCK_SHARED, &ip);
+			 (XFS_IGET_DONTCACHE | XFS_IGET_UNTRUSTED),
+			 XFS_ILOCK_SHARED, &ip);
 	if (error) {
 		*stat = BULKSTAT_RV_NOTHING;
 		goto out_free;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 3e87b02..aef50ab 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -950,6 +950,22 @@ xfs_fs_evict_inode(
 	xfs_inactive(ip);
 }
 
+/*
+ * We do an unlocked check for XFS_IDONTCACHE here because we are already
+ * serialised against cache hits here via the inode->i_lock and igrab() in
+ * xfs_iget_cache_hit(). Hence a lookup that might clear this flag will not be
+ * racing with us, and it avoids needing to grab a spinlock here for every inode
+ * we drop the final reference on.
+ */
+STATIC int
+xfs_fs_drop_inode(
+	struct inode		*inode)
+{
+	struct xfs_inode	*ip = XFS_I(inode);
+
+	return generic_drop_inode(inode) || (ip->i_flags & XFS_IDONTCACHE);
+}
+
 STATIC void
 xfs_free_fsname(
 	struct xfs_mount	*mp)
@@ -1433,6 +1449,7 @@ static const struct super_operations xfs_super_operations = {
 	.destroy_inode		= xfs_fs_destroy_inode,
 	.dirty_inode		= xfs_fs_dirty_inode,
 	.evict_inode		= xfs_fs_evict_inode,
+	.drop_inode		= xfs_fs_drop_inode,
 	.put_super		= xfs_fs_put_super,
 	.sync_fs		= xfs_fs_sync_fs,
 	.freeze_fs		= xfs_fs_freeze,
-- 
1.7.9

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 6/8] xfs: Account log unmount transaction correctly
  2012-03-22  5:15 [PATH 0/8] xfs: outstanding patches for 3.4 merge window Dave Chinner
                   ` (4 preceding siblings ...)
  2012-03-22  5:15 ` [PATCH 5/8] xfs: don't cache inodes read through bulkstat Dave Chinner
@ 2012-03-22  5:15 ` Dave Chinner
  2012-03-24  8:27   ` Christoph Hellwig
  2012-03-26 22:28   ` Ben Myers
  2012-03-22  5:15 ` [PATCH 7/8] xfs: fix fstrim offset calculations Dave Chinner
  2012-03-22  5:15 ` [PATCH 8/8] xfs: add lots of attribute trace points Dave Chinner
  7 siblings, 2 replies; 31+ messages in thread
From: Dave Chinner @ 2012-03-22  5:15 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

There have been a few reports of this warning appearing recently:

XFS (dm-4): xlog_space_left: head behind tail
 tail_cycle = 129, tail_bytes = 20163072
 GH   cycle = 129, GH   bytes = 20162880

The common cause appears to be lots of freeze and unfreeze cycles,
and the output from the warnings indicates that we are leaking
around 8 bytes of log space per freeze/unfreeze cycle.

When we freeze the filesystem, we write an unmount record and that
uses xlog_write directly - a special type of transaction,
effectively. What it doesn't do, however, is correctly acocunt for
the log space it uses. The unmount record writes an 8 byte structure
with a special magic number into the log, and the space this
consumes is not accounted for in the log ticket tracking the
operation. Hence we leak 8 bytes every unmount record that is
written.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 98a9cb5..6db1fef 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -726,8 +726,9 @@ xfs_log_unmount_write(xfs_mount_t *mp)
 				.lv_iovecp = &reg,
 			};
 
-			/* remove inited flag */
+			/* remove inited flag, and account for space used */
 			tic->t_flags = 0;
+			tic->t_curr_res -= sizeof(magic);
 			error = xlog_write(log, &vec, tic, &lsn,
 					   NULL, XLOG_UNMOUNT_TRANS);
 			/*
-- 
1.7.9

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 7/8] xfs: fix fstrim offset calculations
  2012-03-22  5:15 [PATH 0/8] xfs: outstanding patches for 3.4 merge window Dave Chinner
                   ` (5 preceding siblings ...)
  2012-03-22  5:15 ` [PATCH 6/8] xfs: Account log unmount transaction correctly Dave Chinner
@ 2012-03-22  5:15 ` Dave Chinner
  2012-03-24 13:36   ` Christoph Hellwig
  2012-03-27 20:48   ` Ben Myers
  2012-03-22  5:15 ` [PATCH 8/8] xfs: add lots of attribute trace points Dave Chinner
  7 siblings, 2 replies; 31+ messages in thread
From: Dave Chinner @ 2012-03-22  5:15 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

xfs_ioc_fstrim() doesn't treat the incoming offset and length
correctly. It treats them as a filesystem block address, rather than
a disk address. This is wrong because the range passed in is a
linear representation , while the filesystem block address notiation
is a sparse representation. Hence we cannot convert the range direct
to filesystem block units and then use that for calculating the
range to trim.

While this sounds dangerous, the problem is limited to calculting
what AGs need to be trimmed. The code that calcuates the actual
ranges to trim gets the right result (i.e. only ever discards free
space), even though it uses the wrong ranges to limit what is
trimmed. Hence this is not a bug that endangers user data.

Fix this by treating the range as a disk address range and use the
appropriate functions to convert the range into the desired formats
for calculations.

Further, fix the first free extent lookup (the longest) to actually
find the largest free extent. Currently this lookup uses a <=
lookup, which results in finding the extent to the left of the
largest because we can never get an exact match on the largest
extent. This is due to the fact that while we know it's size, we
don't know it's location and so the exact match fails and we move
one record to the left to get the next largest extent. Instead, use
a >= search so that the lookup returns the largest extent regardless
of the fact we don't get an exact match on it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_alloc.c   |    2 +-
 fs/xfs/xfs_alloc.h   |    7 +++++
 fs/xfs/xfs_discard.c |   61 +++++++++++++++++++++++++++++++------------------
 3 files changed, 46 insertions(+), 24 deletions(-)

diff --git a/fs/xfs/xfs_alloc.c b/fs/xfs/xfs_alloc.c
index 31e9033..0f0df27 100644
--- a/fs/xfs/xfs_alloc.c
+++ b/fs/xfs/xfs_alloc.c
@@ -69,7 +69,7 @@ xfs_alloc_lookup_eq(
  * Lookup the first record greater than or equal to [bno, len]
  * in the btree given by cur.
  */
-STATIC int				/* error */
+int				/* error */
 xfs_alloc_lookup_ge(
 	struct xfs_btree_cur	*cur,	/* btree cursor */
 	xfs_agblock_t		bno,	/* starting block of extent */
diff --git a/fs/xfs/xfs_alloc.h b/fs/xfs/xfs_alloc.h
index ab5d0fd..3a7e7d8 100644
--- a/fs/xfs/xfs_alloc.h
+++ b/fs/xfs/xfs_alloc.h
@@ -248,6 +248,13 @@ xfs_alloc_lookup_le(
 	xfs_extlen_t		len,	/* length of extent */
 	int			*stat);	/* success/failure */
 
+int				/* error */
+xfs_alloc_lookup_ge(
+	struct xfs_btree_cur	*cur,	/* btree cursor */
+	xfs_agblock_t		bno,	/* starting block of extent */
+	xfs_extlen_t		len,	/* length of extent */
+	int			*stat);	/* success/failure */
+
 int					/* error */
 xfs_alloc_get_rec(
 	struct xfs_btree_cur	*cur,	/* btree cursor */
diff --git a/fs/xfs/xfs_discard.c b/fs/xfs/xfs_discard.c
index 286a051..1ad3a4b 100644
--- a/fs/xfs/xfs_discard.c
+++ b/fs/xfs/xfs_discard.c
@@ -37,9 +37,9 @@ STATIC int
 xfs_trim_extents(
 	struct xfs_mount	*mp,
 	xfs_agnumber_t		agno,
-	xfs_fsblock_t		start,
-	xfs_fsblock_t		end,
-	xfs_fsblock_t		minlen,
+	xfs_daddr_t		start,
+	xfs_daddr_t		end,
+	xfs_daddr_t		minlen,
 	__uint64_t		*blocks_trimmed)
 {
 	struct block_device	*bdev = mp->m_ddev_targp->bt_bdev;
@@ -67,7 +67,7 @@ xfs_trim_extents(
 	/*
 	 * Look up the longest btree in the AGF and start with it.
 	 */
-	error = xfs_alloc_lookup_le(cur, 0,
+	error = xfs_alloc_lookup_ge(cur, 0,
 			    be32_to_cpu(XFS_BUF_TO_AGF(agbp)->agf_longest), &i);
 	if (error)
 		goto out_del_cursor;
@@ -77,8 +77,10 @@ xfs_trim_extents(
 	 * enough to be worth discarding.
 	 */
 	while (i) {
-		xfs_agblock_t fbno;
-		xfs_extlen_t flen;
+		xfs_agblock_t	fbno;
+		xfs_extlen_t	flen;
+		xfs_daddr_t	dbno;
+		xfs_extlen_t	dlen;
 
 		error = xfs_alloc_get_rec(cur, &fbno, &flen, &i);
 		if (error)
@@ -87,9 +89,17 @@ xfs_trim_extents(
 		ASSERT(flen <= be32_to_cpu(XFS_BUF_TO_AGF(agbp)->agf_longest));
 
 		/*
+		 * use daddr format for all range/len calculations as that is
+		 * the format the range/len variables are supplied in by
+		 * userspace.
+		 */
+		dbno = XFS_AGB_TO_DADDR(mp, agno, fbno);
+		dlen = XFS_FSB_TO_BB(mp, flen);
+
+		/*
 		 * Too small?  Give up.
 		 */
-		if (flen < minlen) {
+		if (dlen < minlen) {
 			trace_xfs_discard_toosmall(mp, agno, fbno, flen);
 			goto out_del_cursor;
 		}
@@ -99,8 +109,7 @@ xfs_trim_extents(
 		 * supposed to discard skip it.  Do not bother to trim
 		 * down partially overlapping ranges for now.
 		 */
-		if (XFS_AGB_TO_FSB(mp, agno, fbno) + flen < start ||
-		    XFS_AGB_TO_FSB(mp, agno, fbno) > end) {
+		if (dbno + dlen < start || dbno > end) {
 			trace_xfs_discard_exclude(mp, agno, fbno, flen);
 			goto next_extent;
 		}
@@ -115,10 +124,7 @@ xfs_trim_extents(
 		}
 
 		trace_xfs_discard_extent(mp, agno, fbno, flen);
-		error = -blkdev_issue_discard(bdev,
-				XFS_AGB_TO_DADDR(mp, agno, fbno),
-				XFS_FSB_TO_BB(mp, flen),
-				GFP_NOFS, 0);
+		error = -blkdev_issue_discard(bdev, dbno, dlen, GFP_NOFS, 0);
 		if (error)
 			goto out_del_cursor;
 		*blocks_trimmed += flen;
@@ -137,6 +143,15 @@ out_put_perag:
 	return error;
 }
 
+/*
+ * trim a range of the filesystem.
+ *
+ * Note: the parameters passed from userspace are byte ranges into the
+ * filesystem which does not match to the format we use for filesystem block
+ * addressing. FSB addressing is sparse (AGNO|AGBNO), while the incoming format
+ * is a linear address range. Hence we need to use DADDR based conversions and
+ * comparisons for determining the correct offset and regions to trim.
+ */
 int
 xfs_ioc_trim(
 	struct xfs_mount		*mp,
@@ -145,7 +160,7 @@ xfs_ioc_trim(
 	struct request_queue	*q = mp->m_ddev_targp->bt_bdev->bd_disk->queue;
 	unsigned int		granularity = q->limits.discard_granularity;
 	struct fstrim_range	range;
-	xfs_fsblock_t		start, end, minlen;
+	xfs_daddr_t		start, end, minlen;
 	xfs_agnumber_t		start_agno, end_agno, agno;
 	__uint64_t		blocks_trimmed = 0;
 	int			error, last_error = 0;
@@ -159,22 +174,22 @@ xfs_ioc_trim(
 
 	/*
 	 * Truncating down the len isn't actually quite correct, but using
-	 * XFS_B_TO_FSB would mean we trivially get overflows for values
+	 * BBTOB would mean we trivially get overflows for values
 	 * of ULLONG_MAX or slightly lower.  And ULLONG_MAX is the default
 	 * used by the fstrim application.  In the end it really doesn't
 	 * matter as trimming blocks is an advisory interface.
 	 */
-	start = XFS_B_TO_FSBT(mp, range.start);
-	end = start + XFS_B_TO_FSBT(mp, range.len) - 1;
-	minlen = XFS_B_TO_FSB(mp, max_t(u64, granularity, range.minlen));
+	start = BTOBB(range.start);
+	end = start + BTOBBT(range.len) - 1;
+	minlen = BTOBB(max_t(u64, granularity, range.minlen));
 
-	if (start >= mp->m_sb.sb_dblocks)
+	if (XFS_BB_TO_FSB(mp, start) >= mp->m_sb.sb_dblocks)
 		return -XFS_ERROR(EINVAL);
-	if (end > mp->m_sb.sb_dblocks - 1)
-		end = mp->m_sb.sb_dblocks - 1;
+	if (end > XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks) - 1)
+		end = XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks)- 1;
 
-	start_agno = XFS_FSB_TO_AGNO(mp, start);
-	end_agno = XFS_FSB_TO_AGNO(mp, end);
+	start_agno = xfs_daddr_to_agno(mp, start);
+	end_agno = xfs_daddr_to_agno(mp, end);
 
 	for (agno = start_agno; agno <= end_agno; agno++) {
 		error = -xfs_trim_extents(mp, agno, start, end, minlen,
-- 
1.7.9

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 8/8] xfs: add lots of attribute trace points
  2012-03-22  5:15 [PATH 0/8] xfs: outstanding patches for 3.4 merge window Dave Chinner
                   ` (6 preceding siblings ...)
  2012-03-22  5:15 ` [PATCH 7/8] xfs: fix fstrim offset calculations Dave Chinner
@ 2012-03-22  5:15 ` Dave Chinner
  2012-03-24 13:42   ` Christoph Hellwig
  2012-03-27 21:18   ` Ben Myers
  7 siblings, 2 replies; 31+ messages in thread
From: Dave Chinner @ 2012-03-22  5:15 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_attr.c      |   16 ++++++++++++
 fs/xfs/xfs_attr_leaf.c |   40 +++++++++++++++++++++++++++---
 fs/xfs/xfs_da_btree.c  |   32 ++++++++++++++++++++++++
 fs/xfs/xfs_trace.h     |   62 ++++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 144 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_attr.c b/fs/xfs/xfs_attr.c
index 08b9ac6..65d61b9 100644
--- a/fs/xfs/xfs_attr.c
+++ b/fs/xfs/xfs_attr.c
@@ -853,6 +853,8 @@ xfs_attr_shortform_addname(xfs_da_args_t *args)
 {
 	int newsize, forkoff, retval;
 
+	trace_xfs_attr_sf_addname(args);
+
 	retval = xfs_attr_shortform_lookup(args);
 	if ((args->flags & ATTR_REPLACE) && (retval == ENOATTR)) {
 		return(retval);
@@ -896,6 +898,8 @@ xfs_attr_leaf_addname(xfs_da_args_t *args)
 	xfs_dabuf_t *bp;
 	int retval, error, committed, forkoff;
 
+	trace_xfs_attr_leaf_addname(args);
+
 	/*
 	 * Read the (only) block in the attribute list in.
 	 */
@@ -920,6 +924,9 @@ xfs_attr_leaf_addname(xfs_da_args_t *args)
 			xfs_da_brelse(args->trans, bp);
 			return(retval);
 		}
+
+		trace_xfs_attr_leaf_replace(args);
+
 		args->op_flags |= XFS_DA_OP_RENAME;	/* an atomic rename */
 		args->blkno2 = args->blkno;		/* set 2nd entry info*/
 		args->index2 = args->index;
@@ -1090,6 +1097,8 @@ xfs_attr_leaf_removename(xfs_da_args_t *args)
 	xfs_dabuf_t *bp;
 	int error, committed, forkoff;
 
+	trace_xfs_attr_leaf_removename(args);
+
 	/*
 	 * Remove the attribute.
 	 */
@@ -1223,6 +1232,8 @@ xfs_attr_node_addname(xfs_da_args_t *args)
 	xfs_mount_t *mp;
 	int committed, retval, error;
 
+	trace_xfs_attr_node_addname(args);
+
 	/*
 	 * Fill in bucket of arguments/results/context to carry around.
 	 */
@@ -1249,6 +1260,9 @@ restart:
 	} else if (retval == EEXIST) {
 		if (args->flags & ATTR_CREATE)
 			goto out;
+
+		trace_xfs_attr_node_replace(args);
+
 		args->op_flags |= XFS_DA_OP_RENAME;	/* atomic rename op */
 		args->blkno2 = args->blkno;		/* set 2nd entry info*/
 		args->index2 = args->index;
@@ -1480,6 +1494,8 @@ xfs_attr_node_removename(xfs_da_args_t *args)
 	xfs_dabuf_t *bp;
 	int retval, error, committed, forkoff;
 
+	trace_xfs_attr_node_removename(args);
+
 	/*
 	 * Tie a string around our finger to remind us where we are.
 	 */
diff --git a/fs/xfs/xfs_attr_leaf.c b/fs/xfs/xfs_attr_leaf.c
index d25eafd..76d93dc 100644
--- a/fs/xfs/xfs_attr_leaf.c
+++ b/fs/xfs/xfs_attr_leaf.c
@@ -235,6 +235,8 @@ xfs_attr_shortform_create(xfs_da_args_t *args)
 	xfs_inode_t *dp;
 	xfs_ifork_t *ifp;
 
+	trace_xfs_attr_sf_create(args);
+
 	dp = args->dp;
 	ASSERT(dp != NULL);
 	ifp = dp->i_afp;
@@ -268,6 +270,8 @@ xfs_attr_shortform_add(xfs_da_args_t *args, int forkoff)
 	xfs_inode_t *dp;
 	xfs_ifork_t *ifp;
 
+	trace_xfs_attr_sf_add(args);
+
 	dp = args->dp;
 	mp = dp->i_mount;
 	dp->i_d.di_forkoff = forkoff;
@@ -337,6 +341,8 @@ xfs_attr_shortform_remove(xfs_da_args_t *args)
 	xfs_mount_t *mp;
 	xfs_inode_t *dp;
 
+	trace_xfs_attr_sf_remove(args);
+
 	dp = args->dp;
 	mp = dp->i_mount;
 	base = sizeof(xfs_attr_sf_hdr_t);
@@ -405,6 +411,8 @@ xfs_attr_shortform_lookup(xfs_da_args_t *args)
 	int i;
 	xfs_ifork_t *ifp;
 
+	trace_xfs_attr_sf_lookup(args);
+
 	ifp = args->dp->i_afp;
 	ASSERT(ifp->if_flags & XFS_IFINLINE);
 	sf = (xfs_attr_shortform_t *)ifp->if_u1.if_data;
@@ -476,6 +484,8 @@ xfs_attr_shortform_to_leaf(xfs_da_args_t *args)
 	xfs_dabuf_t *bp;
 	xfs_ifork_t *ifp;
 
+	trace_xfs_attr_sf_to_leaf(args);
+
 	dp = args->dp;
 	ifp = dp->i_afp;
 	sf = (xfs_attr_shortform_t *)ifp->if_u1.if_data;
@@ -775,6 +785,8 @@ xfs_attr_leaf_to_shortform(xfs_dabuf_t *bp, xfs_da_args_t *args, int forkoff)
 	char *tmpbuffer;
 	int error, i;
 
+	trace_xfs_attr_leaf_to_sf(args);
+
 	dp = args->dp;
 	tmpbuffer = kmem_alloc(XFS_LBSIZE(dp->i_mount), KM_SLEEP);
 	ASSERT(tmpbuffer != NULL);
@@ -848,6 +860,8 @@ xfs_attr_leaf_to_node(xfs_da_args_t *args)
 	xfs_dablk_t blkno;
 	int error;
 
+	trace_xfs_attr_leaf_to_node(args);
+
 	dp = args->dp;
 	bp1 = bp2 = NULL;
 	error = xfs_da_grow_inode(args, &blkno);
@@ -911,6 +925,8 @@ xfs_attr_leaf_create(xfs_da_args_t *args, xfs_dablk_t blkno, xfs_dabuf_t **bpp)
 	xfs_dabuf_t *bp;
 	int error;
 
+	trace_xfs_attr_leaf_create(args);
+
 	dp = args->dp;
 	ASSERT(dp != NULL);
 	error = xfs_da_get_buf(args->trans, args->dp, blkno, -1, &bp,
@@ -948,6 +964,8 @@ xfs_attr_leaf_split(xfs_da_state_t *state, xfs_da_state_blk_t *oldblk,
 	xfs_dablk_t blkno;
 	int error;
 
+	trace_xfs_attr_leaf_split(state->args);
+
 	/*
 	 * Allocate space for a new leaf node.
 	 */
@@ -977,10 +995,13 @@ xfs_attr_leaf_split(xfs_da_state_t *state, xfs_da_state_blk_t *oldblk,
 	 *
 	 * Insert the "new" entry in the correct block.
 	 */
-	if (state->inleaf)
+	if (state->inleaf) {
+		trace_xfs_attr_leaf_add_old(state->args);
 		error = xfs_attr_leaf_add(oldblk->bp, state->args);
-	else
+	} else {
+		trace_xfs_attr_leaf_add_new(state->args);
 		error = xfs_attr_leaf_add(newblk->bp, state->args);
+	}
 
 	/*
 	 * Update last hashval in each block since we added the name.
@@ -1001,6 +1022,8 @@ xfs_attr_leaf_add(xfs_dabuf_t *bp, xfs_da_args_t *args)
 	xfs_attr_leaf_map_t *map;
 	int tablesize, entsize, sum, tmp, i;
 
+	trace_xfs_attr_leaf_add(args);
+
 	leaf = bp->data;
 	ASSERT(leaf->hdr.info.magic == cpu_to_be16(XFS_ATTR_LEAF_MAGIC));
 	ASSERT((args->index >= 0)
@@ -1128,8 +1151,6 @@ xfs_attr_leaf_add_work(xfs_dabuf_t *bp, xfs_da_args_t *args, int mapindex)
 	       (be32_to_cpu(entry->hashval) <= be32_to_cpu((entry+1)->hashval)));
 
 	/*
-	 * Copy the attribute name and value into the new space.
-	 *
 	 * For "remote" attribute values, simply note that we need to
 	 * allocate space for the "remote" value.  We can't actually
 	 * allocate the extents in this transaction, and we can't decide
@@ -1265,6 +1286,8 @@ xfs_attr_leaf_rebalance(xfs_da_state_t *state, xfs_da_state_blk_t *blk1,
 	ASSERT(leaf2->hdr.info.magic == cpu_to_be16(XFS_ATTR_LEAF_MAGIC));
 	args = state->args;
 
+	trace_xfs_attr_leaf_rebalance(args);
+
 	/*
 	 * Check ordering of blocks, reverse if it makes things simpler.
 	 *
@@ -1810,6 +1833,8 @@ xfs_attr_leaf_unbalance(xfs_da_state_t *state, xfs_da_state_blk_t *drop_blk,
 	xfs_mount_t *mp;
 	char *tmpbuffer;
 
+	trace_xfs_attr_leaf_unbalance(state->args);
+
 	/*
 	 * Set up environment.
 	 */
@@ -1919,6 +1944,8 @@ xfs_attr_leaf_lookup_int(xfs_dabuf_t *bp, xfs_da_args_t *args)
 	int probe, span;
 	xfs_dahash_t hashval;
 
+	trace_xfs_attr_leaf_lookup(args);
+
 	leaf = bp->data;
 	ASSERT(leaf->hdr.info.magic == cpu_to_be16(XFS_ATTR_LEAF_MAGIC));
 	ASSERT(be16_to_cpu(leaf->hdr.count)
@@ -2445,6 +2472,7 @@ xfs_attr_leaf_clearflag(xfs_da_args_t *args)
 	char *name;
 #endif /* DEBUG */
 
+	trace_xfs_attr_leaf_clearflag(args);
 	/*
 	 * Set up the operation.
 	 */
@@ -2509,6 +2537,8 @@ xfs_attr_leaf_setflag(xfs_da_args_t *args)
 	xfs_dabuf_t *bp;
 	int error;
 
+	trace_xfs_attr_leaf_setflag(args);
+
 	/*
 	 * Set up the operation.
 	 */
@@ -2565,6 +2595,8 @@ xfs_attr_leaf_flipflags(xfs_da_args_t *args)
 	char *name1, *name2;
 #endif /* DEBUG */
 
+	trace_xfs_attr_leaf_flipflags(args);
+
 	/*
 	 * Read the block containing the "old" attr
 	 */
diff --git a/fs/xfs/xfs_da_btree.c b/fs/xfs/xfs_da_btree.c
index 77c7425..7f1a6f5 100644
--- a/fs/xfs/xfs_da_btree.c
+++ b/fs/xfs/xfs_da_btree.c
@@ -108,6 +108,8 @@ xfs_da_node_create(xfs_da_args_t *args, xfs_dablk_t blkno, int level,
 	int error;
 	xfs_trans_t *tp;
 
+	trace_xfs_da_node_create(args);
+
 	tp = args->trans;
 	error = xfs_da_get_buf(tp, args->dp, blkno, -1, &bp, whichfork);
 	if (error)
@@ -140,6 +142,8 @@ xfs_da_split(xfs_da_state_t *state)
 	xfs_dabuf_t *bp;
 	int max, action, error, i;
 
+	trace_xfs_da_split(state->args);
+
 	/*
 	 * Walk back up the tree splitting/inserting/adjusting as necessary.
 	 * If we need to insert and there isn't room, split the node, then
@@ -178,10 +182,12 @@ xfs_da_split(xfs_da_state_t *state)
 			state->extravalid = 1;
 			if (state->inleaf) {
 				state->extraafter = 0;	/* before newblk */
+				trace_xfs_attr_leaf_split_before(state->args);
 				error = xfs_attr_leaf_split(state, oldblk,
 							    &state->extrablk);
 			} else {
 				state->extraafter = 1;	/* after newblk */
+				trace_xfs_attr_leaf_split_after(state->args);
 				error = xfs_attr_leaf_split(state, newblk,
 							    &state->extrablk);
 			}
@@ -300,6 +306,8 @@ xfs_da_root_split(xfs_da_state_t *state, xfs_da_state_blk_t *blk1,
 	xfs_mount_t *mp;
 	xfs_dir2_leaf_t *leaf;
 
+	trace_xfs_da_root_split(state->args);
+
 	/*
 	 * Copy the existing (incorrect) block from the root node position
 	 * to a free space somewhere.
@@ -380,6 +388,8 @@ xfs_da_node_split(xfs_da_state_t *state, xfs_da_state_blk_t *oldblk,
 	int newcount, error;
 	int useextra;
 
+	trace_xfs_da_node_split(state->args);
+
 	node = oldblk->bp->data;
 	ASSERT(node->hdr.info.magic == cpu_to_be16(XFS_DA_NODE_MAGIC));
 
@@ -466,6 +476,8 @@ xfs_da_node_rebalance(xfs_da_state_t *state, xfs_da_state_blk_t *blk1,
 	int count, tmp;
 	xfs_trans_t *tp;
 
+	trace_xfs_da_node_rebalance(state->args);
+
 	node1 = blk1->bp->data;
 	node2 = blk2->bp->data;
 	/*
@@ -574,6 +586,8 @@ xfs_da_node_add(xfs_da_state_t *state, xfs_da_state_blk_t *oldblk,
 	xfs_da_node_entry_t *btree;
 	int tmp;
 
+	trace_xfs_da_node_add(state->args);
+
 	node = oldblk->bp->data;
 	ASSERT(node->hdr.info.magic == cpu_to_be16(XFS_DA_NODE_MAGIC));
 	ASSERT((oldblk->index >= 0) && (oldblk->index <= be16_to_cpu(node->hdr.count)));
@@ -619,6 +633,8 @@ xfs_da_join(xfs_da_state_t *state)
 	xfs_da_state_blk_t *drop_blk, *save_blk;
 	int action, error;
 
+	trace_xfs_da_join(state->args);
+
 	action = 0;
 	drop_blk = &state->path.blk[ state->path.active-1 ];
 	save_blk = &state->altpath.blk[ state->path.active-1 ];
@@ -723,6 +739,8 @@ xfs_da_root_join(xfs_da_state_t *state, xfs_da_state_blk_t *root_blk)
 	xfs_dabuf_t *bp;
 	int error;
 
+	trace_xfs_da_root_join(state->args);
+
 	args = state->args;
 	ASSERT(args != NULL);
 	ASSERT(root_blk->magic == XFS_DA_NODE_MAGIC);
@@ -941,6 +959,8 @@ xfs_da_node_remove(xfs_da_state_t *state, xfs_da_state_blk_t *drop_blk)
 	xfs_da_node_entry_t *btree;
 	int tmp;
 
+	trace_xfs_da_node_remove(state->args);
+
 	node = drop_blk->bp->data;
 	ASSERT(drop_blk->index < be16_to_cpu(node->hdr.count));
 	ASSERT(drop_blk->index >= 0);
@@ -984,6 +1004,8 @@ xfs_da_node_unbalance(xfs_da_state_t *state, xfs_da_state_blk_t *drop_blk,
 	int tmp;
 	xfs_trans_t *tp;
 
+	trace_xfs_da_node_unbalance(state->args);
+
 	drop_node = drop_blk->bp->data;
 	save_node = save_blk->bp->data;
 	ASSERT(drop_node->hdr.info.magic == cpu_to_be16(XFS_DA_NODE_MAGIC));
@@ -1230,6 +1252,7 @@ xfs_da_blk_link(xfs_da_state_t *state, xfs_da_state_blk_t *old_blk,
 		/*
 		 * Link new block in before existing block.
 		 */
+		trace_xfs_da_link_before(args);
 		new_info->forw = cpu_to_be32(old_blk->blkno);
 		new_info->back = old_info->back;
 		if (old_info->back) {
@@ -1251,6 +1274,7 @@ xfs_da_blk_link(xfs_da_state_t *state, xfs_da_state_blk_t *old_blk,
 		/*
 		 * Link new block in after existing block.
 		 */
+		trace_xfs_da_link_after(args);
 		new_info->forw = old_info->forw;
 		new_info->back = cpu_to_be32(old_blk->blkno);
 		if (old_info->forw) {
@@ -1348,6 +1372,7 @@ xfs_da_blk_unlink(xfs_da_state_t *state, xfs_da_state_blk_t *drop_blk,
 	 * Unlink the leaf block from the doubly linked chain of leaves.
 	 */
 	if (be32_to_cpu(save_info->back) == drop_blk->blkno) {
+		trace_xfs_da_unlink_back(args);
 		save_info->back = drop_info->back;
 		if (drop_info->back) {
 			error = xfs_da_read_buf(args->trans, args->dp,
@@ -1365,6 +1390,7 @@ xfs_da_blk_unlink(xfs_da_state_t *state, xfs_da_state_blk_t *drop_blk,
 			xfs_da_buf_done(bp);
 		}
 	} else {
+		trace_xfs_da_unlink_forward(args);
 		save_info->forw = drop_info->forw;
 		if (drop_info->forw) {
 			error = xfs_da_read_buf(args->trans, args->dp,
@@ -1652,6 +1678,8 @@ xfs_da_grow_inode(
 	int			count;
 	int			error;
 
+	trace_xfs_da_grow_inode(args);
+
 	if (args->whichfork == XFS_DATA_FORK) {
 		bno = args->dp->i_mount->m_dirleafblk;
 		count = args->dp->i_mount->m_dirblkfsbs;
@@ -1690,6 +1718,8 @@ xfs_da_swap_lastblock(xfs_da_args_t *args, xfs_dablk_t *dead_blknop,
 	xfs_dir2_leaf_t *dead_leaf2;
 	xfs_dahash_t dead_hash;
 
+	trace_xfs_da_swap_lastblock(args);
+
 	dead_buf = *dead_bufp;
 	dead_blkno = *dead_blknop;
 	tp = args->trans;
@@ -1878,6 +1908,8 @@ xfs_da_shrink_inode(xfs_da_args_t *args, xfs_dablk_t dead_blkno,
 	xfs_trans_t *tp;
 	xfs_mount_t *mp;
 
+	trace_xfs_da_shrink_inode(args);
+
 	dp = args->dp;
 	w = args->whichfork;
 	tp = args->trans;
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 75eb54a..270775f 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1408,7 +1408,7 @@ DEFINE_ALLOC_EVENT(xfs_alloc_vextent_noagbp);
 DEFINE_ALLOC_EVENT(xfs_alloc_vextent_loopfailed);
 DEFINE_ALLOC_EVENT(xfs_alloc_vextent_allfailed);
 
-DECLARE_EVENT_CLASS(xfs_dir2_class,
+DECLARE_EVENT_CLASS(xfs_da_class,
 	TP_PROTO(struct xfs_da_args *args),
 	TP_ARGS(args),
 	TP_STRUCT__entry(
@@ -1443,7 +1443,7 @@ DECLARE_EVENT_CLASS(xfs_dir2_class,
 )
 
 #define DEFINE_DIR2_EVENT(name) \
-DEFINE_EVENT(xfs_dir2_class, name, \
+DEFINE_EVENT(xfs_da_class, name, \
 	TP_PROTO(struct xfs_da_args *args), \
 	TP_ARGS(args))
 DEFINE_DIR2_EVENT(xfs_dir2_sf_addname);
@@ -1472,6 +1472,64 @@ DEFINE_DIR2_EVENT(xfs_dir2_node_replace);
 DEFINE_DIR2_EVENT(xfs_dir2_node_removename);
 DEFINE_DIR2_EVENT(xfs_dir2_node_to_leaf);
 
+#define DEFINE_ATTR_EVENT(name) \
+DEFINE_EVENT(xfs_da_class, name, \
+	TP_PROTO(struct xfs_da_args *args), \
+	TP_ARGS(args))
+DEFINE_ATTR_EVENT(xfs_attr_sf_add);
+DEFINE_ATTR_EVENT(xfs_attr_sf_addname);
+DEFINE_ATTR_EVENT(xfs_attr_sf_create);
+DEFINE_ATTR_EVENT(xfs_attr_sf_lookup);
+DEFINE_ATTR_EVENT(xfs_attr_sf_remove);
+DEFINE_ATTR_EVENT(xfs_attr_sf_removename);
+DEFINE_ATTR_EVENT(xfs_attr_sf_to_leaf);
+
+DEFINE_ATTR_EVENT(xfs_attr_leaf_add);
+DEFINE_ATTR_EVENT(xfs_attr_leaf_add_old);
+DEFINE_ATTR_EVENT(xfs_attr_leaf_add_new);
+DEFINE_ATTR_EVENT(xfs_attr_leaf_addname);
+DEFINE_ATTR_EVENT(xfs_attr_leaf_create);
+DEFINE_ATTR_EVENT(xfs_attr_leaf_lookup);
+DEFINE_ATTR_EVENT(xfs_attr_leaf_replace);
+DEFINE_ATTR_EVENT(xfs_attr_leaf_removename);
+DEFINE_ATTR_EVENT(xfs_attr_leaf_split);
+DEFINE_ATTR_EVENT(xfs_attr_leaf_split_before);
+DEFINE_ATTR_EVENT(xfs_attr_leaf_split_after);
+DEFINE_ATTR_EVENT(xfs_attr_leaf_clearflag);
+DEFINE_ATTR_EVENT(xfs_attr_leaf_setflag);
+DEFINE_ATTR_EVENT(xfs_attr_leaf_flipflags);
+DEFINE_ATTR_EVENT(xfs_attr_leaf_to_sf);
+DEFINE_ATTR_EVENT(xfs_attr_leaf_to_node);
+DEFINE_ATTR_EVENT(xfs_attr_leaf_rebalance);
+DEFINE_ATTR_EVENT(xfs_attr_leaf_unbalance);
+
+DEFINE_ATTR_EVENT(xfs_attr_node_addname);
+DEFINE_ATTR_EVENT(xfs_attr_node_lookup);
+DEFINE_ATTR_EVENT(xfs_attr_node_replace);
+DEFINE_ATTR_EVENT(xfs_attr_node_removename);
+
+#define DEFINE_DA_EVENT(name) \
+DEFINE_EVENT(xfs_da_class, name, \
+	TP_PROTO(struct xfs_da_args *args), \
+	TP_ARGS(args))
+DEFINE_DA_EVENT(xfs_da_split);
+DEFINE_DA_EVENT(xfs_da_join);
+DEFINE_DA_EVENT(xfs_da_link_before);
+DEFINE_DA_EVENT(xfs_da_link_after);
+DEFINE_DA_EVENT(xfs_da_unlink_back);
+DEFINE_DA_EVENT(xfs_da_unlink_forward);
+DEFINE_DA_EVENT(xfs_da_root_split);
+DEFINE_DA_EVENT(xfs_da_root_join);
+DEFINE_DA_EVENT(xfs_da_node_add);
+DEFINE_DA_EVENT(xfs_da_node_create);
+DEFINE_DA_EVENT(xfs_da_node_split);
+DEFINE_DA_EVENT(xfs_da_node_remove);
+DEFINE_DA_EVENT(xfs_da_node_rebalance);
+DEFINE_DA_EVENT(xfs_da_node_unbalance);
+DEFINE_DA_EVENT(xfs_da_swap_lastblock);
+DEFINE_DA_EVENT(xfs_da_grow_inode);
+DEFINE_DA_EVENT(xfs_da_shrink_inode);
+
 DECLARE_EVENT_CLASS(xfs_dir2_space_class,
 	TP_PROTO(struct xfs_da_args *args, int idx),
 	TP_ARGS(args, idx),
-- 
1.7.9

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/8] xfs: initialise xfssync work before running quotachecks
  2012-03-22  5:15 ` [PATCH 3/8] xfs: initialise xfssync work before running quotachecks Dave Chinner
@ 2012-03-22 15:15   ` Ben Myers
  2012-03-22 21:07     ` Dave Chinner
  2012-03-28 17:38   ` [PATCH 3/8] xfs: initialise xfssync work before running quotachecks Ben Myers
  1 sibling, 1 reply; 31+ messages in thread
From: Ben Myers @ 2012-03-22 15:15 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Thu, Mar 22, 2012 at 04:15:08PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Because the mount process can run a quotacheck and consume lots of
> inodes, we need to be able to run periodic inode reclaim during the
> mount process. This will prevent running the system out of memory
> during quota checks.
> 
> This essentially reverts 2bcf6e97, but that is safe to do now that
> the quota sync code that was causing problems during long quotacheck
> executions is now gone.

Dave, I've held off on #s 3 and 4 because they appear to be racy.  Being
able to run inode reclaim during quota check seems like a good idea
though.

-Ben

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/xfs_super.c |   17 +++++++++--------
>  1 files changed, 9 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 2c3cc2e..a08c56d 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1346,31 +1346,33 @@ xfs_fs_fill_super(
>  	sb->s_time_gran = 1;
>  	set_posix_acl_flag(sb);
>  
> -	error = xfs_mountfs(mp);
> +	error = xfs_syncd_init(mp);
>  	if (error)
>  		goto out_filestream_unmount;
>  
> -	error = xfs_syncd_init(mp);
> +	error = xfs_mountfs(mp);
>  	if (error)
> -		goto out_unmount;
> +		goto out_syncd_stop;
>  
>  	root = igrab(VFS_I(mp->m_rootip));
>  	if (!root) {
>  		error = ENOENT;
> -		goto out_syncd_stop;
> +		goto out_unmount;
>  	}
>  	if (is_bad_inode(root)) {
>  		error = EINVAL;
> -		goto out_syncd_stop;
> +		goto out_unmount;
>  	}
>  	sb->s_root = d_make_root(root);
>  	if (!sb->s_root) {
>  		error = ENOMEM;
> -		goto out_syncd_stop;
> +		goto out_unmount;
>  	}
>  
>  	return 0;
>  
> + out_syncd_stop:
> +	xfs_syncd_stop(mp);
>   out_filestream_unmount:
>  	xfs_filestream_unmount(mp);
>   out_free_sb:
> @@ -1387,8 +1389,6 @@ out_destroy_workqueues:
>   out:
>  	return -error;
>  
> - out_syncd_stop:
> -	xfs_syncd_stop(mp);
>   out_unmount:
>  	/*
>  	 * Blow away any referenced inode in the filestreams cache.
> @@ -1400,6 +1400,7 @@ out_destroy_workqueues:
>  	xfs_flush_buftarg(mp->m_ddev_targp, 1);
>  
>  	xfs_unmountfs(mp);
> +	xfs_syncd_stop(mp);
>  	goto out_free_sb;
>  }
>  
> -- 
> 1.7.9
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/8] xfs: initialise xfssync work before running quotachecks
  2012-03-22 15:15   ` Ben Myers
@ 2012-03-22 21:07     ` Dave Chinner
  2012-03-23 13:34       ` Mark Tinguely
  0 siblings, 1 reply; 31+ messages in thread
From: Dave Chinner @ 2012-03-22 21:07 UTC (permalink / raw)
  To: Ben Myers; +Cc: xfs

On Thu, Mar 22, 2012 at 10:15:48AM -0500, Ben Myers wrote:
> On Thu, Mar 22, 2012 at 04:15:08PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Because the mount process can run a quotacheck and consume lots of
> > inodes, we need to be able to run periodic inode reclaim during the
> > mount process. This will prevent running the system out of memory
> > during quota checks.
> > 
> > This essentially reverts 2bcf6e97, but that is safe to do now that
> > the quota sync code that was causing problems during long quotacheck
> > executions is now gone.
> 
> Dave, I've held off on #s 3 and 4 because they appear to be racy.  Being

What race?

Cheers,

Dave
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/8] xfs: initialise xfssync work before running quotachecks
  2012-03-22 21:07     ` Dave Chinner
@ 2012-03-23 13:34       ` Mark Tinguely
  2012-03-25 23:22         ` Dave Chinner
  0 siblings, 1 reply; 31+ messages in thread
From: Mark Tinguely @ 2012-03-23 13:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Ben Myers, xfs

On 03/22/12 16:07, Dave Chinner wrote:
> On Thu, Mar 22, 2012 at 10:15:48AM -0500, Ben Myers wrote:
>> On Thu, Mar 22, 2012 at 04:15:08PM +1100, Dave Chinner wrote:
>>> From: Dave Chinner<dchinner@redhat.com>
>>>
>>> Because the mount process can run a quotacheck and consume lots of
>>> inodes, we need to be able to run periodic inode reclaim during the
>>> mount process. This will prevent running the system out of memory
>>> during quota checks.
>>>
>>> This essentially reverts 2bcf6e97, but that is safe to do now that
>>> the quota sync code that was causing problems during long quotacheck
>>> executions is now gone.
>>
>> Dave, I've held off on #s 3 and 4 because they appear to be racy.  Being
>
> What race?
>
> Cheers,
>
> Dave


2 of the sync workers use iterators
  xfs_inode_ag_iterator()
   xfs_perag_get()
    radix_tree_lookup(&mp->m_perag_tree, agno)

The race I was worried about was in xfs_mount() to initialize the
mp->m_perag_lock, and the radix tree initialization:
  INIT_RADIX_TREE(&mp->m_perag_tree, GFP_ATOMIC)).

There is a lock and 2 or 3 unbuffered I/O are performed in xfs_mountfs()
before the mp->m_perag_tree is initialized.

I was also looking at the xfs_perag_t being allocated in mountfs() and 
being deallocated in umountfs(), but it turns out that is not important, 
xfs_perag_get() will return NULL if these have not been allocated yet or 
have been removed for the required ag.

Mark Tinguely.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/8] xfs: Account log unmount transaction correctly
  2012-03-22  5:15 ` [PATCH 6/8] xfs: Account log unmount transaction correctly Dave Chinner
@ 2012-03-24  8:27   ` Christoph Hellwig
  2012-03-24 22:49     ` Dave Chinner
  2012-03-26 22:28   ` Ben Myers
  1 sibling, 1 reply; 31+ messages in thread
From: Christoph Hellwig @ 2012-03-24  8:27 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Looks good.

Can we add a testcase doing lots of freeze/unfreeze cycles with no other
load to xftests to verify the fix?


Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 7/8] xfs: fix fstrim offset calculations
  2012-03-22  5:15 ` [PATCH 7/8] xfs: fix fstrim offset calculations Dave Chinner
@ 2012-03-24 13:36   ` Christoph Hellwig
  2012-03-27 20:48   ` Ben Myers
  1 sibling, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2012-03-24 13:36 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Thu, Mar 22, 2012 at 04:15:12PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> xfs_ioc_fstrim() doesn't treat the incoming offset and length
> correctly. It treats them as a filesystem block address, rather than
> a disk address. This is wrong because the range passed in is a
> linear representation , while the filesystem block address notiation
> is a sparse representation. Hence we cannot convert the range direct
> to filesystem block units and then use that for calculating the
> range to trim.
> 
> While this sounds dangerous, the problem is limited to calculting
> what AGs need to be trimmed. The code that calcuates the actual
> ranges to trim gets the right result (i.e. only ever discards free
> space), even though it uses the wrong ranges to limit what is
> trimmed. Hence this is not a bug that endangers user data.
> 
> Fix this by treating the range as a disk address range and use the
> appropriate functions to convert the range into the desired formats
> for calculations.
> 
> Further, fix the first free extent lookup (the longest) to actually
> find the largest free extent. Currently this lookup uses a <=
> lookup, which results in finding the extent to the left of the
> largest because we can never get an exact match on the largest
> extent. This is due to the fact that while we know it's size, we
> don't know it's location and so the exact match fails and we move
> one record to the left to get the next largest extent. Instead, use
> a >= search so that the lookup returns the largest extent regardless
> of the fact we don't get an exact match on it.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks good with the minor nitpick below.

Reviewed-by: Christoph Hellwig <hch@lst.de>

> -	error = xfs_alloc_lookup_le(cur, 0,
> +	error = xfs_alloc_lookup_ge(cur, 0,
>  			    be32_to_cpu(XFS_BUF_TO_AGF(agbp)->agf_longest), &i);

After this change xfs_alloc_lookup_le can be marked static in
xfs_alloc.c

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 8/8] xfs: add lots of attribute trace points
  2012-03-22  5:15 ` [PATCH 8/8] xfs: add lots of attribute trace points Dave Chinner
@ 2012-03-24 13:42   ` Christoph Hellwig
  2012-03-27 21:18   ` Ben Myers
  1 sibling, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2012-03-24 13:42 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Thu, Mar 22, 2012 at 04:15:13PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks good mostly good, minor comments below:

>  	/*
> -	 * Copy the attribute name and value into the new space.
> -	 *

Why does this get removed?

>  
>  #define DEFINE_DIR2_EVENT(name) \
> -DEFINE_EVENT(xfs_dir2_class, name, \
> +DEFINE_EVENT(xfs_da_class, name, \
>  	TP_PROTO(struct xfs_da_args *args), \
>  	TP_ARGS(args))


> +#define DEFINE_ATTR_EVENT(name) \
> +DEFINE_EVENT(xfs_da_class, name, \
> +	TP_PROTO(struct xfs_da_args *args), \
> +	TP_ARGS(args))

> +#define DEFINE_DA_EVENT(name) \
> +DEFINE_EVENT(xfs_da_class, name, \
> +	TP_PROTO(struct xfs_da_args *args), \
> +	TP_ARGS(args))

We should use the DEFINE_DA_EVENT macro for all of these tracepoints,
instead of having two duplicates of it.

Also all the tracepoints should print of which fork we operate on,
which is useful at least for the ones in common code.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/8] xfs: Account log unmount transaction correctly
  2012-03-24  8:27   ` Christoph Hellwig
@ 2012-03-24 22:49     ` Dave Chinner
  0 siblings, 0 replies; 31+ messages in thread
From: Dave Chinner @ 2012-03-24 22:49 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Sat, Mar 24, 2012 at 04:27:00AM -0400, Christoph Hellwig wrote:
> Looks good.
> 
> Can we add a testcase doing lots of freeze/unfreeze cycles with no other
> load to xftests to verify the fix?

It's just a trivial while loop that needs about a hundred
freeze/unfreeze ops to trigger on a 10GB filesystem, so I'll add a
test for that...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/8] xfs: initialise xfssync work before running quotachecks
  2012-03-23 13:34       ` Mark Tinguely
@ 2012-03-25 23:22         ` Dave Chinner
  2012-03-26 15:10           ` Mark Tinguely
  0 siblings, 1 reply; 31+ messages in thread
From: Dave Chinner @ 2012-03-25 23:22 UTC (permalink / raw)
  To: Mark Tinguely; +Cc: Ben Myers, xfs

On Fri, Mar 23, 2012 at 08:34:31AM -0500, Mark Tinguely wrote:
> On 03/22/12 16:07, Dave Chinner wrote:
> >On Thu, Mar 22, 2012 at 10:15:48AM -0500, Ben Myers wrote:
> >>On Thu, Mar 22, 2012 at 04:15:08PM +1100, Dave Chinner wrote:
> >>>From: Dave Chinner<dchinner@redhat.com>
> >>>
> >>>Because the mount process can run a quotacheck and consume lots of
> >>>inodes, we need to be able to run periodic inode reclaim during the
> >>>mount process. This will prevent running the system out of memory
> >>>during quota checks.
> >>>
> >>>This essentially reverts 2bcf6e97, but that is safe to do now that
> >>>the quota sync code that was causing problems during long quotacheck
> >>>executions is now gone.
> >>
> >>Dave, I've held off on #s 3 and 4 because they appear to be racy.  Being
> >
> >What race?
> >
> >Cheers,
> >
> >Dave
> 
> 
> 2 of the sync workers use iterators
>  xfs_inode_ag_iterator()
>   xfs_perag_get()
>    radix_tree_lookup(&mp->m_perag_tree, agno)
> 
> The race I was worried about was in xfs_mount() to initialize the
> mp->m_perag_lock, and the radix tree initialization:
>  INIT_RADIX_TREE(&mp->m_perag_tree, GFP_ATOMIC)).
> 
> There is a lock and 2 or 3 unbuffered I/O are performed in xfs_mountfs()
> before the mp->m_perag_tree is initialized.

Yes they are uncached IOs so do not utilise the cache that
requires the mp->m_perag_tree to be initialised.

> I was also looking at the xfs_perag_t being allocated in mountfs()
> and being deallocated in umountfs(), but it turns out that is not
> important, xfs_perag_get() will return NULL if these have not been
> allocated yet or have been removed for the required ag.

Correct. The other side of that is that if xfssyncd is doing some
form of inode cache iteration, it will simply not find any perag
structures to scani and hence won't cause problems. Same with the
reclaim worker.

Were there any other issues?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/8] xfs: initialise xfssync work before running quotachecks
  2012-03-25 23:22         ` Dave Chinner
@ 2012-03-26 15:10           ` Mark Tinguely
  2012-03-26 21:57             ` Dave Chinner
  0 siblings, 1 reply; 31+ messages in thread
From: Mark Tinguely @ 2012-03-26 15:10 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Ben Myers, xfs

On 03/25/12 18:22, Dave Chinner wrote:
> On Fri, Mar 23, 2012 at 08:34:31AM -0500, Mark Tinguely wrote:
>> >  On 03/22/12 16:07, Dave Chinner wrote:
>>> >  >On Thu, Mar 22, 2012 at 10:15:48AM -0500, Ben Myers wrote:
>>>> >  >>On Thu, Mar 22, 2012 at 04:15:08PM +1100, Dave Chinner wrote:
>>>>> >  >>>From: Dave Chinner<dchinner@redhat.com>
>>>>> >  >>>
>>>>> >  >>>Because the mount process can run a quotacheck and consume lots of
>>>>> >  >>>inodes, we need to be able to run periodic inode reclaim during the
>>>>> >  >>>mount process. This will prevent running the system out of memory
>>>>> >  >>>during quota checks.
>>>>> >  >>>
>>>>> >  >>>This essentially reverts 2bcf6e97, but that is safe to do now that
>>>>> >  >>>the quota sync code that was causing problems during long quotacheck
>>>>> >  >>>executions is now gone.
>>>> >  >>
>>>> >  >>Dave, I've held off on #s 3 and 4 because they appear to be racy.  Being
>>> >  >
>>> >  >What race?
>>> >  >
>>> >  >Cheers,
>>> >  >
>>> >  >Dave
>> >
>> >
>> >  2 of the sync workers use iterators
>> >    xfs_inode_ag_iterator()
>> >     xfs_perag_get()
>> >      radix_tree_lookup(&mp->m_perag_tree, agno)
>> >
>> >  The race I was worried about was in xfs_mount() to initialize the
>> >  mp->m_perag_lock, and the radix tree initialization:
>> >    INIT_RADIX_TREE(&mp->m_perag_tree, GFP_ATOMIC)).
>> >
>> >  There is a lock and 2 or 3 unbuffered I/O are performed in xfs_mountfs()
>> >  before the mp->m_perag_tree is initialized.
> Yes they are uncached IOs so do not utilise the cache that
> requires the mp->m_perag_tree to be initialised.

The point I was trying to make is the sync workers use iterators. The 
race is to get the mp->m_perag_tree initialized before one of the sync 
workers tries to do a xfs_perag_get().

I mentioned the lock and the 2 or 3 unbuffered I/O because they are the 
potential items that can take some time between starting the sync 
workers and intializing the m_perag_tree radix tree.

>> >  I was also looking at the xfs_perag_t being allocated in mountfs()
>> >  and being deallocated in umountfs(), but it turns out that is not
>> >  important, xfs_perag_get() will return NULL if these have not been
>> >  allocated yet or have been removed for the required ag.
> Correct. The other side of that is that if xfssyncd is doing some
> form of inode cache iteration, it will simply not find any perag
> structures to scani and hence won't cause problems. Same with the
> reclaim worker.
>
> Were there any other issues?
>
> Cheers,
>
> Dave.
> -- Dave Chinner david@fromorbit.com

Thanks,

--Mark Tinguely

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/8] xfs: initialise xfssync work before running quotachecks
  2012-03-26 15:10           ` Mark Tinguely
@ 2012-03-26 21:57             ` Dave Chinner
  2012-03-28 19:40               ` Ben Myers
  0 siblings, 1 reply; 31+ messages in thread
From: Dave Chinner @ 2012-03-26 21:57 UTC (permalink / raw)
  To: Mark Tinguely; +Cc: Ben Myers, xfs

On Mon, Mar 26, 2012 at 10:10:50AM -0500, Mark Tinguely wrote:
> On 03/25/12 18:22, Dave Chinner wrote:
> >On Fri, Mar 23, 2012 at 08:34:31AM -0500, Mark Tinguely wrote:
> >>>  On 03/22/12 16:07, Dave Chinner wrote:
> >>>>  >On Thu, Mar 22, 2012 at 10:15:48AM -0500, Ben Myers wrote:
> >>>>>  >>On Thu, Mar 22, 2012 at 04:15:08PM +1100, Dave Chinner wrote:
> >>>>>>  >>>From: Dave Chinner<dchinner@redhat.com>
> >>>>>>  >>>
> >>>>>>  >>>Because the mount process can run a quotacheck and consume lots of
> >>>>>>  >>>inodes, we need to be able to run periodic inode reclaim during the
> >>>>>>  >>>mount process. This will prevent running the system out of memory
> >>>>>>  >>>during quota checks.
> >>>>>>  >>>
> >>>>>>  >>>This essentially reverts 2bcf6e97, but that is safe to do now that
> >>>>>>  >>>the quota sync code that was causing problems during long quotacheck
> >>>>>>  >>>executions is now gone.
> >>>>>  >>
> >>>>>  >>Dave, I've held off on #s 3 and 4 because they appear to be racy.  Being
> >>>>  >
> >>>>  >What race?
> >>>>  >
> >>>>  >Cheers,
> >>>>  >
> >>>>  >Dave
> >>>
> >>>
> >>>  2 of the sync workers use iterators
> >>>    xfs_inode_ag_iterator()
> >>>     xfs_perag_get()
> >>>      radix_tree_lookup(&mp->m_perag_tree, agno)
> >>>
> >>>  The race I was worried about was in xfs_mount() to initialize the
> >>>  mp->m_perag_lock, and the radix tree initialization:
> >>>    INIT_RADIX_TREE(&mp->m_perag_tree, GFP_ATOMIC)).
> >>>
> >>>  There is a lock and 2 or 3 unbuffered I/O are performed in xfs_mountfs()
> >>>  before the mp->m_perag_tree is initialized.
> >Yes they are uncached IOs so do not utilise the cache that
> >requires the mp->m_perag_tree to be initialised.
> 
> The point I was trying to make is the sync workers use iterators.
> The race is to get the mp->m_perag_tree initialized before one of
> the sync workers tries to do a xfs_perag_get().

Firstly, xfs_sync_worker does not iterate AGs at all anymore - it
pushes the log and the AIL, and nothing else. So there is no
problems there.

Secondly xfs_flush_worker() is only triggered by ENOSPC, and that
can't happen until the filesystem is mounted and real work starts.

Finally, the reclaim worker does iterate the perag tree, but the
next patch in the series ensures that is started on demand, not from
xfs_syncd_init(). This ensures that iteration does not occur
until after the first inode is placed into reclaim, and that must
happen after the perag tree is initialised because otherwise we
can't read in inodes, let alone put them into a reclaim state....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/8] xfs: Account log unmount transaction correctly
  2012-03-22  5:15 ` [PATCH 6/8] xfs: Account log unmount transaction correctly Dave Chinner
  2012-03-24  8:27   ` Christoph Hellwig
@ 2012-03-26 22:28   ` Ben Myers
  1 sibling, 0 replies; 31+ messages in thread
From: Ben Myers @ 2012-03-26 22:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Thu, Mar 22, 2012 at 04:15:11PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> There have been a few reports of this warning appearing recently:
> 
> XFS (dm-4): xlog_space_left: head behind tail
>  tail_cycle = 129, tail_bytes = 20163072
>  GH   cycle = 129, GH   bytes = 20162880
> 
> The common cause appears to be lots of freeze and unfreeze cycles,
> and the output from the warnings indicates that we are leaking
> around 8 bytes of log space per freeze/unfreeze cycle.
> 
> When we freeze the filesystem, we write an unmount record and that
> uses xlog_write directly - a special type of transaction,
> effectively. What it doesn't do, however, is correctly acocunt for
> the log space it uses. The unmount record writes an 8 byte structure
> with a special magic number into the log, and the space this
> consumes is not accounted for in the log ticket tracking the
> operation. Hence we leak 8 bytes every unmount record that is
> written.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks good.
Reviewed-by: Ben Myers <bpm@sgi.com>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 7/8] xfs: fix fstrim offset calculations
  2012-03-22  5:15 ` [PATCH 7/8] xfs: fix fstrim offset calculations Dave Chinner
  2012-03-24 13:36   ` Christoph Hellwig
@ 2012-03-27 20:48   ` Ben Myers
  2012-03-27 21:42     ` Dave Chinner
  1 sibling, 1 reply; 31+ messages in thread
From: Ben Myers @ 2012-03-27 20:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Thu, Mar 22, 2012 at 04:15:12PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> xfs_ioc_fstrim() doesn't treat the incoming offset and length
> correctly. It treats them as a filesystem block address, rather than
> a disk address. This is wrong because the range passed in is a
> linear representation , while the filesystem block address notiation
> is a sparse representation. Hence we cannot convert the range direct
> to filesystem block units and then use that for calculating the
> range to trim.
> 
> While this sounds dangerous, the problem is limited to calculting
> what AGs need to be trimmed. The code that calcuates the actual
> ranges to trim gets the right result (i.e. only ever discards free
> space), even though it uses the wrong ranges to limit what is
> trimmed. Hence this is not a bug that endangers user data.

Yep, I can see that the calculation of what we pass to blkdev_issue_discard()
is correct and always a free extent.  I am having a hard time seeing the
problem related to calculating which AGs to trim.  Can you give an example?

> Fix this by treating the range as a disk address range and use the
> appropriate functions to convert the range into the desired formats
> for calculations.
> 
> Further, fix the first free extent lookup (the longest) to actually
> find the largest free extent. Currently this lookup uses a <=
> lookup, which results in finding the extent to the left of the
> largest because we can never get an exact match on the largest
> extent. This is due to the fact that while we know it's size, we
> don't know it's location and so the exact match fails and we move
> one record to the left to get the next largest extent. Instead, use
> a >= search so that the lookup returns the largest extent regardless
> of the fact we don't get an exact match on it.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Reviewed-by: Ben Myers <bpm@sgi.com>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 8/8] xfs: add lots of attribute trace points
  2012-03-22  5:15 ` [PATCH 8/8] xfs: add lots of attribute trace points Dave Chinner
  2012-03-24 13:42   ` Christoph Hellwig
@ 2012-03-27 21:18   ` Ben Myers
  2012-03-27 21:45     ` Dave Chinner
  1 sibling, 1 reply; 31+ messages in thread
From: Ben Myers @ 2012-03-27 21:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Thu, Mar 22, 2012 at 04:15:13PM +1100, Dave Chinner wrote:
> +DEFINE_ATTR_EVENT(xfs_attr_sf_removename);

This one is unused ^^^

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 7/8] xfs: fix fstrim offset calculations
  2012-03-27 20:48   ` Ben Myers
@ 2012-03-27 21:42     ` Dave Chinner
  0 siblings, 0 replies; 31+ messages in thread
From: Dave Chinner @ 2012-03-27 21:42 UTC (permalink / raw)
  To: Ben Myers; +Cc: xfs

On Tue, Mar 27, 2012 at 03:48:25PM -0500, Ben Myers wrote:
> On Thu, Mar 22, 2012 at 04:15:12PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > xfs_ioc_fstrim() doesn't treat the incoming offset and length
> > correctly. It treats them as a filesystem block address, rather than
> > a disk address. This is wrong because the range passed in is a
> > linear representation , while the filesystem block address notiation
> > is a sparse representation. Hence we cannot convert the range direct
> > to filesystem block units and then use that for calculating the
> > range to trim.
> > 
> > While this sounds dangerous, the problem is limited to calculting
> > what AGs need to be trimmed. The code that calcuates the actual
> > ranges to trim gets the right result (i.e. only ever discards free
> > space), even though it uses the wrong ranges to limit what is
> > trimmed. Hence this is not a bug that endangers user data.
> 
> Yep, I can see that the calculation of what we pass to blkdev_issue_discard()
> is correct and always a free extent.  I am having a hard time seeing the
> problem related to calculating which AGs to trim.  Can you give an example?

I don't have the debug traces anymore, but the problem is this
sort of thing. Take a 80MB filesystem with 4 AGs, each AG is 20MB,
which is ~5000 filesystem blocks. That means we need 13 bits to
store the block count per AG. i.e. agblklogi = 13. Now, the FSB
addressing format is sparse, and the calculation is this:

  FSBNO = (AGNO << agblklog) | AGBNO

Note the terminology? FSBNO != FSB. FSB is just a range converted to
filesystem block units. FSBNO is the filesystem block number, an
address.

         offset                                  offset + length
            +-------------------------------------------+
  range:    0                                           80MB
  daddr:    0                                          160k
  FSB:      0                                           20k

  AG:       +----------+----------+----------+----------+
                 0          1          2          3
  AGBNO:    0         5k
                       0         5k
		                  0         5k
				             0         5k
  FSBNO:   0          5k
                       8k       13k
                                  16k      21k
				             24k       29k

IOWs, the FSBNO range looks like this:

            +----------+   +----------+   +----------+   +----------+
            0         5k   8k       13k   16k      21k   24k      29k

And there are regions that are simple invalid (the empty, sparse
bits). This is done to make all the mathematics easy within each AG
as you can convert from the FSBNO straight to the AGBNO (and vice
versa) without needing to know the address of the first block of the
AG. It means it is easy for AGs to manage their own space without
needing to care about where they exist in the larger disk address
space - that is complete abstracted away from the internal freespace
and inode management as they all use AGBNO notation ot reference
blocks within the AG.

As a result, the FSBNO range of the filesystem is quite a bit larger
than the FSB range of the filesystem. So, if we trim a byte range of
0 to 80MB, but treat that as a FSBNO and then convert it to an AGNO,
80MB = 20k FSBs = AG 2.

Hence rather than trimming the entire range of AGs (0-3), we trim
0-2. Hence we need to convert the byte range to a daddr range, and
from there extract the AGNO according to FSBNO encoding.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 8/8] xfs: add lots of attribute trace points
  2012-03-27 21:18   ` Ben Myers
@ 2012-03-27 21:45     ` Dave Chinner
  2012-03-27 22:01       ` Ben Myers
  0 siblings, 1 reply; 31+ messages in thread
From: Dave Chinner @ 2012-03-27 21:45 UTC (permalink / raw)
  To: Ben Myers; +Cc: xfs

On Tue, Mar 27, 2012 at 04:18:44PM -0500, Ben Myers wrote:
> On Thu, Mar 22, 2012 at 04:15:13PM +1100, Dave Chinner wrote:
> > +DEFINE_ATTR_EVENT(xfs_attr_sf_removename);
> 
> This one is unused ^^^

An oversight due to copy-n-paste enthusiasm. Just remove it, or do
you want me to send a new patch with it removed?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 8/8] xfs: add lots of attribute trace points
  2012-03-27 21:45     ` Dave Chinner
@ 2012-03-27 22:01       ` Ben Myers
  0 siblings, 0 replies; 31+ messages in thread
From: Ben Myers @ 2012-03-27 22:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Wed, Mar 28, 2012 at 08:45:01AM +1100, Dave Chinner wrote:
> On Tue, Mar 27, 2012 at 04:18:44PM -0500, Ben Myers wrote:
> > On Thu, Mar 22, 2012 at 04:15:13PM +1100, Dave Chinner wrote:
> > > +DEFINE_ATTR_EVENT(xfs_attr_sf_removename);
> > 
> > This one is unused ^^^
> 
> An oversight due to copy-n-paste enthusiasm. Just remove it, or do
> you want me to send a new patch with it removed?

I can remove it, thanks.

-Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/8] xfs: initialise xfssync work before running quotachecks
  2012-03-22  5:15 ` [PATCH 3/8] xfs: initialise xfssync work before running quotachecks Dave Chinner
  2012-03-22 15:15   ` Ben Myers
@ 2012-03-28 17:38   ` Ben Myers
  2012-03-28 18:02     ` Christoph Hellwig
  1 sibling, 1 reply; 31+ messages in thread
From: Ben Myers @ 2012-03-28 17:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Thu, Mar 22, 2012 at 04:15:08PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Because the mount process can run a quotacheck and consume lots of
> inodes, we need to be able to run periodic inode reclaim during the
> mount process. This will prevent running the system out of memory
> during quota checks.
> 
> This essentially reverts 2bcf6e97, but that is safe to do now that
> the quota sync code that was causing problems during long quotacheck
> executions is now gone.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/xfs_super.c |   17 +++++++++--------
>  1 files changed, 9 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 2c3cc2e..a08c56d 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1346,31 +1346,33 @@ xfs_fs_fill_super(
>  	sb->s_time_gran = 1;
>  	set_posix_acl_flag(sb);
>  
> -	error = xfs_mountfs(mp);
> +	error = xfs_syncd_init(mp);
>  	if (error)
>  		goto out_filestream_unmount;
>  
> -	error = xfs_syncd_init(mp);
> +	error = xfs_mountfs(mp);
>  	if (error)
> -		goto out_unmount;
> +		goto out_syncd_stop;
>  
>  	root = igrab(VFS_I(mp->m_rootip));
>  	if (!root) {
>  		error = ENOENT;
> -		goto out_syncd_stop;
> +		goto out_unmount;
>  	}
>  	if (is_bad_inode(root)) {
>  		error = EINVAL;
> -		goto out_syncd_stop;
> +		goto out_unmount;
>  	}
>  	sb->s_root = d_make_root(root);
		     ^^^^ 

I haven't seen a patch to use d_make_root instead of d_alloc_root, so we'll
take the earlier version of this patch... this one doesn't apply.  ;)

-Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/8] xfs: initialise xfssync work before running quotachecks
  2012-03-28 17:38   ` [PATCH 3/8] xfs: initialise xfssync work before running quotachecks Ben Myers
@ 2012-03-28 18:02     ` Christoph Hellwig
  2012-03-28 18:43       ` Ben Myers
  0 siblings, 1 reply; 31+ messages in thread
From: Christoph Hellwig @ 2012-03-28 18:02 UTC (permalink / raw)
  To: Ben Myers; +Cc: xfs

On Wed, Mar 28, 2012 at 12:38:29PM -0500, Ben Myers wrote:
> I haven't seen a patch to use d_make_root instead of d_alloc_root, so we'll
> take the earlier version of this patch... this one doesn't apply.  ;)

d_alloc_root is gone in mainline.  Maybe get the next pull request out
without this patch, and then send it post -rc1 on a tree based upton
rc1?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/8] xfs: initialise xfssync work before running quotachecks
  2012-03-28 18:02     ` Christoph Hellwig
@ 2012-03-28 18:43       ` Ben Myers
  0 siblings, 0 replies; 31+ messages in thread
From: Ben Myers @ 2012-03-28 18:43 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Wed, Mar 28, 2012 at 02:02:03PM -0400, Christoph Hellwig wrote:
> On Wed, Mar 28, 2012 at 12:38:29PM -0500, Ben Myers wrote:
> > I haven't seen a patch to use d_make_root instead of d_alloc_root, so we'll
> > take the earlier version of this patch... this one doesn't apply.  ;)
> 
> d_alloc_root is gone in mainline.  Maybe get the next pull request out
> without this patch, and then send it post -rc1 on a tree based upton
> rc1?

That sounds good to me.

-Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/8] xfs: initialise xfssync work before running quotachecks
  2012-03-26 21:57             ` Dave Chinner
@ 2012-03-28 19:40               ` Ben Myers
  2012-03-29  0:29                 ` Dave Chinner
  0 siblings, 1 reply; 31+ messages in thread
From: Ben Myers @ 2012-03-28 19:40 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Mark Tinguely, xfs

Hi Dave,

On Tue, Mar 27, 2012 at 08:57:09AM +1100, Dave Chinner wrote:
> On Mon, Mar 26, 2012 at 10:10:50AM -0500, Mark Tinguely wrote:
> > On 03/25/12 18:22, Dave Chinner wrote:
> > >On Fri, Mar 23, 2012 at 08:34:31AM -0500, Mark Tinguely wrote:
> > >>>  On 03/22/12 16:07, Dave Chinner wrote:
> > >>>>  >On Thu, Mar 22, 2012 at 10:15:48AM -0500, Ben Myers wrote:
> > >>>>>  >>On Thu, Mar 22, 2012 at 04:15:08PM +1100, Dave Chinner wrote:
> > >>>>>>  >>>From: Dave Chinner<dchinner@redhat.com>
> > >>>>>>  >>>
> > >>>>>>  >>>Because the mount process can run a quotacheck and consume lots of
> > >>>>>>  >>>inodes, we need to be able to run periodic inode reclaim during the
> > >>>>>>  >>>mount process. This will prevent running the system out of memory
> > >>>>>>  >>>during quota checks.
> > >>>>>>  >>>
> > >>>>>>  >>>This essentially reverts 2bcf6e97, but that is safe to do now that
> > >>>>>>  >>>the quota sync code that was causing problems during long quotacheck
> > >>>>>>  >>>executions is now gone.
> > >>>>>  >>
> > >>>>>  >>Dave, I've held off on #s 3 and 4 because they appear to be racy.  Being
> > >>>>  >
> > >>>>  >What race?
> > >>>>  >
> > >>>>  >Cheers,
> > >>>>  >
> > >>>>  >Dave
> > >>>
> > >>>
> > >>>  2 of the sync workers use iterators
> > >>>    xfs_inode_ag_iterator()
> > >>>     xfs_perag_get()
> > >>>      radix_tree_lookup(&mp->m_perag_tree, agno)
> > >>>
> > >>>  The race I was worried about was in xfs_mount() to initialize the
> > >>>  mp->m_perag_lock, and the radix tree initialization:
> > >>>    INIT_RADIX_TREE(&mp->m_perag_tree, GFP_ATOMIC)).
> > >>>
> > >>>  There is a lock and 2 or 3 unbuffered I/O are performed in xfs_mountfs()
> > >>>  before the mp->m_perag_tree is initialized.
> > >Yes they are uncached IOs so do not utilise the cache that
> > >requires the mp->m_perag_tree to be initialised.
> > 
> > The point I was trying to make is the sync workers use iterators.
> > The race is to get the mp->m_perag_tree initialized before one of
> > the sync workers tries to do a xfs_perag_get().
> 
> Firstly, xfs_sync_worker does not iterate AGs at all anymore - it
> pushes the log and the AIL, and nothing else. So there is no
> problems there.

xfs_sync_worker forces the log and pushes the ail, and a sync is queued in
xfs_syncd_init before either the log or ail are initialized.  A sync should
not be queued before the log and ail are initialized regardless of the
value of xfs_syncd_centisecs.  In the error path of xfs_fs_fill_super a
sync could still be running after xfs_unmount is called, so there is a
window there where it could dereference m_log which had been set to NULL.

> Secondly xfs_flush_worker() is only triggered by ENOSPC, and that
> can't happen until the filesystem is mounted and real work starts.

xfs_flush_worker is triggered by xfs_flush_inodes on ENOSPC in xfs_create
and xfs_iomap_write_delay.  I agree that in upon startup one would not be
able to trigger an ENOSPC event from either of these codepaths until the
filesystem has mounted.  Further, if we hit any error in fill_super you
could not possibly trigger ENOSPC because the root inode had not been
allocated yet.

> Finally, the reclaim worker does iterate the perag tree,

xfs_reclaim_worker has a similar issue with the perag tree in this patch.
It could look at the tree before it has been initialized.

> but the next patch in the series ensures that is started on demand, not
> from xfs_syncd_init().  This ensures that iteration does not occur until
> after the first inode is placed into reclaim, and that must happen after
> the perag tree is initialised because otherwise we can't read in inodes,
> let alone put them into a reclaim state....

I suggest 3 and 4 should be combined into one patch.

You've reordered xfs_syncd_stop with respect to xfs_unmount in the error
path of xfs_fs_fill_super, but not in the regular unmount path
xfs_fs_put_super.  I think for consistency they should not be reordered in
the error path of xfs_fs_fill_super.

As long as workers can run before xfs_mountfs is run, they need to protect
themselves to ensure that the structures they are using are initialized.
It looks like xfs_reclaim_worker would do this in the next patch by using
MS_ACTIVE, but FWICS xfs_sync_worker still does not protect itself.

-Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/8] xfs: initialise xfssync work before running quotachecks
  2012-03-28 19:40               ` Ben Myers
@ 2012-03-29  0:29                 ` Dave Chinner
  2012-03-29  6:30                   ` [PATCH v2] xfs: Ensure inode reclaim can run during quotacheck Dave Chinner
  0 siblings, 1 reply; 31+ messages in thread
From: Dave Chinner @ 2012-03-29  0:29 UTC (permalink / raw)
  To: Ben Myers; +Cc: Mark Tinguely, xfs

On Wed, Mar 28, 2012 at 02:40:18PM -0500, Ben Myers wrote:

[ snip repeated arguments about startup races ]

[ insert repeated responses about "fixed in next patch" ]

[ wait for circle to go around again ]

> I suggest 3 and 4 should be combined into one patch.

Why didn't you just say that's what you want? I have no problem
doing that.  I just kept them separate as there are two logical
changes to the fix - startup order changes, and MS_ACTIVE guard
changes.  Do you want me to send a new combined patch?

FWIW, arguing that a patch in a series is unacceptable because
introduces a temporary problem that is fixed by another patch in the
series is not really helpful, as is saying "it has races" and not
explaining them. If you want the temporary problematic state removed
from the series, just *say so* - it is easy to do.

> You've reordered xfs_syncd_stop with respect to xfs_unmount in the
> error path of xfs_fs_fill_super, but not in the regular unmount
> path xfs_fs_put_super.  I think for consistency they should not be
> reordered in the error path of xfs_fs_fill_super.

[ the circle goes around again ]

The second patch makes it consistent in that xfs_syncd_stop() is
always called after xfs_unmountfs().

> As long as workers can run before xfs_mountfs is run, they need to protect
> themselves to ensure that the structures they are using are initialized.
> It looks like xfs_reclaim_worker would do this in the next patch by using
> MS_ACTIVE, but FWICS xfs_sync_worker still does not protect itself.

Go back and read the second patch, and then review what you wrote
here.  Don't waste my time by making me have to explain it again....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2] xfs: Ensure inode reclaim can run during quotacheck
  2012-03-29  0:29                 ` Dave Chinner
@ 2012-03-29  6:30                   ` Dave Chinner
  0 siblings, 0 replies; 31+ messages in thread
From: Dave Chinner @ 2012-03-29  6:30 UTC (permalink / raw)
  To: Ben Myers; +Cc: Mark Tinguely, xfs

On Thu, Mar 29, 2012 at 11:29:49AM +1100, Dave Chinner wrote:
> On Wed, Mar 28, 2012 at 02:40:18PM -0500, Ben Myers wrote:
> 
> [ snip repeated arguments about startup races ]
> 
> [ insert repeated responses about "fixed in next patch" ]
> 
> [ wait for circle to go around again ]
> 
> > I suggest 3 and 4 should be combined into one patch.
> 
> Why didn't you just say that's what you want? I have no problem
> doing that.  I just kept them separate as there are two logical
> changes to the fix - startup order changes, and MS_ACTIVE guard
> changes.  Do you want me to send a new combined patch?

Here's a new combined patch against current linus+xfsdev TOT.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

xfs: Ensure inode reclaim can run during quotacheck

From: Dave Chinner <dchinner@redhat.com>

Because the mount process can run a quotacheck and consume lots of
inodes, we need to be able to run periodic inode reclaim during the
mount process. This will prevent running the system out of memory
during quota checks.

This essentially reverts 2bcf6e97, but that is safe to do now that
the quota sync code that was causing problems during long quotacheck
executions is now gone.

The reclaim work is currently protected from running during the
unmount process by a check against MS_ACTIVE. Unfortunately, this
also means that the reclaim work cannot run during mount.  The
unmount process should stop the reclaim cleanly before freeing
anything that the reclaim work depends on, so there is no need to
have this guard in place.

Also, the inode reclaim work is demand driven, so there is no need
to start it immediately during mount. It will be started the moment
an inode is queued for reclaim, so qutoacheck will trigger it just
fine.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_super.c |   20 ++++++++++----------
 fs/xfs/xfs_sync.c  |   19 +++++++++----------
 2 files changed, 19 insertions(+), 20 deletions(-)

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index dab9a5f..aef50ab 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -981,8 +981,6 @@ xfs_fs_put_super(
 {
 	struct xfs_mount	*mp = XFS_M(sb);
 
-	xfs_syncd_stop(mp);
-
 	/*
 	 * Blow away any referenced inode in the filestreams cache.
 	 * This can and will cause log traffic as inodes go inactive
@@ -993,6 +991,7 @@ xfs_fs_put_super(
 	xfs_flush_buftarg(mp->m_ddev_targp, 1);
 
 	xfs_unmountfs(mp);
+	xfs_syncd_stop(mp);
 	xfs_freesb(mp);
 	xfs_icsb_destroy_counters(mp);
 	xfs_destroy_mount_workqueues(mp);
@@ -1362,31 +1361,33 @@ xfs_fs_fill_super(
 	sb->s_time_gran = 1;
 	set_posix_acl_flag(sb);
 
-	error = xfs_mountfs(mp);
+	error = xfs_syncd_init(mp);
 	if (error)
 		goto out_filestream_unmount;
 
-	error = xfs_syncd_init(mp);
+	error = xfs_mountfs(mp);
 	if (error)
-		goto out_unmount;
+		goto out_syncd_stop;
 
 	root = igrab(VFS_I(mp->m_rootip));
 	if (!root) {
 		error = ENOENT;
-		goto out_syncd_stop;
+		goto out_unmount;
 	}
 	if (is_bad_inode(root)) {
 		error = EINVAL;
-		goto out_syncd_stop;
+		goto out_unmount;
 	}
 	sb->s_root = d_make_root(root);
 	if (!sb->s_root) {
 		error = ENOMEM;
-		goto out_syncd_stop;
+		goto out_unmount;
 	}
 
 	return 0;
 
+ out_syncd_stop:
+	xfs_syncd_stop(mp);
  out_filestream_unmount:
 	xfs_filestream_unmount(mp);
  out_free_sb:
@@ -1403,8 +1404,6 @@ out_destroy_workqueues:
  out:
 	return -error;
 
- out_syncd_stop:
-	xfs_syncd_stop(mp);
  out_unmount:
 	/*
 	 * Blow away any referenced inode in the filestreams cache.
@@ -1416,6 +1415,7 @@ out_destroy_workqueues:
 	xfs_flush_buftarg(mp->m_ddev_targp, 1);
 
 	xfs_unmountfs(mp);
+	xfs_syncd_stop(mp);
 	goto out_free_sb;
 }
 
diff --git a/fs/xfs/xfs_sync.c b/fs/xfs/xfs_sync.c
index 205ebcb..c318d8a 100644
--- a/fs/xfs/xfs_sync.c
+++ b/fs/xfs/xfs_sync.c
@@ -460,7 +460,15 @@ xfs_sync_worker(
 					struct xfs_mount, m_sync_work);
 	int		error;
 
-	if (!(mp->m_flags & XFS_MOUNT_RDONLY)) {
+	/*
+	 * We shouldn't write/force the log if we are in the mount/unmount
+	 * process or on a read only filesystem. The workqueue still needs to be
+	 * active in both cases, however, because it is used for inode reclaim
+	 * during these times. hence use the MS_ACTIVE flag to avoid doing
+	 * anything in these periods.
+	 */
+	if (!(mp->m_super->s_flags & MS_ACTIVE) &&
+	    !(mp->m_flags & XFS_MOUNT_RDONLY)) {
 		/* dgc: errors ignored here */
 		if (mp->m_super->s_frozen == SB_UNFROZEN &&
 		    xfs_log_need_covered(mp))
@@ -488,14 +496,6 @@ xfs_syncd_queue_reclaim(
 	struct xfs_mount        *mp)
 {
 
-	/*
-	 * We can have inodes enter reclaim after we've shut down the syncd
-	 * workqueue during unmount, so don't allow reclaim work to be queued
-	 * during unmount.
-	 */
-	if (!(mp->m_super->s_flags & MS_ACTIVE))
-		return;
-
 	rcu_read_lock();
 	if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_RECLAIM_TAG)) {
 		queue_delayed_work(xfs_syncd_wq, &mp->m_reclaim_work,
@@ -564,7 +564,6 @@ xfs_syncd_init(
 	INIT_DELAYED_WORK(&mp->m_reclaim_work, xfs_reclaim_worker);
 
 	xfs_syncd_queue_sync(mp);
-	xfs_syncd_queue_reclaim(mp);
 
 	return 0;
 }

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2012-03-29  6:30 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-22  5:15 [PATH 0/8] xfs: outstanding patches for 3.4 merge window Dave Chinner
2012-03-22  5:15 ` [PATCH 1/8] xfs: Fix open flag handling in open_by_handle code Dave Chinner
2012-03-22  5:15 ` [PATCH 2/8] xfs: introduce an allocation workqueue Dave Chinner
2012-03-22  5:15 ` [PATCH 3/8] xfs: initialise xfssync work before running quotachecks Dave Chinner
2012-03-22 15:15   ` Ben Myers
2012-03-22 21:07     ` Dave Chinner
2012-03-23 13:34       ` Mark Tinguely
2012-03-25 23:22         ` Dave Chinner
2012-03-26 15:10           ` Mark Tinguely
2012-03-26 21:57             ` Dave Chinner
2012-03-28 19:40               ` Ben Myers
2012-03-29  0:29                 ` Dave Chinner
2012-03-29  6:30                   ` [PATCH v2] xfs: Ensure inode reclaim can run during quotacheck Dave Chinner
2012-03-28 17:38   ` [PATCH 3/8] xfs: initialise xfssync work before running quotachecks Ben Myers
2012-03-28 18:02     ` Christoph Hellwig
2012-03-28 18:43       ` Ben Myers
2012-03-22  5:15 ` [PATCH 4/8] xfs: remove MS_ACTIVE guard from inode reclaim work Dave Chinner
2012-03-22  5:15 ` [PATCH 5/8] xfs: don't cache inodes read through bulkstat Dave Chinner
2012-03-22  5:15 ` [PATCH 6/8] xfs: Account log unmount transaction correctly Dave Chinner
2012-03-24  8:27   ` Christoph Hellwig
2012-03-24 22:49     ` Dave Chinner
2012-03-26 22:28   ` Ben Myers
2012-03-22  5:15 ` [PATCH 7/8] xfs: fix fstrim offset calculations Dave Chinner
2012-03-24 13:36   ` Christoph Hellwig
2012-03-27 20:48   ` Ben Myers
2012-03-27 21:42     ` Dave Chinner
2012-03-22  5:15 ` [PATCH 8/8] xfs: add lots of attribute trace points Dave Chinner
2012-03-24 13:42   ` Christoph Hellwig
2012-03-27 21:18   ` Ben Myers
2012-03-27 21:45     ` Dave Chinner
2012-03-27 22:01       ` Ben Myers

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.