[PATCH 0/10] xfs: various fixes v2

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/10] xfs: various fixes v2
@ 2012-03-07  4:50 Dave Chinner
  2012-03-07  4:50 ` [PATCH 01/10] xfs: clean up minor sparse warnings Dave Chinner
                   ` (9 more replies)
  0 siblings, 10 replies; 32+ messages in thread
From: Dave Chinner @ 2012-03-07  4:50 UTC (permalink / raw)
  To: xfs

Updates after review:
	- used a local fmode variable in patch 2
	- open coded vmall checks in patch 3 and 4, renamed them
	- rewrote quotacheck memory allocation fixup, now three patches
		- first patch starts xfssyncd earlier
		- second patch enables reclaim work during mount and
		  avoids problems reclaim during unmount.
		- third patch implements inode LRU bypass for inodes
		  read in via bulkstat.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 01/10] xfs: clean up minor sparse warnings
  2012-03-07  4:50 [PATCH 0/10] xfs: various fixes v2 Dave Chinner
@ 2012-03-07  4:50 ` Dave Chinner
  2012-03-08 21:34   ` Ben Myers
  2012-03-07  4:50 ` [PATCH 02/10] xfs: Fix open flag handling in open_by_handle code Dave Chinner
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2012-03-07  4:50 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ben Myers <bpm@sgi.com>
---
 fs/xfs/xfs_dir2_block.c |    1 +
 fs/xfs/xfs_ioctl32.c    |    2 +-
 fs/xfs/xfs_iops.c       |   13 ++++++++-----
 3 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_dir2_block.c b/fs/xfs/xfs_dir2_block.c
index 9245e02..d3b63ae 100644
--- a/fs/xfs/xfs_dir2_block.c
+++ b/fs/xfs/xfs_dir2_block.c
@@ -29,6 +29,7 @@
 #include "xfs_dinode.h"
 #include "xfs_inode.h"
 #include "xfs_inode_item.h"
+#include "xfs_dir2.h"
 #include "xfs_dir2_format.h"
 #include "xfs_dir2_priv.h"
 #include "xfs_error.h"
diff --git a/fs/xfs/xfs_ioctl32.c b/fs/xfs/xfs_ioctl32.c
index f9ccb7b..a849a54 100644
--- a/fs/xfs/xfs_ioctl32.c
+++ b/fs/xfs/xfs_ioctl32.c
@@ -293,7 +293,7 @@ xfs_compat_ioc_bulkstat(
 		int res;
 
 		error = xfs_bulkstat_one_compat(mp, inlast, bulkreq.ubuffer,
-				sizeof(compat_xfs_bstat_t), 0, &res);
+				sizeof(compat_xfs_bstat_t), NULL, &res);
 	} else if (cmd == XFS_IOC_FSBULKSTAT_32) {
 		error = xfs_bulkstat(mp, &inlast, &count,
 			xfs_bulkstat_one_compat, sizeof(compat_xfs_bstat_t),
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index ab30253..2601d6b 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -103,12 +103,15 @@ xfs_mark_inode_dirty(
 }
 
 
-int xfs_initxattrs(struct inode *inode, const struct xattr *xattr_array,
-		   void *fs_info)
+static int
+xfs_initxattrs(
+	struct inode		*inode,
+	const struct xattr	*xattr_array,
+	void			*fs_info)
 {
-	const struct xattr *xattr;
-	struct xfs_inode *ip = XFS_I(inode);
-	int error = 0;
+	const struct xattr	*xattr;
+	struct xfs_inode	*ip = XFS_I(inode);
+	int			error = 0;
 
 	for (xattr = xattr_array; xattr->name != NULL; xattr++) {
 		error = xfs_attr_set(ip, xattr->name, xattr->value,
-- 
1.7.9

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 02/10] xfs: Fix open flag handling in open_by_handle code
  2012-03-07  4:50 [PATCH 0/10] xfs: various fixes v2 Dave Chinner
  2012-03-07  4:50 ` [PATCH 01/10] xfs: clean up minor sparse warnings Dave Chinner
@ 2012-03-07  4:50 ` Dave Chinner
  2012-03-12 13:27   ` Christoph Hellwig
  2012-03-13 21:15   ` Mark Tinguely
  2012-03-07  4:50 ` [PATCH 03/10] xfs: fallback to vmalloc for large buffers in xfs_attrmulti_attr_get Dave Chinner
                   ` (7 subsequent siblings)
  9 siblings, 2 replies; 32+ messages in thread
From: Dave Chinner @ 2012-03-07  4:50 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Sparse identified some unsafe handling of open flags in the xfs open
by handle ioctl code. Update the code to use the correct access
macros to ensure that we handle the open flags correctly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_ioctl.c |   14 +++++---------
 1 files changed, 5 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 76f3ca5..8dc5fe1 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -209,6 +209,7 @@ xfs_open_by_handle(
 	struct file		*filp;
 	struct inode		*inode;
 	struct dentry		*dentry;
+	fmode_t			fmode;
 
 	if (!capable(CAP_SYS_ADMIN))
 		return -XFS_ERROR(EPERM);
@@ -228,26 +229,21 @@ xfs_open_by_handle(
 	hreq->oflags |= O_LARGEFILE;
 #endif
 
-	/* Put open permission in namei format. */
 	permflag = hreq->oflags;
-	if ((permflag+1) & O_ACCMODE)
-		permflag++;
-	if (permflag & O_TRUNC)
-		permflag |= 2;
-
+	fmode = OPEN_FMODE(permflag);
 	if ((!(permflag & O_APPEND) || (permflag & O_TRUNC)) &&
-	    (permflag & FMODE_WRITE) && IS_APPEND(inode)) {
+	    (fmode & FMODE_WRITE) && IS_APPEND(inode)) {
 		error = -XFS_ERROR(EPERM);
 		goto out_dput;
 	}
 
-	if ((permflag & FMODE_WRITE) && IS_IMMUTABLE(inode)) {
+	if ((fmode & FMODE_WRITE) && IS_IMMUTABLE(inode)) {
 		error = -XFS_ERROR(EACCES);
 		goto out_dput;
 	}
 
 	/* Can't write directories. */
-	if (S_ISDIR(inode->i_mode) && (permflag & FMODE_WRITE)) {
+	if (S_ISDIR(inode->i_mode) && (fmode & FMODE_WRITE)) {
 		error = -XFS_ERROR(EISDIR);
 		goto out_dput;
 	}
-- 
1.7.9

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 03/10] xfs: fallback to vmalloc for large buffers in xfs_attrmulti_attr_get
  2012-03-07  4:50 [PATCH 0/10] xfs: various fixes v2 Dave Chinner
  2012-03-07  4:50 ` [PATCH 01/10] xfs: clean up minor sparse warnings Dave Chinner
  2012-03-07  4:50 ` [PATCH 02/10] xfs: Fix open flag handling in open_by_handle code Dave Chinner
@ 2012-03-07  4:50 ` Dave Chinner
  2012-03-12 13:27   ` Christoph Hellwig
  2012-03-14 18:04   ` Mark Tinguely
  2012-03-07  4:50 ` [PATCH 04/10] xfs: fallback to vmalloc for large buffers in xfs_getbmap Dave Chinner
                   ` (6 subsequent siblings)
  9 siblings, 2 replies; 32+ messages in thread
From: Dave Chinner @ 2012-03-07  4:50 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

xfsdump uses for a large buffer for extended attributes, which has a
kmalloc'd shadow buffer in the kernel. This can fail after the
system has been running for some time as it is a high order
allocation. Add a fallback to vmalloc so that it doesn't require
contiguous memory and so won't randomly fail while xfsdump is
running.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_ioctl.c |   14 ++++++++++----
 1 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 8dc5fe1..91f8ff5 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -446,9 +446,12 @@ xfs_attrmulti_attr_get(
 
 	if (*len > XATTR_SIZE_MAX)
 		return EINVAL;
-	kbuf = kmalloc(*len, GFP_KERNEL);
-	if (!kbuf)
-		return ENOMEM;
+	kbuf = kmem_zalloc(*len, KM_SLEEP | KM_MAYFAIL);
+	if (!kbuf) {
+		kbuf = kmem_zalloc_large(*len);
+		if (!kbuf)
+			return ENOMEM;
+	}
 
 	error = xfs_attr_get(XFS_I(inode), name, kbuf, (int *)len, flags);
 	if (error)
@@ -458,7 +461,10 @@ xfs_attrmulti_attr_get(
 		error = EFAULT;
 
  out_kfree:
-	kfree(kbuf);
+	if (is_vmalloc_addr(kbuf))
+		kmem_free_large(kbuf);
+	else
+		kmem_free(kbuf);
 	return error;
 }
 
-- 
1.7.9

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 04/10] xfs: fallback to vmalloc for large buffers in xfs_getbmap
  2012-03-07  4:50 [PATCH 0/10] xfs: various fixes v2 Dave Chinner
                   ` (2 preceding siblings ...)
  2012-03-07  4:50 ` [PATCH 03/10] xfs: fallback to vmalloc for large buffers in xfs_attrmulti_attr_get Dave Chinner
@ 2012-03-07  4:50 ` Dave Chinner
  2012-03-12 13:28   ` Christoph Hellwig
  2012-03-14 18:12   ` Mark Tinguely
  2012-03-07  4:50 ` [PATCH 05/10] xfs: introduce an allocation workqueue Dave Chinner
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 32+ messages in thread
From: Dave Chinner @ 2012-03-07  4:50 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

xfs_getbmap uses for a large buffer for extents, which is kmalloc'd.
This can fail after the system has been running for some time as it
is a high order allocation. Add a fallback to vmalloc so that it
doesn't require contiguous memory and so won't randomly fail on
files with large extent lists.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_bmap.c |   13 ++++++++++---
 1 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_bmap.c b/fs/xfs/xfs_bmap.c
index 188ef2f..3548c6f 100644
--- a/fs/xfs/xfs_bmap.c
+++ b/fs/xfs/xfs_bmap.c
@@ -5536,8 +5536,12 @@ xfs_getbmap(
 	if (bmv->bmv_count > ULONG_MAX / sizeof(struct getbmapx))
 		return XFS_ERROR(ENOMEM);
 	out = kmem_zalloc(bmv->bmv_count * sizeof(struct getbmapx), KM_MAYFAIL);
-	if (!out)
-		return XFS_ERROR(ENOMEM);
+	if (!out) {
+		out = kmem_zalloc_large(bmv->bmv_count *
+					sizeof(struct getbmapx));
+		if (!out)
+			return XFS_ERROR(ENOMEM);
+	}
 
 	xfs_ilock(ip, XFS_IOLOCK_SHARED);
 	if (whichfork == XFS_DATA_FORK && !(iflags & BMV_IF_DELALLOC)) {
@@ -5661,7 +5665,10 @@ xfs_getbmap(
 			break;
 	}
 
-	kmem_free(out);
+	if (is_vmalloc_addr(out))
+		kmem_free_large(out);
+	else
+		kmem_free(out);
 	return error;
 }
 
-- 
1.7.9

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 05/10] xfs: introduce an allocation workqueue
  2012-03-07  4:50 [PATCH 0/10] xfs: various fixes v2 Dave Chinner
                   ` (3 preceding siblings ...)
  2012-03-07  4:50 ` [PATCH 04/10] xfs: fallback to vmalloc for large buffers in xfs_getbmap Dave Chinner
@ 2012-03-07  4:50 ` Dave Chinner
  2012-03-12 16:16   ` Christoph Hellwig
  2012-03-19 16:47   ` Mark Tinguely
  2012-03-07  4:50 ` [PATCH 06/10] xfs: remove remaining scraps of struct xfs_iomap Dave Chinner
                   ` (4 subsequent siblings)
  9 siblings, 2 replies; 32+ messages in thread
From: Dave Chinner @ 2012-03-07  4:50 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

We currently have significant issues with the amount of stack that
allocation in XFS uses, especially in the writeback path. We can
easily consume 4k of stack between mapping the page, manipulating
the bmap btree and allocating blocks from the free list. Not to
mention btree block readahead and other functionality that issues IO
in the allocation path.

As a result, we can no longer fit allocation in the writeback path
in the stack space provided on x86_64. To alleviate this problem,
introduce an allocation workqueue and move all allocations to a
seperate context. This can be easily added as an interposing layer
into xfs_alloc_vextent(), which takes a single argument structure
and does not return until the allocation is complete or has failed.

To do this, add a work structure and a completion to the allocation
args structure. This allows xfs_alloc_vextent to queue the args onto
the workqueue and wait for it to be completed by the worker. This
can be done completely transparently to the caller.

The worker function needs to ensure that it sets and clears the
PF_TRANS flag appropriately as it is being run in an active
transaction context. Work can also be queued in a memory reclaim
context, so a rescuer is needed for the workqueue.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_alloc.c |   34 +++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_alloc.h |    5 +++++
 fs/xfs/xfs_super.c |   16 ++++++++++++++++
 3 files changed, 54 insertions(+), 1 deletions(-)

diff --git a/fs/xfs/xfs_alloc.c b/fs/xfs/xfs_alloc.c
index ce84ffd..31e9033 100644
--- a/fs/xfs/xfs_alloc.c
+++ b/fs/xfs/xfs_alloc.c
@@ -35,6 +35,7 @@
 #include "xfs_error.h"
 #include "xfs_trace.h"
 
+struct workqueue_struct *xfs_alloc_wq;
 
 #define XFS_ABSDIFF(a,b)	(((a) <= (b)) ? ((b) - (a)) : ((a) - (b)))
 
@@ -2207,7 +2208,7 @@ xfs_alloc_read_agf(
  * group or loop over the allocation groups to find the result.
  */
 int				/* error */
-xfs_alloc_vextent(
+__xfs_alloc_vextent(
 	xfs_alloc_arg_t	*args)	/* allocation argument structure */
 {
 	xfs_agblock_t	agsize;	/* allocation group size */
@@ -2417,6 +2418,37 @@ error0:
 	return error;
 }
 
+static void
+xfs_alloc_vextent_worker(
+	struct work_struct	*work)
+{
+	struct xfs_alloc_arg	*args = container_of(work,
+						struct xfs_alloc_arg, work);
+	unsigned long		pflags;
+
+	/* we are in a transaction context here */
+	current_set_flags_nested(&pflags, PF_FSTRANS);
+
+	args->result = __xfs_alloc_vextent(args);
+	complete(args->done);
+
+	current_restore_flags_nested(&pflags, PF_FSTRANS);
+}
+
+
+int				/* error */
+xfs_alloc_vextent(
+	xfs_alloc_arg_t	*args)	/* allocation argument structure */
+{
+	DECLARE_COMPLETION_ONSTACK(done);
+
+	args->done = &done;
+	INIT_WORK(&args->work, xfs_alloc_vextent_worker);
+	queue_work(xfs_alloc_wq, &args->work);
+	wait_for_completion(&done);
+	return args->result;
+}
+
 /*
  * Free an extent.
  * Just break up the extent address and hand off to xfs_free_ag_extent
diff --git a/fs/xfs/xfs_alloc.h b/fs/xfs/xfs_alloc.h
index 2f52b92..ab5d0fd 100644
--- a/fs/xfs/xfs_alloc.h
+++ b/fs/xfs/xfs_alloc.h
@@ -25,6 +25,8 @@ struct xfs_perag;
 struct xfs_trans;
 struct xfs_busy_extent;
 
+extern struct workqueue_struct *xfs_alloc_wq;
+
 /*
  * Freespace allocation types.  Argument to xfs_alloc_[v]extent.
  */
@@ -119,6 +121,9 @@ typedef struct xfs_alloc_arg {
 	char		isfl;		/* set if is freelist blocks - !acctg */
 	char		userdata;	/* set if this is user data */
 	xfs_fsblock_t	firstblock;	/* io first block allocated */
+	struct completion *done;
+	struct work_struct work;
+	int		result;
 } xfs_alloc_arg_t;
 
 /*
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 5e0d43f..13fa0cf 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1604,12 +1604,28 @@ xfs_init_workqueues(void)
 	xfs_syncd_wq = alloc_workqueue("xfssyncd", WQ_NON_REENTRANT, 0);
 	if (!xfs_syncd_wq)
 		return -ENOMEM;
+
+	/*
+	 * The allocation workqueue can be used in memory reclaim situations
+	 * (writepage path), and parallelism is only limited by the number of
+	 * AGs in all the filesystems mounted. Hence use the default large
+	 * max_active value for this workqueue.
+	 */
+	xfs_alloc_wq = alloc_workqueue("xfsalloc", WQ_MEM_RECLAIM, 0);
+	if (!xfs_alloc_wq)
+		goto out_destroy_syncd;
+
 	return 0;
+
+out_destroy_syncd:
+	destroy_workqueue(xfs_syncd_wq);
+	return -ENOMEM;
 }
 
 STATIC void
 xfs_destroy_workqueues(void)
 {
+	destroy_workqueue(xfs_alloc_wq);
 	destroy_workqueue(xfs_syncd_wq);
 }
 
-- 
1.7.9

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 06/10] xfs: remove remaining scraps of struct xfs_iomap
  2012-03-07  4:50 [PATCH 0/10] xfs: various fixes v2 Dave Chinner
                   ` (4 preceding siblings ...)
  2012-03-07  4:50 ` [PATCH 05/10] xfs: introduce an allocation workqueue Dave Chinner
@ 2012-03-07  4:50 ` Dave Chinner
  2012-03-15 16:48   ` Mark Tinguely
  2012-03-07  4:50 ` [PATCH 07/10] xfs: fix inode lookup race Dave Chinner
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2012-03-07  4:50 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_vnode.h    |    1 -
 fs/xfs/xfs_vnodeops.h |    3 ---
 2 files changed, 0 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_vnode.h b/fs/xfs/xfs_vnode.h
index 7c220b4..db14d0c 100644
--- a/fs/xfs/xfs_vnode.h
+++ b/fs/xfs/xfs_vnode.h
@@ -22,7 +22,6 @@
 
 struct file;
 struct xfs_inode;
-struct xfs_iomap;
 struct attrlist_cursor_kern;
 
 /*
diff --git a/fs/xfs/xfs_vnodeops.h b/fs/xfs/xfs_vnodeops.h
index 0c877cb..447e146 100644
--- a/fs/xfs/xfs_vnodeops.h
+++ b/fs/xfs/xfs_vnodeops.h
@@ -10,7 +10,6 @@ struct kiocb;
 struct pipe_inode_info;
 struct uio;
 struct xfs_inode;
-struct xfs_iomap;
 
 
 int xfs_setattr_nonsize(struct xfs_inode *ip, struct iattr *vap, int flags);
@@ -49,8 +48,6 @@ int xfs_attr_set(struct xfs_inode *dp, const unsigned char *name,
 int xfs_attr_remove(struct xfs_inode *dp, const unsigned char *name, int flags);
 int xfs_attr_list(struct xfs_inode *dp, char *buffer, int bufsize,
 		int flags, struct attrlist_cursor_kern *cursor);
-int xfs_bmap(struct xfs_inode *ip, xfs_off_t offset, ssize_t count,
-		int flags, struct xfs_iomap *iomapp, int *niomaps);
 void xfs_tosspages(struct xfs_inode *inode, xfs_off_t first,
 		xfs_off_t last, int fiopt);
 int xfs_flushinval_pages(struct xfs_inode *ip, xfs_off_t first,
-- 
1.7.9

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 07/10] xfs: fix inode lookup race
  2012-03-07  4:50 [PATCH 0/10] xfs: various fixes v2 Dave Chinner
                   ` (5 preceding siblings ...)
  2012-03-07  4:50 ` [PATCH 06/10] xfs: remove remaining scraps of struct xfs_iomap Dave Chinner
@ 2012-03-07  4:50 ` Dave Chinner
  2012-03-07  4:50 ` [PATCH 08/10] xfs: initialise xfssync work before running quotachecks Dave Chinner
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 32+ messages in thread
From: Dave Chinner @ 2012-03-07  4:50 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

When we get concurrent lookups of the same inode that is not in the
per-AG inode cache, there is a race condition that triggers warnings
in unlock_new_inode() indicating that we are initialising an inode
that isn't in a the correct state for a new inode.

When we do an inode lookup via a file handle or a bulkstat, we don't
serialise lookups at a higher level through the dentry cache (i.e.
pathless lookup), and so we can get concurrent lookups of the same
inode.

The race condition is between the insertion of the inode into the
cache in the case of a cache miss and a concurrently lookup:

Thread 1			Thread 2
xfs_iget()
  xfs_iget_cache_miss()
    xfs_iread()
    lock radix tree
    radix_tree_insert()
				rcu_read_lock
				radix_tree_lookup
				lock inode flags
				XFS_INEW not set
				igrab()
				unlock inode flags
				rcu_read_unlock
				use uninitialised inode
				.....
    lock inode flags
    set XFS_INEW
    unlock inode flags
    unlock radix tree
  xfs_setup_inode()
    inode flags = I_NEW
    unlock_new_inode()
      WARNING as inode flags != I_NEW

This can lead to inode corruption, inode list corruption, etc, and
is generally a bad thing to occur.

Fix this by setting XFS_INEW before inserting the inode into the
radix tree. This will ensure any concurrent lookup will find the new
inode with XFS_INEW set and that forces the lookup to wait until the
XFS_INEW flag is removed before allowing the lookup to succeed.

cc: <stable@vger.kernel.org> # for 3.0.x, 3.2.x
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ben Myers <bpm@sgi.com>
---
 fs/xfs/xfs_iget.c |   18 ++++++++++++------
 1 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_iget.c b/fs/xfs/xfs_iget.c
index 37f22da..93fc1dc 100644
--- a/fs/xfs/xfs_iget.c
+++ b/fs/xfs/xfs_iget.c
@@ -350,9 +350,20 @@ xfs_iget_cache_miss(
 			BUG();
 	}
 
-	spin_lock(&pag->pag_ici_lock);
+	/*
+	 * These values must be set before inserting the inode into the radix
+	 * tree as the moment it is inserted a concurrent lookup (allowed by the
+	 * RCU locking mechanism) can find it and that lookup must see that this
+	 * is an inode currently under construction (i.e. that XFS_INEW is set).
+	 * The ip->i_flags_lock that protects the XFS_INEW flag forms the
+	 * memory barrier that ensures this detection works correctly at lookup
+	 * time.
+	 */
+	ip->i_udquot = ip->i_gdquot = NULL;
+	xfs_iflags_set(ip, XFS_INEW);
 
 	/* insert the new inode */
+	spin_lock(&pag->pag_ici_lock);
 	error = radix_tree_insert(&pag->pag_ici_root, agino, ip);
 	if (unlikely(error)) {
 		WARN_ON(error != -EEXIST);
@@ -360,11 +371,6 @@ xfs_iget_cache_miss(
 		error = EAGAIN;
 		goto out_preload_end;
 	}
-
-	/* These values _must_ be set before releasing the radix tree lock! */
-	ip->i_udquot = ip->i_gdquot = NULL;
-	xfs_iflags_set(ip, XFS_INEW);
-
 	spin_unlock(&pag->pag_ici_lock);
 	radix_tree_preload_end();
 
-- 
1.7.9

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 08/10] xfs: initialise xfssync work before running quotachecks
  2012-03-07  4:50 [PATCH 0/10] xfs: various fixes v2 Dave Chinner
                   ` (6 preceding siblings ...)
  2012-03-07  4:50 ` [PATCH 07/10] xfs: fix inode lookup race Dave Chinner
@ 2012-03-07  4:50 ` Dave Chinner
  2012-03-12 13:28   ` Christoph Hellwig
  2012-03-16 17:07   ` Mark Tinguely
  2012-03-07  4:50 ` [PATCH 09/10] xfs: remove MS_ACTIVE guard from inode reclaim work Dave Chinner
  2012-03-07  4:50 ` [PATCH 10/10] xfs: don't cache inodes read through bulkstat Dave Chinner
  9 siblings, 2 replies; 32+ messages in thread
From: Dave Chinner @ 2012-03-07  4:50 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Because the mount process can run a quotacheck and consume lots of
inodes, we need to be able to run periodic inode reclaim during the
mount process. This will prevent running the system out of memory
during quota checks.

This essentially reverts 2bcf6e97, but that is safe to do now that
the quota sync code that was causing problems during long quotacheck
executions is now gone.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_super.c |   15 ++++++++-------
 1 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 13fa0cf..150d8f4 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1343,22 +1343,22 @@ xfs_fs_fill_super(
 	sb->s_time_gran = 1;
 	set_posix_acl_flag(sb);
 
-	error = xfs_mountfs(mp);
+	error = xfs_syncd_init(mp);
 	if (error)
 		goto out_filestream_unmount;
 
-	error = xfs_syncd_init(mp);
+	error = xfs_mountfs(mp);
 	if (error)
-		goto out_unmount;
+		goto out_syncd_stop;
 
 	root = igrab(VFS_I(mp->m_rootip));
 	if (!root) {
 		error = ENOENT;
-		goto out_syncd_stop;
+		goto out_unmount;
 	}
 	if (is_bad_inode(root)) {
 		error = EINVAL;
-		goto out_syncd_stop;
+		goto out_unmount;
 	}
 	sb->s_root = d_alloc_root(root);
 	if (!sb->s_root) {
@@ -1368,6 +1368,8 @@ xfs_fs_fill_super(
 
 	return 0;
 
+ out_syncd_stop:
+	xfs_syncd_stop(mp);
  out_filestream_unmount:
 	xfs_filestream_unmount(mp);
  out_free_sb:
@@ -1384,8 +1386,6 @@ xfs_fs_fill_super(
 
  out_iput:
 	iput(root);
- out_syncd_stop:
-	xfs_syncd_stop(mp);
  out_unmount:
 	/*
 	 * Blow away any referenced inode in the filestreams cache.
@@ -1397,6 +1397,7 @@ xfs_fs_fill_super(
 	xfs_flush_buftarg(mp->m_ddev_targp, 1);
 
 	xfs_unmountfs(mp);
+	xfs_syncd_stop(mp);
 	goto out_free_sb;
 }
 
-- 
1.7.9

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 09/10] xfs: remove MS_ACTIVE guard from inode reclaim work
  2012-03-07  4:50 [PATCH 0/10] xfs: various fixes v2 Dave Chinner
                   ` (7 preceding siblings ...)
  2012-03-07  4:50 ` [PATCH 08/10] xfs: initialise xfssync work before running quotachecks Dave Chinner
@ 2012-03-07  4:50 ` Dave Chinner
  2012-03-12 13:30   ` Christoph Hellwig
  2012-03-07  4:50 ` [PATCH 10/10] xfs: don't cache inodes read through bulkstat Dave Chinner
  9 siblings, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2012-03-07  4:50 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

We need to be able to queue inode reclaim work during the mount
process as quotacheck can cause large amounts of inodes to be read
and we need to clean them up periodically as the shrinkers can not
run until after the mount process has completed.

The reclaim work is currently protected from running during the
unmount process by a check against MS_ACTIVE. Unfortunately, this
also means that the relcaim work cannot run during mount.  The
unmount process should stop the reclaim cleanly before freeing
anything that the reclaim work depends on, so there is no need to
have this guard in place.

Also, the inode reclaim work is demand driven, so ther eis no need
to start it immediately during mount. It will be started the moment
an inode is queued for reclaim, so qutoacheck will trigger it just
fine.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_super.c |    3 +--
 fs/xfs/xfs_sync.c  |   27 ++++++++++++++++-----------
 2 files changed, 17 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 150d8f4..b1df512 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -968,8 +968,6 @@ xfs_fs_put_super(
 {
 	struct xfs_mount	*mp = XFS_M(sb);
 
-	xfs_syncd_stop(mp);
-
 	/*
 	 * Blow away any referenced inode in the filestreams cache.
 	 * This can and will cause log traffic as inodes go inactive
@@ -980,6 +978,7 @@ xfs_fs_put_super(
 	xfs_flush_buftarg(mp->m_ddev_targp, 1);
 
 	xfs_unmountfs(mp);
+	xfs_syncd_stop(mp);
 	xfs_freesb(mp);
 	xfs_icsb_destroy_counters(mp);
 	xfs_close_devices(mp);
diff --git a/fs/xfs/xfs_sync.c b/fs/xfs/xfs_sync.c
index 71bf846..08967e9 100644
--- a/fs/xfs/xfs_sync.c
+++ b/fs/xfs/xfs_sync.c
@@ -496,7 +496,15 @@ xfs_sync_worker(
 					struct xfs_mount, m_sync_work);
 	int		error;
 
-	if (!(mp->m_flags & XFS_MOUNT_RDONLY)) {
+	/*
+	 * We shouldn't write/force the log if we are in the mount/unmount
+	 * process or on a read only filesystem. The workqueue still needs to be
+	 * active in both cases, however, because it is used for inode reclaim
+	 * during these times. hence use the MS_ACTIVE flag to avoid doing
+	 * anything in these periods.
+	 */
+	if (!(mp->m_super->s_flags & MS_ACTIVE) &&
+	    !(mp->m_flags & XFS_MOUNT_RDONLY)) {
 		/* dgc: errors ignored here */
 		if (mp->m_super->s_frozen == SB_UNFROZEN &&
 		    xfs_log_need_covered(mp))
@@ -524,14 +532,6 @@ xfs_syncd_queue_reclaim(
 	struct xfs_mount        *mp)
 {
 
-	/*
-	 * We can have inodes enter reclaim after we've shut down the syncd
-	 * workqueue during unmount, so don't allow reclaim work to be queued
-	 * during unmount.
-	 */
-	if (!(mp->m_super->s_flags & MS_ACTIVE))
-		return;
-
 	rcu_read_lock();
 	if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_RECLAIM_TAG)) {
 		queue_delayed_work(xfs_syncd_wq, &mp->m_reclaim_work,
@@ -600,7 +600,6 @@ xfs_syncd_init(
 	INIT_DELAYED_WORK(&mp->m_reclaim_work, xfs_reclaim_worker);
 
 	xfs_syncd_queue_sync(mp);
-	xfs_syncd_queue_reclaim(mp);
 
 	return 0;
 }
@@ -610,7 +609,13 @@ xfs_syncd_stop(
 	struct xfs_mount	*mp)
 {
 	cancel_delayed_work_sync(&mp->m_sync_work);
-	cancel_delayed_work_sync(&mp->m_reclaim_work);
+
+	/*
+	 * we flush any pending inode reclaim work rather than cancel it here.
+	 * This ensures that there are no clean inodes queued during unmount
+	 * left unreclaimed when we return.
+	 */
+	flush_delayed_work_sync(&mp->m_reclaim_work);
 	cancel_work_sync(&mp->m_flush_work);
 }
 
-- 
1.7.9

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 10/10] xfs: don't cache inodes read through bulkstat
  2012-03-07  4:50 [PATCH 0/10] xfs: various fixes v2 Dave Chinner
                   ` (8 preceding siblings ...)
  2012-03-07  4:50 ` [PATCH 09/10] xfs: remove MS_ACTIVE guard from inode reclaim work Dave Chinner
@ 2012-03-07  4:50 ` Dave Chinner
  2012-03-12 13:31   ` Christoph Hellwig
                     ` (2 more replies)
  9 siblings, 3 replies; 32+ messages in thread
From: Dave Chinner @ 2012-03-07  4:50 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

When we read inodes via bulkstat, we generally only read them once
and then throw them away - they never get used again. If we retain
them in cache, then it simply causes the working set of inodes and
other cached items to be reclaimed just so the inode cache can grow.

Avoid this problem by marking inodes read by bulkstat as not to be
cached and check this flag in .drop_inode to determine whether the
inode should be added to the VFS LRU or not. If the inode lookup
hits an already cached inode, then don't set the flag. If the inode
lookup hits an inode marked with no cache flag, remove the flag and
allow it to be cached once the current reference goes away.

Inodes marked as not cached will get cleaned up by the background
inode reclaim or via memory pressure, so they will still generate
some short term cache pressure. They will, however, be reclaimed
much sooner and in preference to cache hot inodes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_iget.c   |    8 ++++++--
 fs/xfs/xfs_inode.h  |    4 +++-
 fs/xfs/xfs_itable.c |    3 ++-
 fs/xfs/xfs_super.c  |   17 +++++++++++++++++
 4 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_iget.c b/fs/xfs/xfs_iget.c
index 93fc1dc..20ddb1e 100644
--- a/fs/xfs/xfs_iget.c
+++ b/fs/xfs/xfs_iget.c
@@ -290,7 +290,7 @@ xfs_iget_cache_hit(
 	if (lock_flags != 0)
 		xfs_ilock(ip, lock_flags);
 
-	xfs_iflags_clear(ip, XFS_ISTALE);
+	xfs_iflags_clear(ip, XFS_ISTALE | XFS_IDONTCACHE);
 	XFS_STATS_INC(xs_ig_found);
 
 	return 0;
@@ -315,6 +315,7 @@ xfs_iget_cache_miss(
 	struct xfs_inode	*ip;
 	int			error;
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ino);
+	int			iflags;
 
 	ip = xfs_inode_alloc(mp, ino);
 	if (!ip)
@@ -359,8 +360,11 @@ xfs_iget_cache_miss(
 	 * memory barrier that ensures this detection works correctly at lookup
 	 * time.
 	 */
+	iflags = XFS_INEW;
+	if (flags & XFS_IGET_DONTCACHE)
+		iflags |= XFS_IDONTCACHE;
 	ip->i_udquot = ip->i_gdquot = NULL;
-	xfs_iflags_set(ip, XFS_INEW);
+	xfs_iflags_set(ip, iflags);
 
 	/* insert the new inode */
 	spin_lock(&pag->pag_ici_lock);
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index eda4937..096b887 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -374,10 +374,11 @@ xfs_set_projid(struct xfs_inode *ip,
 #define XFS_IFLOCK		(1 << __XFS_IFLOCK_BIT)
 #define __XFS_IPINNED_BIT	8	 /* wakeup key for zero pin count */
 #define XFS_IPINNED		(1 << __XFS_IPINNED_BIT)
+#define XFS_IDONTCACHE		(1 << 9) /* don't cache the inode long term */
 
 /*
  * Per-lifetime flags need to be reset when re-using a reclaimable inode during
- * inode lookup. Thi prevents unintended behaviour on the new inode from
+ * inode lookup. This prevents unintended behaviour on the new inode from
  * ocurring.
  */
 #define XFS_IRECLAIM_RESET_FLAGS	\
@@ -544,6 +545,7 @@ do { \
  */
 #define XFS_IGET_CREATE		0x1
 #define XFS_IGET_UNTRUSTED	0x2
+#define XFS_IGET_DONTCACHE	0x4
 
 int		xfs_inotobp(struct xfs_mount *, struct xfs_trans *,
 			    xfs_ino_t, struct xfs_dinode **,
diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
index 751e94f..b832c58 100644
--- a/fs/xfs/xfs_itable.c
+++ b/fs/xfs/xfs_itable.c
@@ -76,7 +76,8 @@ xfs_bulkstat_one_int(
 		return XFS_ERROR(ENOMEM);
 
 	error = xfs_iget(mp, NULL, ino,
-			 XFS_IGET_UNTRUSTED, XFS_ILOCK_SHARED, &ip);
+			 (XFS_IGET_DONTCACHE | XFS_IGET_UNTRUSTED),
+			 XFS_ILOCK_SHARED, &ip);
 	if (error) {
 		*stat = BULKSTAT_RV_NOTHING;
 		goto out_free;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index b1df512..c162765 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -953,6 +953,22 @@ xfs_fs_evict_inode(
 	xfs_inactive(ip);
 }
 
+/*
+ * We do an unlocked check for XFS_IDONTCACHE here because we are already
+ * serialised against cache hits here via the inode->i_lock and igrab() in
+ * xfs_iget_cache_hit(). Hence a lookup that might clear this flag will not be
+ * racing with us, and it avoids needing to grab a spinlock here for every inode
+ * we drop the final reference on.
+ */
+STATIC int
+xfs_fs_drop_inode(
+	struct inode		*inode)
+{
+	struct xfs_inode	*ip = XFS_I(inode);
+
+	return generic_drop_inode(inode) || (ip->i_flags & XFS_IDONTCACHE);
+}
+
 STATIC void
 xfs_free_fsname(
 	struct xfs_mount	*mp)
@@ -1431,6 +1447,7 @@ static const struct super_operations xfs_super_operations = {
 	.dirty_inode		= xfs_fs_dirty_inode,
 	.write_inode		= xfs_fs_write_inode,
 	.evict_inode		= xfs_fs_evict_inode,
+	.drop_inode		= xfs_fs_drop_inode,
 	.put_super		= xfs_fs_put_super,
 	.sync_fs		= xfs_fs_sync_fs,
 	.freeze_fs		= xfs_fs_freeze,
-- 
1.7.9

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 01/10] xfs: clean up minor sparse warnings
  2012-03-07  4:50 ` [PATCH 01/10] xfs: clean up minor sparse warnings Dave Chinner
@ 2012-03-08 21:34   ` Ben Myers
  2012-03-09  0:30     ` Dave Chinner
  0 siblings, 1 reply; 32+ messages in thread
From: Ben Myers @ 2012-03-08 21:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Wed, Mar 07, 2012 at 03:50:19PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Ben Myers <bpm@sgi.com>

...

> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index ab30253..2601d6b 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -103,12 +103,15 @@ xfs_mark_inode_dirty(
>  }
>  
>  
> -int xfs_initxattrs(struct inode *inode, const struct xattr *xattr_array,
> -		   void *fs_info)
> +static int
> +xfs_initxattrs(
> +	struct inode		*inode,
> +	const struct xattr	*xattr_array,
> +	void			*fs_info)
>  {
> -	const struct xattr *xattr;
> -	struct xfs_inode *ip = XFS_I(inode);
> -	int error = 0;
> +	const struct xattr	*xattr;
> +	struct xfs_inode	*ip = XFS_I(inode);
> +	int			error = 0;

I didn't see a sparse warning which this fixes... FWICS all you've done
here is make it static and fix the formatting.

This still looks good..  ;)

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 01/10] xfs: clean up minor sparse warnings
  2012-03-08 21:34   ` Ben Myers
@ 2012-03-09  0:30     ` Dave Chinner
  0 siblings, 0 replies; 32+ messages in thread
From: Dave Chinner @ 2012-03-09  0:30 UTC (permalink / raw)
  To: Ben Myers; +Cc: xfs

On Thu, Mar 08, 2012 at 03:34:10PM -0600, Ben Myers wrote:
> On Wed, Mar 07, 2012 at 03:50:19PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > Reviewed-by: Ben Myers <bpm@sgi.com>
> 
> ...
> 
> > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > index ab30253..2601d6b 100644
> > --- a/fs/xfs/xfs_iops.c
> > +++ b/fs/xfs/xfs_iops.c
> > @@ -103,12 +103,15 @@ xfs_mark_inode_dirty(
> >  }
> >  
> >  
> > -int xfs_initxattrs(struct inode *inode, const struct xattr *xattr_array,
> > -		   void *fs_info)
> > +static int
> > +xfs_initxattrs(
> > +	struct inode		*inode,
> > +	const struct xattr	*xattr_array,
> > +	void			*fs_info)
> >  {
> > -	const struct xattr *xattr;
> > -	struct xfs_inode *ip = XFS_I(inode);
> > -	int error = 0;
> > +	const struct xattr	*xattr;
> > +	struct xfs_inode	*ip = XFS_I(inode);
> > +	int			error = 0;
> 
> I didn't see a sparse warning which this fixes... FWICS all you've done
> here is make it static and fix the formatting.

If fixes this warning:

fs/xfs/xfs_iops.c:106:5: warning: symbol 'xfs_initxattrs' was not declared. Should it be static?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 02/10] xfs: Fix open flag handling in open_by_handle code
  2012-03-07  4:50 ` [PATCH 02/10] xfs: Fix open flag handling in open_by_handle code Dave Chinner
@ 2012-03-12 13:27   ` Christoph Hellwig
  2012-03-13 21:15   ` Mark Tinguely
  1 sibling, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2012-03-12 13:27 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Wed, Mar 07, 2012 at 03:50:20PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Sparse identified some unsafe handling of open flags in the xfs open
> by handle ioctl code. Update the code to use the correct access
> macros to ensure that we handle the open flags correctly.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 03/10] xfs: fallback to vmalloc for large buffers in xfs_attrmulti_attr_get
  2012-03-07  4:50 ` [PATCH 03/10] xfs: fallback to vmalloc for large buffers in xfs_attrmulti_attr_get Dave Chinner
@ 2012-03-12 13:27   ` Christoph Hellwig
  2012-03-14 18:04   ` Mark Tinguely
  1 sibling, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2012-03-12 13:27 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Wed, Mar 07, 2012 at 03:50:21PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> xfsdump uses for a large buffer for extended attributes, which has a
> kmalloc'd shadow buffer in the kernel. This can fail after the
> system has been running for some time as it is a high order
> allocation. Add a fallback to vmalloc so that it doesn't require
> contiguous memory and so won't randomly fail while xfsdump is
> running.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 04/10] xfs: fallback to vmalloc for large buffers in xfs_getbmap
  2012-03-07  4:50 ` [PATCH 04/10] xfs: fallback to vmalloc for large buffers in xfs_getbmap Dave Chinner
@ 2012-03-12 13:28   ` Christoph Hellwig
  2012-03-14 18:12   ` Mark Tinguely
  1 sibling, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2012-03-12 13:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Wed, Mar 07, 2012 at 03:50:22PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> xfs_getbmap uses for a large buffer for extents, which is kmalloc'd.
> This can fail after the system has been running for some time as it
> is a high order allocation. Add a fallback to vmalloc so that it
> doesn't require contiguous memory and so won't randomly fail on
> files with large extent lists.

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 08/10] xfs: initialise xfssync work before running quotachecks
  2012-03-07  4:50 ` [PATCH 08/10] xfs: initialise xfssync work before running quotachecks Dave Chinner
@ 2012-03-12 13:28   ` Christoph Hellwig
  2012-03-16 17:07   ` Mark Tinguely
  1 sibling, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2012-03-12 13:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs


Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 09/10] xfs: remove MS_ACTIVE guard from inode reclaim work
  2012-03-07  4:50 ` [PATCH 09/10] xfs: remove MS_ACTIVE guard from inode reclaim work Dave Chinner
@ 2012-03-12 13:30   ` Christoph Hellwig
  0 siblings, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2012-03-12 13:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Wed, Mar 07, 2012 at 03:50:27PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> We need to be able to queue inode reclaim work during the mount
> process as quotacheck can cause large amounts of inodes to be read
> and we need to clean them up periodically as the shrinkers can not
> run until after the mount process has completed.
> 
> The reclaim work is currently protected from running during the
> unmount process by a check against MS_ACTIVE. Unfortunately, this
> also means that the relcaim work cannot run during mount.  The
> unmount process should stop the reclaim cleanly before freeing
> anything that the reclaim work depends on, so there is no need to
> have this guard in place.
> 
> Also, the inode reclaim work is demand driven, so ther eis no need
> to start it immediately during mount. It will be started the moment
> an inode is queued for reclaim, so qutoacheck will trigger it just
> fine.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 10/10] xfs: don't cache inodes read through bulkstat
  2012-03-07  4:50 ` [PATCH 10/10] xfs: don't cache inodes read through bulkstat Dave Chinner
@ 2012-03-12 13:31   ` Christoph Hellwig
  2012-03-14 20:44   ` Ben Myers
  2012-03-15 18:14   ` Ben Myers
  2 siblings, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2012-03-12 13:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs


Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 05/10] xfs: introduce an allocation workqueue
  2012-03-07  4:50 ` [PATCH 05/10] xfs: introduce an allocation workqueue Dave Chinner
@ 2012-03-12 16:16   ` Christoph Hellwig
  2012-03-19 16:47   ` Mark Tinguely
  1 sibling, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2012-03-12 16:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

While I still heavily dislike the idea I can't see how else we can
work around the stack abuse in common code:

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 02/10] xfs: Fix open flag handling in open_by_handle code
  2012-03-07  4:50 ` [PATCH 02/10] xfs: Fix open flag handling in open_by_handle code Dave Chinner
  2012-03-12 13:27   ` Christoph Hellwig
@ 2012-03-13 21:15   ` Mark Tinguely
  1 sibling, 0 replies; 32+ messages in thread
From: Mark Tinguely @ 2012-03-13 21:15 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On 03/06/12 22:50, Dave Chinner wrote:
> From: Dave Chinner<dchinner@redhat.com>
>
> Sparse identified some unsafe handling of open flags in the xfs open
> by handle ioctl code. Update the code to use the correct access
> macros to ensure that we handle the open flags correctly.
>
> Signed-off-by: Dave Chinner<dchinner@redhat.com>
> ---

Looks good.

Reviewed-by: Mark Tinguely <tinguely@sgi.com>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 03/10] xfs: fallback to vmalloc for large buffers in xfs_attrmulti_attr_get
  2012-03-07  4:50 ` [PATCH 03/10] xfs: fallback to vmalloc for large buffers in xfs_attrmulti_attr_get Dave Chinner
  2012-03-12 13:27   ` Christoph Hellwig
@ 2012-03-14 18:04   ` Mark Tinguely
  1 sibling, 0 replies; 32+ messages in thread
From: Mark Tinguely @ 2012-03-14 18:04 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On 03/06/12 22:50, Dave Chinner wrote:
> From: Dave Chinner<dchinner@redhat.com>
>
> xfsdump uses for a large buffer for extended attributes, which has a
> kmalloc'd shadow buffer in the kernel. This can fail after the
> system has been running for some time as it is a high order
> allocation. Add a fallback to vmalloc so that it doesn't require
> contiguous memory and so won't randomly fail while xfsdump is
> running.
>
> Signed-off-by: Dave Chinner<dchinner@redhat.com>
> ---
>   fs/xfs/xfs_ioctl.c |   14 ++++++++++----
>   1 files changed, 10 insertions(+), 4 deletions(-)


Makes sense.

Reviewed-by: Mark Tinguely <tinguely@sgi.com>


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 04/10] xfs: fallback to vmalloc for large buffers in xfs_getbmap
  2012-03-07  4:50 ` [PATCH 04/10] xfs: fallback to vmalloc for large buffers in xfs_getbmap Dave Chinner
  2012-03-12 13:28   ` Christoph Hellwig
@ 2012-03-14 18:12   ` Mark Tinguely
  1 sibling, 0 replies; 32+ messages in thread
From: Mark Tinguely @ 2012-03-14 18:12 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On 03/06/12 22:50, Dave Chinner wrote:
> From: Dave Chinner<dchinner@redhat.com>
>
> xfs_getbmap uses for a large buffer for extents, which is kmalloc'd.
> This can fail after the system has been running for some time as it
> is a high order allocation. Add a fallback to vmalloc so that it
> doesn't require contiguous memory and so won't randomly fail on
> files with large extent lists.
>
> Signed-off-by: Dave Chinner<dchinner@redhat.com>
> ---


Also Looks good.

Reviewed-by: Mark Tinguely <tinguely@sgi.com>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 10/10] xfs: don't cache inodes read through bulkstat
  2012-03-07  4:50 ` [PATCH 10/10] xfs: don't cache inodes read through bulkstat Dave Chinner
  2012-03-12 13:31   ` Christoph Hellwig
@ 2012-03-14 20:44   ` Ben Myers
  2012-03-15 18:14   ` Ben Myers
  2 siblings, 0 replies; 32+ messages in thread
From: Ben Myers @ 2012-03-14 20:44 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Wed, Mar 07, 2012 at 03:50:28PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we read inodes via bulkstat, we generally only read them once
> and then throw them away - they never get used again. If we retain
> them in cache, then it simply causes the working set of inodes and
> other cached items to be reclaimed just so the inode cache can grow.
> 
> Avoid this problem by marking inodes read by bulkstat as not to be
> cached and check this flag in .drop_inode to determine whether the
> inode should be added to the VFS LRU or not. If the inode lookup
> hits an already cached inode, then don't set the flag. If the inode
> lookup hits an inode marked with no cache flag, remove the flag and
> allow it to be cached once the current reference goes away.
> 
> Inodes marked as not cached will get cleaned up by the background
> inode reclaim or via memory pressure, so they will still generate
> some short term cache pressure. They will, however, be reclaimed
> much sooner and in preference to cache hot inodes.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks good.

Reviewed-by: Ben Myers <bpm@sgi.com>

> ---
>  fs/xfs/xfs_iget.c   |    8 ++++++--
>  fs/xfs/xfs_inode.h  |    4 +++-
>  fs/xfs/xfs_itable.c |    3 ++-
>  fs/xfs/xfs_super.c  |   17 +++++++++++++++++
>  4 files changed, 28 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iget.c b/fs/xfs/xfs_iget.c
> index 93fc1dc..20ddb1e 100644
> --- a/fs/xfs/xfs_iget.c
> +++ b/fs/xfs/xfs_iget.c
> @@ -290,7 +290,7 @@ xfs_iget_cache_hit(
>  	if (lock_flags != 0)
>  		xfs_ilock(ip, lock_flags);
>  
> -	xfs_iflags_clear(ip, XFS_ISTALE);
> +	xfs_iflags_clear(ip, XFS_ISTALE | XFS_IDONTCACHE);
>  	XFS_STATS_INC(xs_ig_found);
>  
>  	return 0;
> @@ -315,6 +315,7 @@ xfs_iget_cache_miss(
>  	struct xfs_inode	*ip;
>  	int			error;
>  	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ino);
> +	int			iflags;
>  
>  	ip = xfs_inode_alloc(mp, ino);
>  	if (!ip)
> @@ -359,8 +360,11 @@ xfs_iget_cache_miss(
>  	 * memory barrier that ensures this detection works correctly at lookup
>  	 * time.
>  	 */
> +	iflags = XFS_INEW;
> +	if (flags & XFS_IGET_DONTCACHE)
> +		iflags |= XFS_IDONTCACHE;
>  	ip->i_udquot = ip->i_gdquot = NULL;
> -	xfs_iflags_set(ip, XFS_INEW);
> +	xfs_iflags_set(ip, iflags);
>  
>  	/* insert the new inode */
>  	spin_lock(&pag->pag_ici_lock);
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index eda4937..096b887 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -374,10 +374,11 @@ xfs_set_projid(struct xfs_inode *ip,
>  #define XFS_IFLOCK		(1 << __XFS_IFLOCK_BIT)
>  #define __XFS_IPINNED_BIT	8	 /* wakeup key for zero pin count */
>  #define XFS_IPINNED		(1 << __XFS_IPINNED_BIT)
> +#define XFS_IDONTCACHE		(1 << 9) /* don't cache the inode long term */
>  
>  /*
>   * Per-lifetime flags need to be reset when re-using a reclaimable inode during
> - * inode lookup. Thi prevents unintended behaviour on the new inode from
> + * inode lookup. This prevents unintended behaviour on the new inode from
>   * ocurring.
>   */
>  #define XFS_IRECLAIM_RESET_FLAGS	\
> @@ -544,6 +545,7 @@ do { \
>   */
>  #define XFS_IGET_CREATE		0x1
>  #define XFS_IGET_UNTRUSTED	0x2
> +#define XFS_IGET_DONTCACHE	0x4
>  
>  int		xfs_inotobp(struct xfs_mount *, struct xfs_trans *,
>  			    xfs_ino_t, struct xfs_dinode **,
> diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
> index 751e94f..b832c58 100644
> --- a/fs/xfs/xfs_itable.c
> +++ b/fs/xfs/xfs_itable.c
> @@ -76,7 +76,8 @@ xfs_bulkstat_one_int(
>  		return XFS_ERROR(ENOMEM);
>  
>  	error = xfs_iget(mp, NULL, ino,
> -			 XFS_IGET_UNTRUSTED, XFS_ILOCK_SHARED, &ip);
> +			 (XFS_IGET_DONTCACHE | XFS_IGET_UNTRUSTED),
> +			 XFS_ILOCK_SHARED, &ip);
>  	if (error) {
>  		*stat = BULKSTAT_RV_NOTHING;
>  		goto out_free;
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index b1df512..c162765 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -953,6 +953,22 @@ xfs_fs_evict_inode(
>  	xfs_inactive(ip);
>  }
>  
> +/*
> + * We do an unlocked check for XFS_IDONTCACHE here because we are already
> + * serialised against cache hits here via the inode->i_lock and igrab() in
> + * xfs_iget_cache_hit(). Hence a lookup that might clear this flag will not be
> + * racing with us, and it avoids needing to grab a spinlock here for every inode
> + * we drop the final reference on.
> + */

I'll try to put this in my own words, just in case it is mystifying for
anyone else.  ;)

In this case it is ok to do check of ip->i_flags without holding
inode->i_flags_lock because... we have exclusion from xfs_iget_cache_hit
as follows:

The 'dropper' would have taken inode->i_lock when the inode's count went
to zero, and if the XFS_IDONTCARE flag is set, dropper will return 1 to
iput_final which will result in iput_final skipping the inode lru and
setting I_FREEING immediately, before droppig inode->i_lock and evicting
the inode.

A 'cache hitter' must call igrab in order to get a reference on the
inode.  igrab takes the inode->i_lock, and if I_FREEING is set, it
returns NULL, then xfs_iget_cache_hit returns EAGAIN, and is restarted.

So... any 'cache hitter' who could possibly clear the XFS_IDONTCACHE
flag subsequent to 'dropper' checking it would always be unable to get a
reference due to I_FREEING having been set by the dropper.

I appreciate that you added the comment.

Regards,
	Ben

> +STATIC int
> +xfs_fs_drop_inode(
> +	struct inode		*inode)
> +{
> +	struct xfs_inode	*ip = XFS_I(inode);
> +
> +	return generic_drop_inode(inode) || (ip->i_flags & XFS_IDONTCACHE);
> +}
> +
>  STATIC void
>  xfs_free_fsname(
>  	struct xfs_mount	*mp)
> @@ -1431,6 +1447,7 @@ static const struct super_operations xfs_super_operations = {
>  	.dirty_inode		= xfs_fs_dirty_inode,
>  	.write_inode		= xfs_fs_write_inode,
>  	.evict_inode		= xfs_fs_evict_inode,
> +	.drop_inode		= xfs_fs_drop_inode,
>  	.put_super		= xfs_fs_put_super,
>  	.sync_fs		= xfs_fs_sync_fs,
>  	.freeze_fs		= xfs_fs_freeze,
> -- 
> 1.7.9
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 06/10] xfs: remove remaining scraps of struct xfs_iomap
  2012-03-07  4:50 ` [PATCH 06/10] xfs: remove remaining scraps of struct xfs_iomap Dave Chinner
@ 2012-03-15 16:48   ` Mark Tinguely
  0 siblings, 0 replies; 32+ messages in thread
From: Mark Tinguely @ 2012-03-15 16:48 UTC (permalink / raw)
  To: xfs

On 03/06/12 22:50, Dave Chinner wrote:
> From: Dave Chinner<dchinner@redhat.com>
>
> Signed-off-by: Dave Chinner<dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig<hch@lst.de>
> ---
>   fs/xfs/xfs_vnode.h    |    1 -
>   fs/xfs/xfs_vnodeops.h |    3 ---
>   2 files changed, 0 insertions(+), 4 deletions(-)
>


Looks good.

Reviewed-by: Mark Tinguely <tinguely@sgi.com>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 10/10] xfs: don't cache inodes read through bulkstat
  2012-03-07  4:50 ` [PATCH 10/10] xfs: don't cache inodes read through bulkstat Dave Chinner
  2012-03-12 13:31   ` Christoph Hellwig
  2012-03-14 20:44   ` Ben Myers
@ 2012-03-15 18:14   ` Ben Myers
  2012-03-15 22:05     ` Dave Chinner
  2 siblings, 1 reply; 32+ messages in thread
From: Ben Myers @ 2012-03-15 18:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Wed, Mar 07, 2012 at 03:50:28PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we read inodes via bulkstat, we generally only read them once
> and then throw them away - they never get used again. If we retain
> them in cache, then it simply causes the working set of inodes and
> other cached items to be reclaimed just so the inode cache can grow.
> 
> Avoid this problem by marking inodes read by bulkstat as not to be
> cached and check this flag in .drop_inode to determine whether the
> inode should be added to the VFS LRU or not. If the inode lookup
> hits an already cached inode, then don't set the flag. If the inode
> lookup hits an inode marked with no cache flag, remove the flag and
> allow it to be cached once the current reference goes away.
> 
> Inodes marked as not cached will get cleaned up by the background
> inode reclaim or via memory pressure, so they will still generate
> some short term cache pressure. They will, however, be reclaimed
> much sooner and in preference to cache hot inodes.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_iget.c   |    8 ++++++--
>  fs/xfs/xfs_inode.h  |    4 +++-
>  fs/xfs/xfs_itable.c |    3 ++-
>  fs/xfs/xfs_super.c  |   17 +++++++++++++++++
>  4 files changed, 28 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iget.c b/fs/xfs/xfs_iget.c
> index 93fc1dc..20ddb1e 100644
> --- a/fs/xfs/xfs_iget.c
> +++ b/fs/xfs/xfs_iget.c
> @@ -290,7 +290,7 @@ xfs_iget_cache_hit(
>  	if (lock_flags != 0)
>  		xfs_ilock(ip, lock_flags);
>  
> -	xfs_iflags_clear(ip, XFS_ISTALE);
> +	xfs_iflags_clear(ip, XFS_ISTALE | XFS_IDONTCACHE);

If XFS_IGET_DONTCACHE is set, maybe you don't want to clear
XFS_IDONTCACHE.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 10/10] xfs: don't cache inodes read through bulkstat
  2012-03-15 18:14   ` Ben Myers
@ 2012-03-15 22:05     ` Dave Chinner
  0 siblings, 0 replies; 32+ messages in thread
From: Dave Chinner @ 2012-03-15 22:05 UTC (permalink / raw)
  To: Ben Myers; +Cc: xfs

On Thu, Mar 15, 2012 at 01:14:26PM -0500, Ben Myers wrote:
> On Wed, Mar 07, 2012 at 03:50:28PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > When we read inodes via bulkstat, we generally only read them once
> > and then throw them away - they never get used again. If we retain
> > them in cache, then it simply causes the working set of inodes and
> > other cached items to be reclaimed just so the inode cache can grow.
> > 
> > Avoid this problem by marking inodes read by bulkstat as not to be
> > cached and check this flag in .drop_inode to determine whether the
> > inode should be added to the VFS LRU or not. If the inode lookup
> > hits an already cached inode, then don't set the flag. If the inode
> > lookup hits an inode marked with no cache flag, remove the flag and
> > allow it to be cached once the current reference goes away.
> > 
> > Inodes marked as not cached will get cleaned up by the background
> > inode reclaim or via memory pressure, so they will still generate
> > some short term cache pressure. They will, however, be reclaimed
> > much sooner and in preference to cache hot inodes.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_iget.c   |    8 ++++++--
> >  fs/xfs/xfs_inode.h  |    4 +++-
> >  fs/xfs/xfs_itable.c |    3 ++-
> >  fs/xfs/xfs_super.c  |   17 +++++++++++++++++
> >  4 files changed, 28 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_iget.c b/fs/xfs/xfs_iget.c
> > index 93fc1dc..20ddb1e 100644
> > --- a/fs/xfs/xfs_iget.c
> > +++ b/fs/xfs/xfs_iget.c
> > @@ -290,7 +290,7 @@ xfs_iget_cache_hit(
> >  	if (lock_flags != 0)
> >  		xfs_ilock(ip, lock_flags);
> >  
> > -	xfs_iflags_clear(ip, XFS_ISTALE);
> > +	xfs_iflags_clear(ip, XFS_ISTALE | XFS_IDONTCACHE);
> 
> If XFS_IGET_DONTCACHE is set, maybe you don't want to clear
> XFS_IDONTCACHE.

I think that if we get a cache hit, regardless of the access method,
then the inode needs to stay cached for longer. I can't think of a
workload other than repeated xfsdump or xfs_fsr cycles that would
cause this, and in these cases it will only occur if the scans
happen faster than the reclaim period. That sort of workload would
be extremely unusual, but if it is happening then I think we should
treat it as a cached workload rather than an uncached workload
because caching the inodes long term results in better and more
consistent performance.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 08/10] xfs: initialise xfssync work before running quotachecks
  2012-03-07  4:50 ` [PATCH 08/10] xfs: initialise xfssync work before running quotachecks Dave Chinner
  2012-03-12 13:28   ` Christoph Hellwig
@ 2012-03-16 17:07   ` Mark Tinguely
  1 sibling, 0 replies; 32+ messages in thread
From: Mark Tinguely @ 2012-03-16 17:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On 03/06/12 22:50, Dave Chinner wrote:
> From: Dave Chinner<dchinner@redhat.com>
>
> Because the mount process can run a quotacheck and consume lots of
> inodes, we need to be able to run periodic inode reclaim during the
> mount process. This will prevent running the system out of memory
> during quota checks.
>
> This essentially reverts 2bcf6e97, but that is safe to do now that
> the quota sync code that was causing problems during long quotacheck
> executions is now gone.
>
> Signed-off-by: Dave Chinner<dchinner@redhat.com>
> ---
>   fs/xfs/xfs_super.c |   15 ++++++++-------
>   1 files changed, 8 insertions(+), 7 deletions(-)
>
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 13fa0cf..150d8f4 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1343,22 +1343,22 @@ xfs_fs_fill_super(

>   	/*
>   	 * Blow away any referenced inode in the filestreams cache.
> @@ -1397,6 +1397,7 @@ xfs_fs_fill_super(
>   	xfs_flush_buftarg(mp->m_ddev_targp, 1);
>
>   	xfs_unmountfs(mp);
> +	xfs_syncd_stop(mp);
>   	goto out_free_sb;
>   }
>


Shouldn't the xfs_syncd_stop() preceed the xfs_unmountfs()?

xfs_unmountfs() calls xfs_free_perag() and there is an iterator in inode 
and data sync workers.

Mark Tinguely
<tinguely@sgi.com>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 05/10] xfs: introduce an allocation workqueue
  2012-03-07  4:50 ` [PATCH 05/10] xfs: introduce an allocation workqueue Dave Chinner
  2012-03-12 16:16   ` Christoph Hellwig
@ 2012-03-19 16:47   ` Mark Tinguely
  2012-03-19 22:20     ` Dave Chinner
  1 sibling, 1 reply; 32+ messages in thread
From: Mark Tinguely @ 2012-03-19 16:47 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On 03/06/12 22:50, Dave Chinner wrote:
> From: Dave Chinner<dchinner@redhat.com>
>
> We currently have significant issues with the amount of stack that
> allocation in XFS uses, especially in the writeback path. We can
> easily consume 4k of stack between mapping the page, manipulating
> the bmap btree and allocating blocks from the free list. Not to
> mention btree block readahead and other functionality that issues IO
> in the allocation path.
>
> As a result, we can no longer fit allocation in the writeback path
> in the stack space provided on x86_64. To alleviate this problem,
> introduce an allocation workqueue and move all allocations to a
> seperate context. This can be easily added as an interposing layer
> into xfs_alloc_vextent(), which takes a single argument structure
> and does not return until the allocation is complete or has failed.
>
> To do this, add a work structure and a completion to the allocation
> args structure. This allows xfs_alloc_vextent to queue the args onto
> the workqueue and wait for it to be completed by the worker. This
> can be done completely transparently to the caller.
>
> The worker function needs to ensure that it sets and clears the
> PF_TRANS flag appropriately as it is being run in an active
> transaction context. Work can also be queued in a memory reclaim
> context, so a rescuer is needed for the workqueue.
>
> Signed-off-by: Dave Chinner<dchinner@redhat.com>

#include <std/disclaimer>	# speaking for myself

As the problem is described above, it sounds like the STANDARD x86_64
configuration is in stack crisis needing to put a worker in-line to
solve the stack issue.

Adding an in-line worker to fix a "stack crisis" without any other
measures and the Linux's implementation of the kernel stack (not
configurable on compilation, and requiring order of magnitude physical
allocation), sent me into a full blown rant last week. The standard,
what? when? why? how? WTF? - you know the standard rant. I even
generated a couple yawns of response from people! :)

x86_64, x86_32 (and untested ARM) code can be sent to anyone who wants
to try this at home. I would say, a generic configuration is using at
most 3KB of the stack is being used by the time xfs_alloc_vextent()
is being called and that includes the nested calls of the routine. So
for most setups, we can say the standard 8KB stacks is in no danger of
depletion and will not benefit from this feature.

Let us talk about 4KB stacks....people may need to use them because
the embedded environment has less memory, it is more sensitive to the
physical contiguous nature of the multi-page stacks, and the smaller
memory amounts may cause the allocation routines to nest more. If XFS
can't optimize the stack use enough to be much help, and going to
8KB stacks is too expensive, then extra-ordinary operations such as
this feature may be needed but please make it a configurable option
for 4KB stacks and not the default code.

I believe that the kernel stacks do not need to be physically
contiguous. Would 8KB stacks be used in this environment if the Linux
did not implement them as physically contiguous? What is the plan
when the 8KB limits become threatened?

This feature and the related nuances are good topics for the
upcoming Linux Filesystem and MM forum next month.

Mark Tinguely

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 05/10] xfs: introduce an allocation workqueue
  2012-03-19 16:47   ` Mark Tinguely
@ 2012-03-19 22:20     ` Dave Chinner
  2012-03-20 16:34       ` Mark Tinguely
  0 siblings, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2012-03-19 22:20 UTC (permalink / raw)
  To: Mark Tinguely; +Cc: xfs

On Mon, Mar 19, 2012 at 11:47:44AM -0500, Mark Tinguely wrote:
> On 03/06/12 22:50, Dave Chinner wrote:
> >From: Dave Chinner<dchinner@redhat.com>
> >
> >We currently have significant issues with the amount of stack that
> >allocation in XFS uses, especially in the writeback path. We can
> >easily consume 4k of stack between mapping the page, manipulating
> >the bmap btree and allocating blocks from the free list. Not to
> >mention btree block readahead and other functionality that issues IO
> >in the allocation path.
> >
> >As a result, we can no longer fit allocation in the writeback path
> >in the stack space provided on x86_64. To alleviate this problem,
> >introduce an allocation workqueue and move all allocations to a
> >seperate context. This can be easily added as an interposing layer
> >into xfs_alloc_vextent(), which takes a single argument structure
> >and does not return until the allocation is complete or has failed.
> >
> >To do this, add a work structure and a completion to the allocation
> >args structure. This allows xfs_alloc_vextent to queue the args onto
> >the workqueue and wait for it to be completed by the worker. This
> >can be done completely transparently to the caller.
> >
> >The worker function needs to ensure that it sets and clears the
> >PF_TRANS flag appropriately as it is being run in an active
> >transaction context. Work can also be queued in a memory reclaim
> >context, so a rescuer is needed for the workqueue.
> >
> >Signed-off-by: Dave Chinner<dchinner@redhat.com>
> 
> 
> #include <std/disclaimer>	# speaking for myself
> 
> As the problem is described above, it sounds like the STANDARD x86_64
> configuration is in stack crisis needing to put a worker in-line to
> solve the stack issue.
> 
> Adding an in-line worker to fix a "stack crisis" without any other
> measures and the Linux's implementation of the kernel stack (not
> configurable on compilation, and requiring order of magnitude physical
> allocation), sent me into a full blown rant last week.

You think I like it?

> The standard,
> what? when? why? how? WTF? - you know the standard rant. I even
> generated a couple yawns of response from people! :)

Yeah, I know. Stack usage has been a problem for years and years. I
even mentioned at last year's Kernel Summit that we needed to
consider increasing the size of the kernel stack to 16KB to support
typical storage configurations. That was met with the same old "so
what?" response: "your filesystem code is broken". I still haven;t
been able to get across that it isn't the filesystems that are
causing the problems.

For example, what's a typical memory allocation failure stack look
like? Try this:


  0)     5152     256   get_page_from_freelist+0x52d/0x840
  1)     4896     272   __alloc_pages_nodemask+0x10e/0x760
  2)     4624      48   kmem_getpages+0x70/0x170
  3)     4576     112   cache_grow+0x2a9/0x2d0
  4)     4464      80   cache_alloc_refill+0x1a3/0x1ea
  5)     4384      80   kmem_cache_alloc+0x181/0x190
  6)     4304      16   mempool_alloc_slab+0x15/0x20
  7)     4288     128   mempool_alloc+0x5e/0x160
  8)     4160      16   scsi_sg_alloc+0x44/0x50
  9)     4144     112   __sg_alloc_table+0x67/0x140
 10)     4032      32   scsi_init_sgtable+0x33/0x90
 11)     4000      48   scsi_init_io+0x28/0xc0
 12)     3952      32   scsi_setup_fs_cmnd+0x63/0xa0
 13)     3920     112   sd_prep_fn+0x158/0xa70
 14)     3808      64   blk_peek_request+0xb8/0x230
 15)     3744      80   scsi_request_fn+0x54/0x3f0
 16)     3664      80   queue_unplugged+0x55/0xf0
 17)     3584     112   blk_flush_plug_list+0x1c3/0x220
 18)     3472      32   io_schedule+0x78/0xd0
 19)     3440      16   sleep_on_page+0xe/0x20
 20)     3424      80   __wait_on_bit+0x5f/0x90
 21)     3344      80   wait_on_page_bit+0x78/0x80
 22)     3264     288   shrink_page_list+0x445/0x950
 23)     2976     192   shrink_inactive_list+0x448/0x520
 24)     2784     256   shrink_mem_cgroup_zone+0x421/0x520
 25)     2528     144   do_try_to_free_pages+0x12f/0x3e0
 26)     2384     192   try_to_free_pages+0xab/0x170
 27)     2192     272   __alloc_pages_nodemask+0x4a8/0x760
 28)     1920      48   kmem_getpages+0x70/0x170
 29)     1872     112   fallback_alloc+0x1ff/0x220
 30)     1760      96   ____cache_alloc_node+0x9a/0x150
 31)     1664      32   __kmalloc+0x185/0x200
 32)     1632     112   kmem_alloc+0x67/0xe0
 33)     1520     144   xfs_log_commit_cil+0xfe/0x540
 34)     1376      80   xfs_trans_commit+0xc2/0x2a0
 35)     1296     192   xfs_dir_ialloc+0x120/0x320
 36)     1104     208   xfs_create+0x4df/0x6b0
 37)      896     112   xfs_vn_mknod+0x8f/0x1c0
 38)      784      16   xfs_vn_create+0x13/0x20
 39)      768      64   vfs_create+0xb4/0xf0
....

That's just waiting for a page flag to clear triggering a plug
flush, and that requires ~3600 bytes of stack. This is the swap
path, not a filesystem path. This is also on a single SATA drive
with no NFS, MD/DM, etc. What this says is that we cannot commit a
transaction with more than 4300 bytes of stack consumed, otherwise
we risk overflowing the stack.

It's when you start seeing fragments like this that you start to
realise the depth of the problem:

  2)     5136     112   get_request+0x2a5/0x560
  3)     5024     176   get_request_wait+0x32/0x240
  4)     4848      96   blk_queue_bio+0x73/0x400
  5)     4752      48   generic_make_request+0xc7/0x100
  6)     4704      96   submit_bio+0x66/0xe0
  7)     4608     112   _xfs_buf_ioapply+0x15c/0x1c0
  8)     4496      64   xfs_buf_iorequest+0x7b/0xf0
  9)     4432      32   xlog_bdstrat+0x23/0x60
 10)     4400      96   xlog_sync+0x2e4/0x520
 11)     4304      48   xlog_state_release_iclog+0xeb/0x130
 12)     4256     208   xlog_write+0x6a3/0x750
 13)     4048     192   xlog_cil_push+0x264/0x3a0
 14)     3856     144   xlog_cil_force_lsn+0x144/0x150
 15)     3712     144   _xfs_log_force+0x6a/0x280
 16)     3568      32   xfs_log_force+0x18/0x40
 17)     3536      80   xfs_buf_trylock+0x9a/0xf0

Any metadata read we do that hits a pinned buffer needs a minimum
1500 bytes of stack before we hit the driver code, which from the
above swap trace, requires around 1300 bytes to dispatch safely for
the SCSI stack. So we can't safely issue a metadata *read* without
having about 3KB of stack available. And given that if we do a
double btree split and have to read in a metadata buffer, that means
we can't start the allocation with more than about 2KB of stack
consumed. And that is questionable when we add MD/DM layers into the
picture as well....

IOWs, there is simply no way we can fit an allocation call chain
into an 8KB stack when even a small amount of stack is consumed
prior to triggering allocation. Pushing the allocation off into it's
own context is, AFAICT, the only choice we have here to avoid stack
overruns because nobody else wants to acknowledge there is a
problem.

As it is, even pushing the allocation off into it's own context is
questionable as to whether it will fit in the 8KB stack given the
crazy amount of stack that the memory allocation path can consume
and we can hit that path deep in the allocation stack....

> x86_64, x86_32 (and untested ARM) code can be sent to anyone who wants
> to try this at home. I would say, a generic configuration is using at
> most 3KB of the stack is being used by the time xfs_alloc_vextent()
> is being called and that includes the nested calls of the routine. So
> for most setups, we can say the standard 8KB stacks is in no danger of
> depletion and will not benefit from this feature.

You should be able to see how easy it is to put together a call stack
that blows 8k now...

> Let us talk about 4KB stacks....

No, let's not.

> I believe that the kernel stacks do not need to be physically
> contiguous.

Sure, but the problem is that making them vmalloc'd memory will
reduce performance and no change that reduces performance will ever
be accepted. So contiguous kernel mapped stacks are here to stay.

> Would 8KB stacks be used in this environment if the Linux
> did not implement them as physically contiguous? What is the plan
> when the 8KB limits become threatened?

The current plan appears to be to stick our fingers in our ears,
and then stick our heads in the sand....

> This feature and the related nuances are good topics for the
> upcoming Linux Filesystem and MM forum next month.

I'm not sure that there is much to be gained by discussing it with
people that already agree that there is a problem. I'll try, though.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 05/10] xfs: introduce an allocation workqueue
  2012-03-19 22:20     ` Dave Chinner
@ 2012-03-20 16:34       ` Mark Tinguely
  2012-03-20 22:45         ` Dave Chinner
  0 siblings, 1 reply; 32+ messages in thread
From: Mark Tinguely @ 2012-03-20 16:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On 03/19/12 17:20, Dave Chinner wrote:
> On Mon, Mar 19, 2012 at 11:47:44AM -0500, Mark Tinguely wrote:
>> On 03/06/12 22:50, Dave Chinner wrote:
>>> From: Dave Chinner<dchinner@redhat.com>
>>>
>>> We currently have significant issues with the amount of stack that
>>> allocation in XFS uses, especially in the writeback path. We can
>>> easily consume 4k of stack between mapping the page, manipulating
>>> the bmap btree and allocating blocks from the free list. Not to
>>> mention btree block readahead and other functionality that issues IO
>>> in the allocation path.
>>>
>>> As a result, we can no longer fit allocation in the writeback path
>>> in the stack space provided on x86_64. To alleviate this problem,
>>> introduce an allocation workqueue and move all allocations to a
>>> seperate context. This can be easily added as an interposing layer
>>> into xfs_alloc_vextent(), which takes a single argument structure
>>> and does not return until the allocation is complete or has failed.
>>>
>>> To do this, add a work structure and a completion to the allocation
>>> args structure. This allows xfs_alloc_vextent to queue the args onto
>>> the workqueue and wait for it to be completed by the worker. This
>>> can be done completely transparently to the caller.
>>>
>>> The worker function needs to ensure that it sets and clears the
>>> PF_TRANS flag appropriately as it is being run in an active
>>> transaction context. Work can also be queued in a memory reclaim
>>> context, so a rescuer is needed for the workqueue.
>>>
>>> Signed-off-by: Dave Chinner<dchinner@redhat.com>
>>
>>
>> #include<std/disclaimer>	# speaking for myself
>>
>> As the problem is described above, it sounds like the STANDARD x86_64
>> configuration is in stack crisis needing to put a worker in-line to
>> solve the stack issue.
>>
>> Adding an in-line worker to fix a "stack crisis" without any other
>> measures and the Linux's implementation of the kernel stack (not
>> configurable on compilation, and requiring order of magnitude physical
>> allocation), sent me into a full blown rant last week.
>
> You think I like it?

No, no at all. Half of my ranting was about the kernel page limit.

>
>> The standard,
>> what? when? why? how? WTF? - you know the standard rant. I even
>> generated a couple yawns of response from people! :)
>
> Yeah, I know. Stack usage has been a problem for years and years. I
> even mentioned at last year's Kernel Summit that we needed to
> consider increasing the size of the kernel stack to 16KB to support
> typical storage configurations. That was met with the same old "so
> what?" response: "your filesystem code is broken". I still haven;t
> been able to get across that it isn't the filesystems that are
> causing the problems.
>
> For example, what's a typical memory allocation failure stack look
> like? Try this:
>
>
>    0)     5152     256   get_page_from_freelist+0x52d/0x840
>    1)     4896     272   __alloc_pages_nodemask+0x10e/0x760
>    2)     4624      48   kmem_getpages+0x70/0x170
>    3)     4576     112   cache_grow+0x2a9/0x2d0
>    4)     4464      80   cache_alloc_refill+0x1a3/0x1ea
>    5)     4384      80   kmem_cache_alloc+0x181/0x190
>    6)     4304      16   mempool_alloc_slab+0x15/0x20
>    7)     4288     128   mempool_alloc+0x5e/0x160
>    8)     4160      16   scsi_sg_alloc+0x44/0x50
>    9)     4144     112   __sg_alloc_table+0x67/0x140
>   10)     4032      32   scsi_init_sgtable+0x33/0x90
>   11)     4000      48   scsi_init_io+0x28/0xc0
>   12)     3952      32   scsi_setup_fs_cmnd+0x63/0xa0
>   13)     3920     112   sd_prep_fn+0x158/0xa70
>   14)     3808      64   blk_peek_request+0xb8/0x230
>   15)     3744      80   scsi_request_fn+0x54/0x3f0
>   16)     3664      80   queue_unplugged+0x55/0xf0
>   17)     3584     112   blk_flush_plug_list+0x1c3/0x220
>   18)     3472      32   io_schedule+0x78/0xd0
>   19)     3440      16   sleep_on_page+0xe/0x20
>   20)     3424      80   __wait_on_bit+0x5f/0x90
>   21)     3344      80   wait_on_page_bit+0x78/0x80
>   22)     3264     288   shrink_page_list+0x445/0x950
>   23)     2976     192   shrink_inactive_list+0x448/0x520
>   24)     2784     256   shrink_mem_cgroup_zone+0x421/0x520
>   25)     2528     144   do_try_to_free_pages+0x12f/0x3e0
>   26)     2384     192   try_to_free_pages+0xab/0x170
>   27)     2192     272   __alloc_pages_nodemask+0x4a8/0x760
>   28)     1920      48   kmem_getpages+0x70/0x170
>   29)     1872     112   fallback_alloc+0x1ff/0x220
>   30)     1760      96   ____cache_alloc_node+0x9a/0x150
>   31)     1664      32   __kmalloc+0x185/0x200
>   32)     1632     112   kmem_alloc+0x67/0xe0
>   33)     1520     144   xfs_log_commit_cil+0xfe/0x540
>   34)     1376      80   xfs_trans_commit+0xc2/0x2a0
>   35)     1296     192   xfs_dir_ialloc+0x120/0x320
>   36)     1104     208   xfs_create+0x4df/0x6b0
>   37)      896     112   xfs_vn_mknod+0x8f/0x1c0
>   38)      784      16   xfs_vn_create+0x13/0x20
>   39)      768      64   vfs_create+0xb4/0xf0
> ....


Wow, that much stack to clean and allocate a page. I am glad I did not
know that week, I would have had a stroke instead of a rant.

>
> That's just waiting for a page flag to clear triggering a plug
> flush, and that requires ~3600 bytes of stack. This is the swap
> path, not a filesystem path. This is also on a single SATA drive
> with no NFS, MD/DM, etc. What this says is that we cannot commit a
> transaction with more than 4300 bytes of stack consumed, otherwise
> we risk overflowing the stack.
>
> It's when you start seeing fragments like this that you start to
> realise the depth of the problem:
>
>    2)     5136     112   get_request+0x2a5/0x560
>    3)     5024     176   get_request_wait+0x32/0x240
>    4)     4848      96   blk_queue_bio+0x73/0x400
>    5)     4752      48   generic_make_request+0xc7/0x100
>    6)     4704      96   submit_bio+0x66/0xe0
>    7)     4608     112   _xfs_buf_ioapply+0x15c/0x1c0
>    8)     4496      64   xfs_buf_iorequest+0x7b/0xf0
>    9)     4432      32   xlog_bdstrat+0x23/0x60
>   10)     4400      96   xlog_sync+0x2e4/0x520
>   11)     4304      48   xlog_state_release_iclog+0xeb/0x130
>   12)     4256     208   xlog_write+0x6a3/0x750
>   13)     4048     192   xlog_cil_push+0x264/0x3a0
>   14)     3856     144   xlog_cil_force_lsn+0x144/0x150
>   15)     3712     144   _xfs_log_force+0x6a/0x280
>   16)     3568      32   xfs_log_force+0x18/0x40
>   17)     3536      80   xfs_buf_trylock+0x9a/0xf0

Thank-you for documenting this.

>
> Any metadata read we do that hits a pinned buffer needs a minimum
> 1500 bytes of stack before we hit the driver code, which from the
> above swap trace, requires around 1300 bytes to dispatch safely for
> the SCSI stack. So we can't safely issue a metadata *read* without
> having about 3KB of stack available. And given that if we do a
> double btree split and have to read in a metadata buffer, that means
> we can't start the allocation with more than about 2KB of stack
> consumed. And that is questionable when we add MD/DM layers into the
> picture as well....
>
> IOWs, there is simply no way we can fit an allocation call chain
> into an 8KB stack when even a small amount of stack is consumed
> prior to triggering allocation. Pushing the allocation off into it's
> own context is, AFAICT, the only choice we have here to avoid stack
> overruns because nobody else wants to acknowledge there is a
> problem.

Sigh. Also part of my rant that I can't believe that this is an issue
in LINUX.

>
> As it is, even pushing the allocation off into it's own context is
> questionable as to whether it will fit in the 8KB stack given the
> crazy amount of stack that the memory allocation path can consume
> and we can hit that path deep in the allocation stack....
>
>> x86_64, x86_32 (and untested ARM) code can be sent to anyone who wants
>> to try this at home. I would say, a generic configuration is using at
>> most 3KB of the stack is being used by the time xfs_alloc_vextent()
>> is being called and that includes the nested calls of the routine. So
>> for most setups, we can say the standard 8KB stacks is in no danger of
>> depletion and will not benefit from this feature.
>
> You should be able to see how easy it is to put together a call stack
> that blows 8k now...
>
>> Let us talk about 4KB stacks....
>
> No, let's not.
>
>> I believe that the kernel stacks do not need to be physically
>> contiguous.
>
> Sure, but the problem is that making them vmalloc'd memory will
> reduce performance and no change that reduces performance will ever
> be accepted. So contiguous kernel mapped stacks are here to stay.
>
>> Would 8KB stacks be used in this environment if the Linux
>> did not implement them as physically contiguous? What is the plan
>> when the 8KB limits become threatened?
>
> The current plan appears to be to stick our fingers in our ears,
> and then stick our heads in the sand....
>
>> This feature and the related nuances are good topics for the
>> upcoming Linux Filesystem and MM forum next month.
>
> I'm not sure that there is much to be gained by discussing it with
> people that already agree that there is a problem. I'll try, though.
>
> Cheers,
>
> Dave.

The other half of my rant is:

I haven't seen XFS on a stack reduction in new code nor existing code
(splitting routines and local variables) but I know that can only go
so far.

Filesystems, network stacks, well any kernel services, can't play
"Whack-a-mole" with the stack issues for long. The problems will just
pop up somewhere else.

I suspect it will take a big group of choir-members, the companies
they work for and the customers they represent to change the situation.
Sad. How can we get everyone in a rant over this situation?

Also thank-you for not biting my head off.

--Mark Tinguely.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 05/10] xfs: introduce an allocation workqueue
  2012-03-20 16:34       ` Mark Tinguely
@ 2012-03-20 22:45         ` Dave Chinner
  0 siblings, 0 replies; 32+ messages in thread
From: Dave Chinner @ 2012-03-20 22:45 UTC (permalink / raw)
  To: Mark Tinguely; +Cc: xfs

On Tue, Mar 20, 2012 at 11:34:16AM -0500, Mark Tinguely wrote:
> On 03/19/12 17:20, Dave Chinner wrote:
> >Yeah, I know. Stack usage has been a problem for years and years. I
> >even mentioned at last year's Kernel Summit that we needed to
> >consider increasing the size of the kernel stack to 16KB to support
> >typical storage configurations. That was met with the same old "so
> >what?" response: "your filesystem code is broken". I still haven;t
> >been able to get across that it isn't the filesystems that are
> >causing the problems.
> >
> >For example, what's a typical memory allocation failure stack look
> >like? Try this:
> >
> >
> >   0)     5152     256   get_page_from_freelist+0x52d/0x840
.....
> >  32)     1632     112   kmem_alloc+0x67/0xe0
> >  33)     1520     144   xfs_log_commit_cil+0xfe/0x540
> >  34)     1376      80   xfs_trans_commit+0xc2/0x2a0
> >  35)     1296     192   xfs_dir_ialloc+0x120/0x320
> >  36)     1104     208   xfs_create+0x4df/0x6b0
> >  37)      896     112   xfs_vn_mknod+0x8f/0x1c0
> >  38)      784      16   xfs_vn_create+0x13/0x20
> >  39)      768      64   vfs_create+0xb4/0xf0
> >....
> 
> 
> Wow, that much stack to clean and allocate a page. I am glad I did not
> know that week, I would have had a stroke instead of a rant.

It's not worth that much rage :P

> >Any metadata read we do that hits a pinned buffer needs a minimum
> >1500 bytes of stack before we hit the driver code, which from the
> >above swap trace, requires around 1300 bytes to dispatch safely for
> >the SCSI stack. So we can't safely issue a metadata *read* without
> >having about 3KB of stack available. And given that if we do a
> >double btree split and have to read in a metadata buffer, that means
> >we can't start the allocation with more than about 2KB of stack
> >consumed. And that is questionable when we add MD/DM layers into the
> >picture as well....
> >
> >IOWs, there is simply no way we can fit an allocation call chain
> >into an 8KB stack when even a small amount of stack is consumed
> >prior to triggering allocation. Pushing the allocation off into it's
> >own context is, AFAICT, the only choice we have here to avoid stack
> >overruns because nobody else wants to acknowledge there is a
> >problem.
> 
> Sigh. Also part of my rant that I can't believe that this is an issue
> in LINUX.

The problem is more of a political/perception problem than a
technical one. The code to make the stacks discontiguous, larger or
even only fall back to discontiguous vmap()d memory when contiguous
allocation fails is relatively simple to add.

> The other half of my rant is:
> 
> I haven't seen XFS on a stack reduction in new code nor existing code
> (splitting routines and local variables) but I know that can only go
> so far.

That's because we did a massive stack reduction push back around
2005-2007 that trimmed all the low hanging fruit from the tree.
Since then, it's major code refactoring efforts that have reduced
the stack footprint, but those are hard to do and test. For example,
removing all the xfs_iomap indirection from writeback. Refactoring
xfs_bmapi() in multiple functions and splitting the read and write
mappings into separate functions. Greatly simplifying the writeback
path and reducing the amount of state held in the .writepage path.

> Filesystems, network stacks, well any kernel services, can't play
> "Whack-a-mole" with the stack issues for long. The problems will just
> pop up somewhere else.

Right, that's exactly the problem we are seeing now. We played
filesystem whack-a-mole years ago, and now the problem is that all
the subsystems above and below have grown....

> I suspect it will take a big group of choir-members, the companies
> they work for and the customers they represent to change the situation.
> Sad. How can we get everyone in a rant over this situation?

Doesn't matter about companies - the only way to get such a change
through is to convince Linus and the inner cabal that it is
necessary. They are the ones that say "your code is broken" rather
than acknowledging that the ever increasing complexity of the
storage stack is the reason we are running out of stack....

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2012-03-20 22:45 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-07  4:50 [PATCH 0/10] xfs: various fixes v2 Dave Chinner
2012-03-07  4:50 ` [PATCH 01/10] xfs: clean up minor sparse warnings Dave Chinner
2012-03-08 21:34   ` Ben Myers
2012-03-09  0:30     ` Dave Chinner
2012-03-07  4:50 ` [PATCH 02/10] xfs: Fix open flag handling in open_by_handle code Dave Chinner
2012-03-12 13:27   ` Christoph Hellwig
2012-03-13 21:15   ` Mark Tinguely
2012-03-07  4:50 ` [PATCH 03/10] xfs: fallback to vmalloc for large buffers in xfs_attrmulti_attr_get Dave Chinner
2012-03-12 13:27   ` Christoph Hellwig
2012-03-14 18:04   ` Mark Tinguely
2012-03-07  4:50 ` [PATCH 04/10] xfs: fallback to vmalloc for large buffers in xfs_getbmap Dave Chinner
2012-03-12 13:28   ` Christoph Hellwig
2012-03-14 18:12   ` Mark Tinguely
2012-03-07  4:50 ` [PATCH 05/10] xfs: introduce an allocation workqueue Dave Chinner
2012-03-12 16:16   ` Christoph Hellwig
2012-03-19 16:47   ` Mark Tinguely
2012-03-19 22:20     ` Dave Chinner
2012-03-20 16:34       ` Mark Tinguely
2012-03-20 22:45         ` Dave Chinner
2012-03-07  4:50 ` [PATCH 06/10] xfs: remove remaining scraps of struct xfs_iomap Dave Chinner
2012-03-15 16:48   ` Mark Tinguely
2012-03-07  4:50 ` [PATCH 07/10] xfs: fix inode lookup race Dave Chinner
2012-03-07  4:50 ` [PATCH 08/10] xfs: initialise xfssync work before running quotachecks Dave Chinner
2012-03-12 13:28   ` Christoph Hellwig
2012-03-16 17:07   ` Mark Tinguely
2012-03-07  4:50 ` [PATCH 09/10] xfs: remove MS_ACTIVE guard from inode reclaim work Dave Chinner
2012-03-12 13:30   ` Christoph Hellwig
2012-03-07  4:50 ` [PATCH 10/10] xfs: don't cache inodes read through bulkstat Dave Chinner
2012-03-12 13:31   ` Christoph Hellwig
2012-03-14 20:44   ` Ben Myers
2012-03-15 18:14   ` Ben Myers
2012-03-15 22:05     ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.