All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] xfs: run eofblocks scan on ENOSPC
@ 2014-03-28 13:15 Brian Foster
  2014-03-28 13:15 ` [PATCH 1/5] xfs: do eofb filtering before dirty check Brian Foster
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Brian Foster @ 2014-03-28 13:15 UTC (permalink / raw)
  To: xfs

Hi all,

The purpose of this series is to help address the inefficient use of
preallocation for larger files on smaller filesystems. When the
filesystem is small, so is the 5% low free space threshold that enables
preallocation throttling. When the low free space threshold is reduced
to a couple GB or so, we can ram into ENOSPC prematurely due to larger,
active preallocations.

We resolve this condition with an eofblocks scan in the pre-existing
ENOSPC retry write sequence. The scan resets outstanding preallocations
such that throttling is guaranteed an opportunity to manage future
preallocations gracefully into ENOSPC and thus ensures closer to 100%
utilization before ENOSPC is reported to userspace.

Patches 1-3 make some small enhancements to the eofblocks scanner that
facilitate running a scan in the context of a write. Patch 4 adds the
actual scan-on-ENOSPC policy. Patch 5 updates the preallocation
throttling algorithm to take quota free space into account.

Thoughts, reviews, flames appreciated.

Brian

Brian Foster (5):
  xfs: do eofb filtering before dirty check
  xfs: add flush flag to xfs_eofblocks
  xfs: add scan owner field to xfs_eofblocks
  xfs: run an eofblocks scan on ENOSPC/EDQUOT
  xfs: squash prealloc while over quota free space as well

 fs/xfs/xfs_dquot.h  | 15 +++++++++
 fs/xfs/xfs_file.c   | 32 +++++++++++++++++--
 fs/xfs/xfs_fs.h     |  4 ++-
 fs/xfs/xfs_icache.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++------
 fs/xfs/xfs_icache.h |  3 ++
 fs/xfs/xfs_iomap.c  | 20 ++++++++----
 6 files changed, 145 insertions(+), 19 deletions(-)

-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/5] xfs: do eofb filtering before dirty check
  2014-03-28 13:15 [PATCH 0/5] xfs: run eofblocks scan on ENOSPC Brian Foster
@ 2014-03-28 13:15 ` Brian Foster
  2014-03-28 13:16 ` [PATCH 2/5] xfs: add flush flag to xfs_eofblocks Brian Foster
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Brian Foster @ 2014-03-28 13:15 UTC (permalink / raw)
  To: xfs

Reorder xfs_inode_free_eofblocks() so the explicit eofb filtering (i.e.,
uid, etc.) occurs before the dirty mapping check. This facilitates the
addition of a flush flag.

If we want to issue a flush, we should do so after the filtering logic
but before the dirty check since a flush reduces the likelihood we skip
an inode due to a dirty mapping.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_icache.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 98d3524..7ff59c9 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1223,14 +1223,6 @@ xfs_inode_free_eofblocks(
 		return 0;
 	}
 
-	/*
-	 * If the mapping is dirty the operation can block and wait for some
-	 * time. Unless we are waiting, skip it.
-	 */
-	if (!(flags & SYNC_WAIT) &&
-	    mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_DIRTY))
-		return 0;
-
 	if (eofb) {
 		if (!xfs_inode_match_id(ip, eofb))
 			return 0;
@@ -1241,6 +1233,14 @@ xfs_inode_free_eofblocks(
 			return 0;
 	}
 
+	/*
+	 * If the mapping is dirty the operation can block and wait for some
+	 * time. Unless we are waiting, skip it.
+	 */
+	if (!(flags & SYNC_WAIT) &&
+	    mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_DIRTY))
+		return 0;
+
 	ret = xfs_free_eofblocks(ip->i_mount, ip, true);
 
 	/* don't revisit the inode if we're not waiting */
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/5] xfs: add flush flag to xfs_eofblocks
  2014-03-28 13:15 [PATCH 0/5] xfs: run eofblocks scan on ENOSPC Brian Foster
  2014-03-28 13:15 ` [PATCH 1/5] xfs: do eofb filtering before dirty check Brian Foster
@ 2014-03-28 13:16 ` Brian Foster
  2014-03-31 21:47   ` Dave Chinner
  2014-03-28 13:16 ` [PATCH 3/5] xfs: add scan owner field " Brian Foster
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 15+ messages in thread
From: Brian Foster @ 2014-03-28 13:16 UTC (permalink / raw)
  To: xfs

The flush flag allows the caller to issue a flush for scanned inodes. In
ENOSPC conditions caused by project quotas, a flush is required to free
up reserved metadata allocations.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_fs.h     | 4 +++-
 fs/xfs/xfs_icache.c | 3 +++
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_fs.h b/fs/xfs/xfs_fs.h
index c5fc116..fa3a58e 100644
--- a/fs/xfs/xfs_fs.h
+++ b/fs/xfs/xfs_fs.h
@@ -374,12 +374,14 @@ struct xfs_fs_eofblocks {
 #define XFS_EOF_FLAGS_GID		(1 << 2) /* filter by gid */
 #define XFS_EOF_FLAGS_PRID		(1 << 3) /* filter by project id */
 #define XFS_EOF_FLAGS_MINFILESIZE	(1 << 4) /* filter by min file size */
+#define XFS_EOF_FLAGS_FLUSH		(1 << 5) /* issue a flush */
 #define XFS_EOF_FLAGS_VALID	\
 	(XFS_EOF_FLAGS_SYNC |	\
 	 XFS_EOF_FLAGS_UID |	\
 	 XFS_EOF_FLAGS_GID |	\
 	 XFS_EOF_FLAGS_PRID |	\
-	 XFS_EOF_FLAGS_MINFILESIZE)
+	 XFS_EOF_FLAGS_MINFILESIZE | \
+	 XFS_EOF_FLAGS_FLUSH)
 
 
 /*
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 7ff59c9..d4e15db 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1231,6 +1231,9 @@ xfs_inode_free_eofblocks(
 		if (eofb->eof_flags & XFS_EOF_FLAGS_MINFILESIZE &&
 		    XFS_ISIZE(ip) < eofb->eof_min_file_size)
 			return 0;
+
+		if (eofb->eof_flags & XFS_EOF_FLAGS_FLUSH)
+			filemap_flush(VFS_I(ip)->i_mapping);
 	}
 
 	/*
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/5] xfs: add scan owner field to xfs_eofblocks
  2014-03-28 13:15 [PATCH 0/5] xfs: run eofblocks scan on ENOSPC Brian Foster
  2014-03-28 13:15 ` [PATCH 1/5] xfs: do eofb filtering before dirty check Brian Foster
  2014-03-28 13:16 ` [PATCH 2/5] xfs: add flush flag to xfs_eofblocks Brian Foster
@ 2014-03-28 13:16 ` Brian Foster
  2014-03-28 13:16 ` [PATCH 4/5] xfs: run an eofblocks scan on ENOSPC/EDQUOT Brian Foster
  2014-03-28 13:16 ` [PATCH 5/5] xfs: squash prealloc while over quota free space as well Brian Foster
  4 siblings, 0 replies; 15+ messages in thread
From: Brian Foster @ 2014-03-28 13:16 UTC (permalink / raw)
  To: xfs

The scan owner field represents an optional inode number that is
responsible for the current scan. The purpose is to identify that an
inode is under iolock and as such, the iolock shouldn't be attempted
when trimming eofblocks. This is an internal only field.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_icache.c | 12 +++++++++++-
 fs/xfs/xfs_icache.h |  2 ++
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index d4e15db..bd0ab7d 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1215,6 +1215,7 @@ xfs_inode_free_eofblocks(
 {
 	int ret;
 	struct xfs_eofblocks *eofb = args;
+	bool need_iolock = true;
 
 	if (!xfs_can_free_eofblocks(ip, false)) {
 		/* inode could be preallocated or append-only */
@@ -1234,6 +1235,15 @@ xfs_inode_free_eofblocks(
 
 		if (eofb->eof_flags & XFS_EOF_FLAGS_FLUSH)
 			filemap_flush(VFS_I(ip)->i_mapping);
+
+		/*
+		 * A scan owner implies we already hold the iolock. Skip it in
+		 * xfs_free_eofblocks() to avoid deadlock. This also eliminates
+		 * the possibility of EAGAIN being returned.
+		 */
+		if (eofb->eof_scan_owner != NULLFSINO &&
+		    eofb->eof_scan_owner == ip->i_ino)
+			need_iolock = false;
 	}
 
 	/*
@@ -1244,7 +1254,7 @@ xfs_inode_free_eofblocks(
 	    mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_DIRTY))
 		return 0;
 
-	ret = xfs_free_eofblocks(ip->i_mount, ip, true);
+	ret = xfs_free_eofblocks(ip->i_mount, ip, need_iolock);
 
 	/* don't revisit the inode if we're not waiting */
 	if (ret == EAGAIN && !(flags & SYNC_WAIT))
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index 9ed68bb..4387b1d 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -27,6 +27,7 @@ struct xfs_eofblocks {
 	kgid_t		eof_gid;
 	prid_t		eof_prid;
 	__u64		eof_min_file_size;
+	xfs_ino_t	eof_scan_owner;
 };
 
 #define SYNC_WAIT		0x0001	/* wait for i/o to complete */
@@ -86,6 +87,7 @@ xfs_fs_eofblocks_from_user(
 	dst->eof_flags = src->eof_flags;
 	dst->eof_prid = src->eof_prid;
 	dst->eof_min_file_size = src->eof_min_file_size;
+	dst->eof_scan_owner = NULLFSINO;
 
 	dst->eof_uid = INVALID_UID;
 	if (src->eof_flags & XFS_EOF_FLAGS_UID) {
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/5] xfs: run an eofblocks scan on ENOSPC/EDQUOT
  2014-03-28 13:15 [PATCH 0/5] xfs: run eofblocks scan on ENOSPC Brian Foster
                   ` (2 preceding siblings ...)
  2014-03-28 13:16 ` [PATCH 3/5] xfs: add scan owner field " Brian Foster
@ 2014-03-28 13:16 ` Brian Foster
  2014-03-31 22:22   ` Dave Chinner
  2014-03-28 13:16 ` [PATCH 5/5] xfs: squash prealloc while over quota free space as well Brian Foster
  4 siblings, 1 reply; 15+ messages in thread
From: Brian Foster @ 2014-03-28 13:16 UTC (permalink / raw)
  To: xfs

Speculative preallocation and and the associated throttling metrics
assume we're working with large files on large filesystems. Users have
reported inefficiencies in these mechanisms when we happen to be dealing
with large files on smaller filesystems. This can occur because while
prealloc throttling is aggressive under low free space conditions, it is
not active until we reach 5% free space or less.

For example, a 40GB filesystem has enough space for several files large
enough to have multi-GB preallocations at any given time. If those files
are slow growing, they might reserve preallocation for long periods of
time as well as avoid the background scanner due to frequent
modification. If a new file is written under these conditions, said file
has no access to this already reserved space and premature ENOSPC is
imminent.

To handle this scenario, modify the buffered write ENOSPC handling and
retry sequence to invoke an eofblocks scan. The eofblocks scan is
attempted prior to the inode flush as it is lighter weight and the
former is a last resort to free space. In the smaller filesystem
scenario, the eofblocks scan resets the usage of preallocation such that
when the 5% free space threshold is met, throttling effectively takes
over to provide fair and efficient preallocation until legitimate
ENOSPC.

The eofblocks scan is selective based on the nature of the failure. For
example, an EDQUOT failure in a particular quota will use a filtered
scan for that quota. Because we don't know which quota might have caused
an allocation failure at any given time, we run a scan against each
applicable quota determined to be under low free space conditions.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_dquot.h  | 15 ++++++++++++++
 fs/xfs/xfs_file.c   | 32 ++++++++++++++++++++++++++---
 fs/xfs/xfs_icache.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_icache.h |  1 +
 4 files changed, 104 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_dquot.h b/fs/xfs/xfs_dquot.h
index d22ed00..899b99f 100644
--- a/fs/xfs/xfs_dquot.h
+++ b/fs/xfs/xfs_dquot.h
@@ -141,6 +141,21 @@ static inline xfs_dquot_t *xfs_inode_dquot(struct xfs_inode *ip, int type)
 	}
 }
 
+/*
+ * Check whether a dquot is under low free space conditions. We assume the quota
+ * is enabled and enforced.
+ */
+static inline bool xfs_dquot_lowsp(struct xfs_dquot *dqp)
+{
+	int64_t freesp;
+
+	freesp = be64_to_cpu(dqp->q_core.d_blk_hardlimit) - dqp->q_res_bcount;
+	if (freesp < dqp->q_low_space[XFS_QLOWSP_1_PCNT])
+		return true;
+
+	return false;
+}
+
 #define XFS_DQ_IS_LOCKED(dqp)	(mutex_is_locked(&((dqp)->q_qlock)))
 #define XFS_DQ_IS_DIRTY(dqp)	((dqp)->dq_flags & XFS_DQ_DIRTY)
 #define XFS_QM_ISUDQ(dqp)	((dqp)->dq_flags & XFS_DQ_USER)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 2e7989e..1ca16ce 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -38,6 +38,7 @@
 #include "xfs_trace.h"
 #include "xfs_log.h"
 #include "xfs_dinode.h"
+#include "xfs_icache.h"
 
 #include <linux/aio.h>
 #include <linux/dcache.h>
@@ -723,6 +724,7 @@ xfs_file_buffered_aio_write(
 	struct xfs_inode	*ip = XFS_I(inode);
 	ssize_t			ret;
 	int			enospc = 0;
+	int			scanned = 0;
 	int			iolock = XFS_IOLOCK_EXCL;
 	size_t			count = ocount;
 
@@ -741,10 +743,34 @@ write_retry:
 			pos, &iocb->ki_pos, count, 0);
 
 	/*
-	 * If we just got an ENOSPC, try to write back all dirty inodes to
-	 * convert delalloc space to free up some of the excess reserved
-	 * metadata space.
+	 * If we hit ENOSPC or a quota limit, use the selective nature of the
+	 * eofblocks scan to try and free up some lingering speculative
+	 * preallocation delalloc blocks.
+	 *
+	 * If we hit a quota limit, only scan for files covered by the quota. We
+	 * also consider ENOSPC here because project quota failure can return
+	 * ENOSPC instead of EDQUOT. The quota scanning only sets 'scanned' if
+	 * the inode is covered by a quota with low free space. This should
+	 * minimize interference with global ENOSPC handling.
+	 *
+	 * If a scan does not free enough space, resort to the inode flush big
+	 * hammer to convert delalloc space to free up some of the excess
+	 * reserved metadata space.
 	 */
+	if ((ret == -EDQUOT || ret == -ENOSPC) && !scanned) {
+		scanned = xfs_inode_free_quota_eofblocks(ip);
+		if (scanned)
+			goto write_retry;
+	}
+	if (ret == -ENOSPC && !scanned) {
+		struct xfs_eofblocks eofb = {0,};
+
+		eofb.eof_scan_owner = ip->i_ino; /* for locking */
+		eofb.eof_flags = XFS_EOF_FLAGS_SYNC | XFS_EOF_FLAGS_FLUSH;
+		xfs_icache_free_eofblocks(ip->i_mount, &eofb);
+		scanned = 1;
+		goto write_retry;
+	}
 	if (ret == -ENOSPC && !enospc) {
 		enospc = 1;
 		xfs_flush_inodes(ip->i_mount);
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index bd0ab7d..471ccfa 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -33,6 +33,9 @@
 #include "xfs_trace.h"
 #include "xfs_icache.h"
 #include "xfs_bmap_util.h"
+#include "xfs_quota.h"
+#include "xfs_dquot_item.h"
+#include "xfs_dquot.h"
 
 #include <linux/kthread.h>
 #include <linux/freezer.h>
@@ -1277,6 +1280,62 @@ xfs_icache_free_eofblocks(
 					 eofb, XFS_ICI_EOFBLOCKS_TAG);
 }
 
+/*
+ * Run eofblocks scans on the quotas applicable to the inode. For inodes with
+ * multiple quotas, we don't know exactly which quota caused an allocation
+ * failure. We make a best effort by running scans for each quota considered
+ * to be under low free space conditions (less than 1% available free space).
+ */
+int
+xfs_inode_free_quota_eofblocks(
+	struct xfs_inode *ip)
+{
+	int scanned = 0;
+	struct xfs_eofblocks eofb = {0,};
+	struct xfs_dquot *dq;
+
+	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL));
+
+	/* set the scan owner to avoid potential livelock */
+	eofb.eof_scan_owner = ip->i_ino;
+
+	if (XFS_IS_UQUOTA_ENFORCED(ip->i_mount)) {
+		dq = xfs_inode_dquot(ip, XFS_DQ_USER);
+		if (dq && xfs_dquot_lowsp(dq)) {
+			eofb.eof_uid = VFS_I(ip)->i_uid;
+			eofb.eof_flags = XFS_EOF_FLAGS_SYNC|
+					 XFS_EOF_FLAGS_UID;
+			xfs_icache_free_eofblocks(ip->i_mount, &eofb);
+			scanned = 1;
+		}
+	}
+
+	if (XFS_IS_GQUOTA_ENFORCED(ip->i_mount)) {
+		dq = xfs_inode_dquot(ip, XFS_DQ_GROUP);
+		if (dq && xfs_dquot_lowsp(dq)) {
+			eofb.eof_gid = VFS_I(ip)->i_gid;
+			eofb.eof_flags = XFS_EOF_FLAGS_SYNC|
+					 XFS_EOF_FLAGS_GID;
+			xfs_icache_free_eofblocks(ip->i_mount, &eofb);
+			scanned = 1;
+		}
+	}
+
+	if (XFS_IS_PQUOTA_ENFORCED(ip->i_mount)) {
+		dq = xfs_inode_dquot(ip, XFS_DQ_PROJ);
+		if (dq && xfs_dquot_lowsp(dq)) {
+			eofb.eof_prid = xfs_get_projid(ip);
+			eofb.eof_flags = XFS_EOF_FLAGS_SYNC|
+					 XFS_EOF_FLAGS_PRID|
+					 XFS_EOF_FLAGS_FLUSH;
+			xfs_icache_free_eofblocks(ip->i_mount, &eofb);
+			scanned = 1;
+		}
+	}
+
+	return scanned;
+}
+
 void
 xfs_inode_set_eofblocks_tag(
 	xfs_inode_t	*ip)
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index 4387b1d..23aa163 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -58,6 +58,7 @@ void xfs_inode_set_reclaim_tag(struct xfs_inode *ip);
 void xfs_inode_set_eofblocks_tag(struct xfs_inode *ip);
 void xfs_inode_clear_eofblocks_tag(struct xfs_inode *ip);
 int xfs_icache_free_eofblocks(struct xfs_mount *, struct xfs_eofblocks *);
+int xfs_inode_free_quota_eofblocks(struct xfs_inode *ip);
 void xfs_eofblocks_worker(struct work_struct *);
 
 int xfs_inode_ag_iterator(struct xfs_mount *mp,
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 5/5] xfs: squash prealloc while over quota free space as well
  2014-03-28 13:15 [PATCH 0/5] xfs: run eofblocks scan on ENOSPC Brian Foster
                   ` (3 preceding siblings ...)
  2014-03-28 13:16 ` [PATCH 4/5] xfs: run an eofblocks scan on ENOSPC/EDQUOT Brian Foster
@ 2014-03-28 13:16 ` Brian Foster
  4 siblings, 0 replies; 15+ messages in thread
From: Brian Foster @ 2014-03-28 13:16 UTC (permalink / raw)
  To: xfs

Commit 4d559a3b introduced heavy prealloc. squashing to catch the case
of requesting too large a prealloc on smaller filesystems, leading to
repeated flush and retry cycles that occur on ENOSPC.  Now that we issue
eofblocks scans on EDQUOT/ENOSPC (with a selective flush in the case of
project quotas), squash the prealloc against the minimum available free
space across all applicable quotas as well to avoid a similar problem of
repeated eofblocks scans.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_iomap.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 22d1cbe..934983a 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -401,7 +401,8 @@ xfs_quota_calc_throttle(
 	struct xfs_inode *ip,
 	int type,
 	xfs_fsblock_t *qblocks,
-	int *qshift)
+	int *qshift,
+	int64_t	*qfreesp)
 {
 	int64_t freesp;
 	int shift = 0;
@@ -410,6 +411,7 @@ xfs_quota_calc_throttle(
 	/* over hi wmark, squash the prealloc completely */
 	if (dq->q_res_bcount >= dq->q_prealloc_hi_wmark) {
 		*qblocks = 0;
+		*qfreesp = 0;
 		return;
 	}
 
@@ -422,6 +424,9 @@ xfs_quota_calc_throttle(
 			shift += 2;
 	}
 
+	if (freesp < *qfreesp)
+		*qfreesp = freesp;
+
 	/* only overwrite the throttle values if we are more aggressive */
 	if ((freesp >> shift) < (*qblocks >> *qshift)) {
 		*qblocks = freesp;
@@ -480,15 +485,18 @@ xfs_iomap_prealloc_size(
 	}
 
 	/*
-	 * Check each quota to cap the prealloc size and provide a shift
-	 * value to throttle with.
+	 * Check each quota to cap the prealloc size, provide a shift value to
+	 * throttle with and adjust amount of available space.
 	 */
 	if (xfs_quota_need_throttle(ip, XFS_DQ_USER, alloc_blocks))
-		xfs_quota_calc_throttle(ip, XFS_DQ_USER, &qblocks, &qshift);
+		xfs_quota_calc_throttle(ip, XFS_DQ_USER, &qblocks, &qshift,
+					&freesp);
 	if (xfs_quota_need_throttle(ip, XFS_DQ_GROUP, alloc_blocks))
-		xfs_quota_calc_throttle(ip, XFS_DQ_GROUP, &qblocks, &qshift);
+		xfs_quota_calc_throttle(ip, XFS_DQ_GROUP, &qblocks, &qshift,
+					&freesp);
 	if (xfs_quota_need_throttle(ip, XFS_DQ_PROJ, alloc_blocks))
-		xfs_quota_calc_throttle(ip, XFS_DQ_PROJ, &qblocks, &qshift);
+		xfs_quota_calc_throttle(ip, XFS_DQ_PROJ, &qblocks, &qshift,
+					&freesp);
 
 	/*
 	 * The final prealloc size is set to the minimum of free space available
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/5] xfs: add flush flag to xfs_eofblocks
  2014-03-28 13:16 ` [PATCH 2/5] xfs: add flush flag to xfs_eofblocks Brian Foster
@ 2014-03-31 21:47   ` Dave Chinner
  2014-04-01 13:48     ` Brian Foster
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2014-03-31 21:47 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Fri, Mar 28, 2014 at 09:16:00AM -0400, Brian Foster wrote:
> The flush flag allows the caller to issue a flush for scanned inodes. In
> ENOSPC conditions caused by project quotas, a flush is required to free
> up reserved metadata allocations.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_fs.h     | 4 +++-
>  fs/xfs/xfs_icache.c | 3 +++
>  2 files changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_fs.h b/fs/xfs/xfs_fs.h
> index c5fc116..fa3a58e 100644
> --- a/fs/xfs/xfs_fs.h
> +++ b/fs/xfs/xfs_fs.h
> @@ -374,12 +374,14 @@ struct xfs_fs_eofblocks {
>  #define XFS_EOF_FLAGS_GID		(1 << 2) /* filter by gid */
>  #define XFS_EOF_FLAGS_PRID		(1 << 3) /* filter by project id */
>  #define XFS_EOF_FLAGS_MINFILESIZE	(1 << 4) /* filter by min file size */
> +#define XFS_EOF_FLAGS_FLUSH		(1 << 5) /* issue a flush */
>  #define XFS_EOF_FLAGS_VALID	\
>  	(XFS_EOF_FLAGS_SYNC |	\
>  	 XFS_EOF_FLAGS_UID |	\
>  	 XFS_EOF_FLAGS_GID |	\
>  	 XFS_EOF_FLAGS_PRID |	\
> -	 XFS_EOF_FLAGS_MINFILESIZE)
> +	 XFS_EOF_FLAGS_MINFILESIZE | \
> +	 XFS_EOF_FLAGS_FLUSH)
>  
>  
>  /*
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 7ff59c9..d4e15db 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -1231,6 +1231,9 @@ xfs_inode_free_eofblocks(
>  		if (eofb->eof_flags & XFS_EOF_FLAGS_MINFILESIZE &&
>  		    XFS_ISIZE(ip) < eofb->eof_min_file_size)
>  			return 0;
> +
> +		if (eofb->eof_flags & XFS_EOF_FLAGS_FLUSH)
> +			filemap_flush(VFS_I(ip)->i_mapping);

So this does WB_SYNC_NONE writeback, which means the filesystem is
free to ignore it when we get to .writepage. Given that we are are
ENOSPC here, wouldn't it be better to guarantee that writeback will
occur (i.e. use filemap_fdatawrite())?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/5] xfs: run an eofblocks scan on ENOSPC/EDQUOT
  2014-03-28 13:16 ` [PATCH 4/5] xfs: run an eofblocks scan on ENOSPC/EDQUOT Brian Foster
@ 2014-03-31 22:22   ` Dave Chinner
  2014-04-01 13:55     ` Brian Foster
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2014-03-31 22:22 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Fri, Mar 28, 2014 at 09:16:02AM -0400, Brian Foster wrote:
> Speculative preallocation and and the associated throttling metrics
> assume we're working with large files on large filesystems. Users have
> reported inefficiencies in these mechanisms when we happen to be dealing
> with large files on smaller filesystems. This can occur because while
> prealloc throttling is aggressive under low free space conditions, it is
> not active until we reach 5% free space or less.
> 
> For example, a 40GB filesystem has enough space for several files large
> enough to have multi-GB preallocations at any given time. If those files
> are slow growing, they might reserve preallocation for long periods of
> time as well as avoid the background scanner due to frequent
> modification. If a new file is written under these conditions, said file
> has no access to this already reserved space and premature ENOSPC is
> imminent.
> 
> To handle this scenario, modify the buffered write ENOSPC handling and
> retry sequence to invoke an eofblocks scan. The eofblocks scan is
> attempted prior to the inode flush as it is lighter weight and the
> former is a last resort to free space. In the smaller filesystem
> scenario, the eofblocks scan resets the usage of preallocation such that
> when the 5% free space threshold is met, throttling effectively takes
> over to provide fair and efficient preallocation until legitimate
> ENOSPC.
> 
> The eofblocks scan is selective based on the nature of the failure. For
> example, an EDQUOT failure in a particular quota will use a filtered
> scan for that quota. Because we don't know which quota might have caused
> an allocation failure at any given time, we run a scan against each
> applicable quota determined to be under low free space conditions.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_dquot.h  | 15 ++++++++++++++
>  fs/xfs/xfs_file.c   | 32 ++++++++++++++++++++++++++---
>  fs/xfs/xfs_icache.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_icache.h |  1 +
>  4 files changed, 104 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_dquot.h b/fs/xfs/xfs_dquot.h
> index d22ed00..899b99f 100644
> --- a/fs/xfs/xfs_dquot.h
> +++ b/fs/xfs/xfs_dquot.h
> @@ -141,6 +141,21 @@ static inline xfs_dquot_t *xfs_inode_dquot(struct xfs_inode *ip, int type)
>  	}
>  }
>  
> +/*
> + * Check whether a dquot is under low free space conditions. We assume the quota
> + * is enabled and enforced.
> + */
> +static inline bool xfs_dquot_lowsp(struct xfs_dquot *dqp)
> +{
> +	int64_t freesp;
> +
> +	freesp = be64_to_cpu(dqp->q_core.d_blk_hardlimit) - dqp->q_res_bcount;
> +	if (freesp < dqp->q_low_space[XFS_QLOWSP_1_PCNT])
> +		return true;
> +
> +	return false;
> +}
> +
>  #define XFS_DQ_IS_LOCKED(dqp)	(mutex_is_locked(&((dqp)->q_qlock)))
>  #define XFS_DQ_IS_DIRTY(dqp)	((dqp)->dq_flags & XFS_DQ_DIRTY)
>  #define XFS_QM_ISUDQ(dqp)	((dqp)->dq_flags & XFS_DQ_USER)
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 2e7989e..1ca16ce 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -38,6 +38,7 @@
>  #include "xfs_trace.h"
>  #include "xfs_log.h"
>  #include "xfs_dinode.h"
> +#include "xfs_icache.h"
>  
>  #include <linux/aio.h>
>  #include <linux/dcache.h>
> @@ -723,6 +724,7 @@ xfs_file_buffered_aio_write(
>  	struct xfs_inode	*ip = XFS_I(inode);
>  	ssize_t			ret;
>  	int			enospc = 0;
> +	int			scanned = 0;
>  	int			iolock = XFS_IOLOCK_EXCL;
>  	size_t			count = ocount;
>  
> @@ -741,10 +743,34 @@ write_retry:
>  			pos, &iocb->ki_pos, count, 0);
>  
>  	/*
> -	 * If we just got an ENOSPC, try to write back all dirty inodes to
> -	 * convert delalloc space to free up some of the excess reserved
> -	 * metadata space.
> +	 * If we hit ENOSPC or a quota limit, use the selective nature of the
> +	 * eofblocks scan to try and free up some lingering speculative
> +	 * preallocation delalloc blocks.
> +	 *
> +	 * If we hit a quota limit, only scan for files covered by the quota. We
> +	 * also consider ENOSPC here because project quota failure can return
> +	 * ENOSPC instead of EDQUOT. The quota scanning only sets 'scanned' if
> +	 * the inode is covered by a quota with low free space. This should
> +	 * minimize interference with global ENOSPC handling.
> +	 *
> +	 * If a scan does not free enough space, resort to the inode flush big
> +	 * hammer to convert delalloc space to free up some of the excess
> +	 * reserved metadata space.
>  	 */
> +	if ((ret == -EDQUOT || ret == -ENOSPC) && !scanned) {
> +		scanned = xfs_inode_free_quota_eofblocks(ip);
> +		if (scanned)
> +			goto write_retry;
> +	}
> +	if (ret == -ENOSPC && !scanned) {
> +		struct xfs_eofblocks eofb = {0,};

IIRC, you can just use "{ 0 }" for initialisation, no "," needed.

> +
> +		eofb.eof_scan_owner = ip->i_ino; /* for locking */
> +		eofb.eof_flags = XFS_EOF_FLAGS_SYNC | XFS_EOF_FLAGS_FLUSH;
> +		xfs_icache_free_eofblocks(ip->i_mount, &eofb);
> +		scanned = 1;
> +		goto write_retry;
> +	}
>  	if (ret == -ENOSPC && !enospc) {
>  		enospc = 1;
>  		xfs_flush_inodes(ip->i_mount);

This seems overly complex and fragile. I'd much prefer that we don't
bury data writeback deep in the EOF block freeing code - we've done
a lot of work in the past to remove exactly that sort of behaviour
from XFS inode scanners.

I'd prefer to see something like this:

	if (ret == -EDQUOT && !enospc) {
		enospc = 1;
		xfs_inode_free_quota_eofblocks(ip);
		goto retry;
	else if (ret == -ENOSPC && !enospc) {
		enospc = 1;
		xfs_flush_inodes(ip->i_mount);
		....
		xfs_icache_free_eofblocks(ip->i_mount, &eofb);
		goto retry;
	}

This way has specific behaviours for EDQUOT vs ENOSPC, and we treat
them appropriately with a minimum of differences. And ENOSPC is
global, because we can't tell the difference here between a project
quota ENOSPC and a global ENOSPC at this point.

> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index bd0ab7d..471ccfa 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -33,6 +33,9 @@
>  #include "xfs_trace.h"
>  #include "xfs_icache.h"
>  #include "xfs_bmap_util.h"
> +#include "xfs_quota.h"
> +#include "xfs_dquot_item.h"
> +#include "xfs_dquot.h"
>  
>  #include <linux/kthread.h>
>  #include <linux/freezer.h>
> @@ -1277,6 +1280,62 @@ xfs_icache_free_eofblocks(
>  					 eofb, XFS_ICI_EOFBLOCKS_TAG);
>  }
>  
> +/*
> + * Run eofblocks scans on the quotas applicable to the inode. For inodes with
> + * multiple quotas, we don't know exactly which quota caused an allocation
> + * failure. We make a best effort by running scans for each quota considered
> + * to be under low free space conditions (less than 1% available free space).
> + */
> +int
> +xfs_inode_free_quota_eofblocks(
> +	struct xfs_inode *ip)
> +{
> +	int scanned = 0;
> +	struct xfs_eofblocks eofb = {0,};
> +	struct xfs_dquot *dq;
> +
> +	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL));
> +
> +	/* set the scan owner to avoid potential livelock */
> +	eofb.eof_scan_owner = ip->i_ino;
> +
> +	if (XFS_IS_UQUOTA_ENFORCED(ip->i_mount)) {
> +		dq = xfs_inode_dquot(ip, XFS_DQ_USER);
> +		if (dq && xfs_dquot_lowsp(dq)) {
> +			eofb.eof_uid = VFS_I(ip)->i_uid;
> +			eofb.eof_flags = XFS_EOF_FLAGS_SYNC|

whitespace.

> +					 XFS_EOF_FLAGS_UID;
> +			xfs_icache_free_eofblocks(ip->i_mount, &eofb);
> +			scanned = 1;
> +		}
> +	}
> +
> +	if (XFS_IS_GQUOTA_ENFORCED(ip->i_mount)) {
> +		dq = xfs_inode_dquot(ip, XFS_DQ_GROUP);
> +		if (dq && xfs_dquot_lowsp(dq)) {
> +			eofb.eof_gid = VFS_I(ip)->i_gid;
> +			eofb.eof_flags = XFS_EOF_FLAGS_SYNC|
> +					 XFS_EOF_FLAGS_GID;
> +			xfs_icache_free_eofblocks(ip->i_mount, &eofb);
> +			scanned = 1;
> +		}
> +	}
> +
> +	if (XFS_IS_PQUOTA_ENFORCED(ip->i_mount)) {
> +		dq = xfs_inode_dquot(ip, XFS_DQ_PROJ);
> +		if (dq && xfs_dquot_lowsp(dq)) {
> +			eofb.eof_prid = xfs_get_projid(ip);
> +			eofb.eof_flags = XFS_EOF_FLAGS_SYNC|
> +					 XFS_EOF_FLAGS_PRID|
> +					 XFS_EOF_FLAGS_FLUSH;
> +			xfs_icache_free_eofblocks(ip->i_mount, &eofb);
> +			scanned = 1;
> +		}
> +	}

I really don't like the fact that project quota is hiding a data
flush in the "free_quota_eofblocks" logic. It just strikes me a the
wrong thing to do because if it's a real ENOSPC we're just going to
have to do this anyway...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/5] xfs: add flush flag to xfs_eofblocks
  2014-03-31 21:47   ` Dave Chinner
@ 2014-04-01 13:48     ` Brian Foster
  0 siblings, 0 replies; 15+ messages in thread
From: Brian Foster @ 2014-04-01 13:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Apr 01, 2014 at 08:47:54AM +1100, Dave Chinner wrote:
> On Fri, Mar 28, 2014 at 09:16:00AM -0400, Brian Foster wrote:
> > The flush flag allows the caller to issue a flush for scanned inodes. In
> > ENOSPC conditions caused by project quotas, a flush is required to free
> > up reserved metadata allocations.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/xfs_fs.h     | 4 +++-
> >  fs/xfs/xfs_icache.c | 3 +++
> >  2 files changed, 6 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/xfs_fs.h b/fs/xfs/xfs_fs.h
> > index c5fc116..fa3a58e 100644
> > --- a/fs/xfs/xfs_fs.h
> > +++ b/fs/xfs/xfs_fs.h
> > @@ -374,12 +374,14 @@ struct xfs_fs_eofblocks {
> >  #define XFS_EOF_FLAGS_GID		(1 << 2) /* filter by gid */
> >  #define XFS_EOF_FLAGS_PRID		(1 << 3) /* filter by project id */
> >  #define XFS_EOF_FLAGS_MINFILESIZE	(1 << 4) /* filter by min file size */
> > +#define XFS_EOF_FLAGS_FLUSH		(1 << 5) /* issue a flush */
> >  #define XFS_EOF_FLAGS_VALID	\
> >  	(XFS_EOF_FLAGS_SYNC |	\
> >  	 XFS_EOF_FLAGS_UID |	\
> >  	 XFS_EOF_FLAGS_GID |	\
> >  	 XFS_EOF_FLAGS_PRID |	\
> > -	 XFS_EOF_FLAGS_MINFILESIZE)
> > +	 XFS_EOF_FLAGS_MINFILESIZE | \
> > +	 XFS_EOF_FLAGS_FLUSH)
> >  
> >  
> >  /*
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index 7ff59c9..d4e15db 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -1231,6 +1231,9 @@ xfs_inode_free_eofblocks(
> >  		if (eofb->eof_flags & XFS_EOF_FLAGS_MINFILESIZE &&
> >  		    XFS_ISIZE(ip) < eofb->eof_min_file_size)
> >  			return 0;
> > +
> > +		if (eofb->eof_flags & XFS_EOF_FLAGS_FLUSH)
> > +			filemap_flush(VFS_I(ip)->i_mapping);
> 
> So this does WB_SYNC_NONE writeback, which means the filesystem is
> free to ignore it when we get to .writepage. Given that we are are
> ENOSPC here, wouldn't it be better to guarantee that writeback will
> occur (i.e. use filemap_fdatawrite())?
> 

Yeah, I didn't catch that. That makes sense, thanks.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/5] xfs: run an eofblocks scan on ENOSPC/EDQUOT
  2014-03-31 22:22   ` Dave Chinner
@ 2014-04-01 13:55     ` Brian Foster
  2014-04-01 21:19       ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Brian Foster @ 2014-04-01 13:55 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Apr 01, 2014 at 09:22:47AM +1100, Dave Chinner wrote:
> On Fri, Mar 28, 2014 at 09:16:02AM -0400, Brian Foster wrote:
...
> >  	/*
> > -	 * If we just got an ENOSPC, try to write back all dirty inodes to
> > -	 * convert delalloc space to free up some of the excess reserved
> > -	 * metadata space.
> > +	 * If we hit ENOSPC or a quota limit, use the selective nature of the
> > +	 * eofblocks scan to try and free up some lingering speculative
> > +	 * preallocation delalloc blocks.
> > +	 *
> > +	 * If we hit a quota limit, only scan for files covered by the quota. We
> > +	 * also consider ENOSPC here because project quota failure can return
> > +	 * ENOSPC instead of EDQUOT. The quota scanning only sets 'scanned' if
> > +	 * the inode is covered by a quota with low free space. This should
> > +	 * minimize interference with global ENOSPC handling.
> > +	 *
> > +	 * If a scan does not free enough space, resort to the inode flush big
> > +	 * hammer to convert delalloc space to free up some of the excess
> > +	 * reserved metadata space.
> >  	 */
> > +	if ((ret == -EDQUOT || ret == -ENOSPC) && !scanned) {
> > +		scanned = xfs_inode_free_quota_eofblocks(ip);
> > +		if (scanned)
> > +			goto write_retry;
> > +	}
> > +	if (ret == -ENOSPC && !scanned) {
> > +		struct xfs_eofblocks eofb = {0,};
> 
> IIRC, you can just use "{ 0 }" for initialisation, no "," needed.
> 
> > +
> > +		eofb.eof_scan_owner = ip->i_ino; /* for locking */
> > +		eofb.eof_flags = XFS_EOF_FLAGS_SYNC | XFS_EOF_FLAGS_FLUSH;
> > +		xfs_icache_free_eofblocks(ip->i_mount, &eofb);
> > +		scanned = 1;
> > +		goto write_retry;
> > +	}
> >  	if (ret == -ENOSPC && !enospc) {
> >  		enospc = 1;
> >  		xfs_flush_inodes(ip->i_mount);
> 
> This seems overly complex and fragile. I'd much prefer that we don't
> bury data writeback deep in the EOF block freeing code - we've done
> a lot of work in the past to remove exactly that sort of behaviour
> from XFS inode scanners.
> 

I think the fragility comes from the fact that we can't detect a
particular quota failure or a project quota failure from a global
failure. IIRC from looking at this way back when, there wasn't a clear
solution to that problem. It didn't seem worth getting too far into for
the purpose of this little bit of functionality. In that sense, the
fragility will be there regardless.

It would be nice to simplify this, however. It took a little bit of
staring at this to try and make it effective and somewhat succinct.

> I'd prefer to see something like this:
> 
> 	if (ret == -EDQUOT && !enospc) {
> 		enospc = 1;
> 		xfs_inode_free_quota_eofblocks(ip);
> 		goto retry;
> 	else if (ret == -ENOSPC && !enospc) {
> 		enospc = 1;
> 		xfs_flush_inodes(ip->i_mount);
> 		....
> 		xfs_icache_free_eofblocks(ip->i_mount, &eofb);
> 		goto retry;
> 	}
> 

What I don't like about this, in particular, is xfs_flush_inodes() could
be unnecessary. We have a max preallocation size of 8GB, so the
eofblocks scan alone can free up gigabytes of space. Now that I think of
it, this is kind of a flaw in the proposed logic as well.

> This way has specific behaviours for EDQUOT vs ENOSPC, and we treat
> them appropriately with a minimum of differences. And ENOSPC is
> global, because we can't tell the difference here between a project
> quota ENOSPC and a global ENOSPC at this point.
> 

The main difference I see is that the original logic is centered around
figuring out how to do an eofblocks scan vs. xfs_flush_inodes() only
when necessary. In other words, we always attempt to do an eofblocks
scan first then fall back to the inode flush. Where it gets a bit ugly
is where we try to determine what kind of scan to run due to lack of
information. The flush aspect of things is a little confused as well I
suppose.

Alternatively, the logic above is designed around categorization of the
possible error conditions. I agree that it is more simple, but not quite
as effective as noted above. We would continue to run a global inode
flush when eofblocks can clean up effectively for the time being or due
to a project quota failure that should ideally not impact the wider fs.

I wonder if something like the following would simplify this enough, yet
still provide nicer behavior:

	/*
	 * If this is an allocation failure, attempt an eofblocks scan
	 * before we resort to an inode flush...
	 */
	if ((ret == -EDQUOT || ret == -ENOSPC) && !scanned) {
		scanned = xfs_inode_enospc_eofblocks(ip);
		if (scanned)
			goto retry;
		scanned = 1;	/* try this once */
	}
	if (ret == -ENOSPC && !enospc) {
		enospc = 1;
		xfs_flush_inodes();
		goto retry;
	}

This consolidates the eofblocks scan logic to a separate helper to
simplify things. The helper can run the quota scan and/or the global
scan based on the data regarding the situation (i.e., the inode and
associated quota characteristics). This allows us to preserve the notion
of attempting a lightweight recovery first and a heavyweight recovery
second, reserving the inode flush for when space is truly tight.
Thoughts?

> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index bd0ab7d..471ccfa 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -33,6 +33,9 @@
...
> > +
> > +	if (XFS_IS_PQUOTA_ENFORCED(ip->i_mount)) {
> > +		dq = xfs_inode_dquot(ip, XFS_DQ_PROJ);
> > +		if (dq && xfs_dquot_lowsp(dq)) {
> > +			eofb.eof_prid = xfs_get_projid(ip);
> > +			eofb.eof_flags = XFS_EOF_FLAGS_SYNC|
> > +					 XFS_EOF_FLAGS_PRID|
> > +					 XFS_EOF_FLAGS_FLUSH;
> > +			xfs_icache_free_eofblocks(ip->i_mount, &eofb);
> > +			scanned = 1;
> > +		}
> > +	}
> 
> I really don't like the fact that project quota is hiding a data
> flush in the "free_quota_eofblocks" logic. It just strikes me a the
> wrong thing to do because if it's a real ENOSPC we're just going to
> have to do this anyway...
> 

I was under the impression that a flush was important for project quotas
based on freeing up reserved metadata space (from our past discussions
on the original eofblocks work). I could have just misunderstood the
point at the time. If so, I can just remove the flush flag on the
eofblocks scan in the logic proposed above and let the inode flush
handle this for both cases.

If we go that route, any preference as to whether to keep the support
for doing a flush in eofblocks at all? I included it primarily for the
project quota case. With that dropped, I could just drop the first
couple of patches here.

Thanks for the review.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/5] xfs: run an eofblocks scan on ENOSPC/EDQUOT
  2014-04-01 13:55     ` Brian Foster
@ 2014-04-01 21:19       ` Dave Chinner
  2014-04-01 23:20         ` Brian Foster
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2014-04-01 21:19 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Tue, Apr 01, 2014 at 09:55:18AM -0400, Brian Foster wrote:
> On Tue, Apr 01, 2014 at 09:22:47AM +1100, Dave Chinner wrote:
> > On Fri, Mar 28, 2014 at 09:16:02AM -0400, Brian Foster wrote:
> ...
> > >  	/*
> > > -	 * If we just got an ENOSPC, try to write back all dirty inodes to
> > > -	 * convert delalloc space to free up some of the excess reserved
> > > -	 * metadata space.
> > > +	 * If we hit ENOSPC or a quota limit, use the selective nature of the
> > > +	 * eofblocks scan to try and free up some lingering speculative
> > > +	 * preallocation delalloc blocks.
> > > +	 *
> > > +	 * If we hit a quota limit, only scan for files covered by the quota. We
> > > +	 * also consider ENOSPC here because project quota failure can return
> > > +	 * ENOSPC instead of EDQUOT. The quota scanning only sets 'scanned' if
> > > +	 * the inode is covered by a quota with low free space. This should
> > > +	 * minimize interference with global ENOSPC handling.
> > > +	 *
> > > +	 * If a scan does not free enough space, resort to the inode flush big
> > > +	 * hammer to convert delalloc space to free up some of the excess
> > > +	 * reserved metadata space.
> > >  	 */
> > > +	if ((ret == -EDQUOT || ret == -ENOSPC) && !scanned) {
> > > +		scanned = xfs_inode_free_quota_eofblocks(ip);
> > > +		if (scanned)
> > > +			goto write_retry;
> > > +	}
> > > +	if (ret == -ENOSPC && !scanned) {
> > > +		struct xfs_eofblocks eofb = {0,};
> > 
> > IIRC, you can just use "{ 0 }" for initialisation, no "," needed.
> > 
> > > +
> > > +		eofb.eof_scan_owner = ip->i_ino; /* for locking */
> > > +		eofb.eof_flags = XFS_EOF_FLAGS_SYNC | XFS_EOF_FLAGS_FLUSH;
> > > +		xfs_icache_free_eofblocks(ip->i_mount, &eofb);
> > > +		scanned = 1;
> > > +		goto write_retry;
> > > +	}
> > >  	if (ret == -ENOSPC && !enospc) {
> > >  		enospc = 1;
> > >  		xfs_flush_inodes(ip->i_mount);
> > 
> > This seems overly complex and fragile. I'd much prefer that we don't
> > bury data writeback deep in the EOF block freeing code - we've done
> > a lot of work in the past to remove exactly that sort of behaviour
> > from XFS inode scanners.
> 
> I think the fragility comes from the fact that we can't detect a
> particular quota failure or a project quota failure from a global
> failure. IIRC from looking at this way back when, there wasn't a clear
> solution to that problem. It didn't seem worth getting too far into for
> the purpose of this little bit of functionality. In that sense, the
> fragility will be there regardless.
> 
> It would be nice to simplify this, however. It took a little bit of
> staring at this to try and make it effective and somewhat succinct.
> 
> > I'd prefer to see something like this:
> > 
> > 	if (ret == -EDQUOT && !enospc) {
> > 		enospc = 1;
> > 		xfs_inode_free_quota_eofblocks(ip);
> > 		goto retry;
> > 	else if (ret == -ENOSPC && !enospc) {
> > 		enospc = 1;
> > 		xfs_flush_inodes(ip->i_mount);
> > 		....
> > 		xfs_icache_free_eofblocks(ip->i_mount, &eofb);
> > 		goto retry;
> > 	}
> > 
> 
> What I don't like about this, in particular, is xfs_flush_inodes() could
> be unnecessary. We have a max preallocation size of 8GB, so the
> eofblocks scan alone can free up gigabytes of space. Now that I think of
> it, this is kind of a flaw in the proposed logic as well.

The issue is that an eofblock scan is expensive, and there's no
limiting on how many inodes can have the EOF tag set and there's no
limiting on the number of concurrent scans that can be run. i.e.
when an active project that has many processes writing to it runs
out of space, every process will enter xfs_icache_free_eofblocks()
function and start issuing data flush requests on dirty inodes. It
causes unbound writeback concurrency, and we know that this is a bad
thing to do to your storage...

xfs_flush_inodes() only runs one scan of the currently dirty
inodes at a time because it serialises on the flusher thread. We get
optimised writeback patterns rather than random data writeback from
as many threads hitting the eofblocks scanner at once. History has
shown that this is the most efficient way to clean dirty data -
optimise once and in the correct place, then use it everywhere.

Yes, it means that there is a global flush when a project quota runs
out of space, but it means that we only do it once and we don't burn
excessive CPU walking radix trees scanning inodes needlessly every
time we get a storm of processes hammering project quota ENOSPC.

FWIW, if we want to filter the list of dirty inodes to writeback, we
should look at providing a filter callback to the generic inode code
that gets called on each dirty inode. That way we can use the
optimised writeback code to quickly find and dispatch only the dirty
inodes that match the quota id we are having problems with....

> > This way has specific behaviours for EDQUOT vs ENOSPC, and we treat
> > them appropriately with a minimum of differences. And ENOSPC is
> > global, because we can't tell the difference here between a project
> > quota ENOSPC and a global ENOSPC at this point.
> > 
> 
> The main difference I see is that the original logic is centered around
> figuring out how to do an eofblocks scan vs. xfs_flush_inodes() only
> when necessary. In other words, we always attempt to do an eofblocks
> scan first then fall back to the inode flush. Where it gets a bit ugly
> is where we try to determine what kind of scan to run due to lack of
> information. The flush aspect of things is a little confused as well I
> suppose.
> 
> Alternatively, the logic above is designed around categorization of the
> possible error conditions. I agree that it is more simple, but not quite
> as effective as noted above. We would continue to run a global inode
> flush when eofblocks can clean up effectively for the time being or due
> to a project quota failure that should ideally not impact the wider fs.

Sure, but we've found in the past that just issuing a global flush
is far better from both the CPU and IO efficiency POV than
concurrent inode cache scans to find matching dirty inodes and
flushing them.

> I wonder if something like the following would simplify this enough, yet
> still provide nicer behavior:
> 
> 	/*
> 	 * If this is an allocation failure, attempt an eofblocks scan
> 	 * before we resort to an inode flush...
> 	 */
> 	if ((ret == -EDQUOT || ret == -ENOSPC) && !scanned) {
> 		scanned = xfs_inode_enospc_eofblocks(ip);
> 		if (scanned)
> 			goto retry;
> 		scanned = 1;	/* try this once */
> 	}
> 	if (ret == -ENOSPC && !enospc) {
> 		enospc = 1;
> 		xfs_flush_inodes();
> 		goto retry;
> 	}
> 
> This consolidates the eofblocks scan logic to a separate helper to
> simplify things. The helper can run the quota scan and/or the global
> scan based on the data regarding the situation (i.e., the inode and
> associated quota characteristics). This allows us to preserve the notion
> of attempting a lightweight recovery first and a heavyweight recovery
> second, reserving the inode flush for when space is truly tight.
> Thoughts?

It's still damn clunky, IMO. It's still trying to work around the
fact that project quotas use ENOSPC rather than EDQUOT, and that
makes the logic rather twisted. And it still has that hidden data
flush in it so it can't really be called lightweight...


> > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > index bd0ab7d..471ccfa 100644
> > > --- a/fs/xfs/xfs_icache.c
> > > +++ b/fs/xfs/xfs_icache.c
> > > @@ -33,6 +33,9 @@
> ...
> > > +
> > > +	if (XFS_IS_PQUOTA_ENFORCED(ip->i_mount)) {
> > > +		dq = xfs_inode_dquot(ip, XFS_DQ_PROJ);
> > > +		if (dq && xfs_dquot_lowsp(dq)) {
> > > +			eofb.eof_prid = xfs_get_projid(ip);
> > > +			eofb.eof_flags = XFS_EOF_FLAGS_SYNC|
> > > +					 XFS_EOF_FLAGS_PRID|
> > > +					 XFS_EOF_FLAGS_FLUSH;
> > > +			xfs_icache_free_eofblocks(ip->i_mount, &eofb);
> > > +			scanned = 1;
> > > +		}
> > > +	}
> > 
> > I really don't like the fact that project quota is hiding a data
> > flush in the "free_quota_eofblocks" logic. It just strikes me a the
> > wrong thing to do because if it's a real ENOSPC we're just going to
> > have to do this anyway...
> > 
> 
> I was under the impression that a flush was important for project quotas
> based on freeing up reserved metadata space (from our past discussions
> on the original eofblocks work). I could have just misunderstood the
> point at the time.

It is important - I'm not saying that we should not flush dirty
data. What I am disagreeing with is the implementation of data
flushing in the patch....

> If so, I can just remove the flush flag on the
> eofblocks scan in the logic proposed above and let the inode flush
> handle this for both cases.
> 
> If we go that route, any preference as to whether to keep the support
> for doing a flush in eofblocks at all? I included it primarily for the
> project quota case. With that dropped, I could just drop the first
> couple of patches here.

Yes, you could do that.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/5] xfs: run an eofblocks scan on ENOSPC/EDQUOT
  2014-04-01 21:19       ` Dave Chinner
@ 2014-04-01 23:20         ` Brian Foster
  2014-04-02  5:11           ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Brian Foster @ 2014-04-01 23:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Wed, Apr 02, 2014 at 08:19:26AM +1100, Dave Chinner wrote:
> On Tue, Apr 01, 2014 at 09:55:18AM -0400, Brian Foster wrote:
> > On Tue, Apr 01, 2014 at 09:22:47AM +1100, Dave Chinner wrote:
> > > On Fri, Mar 28, 2014 at 09:16:02AM -0400, Brian Foster wrote:
> > ...
> > > >  	/*
> > > > -	 * If we just got an ENOSPC, try to write back all dirty inodes to
> > > > -	 * convert delalloc space to free up some of the excess reserved
> > > > -	 * metadata space.
> > > > +	 * If we hit ENOSPC or a quota limit, use the selective nature of the
> > > > +	 * eofblocks scan to try and free up some lingering speculative
> > > > +	 * preallocation delalloc blocks.
> > > > +	 *
> > > > +	 * If we hit a quota limit, only scan for files covered by the quota. We
> > > > +	 * also consider ENOSPC here because project quota failure can return
> > > > +	 * ENOSPC instead of EDQUOT. The quota scanning only sets 'scanned' if
> > > > +	 * the inode is covered by a quota with low free space. This should
> > > > +	 * minimize interference with global ENOSPC handling.
> > > > +	 *
> > > > +	 * If a scan does not free enough space, resort to the inode flush big
> > > > +	 * hammer to convert delalloc space to free up some of the excess
> > > > +	 * reserved metadata space.
> > > >  	 */
> > > > +	if ((ret == -EDQUOT || ret == -ENOSPC) && !scanned) {
> > > > +		scanned = xfs_inode_free_quota_eofblocks(ip);
> > > > +		if (scanned)
> > > > +			goto write_retry;
> > > > +	}
> > > > +	if (ret == -ENOSPC && !scanned) {
> > > > +		struct xfs_eofblocks eofb = {0,};
> > > 
> > > IIRC, you can just use "{ 0 }" for initialisation, no "," needed.
> > > 
> > > > +
> > > > +		eofb.eof_scan_owner = ip->i_ino; /* for locking */
> > > > +		eofb.eof_flags = XFS_EOF_FLAGS_SYNC | XFS_EOF_FLAGS_FLUSH;
> > > > +		xfs_icache_free_eofblocks(ip->i_mount, &eofb);
> > > > +		scanned = 1;
> > > > +		goto write_retry;
> > > > +	}
> > > >  	if (ret == -ENOSPC && !enospc) {
> > > >  		enospc = 1;
> > > >  		xfs_flush_inodes(ip->i_mount);
> > > 
> > > This seems overly complex and fragile. I'd much prefer that we don't
> > > bury data writeback deep in the EOF block freeing code - we've done
> > > a lot of work in the past to remove exactly that sort of behaviour
> > > from XFS inode scanners.
> > 
> > I think the fragility comes from the fact that we can't detect a
> > particular quota failure or a project quota failure from a global
> > failure. IIRC from looking at this way back when, there wasn't a clear
> > solution to that problem. It didn't seem worth getting too far into for
> > the purpose of this little bit of functionality. In that sense, the
> > fragility will be there regardless.
> > 
> > It would be nice to simplify this, however. It took a little bit of
> > staring at this to try and make it effective and somewhat succinct.
> > 
> > > I'd prefer to see something like this:
> > > 
> > > 	if (ret == -EDQUOT && !enospc) {
> > > 		enospc = 1;
> > > 		xfs_inode_free_quota_eofblocks(ip);
> > > 		goto retry;
> > > 	else if (ret == -ENOSPC && !enospc) {
> > > 		enospc = 1;
> > > 		xfs_flush_inodes(ip->i_mount);
> > > 		....
> > > 		xfs_icache_free_eofblocks(ip->i_mount, &eofb);
> > > 		goto retry;
> > > 	}
> > > 
> > 
> > What I don't like about this, in particular, is xfs_flush_inodes() could
> > be unnecessary. We have a max preallocation size of 8GB, so the
> > eofblocks scan alone can free up gigabytes of space. Now that I think of
> > it, this is kind of a flaw in the proposed logic as well.
> 
> The issue is that an eofblock scan is expensive, and there's no
> limiting on how many inodes can have the EOF tag set and there's no
> limiting on the number of concurrent scans that can be run. i.e.
> when an active project that has many processes writing to it runs
> out of space, every process will enter xfs_icache_free_eofblocks()
> function and start issuing data flush requests on dirty inodes. It
> causes unbound writeback concurrency, and we know that this is a bad
> thing to do to your storage...
> 
> xfs_flush_inodes() only runs one scan of the currently dirty
> inodes at a time because it serialises on the flusher thread. We get
> optimised writeback patterns rather than random data writeback from
> as many threads hitting the eofblocks scanner at once. History has
> shown that this is the most efficient way to clean dirty data -
> optimise once and in the correct place, then use it everywhere.
> 

Ok. I've been waffling on whether the eofblocks scan does/needs to do
the flush, including whether there is any value of it being "selective."
This suggests otherwise. In particular, that one big flush is better
than many smaller/selective flushes. So we can just toss the project
quota eofblocks flush. I'll drop the first couple patches for the flush
mechanism.

> Yes, it means that there is a global flush when a project quota runs
> out of space, but it means that we only do it once and we don't burn
> excessive CPU walking radix trees scanning inodes needlessly every
> time we get a storm of processes hammering project quota ENOSPC.
> 

It's not clear to me that excess scanning is an issue, but it seems
orthogonal to how we organize the enospc logic (at least once the flush
is out of the equation). IOW, we'll invoke the scanner with the same
frequency either way. Or maybe you are just referring to scanning
specifically for the purpose of doing flushes as a waste..?

That said, I'll do some testing in this area to try and detect any
potential issues.

> FWIW, if we want to filter the list of dirty inodes to writeback, we
> should look at providing a filter callback to the generic inode code
> that gets called on each dirty inode. That way we can use the
> optimised writeback code to quickly find and dispatch only the dirty
> inodes that match the quota id we are having problems with....
> 
> > > This way has specific behaviours for EDQUOT vs ENOSPC, and we treat
> > > them appropriately with a minimum of differences. And ENOSPC is
> > > global, because we can't tell the difference here between a project
> > > quota ENOSPC and a global ENOSPC at this point.
> > > 
> > 
> > The main difference I see is that the original logic is centered around
> > figuring out how to do an eofblocks scan vs. xfs_flush_inodes() only
> > when necessary. In other words, we always attempt to do an eofblocks
> > scan first then fall back to the inode flush. Where it gets a bit ugly
> > is where we try to determine what kind of scan to run due to lack of
> > information. The flush aspect of things is a little confused as well I
> > suppose.
> > 
> > Alternatively, the logic above is designed around categorization of the
> > possible error conditions. I agree that it is more simple, but not quite
> > as effective as noted above. We would continue to run a global inode
> > flush when eofblocks can clean up effectively for the time being or due
> > to a project quota failure that should ideally not impact the wider fs.
> 
> Sure, but we've found in the past that just issuing a global flush
> is far better from both the CPU and IO efficiency POV than
> concurrent inode cache scans to find matching dirty inodes and
> flushing them.
> 

Makes sense...

> > I wonder if something like the following would simplify this enough, yet
> > still provide nicer behavior:
> > 
> > 	/*
> > 	 * If this is an allocation failure, attempt an eofblocks scan
> > 	 * before we resort to an inode flush...
> > 	 */
> > 	if ((ret == -EDQUOT || ret == -ENOSPC) && !scanned) {
> > 		scanned = xfs_inode_enospc_eofblocks(ip);
> > 		if (scanned)
> > 			goto retry;
> > 		scanned = 1;	/* try this once */
> > 	}
> > 	if (ret == -ENOSPC && !enospc) {
> > 		enospc = 1;
> > 		xfs_flush_inodes();
> > 		goto retry;
> > 	}
> > 
> > This consolidates the eofblocks scan logic to a separate helper to
> > simplify things. The helper can run the quota scan and/or the global
> > scan based on the data regarding the situation (i.e., the inode and
> > associated quota characteristics). This allows us to preserve the notion
> > of attempting a lightweight recovery first and a heavyweight recovery
> > second, reserving the inode flush for when space is truly tight.
> > Thoughts?
> 
> It's still damn clunky, IMO. It's still trying to work around the
> fact that project quotas use ENOSPC rather than EDQUOT, and that
> makes the logic rather twisted. And it still has that hidden data
> flush in it so it can't really be called lightweight...
> 

Ok. Well if the high level logic is the issue, we could still turn that
inside out a bit to use your EDQUOT/ENOSPC logic, yet still preserve the
standalone eofblocks scan. It might be a few more lines, but perhaps
more clear. E.g.:

	if (ret == -EDQUOT && !enospc) {
		enospc = 1;
		xfs_inode_free_quota_eofblocks(ip);
		goto retry;
	else if (ret == -ENOSPC && !enospc) {
		if (!scanned) {
			struct xfs_eofblocks eofb = {0};
			...
			scanned = 1;
			xfs_icache_free_eofblocks(ip->i_mount, &eofb);
			goto retry;
		}

		enospc = 1;
		xfs_flush_inodes(ip->i_mount);
		goto retry;
	}

Thoughts on that? We'll handle project quota ENOSPC just as we handle
global ENOSPC with respect to the scan and the flush, but we'll still
have the opportunity to free up a decent amount of space without
initiating I/O. The project quota scan down in free_quota_eofblocks()
might be rendered pointless as well.

Brian

> 
> > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > > index bd0ab7d..471ccfa 100644
> > > > --- a/fs/xfs/xfs_icache.c
> > > > +++ b/fs/xfs/xfs_icache.c
> > > > @@ -33,6 +33,9 @@
> > ...
> > > > +
> > > > +	if (XFS_IS_PQUOTA_ENFORCED(ip->i_mount)) {
> > > > +		dq = xfs_inode_dquot(ip, XFS_DQ_PROJ);
> > > > +		if (dq && xfs_dquot_lowsp(dq)) {
> > > > +			eofb.eof_prid = xfs_get_projid(ip);
> > > > +			eofb.eof_flags = XFS_EOF_FLAGS_SYNC|
> > > > +					 XFS_EOF_FLAGS_PRID|
> > > > +					 XFS_EOF_FLAGS_FLUSH;
> > > > +			xfs_icache_free_eofblocks(ip->i_mount, &eofb);
> > > > +			scanned = 1;
> > > > +		}
> > > > +	}
> > > 
> > > I really don't like the fact that project quota is hiding a data
> > > flush in the "free_quota_eofblocks" logic. It just strikes me a the
> > > wrong thing to do because if it's a real ENOSPC we're just going to
> > > have to do this anyway...
> > > 
> > 
> > I was under the impression that a flush was important for project quotas
> > based on freeing up reserved metadata space (from our past discussions
> > on the original eofblocks work). I could have just misunderstood the
> > point at the time.
> 
> It is important - I'm not saying that we should not flush dirty
> data. What I am disagreeing with is the implementation of data
> flushing in the patch....
> 
> > If so, I can just remove the flush flag on the
> > eofblocks scan in the logic proposed above and let the inode flush
> > handle this for both cases.
> > 
> > If we go that route, any preference as to whether to keep the support
> > for doing a flush in eofblocks at all? I included it primarily for the
> > project quota case. With that dropped, I could just drop the first
> > couple of patches here.
> 
> Yes, you could do that.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/5] xfs: run an eofblocks scan on ENOSPC/EDQUOT
  2014-04-01 23:20         ` Brian Foster
@ 2014-04-02  5:11           ` Dave Chinner
  2014-04-02 20:11             ` Brian Foster
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2014-04-02  5:11 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Tue, Apr 01, 2014 at 07:20:23PM -0400, Brian Foster wrote:
> On Wed, Apr 02, 2014 at 08:19:26AM +1100, Dave Chinner wrote:
> > On Tue, Apr 01, 2014 at 09:55:18AM -0400, Brian Foster wrote:
> > Yes, it means that there is a global flush when a project quota runs
> > out of space, but it means that we only do it once and we don't burn
> > excessive CPU walking radix trees scanning inodes needlessly every
> > time we get a storm of processes hammering project quota ENOSPC.
> > 
> 
> It's not clear to me that excess scanning is an issue, but it seems

Have 100 threads hit ENOSPC on the same project quota at the same
time on a filesystem with a couple of thousand AGs with a few
million cached inode, and then you'll see the problem....

> orthogonal to how we organize the enospc logic (at least once the flush
> is out of the equation). IOW, we'll invoke the scanner with the same
> frequency either way. Or maybe you are just referring to scanning
> specifically for the purpose of doing flushes as a waste..?

Well, lots of concurrent scanning for the same purpose is highly
inefficient - look at the amount of additional serialisation in the
inode recalim walker that is aimed at reducing concurrency to one
reclaim thread per AG at a time...

I expect that if the serialisation on xfs_flush_inodes() isn't
sufficient to throttle eofblock scanning concurrency in case like
the above then we'll know about it pretty quickly.

> > > simplify things. The helper can run the quota scan and/or the global
> > > scan based on the data regarding the situation (i.e., the inode and
> > > associated quota characteristics). This allows us to preserve the notion
> > > of attempting a lightweight recovery first and a heavyweight recovery
> > > second, reserving the inode flush for when space is truly tight.
> > > Thoughts?
> > 
> > It's still damn clunky, IMO. It's still trying to work around the
> > fact that project quotas use ENOSPC rather than EDQUOT, and that
> > makes the logic rather twisted. And it still has that hidden data
> > flush in it so it can't really be called lightweight...
> > 
> 
> Ok. Well if the high level logic is the issue, we could still turn that
> inside out a bit to use your EDQUOT/ENOSPC logic, yet still preserve the
> standalone eofblocks scan. It might be a few more lines, but perhaps
> more clear. E.g.:
> 
> 	if (ret == -EDQUOT && !enospc) {
> 		enospc = 1;
> 		xfs_inode_free_quota_eofblocks(ip);
> 		goto retry;
> 	else if (ret == -ENOSPC && !enospc) {
> 		if (!scanned) {
> 			struct xfs_eofblocks eofb = {0};
> 			...
> 			scanned = 1;
> 			xfs_icache_free_eofblocks(ip->i_mount, &eofb);
> 			goto retry;
> 		}
> 
> 		enospc = 1;
> 		xfs_flush_inodes(ip->i_mount);
> 		goto retry;
> 	}
> 
> Thoughts on that?

Even more convoluted. :/

Look at it this way - I've never been a fan of this "retry write on
enospc once" code. It's a necessary evil due to the reality of
locking orders and having to back out of the write far enough to
safely trigger dirty inode writeback so we *might* be able to write
this data. Making this code more finicky and tricky is the last
thing I think we should be doing.

I don't have any better ideas at this point - I'm still tired and
jetlagged and need to sort out everything for a merge window pull of
the current tree, so time for me to think up a creative solution is
limited. I'm hoping that you might be able to think of a different
way entirely of doing this "write retry on ENOSPC" that dolves all
these problems simply and easily ;)

> We'll handle project quota ENOSPC just as we handle
> global ENOSPC with respect to the scan and the flush, but we'll still
> have the opportunity to free up a decent amount of space without
> initiating I/O. The project quota scan down in free_quota_eofblocks()
> might be rendered pointless as well.

Well, therein lies an avenue of investigation: for EOF block
trimming, why would we need to do a flush first, even for project
quotas? And if we aren't going to flush data, why does it need to be
done at this level? Can it be done deep in the quota code where we
know exactly what quota ran out of space?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/5] xfs: run an eofblocks scan on ENOSPC/EDQUOT
  2014-04-02  5:11           ` Dave Chinner
@ 2014-04-02 20:11             ` Brian Foster
  2014-04-03 22:18               ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Brian Foster @ 2014-04-02 20:11 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Wed, Apr 02, 2014 at 04:11:03PM +1100, Dave Chinner wrote:
> On Tue, Apr 01, 2014 at 07:20:23PM -0400, Brian Foster wrote:
> > On Wed, Apr 02, 2014 at 08:19:26AM +1100, Dave Chinner wrote:
> > > On Tue, Apr 01, 2014 at 09:55:18AM -0400, Brian Foster wrote:
> > > Yes, it means that there is a global flush when a project quota runs
> > > out of space, but it means that we only do it once and we don't burn
> > > excessive CPU walking radix trees scanning inodes needlessly every
> > > time we get a storm of processes hammering project quota ENOSPC.
> > > 
> > 
> > It's not clear to me that excess scanning is an issue, but it seems
> 
> Have 100 threads hit ENOSPC on the same project quota at the same
> time on a filesystem with a couple of thousand AGs with a few
> million cached inode, and then you'll see the problem....
> 

Right, that certainly sounds like a mess if we kicked off a bunch of the
mini-flush scans. I'm curious how much overhead there would be with the
flush out of the picture and a scan can actually go ahead and free up
some space (hopefully) quickly. That's something I'll have to test.

> > orthogonal to how we organize the enospc logic (at least once the flush
> > is out of the equation). IOW, we'll invoke the scanner with the same
> > frequency either way. Or maybe you are just referring to scanning
> > specifically for the purpose of doing flushes as a waste..?
> 
> Well, lots of concurrent scanning for the same purpose is highly
> inefficient - look at the amount of additional serialisation in the
> inode recalim walker that is aimed at reducing concurrency to one
> reclaim thread per AG at a time...
> 

Interesting, and I think this touches a bit on what I mean by the
scanning being somewhat orthogonal to the purpose of this patch. If the
scanning does turn out to be a real problem in this particular context,
why not try to improve the efficiency of the scan? We could include this
kind of per-ag locking for internal scans, or perhaps create a new scan
mode that exits after freeing a specified amount of space, etc. 

> I expect that if the serialisation on xfs_flush_inodes() isn't
> sufficient to throttle eofblock scanning concurrency in case like
> the above then we'll know about it pretty quickly.
> 

How much space should we expect xfs_flush_inodes() to free up? Using
your example of the unconditional flush followed by the global scan - it
seems like it could throttle things temporarily by allowing some set of
writers to serialize on a flush and hopefully soon after that other
writers can allocate space again. Once we're past that, those flushing
threads head into the scan just the same.

So I guess the question is, will the flush satisfy concurrent writers
long enough for one of the flush inducing threads to get into the scan
and prevent others from queuing further? If so, it seems like a
potential positive if the overhead of the flush is less than the
overhead of the "unbounded" scan in the same scenario. If not, it seems
like it could also be a potential net negative because we'll effectively
queue more threads on the flush that could have avoided allocation
failure were a scan already running and freeing space. I guess that all
depends on how long the flush takes, how much data is in cache, storage
hardware, etc. Something else I'll have to experiment with a little
more I suppose... :)

> > > > simplify things. The helper can run the quota scan and/or the global
> > > > scan based on the data regarding the situation (i.e., the inode and
> > > > associated quota characteristics). This allows us to preserve the notion
> > > > of attempting a lightweight recovery first and a heavyweight recovery
> > > > second, reserving the inode flush for when space is truly tight.
> > > > Thoughts?
> > > 
> > > It's still damn clunky, IMO. It's still trying to work around the
> > > fact that project quotas use ENOSPC rather than EDQUOT, and that
> > > makes the logic rather twisted. And it still has that hidden data
> > > flush in it so it can't really be called lightweight...
> > > 
> > 
> > Ok. Well if the high level logic is the issue, we could still turn that
> > inside out a bit to use your EDQUOT/ENOSPC logic, yet still preserve the
> > standalone eofblocks scan. It might be a few more lines, but perhaps
> > more clear. E.g.:
> > 
> > 	if (ret == -EDQUOT && !enospc) {
> > 		enospc = 1;
> > 		xfs_inode_free_quota_eofblocks(ip);
> > 		goto retry;
> > 	else if (ret == -ENOSPC && !enospc) {
> > 		if (!scanned) {
> > 			struct xfs_eofblocks eofb = {0};
> > 			...
> > 			scanned = 1;
> > 			xfs_icache_free_eofblocks(ip->i_mount, &eofb);
> > 			goto retry;
> > 		}
> > 
> > 		enospc = 1;
> > 		xfs_flush_inodes(ip->i_mount);
> > 		goto retry;
> > 	}
> > 
> > Thoughts on that?
> 
> Even more convoluted. :/
> 

That's pretty close to your example with the exception of adding the
retry after the scan and the scan/flush order. It loses the EDQUT/ENOSPC
confusion.

> Look at it this way - I've never been a fan of this "retry write on
> enospc once" code. It's a necessary evil due to the reality of
> locking orders and having to back out of the write far enough to
> safely trigger dirty inode writeback so we *might* be able to write
> this data. Making this code more finicky and tricky is the last
> thing I think we should be doing.
> 
> I don't have any better ideas at this point - I'm still tired and
> jetlagged and need to sort out everything for a merge window pull of
> the current tree, so time for me to think up a creative solution is
> limited. I'm hoping that you might be able to think of a different
> way entirely of doing this "write retry on ENOSPC" that dolves all
> these problems simply and easily ;)
> 

Fair enough, that's certainly more helpful. ;) We're probably closer on
this than I thought. I'm not arguing because I think this code is great,
but rather because the behavior we have seems kind of obtuse in
particular situations and can be improved with some reasonably small
changes.

> > We'll handle project quota ENOSPC just as we handle
> > global ENOSPC with respect to the scan and the flush, but we'll still
> > have the opportunity to free up a decent amount of space without
> > initiating I/O. The project quota scan down in free_quota_eofblocks()
> > might be rendered pointless as well.
> 
> Well, therein lies an avenue of investigation: for EOF block
> trimming, why would we need to do a flush first, even for project
> quotas? And if we aren't going to flush data, why does it need to be
> done at this level? Can it be done deep in the quota code where we
> know exactly what quota ran out of space?
> 

This is something I considered way back when first looking at this
(http://oss.sgi.com/archives/xfs/2012-12/msg00112.html). I don't recall
specifically what made that difficult, perhaps lack of enough
information to run a scan in a single context. I'll reset and have
another look at it.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/5] xfs: run an eofblocks scan on ENOSPC/EDQUOT
  2014-04-02 20:11             ` Brian Foster
@ 2014-04-03 22:18               ` Dave Chinner
  0 siblings, 0 replies; 15+ messages in thread
From: Dave Chinner @ 2014-04-03 22:18 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Wed, Apr 02, 2014 at 04:11:10PM -0400, Brian Foster wrote:
> On Wed, Apr 02, 2014 at 04:11:03PM +1100, Dave Chinner wrote:
> > On Tue, Apr 01, 2014 at 07:20:23PM -0400, Brian Foster wrote:
> > > On Wed, Apr 02, 2014 at 08:19:26AM +1100, Dave Chinner wrote:
> > > > On Tue, Apr 01, 2014 at 09:55:18AM -0400, Brian Foster wrote:
> > > > Yes, it means that there is a global flush when a project quota runs
> > > > out of space, but it means that we only do it once and we don't burn
> > > > excessive CPU walking radix trees scanning inodes needlessly every
> > > > time we get a storm of processes hammering project quota ENOSPC.
> > > > 
> > > 
> > > It's not clear to me that excess scanning is an issue, but it seems
> > 
> > Have 100 threads hit ENOSPC on the same project quota at the same
> > time on a filesystem with a couple of thousand AGs with a few
> > million cached inode, and then you'll see the problem....
> > 
> 
> Right, that certainly sounds like a mess if we kicked off a bunch of the
> mini-flush scans. I'm curious how much overhead there would be with the
> flush out of the picture and a scan can actually go ahead and free up
> some space (hopefully) quickly. That's something I'll have to test.
> 
> > > orthogonal to how we organize the enospc logic (at least once the flush
> > > is out of the equation). IOW, we'll invoke the scanner with the same
> > > frequency either way. Or maybe you are just referring to scanning
> > > specifically for the purpose of doing flushes as a waste..?
> > 
> > Well, lots of concurrent scanning for the same purpose is highly
> > inefficient - look at the amount of additional serialisation in the
> > inode recalim walker that is aimed at reducing concurrency to one
> > reclaim thread per AG at a time...
> > 
> 
> Interesting, and I think this touches a bit on what I mean by the
> scanning being somewhat orthogonal to the purpose of this patch. If the
> scanning does turn out to be a real problem in this particular context,
> why not try to improve the efficiency of the scan? We could include this
> kind of per-ag locking for internal scans, or perhaps create a new scan
> mode that exits after freeing a specified amount of space, etc. 

Well, xfs_flush_inodes() serialises callers, so if we flush before
scan we already effectively have a method of throttling the EOF
block scans ;)

The idea of the generic scan code, however, is not to do any
serialisation at all. i.e. to be as fast and concurrent as possible.
If a caller needs throttling, then we do that at the caller.

The reclaim code has /interesting/ serialisation requirements - it
has to be as concurrent as possible (because shrinkers can be called
concurrently) but not waste CPU scanning the same inodes
unnecesarily (hence the cursors). But it also needs to throttle
inode allocation to the rate at which we can clean and free them,
and hence the mutexes and synchronous inode writeback to throttle
and back up excessive reclaim concurrency without wasting CPU.

> > I expect that if the serialisation on xfs_flush_inodes() isn't
> > sufficient to throttle eofblock scanning concurrency in case like
> > the above then we'll know about it pretty quickly.
> > 
> 
> How much space should we expect xfs_flush_inodes() to free up? Using

Depends on the amount of delalloc space. Assume and average indirect
block reservation of 4 blocks per delalloc region that is flushed.

> your example of the unconditional flush followed by the global scan - it
> seems like it could throttle things temporarily by allowing some set of
> writers to serialize on a flush and hopefully soon after that other
> writers can allocate space again. Once we're past that, those flushing
> threads head into the scan just the same.
>
> So I guess the question is, will the flush satisfy concurrent writers
> long enough for one of the flush inducing threads to get into the scan
> and prevent others from queuing further? If so, it seems like a
> potential positive if the overhead of the flush is less than the
> overhead of the "unbounded" scan in the same scenario. If not, it seems
> like it could also be a potential net negative because we'll effectively
> queue more threads on the flush that could have avoided allocation
> failure were a scan already running and freeing space. I guess that all
> depends on how long the flush takes, how much data is in cache, storage
> hardware, etc. Something else I'll have to experiment with a little
> more I suppose... :)

Right, but in general I think that the FIFO serialisation behaviour
of xfs_flush_inodes() will be sufficent to limit the scanning
concurrency, especially if each flush causes a few IOs to be done
and so metres out the start of each new scan to a 10-20 new scans
per second....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2014-04-03 22:18 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-28 13:15 [PATCH 0/5] xfs: run eofblocks scan on ENOSPC Brian Foster
2014-03-28 13:15 ` [PATCH 1/5] xfs: do eofb filtering before dirty check Brian Foster
2014-03-28 13:16 ` [PATCH 2/5] xfs: add flush flag to xfs_eofblocks Brian Foster
2014-03-31 21:47   ` Dave Chinner
2014-04-01 13:48     ` Brian Foster
2014-03-28 13:16 ` [PATCH 3/5] xfs: add scan owner field " Brian Foster
2014-03-28 13:16 ` [PATCH 4/5] xfs: run an eofblocks scan on ENOSPC/EDQUOT Brian Foster
2014-03-31 22:22   ` Dave Chinner
2014-04-01 13:55     ` Brian Foster
2014-04-01 21:19       ` Dave Chinner
2014-04-01 23:20         ` Brian Foster
2014-04-02  5:11           ` Dave Chinner
2014-04-02 20:11             ` Brian Foster
2014-04-03 22:18               ` Dave Chinner
2014-03-28 13:16 ` [PATCH 5/5] xfs: squash prealloc while over quota free space as well Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.