All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/11] xfs: introduce the free inode btree
@ 2013-09-03 18:24 Brian Foster
  2013-09-03 18:24 ` [RFC PATCH 01/11] xfs: refactor xfs_ialloc_btree.c to support multiple inobt numbers Brian Foster
                   ` (11 more replies)
  0 siblings, 12 replies; 46+ messages in thread
From: Brian Foster @ 2013-09-03 18:24 UTC (permalink / raw)
  To: xfs

Hi all,

This is an RFC for the kernel work to support a free inode btree. The free inode
btree (finobt) is equivalent to the existing inode allocation btree with the
exception that the finobt only tracks inode chunks with at least one free inode.
The purpose is to improve lookups for free inode clusters for inode allocation.

Newly allocated inode chunks by definition contain free inodes and are thus
inserted into the finobt immediately. The record for a previously full inode
chunk is inserted to the finobt when the first inode is freed. A record is
removed from the finobt when the last free inode has been allocated or the
entire chunk is completely deallocated.

Patches 1-3 refactor some ialloc btree code to introduce the new finobt type and
feature bit. Patches 4-7 fix up the transaction handling for inode allocation
and deallocation to account for the new tree. Patches 8-10 add the finobt
management code to insert, remove and modify records as appropriate. Patch 11
fixes growfs to support the finobt.

Thoughts, reviews, flames appreciated.

NOTES:
- This is RFC because it is lightly tested and I don't have much userspace code
at the moment (though most of this code should apply).
- I'm not totally sure about the scope of change in patch 6 (use correct
transaction reservations in xfs_inactive()). I started off minimally altering
the reserved blocks value, but the existing transaction management seemed
strange enough that I ended up with the current patch. I figured I'd start here
and pare it back as necessary if the changes are bogus or lead to other
problems.
- I've yet to do any performance testing to measure the benefit of patch 9.

Brian

Brian Foster (11):
  xfs: refactor xfs_ialloc_btree.c to support multiple inobt numbers
  xfs: reserve v5 superblock read-only compat. feature bit for finobt
  xfs: support the XFS_BTNUM_FINOBT free inode btree type
  xfs: update inode allocation transaction reservations for finobt
  xfs: update ifree transaction reservations for finobt
  xfs: use correct transaction reservations in xfs_inactive()
  xfs: retry trans reservation on ENOSPC in xfs_inactive()
  xfs: insert newly allocated inode chunks into the finobt
  xfs: use and update the finobt on inode allocation
  xfs: update the finobt on inode free
  xfs: add finobt support to growfs

 fs/xfs/xfs_ag.h           |   7 +-
 fs/xfs/xfs_btree.c        |   6 +-
 fs/xfs/xfs_btree.h        |   3 +
 fs/xfs/xfs_fsops.c        |  32 +++++
 fs/xfs/xfs_ialloc.c       | 343 ++++++++++++++++++++++++++++++++++++++++++----
 fs/xfs/xfs_ialloc_btree.c |  37 +++--
 fs/xfs/xfs_ialloc_btree.h |  17 ++-
 fs/xfs/xfs_inode.c        |  82 ++++++-----
 fs/xfs/xfs_itable.c       |   6 +-
 fs/xfs/xfs_log_recover.c  |   2 +
 fs/xfs/xfs_sb.h           |  10 +-
 fs/xfs/xfs_stats.h        |  18 ++-
 fs/xfs/xfs_trans_resv.c   |  14 +-
 fs/xfs/xfs_trans_space.h  |   4 +-
 fs/xfs/xfs_types.h        |   2 +-
 15 files changed, 500 insertions(+), 83 deletions(-)

-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC PATCH 01/11] xfs: refactor xfs_ialloc_btree.c to support multiple inobt numbers
  2013-09-03 18:24 [RFC PATCH 00/11] xfs: introduce the free inode btree Brian Foster
@ 2013-09-03 18:24 ` Brian Foster
  2013-09-05  0:36   ` Dave Chinner
  2013-09-03 18:24 ` [RFC PATCH 02/11] xfs: reserve v5 superblock read-only compat. feature bit for finobt Brian Foster
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2013-09-03 18:24 UTC (permalink / raw)
  To: xfs

The introduction of the free inode btree (finobt) requires that
xfs_ialloc_btree.c handle multiple trees. Refactor xfs_ialloc_btree.c
so the caller specifies the btree type on cursor initialization to
prepare for addition of the finobt.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_ialloc.c       | 8 ++++----
 fs/xfs/xfs_ialloc_btree.c | 8 +++++---
 fs/xfs/xfs_ialloc_btree.h | 3 ++-
 fs/xfs/xfs_itable.c       | 6 ++++--
 4 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
index ccf2fb1..524aa88 100644
--- a/fs/xfs/xfs_ialloc.c
+++ b/fs/xfs/xfs_ialloc.c
@@ -455,7 +455,7 @@ xfs_ialloc_ag_alloc(
 	/*
 	 * Insert records describing the new inode chunk into the btree.
 	 */
-	cur = xfs_inobt_init_cursor(args.mp, tp, agbp, agno);
+	cur = xfs_inobt_init_cursor(args.mp, tp, agbp, agno, XFS_BTNUM_INO);
 	for (thisino = newino;
 	     thisino < newino + newlen;
 	     thisino += XFS_INODES_PER_CHUNK) {
@@ -701,7 +701,7 @@ xfs_dialloc_ag(
 	ASSERT(pag->pagi_freecount > 0);
 
  restart_pagno:
-	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno);
+	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno, XFS_BTNUM_INO);
 	/*
 	 * If pagino is 0 (this is the root inode allocation) use newino.
 	 * This must work because we've just allocated some.
@@ -1163,7 +1163,7 @@ xfs_difree(
 	/*
 	 * Initialize the cursor.
 	 */
-	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno);
+	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno, XFS_BTNUM_INO);
 
 	error = xfs_check_agi_freecount(cur, agi);
 	if (error)
@@ -1294,7 +1294,7 @@ xfs_imap_lookup(
 	 * we have a record, we need to ensure it contains the inode number
 	 * we are looking up.
 	 */
-	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno);
+	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno, XFS_BTNUM_INO);
 	error = xfs_inobt_lookup(cur, agino, XFS_LOOKUP_LE, &i);
 	if (!error) {
 		if (i)
diff --git a/fs/xfs/xfs_ialloc_btree.c b/fs/xfs/xfs_ialloc_btree.c
index 5448eb6..0cdb88b 100644
--- a/fs/xfs/xfs_ialloc_btree.c
+++ b/fs/xfs/xfs_ialloc_btree.c
@@ -50,7 +50,8 @@ xfs_inobt_dup_cursor(
 	struct xfs_btree_cur	*cur)
 {
 	return xfs_inobt_init_cursor(cur->bc_mp, cur->bc_tp,
-			cur->bc_private.a.agbp, cur->bc_private.a.agno);
+			cur->bc_private.a.agbp, cur->bc_private.a.agno,
+			cur->bc_btnum);
 }
 
 STATIC void
@@ -324,7 +325,8 @@ xfs_inobt_init_cursor(
 	struct xfs_mount	*mp,		/* file system mount point */
 	struct xfs_trans	*tp,		/* transaction pointer */
 	struct xfs_buf		*agbp,		/* buffer for agi structure */
-	xfs_agnumber_t		agno)		/* allocation group number */
+	xfs_agnumber_t		agno,		/* allocation group number */
+	xfs_btnum_t		btnum)		/* ialloc or free ino btree */
 {
 	struct xfs_agi		*agi = XFS_BUF_TO_AGI(agbp);
 	struct xfs_btree_cur	*cur;
@@ -334,7 +336,7 @@ xfs_inobt_init_cursor(
 	cur->bc_tp = tp;
 	cur->bc_mp = mp;
 	cur->bc_nlevels = be32_to_cpu(agi->agi_level);
-	cur->bc_btnum = XFS_BTNUM_INO;
+	cur->bc_btnum = btnum;
 	cur->bc_blocklog = mp->m_sb.sb_blocklog;
 
 	cur->bc_ops = &xfs_inobt_ops;
diff --git a/fs/xfs/xfs_ialloc_btree.h b/fs/xfs/xfs_ialloc_btree.h
index 3ac36b76..ce7a62b 100644
--- a/fs/xfs/xfs_ialloc_btree.h
+++ b/fs/xfs/xfs_ialloc_btree.h
@@ -107,7 +107,8 @@ typedef __be32 xfs_inobt_ptr_t;
 		 ((index) - 1) * sizeof(xfs_inobt_ptr_t)))
 
 extern struct xfs_btree_cur *xfs_inobt_init_cursor(struct xfs_mount *,
-		struct xfs_trans *, struct xfs_buf *, xfs_agnumber_t);
+		struct xfs_trans *, struct xfs_buf *, xfs_agnumber_t,
+		xfs_btnum_t);
 extern int xfs_inobt_maxrecs(struct xfs_mount *, int, int);
 
 extern const struct xfs_buf_ops xfs_inobt_buf_ops;
diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
index b93e14b..e720f8b 100644
--- a/fs/xfs/xfs_itable.c
+++ b/fs/xfs/xfs_itable.c
@@ -275,7 +275,8 @@ xfs_bulkstat(
 		/*
 		 * Allocate and initialize a btree cursor for ialloc btree.
 		 */
-		cur = xfs_inobt_init_cursor(mp, NULL, agbp, agno);
+		cur = xfs_inobt_init_cursor(mp, NULL, agbp, agno,
+					    XFS_BTNUM_INO);
 		irbp = irbuf;
 		irbufend = irbuf + nirbuf;
 		end_of_ag = 0;
@@ -625,7 +626,8 @@ xfs_inumbers(
 				agino = 0;
 				continue;
 			}
-			cur = xfs_inobt_init_cursor(mp, NULL, agbp, agno);
+			cur = xfs_inobt_init_cursor(mp, NULL, agbp, agno,
+						    XFS_BTNUM_INO);
 			error = xfs_inobt_lookup(cur, agino, XFS_LOOKUP_GE,
 						 &tmp);
 			if (error) {
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 02/11] xfs: reserve v5 superblock read-only compat. feature bit for finobt
  2013-09-03 18:24 [RFC PATCH 00/11] xfs: introduce the free inode btree Brian Foster
  2013-09-03 18:24 ` [RFC PATCH 01/11] xfs: refactor xfs_ialloc_btree.c to support multiple inobt numbers Brian Foster
@ 2013-09-03 18:24 ` Brian Foster
  2013-09-05  0:39   ` Dave Chinner
  2013-09-03 18:25 ` [RFC PATCH 03/11] xfs: support the XFS_BTNUM_FINOBT free inode btree type Brian Foster
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2013-09-03 18:24 UTC (permalink / raw)
  To: xfs

Reserve a v5 read-only compatibility feature bit for the finobt and
create the xfs_sb_version_hasfinobt() helper to determine whether
an fs has the feature enabled.

The finobt does not change existing on-disk structures, but must
remain consistent with the ialloc btree. Modifications from older
kernels would violate that constrant. Therefore, we restrict older
kernels to read-only mounts of finobt-enabled filesystems.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_sb.h | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_sb.h b/fs/xfs/xfs_sb.h
index 6835b44..c48d95d 100644
--- a/fs/xfs/xfs_sb.h
+++ b/fs/xfs/xfs_sb.h
@@ -585,7 +585,9 @@ xfs_sb_has_compat_feature(
 	return (sbp->sb_features_compat & feature) != 0;
 }
 
-#define XFS_SB_FEAT_RO_COMPAT_ALL 0
+#define XFS_SB_FEAT_RO_COMPAT_FINOBT   (1 << 0)		/* free inode btree */
+#define XFS_SB_FEAT_RO_COMPAT_ALL \
+		(XFS_SB_FEAT_RO_COMPAT_FINOBT)
 #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL
 static inline bool
 xfs_sb_has_ro_compat_feature(
@@ -639,6 +641,12 @@ static inline int xfs_sb_version_hasftype(struct xfs_sb *sbp)
 		 (sbp->sb_features2 & XFS_SB_VERSION2_FTYPE));
 }
 
+static inline int xfs_sb_version_hasfinobt(xfs_sb_t *sbp)
+{
+	return (XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5) &&
+		(sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_FINOBT);
+}
+
 /*
  * end of superblock version macros
  */
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 03/11] xfs: support the XFS_BTNUM_FINOBT free inode btree type
  2013-09-03 18:24 [RFC PATCH 00/11] xfs: introduce the free inode btree Brian Foster
  2013-09-03 18:24 ` [RFC PATCH 01/11] xfs: refactor xfs_ialloc_btree.c to support multiple inobt numbers Brian Foster
  2013-09-03 18:24 ` [RFC PATCH 02/11] xfs: reserve v5 superblock read-only compat. feature bit for finobt Brian Foster
@ 2013-09-03 18:25 ` Brian Foster
  2013-09-05  0:54   ` Dave Chinner
  2013-09-03 18:25 ` [RFC PATCH 04/11] xfs: update inode allocation transaction reservations for finobt Brian Foster
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2013-09-03 18:25 UTC (permalink / raw)
  To: xfs

Define the AGI fields for the finobt root/level and add magic
numbers. Update the btree code to add support for the new
XFS_BTNUM_FINOBT inode btree.

The finobt root block is reserved immediately following the inobt
root block in the AG. Update XFS_PREALLOC_BLOCKS() to determine the
starting AG data block based on whether finobt support is enabled.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_ag.h           |  7 ++++++-
 fs/xfs/xfs_btree.c        |  6 ++++--
 fs/xfs/xfs_btree.h        |  3 +++
 fs/xfs/xfs_ialloc.c       |  2 ++
 fs/xfs/xfs_ialloc_btree.c | 29 +++++++++++++++++++++++------
 fs/xfs/xfs_ialloc_btree.h | 14 +++++++++++++-
 fs/xfs/xfs_log_recover.c  |  2 ++
 fs/xfs/xfs_stats.h        | 18 +++++++++++++++++-
 fs/xfs/xfs_types.h        |  2 +-
 9 files changed, 71 insertions(+), 12 deletions(-)

diff --git a/fs/xfs/xfs_ag.h b/fs/xfs/xfs_ag.h
index 1cb740a..b85585d 100644
--- a/fs/xfs/xfs_ag.h
+++ b/fs/xfs/xfs_ag.h
@@ -166,6 +166,9 @@ typedef struct xfs_agi {
 	__be32		agi_pad32;
 	__be64		agi_lsn;	/* last write sequence */
 
+	__be32		agi_free_root; /* root of the free inode btree */
+	__be32		agi_free_level;/* levels in free inode btree */
+
 	/* structure must be padded to 64 bit alignment */
 } xfs_agi_t;
 
@@ -180,7 +183,9 @@ typedef struct xfs_agi {
 #define	XFS_AGI_NEWINO		0x00000100
 #define	XFS_AGI_DIRINO		0x00000200
 #define	XFS_AGI_UNLINKED	0x00000400
-#define	XFS_AGI_NUM_BITS	11
+#define	XFS_AGI_FREE_ROOT	0x00000800
+#define	XFS_AGI_FREE_LEVEL	0x00001000
+#define	XFS_AGI_NUM_BITS	13
 #define	XFS_AGI_ALL_BITS	((1 << XFS_AGI_NUM_BITS) - 1)
 
 /* disk block (xfs_daddr_t) in the AG */
diff --git a/fs/xfs/xfs_btree.c b/fs/xfs/xfs_btree.c
index 7a2b4da..0090e3f 100644
--- a/fs/xfs/xfs_btree.c
+++ b/fs/xfs/xfs_btree.c
@@ -45,9 +45,10 @@ kmem_zone_t	*xfs_btree_cur_zone;
  * Btree magic numbers.
  */
 static const __uint32_t xfs_magics[2][XFS_BTNUM_MAX] = {
-	{ XFS_ABTB_MAGIC, XFS_ABTC_MAGIC, XFS_BMAP_MAGIC, XFS_IBT_MAGIC },
+	{ XFS_ABTB_MAGIC, XFS_ABTC_MAGIC, XFS_BMAP_MAGIC, XFS_IBT_MAGIC,
+	  XFS_FIBT_MAGIC },
 	{ XFS_ABTB_CRC_MAGIC, XFS_ABTC_CRC_MAGIC,
-	  XFS_BMAP_CRC_MAGIC, XFS_IBT_CRC_MAGIC }
+	  XFS_BMAP_CRC_MAGIC, XFS_IBT_CRC_MAGIC, XFS_FIBT_CRC_MAGIC }
 };
 #define xfs_btree_magic(cur) \
 	xfs_magics[!!((cur)->bc_flags & XFS_BTREE_CRC_BLOCKS)][cur->bc_btnum]
@@ -1102,6 +1103,7 @@ xfs_btree_set_refs(
 		xfs_buf_set_ref(bp, XFS_ALLOC_BTREE_REF);
 		break;
 	case XFS_BTNUM_INO:
+	case XFS_BTNUM_FINO:
 		xfs_buf_set_ref(bp, XFS_INO_BTREE_REF);
 		break;
 	case XFS_BTNUM_BMAP:
diff --git a/fs/xfs/xfs_btree.h b/fs/xfs/xfs_btree.h
index c8473c7..939002d 100644
--- a/fs/xfs/xfs_btree.h
+++ b/fs/xfs/xfs_btree.h
@@ -37,6 +37,7 @@ extern kmem_zone_t	*xfs_btree_cur_zone;
 #define	XFS_BTNUM_CNT	((xfs_btnum_t)XFS_BTNUM_CNTi)
 #define	XFS_BTNUM_BMAP	((xfs_btnum_t)XFS_BTNUM_BMAPi)
 #define	XFS_BTNUM_INO	((xfs_btnum_t)XFS_BTNUM_INOi)
+#define	XFS_BTNUM_FINO	((xfs_btnum_t)XFS_BTNUM_FINOi)
 
 /*
  * Generic btree header.
@@ -144,6 +145,7 @@ do {    \
 	case XFS_BTNUM_CNT: __XFS_BTREE_STATS_INC(abtc, stat); break;	\
 	case XFS_BTNUM_BMAP: __XFS_BTREE_STATS_INC(bmbt, stat); break;	\
 	case XFS_BTNUM_INO: __XFS_BTREE_STATS_INC(ibt, stat); break;	\
+	case XFS_BTNUM_FINO: __XFS_BTREE_STATS_INC(fibt, stat); break;	\
 	case XFS_BTNUM_MAX: ASSERT(0); /* fucking gcc */ ; break;	\
 	}       \
 } while (0)
@@ -157,6 +159,7 @@ do {    \
 	case XFS_BTNUM_CNT: __XFS_BTREE_STATS_ADD(abtc, stat, val); break; \
 	case XFS_BTNUM_BMAP: __XFS_BTREE_STATS_ADD(bmbt, stat, val); break; \
 	case XFS_BTNUM_INO: __XFS_BTREE_STATS_ADD(ibt, stat, val); break; \
+	case XFS_BTNUM_FINO: __XFS_BTREE_STATS_ADD(fibt, stat, val); break; \
 	case XFS_BTNUM_MAX: ASSERT(0); /* fucking gcc */ ; break;	\
 	}       \
 } while (0)
diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
index 524aa88..5ced506 100644
--- a/fs/xfs/xfs_ialloc.c
+++ b/fs/xfs/xfs_ialloc.c
@@ -1505,6 +1505,8 @@ xfs_ialloc_log_agi(
 		offsetof(xfs_agi_t, agi_newino),
 		offsetof(xfs_agi_t, agi_dirino),
 		offsetof(xfs_agi_t, agi_unlinked),
+		offsetof(xfs_agi_t, agi_free_root),
+		offsetof(xfs_agi_t, agi_free_level),
 		sizeof(xfs_agi_t)
 	};
 #ifdef DEBUG
diff --git a/fs/xfs/xfs_ialloc_btree.c b/fs/xfs/xfs_ialloc_btree.c
index 0cdb88b..7923292 100644
--- a/fs/xfs/xfs_ialloc_btree.c
+++ b/fs/xfs/xfs_ialloc_btree.c
@@ -62,10 +62,18 @@ xfs_inobt_set_root(
 {
 	struct xfs_buf		*agbp = cur->bc_private.a.agbp;
 	struct xfs_agi		*agi = XFS_BUF_TO_AGI(agbp);
-
-	agi->agi_root = nptr->s;
-	be32_add_cpu(&agi->agi_level, inc);
-	xfs_ialloc_log_agi(cur->bc_tp, agbp, XFS_AGI_ROOT | XFS_AGI_LEVEL);
+	int			fields;
+
+	if (cur->bc_btnum == XFS_BTNUM_INO) {
+		agi->agi_root = nptr->s;
+		be32_add_cpu(&agi->agi_level, inc);
+		fields = XFS_AGI_ROOT | XFS_AGI_LEVEL;
+	} else {
+		agi->agi_free_root = nptr->s;
+		be32_add_cpu(&agi->agi_free_level, inc);
+		fields = XFS_AGI_FREE_ROOT | XFS_AGI_FREE_LEVEL;
+	}
+	xfs_ialloc_log_agi(cur->bc_tp, agbp, fields);
 }
 
 STATIC int
@@ -172,7 +180,10 @@ xfs_inobt_init_ptr_from_cur(
 
 	ASSERT(cur->bc_private.a.agno == be32_to_cpu(agi->agi_seqno));
 
-	ptr->s = agi->agi_root;
+	if (cur->bc_btnum == XFS_BTNUM_INO)
+		ptr->s = agi->agi_root;
+	else
+		ptr->s = agi->agi_free_root;
 }
 
 STATIC __int64_t
@@ -205,6 +216,7 @@ xfs_inobt_verify(
 	 */
 	switch (block->bb_magic) {
 	case cpu_to_be32(XFS_IBT_CRC_MAGIC):
+	case cpu_to_be32(XFS_FIBT_CRC_MAGIC):
 		if (!xfs_sb_version_hascrc(&mp->m_sb))
 			return false;
 		if (!uuid_equal(&block->bb_u.s.bb_uuid, &mp->m_sb.sb_uuid))
@@ -216,6 +228,7 @@ xfs_inobt_verify(
 			return false;
 		/* fall through */
 	case cpu_to_be32(XFS_IBT_MAGIC):
+	case cpu_to_be32(XFS_FIBT_MAGIC):
 		break;
 	default:
 		return 0;
@@ -335,8 +348,12 @@ xfs_inobt_init_cursor(
 
 	cur->bc_tp = tp;
 	cur->bc_mp = mp;
-	cur->bc_nlevels = be32_to_cpu(agi->agi_level);
 	cur->bc_btnum = btnum;
+	if (btnum == XFS_BTNUM_INO)
+		cur->bc_nlevels = be32_to_cpu(agi->agi_level);
+	else
+		cur->bc_nlevels = be32_to_cpu(agi->agi_free_level);
+
 	cur->bc_blocklog = mp->m_sb.sb_blocklog;
 
 	cur->bc_ops = &xfs_inobt_ops;
diff --git a/fs/xfs/xfs_ialloc_btree.h b/fs/xfs/xfs_ialloc_btree.h
index ce7a62b..33d6dd4 100644
--- a/fs/xfs/xfs_ialloc_btree.h
+++ b/fs/xfs/xfs_ialloc_btree.h
@@ -31,6 +31,8 @@ struct xfs_mount;
  */
 #define	XFS_IBT_MAGIC		0x49414254	/* 'IABT' */
 #define	XFS_IBT_CRC_MAGIC	0x49414233	/* 'IAB3' */
+#define	XFS_FIBT_MAGIC		0x46494254	/* 'FIBT' */
+#define	XFS_FIBT_CRC_MAGIC	0x46494233	/* 'FIB3' */
 
 typedef	__uint64_t	xfs_inofree_t;
 #define	XFS_INODES_PER_CHUNK		(NBBY * sizeof(xfs_inofree_t))
@@ -73,7 +75,17 @@ typedef __be32 xfs_inobt_ptr_t;
  * block numbers in the AG.
  */
 #define	XFS_IBT_BLOCK(mp)		((xfs_agblock_t)(XFS_CNT_BLOCK(mp) + 1))
-#define	XFS_PREALLOC_BLOCKS(mp)		((xfs_agblock_t)(XFS_IBT_BLOCK(mp) + 1))
+#define	XFS_FIBT_BLOCK(mp)		((xfs_agblock_t)(XFS_IBT_BLOCK(mp) + 1))
+
+/*
+ * The first data block of an AG depends on whether the filesystem was formatted
+ * with the finobt feature. If so, account for the finobt reserved root btree
+ * block.
+ */
+#define XFS_PREALLOC_BLOCKS(mp) \
+	(xfs_sb_version_hasfinobt(&((mp)->m_sb)) ? \
+	 XFS_FIBT_BLOCK(mp) + 1 : \
+	 XFS_IBT_BLOCK(mp) + 1)
 
 /*
  * Btree block header size depends on a superblock flag.
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 7c0c1fd..b8f16d2 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -2088,7 +2088,9 @@ xlog_recover_validate_buf_type(
 			bp->b_ops = &xfs_allocbt_buf_ops;
 			break;
 		case XFS_IBT_CRC_MAGIC:
+		case XFS_FIBT_CRC_MAGIC:
 		case XFS_IBT_MAGIC:
+		case XFS_FIBT_MAGIC:
 			bp->b_ops = &xfs_inobt_buf_ops;
 			break;
 		case XFS_BMAP_CRC_MAGIC:
diff --git a/fs/xfs/xfs_stats.h b/fs/xfs/xfs_stats.h
index c03ad38..c8f238b 100644
--- a/fs/xfs/xfs_stats.h
+++ b/fs/xfs/xfs_stats.h
@@ -183,7 +183,23 @@ struct xfsstats {
 	__uint32_t		xs_ibt_2_alloc;
 	__uint32_t		xs_ibt_2_free;
 	__uint32_t		xs_ibt_2_moves;
-#define XFSSTAT_END_XQMSTAT		(XFSSTAT_END_IBT_V2+6)
+#define XFSSTAT_END_FIBT_V2		(XFSSTAT_END_IBT_V2+15)
+	__uint32_t		xs_fibt_2_lookup;
+	__uint32_t		xs_fibt_2_compare;
+	__uint32_t		xs_fibt_2_insrec;
+	__uint32_t		xs_fibt_2_delrec;
+	__uint32_t		xs_fibt_2_newroot;
+	__uint32_t		xs_fibt_2_killroot;
+	__uint32_t		xs_fibt_2_increment;
+	__uint32_t		xs_fibt_2_decrement;
+	__uint32_t		xs_fibt_2_lshift;
+	__uint32_t		xs_fibt_2_rshift;
+	__uint32_t		xs_fibt_2_split;
+	__uint32_t		xs_fibt_2_join;
+	__uint32_t		xs_fibt_2_alloc;
+	__uint32_t		xs_fibt_2_free;
+	__uint32_t		xs_fibt_2_moves;
+#define XFSSTAT_END_XQMSTAT		(XFSSTAT_END_FIBT_V2+6)
 	__uint32_t		xs_qm_dqreclaims;
 	__uint32_t		xs_qm_dqreclaim_misses;
 	__uint32_t		xs_qm_dquot_dups;
diff --git a/fs/xfs/xfs_types.h b/fs/xfs/xfs_types.h
index 82bbc34..65c6e66 100644
--- a/fs/xfs/xfs_types.h
+++ b/fs/xfs/xfs_types.h
@@ -134,7 +134,7 @@ typedef enum {
 
 typedef enum {
 	XFS_BTNUM_BNOi, XFS_BTNUM_CNTi, XFS_BTNUM_BMAPi, XFS_BTNUM_INOi,
-	XFS_BTNUM_MAX
+	XFS_BTNUM_FINOi, XFS_BTNUM_MAX
 } xfs_btnum_t;
 
 struct xfs_name {
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 04/11] xfs: update inode allocation transaction reservations for finobt
  2013-09-03 18:24 [RFC PATCH 00/11] xfs: introduce the free inode btree Brian Foster
                   ` (2 preceding siblings ...)
  2013-09-03 18:25 ` [RFC PATCH 03/11] xfs: support the XFS_BTNUM_FINOBT free inode btree type Brian Foster
@ 2013-09-03 18:25 ` Brian Foster
  2013-09-05  0:59   ` Dave Chinner
  2013-09-03 18:25 ` [RFC PATCH 05/11] xfs: update ifree " Brian Foster
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2013-09-03 18:25 UTC (permalink / raw)
  To: xfs

Update inode allocation transaction reservations for the finobt. A
create via record modification requires a log reservation for the
additional finobt record. Any such allocation could result in an
finobt removal if the inode chunk has become fully allocated, thus
we include a reservation for a finobt btree merge as well.
Allocation of a new inode chunk must account for splits in the
finobt as well as the existing ialloc tree.

Also update XFS_IALLOC_SPACE_RES() to reserve data blocks for
finobt split/merge scenarios.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_trans_resv.c  | 9 ++++++++-
 fs/xfs/xfs_trans_space.h | 2 +-
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_trans_resv.c b/fs/xfs/xfs_trans_resv.c
index a65a3cc..3040dad 100644
--- a/fs/xfs/xfs_trans_resv.c
+++ b/fs/xfs/xfs_trans_resv.c
@@ -272,6 +272,8 @@ xfs_calc_remove_reservation(
  *    the parent directory inode: inode size
  *    the new inode: inode size
  *    the inode btree entry: block size
+ *    the free inode btree entry: block size
+ *    the free inode btree: max depth * block size
  *    the superblock for the nlink flag: sector size
  *    the directory btree: (max depth + v2) * dir block size
  *    the directory inode's bmap btree: (max depth + v2) * block size
@@ -282,7 +284,8 @@ xfs_calc_create_resv_modify(
 {
 	return xfs_calc_inode_res(mp, 2) +
 		xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
-		(uint)XFS_FSB_TO_B(mp, 1) +
+		(uint)XFS_FSB_TO_B(mp, 2) +
+		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
 		xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp), XFS_FSB_TO_B(mp, 1));
 }
 
@@ -292,6 +295,7 @@ xfs_calc_create_resv_modify(
  *    the superblock for the nlink flag: sector size
  *    the inode blocks allocated: XFS_IALLOC_BLOCKS * blocksize
  *    the inode btree: max depth * blocksize
+ *    the free inode btree: max depth * blocksize
  *    the allocation btrees: 2 trees * (max depth - 1) * block size
  */
 STATIC uint
@@ -302,6 +306,7 @@ xfs_calc_create_resv_alloc(
 		mp->m_sb.sb_sectsize +
 		xfs_calc_buf_res(XFS_IALLOC_BLOCKS(mp), XFS_FSB_TO_B(mp, 1)) +
 		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
+		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
 		xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 1),
 				 XFS_FSB_TO_B(mp, 1));
 }
@@ -320,6 +325,7 @@ __xfs_calc_create_reservation(
  *    the agi and agf of the ag getting the new inodes: 2 * sectorsize
  *    the superblock for the nlink flag: sector size
  *    the inode btree: max depth * blocksize
+ *    the free inode btree: max depth * blocksize
  *    the allocation btrees: 2 trees * (max depth - 1) * block size
  */
 STATIC uint
@@ -329,6 +335,7 @@ xfs_calc_icreate_resv_alloc(
 	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
 		mp->m_sb.sb_sectsize +
 		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
+		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
 		xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 1),
 				 XFS_FSB_TO_B(mp, 1));
 }
diff --git a/fs/xfs/xfs_trans_space.h b/fs/xfs/xfs_trans_space.h
index 7d2c920..5dca732 100644
--- a/fs/xfs/xfs_trans_space.h
+++ b/fs/xfs/xfs_trans_space.h
@@ -47,7 +47,7 @@
 #define	XFS_DIRREMOVE_SPACE_RES(mp)	\
 	XFS_DAREMOVE_SPACE_RES(mp, XFS_DATA_FORK)
 #define	XFS_IALLOC_SPACE_RES(mp)	\
-	(XFS_IALLOC_BLOCKS(mp) + (mp)->m_in_maxlevels - 1)
+	(XFS_IALLOC_BLOCKS(mp) + 2 * ((mp)->m_in_maxlevels - 1))
 
 /*
  * Space reservation values for various transactions.
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 05/11] xfs: update ifree transaction reservations for finobt
  2013-09-03 18:24 [RFC PATCH 00/11] xfs: introduce the free inode btree Brian Foster
                   ` (3 preceding siblings ...)
  2013-09-03 18:25 ` [RFC PATCH 04/11] xfs: update inode allocation transaction reservations for finobt Brian Foster
@ 2013-09-03 18:25 ` Brian Foster
  2013-09-05  1:00   ` Dave Chinner
  2013-09-03 18:25 ` [RFC PATCH 06/11] xfs: use correct transaction reservations in xfs_inactive() Brian Foster
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2013-09-03 18:25 UTC (permalink / raw)
  To: xfs

Update the ifree transaction log reservations to support the
finobt. An inode free can now mean an extra record modification,
record removal (i.e., inode chunk being freed) or a record
insertion (i.e., a previously full inode chunk).

Define XFS_IFREE_SPACE_RES() for the inactive transaction to
reserve data blocks for possible finobt merge/split situations.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_trans_resv.c  | 5 ++++-
 fs/xfs/xfs_trans_space.h | 2 ++
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_trans_resv.c b/fs/xfs/xfs_trans_resv.c
index 3040dad..bbd393a 100644
--- a/fs/xfs/xfs_trans_resv.c
+++ b/fs/xfs/xfs_trans_resv.c
@@ -388,8 +388,10 @@ xfs_calc_symlink_reservation(
  *    the super block free inode counter: sector size
  *    the agi hash list and counters: sector size
  *    the inode btree entry: block size
+ *    the free inode btree entry: block size
  *    the on disk inode before ours in the agi hash list: inode cluster size
  *    the inode btree: max depth * blocksize
+ *    the free inode btree: max depth * block size
  *    the allocation btrees: 2 trees * (max depth - 1) * block size
  */
 STATIC uint
@@ -399,12 +401,13 @@ xfs_calc_ifree_reservation(
 	return XFS_DQUOT_LOGRES(mp) +
 		xfs_calc_inode_res(mp, 1) +
 		xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
-		xfs_calc_buf_res(1, XFS_FSB_TO_B(mp, 1)) +
+		xfs_calc_buf_res(2, XFS_FSB_TO_B(mp, 1)) +
 		MAX((__uint16_t)XFS_FSB_TO_B(mp, 1),
 		    XFS_INODE_CLUSTER_SIZE(mp)) +
 		xfs_calc_buf_res(1, 0) +
 		xfs_calc_buf_res(2 + XFS_IALLOC_BLOCKS(mp) +
 				 mp->m_in_maxlevels, 0) +
+		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
 		xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 1),
 				 XFS_FSB_TO_B(mp, 1));
 }
diff --git a/fs/xfs/xfs_trans_space.h b/fs/xfs/xfs_trans_space.h
index 5dca732..4fc8c10 100644
--- a/fs/xfs/xfs_trans_space.h
+++ b/fs/xfs/xfs_trans_space.h
@@ -82,5 +82,7 @@
 	(XFS_DIRREMOVE_SPACE_RES(mp) + XFS_DIRENTER_SPACE_RES(mp,nl))
 #define	XFS_SYMLINK_SPACE_RES(mp,nl,b)	\
 	(XFS_IALLOC_SPACE_RES(mp) + XFS_DIRENTER_SPACE_RES(mp,nl) + (b))
+#define XFS_IFREE_SPACE_RES(mp)		\
+	(mp)->m_in_maxlevels
 
 #endif	/* __XFS_TRANS_SPACE_H__ */
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 06/11] xfs: use correct transaction reservations in xfs_inactive()
  2013-09-03 18:24 [RFC PATCH 00/11] xfs: introduce the free inode btree Brian Foster
                   ` (4 preceding siblings ...)
  2013-09-03 18:25 ` [RFC PATCH 05/11] xfs: update ifree " Brian Foster
@ 2013-09-03 18:25 ` Brian Foster
  2013-09-05  1:35   ` Dave Chinner
  2013-09-03 18:25 ` [RFC PATCH 07/11] xfs: retry trans reservation on ENOSPC " Brian Foster
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2013-09-03 18:25 UTC (permalink / raw)
  To: xfs

The transaction allocated in xfs_inactive() can be passed down into
xfs_inactive_symlink() or xfs_itruncate_extents(), both of which
can commit and reallocate a new transaction. This leads to
reservation issues if the transaction is subsequently passed into
xfs_ifree(), which requires a larger reservation to manage the
finobt.

Reorganize xfs_inactive() to commit any transaction handed back
from symlink or truncate processing and unconditionally allocate
a new transaction for xfs_ifree() with the appropriate reservation.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_inode.c | 71 ++++++++++++++++++++++++++++--------------------------
 1 file changed, 37 insertions(+), 34 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index e3d7538..56cbf63 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1679,7 +1679,6 @@ xfs_inactive(
 	int			committed;
 	struct xfs_trans	*tp;
 	struct xfs_mount	*mp;
-	struct xfs_trans_res	*resp;
 	int			error;
 	int			truncate = 0;
 
@@ -1724,33 +1723,39 @@ xfs_inactive(
 	if (error)
 		return VN_INACTIVE_CACHE;
 
-	tp = xfs_trans_alloc(mp, XFS_TRANS_INACTIVE);
-	resp = (truncate || S_ISLNK(ip->i_d.di_mode)) ?
-		&M_RES(mp)->tr_itruncate : &M_RES(mp)->tr_ifree;
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
 
-	error = xfs_trans_reserve(tp, resp, 0, 0);
-	if (error) {
-		ASSERT(XFS_FORCED_SHUTDOWN(mp));
-		xfs_trans_cancel(tp, 0);
-		return VN_INACTIVE_CACHE;
-	}
+	if (truncate || S_ISLNK(ip->i_d.di_mode)) {
+		tp = xfs_trans_alloc(mp, XFS_TRANS_INACTIVE);
 
-	xfs_ilock(ip, XFS_ILOCK_EXCL);
-	xfs_trans_ijoin(tp, ip, 0);
+		error = xfs_trans_reserve(tp, &M_RES(mp)->tr_itruncate, 0, 0);
+		if (error) {
+			ASSERT(XFS_FORCED_SHUTDOWN(mp));
+			xfs_trans_cancel(tp, 0);
+			return VN_INACTIVE_CACHE;
+		}
 
-	if (S_ISLNK(ip->i_d.di_mode)) {
-		error = xfs_inactive_symlink(ip, &tp);
-		if (error)
-			goto out_cancel;
-	} else if (truncate) {
-		ip->i_d.di_size = 0;
-		xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+		xfs_trans_ijoin(tp, ip, 0);
+
+		if (S_ISLNK(ip->i_d.di_mode)) {
+			error = xfs_inactive_symlink(ip, &tp);
+			if (error)
+				goto out_cancel;
+		} else if (truncate) {
+			ip->i_d.di_size = 0;
+			xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+
+			error = xfs_itruncate_extents(&tp, ip, XFS_DATA_FORK,
+						      0);
+			if (error)
+				goto out_cancel;
 
-		error = xfs_itruncate_extents(&tp, ip, XFS_DATA_FORK, 0);
+			ASSERT(ip->i_d.di_nextents == 0);
+		}
+
+		error = xfs_trans_commit(tp, XFS_TRANS_RELEASE_LOG_RES);
 		if (error)
 			goto out_cancel;
-
-		ASSERT(ip->i_d.di_nextents == 0);
 	}
 
 	/*
@@ -1762,27 +1767,25 @@ xfs_inactive(
 	if (ip->i_d.di_anextents > 0) {
 		ASSERT(ip->i_d.di_forkoff != 0);
 
-		error = xfs_trans_commit(tp, XFS_TRANS_RELEASE_LOG_RES);
-		if (error)
-			goto out_unlock;
-
 		xfs_iunlock(ip, XFS_ILOCK_EXCL);
 
 		error = xfs_attr_inactive(ip);
 		if (error)
 			goto out;
 
-		tp = xfs_trans_alloc(mp, XFS_TRANS_INACTIVE);
-		error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree, 0, 0);
-		if (error) {
-			xfs_trans_cancel(tp, 0);
-			goto out;
-		}
-
 		xfs_ilock(ip, XFS_ILOCK_EXCL);
-		xfs_trans_ijoin(tp, ip, 0);
 	}
 
+	tp = xfs_trans_alloc(mp, XFS_TRANS_INACTIVE);
+	error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree,
+				  XFS_IFREE_SPACE_RES(mp), 0);
+	if (error) {
+		xfs_trans_cancel(tp, 0);
+		goto out_unlock;
+	}
+
+	xfs_trans_ijoin(tp, ip, 0);
+
 	if (ip->i_afp)
 		xfs_idestroy_fork(ip, XFS_ATTR_FORK);
 
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 07/11] xfs: retry trans reservation on ENOSPC in xfs_inactive()
  2013-09-03 18:24 [RFC PATCH 00/11] xfs: introduce the free inode btree Brian Foster
                   ` (5 preceding siblings ...)
  2013-09-03 18:25 ` [RFC PATCH 06/11] xfs: use correct transaction reservations in xfs_inactive() Brian Foster
@ 2013-09-03 18:25 ` Brian Foster
  2013-09-05  1:40   ` Dave Chinner
  2013-09-03 18:25 ` [RFC PATCH 08/11] xfs: insert newly allocated inode chunks into the finobt Brian Foster
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2013-09-03 18:25 UTC (permalink / raw)
  To: xfs

An ifree data block reservation can fail with ENOSPC. Flush inodes
to try and free up space or attempt without a data block
reservation to avoid failing out of xfs_inactive().

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_inode.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 56cbf63..92de4b7 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1779,7 +1779,18 @@ xfs_inactive(
 	tp = xfs_trans_alloc(mp, XFS_TRANS_INACTIVE);
 	error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree,
 				  XFS_IFREE_SPACE_RES(mp), 0);
+	if (error == ENOSPC) {
+		/* flush outstanding delalloc blocks and retry */
+		xfs_flush_inodes(mp);
+		error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree,
+					  XFS_IFREE_SPACE_RES(mp), 0);
+	}
+	if (error == ENOSPC) {
+		/* resort to a no alloc. reservation */
+		error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree, 0, 0);
+	}
 	if (error) {
+		ASSERT(XFS_FORCED_SHUTDOWN(mp));
 		xfs_trans_cancel(tp, 0);
 		goto out_unlock;
 	}
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 08/11] xfs: insert newly allocated inode chunks into the finobt
  2013-09-03 18:24 [RFC PATCH 00/11] xfs: introduce the free inode btree Brian Foster
                   ` (6 preceding siblings ...)
  2013-09-03 18:25 ` [RFC PATCH 07/11] xfs: retry trans reservation on ENOSPC " Brian Foster
@ 2013-09-03 18:25 ` Brian Foster
  2013-09-05  2:10   ` Dave Chinner
  2013-09-03 18:25 ` [RFC PATCH 09/11] xfs: use and update the finobt on inode allocation Brian Foster
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2013-09-03 18:25 UTC (permalink / raw)
  To: xfs

A newly allocated inode chunk, by definition, has at least one
free inode, so a record is always inserted into the finobt.

Create the xfs_inobt_insert() helper from existing code to insert
a record in an inobt based on the provided BTNUM. Update
xfs_ialloc_ag_alloc() to invoke the helper for the existing
XFS_BTNUM_INO tree and XFS_BTNUM_FINO tree, if enabled.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_ialloc.c | 79 +++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 56 insertions(+), 23 deletions(-)

diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
index 5ced506..e64a728 100644
--- a/fs/xfs/xfs_ialloc.c
+++ b/fs/xfs/xfs_ialloc.c
@@ -152,6 +152,52 @@ xfs_check_agi_freecount(
 #endif
 
 /*
+ * Insert records describing a newly allocated inode chunk into the inobt.
+ */
+STATIC int
+xfs_inobt_insert(
+	struct xfs_mount	*mp,
+	struct xfs_trans	*tp,
+	struct xfs_buf		*agbp,
+	xfs_agino_t		newino,
+	xfs_agino_t		newlen,
+	xfs_btnum_t		btnum)
+{
+	struct xfs_btree_cur	*cur;
+	struct xfs_agi		*agi = XFS_BUF_TO_AGI(agbp);
+	xfs_agnumber_t		agno = be32_to_cpu(agi->agi_seqno);
+	xfs_agino_t		thisino;
+	int			i;
+	int			error;
+
+	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno, btnum);
+
+	for (thisino = newino;
+	     thisino < newino + newlen;
+	     thisino += XFS_INODES_PER_CHUNK) {
+		cur->bc_rec.i.ir_startino = thisino;
+		cur->bc_rec.i.ir_freecount = XFS_INODES_PER_CHUNK;
+		cur->bc_rec.i.ir_free = XFS_INOBT_ALL_FREE;
+		error = xfs_btree_lookup(cur, XFS_LOOKUP_EQ, &i);
+		if (error) {
+			xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+			return error;
+		}
+		ASSERT(i == 0);
+		error = xfs_btree_insert(cur, &i);
+		if (error) {
+			xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+			return error;
+		}
+		ASSERT(i == 1);
+	}
+
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+
+	return 0;
+}
+
+/*
  * Initialise a new set of inodes. When called without a transaction context
  * (e.g. from recovery) we initiate a delayed write of the inode buffers rather
  * than logging them (which in a transaction context puts them into the AIL
@@ -309,13 +355,10 @@ xfs_ialloc_ag_alloc(
 {
 	xfs_agi_t	*agi;		/* allocation group header */
 	xfs_alloc_arg_t	args;		/* allocation argument structure */
-	xfs_btree_cur_t	*cur;		/* inode btree cursor */
 	xfs_agnumber_t	agno;
 	int		error;
-	int		i;
 	xfs_agino_t	newino;		/* new first inode's number */
 	xfs_agino_t	newlen;		/* new number of inodes */
-	xfs_agino_t	thisino;	/* current inode number, for loop */
 	int		isaligned = 0;	/* inode allocation at stripe unit */
 					/* boundary */
 	struct xfs_perag *pag;
@@ -453,29 +496,19 @@ xfs_ialloc_ag_alloc(
 	agi->agi_newino = cpu_to_be32(newino);
 
 	/*
-	 * Insert records describing the new inode chunk into the btree.
+	 * Insert records describing the new inode chunk into the btrees.
 	 */
-	cur = xfs_inobt_init_cursor(args.mp, tp, agbp, agno, XFS_BTNUM_INO);
-	for (thisino = newino;
-	     thisino < newino + newlen;
-	     thisino += XFS_INODES_PER_CHUNK) {
-		cur->bc_rec.i.ir_startino = thisino;
-		cur->bc_rec.i.ir_freecount = XFS_INODES_PER_CHUNK;
-		cur->bc_rec.i.ir_free = XFS_INOBT_ALL_FREE;
-		error = xfs_btree_lookup(cur, XFS_LOOKUP_EQ, &i);
-		if (error) {
-			xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
-			return error;
-		}
-		ASSERT(i == 0);
-		error = xfs_btree_insert(cur, &i);
-		if (error) {
-			xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+	error = xfs_inobt_insert(args.mp, tp, agbp, newino, newlen,
+				 XFS_BTNUM_INO);
+	if (error)
+		return error;
+
+	if (xfs_sb_version_hasfinobt(&args.mp->m_sb)) {
+		error = xfs_inobt_insert(args.mp, tp, agbp, newino, newlen,
+					 XFS_BTNUM_FINO);
+		if (error)
 			return error;
-		}
-		ASSERT(i == 1);
 	}
-	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
 	/*
 	 * Log allocation group header fields
 	 */
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 09/11] xfs: use and update the finobt on inode allocation
  2013-09-03 18:24 [RFC PATCH 00/11] xfs: introduce the free inode btree Brian Foster
                   ` (7 preceding siblings ...)
  2013-09-03 18:25 ` [RFC PATCH 08/11] xfs: insert newly allocated inode chunks into the finobt Brian Foster
@ 2013-09-03 18:25 ` Brian Foster
  2013-09-05  2:27   ` Dave Chinner
  2013-09-03 18:25 ` [RFC PATCH 10/11] xfs: update the finobt on inode free Brian Foster
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2013-09-03 18:25 UTC (permalink / raw)
  To: xfs

Replace xfs_dialloc_ag() with an implementation that looks for a
record in the finobt. The finobt only tracks records with at least
one free inode. This eliminates the need for the intra-ag scan in
the original algorithm. Once the inode is allocated, update the
finobt appropriately (possibly removing the record) as well as the
inobt.

Move the original xfs_dialloc_ag() algorithm to
xfs_dialloc_ag_slow() and fall back as such if finobt support is
not enabled.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_ialloc.c | 136 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 135 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
index e64a728..516f4af 100644
--- a/fs/xfs/xfs_ialloc.c
+++ b/fs/xfs/xfs_ialloc.c
@@ -708,7 +708,7 @@ xfs_ialloc_get_rec(
  * available.
  */
 STATIC int
-xfs_dialloc_ag(
+xfs_dialloc_ag_slow(
 	struct xfs_trans	*tp,
 	struct xfs_buf		*agbp,
 	xfs_ino_t		parent,
@@ -966,6 +966,140 @@ error0:
 	return error;
 }
 
+STATIC int
+xfs_dialloc_ag(
+	struct xfs_trans	*tp,
+	struct xfs_buf		*agbp,
+	xfs_ino_t		parent,
+	xfs_ino_t		*inop)
+{
+	struct xfs_mount		*mp = tp->t_mountp;
+	struct xfs_agi			*agi = XFS_BUF_TO_AGI(agbp);
+	xfs_agnumber_t			agno = be32_to_cpu(agi->agi_seqno);
+	xfs_agino_t			pagino = XFS_INO_TO_AGINO(mp, parent);
+	struct xfs_perag		*pag;
+	struct xfs_btree_cur		*fcur;
+	struct xfs_btree_cur		*icur;
+	struct xfs_inobt_rec_incore	frec;
+	struct xfs_inobt_rec_incore	irec;
+	xfs_ino_t			ino;
+	int				error;
+	int				offset;
+	int				i;
+
+	if (!xfs_sb_version_hasfinobt(&mp->m_sb))
+		return xfs_dialloc_ag_slow(tp, agbp, parent, inop);
+
+	pag = xfs_perag_get(mp, agno);
+
+	/*
+	 * If pagino is 0 (this is the root inode allocation) use newino.
+	 * This must work because we've just allocated some.
+	 */
+	if (!pagino)
+		pagino = be32_to_cpu(agi->agi_newino);
+
+	fcur = xfs_inobt_init_cursor(mp, tp, agbp, agno, XFS_BTNUM_FINO);
+	icur = xfs_inobt_init_cursor(mp, tp, agbp, agno, XFS_BTNUM_INO);
+
+	error = xfs_check_agi_freecount(fcur, agi);
+	if (error)
+		goto error;
+	error = xfs_check_agi_freecount(icur, agi);
+	if (error)
+		goto error;
+
+	/*
+	 * Search the finobt.
+	 */
+	error = xfs_inobt_lookup(fcur, pagino, XFS_LOOKUP_LE, &i);
+	if (error)
+		goto error;
+	if (i == 0) {
+		error = xfs_inobt_lookup(fcur, pagino, XFS_LOOKUP_GE, &i);
+		if (error)
+			goto error;
+		XFS_WANT_CORRUPTED_GOTO(i == 1, error);
+	}
+
+	error = xfs_inobt_get_rec(fcur, &frec, &i);
+	if (error)
+		goto error;
+	XFS_WANT_CORRUPTED_GOTO(i == 1, error);
+
+	offset = xfs_lowbit64(frec.ir_free);
+	ASSERT(offset >= 0);
+	ASSERT(offset < XFS_INODES_PER_CHUNK);
+	ASSERT((XFS_AGINO_TO_OFFSET(mp, frec.ir_startino) %
+				   XFS_INODES_PER_CHUNK) == 0);
+	ino = XFS_AGINO_TO_INO(mp, agno, frec.ir_startino + offset);
+
+	/*
+	 * Modify or remove the finobt record.
+	 */
+	frec.ir_free &= ~XFS_INOBT_MASK(offset);
+	frec.ir_freecount--;
+	if (frec.ir_freecount) 
+		error = xfs_inobt_update(fcur, &frec);
+	else
+		error = xfs_btree_delete(fcur, &i);
+	if (error)
+		goto error;
+
+	/*
+	 * Lookup and modify the equivalent record in the inobt.
+	 */
+	error = xfs_inobt_lookup(icur, frec.ir_startino, XFS_LOOKUP_EQ, &i);
+	if (error)
+		goto error;
+	XFS_WANT_CORRUPTED_GOTO(i == 1, error);
+
+	error = xfs_inobt_get_rec(icur, &irec, &i);
+	if (error)
+		goto error;
+	XFS_WANT_CORRUPTED_GOTO(i == 1, error);
+	ASSERT((XFS_AGINO_TO_OFFSET(mp, irec.ir_startino) %
+				   XFS_INODES_PER_CHUNK) == 0);
+
+	irec.ir_free &= ~XFS_INOBT_MASK(offset);
+	irec.ir_freecount--;
+
+	XFS_WANT_CORRUPTED_GOTO((frec.ir_free == irec.ir_free) &&
+				(frec.ir_freecount == irec.ir_freecount),
+				error);
+
+	error = xfs_inobt_update(icur, &irec);
+	if (error)
+		goto error;
+
+	/*
+	 * Update the perag and superblock.
+	 */
+	be32_add_cpu(&agi->agi_freecount, -1);
+	xfs_ialloc_log_agi(tp, agbp, XFS_AGI_FREECOUNT);
+	pag->pagi_freecount--;
+
+	xfs_trans_mod_sb(tp, XFS_TRANS_SB_IFREE, -1);
+	xfs_perag_put(pag);
+
+	error = xfs_check_agi_freecount(fcur, agi);
+	if (error)
+		goto error;
+	error = xfs_check_agi_freecount(icur, agi);
+	if (error)
+		goto error;
+
+	xfs_btree_del_cursor(icur, XFS_BTREE_NOERROR);
+	xfs_btree_del_cursor(fcur, XFS_BTREE_ERROR);
+	*inop = ino;
+	return 0;
+error:
+	xfs_perag_put(pag);
+	xfs_btree_del_cursor(icur, XFS_BTREE_ERROR);
+	xfs_btree_del_cursor(fcur, XFS_BTREE_ERROR);
+	return error;
+}
+
 /*
  * Allocate an inode on disk.
  *
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 10/11] xfs: update the finobt on inode free
  2013-09-03 18:24 [RFC PATCH 00/11] xfs: introduce the free inode btree Brian Foster
                   ` (8 preceding siblings ...)
  2013-09-03 18:25 ` [RFC PATCH 09/11] xfs: use and update the finobt on inode allocation Brian Foster
@ 2013-09-03 18:25 ` Brian Foster
  2013-09-05  2:54   ` Dave Chinner
  2013-09-03 18:25 ` [RFC PATCH 11/11] xfs: add finobt support to growfs Brian Foster
  2013-09-05 21:17 ` [RFC PATCH 00/11] xfs: introduce the free inode btree Michael L. Semon
  11 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2013-09-03 18:25 UTC (permalink / raw)
  To: xfs

An inode free operation can have several effects on the finobt. If
all inodes have been freed and the chunk deallocated, we remove the
finobt record. If the inode chunk was previously full, we must
insert a new record based on the existing inobt record. Otherwise,
we modify the record in place.

Create the xfs_ifree_finobt() function to identify the potential
scenarios and update the finobt appropriately.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_ialloc.c | 120 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 120 insertions(+)

diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
index 516f4af..96f71b5 100644
--- a/fs/xfs/xfs_ialloc.c
+++ b/fs/xfs/xfs_ialloc.c
@@ -198,6 +198,117 @@ xfs_inobt_insert(
 }
 
 /*
+ * Free an inode in the free inode btree.
+ */
+STATIC int
+xfs_ifree_finobt(
+	struct xfs_mount		*mp,
+	struct xfs_trans		*tp,
+	struct xfs_buf			*agbp,
+	struct xfs_inobt_rec_incore	*ibtrec,/* inobt record */
+	int				offset)	/* offset of inode */
+{
+	struct xfs_agi			*agi = XFS_BUF_TO_AGI(agbp);
+	xfs_agnumber_t			agno = be32_to_cpu(agi->agi_seqno);
+	struct xfs_btree_cur		*cur;
+	struct xfs_inobt_rec_incore	rec;
+	int				error;
+	int				i;
+
+	if (!xfs_sb_version_hasfinobt(&mp->m_sb))
+		return 0;
+
+	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno, XFS_BTNUM_FINO);
+
+	error = xfs_inobt_lookup(cur, ibtrec->ir_startino, XFS_LOOKUP_EQ, &i);
+	if (error)
+		goto error;
+
+	if (i == 1) {
+		int j;
+		/*
+		 * Read and update the existing record.
+		 */
+		error = xfs_inobt_get_rec(cur, &rec, &j);
+		if (error)
+			goto error;
+		XFS_WANT_CORRUPTED_GOTO(j == 1, error);
+
+		rec.ir_free |= XFS_INOBT_MASK(offset);
+		rec.ir_freecount++;
+
+		XFS_WANT_CORRUPTED_GOTO((rec.ir_free == ibtrec->ir_free) &&
+					(rec.ir_freecount == ibtrec->ir_freecount),
+					error);
+	}
+
+	/*
+	 * The content of inobt records should always match between the inobt
+	 * and finobt. The lifecycle of records in the finobt is different from
+	 * the inobt in that the finobt only tracks records with at least one
+	 * free inode. This is to optimize lookup for inode allocation purposes.
+	 * The following checks fix up the finobt appropriately based on the
+	 * state of the record subsequent to the current operation.
+	 */
+
+	if ((i == 1) &&
+	    (rec.ir_freecount == XFS_IALLOC_INODES(mp) &&
+	     !(mp->m_flags & XFS_MOUNT_IKEEP))) {
+		/*
+		 * We have an existing finobt record. If all inodes are free
+		 * and we're in !ikeep mode, the entire inode chunk has been
+		 * deallocated. Remove the record from the finobt.
+		 */
+		error = xfs_btree_delete(cur, &i);
+		if (error)
+			goto error;
+		ASSERT(i == 1);
+	} else if ((i == 0) && (ibtrec->ir_freecount == 1)) {
+		/*
+		 * No existing finobt record and the inobt record has a single
+		 * free inode. This means we've freed an inode in a previously
+		 * fully allocated chunk. Insert a new record into the finobt
+		 * based on the current inobt record.
+		 */
+		cur->bc_rec.i.ir_startino = ibtrec->ir_startino;
+		cur->bc_rec.i.ir_free = ibtrec->ir_free;
+		cur->bc_rec.i.ir_freecount = ibtrec->ir_freecount;
+		error = xfs_btree_insert(cur, &i);
+		if (error)
+			goto error;
+		ASSERT(i == 1);
+	} else if (i == 1) {
+		/*
+		 * The existing finobt record was modified and has a combination
+		 * of allocated and free inodes or is completely free and ikeep
+		 * is enabled. Update the record.
+		 */
+		error = xfs_inobt_update(cur, &rec);
+		if (error)
+			goto error;
+	} else {
+		/* somehow out of sync */
+		XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
+				     agbp->b_addr);
+		ASSERT(0);
+
+		error = XFS_ERROR(EFSCORRUPTED);
+		goto error;
+	}
+
+	error = xfs_check_agi_freecount(cur, agi);
+	if (error)
+		goto error;
+
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+	return 0;
+
+error:
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+	return error;
+}
+
+/*
  * Initialise a new set of inodes. When called without a transaction context
  * (e.g. from recovery) we initiate a delayed write of the inode buffers rather
  * than logging them (which in a transaction context puts them into the AIL
@@ -1422,6 +1533,15 @@ xfs_difree(
 	if (error)
 		goto error0;
 
+	/*
+	 * Fix up the free inode btree.
+	 */
+	if (xfs_sb_version_hasfinobt(&mp->m_sb)) {
+		error = xfs_ifree_finobt(mp, tp, agbp, &rec, off);
+		if (error)
+			goto error0;
+	}
+
 	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
 	return 0;
 
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 11/11] xfs: add finobt support to growfs
  2013-09-03 18:24 [RFC PATCH 00/11] xfs: introduce the free inode btree Brian Foster
                   ` (9 preceding siblings ...)
  2013-09-03 18:25 ` [RFC PATCH 10/11] xfs: update the finobt on inode free Brian Foster
@ 2013-09-03 18:25 ` Brian Foster
  2013-09-05  2:55   ` Dave Chinner
  2013-09-05 21:17 ` [RFC PATCH 00/11] xfs: introduce the free inode btree Michael L. Semon
  11 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2013-09-03 18:25 UTC (permalink / raw)
  To: xfs

Add finobt support to growfs. Initialize the agi root/level fields
and the root finobt block.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_fsops.c | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index e64ee52..2e6ef6d 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -309,6 +309,10 @@ xfs_growfs_data_private(
 		agi->agi_dirino = cpu_to_be32(NULLAGINO);
 		if (xfs_sb_version_hascrc(&mp->m_sb))
 			uuid_copy(&agi->agi_uuid, &mp->m_sb.sb_uuid);
+		if (xfs_sb_version_hasfinobt(&mp->m_sb)) {
+			agi->agi_free_root = cpu_to_be32(XFS_FIBT_BLOCK(mp));
+			agi->agi_free_level = cpu_to_be32(1);
+		}
 		for (bucket = 0; bucket < XFS_AGI_UNLINKED_BUCKETS; bucket++)
 			agi->agi_unlinked[bucket] = cpu_to_be32(NULLAGINO);
 
@@ -400,6 +404,34 @@ xfs_growfs_data_private(
 		xfs_buf_relse(bp);
 		if (error)
 			goto error0;
+
+		/*
+		 * FINO btree root block
+		 */
+		if (xfs_sb_version_hasfinobt(&mp->m_sb)) {
+			bp = xfs_growfs_get_hdr_buf(mp,
+				XFS_AGB_TO_DADDR(mp, agno, XFS_FIBT_BLOCK(mp)),
+				BTOBB(mp->m_sb.sb_blocksize), 0,
+				&xfs_inobt_buf_ops);
+			if (!bp) {
+				error = ENOMEM;
+				goto error0;
+			}
+
+			if (xfs_sb_version_hascrc(&mp->m_sb))
+				xfs_btree_init_block(mp, bp, XFS_FIBT_CRC_MAGIC,
+						     0, 0, agno,
+						     XFS_BTREE_CRC_BLOCKS);
+			else
+				xfs_btree_init_block(mp, bp, XFS_FIBT_MAGIC, 0,
+						     0, agno, 0);
+
+			error = xfs_bwrite(bp);
+			xfs_buf_relse(bp);
+			if (error)
+				goto error0;
+		}
+
 	}
 	xfs_trans_agblocks_delta(tp, nfree);
 	/*
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 01/11] xfs: refactor xfs_ialloc_btree.c to support multiple inobt numbers
  2013-09-03 18:24 ` [RFC PATCH 01/11] xfs: refactor xfs_ialloc_btree.c to support multiple inobt numbers Brian Foster
@ 2013-09-05  0:36   ` Dave Chinner
  0 siblings, 0 replies; 46+ messages in thread
From: Dave Chinner @ 2013-09-05  0:36 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Tue, Sep 03, 2013 at 02:24:58PM -0400, Brian Foster wrote:
> The introduction of the free inode btree (finobt) requires that
> xfs_ialloc_btree.c handle multiple trees. Refactor xfs_ialloc_btree.c
> so the caller specifies the btree type on cursor initialization to
> prepare for addition of the finobt.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>

Simple enough, obvious place to start. :)

Reviewed-by: Dave Chinner <david@fromorbit.com>

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 02/11] xfs: reserve v5 superblock read-only compat. feature bit for finobt
  2013-09-03 18:24 ` [RFC PATCH 02/11] xfs: reserve v5 superblock read-only compat. feature bit for finobt Brian Foster
@ 2013-09-05  0:39   ` Dave Chinner
  0 siblings, 0 replies; 46+ messages in thread
From: Dave Chinner @ 2013-09-05  0:39 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Tue, Sep 03, 2013 at 02:24:59PM -0400, Brian Foster wrote:
> Reserve a v5 read-only compatibility feature bit for the finobt and
> create the xfs_sb_version_hasfinobt() helper to determine whether
> an fs has the feature enabled.
> 
> The finobt does not change existing on-disk structures, but must
> remain consistent with the ialloc btree. Modifications from older
> kernels would violate that constrant. Therefore, we restrict older
> kernels to read-only mounts of finobt-enabled filesystems.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_sb.h | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_sb.h b/fs/xfs/xfs_sb.h
> index 6835b44..c48d95d 100644
> --- a/fs/xfs/xfs_sb.h
> +++ b/fs/xfs/xfs_sb.h
> @@ -585,7 +585,9 @@ xfs_sb_has_compat_feature(
>  	return (sbp->sb_features_compat & feature) != 0;
>  }
>  
> -#define XFS_SB_FEAT_RO_COMPAT_ALL 0
> +#define XFS_SB_FEAT_RO_COMPAT_FINOBT   (1 << 0)		/* free inode btree */
> +#define XFS_SB_FEAT_RO_COMPAT_ALL \
> +		(XFS_SB_FEAT_RO_COMPAT_FINOBT)
>  #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL

The only thing I'd suggest here is that the last patch in the series
should add the XFS_SB_FEAT_RO_COMPAT_FINOBT bit to the
XFS_SB_FEAT_RO_COMPAT_ALL mask.

Otherwise we can have the problem of bisects landing in the middle
of the series and thinking that the feature is fully supported when
it isn't. So it's fine to add the xfs_sb_version_hasfinobt() helper
here and define the bit but don't add it to the supported mask
until all the changes for the feature are complete.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 03/11] xfs: support the XFS_BTNUM_FINOBT free inode btree type
  2013-09-03 18:25 ` [RFC PATCH 03/11] xfs: support the XFS_BTNUM_FINOBT free inode btree type Brian Foster
@ 2013-09-05  0:54   ` Dave Chinner
  2013-09-05 16:17     ` Brian Foster
  0 siblings, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2013-09-05  0:54 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Tue, Sep 03, 2013 at 02:25:00PM -0400, Brian Foster wrote:
> Define the AGI fields for the finobt root/level and add magic
> numbers. Update the btree code to add support for the new
> XFS_BTNUM_FINOBT inode btree.
> 
> The finobt root block is reserved immediately following the inobt
> root block in the AG. Update XFS_PREALLOC_BLOCKS() to determine the
> starting AG data block based on whether finobt support is enabled.

A few minor things...

> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_ag.h           |  7 ++++++-
>  fs/xfs/xfs_btree.c        |  6 ++++--
>  fs/xfs/xfs_btree.h        |  3 +++
>  fs/xfs/xfs_ialloc.c       |  2 ++
>  fs/xfs/xfs_ialloc_btree.c | 29 +++++++++++++++++++++++------
>  fs/xfs/xfs_ialloc_btree.h | 14 +++++++++++++-
>  fs/xfs/xfs_log_recover.c  |  2 ++
>  fs/xfs/xfs_stats.h        | 18 +++++++++++++++++-
>  fs/xfs/xfs_types.h        |  2 +-
>  9 files changed, 71 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/xfs/xfs_ag.h b/fs/xfs/xfs_ag.h
> index 1cb740a..b85585d 100644
> --- a/fs/xfs/xfs_ag.h
> +++ b/fs/xfs/xfs_ag.h
> @@ -166,6 +166,9 @@ typedef struct xfs_agi {
>  	__be32		agi_pad32;
>  	__be64		agi_lsn;	/* last write sequence */
>  
> +	__be32		agi_free_root; /* root of the free inode btree */
> +	__be32		agi_free_level;/* levels in free inode btree */
> +
>  	/* structure must be padded to 64 bit alignment */
>  } xfs_agi_t;

That's fine, but...
>  
> @@ -180,7 +183,9 @@ typedef struct xfs_agi {
>  #define	XFS_AGI_NEWINO		0x00000100
>  #define	XFS_AGI_DIRINO		0x00000200
>  #define	XFS_AGI_UNLINKED	0x00000400
> -#define	XFS_AGI_NUM_BITS	11
> +#define	XFS_AGI_FREE_ROOT	0x00000800
> +#define	XFS_AGI_FREE_LEVEL	0x00001000
> +#define	XFS_AGI_NUM_BITS	13
>  #define	XFS_AGI_ALL_BITS	((1 << XFS_AGI_NUM_BITS) - 1)

This is a bit of a problem, because the range logging bits will now
cause the entire AGI to be logged (including all the unlinked list
hash) because these only define the first/last offsets to be
logged...

> diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
> index 524aa88..5ced506 100644
> --- a/fs/xfs/xfs_ialloc.c
> +++ b/fs/xfs/xfs_ialloc.c
> @@ -1505,6 +1505,8 @@ xfs_ialloc_log_agi(
>  		offsetof(xfs_agi_t, agi_newino),
>  		offsetof(xfs_agi_t, agi_dirino),
>  		offsetof(xfs_agi_t, agi_unlinked),
> +		offsetof(xfs_agi_t, agi_free_root),
> +		offsetof(xfs_agi_t, agi_free_level),
>  		sizeof(xfs_agi_t)
>  	};

Because of how this table works.

What we really need here is for xfs_ialloc_log_agi to consider that
there are two distinct regions for range logging - the first spaces
from offset 0 to offset of agi_unlinked, and the second is from the
the offset of agi_free_root to the end of the xfs_agi_t....

It's abit messy, I know, but we couldn't easily add new padding to
the AGI in the existing range logging area like was done for the AGF
because of the unlinked list hash table already defining the end of
the range logging region....

>  #ifdef DEBUG
> diff --git a/fs/xfs/xfs_ialloc_btree.c b/fs/xfs/xfs_ialloc_btree.c
> index 0cdb88b..7923292 100644
> --- a/fs/xfs/xfs_ialloc_btree.c
> +++ b/fs/xfs/xfs_ialloc_btree.c
> @@ -62,10 +62,18 @@ xfs_inobt_set_root(
>  {
>  	struct xfs_buf		*agbp = cur->bc_private.a.agbp;
>  	struct xfs_agi		*agi = XFS_BUF_TO_AGI(agbp);
> -
> -	agi->agi_root = nptr->s;
> -	be32_add_cpu(&agi->agi_level, inc);
> -	xfs_ialloc_log_agi(cur->bc_tp, agbp, XFS_AGI_ROOT | XFS_AGI_LEVEL);
> +	int			fields;
> +
> +	if (cur->bc_btnum == XFS_BTNUM_INO) {
> +		agi->agi_root = nptr->s;
> +		be32_add_cpu(&agi->agi_level, inc);
> +		fields = XFS_AGI_ROOT | XFS_AGI_LEVEL;
> +	} else {
> +		agi->agi_free_root = nptr->s;
> +		be32_add_cpu(&agi->agi_free_level, inc);
> +		fields = XFS_AGI_FREE_ROOT | XFS_AGI_FREE_LEVEL;
> +	}
> +	xfs_ialloc_log_agi(cur->bc_tp, agbp, fields);
>  }

I suspect that it would be better to have separate functions for
these differences i.e. xfs_inobt_set_root() and
xfs_finobt_set_root(), and set up separate btree ops structure
forthe two different trees. Most of the code is still identical,
but the differences in root structures can easily be handled without
putting switches in the code everywhere.

>  
>  STATIC int
> @@ -172,7 +180,10 @@ xfs_inobt_init_ptr_from_cur(
>  
>  	ASSERT(cur->bc_private.a.agno == be32_to_cpu(agi->agi_seqno));
>  
> -	ptr->s = agi->agi_root;
> +	if (cur->bc_btnum == XFS_BTNUM_INO)
> +		ptr->s = agi->agi_root;
> +	else
> +		ptr->s = agi->agi_free_root;
>  }

Like this...

>  
>  STATIC __int64_t
> @@ -205,6 +216,7 @@ xfs_inobt_verify(
>  	 */
>  	switch (block->bb_magic) {
>  	case cpu_to_be32(XFS_IBT_CRC_MAGIC):
> +	case cpu_to_be32(XFS_FIBT_CRC_MAGIC):
>  		if (!xfs_sb_version_hascrc(&mp->m_sb))
>  			return false;
>  		if (!uuid_equal(&block->bb_u.s.bb_uuid, &mp->m_sb.sb_uuid))
> @@ -216,6 +228,7 @@ xfs_inobt_verify(
>  			return false;
>  		/* fall through */
>  	case cpu_to_be32(XFS_IBT_MAGIC):
> +	case cpu_to_be32(XFS_FIBT_MAGIC):
>  		break;
>  	default:
>  		return 0;
> @@ -335,8 +348,12 @@ xfs_inobt_init_cursor(
>  
>  	cur->bc_tp = tp;
>  	cur->bc_mp = mp;
> -	cur->bc_nlevels = be32_to_cpu(agi->agi_level);
>  	cur->bc_btnum = btnum;
> +	if (btnum == XFS_BTNUM_INO)
> +		cur->bc_nlevels = be32_to_cpu(agi->agi_level);
> +	else
> +		cur->bc_nlevels = be32_to_cpu(agi->agi_free_level);
> +
>  	cur->bc_blocklog = mp->m_sb.sb_blocklog;
>  
>  	cur->bc_ops = &xfs_inobt_ops;

And this is where you do the check on the btnum and set the
appropriate ops structure....

>  #define	XFS_INODES_PER_CHUNK		(NBBY * sizeof(xfs_inofree_t))
> @@ -73,7 +75,17 @@ typedef __be32 xfs_inobt_ptr_t;
>   * block numbers in the AG.
>   */
>  #define	XFS_IBT_BLOCK(mp)		((xfs_agblock_t)(XFS_CNT_BLOCK(mp) + 1))
> -#define	XFS_PREALLOC_BLOCKS(mp)		((xfs_agblock_t)(XFS_IBT_BLOCK(mp) + 1))
> +#define	XFS_FIBT_BLOCK(mp)		((xfs_agblock_t)(XFS_IBT_BLOCK(mp) + 1))
> +
> +/*
> + * The first data block of an AG depends on whether the filesystem was formatted
> + * with the finobt feature. If so, account for the finobt reserved root btree
> + * block.
> + */
> +#define XFS_PREALLOC_BLOCKS(mp) \
> +	(xfs_sb_version_hasfinobt(&((mp)->m_sb)) ? \
> +	 XFS_FIBT_BLOCK(mp) + 1 : \
> +	 XFS_IBT_BLOCK(mp) + 1)

Yup, that looks sensible, with a nice comment to explain it :)

> diff --git a/fs/xfs/xfs_stats.h b/fs/xfs/xfs_stats.h
> index c03ad38..c8f238b 100644
> --- a/fs/xfs/xfs_stats.h
> +++ b/fs/xfs/xfs_stats.h
> @@ -183,7 +183,23 @@ struct xfsstats {
>  	__uint32_t		xs_ibt_2_alloc;
>  	__uint32_t		xs_ibt_2_free;
>  	__uint32_t		xs_ibt_2_moves;
> -#define XFSSTAT_END_XQMSTAT		(XFSSTAT_END_IBT_V2+6)
> +#define XFSSTAT_END_FIBT_V2		(XFSSTAT_END_IBT_V2+15)
> +	__uint32_t		xs_fibt_2_lookup;
> +	__uint32_t		xs_fibt_2_compare;
> +	__uint32_t		xs_fibt_2_insrec;
> +	__uint32_t		xs_fibt_2_delrec;
> +	__uint32_t		xs_fibt_2_newroot;
> +	__uint32_t		xs_fibt_2_killroot;
> +	__uint32_t		xs_fibt_2_increment;
> +	__uint32_t		xs_fibt_2_decrement;
> +	__uint32_t		xs_fibt_2_lshift;
> +	__uint32_t		xs_fibt_2_rshift;
> +	__uint32_t		xs_fibt_2_split;
> +	__uint32_t		xs_fibt_2_join;
> +	__uint32_t		xs_fibt_2_alloc;
> +	__uint32_t		xs_fibt_2_free;
> +	__uint32_t		xs_fibt_2_moves;
> +#define XFSSTAT_END_XQMSTAT		(XFSSTAT_END_FIBT_V2+6)
>  	__uint32_t		xs_qm_dqreclaims;
>  	__uint32_t		xs_qm_dqreclaim_misses;
>  	__uint32_t		xs_qm_dquot_dups;

I didn't see an equivalent change to add these new stats to the proc
file output (ie. in xfs_stat_proc_show()). maybe I just missed it,
but if I didn't, can you add it?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 04/11] xfs: update inode allocation transaction reservations for finobt
  2013-09-03 18:25 ` [RFC PATCH 04/11] xfs: update inode allocation transaction reservations for finobt Brian Foster
@ 2013-09-05  0:59   ` Dave Chinner
  2013-09-05 16:17     ` Brian Foster
  0 siblings, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2013-09-05  0:59 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Tue, Sep 03, 2013 at 02:25:01PM -0400, Brian Foster wrote:
> Update inode allocation transaction reservations for the finobt. A
> create via record modification requires a log reservation for the
> additional finobt record. Any such allocation could result in an
> finobt removal if the inode chunk has become fully allocated, thus
> we include a reservation for a finobt btree merge as well.
> Allocation of a new inode chunk must account for splits in the
> finobt as well as the existing ialloc tree.

These transaction reservation changes are only necessary for
filesystems with free inode btrees, otherwise they just use more log
space than is necessary.

Can you add helper functions for the free inode btree reservations,
and have them return 0 when the feature is not enabled? That way the
code stays pretty clean, is self documenting and doesn't take
unnecessary space when the feature is not enabled....

> Also update XFS_IALLOC_SPACE_RES() to reserve data blocks for
> finobt split/merge scenarios.

Needs to handle the enabled/disabled case, too.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 05/11] xfs: update ifree transaction reservations for finobt
  2013-09-03 18:25 ` [RFC PATCH 05/11] xfs: update ifree " Brian Foster
@ 2013-09-05  1:00   ` Dave Chinner
  0 siblings, 0 replies; 46+ messages in thread
From: Dave Chinner @ 2013-09-05  1:00 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Tue, Sep 03, 2013 at 02:25:02PM -0400, Brian Foster wrote:
> Update the ifree transaction log reservations to support the
> finobt. An inode free can now mean an extra record modification,
> record removal (i.e., inode chunk being freed) or a record
> insertion (i.e., a previously full inode chunk).
> 
> Define XFS_IFREE_SPACE_RES() for the inactive transaction to
> reserve data blocks for possible finobt merge/split situations.

Same comment about doing this with helper functions...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 06/11] xfs: use correct transaction reservations in xfs_inactive()
  2013-09-03 18:25 ` [RFC PATCH 06/11] xfs: use correct transaction reservations in xfs_inactive() Brian Foster
@ 2013-09-05  1:35   ` Dave Chinner
  2013-09-05 16:18     ` Brian Foster
  0 siblings, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2013-09-05  1:35 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Tue, Sep 03, 2013 at 02:25:03PM -0400, Brian Foster wrote:
> The transaction allocated in xfs_inactive() can be passed down into
> xfs_inactive_symlink() or xfs_itruncate_extents(), both of which
> can commit and reallocate a new transaction. This leads to
> reservation issues if the transaction is subsequently passed into
> xfs_ifree(), which requires a larger reservation to manage the
> finobt.
> 
> Reorganize xfs_inactive() to commit any transaction handed back
> from symlink or truncate processing and unconditionally allocate
> a new transaction for xfs_ifree() with the appropriate reservation.

Ok, I've had a bit of a look at this now, and I like how the code
turns out. However, I don't think it goes far enough, or fix the
problem that causes all the transaction nastiness in xfs_inactive().

Firstly, we are not doing rolling transactions here - there is no
need for all the changes to be atomic because the inode is on the
unlinked list if it is going to be freed. Hence we don't need to
pass transaction pointers around.

xfs_inactive_symlink() can do a transaction completely internally,
and, well, it doesn't even log the inode if the symlink is in-line
and so may not even need a transaction. Hence really only
xfs_inactive_symlink_rmt() needs to run a transaction, and it can do
that internally just fine.

For the xfs_itruncate_extents() data fork transaction, just add a
new wrapper called xfs_inactive_truncate() that holds the
transaction context internally - that moves the only other
transaction context that you need to commit out of xfs_inactive()
altogether, as the attr fork already uses a private transaction
context.

And, finally, you can then factor the xfs_ifree() and it's
transaction context into a helper function as well, so there aren't
any transaction contexts left in xfs_inactive() at all.

That would leave us with:

	if (ISLNK) {
		error = xfs_inactive_symlink(ip);
	} else if (truncate)
		error = xfs_inactive_truncate(ip);
	}
	if (error)
		goto out;
	if (ip->i_d.di_anextents > 0)
		error = xfs_attr_inactive(ip);
	if (error)
		goto out;

	error = xfs_inactive_ifree(ip);

	xfs_qm_dqdetach(ip);
out:
	return;

This gives us a natural separation of the different transaction
reservations and contexts needed to perform the operations, and does
result in any extraneous work being done because we don't know what
the transaction context passed to us contains at all...

FWIW, there are other reasons for suggesting this structure - have a
read of "[RFD 14/17] xfs: separate inode freeing from inactivation"
and you'll see that what I've suggested above sets the code up for
implementing the optimisations documented in the RFD.

http://oss.sgi.com/archives/xfs/2013-08/msg00345.html

It might be best to put this as 3-4 patches at the start of the
series, rather than in the middle of it as it's really a separate
piece of cleanup work....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 07/11] xfs: retry trans reservation on ENOSPC in xfs_inactive()
  2013-09-03 18:25 ` [RFC PATCH 07/11] xfs: retry trans reservation on ENOSPC " Brian Foster
@ 2013-09-05  1:40   ` Dave Chinner
  2013-09-05 16:18     ` Brian Foster
  0 siblings, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2013-09-05  1:40 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Tue, Sep 03, 2013 at 02:25:04PM -0400, Brian Foster wrote:
> An ifree data block reservation can fail with ENOSPC. Flush inodes
> to try and free up space or attempt without a data block
> reservation to avoid failing out of xfs_inactive().
>
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_inode.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 56cbf63..92de4b7 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1779,7 +1779,18 @@ xfs_inactive(
>  	tp = xfs_trans_alloc(mp, XFS_TRANS_INACTIVE);
>  	error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree,
>  				  XFS_IFREE_SPACE_RES(mp), 0);
> +	if (error == ENOSPC) {
> +		/* flush outstanding delalloc blocks and retry */
> +		xfs_flush_inodes(mp);
> +		error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree,
> +					  XFS_IFREE_SPACE_RES(mp), 0);
> +	}

We don't want to be blocking for inode flushes here. We might be in
a shrinker context, for example, and blocking those for a filesystem
sync is going to be unfriendly.

If this really is a problem, then the right thing to do is to allow
this transaction to dip into the reserve block pool so the
transaction can complete and make progress - other write operations
will trigger the flushing of the filesystem, and freeing of whole
inode chunks should return more free space than we need for the
finobt modifications in the removing lots of zero length inodes
at ENOSPC case....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 08/11] xfs: insert newly allocated inode chunks into the finobt
  2013-09-03 18:25 ` [RFC PATCH 08/11] xfs: insert newly allocated inode chunks into the finobt Brian Foster
@ 2013-09-05  2:10   ` Dave Chinner
  0 siblings, 0 replies; 46+ messages in thread
From: Dave Chinner @ 2013-09-05  2:10 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Tue, Sep 03, 2013 at 02:25:05PM -0400, Brian Foster wrote:
> A newly allocated inode chunk, by definition, has at least one
> free inode, so a record is always inserted into the finobt.
> 
> Create the xfs_inobt_insert() helper from existing code to insert
> a record in an inobt based on the provided BTNUM. Update
> xfs_ialloc_ag_alloc() to invoke the helper for the existing
> XFS_BTNUM_INO tree and XFS_BTNUM_FINO tree, if enabled.

Factoring is good...

> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_ialloc.c | 79 +++++++++++++++++++++++++++++++++++++----------------
>  1 file changed, 56 insertions(+), 23 deletions(-)
> 
> diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
> index 5ced506..e64a728 100644
> --- a/fs/xfs/xfs_ialloc.c
> +++ b/fs/xfs/xfs_ialloc.c
> @@ -152,6 +152,52 @@ xfs_check_agi_freecount(
>  #endif
>  
>  /*
> + * Insert records describing a newly allocated inode chunk into the inobt.
> + */
> +STATIC int
> +xfs_inobt_insert(
> +	struct xfs_mount	*mp,
> +	struct xfs_trans	*tp,
> +	struct xfs_buf		*agbp,
> +	xfs_agino_t		newino,
> +	xfs_agino_t		newlen,
> +	xfs_btnum_t		btnum)
> +{
> +	struct xfs_btree_cur	*cur;
> +	struct xfs_agi		*agi = XFS_BUF_TO_AGI(agbp);
> +	xfs_agnumber_t		agno = be32_to_cpu(agi->agi_seqno);
> +	xfs_agino_t		thisino;
> +	int			i;
> +	int			error;
> +
> +	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno, btnum);
> +
> +	for (thisino = newino;
> +	     thisino < newino + newlen;
> +	     thisino += XFS_INODES_PER_CHUNK) {
> +		cur->bc_rec.i.ir_startino = thisino;
> +		cur->bc_rec.i.ir_freecount = XFS_INODES_PER_CHUNK;
> +		cur->bc_rec.i.ir_free = XFS_INOBT_ALL_FREE;
> +		error = xfs_btree_lookup(cur, XFS_LOOKUP_EQ, &i);
> +		if (error) {
> +			xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
> +			return error;
> +		}

I'm wondering if we'd do better to pass freecount/free to
xfs_inobt_lookup() and call that instead. i.e. we don't need to
expose the btree cursor internals here...

Maybe, also, move this function up to the top of the file with the
other xfs_inobt_* functions.

Otherwise it all looks OK.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 09/11] xfs: use and update the finobt on inode allocation
  2013-09-03 18:25 ` [RFC PATCH 09/11] xfs: use and update the finobt on inode allocation Brian Foster
@ 2013-09-05  2:27   ` Dave Chinner
  2013-09-05 16:18     ` Brian Foster
  0 siblings, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2013-09-05  2:27 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Tue, Sep 03, 2013 at 02:25:06PM -0400, Brian Foster wrote:
> Replace xfs_dialloc_ag() with an implementation that looks for a
> record in the finobt. The finobt only tracks records with at least
> one free inode. This eliminates the need for the intra-ag scan in
> the original algorithm. Once the inode is allocated, update the
> finobt appropriately (possibly removing the record) as well as the
> inobt.
> 
> Move the original xfs_dialloc_ag() algorithm to
> xfs_dialloc_ag_slow() and fall back as such if finobt support is
> not enabled.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_ialloc.c | 136 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 135 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
> index e64a728..516f4af 100644
> --- a/fs/xfs/xfs_ialloc.c
> +++ b/fs/xfs/xfs_ialloc.c
> @@ -708,7 +708,7 @@ xfs_ialloc_get_rec(
>   * available.
>   */
>  STATIC int
> -xfs_dialloc_ag(
> +xfs_dialloc_ag_slow(
>  	struct xfs_trans	*tp,
>  	struct xfs_buf		*agbp,
>  	xfs_ino_t		parent,
> @@ -966,6 +966,140 @@ error0:
>  	return error;
>  }
>  
> +STATIC int
> +xfs_dialloc_ag(
> +	struct xfs_trans	*tp,
> +	struct xfs_buf		*agbp,
> +	xfs_ino_t		parent,
> +	xfs_ino_t		*inop)
> +{
> +	struct xfs_mount		*mp = tp->t_mountp;
> +	struct xfs_agi			*agi = XFS_BUF_TO_AGI(agbp);
> +	xfs_agnumber_t			agno = be32_to_cpu(agi->agi_seqno);
> +	xfs_agino_t			pagino = XFS_INO_TO_AGINO(mp, parent);
> +	struct xfs_perag		*pag;
> +	struct xfs_btree_cur		*fcur;
> +	struct xfs_btree_cur		*icur;
> +	struct xfs_inobt_rec_incore	frec;
> +	struct xfs_inobt_rec_incore	irec;
> +	xfs_ino_t			ino;
> +	int				error;
> +	int				offset;
> +	int				i;
> +
> +	if (!xfs_sb_version_hasfinobt(&mp->m_sb))
> +		return xfs_dialloc_ag_slow(tp, agbp, parent, inop);

I'm starting to think that we really, really need the iops vector
mentioned in "[RFD 15/17] xfs: introduce a method vector for unlinked
list operations" so we don't need to have these sorts of switches in
the code...

> +
> +	pag = xfs_perag_get(mp, agno);
> +
> +	/*
> +	 * If pagino is 0 (this is the root inode allocation) use newino.
> +	 * This must work because we've just allocated some.
> +	 */
> +	if (!pagino)
> +		pagino = be32_to_cpu(agi->agi_newino);
> +
> +	fcur = xfs_inobt_init_cursor(mp, tp, agbp, agno, XFS_BTNUM_FINO);
> +	icur = xfs_inobt_init_cursor(mp, tp, agbp, agno, XFS_BTNUM_INO);
> +
> +	error = xfs_check_agi_freecount(fcur, agi);
> +	if (error)
> +		goto error;
> +	error = xfs_check_agi_freecount(icur, agi);
> +	if (error)
> +		goto error;

Why do we need to initialise both cursors at once? We only do the
operations one at a time, and you should actually use 2 cursors
for the finobt lookup.....

> +
> +	/*
> +	 * Search the finobt.
> +	 */
> +	error = xfs_inobt_lookup(fcur, pagino, XFS_LOOKUP_LE, &i);
> +	if (error)
> +		goto error;
> +	if (i == 0) {
> +		error = xfs_inobt_lookup(fcur, pagino, XFS_LOOKUP_GE, &i);
> +		if (error)
> +			goto error;
> +		XFS_WANT_CORRUPTED_GOTO(i == 1, error);
> +	}

.... because this biases allocation to lower inode numbers than the
target. i.e we only ever search for higher numbers if here are none
lower. That's quite different to the current algorithm which first
searches for the *closest* free inode.

That is, we should be using two cursors for the free inode search,
one for LE, the other for GT. If they both return records then, like
the "slow" algorithm, calculate the diff between them and the target
inode, and select the closer one (smallest diff). Destroy the cursor
you don't need.


> +	error = xfs_inobt_get_rec(fcur, &frec, &i);
> +	if (error)
> +		goto error;
> +	XFS_WANT_CORRUPTED_GOTO(i == 1, error);
> +
> +	offset = xfs_lowbit64(frec.ir_free);
> +	ASSERT(offset >= 0);
> +	ASSERT(offset < XFS_INODES_PER_CHUNK);
> +	ASSERT((XFS_AGINO_TO_OFFSET(mp, frec.ir_startino) %
> +				   XFS_INODES_PER_CHUNK) == 0);
> +	ino = XFS_AGINO_TO_INO(mp, agno, frec.ir_startino + offset);
> +
> +	/*
> +	 * Modify or remove the finobt record.
> +	 */
> +	frec.ir_free &= ~XFS_INOBT_MASK(offset);
> +	frec.ir_freecount--;
> +	if (frec.ir_freecount) 
> +		error = xfs_inobt_update(fcur, &frec);
> +	else
> +		error = xfs_btree_delete(fcur, &i);
> +	if (error)
> +		goto error;

Yup, good.

Now you can re-initialise the second cursor to point at the inobt
and:

> +
> +	/*
> +	 * Lookup and modify the equivalent record in the inobt.
> +	 */
> +	error = xfs_inobt_lookup(icur, frec.ir_startino, XFS_LOOKUP_EQ, &i);
> +	if (error)
> +		goto error;
> +	XFS_WANT_CORRUPTED_GOTO(i == 1, error);
> +
> +	error = xfs_inobt_get_rec(icur, &irec, &i);
> +	if (error)
> +		goto error;
> +	XFS_WANT_CORRUPTED_GOTO(i == 1, error);
> +	ASSERT((XFS_AGINO_TO_OFFSET(mp, irec.ir_startino) %
> +				   XFS_INODES_PER_CHUNK) == 0);
> +
> +	irec.ir_free &= ~XFS_INOBT_MASK(offset);
> +	irec.ir_freecount--;
> +
> +	XFS_WANT_CORRUPTED_GOTO((frec.ir_free == irec.ir_free) &&
> +				(frec.ir_freecount == irec.ir_freecount),
> +				error);

Good, I like that check - they should be the same!

> +
> +	error = xfs_inobt_update(icur, &irec);
> +	if (error)
> +		goto error;
> +
> +	/*
> +	 * Update the perag and superblock.
> +	 */
> +	be32_add_cpu(&agi->agi_freecount, -1);
> +	xfs_ialloc_log_agi(tp, agbp, XFS_AGI_FREECOUNT);
> +	pag->pagi_freecount--;
> +
> +	xfs_trans_mod_sb(tp, XFS_TRANS_SB_IFREE, -1);
> +	xfs_perag_put(pag);
> +
> +	error = xfs_check_agi_freecount(fcur, agi);
> +	if (error)
> +		goto error;
> +	error = xfs_check_agi_freecount(icur, agi);
> +	if (error)
> +		goto error;

Failures here will result in 2 calls to xfs_perag_put(pag);

> +
> +	xfs_btree_del_cursor(icur, XFS_BTREE_NOERROR);
> +	xfs_btree_del_cursor(fcur, XFS_BTREE_ERROR);
> +	*inop = ino;
> +	return 0;
> +error:
> +	xfs_perag_put(pag);
> +	xfs_btree_del_cursor(icur, XFS_BTREE_ERROR);
> +	xfs_btree_del_cursor(fcur, XFS_BTREE_ERROR);
> +	return error;
> +}

Otherwise it looks good.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 10/11] xfs: update the finobt on inode free
  2013-09-03 18:25 ` [RFC PATCH 10/11] xfs: update the finobt on inode free Brian Foster
@ 2013-09-05  2:54   ` Dave Chinner
  2013-09-05 16:19     ` Brian Foster
  0 siblings, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2013-09-05  2:54 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Tue, Sep 03, 2013 at 02:25:07PM -0400, Brian Foster wrote:
> An inode free operation can have several effects on the finobt. If
> all inodes have been freed and the chunk deallocated, we remove the
> finobt record. If the inode chunk was previously full, we must
> insert a new record based on the existing inobt record. Otherwise,
> we modify the record in place.
> 
> Create the xfs_ifree_finobt() function to identify the potential
> scenarios and update the finobt appropriately.

The first thing I'd do is factor all the inobt manipulation
code xfs_difree() into a xfs_difree_inobt() helper function. have it
return the record and offset that is then passed to your new helper
xfs_difree_finobt(). That way xfs_difree() really becomes the setup
function for the two btree operations rather than containing one set
of modifications and calling a function to do the other...

> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_ialloc.c | 120 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 120 insertions(+)
> 
> diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
> index 516f4af..96f71b5 100644
> --- a/fs/xfs/xfs_ialloc.c
> +++ b/fs/xfs/xfs_ialloc.c
> @@ -198,6 +198,117 @@ xfs_inobt_insert(
>  }
>  
>  /*
> + * Free an inode in the free inode btree.
> + */
> +STATIC int
> +xfs_ifree_finobt(
> +	struct xfs_mount		*mp,
> +	struct xfs_trans		*tp,
> +	struct xfs_buf			*agbp,
> +	struct xfs_inobt_rec_incore	*ibtrec,/* inobt record */
> +	int				offset)	/* offset of inode */
> +{
> +	struct xfs_agi			*agi = XFS_BUF_TO_AGI(agbp);
> +	xfs_agnumber_t			agno = be32_to_cpu(agi->agi_seqno);
> +	struct xfs_btree_cur		*cur;
> +	struct xfs_inobt_rec_incore	rec;
> +	int				error;
> +	int				i;
> +
> +	if (!xfs_sb_version_hasfinobt(&mp->m_sb))
> +		return 0;

There's that vector thing again...

> +
> +	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno, XFS_BTNUM_FINO);
> +
> +	error = xfs_inobt_lookup(cur, ibtrec->ir_startino, XFS_LOOKUP_EQ, &i);
> +	if (error)
> +		goto error;
> +
> +	if (i == 1) {
> +		int j;
> +		/*
> +		 * Read and update the existing record.
> +		 */
> +		error = xfs_inobt_get_rec(cur, &rec, &j);
> +		if (error)
> +			goto error;
> +		XFS_WANT_CORRUPTED_GOTO(j == 1, error);
> +
> +		rec.ir_free |= XFS_INOBT_MASK(offset);
> +		rec.ir_freecount++;
> +
> +		XFS_WANT_CORRUPTED_GOTO((rec.ir_free == ibtrec->ir_free) &&
> +					(rec.ir_freecount == ibtrec->ir_freecount),
> +					error);
> +	}

I can't say I'm a great fan of the layout of the logic. Yes, there's
lots of cases to handle. It looks like:

	lookup()
	if (found)
		modify in place
	if (found && full && deleting chunks)
		delete record
	else if (!found && no record)
		insert record
	else if (found)
		update record
	else
		corruption!

I think it woul dbe better to get then "!found" case out of the way
at the start. ie

	if (i == 0) {
		if (ibtrec->ir_freecount == 1)
			insert record
		else
			CORRUPTION
		goto out;
	}

	/* found a record, no need to check i == 1 anymore */
	ASSERT(i == 1);

	/* read and update */

	if (full && deleting chunks)
		delete record
	else
		update record

	
> +
> +	/*
> +	 * The content of inobt records should always match between the inobt
> +	 * and finobt. The lifecycle of records in the finobt is different from
> +	 * the inobt in that the finobt only tracks records with at least one
> +	 * free inode. This is to optimize lookup for inode allocation purposes.
> +	 * The following checks fix up the finobt appropriately based on the
> +	 * state of the record subsequent to the current operation.
> +	 */
> +
> +	if ((i == 1) &&
> +	    (rec.ir_freecount == XFS_IALLOC_INODES(mp) &&
> +	     !(mp->m_flags & XFS_MOUNT_IKEEP))) {
> +		/*
> +		 * We have an existing finobt record. If all inodes are free
> +		 * and we're in !ikeep mode, the entire inode chunk has been
> +		 * deallocated. Remove the record from the finobt.
> +		 */
> +		error = xfs_btree_delete(cur, &i);
> +		if (error)
> +			goto error;
> +		ASSERT(i == 1);
> +	} else if ((i == 0) && (ibtrec->ir_freecount == 1)) {
> +		/*
> +		 * No existing finobt record and the inobt record has a single
> +		 * free inode. This means we've freed an inode in a previously
> +		 * fully allocated chunk. Insert a new record into the finobt
> +		 * based on the current inobt record.
> +		 */
> +		cur->bc_rec.i.ir_startino = ibtrec->ir_startino;
> +		cur->bc_rec.i.ir_free = ibtrec->ir_free;
> +		cur->bc_rec.i.ir_freecount = ibtrec->ir_freecount;
> +		error = xfs_btree_insert(cur, &i);
> +		if (error)
> +			goto error;
> +		ASSERT(i == 1);

That's rather similar to the code in xfs_inobt_insert(). Indeed,
is you write a helper - xfs_inobt_insert_rec() - for this, then rather than modifying
xfs_inobt_lookup() to take extra parameters like I wondered for the
previous patch, leave it alonge and pass the parameters to
xfs_inobt_insert_rec() instead.

Then this code is functionally identical to xfs_inobt_insert() done
during allocation....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 11/11] xfs: add finobt support to growfs
  2013-09-03 18:25 ` [RFC PATCH 11/11] xfs: add finobt support to growfs Brian Foster
@ 2013-09-05  2:55   ` Dave Chinner
  0 siblings, 0 replies; 46+ messages in thread
From: Dave Chinner @ 2013-09-05  2:55 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Tue, Sep 03, 2013 at 02:25:08PM -0400, Brian Foster wrote:
> Add finobt support to growfs. Initialize the agi root/level fields
> and the root finobt block.

Looks OK.

Reviewed-by: Dave Chinner <david@fromorbit.com>

growfs is crying out for a table driven initialisation loop,
though...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 03/11] xfs: support the XFS_BTNUM_FINOBT free inode btree type
  2013-09-05  0:54   ` Dave Chinner
@ 2013-09-05 16:17     ` Brian Foster
  2013-09-06  0:07       ` Dave Chinner
  0 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2013-09-05 16:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On 09/04/2013 08:54 PM, Dave Chinner wrote:
> On Tue, Sep 03, 2013 at 02:25:00PM -0400, Brian Foster wrote:
>> Define the AGI fields for the finobt root/level and add magic
>> numbers. Update the btree code to add support for the new
>> XFS_BTNUM_FINOBT inode btree.
>>
>> The finobt root block is reserved immediately following the inobt
>> root block in the AG. Update XFS_PREALLOC_BLOCKS() to determine the
>> starting AG data block based on whether finobt support is enabled.
> 
> A few minor things...
> 
>>
>> Signed-off-by: Brian Foster <bfoster@redhat.com>
>> ---
>>  fs/xfs/xfs_ag.h           |  7 ++++++-
>>  fs/xfs/xfs_btree.c        |  6 ++++--
>>  fs/xfs/xfs_btree.h        |  3 +++
>>  fs/xfs/xfs_ialloc.c       |  2 ++
>>  fs/xfs/xfs_ialloc_btree.c | 29 +++++++++++++++++++++++------
>>  fs/xfs/xfs_ialloc_btree.h | 14 +++++++++++++-
>>  fs/xfs/xfs_log_recover.c  |  2 ++
>>  fs/xfs/xfs_stats.h        | 18 +++++++++++++++++-
>>  fs/xfs/xfs_types.h        |  2 +-
>>  9 files changed, 71 insertions(+), 12 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_ag.h b/fs/xfs/xfs_ag.h
>> index 1cb740a..b85585d 100644
>> --- a/fs/xfs/xfs_ag.h
>> +++ b/fs/xfs/xfs_ag.h
>> @@ -166,6 +166,9 @@ typedef struct xfs_agi {
>>  	__be32		agi_pad32;
>>  	__be64		agi_lsn;	/* last write sequence */
>>  
>> +	__be32		agi_free_root; /* root of the free inode btree */
>> +	__be32		agi_free_level;/* levels in free inode btree */
>> +
>>  	/* structure must be padded to 64 bit alignment */
>>  } xfs_agi_t;
> 
> That's fine, but...
>>  
>> @@ -180,7 +183,9 @@ typedef struct xfs_agi {
>>  #define	XFS_AGI_NEWINO		0x00000100
>>  #define	XFS_AGI_DIRINO		0x00000200
>>  #define	XFS_AGI_UNLINKED	0x00000400
>> -#define	XFS_AGI_NUM_BITS	11
>> +#define	XFS_AGI_FREE_ROOT	0x00000800
>> +#define	XFS_AGI_FREE_LEVEL	0x00001000
>> +#define	XFS_AGI_NUM_BITS	13
>>  #define	XFS_AGI_ALL_BITS	((1 << XFS_AGI_NUM_BITS) - 1)
> 
> This is a bit of a problem, because the range logging bits will now
> cause the entire AGI to be logged (including all the unlinked list
> hash) because these only define the first/last offsets to be
> logged...
> 

Ok, I see what you mean here...

>> diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
>> index 524aa88..5ced506 100644
>> --- a/fs/xfs/xfs_ialloc.c
>> +++ b/fs/xfs/xfs_ialloc.c
>> @@ -1505,6 +1505,8 @@ xfs_ialloc_log_agi(
>>  		offsetof(xfs_agi_t, agi_newino),
>>  		offsetof(xfs_agi_t, agi_dirino),
>>  		offsetof(xfs_agi_t, agi_unlinked),
>> +		offsetof(xfs_agi_t, agi_free_root),
>> +		offsetof(xfs_agi_t, agi_free_level),
>>  		sizeof(xfs_agi_t)
>>  	};
> 
> Because of how this table works.
> 
> What we really need here is for xfs_ialloc_log_agi to consider that
> there are two distinct regions for range logging - the first spaces
> from offset 0 to offset of agi_unlinked, and the second is from the
> the offset of agi_free_root to the end of the xfs_agi_t....
> 
> It's abit messy, I know, but we couldn't easily add new padding to
> the AGI in the existing range logging area like was done for the AGF
> because of the unlinked list hash table already defining the end of
> the range logging region....
> 

... but where would that ever happen? The existing invocations of
xfs_ialloc_log_agi() seem to log either the agi inode count values or
the btree root/level values (i.e., never the range across both). I think
I've introduced at least a couple new invocations throughout this set,
but I've not changed that model (i.e., an XFS_AGI_FREECOUNT instance in
the new lookup code and an XFS_AGI_FREE_ROOT|*_LEVEL instance in the new
btree code).

My understanding of this code is that the range to log is defined at
invocation time to xfs_iallog_log_agi(), so if the callers never specify
a range that includes the unlinked bits in a single call, we won't set
that range in the buffer log item code. In other words, even if we
ultimately happen to log both ranges in the agi, the lower level code
will never expand the logged region. Therefore, this shouldn't happen
unless a single invocation that specifies one of the
XFS_AGI_FREE_ROOT/LEVEL bits also specifies one of the existing agi bits.

I could see breaking this up as a matter of preparation for future
fields or calls that would introduce logging that kind of range, but at
the same time (and assuming my interpretation of above is correct),
that's a bit of code that serves no purpose for the foreseeable future.
Perhaps a comment in xfs_ialloc_log_agi() and the one caller that uses
the AGI_FREE_* bits to document this restriction is sufficient?

>>  #ifdef DEBUG
>> diff --git a/fs/xfs/xfs_ialloc_btree.c b/fs/xfs/xfs_ialloc_btree.c
>> index 0cdb88b..7923292 100644
>> --- a/fs/xfs/xfs_ialloc_btree.c
>> +++ b/fs/xfs/xfs_ialloc_btree.c
>> @@ -62,10 +62,18 @@ xfs_inobt_set_root(
>>  {
>>  	struct xfs_buf		*agbp = cur->bc_private.a.agbp;
>>  	struct xfs_agi		*agi = XFS_BUF_TO_AGI(agbp);
>> -
>> -	agi->agi_root = nptr->s;
>> -	be32_add_cpu(&agi->agi_level, inc);
>> -	xfs_ialloc_log_agi(cur->bc_tp, agbp, XFS_AGI_ROOT | XFS_AGI_LEVEL);
>> +	int			fields;
>> +
>> +	if (cur->bc_btnum == XFS_BTNUM_INO) {
>> +		agi->agi_root = nptr->s;
>> +		be32_add_cpu(&agi->agi_level, inc);
>> +		fields = XFS_AGI_ROOT | XFS_AGI_LEVEL;
>> +	} else {
>> +		agi->agi_free_root = nptr->s;
>> +		be32_add_cpu(&agi->agi_free_level, inc);
>> +		fields = XFS_AGI_FREE_ROOT | XFS_AGI_FREE_LEVEL;
>> +	}
>> +	xfs_ialloc_log_agi(cur->bc_tp, agbp, fields);
>>  }
> 
> I suspect that it would be better to have separate functions for
> these differences i.e. xfs_inobt_set_root() and
> xfs_finobt_set_root(), and set up separate btree ops structure
> forthe two different trees. Most of the code is still identical,
> but the differences in root structures can easily be handled without
> putting switches in the code everywhere.
> 

Ok, I'm assuming the suggestion is to only create new functions for the
implementations that differ.

>>  
>>  STATIC int
>> @@ -172,7 +180,10 @@ xfs_inobt_init_ptr_from_cur(
>>  
>>  	ASSERT(cur->bc_private.a.agno == be32_to_cpu(agi->agi_seqno));
>>  
>> -	ptr->s = agi->agi_root;
>> +	if (cur->bc_btnum == XFS_BTNUM_INO)
>> +		ptr->s = agi->agi_root;
>> +	else
>> +		ptr->s = agi->agi_free_root;
>>  }
> 
> Like this...
> 
>>  
>>  STATIC __int64_t
>> @@ -205,6 +216,7 @@ xfs_inobt_verify(
>>  	 */
>>  	switch (block->bb_magic) {
>>  	case cpu_to_be32(XFS_IBT_CRC_MAGIC):
>> +	case cpu_to_be32(XFS_FIBT_CRC_MAGIC):
>>  		if (!xfs_sb_version_hascrc(&mp->m_sb))
>>  			return false;
>>  		if (!uuid_equal(&block->bb_u.s.bb_uuid, &mp->m_sb.sb_uuid))
>> @@ -216,6 +228,7 @@ xfs_inobt_verify(
>>  			return false;
>>  		/* fall through */
>>  	case cpu_to_be32(XFS_IBT_MAGIC):
>> +	case cpu_to_be32(XFS_FIBT_MAGIC):
>>  		break;
>>  	default:
>>  		return 0;
>> @@ -335,8 +348,12 @@ xfs_inobt_init_cursor(
>>  
>>  	cur->bc_tp = tp;
>>  	cur->bc_mp = mp;
>> -	cur->bc_nlevels = be32_to_cpu(agi->agi_level);
>>  	cur->bc_btnum = btnum;
>> +	if (btnum == XFS_BTNUM_INO)
>> +		cur->bc_nlevels = be32_to_cpu(agi->agi_level);
>> +	else
>> +		cur->bc_nlevels = be32_to_cpu(agi->agi_free_level);
>> +
>>  	cur->bc_blocklog = mp->m_sb.sb_blocklog;
>>  
>>  	cur->bc_ops = &xfs_inobt_ops;
> 
> And this is where you do the check on the btnum and set the
> appropriate ops structure....
> 

Ok.

>>  #define	XFS_INODES_PER_CHUNK		(NBBY * sizeof(xfs_inofree_t))
>> @@ -73,7 +75,17 @@ typedef __be32 xfs_inobt_ptr_t;
>>   * block numbers in the AG.
>>   */
>>  #define	XFS_IBT_BLOCK(mp)		((xfs_agblock_t)(XFS_CNT_BLOCK(mp) + 1))
>> -#define	XFS_PREALLOC_BLOCKS(mp)		((xfs_agblock_t)(XFS_IBT_BLOCK(mp) + 1))
>> +#define	XFS_FIBT_BLOCK(mp)		((xfs_agblock_t)(XFS_IBT_BLOCK(mp) + 1))
>> +
>> +/*
>> + * The first data block of an AG depends on whether the filesystem was formatted
>> + * with the finobt feature. If so, account for the finobt reserved root btree
>> + * block.
>> + */
>> +#define XFS_PREALLOC_BLOCKS(mp) \
>> +	(xfs_sb_version_hasfinobt(&((mp)->m_sb)) ? \
>> +	 XFS_FIBT_BLOCK(mp) + 1 : \
>> +	 XFS_IBT_BLOCK(mp) + 1)
> 
> Yup, that looks sensible, with a nice comment to explain it :)
> 
>> diff --git a/fs/xfs/xfs_stats.h b/fs/xfs/xfs_stats.h
>> index c03ad38..c8f238b 100644
>> --- a/fs/xfs/xfs_stats.h
>> +++ b/fs/xfs/xfs_stats.h
>> @@ -183,7 +183,23 @@ struct xfsstats {
>>  	__uint32_t		xs_ibt_2_alloc;
>>  	__uint32_t		xs_ibt_2_free;
>>  	__uint32_t		xs_ibt_2_moves;
>> -#define XFSSTAT_END_XQMSTAT		(XFSSTAT_END_IBT_V2+6)
>> +#define XFSSTAT_END_FIBT_V2		(XFSSTAT_END_IBT_V2+15)
>> +	__uint32_t		xs_fibt_2_lookup;
>> +	__uint32_t		xs_fibt_2_compare;
>> +	__uint32_t		xs_fibt_2_insrec;
>> +	__uint32_t		xs_fibt_2_delrec;
>> +	__uint32_t		xs_fibt_2_newroot;
>> +	__uint32_t		xs_fibt_2_killroot;
>> +	__uint32_t		xs_fibt_2_increment;
>> +	__uint32_t		xs_fibt_2_decrement;
>> +	__uint32_t		xs_fibt_2_lshift;
>> +	__uint32_t		xs_fibt_2_rshift;
>> +	__uint32_t		xs_fibt_2_split;
>> +	__uint32_t		xs_fibt_2_join;
>> +	__uint32_t		xs_fibt_2_alloc;
>> +	__uint32_t		xs_fibt_2_free;
>> +	__uint32_t		xs_fibt_2_moves;
>> +#define XFSSTAT_END_XQMSTAT		(XFSSTAT_END_FIBT_V2+6)
>>  	__uint32_t		xs_qm_dqreclaims;
>>  	__uint32_t		xs_qm_dqreclaim_misses;
>>  	__uint32_t		xs_qm_dquot_dups;
> 
> I didn't see an equivalent change to add these new stats to the proc
> file output (ie. in xfs_stat_proc_show()). maybe I just missed it,
> but if I didn't, can you add it?
> 

Yeah, that's missing. Good catch. I think I added these bits on some
kind of compilation failure or warning and I hadn't gone back to check
that they were used anywhere. :P

Brian

> Cheers,
> 
> Dave.
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 04/11] xfs: update inode allocation transaction reservations for finobt
  2013-09-05  0:59   ` Dave Chinner
@ 2013-09-05 16:17     ` Brian Foster
  2013-09-06  0:11       ` Dave Chinner
  0 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2013-09-05 16:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On 09/04/2013 08:59 PM, Dave Chinner wrote:
> On Tue, Sep 03, 2013 at 02:25:01PM -0400, Brian Foster wrote:
>> Update inode allocation transaction reservations for the finobt. A
>> create via record modification requires a log reservation for the
>> additional finobt record. Any such allocation could result in an
>> finobt removal if the inode chunk has become fully allocated, thus
>> we include a reservation for a finobt btree merge as well.
>> Allocation of a new inode chunk must account for splits in the
>> finobt as well as the existing ialloc tree.
> 
> These transaction reservation changes are only necessary for
> filesystems with free inode btrees, otherwise they just use more log
> space than is necessary.
> 

Yeah, good point.

> Can you add helper functions for the free inode btree reservations,
> and have them return 0 when the feature is not enabled? That way the
> code stays pretty clean, is self documenting and doesn't take
> unnecessary space when the feature is not enabled....
> 

So not new functions that duplicate the entire calculations for the
finobt enabled cases, but new functions that define the additional
requirements for the finobt on top of the existing reservation
calculations for particular operations (i.e., similar to the recent
patch to fix the inode log size, if I recall). Sounds good.

Brian

>> Also update XFS_IALLOC_SPACE_RES() to reserve data blocks for
>> finobt split/merge scenarios.
> 
> Needs to handle the enabled/disabled case, too.
> 
> Cheers,
> 
> Dave.
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 06/11] xfs: use correct transaction reservations in xfs_inactive()
  2013-09-05  1:35   ` Dave Chinner
@ 2013-09-05 16:18     ` Brian Foster
  0 siblings, 0 replies; 46+ messages in thread
From: Brian Foster @ 2013-09-05 16:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On 09/04/2013 09:35 PM, Dave Chinner wrote:
> On Tue, Sep 03, 2013 at 02:25:03PM -0400, Brian Foster wrote:
>> The transaction allocated in xfs_inactive() can be passed down into
>> xfs_inactive_symlink() or xfs_itruncate_extents(), both of which
>> can commit and reallocate a new transaction. This leads to
>> reservation issues if the transaction is subsequently passed into
>> xfs_ifree(), which requires a larger reservation to manage the
>> finobt.
>>
>> Reorganize xfs_inactive() to commit any transaction handed back
>> from symlink or truncate processing and unconditionally allocate
>> a new transaction for xfs_ifree() with the appropriate reservation.
> 
> Ok, I've had a bit of a look at this now, and I like how the code
> turns out. However, I don't think it goes far enough, or fix the
> problem that causes all the transaction nastiness in xfs_inactive().
> 
> Firstly, we are not doing rolling transactions here - there is no
> need for all the changes to be atomic because the inode is on the
> unlinked list if it is going to be freed. Hence we don't need to
> pass transaction pointers around.
> 

I was wondering about this when I was passing through the code. If I
recall, I ended up just trying to work around the lower level calls as
opposed to messing with them because xfs_truncate_extents() had other
callers (but the discussion below addresses that).

> xfs_inactive_symlink() can do a transaction completely internally,
> and, well, it doesn't even log the inode if the symlink is in-line
> and so may not even need a transaction. Hence really only
> xfs_inactive_symlink_rmt() needs to run a transaction, and it can do
> that internally just fine.
> 

Ok.

> For the xfs_itruncate_extents() data fork transaction, just add a
> new wrapper called xfs_inactive_truncate() that holds the
> transaction context internally - that moves the only other
> transaction context that you need to commit out of xfs_inactive()
> altogether, as the attr fork already uses a private transaction
> context.
> 

Sounds good.

> And, finally, you can then factor the xfs_ifree() and it's
> transaction context into a helper function as well, so there aren't
> any transaction contexts left in xfs_inactive() at all.
> 
> That would leave us with:
> 
> 	if (ISLNK) {
> 		error = xfs_inactive_symlink(ip);
> 	} else if (truncate)
> 		error = xfs_inactive_truncate(ip);
> 	}
> 	if (error)
> 		goto out;
> 	if (ip->i_d.di_anextents > 0)
> 		error = xfs_attr_inactive(ip);
> 	if (error)
> 		goto out;
> 
> 	error = xfs_inactive_ifree(ip);
> 
> 	xfs_qm_dqdetach(ip);
> out:
> 	return;
> 
> This gives us a natural separation of the different transaction
> reservations and contexts needed to perform the operations, and does
> result in any extraneous work being done because we don't know what
> the transaction context passed to us contains at all...
> 

Yeah, I agree. That cleans things up nicely.

> FWIW, there are other reasons for suggesting this structure - have a
> read of "[RFD 14/17] xfs: separate inode freeing from inactivation"
> and you'll see that what I've suggested above sets the code up for
> implementing the optimisations documented in the RFD.
> 
> http://oss.sgi.com/archives/xfs/2013-08/msg00345.html
> 
> It might be best to put this as 3-4 patches at the start of the
> series, rather than in the middle of it as it's really a separate
> piece of cleanup work....
> 

Ok, I've skimmed through most of the writeups in that set. If it turns
into a handful of patches, I might just test and post this as an
independent, prerequisite set for personal ease of management reasons if
nothing else. I'm more confident in the changes with the review feedback
and that I obviously now know what the finobt requirements are. Thanks.

Brian

> Cheers,
> 
> Dave.
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 07/11] xfs: retry trans reservation on ENOSPC in xfs_inactive()
  2013-09-05  1:40   ` Dave Chinner
@ 2013-09-05 16:18     ` Brian Foster
  2013-09-06  0:17       ` Dave Chinner
  0 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2013-09-05 16:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On 09/04/2013 09:40 PM, Dave Chinner wrote:
> On Tue, Sep 03, 2013 at 02:25:04PM -0400, Brian Foster wrote:
>> An ifree data block reservation can fail with ENOSPC. Flush inodes
>> to try and free up space or attempt without a data block
>> reservation to avoid failing out of xfs_inactive().
>>
>> Signed-off-by: Brian Foster <bfoster@redhat.com>
>> ---
>>  fs/xfs/xfs_inode.c | 11 +++++++++++
>>  1 file changed, 11 insertions(+)
>>
>> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
>> index 56cbf63..92de4b7 100644
>> --- a/fs/xfs/xfs_inode.c
>> +++ b/fs/xfs/xfs_inode.c
>> @@ -1779,7 +1779,18 @@ xfs_inactive(
>>  	tp = xfs_trans_alloc(mp, XFS_TRANS_INACTIVE);
>>  	error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree,
>>  				  XFS_IFREE_SPACE_RES(mp), 0);
>> +	if (error == ENOSPC) {
>> +		/* flush outstanding delalloc blocks and retry */
>> +		xfs_flush_inodes(mp);
>> +		error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree,
>> +					  XFS_IFREE_SPACE_RES(mp), 0);
>> +	}
> 
> We don't want to be blocking for inode flushes here. We might be in
> a shrinker context, for example, and blocking those for a filesystem
> sync is going to be unfriendly.
> 

Ok.

> If this really is a problem, then the right thing to do is to allow
> this transaction to dip into the reserve block pool so the
> transaction can complete and make progress - other write operations
> will trigger the flushing of the filesystem, and freeing of whole
> inode chunks should return more free space than we need for the
> finobt modifications in the removing lots of zero length inodes
> at ENOSPC case....
> 

I did have one of the enospc xfstests lead to this situation, though I
don't have the particular test in my notes. It initially manifested as
an assert failure due to the fs not being shutdown after an
xfs_trans_reserve() ENOSPC failure. Subsequent to avoiding that, I
believe there were inconsistent fs issues called out due to the unlinked
lists being populated after umount.

Taking a further look, I missed the XFS_TRANS_RESERVE flag and whole
m_resblks mechanism. I'll take a closer look at that and see if that
works to resolve the problem instead of the flush.

Brian

> Cheers,
> 
> Dave.
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 09/11] xfs: use and update the finobt on inode allocation
  2013-09-05  2:27   ` Dave Chinner
@ 2013-09-05 16:18     ` Brian Foster
  0 siblings, 0 replies; 46+ messages in thread
From: Brian Foster @ 2013-09-05 16:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On 09/04/2013 10:27 PM, Dave Chinner wrote:
> On Tue, Sep 03, 2013 at 02:25:06PM -0400, Brian Foster wrote:
>> Replace xfs_dialloc_ag() with an implementation that looks for a
>> record in the finobt. The finobt only tracks records with at least
>> one free inode. This eliminates the need for the intra-ag scan in
>> the original algorithm. Once the inode is allocated, update the
>> finobt appropriately (possibly removing the record) as well as the
>> inobt.
>>
>> Move the original xfs_dialloc_ag() algorithm to
>> xfs_dialloc_ag_slow() and fall back as such if finobt support is
>> not enabled.
>>
>> Signed-off-by: Brian Foster <bfoster@redhat.com>
>> ---
>>  fs/xfs/xfs_ialloc.c | 136 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>>  1 file changed, 135 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
>> index e64a728..516f4af 100644
>> --- a/fs/xfs/xfs_ialloc.c
>> +++ b/fs/xfs/xfs_ialloc.c
>> @@ -708,7 +708,7 @@ xfs_ialloc_get_rec(
>>   * available.
>>   */
...
> 
> Why do we need to initialise both cursors at once? We only do the
> operations one at a time, and you should actually use 2 cursors
> for the finobt lookup.....
> 

No good reason. I probably just did that to simplify error handling.

>> +
>> +	/*
>> +	 * Search the finobt.
>> +	 */
>> +	error = xfs_inobt_lookup(fcur, pagino, XFS_LOOKUP_LE, &i);
>> +	if (error)
>> +		goto error;
>> +	if (i == 0) {
>> +		error = xfs_inobt_lookup(fcur, pagino, XFS_LOOKUP_GE, &i);
>> +		if (error)
>> +			goto error;
>> +		XFS_WANT_CORRUPTED_GOTO(i == 1, error);
>> +	}
> 
> .... because this biases allocation to lower inode numbers than the
> target. i.e we only ever search for higher numbers if here are none
> lower. That's quite different to the current algorithm which first
> searches for the *closest* free inode.
> 
> That is, we should be using two cursors for the free inode search,
> one for LE, the other for GT. If they both return records then, like
> the "slow" algorithm, calculate the diff between them and the target
> inode, and select the closer one (smallest diff). Destroy the cursor
> you don't need.
> 

Ah, Ok. I hadn't taken a close enough look at the existing algorithm
yet, to be honest. I'll do so and incorporate the closest free inode
heuristic.

...
>> +	xfs_trans_mod_sb(tp, XFS_TRANS_SB_IFREE, -1);
>> +	xfs_perag_put(pag);
>> +
>> +	error = xfs_check_agi_freecount(fcur, agi);
>> +	if (error)
>> +		goto error;
>> +	error = xfs_check_agi_freecount(icur, agi);
>> +	if (error)
>> +		goto error;
> 
> Failures here will result in 2 calls to xfs_perag_put(pag);
> 

Yeah, thanks.

Brian

>> +
>> +	xfs_btree_del_cursor(icur, XFS_BTREE_NOERROR);
>> +	xfs_btree_del_cursor(fcur, XFS_BTREE_ERROR);
>> +	*inop = ino;
>> +	return 0;
>> +error:
>> +	xfs_perag_put(pag);
>> +	xfs_btree_del_cursor(icur, XFS_BTREE_ERROR);
>> +	xfs_btree_del_cursor(fcur, XFS_BTREE_ERROR);
>> +	return error;
>> +}
> 
> Otherwise it looks good.
> 
> Cheers,
> 
> Dave.
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 10/11] xfs: update the finobt on inode free
  2013-09-05  2:54   ` Dave Chinner
@ 2013-09-05 16:19     ` Brian Foster
  2013-09-06  0:28       ` Dave Chinner
  0 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2013-09-05 16:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On 09/04/2013 10:54 PM, Dave Chinner wrote:
> On Tue, Sep 03, 2013 at 02:25:07PM -0400, Brian Foster wrote:
>> An inode free operation can have several effects on the finobt. If
>> all inodes have been freed and the chunk deallocated, we remove the
>> finobt record. If the inode chunk was previously full, we must
>> insert a new record based on the existing inobt record. Otherwise,
>> we modify the record in place.
>>
>> Create the xfs_ifree_finobt() function to identify the potential
>> scenarios and update the finobt appropriately.
> 
> The first thing I'd do is factor all the inobt manipulation
> code xfs_difree() into a xfs_difree_inobt() helper function. have it
> return the record and offset that is then passed to your new helper
> xfs_difree_finobt(). That way xfs_difree() really becomes the setup
> function for the two btree operations rather than containing one set
> of modifications and calling a function to do the other...
> 

Sounds logical.

>> Signed-off-by: Brian Foster <bfoster@redhat.com>
>> ---
>>  fs/xfs/xfs_ialloc.c | 120 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 120 insertions(+)
>>
>> diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
>> index 516f4af..96f71b5 100644
>> --- a/fs/xfs/xfs_ialloc.c
>> +++ b/fs/xfs/xfs_ialloc.c
>> @@ -198,6 +198,117 @@ xfs_inobt_insert(
>>  }
>>  
>>  /*
>> + * Free an inode in the free inode btree.
>> + */
>> +STATIC int
>> +xfs_ifree_finobt(
...
> 
> I can't say I'm a great fan of the layout of the logic. Yes, there's
> lots of cases to handle. It looks like:
> 

Yeah, I've shuffled this code around quite a bit myself.

> 	lookup()
> 	if (found)
> 		modify in place
> 	if (found && full && deleting chunks)
> 		delete record
> 	else if (!found && no record)
> 		insert record
> 	else if (found)
> 		update record
> 	else
> 		corruption!
> 
> I think it woul dbe better to get then "!found" case out of the way
> at the start. ie
> 
> 	if (i == 0) {
> 		if (ibtrec->ir_freecount == 1)
> 			insert record
> 		else
> 			CORRUPTION
> 		goto out;
> 	}
> 
> 	/* found a record, no need to check i == 1 anymore */
> 	ASSERT(i == 1);
> 
> 	/* read and update */
> 
> 	if (full && deleting chunks)
> 		delete record
> 	else
> 		update record
> 

Ok, I'll try to pull that logic up and see what falls out.

...
>> +	} else if ((i == 0) && (ibtrec->ir_freecount == 1)) {
>> +		/*
>> +		 * No existing finobt record and the inobt record has a single
>> +		 * free inode. This means we've freed an inode in a previously
>> +		 * fully allocated chunk. Insert a new record into the finobt
>> +		 * based on the current inobt record.
>> +		 */
>> +		cur->bc_rec.i.ir_startino = ibtrec->ir_startino;
>> +		cur->bc_rec.i.ir_free = ibtrec->ir_free;
>> +		cur->bc_rec.i.ir_freecount = ibtrec->ir_freecount;
>> +		error = xfs_btree_insert(cur, &i);
>> +		if (error)
>> +			goto error;
>> +		ASSERT(i == 1);
> 
> That's rather similar to the code in xfs_inobt_insert(). Indeed,
> is you write a helper - xfs_inobt_insert_rec() - for this, then rather than modifying
> xfs_inobt_lookup() to take extra parameters like I wondered for the
> previous patch, leave it alonge and pass the parameters to
> xfs_inobt_insert_rec() instead.
> 
> Then this code is functionally identical to xfs_inobt_insert() done
> during allocation....
> 

I think I'm parsing you after having another look at the code.
xfs_inobt_lookup() remains as is and is potentially used from
xfs_inobt_insert(). xfs_inobt_insert_rec() is introduced to set the
cursor fields and do the insert and is used here and from
xfs_inobt_insert().

At that point, this looks close to xfs_inobt_insert(), but I think using
that here would introduce a duplicate lookup. Regardless, we'll see what
the whole thing looks like at that point. Thanks for the reviews. :)

Brian

> Cheers,
> 
> Dave.
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 00/11] xfs: introduce the free inode btree
  2013-09-03 18:24 [RFC PATCH 00/11] xfs: introduce the free inode btree Brian Foster
                   ` (10 preceding siblings ...)
  2013-09-03 18:25 ` [RFC PATCH 11/11] xfs: add finobt support to growfs Brian Foster
@ 2013-09-05 21:17 ` Michael L. Semon
  2013-09-06 11:17   ` Brian Foster
  2013-09-06 21:35   ` Dave Chinner
  11 siblings, 2 replies; 46+ messages in thread
From: Michael L. Semon @ 2013-09-05 21:17 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On 09/03/2013 02:24 PM, Brian Foster wrote:
> Hi all,
> 
> This is an RFC for the kernel work to support a free inode btree. The free inode
> btree (finobt) is equivalent to the existing inode allocation btree with the
> exception that the finobt only tracks inode chunks with at least one free inode.
> The purpose is to improve lookups for free inode clusters for inode allocation.
> 
> Newly allocated inode chunks by definition contain free inodes and are thus
> inserted into the finobt immediately. The record for a previously full inode
> chunk is inserted to the finobt when the first inode is freed. A record is
> removed from the finobt when the last free inode has been allocated or the
> entire chunk is completely deallocated.
> 
> Patches 1-3 refactor some ialloc btree code to introduce the new finobt type and
> feature bit. Patches 4-7 fix up the transaction handling for inode allocation
> and deallocation to account for the new tree. Patches 8-10 add the finobt
> management code to insert, remove and modify records as appropriate. Patch 11
> fixes growfs to support the finobt.
> 
> Thoughts, reviews, flames appreciated.

I'm looking for Dave's judgement call on whether I should run this code 
full-time.  The patchset applied well on top of Dave's latest work--only 
a "trailing whitespace" warning on Patch #9 (I think)--and the code 
compiled without error.  There was a lockdep while running xfstests, 
before generic/013 (I think), so I switched back to my normal git branch 
and have your patchset in a separate branch.

My setup here is slow--x86, old IDE hardware, write cache off, debug 
kernel--but the patchset made things seem a little slower.  At the 
right time--not necessarily now--performance numbers might be nice.  
I didn't time it but did a copy of 3 kernel gits to v5 1k-block-size 
XFS and just felt something was off.  The copy did complete, though.  
Will try timing this on another day.

Anyway, good work so far!  No additional stack traces were caused by 
your code in limited testing, and the filesystems were are still 
intact.

Thanks!

Michael

[lockdep from xfstests generic/0-ten-something follows:]

[  763.993429] XFS (sdb4): Mounting Filesystem
[  764.258701] XFS (sdb4): Ending clean mount
[  768.798390] XFS (sdb5): Mounting Filesystem
[  769.061280] XFS (sdb5): Ending clean mount
[  770.030277] XFS (sdb4): Mounting Filesystem
[  770.313502] XFS (sdb4): Ending clean mount
[  788.932588] XFS (sdb4): Mounting Filesystem
[  789.256815] XFS (sdb4): Ending clean mount
[  792.639933] XFS (sdb4): Mounting Filesystem
[  792.965477] XFS (sdb4): Ending clean mount
[  795.166220] XFS (sdb4): Mounting Filesystem
[  795.507372] XFS (sdb4): Ending clean mount
[  802.870263] XFS (sdb4): Mounting Filesystem
[  803.516422] XFS (sdb4): Ending clean mount
[  814.376620] XFS (sdb4): Mounting Filesystem
[  815.050778] XFS (sdb4): Ending clean mount
[  823.169368] 
[  823.170932] ======================================================
[  823.172146] [ INFO: possible circular locking dependency detected ]
[  823.172146] 3.11.0+ #5 Not tainted
[  823.172146] -------------------------------------------------------
[  823.172146] dirstress/5276 is trying to acquire lock:
[  823.172146]  (sb_internal){.+.+.+}, at: [<c11c5e60>] xfs_trans_alloc+0x1f/0x35
[  823.172146] 
[  823.172146] but task is already holding lock:
[  823.172146]  (&(&ip->i_lock)->mr_lock){+++++.}, at: [<c1206cfb>] xfs_ilock+0x100/0x1f1
[  823.172146] 
[  823.172146] which lock already depends on the new lock.
[  823.172146] 
[  823.172146] 
[  823.172146] the existing dependency chain (in reverse order) is:
[  823.172146] 
[  823.172146] -> #1 (&(&ip->i_lock)->mr_lock){+++++.}:
[  823.172146]        [<c1070a11>] __lock_acquire+0x345/0xa11
[  823.172146]        [<c1071c45>] lock_acquire+0x88/0x17e
[  823.172146]        [<c14bff98>] _raw_spin_lock+0x47/0x74
[  823.172146]        [<c1116247>] __mark_inode_dirty+0x171/0x38c
[  823.172146]        [<c111acab>] __set_page_dirty+0x5f/0x95
[  823.172146]        [<c111b93e>] mark_buffer_dirty+0x58/0x12b
[  823.172146]        [<c111baff>] __block_commit_write.isra.17+0x64/0x82
[  823.172146]        [<c111c197>] block_write_end+0x2b/0x53
[  823.172146]        [<c111c201>] generic_write_end+0x42/0x9a
[  823.172146]        [<c11a42d5>] xfs_vm_write_end+0x60/0xbe
[  823.172146]        [<c10bd47a>] generic_file_buffered_write+0x140/0x20f
[  823.172146]        [<c11b2347>] xfs_file_buffered_aio_write+0x10b/0x205
[  823.172146]        [<c11b24ee>] xfs_file_aio_write+0xad/0xec
[  823.172146]        [<c10f0c5f>] do_sync_write+0x60/0x87
[  823.172146]        [<c10f0e0f>] vfs_write+0x9c/0x189
[  823.172146]        [<c10f0fc6>] SyS_write+0x49/0x81
[  823.172146]        [<c14c14bb>] sysenter_do_call+0x12/0x32
[  823.172146] 
[  823.172146] -> #0 (sb_internal){.+.+.+}:
[  823.172146]        [<c106e972>] validate_chain.isra.35+0xfc7/0xff4
[  823.172146]        [<c1070a11>] __lock_acquire+0x345/0xa11
[  823.172146]        [<c1071c45>] lock_acquire+0x88/0x17e
[  823.172146]        [<c10f36eb>] __sb_start_write+0xad/0x177
[  823.172146]        [<c11c5e60>] xfs_trans_alloc+0x1f/0x35
[  823.172146]        [<c120a823>] xfs_inactive+0x129/0x4a3
[  823.172146]        [<c11c280d>] xfs_fs_evict_inode+0x6c/0x114
[  823.172146]        [<c1106678>] evict+0x8e/0x15d
[  823.172146]        [<c1107126>] iput+0xc4/0x138
[  823.172146]        [<c1103504>] dput+0x1b2/0x257
[  823.172146]        [<c10f1a30>] __fput+0x140/0x1eb
[  823.172146]        [<c10f1b0f>] ____fput+0xd/0xf
[  823.172146]        [<c1048477>] task_work_run+0x67/0x90
[  823.172146]        [<c1001ea5>] do_notify_resume+0x61/0x63
[  823.172146]        [<c14c0cfa>] work_notifysig+0x1f/0x25
[  823.172146] 
[  823.172146] other info that might help us debug this:
[  823.172146] 
[  823.172146]  Possible unsafe locking scenario:
[  823.172146] 
[  823.172146]        CPU0                    CPU1
[  823.172146]        ----                    ----
[  823.172146]   lock(&(&ip->i_lock)->mr_lock);
[  823.172146]                                lock(sb_internal);
[  823.172146]                                lock(&(&ip->i_lock)->mr_lock);
[  823.172146]   lock(sb_internal);
[  823.172146] 
[  823.172146]  *** DEADLOCK ***
[  823.172146] 
[  823.172146] 1 lock held by dirstress/5276:
[  823.172146]  #0:  (&(&ip->i_lock)->mr_lock){+++++.}, at: [<c1206cfb>] xfs_ilock+0x100/0x1f1
[  823.172146] 
[  823.172146] stack backtrace:
[  823.172146] CPU: 0 PID: 5276 Comm: dirstress Not tainted 3.11.0+ #5
[  823.172146] Hardware name: Dell Computer Corporation Dimension 2350/07W080, BIOS A01 12/17/2002
[  823.172146]  c1c26060 c1c26060 da34fd58 c14ba216 da34fd78 c14b7317 c15f171b da34fdb4
[  823.172146]  dcaa1440 00000001 dcaa18b0 00000000 da34fde4 c106e972 dcaa1888 00000001
[  823.172146]  da34fdb4 c1057e0f 00000000 00003f61 c1c28660 00000000 dcaa1888 dcaa18b0
[  823.172146] Call Trace:
[  823.172146]  [<c14ba216>] dump_stack+0x16/0x18
[  823.172146]  [<c14b7317>] print_circular_bug+0x1b8/0x1c2
[  823.172146]  [<c106e972>] validate_chain.isra.35+0xfc7/0xff4
[  823.172146]  [<c1057e0f>] ? sched_clock_local.constprop.3+0x39/0x131
[  823.172146]  [<c1057fd4>] ? sched_clock_cpu+0x8f/0xe2
[  823.172146]  [<c1070a11>] __lock_acquire+0x345/0xa11
[  823.172146]  [<c1070a36>] ? __lock_acquire+0x36a/0xa11
[  823.172146]  [<c1071c45>] lock_acquire+0x88/0x17e
[  823.172146]  [<c11c5e60>] ? xfs_trans_alloc+0x1f/0x35
[  823.172146]  [<c10f36eb>] __sb_start_write+0xad/0x177
[  823.172146]  [<c11c5e60>] ? xfs_trans_alloc+0x1f/0x35
[  823.172146]  [<c11c5e60>] ? xfs_trans_alloc+0x1f/0x35
[  823.172146]  [<c1206cfb>] ? xfs_ilock+0x100/0x1f1
[  823.172146]  [<c11c5e60>] xfs_trans_alloc+0x1f/0x35
[  823.172146]  [<c120a823>] xfs_inactive+0x129/0x4a3
[  823.172146]  [<c106f21f>] ? trace_hardirqs_on+0xb/0xd
[  823.172146]  [<c14c01e5>] ? _raw_spin_unlock_irq+0x27/0x36
[  823.172146]  [<c11c280d>] xfs_fs_evict_inode+0x6c/0x114
[  823.172146]  [<c1106678>] evict+0x8e/0x15d
[  823.172146]  [<c1107126>] iput+0xc4/0x138
[  823.172146]  [<c1103504>] dput+0x1b2/0x257
[  823.172146]  [<c10f1a30>] __fput+0x140/0x1eb
[  823.172146]  [<c10f1b0f>] ____fput+0xd/0xf
[  823.172146]  [<c1048477>] task_work_run+0x67/0x90
[  823.172146]  [<c1001ea5>] do_notify_resume+0x61/0x63
[  823.172146]  [<c14c0cfa>] work_notifysig+0x1f/0x25
[  824.015366] Clocksource tsc unstable (delta = 486645129 ns)
[  825.324019] XFS (sdb4): Mounting Filesystem
[  825.743317] XFS (sdb4): Ending clean mount
[  827.223193] XFS (sdb4): Mounting Filesystem
[  827.668493] XFS (sdb4): Ending clean mount
[  837.524673] XFS (sdb4): Mounting Filesystem
[  837.986097] XFS (sdb4): Ending clean mount



_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 03/11] xfs: support the XFS_BTNUM_FINOBT free inode btree type
  2013-09-05 16:17     ` Brian Foster
@ 2013-09-06  0:07       ` Dave Chinner
  2013-09-06 11:25         ` Brian Foster
  0 siblings, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2013-09-06  0:07 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Thu, Sep 05, 2013 at 12:17:04PM -0400, Brian Foster wrote:
> On 09/04/2013 08:54 PM, Dave Chinner wrote:
> > On Tue, Sep 03, 2013 at 02:25:00PM -0400, Brian Foster wrote:
> >> Define the AGI fields for the finobt root/level and add magic
> >> numbers. Update the btree code to add support for the new
> >> XFS_BTNUM_FINOBT inode btree.
.....
> >> @@ -180,7 +183,9 @@ typedef struct xfs_agi {
> >>  #define	XFS_AGI_NEWINO		0x00000100
> >>  #define	XFS_AGI_DIRINO		0x00000200
> >>  #define	XFS_AGI_UNLINKED	0x00000400
> >> -#define	XFS_AGI_NUM_BITS	11
> >> +#define	XFS_AGI_FREE_ROOT	0x00000800
> >> +#define	XFS_AGI_FREE_LEVEL	0x00001000
> >> +#define	XFS_AGI_NUM_BITS	13
> >>  #define	XFS_AGI_ALL_BITS	((1 << XFS_AGI_NUM_BITS) - 1)
> > 
> > This is a bit of a problem, because the range logging bits will now
> > cause the entire AGI to be logged (including all the unlinked list
> > hash) because these only define the first/last offsets to be
> > logged...
> > 
> 
> Ok, I see what you mean here...
> 
> >> diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
> >> index 524aa88..5ced506 100644
> >> --- a/fs/xfs/xfs_ialloc.c
> >> +++ b/fs/xfs/xfs_ialloc.c
> >> @@ -1505,6 +1505,8 @@ xfs_ialloc_log_agi(
> >>  		offsetof(xfs_agi_t, agi_newino),
> >>  		offsetof(xfs_agi_t, agi_dirino),
> >>  		offsetof(xfs_agi_t, agi_unlinked),
> >> +		offsetof(xfs_agi_t, agi_free_root),
> >> +		offsetof(xfs_agi_t, agi_free_level),
> >>  		sizeof(xfs_agi_t)
> >>  	};
> > 
> > Because of how this table works.
> > 
> > What we really need here is for xfs_ialloc_log_agi to consider that
> > there are two distinct regions for range logging - the first spaces
> > from offset 0 to offset of agi_unlinked, and the second is from the
> > the offset of agi_free_root to the end of the xfs_agi_t....
> > 
> > It's abit messy, I know, but we couldn't easily add new padding to
> > the AGI in the existing range logging area like was done for the AGF
> > because of the unlinked list hash table already defining the end of
> > the range logging region....
> > 
> 
> ... but where would that ever happen? The existing invocations of
> xfs_ialloc_log_agi() seem to log either the agi inode count values or
> the btree root/level values (i.e., never the range across both). I think
> I've introduced at least a couple new invocations throughout this set,
> but I've not changed that model (i.e., an XFS_AGI_FREECOUNT instance in
> the new lookup code and an XFS_AGI_FREE_ROOT|*_LEVEL instance in the new
> btree code).

Right, we don't current log across the range because of the way the
code is currently written, but there's no rule that says that
logging fields must be done this way.

I can see that there may be reason for logging
XFS_AGI_FREE_ROOT|*_LEVEL and XFS_AGI_NEW_INODE all in one go -
pointing new inode allocation at recently freed inodes is not
unreasonable, and if we split the finobt and update agi_newino in
the one update, we will log across this gap.

> My understanding of this code is that the range to log is defined at
> invocation time to xfs_iallog_log_agi(), so if the callers never specify
> a range that includes the unlinked bits in a single call, we won't set
> that range in the buffer log item code. In other words, even if we
> ultimately happen to log both ranges in the agi, the lower level code
> will never expand the logged region. Therefore, this shouldn't happen
> unless a single invocation that specifies one of the
> XFS_AGI_FREE_ROOT/LEVEL bits also specifies one of the existing agi bits.

Yes, we can avoid that by logging te regions separately, but that
then puts the onus on the future developers and reviewers to
remember this landmine and avoid it.

> I could see breaking this up as a matter of preparation for future
> fields or calls that would introduce logging that kind of range, but at
> the same time (and assuming my interpretation of above is correct),
> that's a bit of code that serves no purpose for the foreseeable future.
> Perhaps a comment in xfs_ialloc_log_agi() and the one caller that uses
> the AGI_FREE_* bits to document this restriction is sufficient?

Given it is only a few lines of extra code in xfs_ialloc_log_agi(),
I'd prefer that the code documents and deals with the different
regions internally. That way we can forget about it completely for the
future and never have to worry about it again.

Keep in mind that I'm looking at this from a code maintenance
timescale of at least 5-10 years into the future here - a few
minutes extra fixing this right now could save us a lot of hassle
years down the track. ;)

> >>  #ifdef DEBUG
> >> diff --git a/fs/xfs/xfs_ialloc_btree.c b/fs/xfs/xfs_ialloc_btree.c
> >> index 0cdb88b..7923292 100644
> >> --- a/fs/xfs/xfs_ialloc_btree.c
> >> +++ b/fs/xfs/xfs_ialloc_btree.c
> >> @@ -62,10 +62,18 @@ xfs_inobt_set_root(
> >>  {
> >>  	struct xfs_buf		*agbp = cur->bc_private.a.agbp;
> >>  	struct xfs_agi		*agi = XFS_BUF_TO_AGI(agbp);
> >> -
> >> -	agi->agi_root = nptr->s;
> >> -	be32_add_cpu(&agi->agi_level, inc);
> >> -	xfs_ialloc_log_agi(cur->bc_tp, agbp, XFS_AGI_ROOT | XFS_AGI_LEVEL);
> >> +	int			fields;
> >> +
> >> +	if (cur->bc_btnum == XFS_BTNUM_INO) {
> >> +		agi->agi_root = nptr->s;
> >> +		be32_add_cpu(&agi->agi_level, inc);
> >> +		fields = XFS_AGI_ROOT | XFS_AGI_LEVEL;
> >> +	} else {
> >> +		agi->agi_free_root = nptr->s;
> >> +		be32_add_cpu(&agi->agi_free_level, inc);
> >> +		fields = XFS_AGI_FREE_ROOT | XFS_AGI_FREE_LEVEL;
> >> +	}
> >> +	xfs_ialloc_log_agi(cur->bc_tp, agbp, fields);
> >>  }
> > 
> > I suspect that it would be better to have separate functions for
> > these differences i.e. xfs_inobt_set_root() and
> > xfs_finobt_set_root(), and set up separate btree ops structure
> > forthe two different trees. Most of the code is still identical,
> > but the differences in root structures can easily be handled without
> > putting switches in the code everywhere.
> > 
> 
> Ok, I'm assuming the suggestion is to only create new functions for the
> implementations that differ.

Exactly.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 04/11] xfs: update inode allocation transaction reservations for finobt
  2013-09-05 16:17     ` Brian Foster
@ 2013-09-06  0:11       ` Dave Chinner
  0 siblings, 0 replies; 46+ messages in thread
From: Dave Chinner @ 2013-09-06  0:11 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Thu, Sep 05, 2013 at 12:17:23PM -0400, Brian Foster wrote:
> On 09/04/2013 08:59 PM, Dave Chinner wrote:
> > On Tue, Sep 03, 2013 at 02:25:01PM -0400, Brian Foster wrote:
> >> Update inode allocation transaction reservations for the finobt. A
> >> create via record modification requires a log reservation for the
> >> additional finobt record. Any such allocation could result in an
> >> finobt removal if the inode chunk has become fully allocated, thus
> >> we include a reservation for a finobt btree merge as well.
> >> Allocation of a new inode chunk must account for splits in the
> >> finobt as well as the existing ialloc tree.
> > 
> > These transaction reservation changes are only necessary for
> > filesystems with free inode btrees, otherwise they just use more log
> > space than is necessary.
> > 
> 
> Yeah, good point.
> 
> > Can you add helper functions for the free inode btree reservations,
> > and have them return 0 when the feature is not enabled? That way the
> > code stays pretty clean, is self documenting and doesn't take
> > unnecessary space when the feature is not enabled....
> > 
> 
> So not new functions that duplicate the entire calculations for the
> finobt enabled cases, but new functions that define the additional
> requirements for the finobt on top of the existing reservation
> calculations for particular operations (i.e., similar to the recent
> patch to fix the inode log size, if I recall). Sounds good.

That's exactly what I'm thinking - just like the
xfs_calc_inode_res() helper. This really helps document the
transaction reservation calculations - the comments help, but they
are no substitute for being able to say "*that line* is what
reserves space for the inodes modified in the transaction"...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 07/11] xfs: retry trans reservation on ENOSPC in xfs_inactive()
  2013-09-05 16:18     ` Brian Foster
@ 2013-09-06  0:17       ` Dave Chinner
  2013-09-06 11:30         ` Brian Foster
  0 siblings, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2013-09-06  0:17 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Thu, Sep 05, 2013 at 12:18:31PM -0400, Brian Foster wrote:
> On 09/04/2013 09:40 PM, Dave Chinner wrote:
> > On Tue, Sep 03, 2013 at 02:25:04PM -0400, Brian Foster wrote:
> >> An ifree data block reservation can fail with ENOSPC. Flush inodes
> >> to try and free up space or attempt without a data block
> >> reservation to avoid failing out of xfs_inactive().
> >>
> >> Signed-off-by: Brian Foster <bfoster@redhat.com>
> >> ---
> >>  fs/xfs/xfs_inode.c | 11 +++++++++++
> >>  1 file changed, 11 insertions(+)
> >>
> >> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> >> index 56cbf63..92de4b7 100644
> >> --- a/fs/xfs/xfs_inode.c
> >> +++ b/fs/xfs/xfs_inode.c
> >> @@ -1779,7 +1779,18 @@ xfs_inactive(
> >>  	tp = xfs_trans_alloc(mp, XFS_TRANS_INACTIVE);
> >>  	error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree,
> >>  				  XFS_IFREE_SPACE_RES(mp), 0);
> >> +	if (error == ENOSPC) {
> >> +		/* flush outstanding delalloc blocks and retry */
> >> +		xfs_flush_inodes(mp);
> >> +		error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree,
> >> +					  XFS_IFREE_SPACE_RES(mp), 0);
> >> +	}
> > 
> > We don't want to be blocking for inode flushes here. We might be in
> > a shrinker context, for example, and blocking those for a filesystem
> > sync is going to be unfriendly.
> > 
> 
> Ok.
> 
> > If this really is a problem, then the right thing to do is to allow
> > this transaction to dip into the reserve block pool so the
> > transaction can complete and make progress - other write operations
> > will trigger the flushing of the filesystem, and freeing of whole
> > inode chunks should return more free space than we need for the
> > finobt modifications in the removing lots of zero length inodes
> > at ENOSPC case....
> > 
> 
> I did have one of the enospc xfstests lead to this situation, though I
> don't have the particular test in my notes. It initially manifested as
> an assert failure due to the fs not being shutdown after an
> xfs_trans_reserve() ENOSPC failure.

Ok. I can see how ENOSPC might occur here :)

> Subsequent to avoiding that, I
> believe there were inconsistent fs issues called out due to the unlinked
> lists being populated after umount.

That sounds like a recovery failure, not so much an ENOSPC failure.
i.e. that recovery only looks at the log to see if it's clean, and
only recovers unlinked lists if it's dirty. There is the
*possibility* of having a clean log with inodes on the unlinked
list, and log recovery doesn't run the unlinked list processing in
that case.

This is one of the issues we'll need to fix for O_TMPFILE support
as it will actively use inodes on unlinked list for potentially long
periods of time.

> Taking a further look, I missed the XFS_TRANS_RESERVE flag and whole
> m_resblks mechanism. I'll take a closer look at that and see if that
> works to resolve the problem instead of the flush.

It should - the only time it won't is if we exhaust the pool, but
that doesn't happen in normal ENOSPC situations and any blocks we do
end up freeing will immediately refill the reserve pool...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 10/11] xfs: update the finobt on inode free
  2013-09-05 16:19     ` Brian Foster
@ 2013-09-06  0:28       ` Dave Chinner
  2013-09-06 11:39         ` Brian Foster
  0 siblings, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2013-09-06  0:28 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Thu, Sep 05, 2013 at 12:19:12PM -0400, Brian Foster wrote:
> On 09/04/2013 10:54 PM, Dave Chinner wrote:
> > On Tue, Sep 03, 2013 at 02:25:07PM -0400, Brian Foster wrote:
> >> An inode free operation can have several effects on the finobt. If
> >> all inodes have been freed and the chunk deallocated, we remove the
> >> finobt record. If the inode chunk was previously full, we must
> >> insert a new record based on the existing inobt record. Otherwise,
> >> we modify the record in place.
.....
> >> +	} else if ((i == 0) && (ibtrec->ir_freecount == 1)) {
> >> +		/*
> >> +		 * No existing finobt record and the inobt record has a single
> >> +		 * free inode. This means we've freed an inode in a previously
> >> +		 * fully allocated chunk. Insert a new record into the finobt
> >> +		 * based on the current inobt record.
> >> +		 */
> >> +		cur->bc_rec.i.ir_startino = ibtrec->ir_startino;
> >> +		cur->bc_rec.i.ir_free = ibtrec->ir_free;
> >> +		cur->bc_rec.i.ir_freecount = ibtrec->ir_freecount;
> >> +		error = xfs_btree_insert(cur, &i);
> >> +		if (error)
> >> +			goto error;
> >> +		ASSERT(i == 1);
> > 
> > That's rather similar to the code in xfs_inobt_insert(). Indeed,
> > is you write a helper - xfs_inobt_insert_rec() - for this, then rather than modifying
> > xfs_inobt_lookup() to take extra parameters like I wondered for the
> > previous patch, leave it alonge and pass the parameters to
> > xfs_inobt_insert_rec() instead.
> > 
> > Then this code is functionally identical to xfs_inobt_insert() done
> > during allocation....
> > 
> 
> I think I'm parsing you after having another look at the code.
> xfs_inobt_lookup() remains as is and is potentially used from
> xfs_inobt_insert(). xfs_inobt_insert_rec() is introduced to set the
> cursor fields and do the insert and is used here and from
> xfs_inobt_insert().

Effectively. xfs_inobt_insert() becomes:

	for (each inode chunk) {
		xfs_inobt_lookup(cur, startino)
		xfs_inobt_insert_rec(cur, startino, free, free_count)
	}

And this code becomes:

	xfs_inobt_lookup(cur, startino);
	if (!found) {
		if (free_count == 1)
			xfs_inobt_insert_rec(cur, startino, free, free_count)
		else
			CORRUPTION
		goto out;
	}

> At that point, this looks close to xfs_inobt_insert(), but I think using
> that here would introduce a duplicate lookup.

Yes, it would. I think just using helpers like this is sufficient
for the two different cases, especially as xfs_inobt_insert() needs
to be able to handle multiple chunk insertion and we don't have that
here...

> Regardless, we'll see what
> the whole thing looks like at that point. Thanks for the reviews. :)

No worries. BTW, can you post your rudimentary userspace support so
we can run tests that use this code, too?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 00/11] xfs: introduce the free inode btree
  2013-09-05 21:17 ` [RFC PATCH 00/11] xfs: introduce the free inode btree Michael L. Semon
@ 2013-09-06 11:17   ` Brian Foster
  2013-09-06 21:35   ` Dave Chinner
  1 sibling, 0 replies; 46+ messages in thread
From: Brian Foster @ 2013-09-06 11:17 UTC (permalink / raw)
  To: Michael L. Semon; +Cc: xfs

On 09/05/2013 05:17 PM, Michael L. Semon wrote:
> On 09/03/2013 02:24 PM, Brian Foster wrote:
>> Hi all,
>>
>> This is an RFC for the kernel work to support a free inode btree. The free inode
>> btree (finobt) is equivalent to the existing inode allocation btree with the
>> exception that the finobt only tracks inode chunks with at least one free inode.
>> The purpose is to improve lookups for free inode clusters for inode allocation.
>>
>> Newly allocated inode chunks by definition contain free inodes and are thus
>> inserted into the finobt immediately. The record for a previously full inode
>> chunk is inserted to the finobt when the first inode is freed. A record is
>> removed from the finobt when the last free inode has been allocated or the
>> entire chunk is completely deallocated.
>>
>> Patches 1-3 refactor some ialloc btree code to introduce the new finobt type and
>> feature bit. Patches 4-7 fix up the transaction handling for inode allocation
>> and deallocation to account for the new tree. Patches 8-10 add the finobt
>> management code to insert, remove and modify records as appropriate. Patch 11
>> fixes growfs to support the finobt.
>>
>> Thoughts, reviews, flames appreciated.
> 
> I'm looking for Dave's judgement call on whether I should run this code 
> full-time.  The patchset applied well on top of Dave's latest work--only 
> a "trailing whitespace" warning on Patch #9 (I think)--and the code 
> compiled without error.  There was a lockdep while running xfstests, 
> before generic/013 (I think), so I switched back to my normal git branch 
> and have your patchset in a separate branch.
> 

Hi Michael,

Thanks for giving this a burn. I actually haven't tested with lockdep
yet, so I've made a note to do so and see if I can reproduce that or any
other problems.

With regard to running "full-time," I'd suggest to hold off at least
until I post a non-RFC version (the next version will probably be a real
v1). The review feedback has shown that a few areas need some
non-trivial rework that change the behavior enough (i.e., the flush on
inactive thing, transaction reservation management, etc.) that any
issues at this point might be worth retesting for on the later version
anyways.

> My setup here is slow--x86, old IDE hardware, write cache off, debug 
> kernel--but the patchset made things seem a little slower.  At the 
> right time--not necessarily now--performance numbers might be nice.  
> I didn't time it but did a copy of 3 kernel gits to v5 1k-block-size 
> XFS and just felt something was off.  The copy did complete, though.  
> Will try timing this on another day.
> 

That's interesting. I had a chat with Dave with regard to the existing
lookup allocation algorithm and the best way to test the improved
lookup, and I am able to reproduce a nice inode allocation improvement
under certain conditions. I could understand seeing some general effect
in performance via the addition of the new tree (there's more work to do
now to manage it, after all), but my expectation is that for finobt=0
filesystems, there should be no effect whatsoever.

> Anyway, good work so far!  No additional stack traces were caused by 
> your code in limited testing, and the filesystems were are still 
> intact.
> 

Thanks again for testing. Most of my testing so far has been with finobt
enabled, so that's a good bit of news! :)

Brian

> Thanks!
> 
> Michael
> 
> [lockdep from xfstests generic/0-ten-something follows:]
> 
> [  763.993429] XFS (sdb4): Mounting Filesystem
> [  764.258701] XFS (sdb4): Ending clean mount
> [  768.798390] XFS (sdb5): Mounting Filesystem
> [  769.061280] XFS (sdb5): Ending clean mount
> [  770.030277] XFS (sdb4): Mounting Filesystem
> [  770.313502] XFS (sdb4): Ending clean mount
> [  788.932588] XFS (sdb4): Mounting Filesystem
> [  789.256815] XFS (sdb4): Ending clean mount
> [  792.639933] XFS (sdb4): Mounting Filesystem
> [  792.965477] XFS (sdb4): Ending clean mount
> [  795.166220] XFS (sdb4): Mounting Filesystem
> [  795.507372] XFS (sdb4): Ending clean mount
> [  802.870263] XFS (sdb4): Mounting Filesystem
> [  803.516422] XFS (sdb4): Ending clean mount
> [  814.376620] XFS (sdb4): Mounting Filesystem
> [  815.050778] XFS (sdb4): Ending clean mount
> [  823.169368] 
> [  823.170932] ======================================================
> [  823.172146] [ INFO: possible circular locking dependency detected ]
> [  823.172146] 3.11.0+ #5 Not tainted
> [  823.172146] -------------------------------------------------------
> [  823.172146] dirstress/5276 is trying to acquire lock:
> [  823.172146]  (sb_internal){.+.+.+}, at: [<c11c5e60>] xfs_trans_alloc+0x1f/0x35
> [  823.172146] 
> [  823.172146] but task is already holding lock:
> [  823.172146]  (&(&ip->i_lock)->mr_lock){+++++.}, at: [<c1206cfb>] xfs_ilock+0x100/0x1f1
> [  823.172146] 
> [  823.172146] which lock already depends on the new lock.
> [  823.172146] 
> [  823.172146] 
> [  823.172146] the existing dependency chain (in reverse order) is:
> [  823.172146] 
> [  823.172146] -> #1 (&(&ip->i_lock)->mr_lock){+++++.}:
> [  823.172146]        [<c1070a11>] __lock_acquire+0x345/0xa11
> [  823.172146]        [<c1071c45>] lock_acquire+0x88/0x17e
> [  823.172146]        [<c14bff98>] _raw_spin_lock+0x47/0x74
> [  823.172146]        [<c1116247>] __mark_inode_dirty+0x171/0x38c
> [  823.172146]        [<c111acab>] __set_page_dirty+0x5f/0x95
> [  823.172146]        [<c111b93e>] mark_buffer_dirty+0x58/0x12b
> [  823.172146]        [<c111baff>] __block_commit_write.isra.17+0x64/0x82
> [  823.172146]        [<c111c197>] block_write_end+0x2b/0x53
> [  823.172146]        [<c111c201>] generic_write_end+0x42/0x9a
> [  823.172146]        [<c11a42d5>] xfs_vm_write_end+0x60/0xbe
> [  823.172146]        [<c10bd47a>] generic_file_buffered_write+0x140/0x20f
> [  823.172146]        [<c11b2347>] xfs_file_buffered_aio_write+0x10b/0x205
> [  823.172146]        [<c11b24ee>] xfs_file_aio_write+0xad/0xec
> [  823.172146]        [<c10f0c5f>] do_sync_write+0x60/0x87
> [  823.172146]        [<c10f0e0f>] vfs_write+0x9c/0x189
> [  823.172146]        [<c10f0fc6>] SyS_write+0x49/0x81
> [  823.172146]        [<c14c14bb>] sysenter_do_call+0x12/0x32
> [  823.172146] 
> [  823.172146] -> #0 (sb_internal){.+.+.+}:
> [  823.172146]        [<c106e972>] validate_chain.isra.35+0xfc7/0xff4
> [  823.172146]        [<c1070a11>] __lock_acquire+0x345/0xa11
> [  823.172146]        [<c1071c45>] lock_acquire+0x88/0x17e
> [  823.172146]        [<c10f36eb>] __sb_start_write+0xad/0x177
> [  823.172146]        [<c11c5e60>] xfs_trans_alloc+0x1f/0x35
> [  823.172146]        [<c120a823>] xfs_inactive+0x129/0x4a3
> [  823.172146]        [<c11c280d>] xfs_fs_evict_inode+0x6c/0x114
> [  823.172146]        [<c1106678>] evict+0x8e/0x15d
> [  823.172146]        [<c1107126>] iput+0xc4/0x138
> [  823.172146]        [<c1103504>] dput+0x1b2/0x257
> [  823.172146]        [<c10f1a30>] __fput+0x140/0x1eb
> [  823.172146]        [<c10f1b0f>] ____fput+0xd/0xf
> [  823.172146]        [<c1048477>] task_work_run+0x67/0x90
> [  823.172146]        [<c1001ea5>] do_notify_resume+0x61/0x63
> [  823.172146]        [<c14c0cfa>] work_notifysig+0x1f/0x25
> [  823.172146] 
> [  823.172146] other info that might help us debug this:
> [  823.172146] 
> [  823.172146]  Possible unsafe locking scenario:
> [  823.172146] 
> [  823.172146]        CPU0                    CPU1
> [  823.172146]        ----                    ----
> [  823.172146]   lock(&(&ip->i_lock)->mr_lock);
> [  823.172146]                                lock(sb_internal);
> [  823.172146]                                lock(&(&ip->i_lock)->mr_lock);
> [  823.172146]   lock(sb_internal);
> [  823.172146] 
> [  823.172146]  *** DEADLOCK ***
> [  823.172146] 
> [  823.172146] 1 lock held by dirstress/5276:
> [  823.172146]  #0:  (&(&ip->i_lock)->mr_lock){+++++.}, at: [<c1206cfb>] xfs_ilock+0x100/0x1f1
> [  823.172146] 
> [  823.172146] stack backtrace:
> [  823.172146] CPU: 0 PID: 5276 Comm: dirstress Not tainted 3.11.0+ #5
> [  823.172146] Hardware name: Dell Computer Corporation Dimension 2350/07W080, BIOS A01 12/17/2002
> [  823.172146]  c1c26060 c1c26060 da34fd58 c14ba216 da34fd78 c14b7317 c15f171b da34fdb4
> [  823.172146]  dcaa1440 00000001 dcaa18b0 00000000 da34fde4 c106e972 dcaa1888 00000001
> [  823.172146]  da34fdb4 c1057e0f 00000000 00003f61 c1c28660 00000000 dcaa1888 dcaa18b0
> [  823.172146] Call Trace:
> [  823.172146]  [<c14ba216>] dump_stack+0x16/0x18
> [  823.172146]  [<c14b7317>] print_circular_bug+0x1b8/0x1c2
> [  823.172146]  [<c106e972>] validate_chain.isra.35+0xfc7/0xff4
> [  823.172146]  [<c1057e0f>] ? sched_clock_local.constprop.3+0x39/0x131
> [  823.172146]  [<c1057fd4>] ? sched_clock_cpu+0x8f/0xe2
> [  823.172146]  [<c1070a11>] __lock_acquire+0x345/0xa11
> [  823.172146]  [<c1070a36>] ? __lock_acquire+0x36a/0xa11
> [  823.172146]  [<c1071c45>] lock_acquire+0x88/0x17e
> [  823.172146]  [<c11c5e60>] ? xfs_trans_alloc+0x1f/0x35
> [  823.172146]  [<c10f36eb>] __sb_start_write+0xad/0x177
> [  823.172146]  [<c11c5e60>] ? xfs_trans_alloc+0x1f/0x35
> [  823.172146]  [<c11c5e60>] ? xfs_trans_alloc+0x1f/0x35
> [  823.172146]  [<c1206cfb>] ? xfs_ilock+0x100/0x1f1
> [  823.172146]  [<c11c5e60>] xfs_trans_alloc+0x1f/0x35
> [  823.172146]  [<c120a823>] xfs_inactive+0x129/0x4a3
> [  823.172146]  [<c106f21f>] ? trace_hardirqs_on+0xb/0xd
> [  823.172146]  [<c14c01e5>] ? _raw_spin_unlock_irq+0x27/0x36
> [  823.172146]  [<c11c280d>] xfs_fs_evict_inode+0x6c/0x114
> [  823.172146]  [<c1106678>] evict+0x8e/0x15d
> [  823.172146]  [<c1107126>] iput+0xc4/0x138
> [  823.172146]  [<c1103504>] dput+0x1b2/0x257
> [  823.172146]  [<c10f1a30>] __fput+0x140/0x1eb
> [  823.172146]  [<c10f1b0f>] ____fput+0xd/0xf
> [  823.172146]  [<c1048477>] task_work_run+0x67/0x90
> [  823.172146]  [<c1001ea5>] do_notify_resume+0x61/0x63
> [  823.172146]  [<c14c0cfa>] work_notifysig+0x1f/0x25
> [  824.015366] Clocksource tsc unstable (delta = 486645129 ns)
> [  825.324019] XFS (sdb4): Mounting Filesystem
> [  825.743317] XFS (sdb4): Ending clean mount
> [  827.223193] XFS (sdb4): Mounting Filesystem
> [  827.668493] XFS (sdb4): Ending clean mount
> [  837.524673] XFS (sdb4): Mounting Filesystem
> [  837.986097] XFS (sdb4): Ending clean mount
> 
> 
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 03/11] xfs: support the XFS_BTNUM_FINOBT free inode btree type
  2013-09-06  0:07       ` Dave Chinner
@ 2013-09-06 11:25         ` Brian Foster
  2013-09-06 21:22           ` Dave Chinner
  0 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2013-09-06 11:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On 09/05/2013 08:07 PM, Dave Chinner wrote:
> On Thu, Sep 05, 2013 at 12:17:04PM -0400, Brian Foster wrote:
>> On 09/04/2013 08:54 PM, Dave Chinner wrote:
>>> On Tue, Sep 03, 2013 at 02:25:00PM -0400, Brian Foster wrote:
...
>>>
>>> What we really need here is for xfs_ialloc_log_agi to consider that
>>> there are two distinct regions for range logging - the first spaces
>>> from offset 0 to offset of agi_unlinked, and the second is from the
>>> the offset of agi_free_root to the end of the xfs_agi_t....
>>>
>>> It's abit messy, I know, but we couldn't easily add new padding to
>>> the AGI in the existing range logging area like was done for the AGF
>>> because of the unlinked list hash table already defining the end of
>>> the range logging region....
>>>
>>
>> ... but where would that ever happen? The existing invocations of
>> xfs_ialloc_log_agi() seem to log either the agi inode count values or
>> the btree root/level values (i.e., never the range across both). I think
>> I've introduced at least a couple new invocations throughout this set,
>> but I've not changed that model (i.e., an XFS_AGI_FREECOUNT instance in
>> the new lookup code and an XFS_AGI_FREE_ROOT|*_LEVEL instance in the new
>> btree code).
> 
> Right, we don't current log across the range because of the way the
> code is currently written, but there's no rule that says that
> logging fields must be done this way.
> 
> I can see that there may be reason for logging
> XFS_AGI_FREE_ROOT|*_LEVEL and XFS_AGI_NEW_INODE all in one go -
> pointing new inode allocation at recently freed inodes is not
> unreasonable, and if we split the finobt and update agi_newino in
> the one update, we will log across this gap.
> 

For the sake of argument, it seems a little strange to me to set an
inode level value in the agi in the context of a btree operation, such
as a split...

>> My understanding of this code is that the range to log is defined at
>> invocation time to xfs_iallog_log_agi(), so if the callers never specify
>> a range that includes the unlinked bits in a single call, we won't set
>> that range in the buffer log item code. In other words, even if we
>> ultimately happen to log both ranges in the agi, the lower level code
>> will never expand the logged region. Therefore, this shouldn't happen
>> unless a single invocation that specifies one of the
>> XFS_AGI_FREE_ROOT/LEVEL bits also specifies one of the existing agi bits.
> 
> Yes, we can avoid that by logging te regions separately, but that
> then puts the onus on the future developers and reviewers to
> remember this landmine and avoid it.
> 

... but I agree with the general premise. It's certainly a landmine.

>> I could see breaking this up as a matter of preparation for future
>> fields or calls that would introduce logging that kind of range, but at
>> the same time (and assuming my interpretation of above is correct),
>> that's a bit of code that serves no purpose for the foreseeable future.
>> Perhaps a comment in xfs_ialloc_log_agi() and the one caller that uses
>> the AGI_FREE_* bits to document this restriction is sufficient?
> 
> Given it is only a few lines of extra code in xfs_ialloc_log_agi(),
> I'd prefer that the code documents and deals with the different
> regions internally. That way we can forget about it completely for the
> future and never have to worry about it again.
> 
> Keep in mind that I'm looking at this from a code maintenance
> timescale of at least 5-10 years into the future here - a few
> minutes extra fixing this right now could save us a lot of hassle
> years down the track. ;)
> 

Fair enough. If it's at least a reasonably likely scenario, then I'm on
board with the better safe than sorry approach. ;)

Brian

>>>>  #ifdef DEBUG
>>>> diff --git a/fs/xfs/xfs_ialloc_btree.c b/fs/xfs/xfs_ialloc_btree.c
>>>> index 0cdb88b..7923292 100644
>>>> --- a/fs/xfs/xfs_ialloc_btree.c
>>>> +++ b/fs/xfs/xfs_ialloc_btree.c
>>>> @@ -62,10 +62,18 @@ xfs_inobt_set_root(
>>>>  {
>>>>  	struct xfs_buf		*agbp = cur->bc_private.a.agbp;
>>>>  	struct xfs_agi		*agi = XFS_BUF_TO_AGI(agbp);
>>>> -
>>>> -	agi->agi_root = nptr->s;
>>>> -	be32_add_cpu(&agi->agi_level, inc);
>>>> -	xfs_ialloc_log_agi(cur->bc_tp, agbp, XFS_AGI_ROOT | XFS_AGI_LEVEL);
>>>> +	int			fields;
>>>> +
>>>> +	if (cur->bc_btnum == XFS_BTNUM_INO) {
>>>> +		agi->agi_root = nptr->s;
>>>> +		be32_add_cpu(&agi->agi_level, inc);
>>>> +		fields = XFS_AGI_ROOT | XFS_AGI_LEVEL;
>>>> +	} else {
>>>> +		agi->agi_free_root = nptr->s;
>>>> +		be32_add_cpu(&agi->agi_free_level, inc);
>>>> +		fields = XFS_AGI_FREE_ROOT | XFS_AGI_FREE_LEVEL;
>>>> +	}
>>>> +	xfs_ialloc_log_agi(cur->bc_tp, agbp, fields);
>>>>  }
>>>
>>> I suspect that it would be better to have separate functions for
>>> these differences i.e. xfs_inobt_set_root() and
>>> xfs_finobt_set_root(), and set up separate btree ops structure
>>> forthe two different trees. Most of the code is still identical,
>>> but the differences in root structures can easily be handled without
>>> putting switches in the code everywhere.
>>>
>>
>> Ok, I'm assuming the suggestion is to only create new functions for the
>> implementations that differ.
> 
> Exactly.
> 
> Cheers,
> 
> Dave.
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 07/11] xfs: retry trans reservation on ENOSPC in xfs_inactive()
  2013-09-06  0:17       ` Dave Chinner
@ 2013-09-06 11:30         ` Brian Foster
  0 siblings, 0 replies; 46+ messages in thread
From: Brian Foster @ 2013-09-06 11:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On 09/05/2013 08:17 PM, Dave Chinner wrote:
> On Thu, Sep 05, 2013 at 12:18:31PM -0400, Brian Foster wrote:
>> On 09/04/2013 09:40 PM, Dave Chinner wrote:
>>> On Tue, Sep 03, 2013 at 02:25:04PM -0400, Brian Foster wrote:
>>>> An ifree data block reservation can fail with ENOSPC. Flush inodes
>>>> to try and free up space or attempt without a data block
>>>> reservation to avoid failing out of xfs_inactive().
>>>>
>>>> Signed-off-by: Brian Foster <bfoster@redhat.com>
>>>> ---
>>>>  fs/xfs/xfs_inode.c | 11 +++++++++++
>>>>  1 file changed, 11 insertions(+)
>>>>
...
> 
>> Subsequent to avoiding that, I
>> believe there were inconsistent fs issues called out due to the unlinked
>> lists being populated after umount.
> 
> That sounds like a recovery failure, not so much an ENOSPC failure.
> i.e. that recovery only looks at the log to see if it's clean, and
> only recovers unlinked lists if it's dirty. There is the
> *possibility* of having a clean log with inodes on the unlinked
> list, and log recovery doesn't run the unlinked list processing in
> that case.
> 

Interesting, I'll have a closer look when I rework the inactive
transaction reservation bits. Thanks.

Brian

> This is one of the issues we'll need to fix for O_TMPFILE support
> as it will actively use inodes on unlinked list for potentially long
> periods of time.
> 
>> Taking a further look, I missed the XFS_TRANS_RESERVE flag and whole
>> m_resblks mechanism. I'll take a closer look at that and see if that
>> works to resolve the problem instead of the flush.
> 
> It should - the only time it won't is if we exhaust the pool, but
> that doesn't happen in normal ENOSPC situations and any blocks we do
> end up freeing will immediately refill the reserve pool...
> 
> Cheers,
> 
> Dave.
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 10/11] xfs: update the finobt on inode free
  2013-09-06  0:28       ` Dave Chinner
@ 2013-09-06 11:39         ` Brian Foster
  2013-09-06 21:24           ` Dave Chinner
  0 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2013-09-06 11:39 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On 09/05/2013 08:28 PM, Dave Chinner wrote:
> On Thu, Sep 05, 2013 at 12:19:12PM -0400, Brian Foster wrote:
>> On 09/04/2013 10:54 PM, Dave Chinner wrote:
>>> On Tue, Sep 03, 2013 at 02:25:07PM -0400, Brian Foster wrote:
...
>>
>> I think I'm parsing you after having another look at the code.
>> xfs_inobt_lookup() remains as is and is potentially used from
>> xfs_inobt_insert(). xfs_inobt_insert_rec() is introduced to set the
>> cursor fields and do the insert and is used here and from
>> xfs_inobt_insert().
> 
> Effectively. xfs_inobt_insert() becomes:
> 
> 	for (each inode chunk) {
> 		xfs_inobt_lookup(cur, startino)
> 		xfs_inobt_insert_rec(cur, startino, free, free_count)
> 	}
> 
> And this code becomes:
> 
> 	xfs_inobt_lookup(cur, startino);
> 	if (!found) {
> 		if (free_count == 1)
> 			xfs_inobt_insert_rec(cur, startino, free, free_count)
> 		else
> 			CORRUPTION
> 		goto out;
> 	}
> 
>> At that point, this looks close to xfs_inobt_insert(), but I think using
>> that here would introduce a duplicate lookup.
> 
> Yes, it would. I think just using helpers like this is sufficient
> for the two different cases, especially as xfs_inobt_insert() needs
> to be able to handle multiple chunk insertion and we don't have that
> here...
> 

Ok, that was my thinking as well.

>> Regardless, we'll see what
>> the whole thing looks like at that point. Thanks for the reviews. :)
> 
> No worries. BTW, can you post your rudimentary userspace support so
> we can run tests that use this code, too?
> 

Sure. My xfsprogs branch currently is the application of a slightly
older version of this set (pre-cleanups I made to make this post-worthy)
with some hacks to make it apply/compile and a few other patches on top
of that for mkfs, xfs_db and xfs_repair to work through some basic
things I ran into when running xfstests.

Would you prefer I drop the whole thing on the list?

Brian

> Cheers,
> 
> Dave.
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 03/11] xfs: support the XFS_BTNUM_FINOBT free inode btree type
  2013-09-06 11:25         ` Brian Foster
@ 2013-09-06 21:22           ` Dave Chinner
  0 siblings, 0 replies; 46+ messages in thread
From: Dave Chinner @ 2013-09-06 21:22 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Fri, Sep 06, 2013 at 07:25:58AM -0400, Brian Foster wrote:
> On 09/05/2013 08:07 PM, Dave Chinner wrote:
> > On Thu, Sep 05, 2013 at 12:17:04PM -0400, Brian Foster wrote:
> >> On 09/04/2013 08:54 PM, Dave Chinner wrote:
> >>> On Tue, Sep 03, 2013 at 02:25:00PM -0400, Brian Foster wrote:
> ...
> >>>
> >>> What we really need here is for xfs_ialloc_log_agi to consider that
> >>> there are two distinct regions for range logging - the first spaces
> >>> from offset 0 to offset of agi_unlinked, and the second is from the
> >>> the offset of agi_free_root to the end of the xfs_agi_t....
> >>>
> >>> It's abit messy, I know, but we couldn't easily add new padding to
> >>> the AGI in the existing range logging area like was done for the AGF
> >>> because of the unlinked list hash table already defining the end of
> >>> the range logging region....
> >>>
> >>
> >> ... but where would that ever happen? The existing invocations of
> >> xfs_ialloc_log_agi() seem to log either the agi inode count values or
> >> the btree root/level values (i.e., never the range across both). I think
> >> I've introduced at least a couple new invocations throughout this set,
> >> but I've not changed that model (i.e., an XFS_AGI_FREECOUNT instance in
> >> the new lookup code and an XFS_AGI_FREE_ROOT|*_LEVEL instance in the new
> >> btree code).
> > 
> > Right, we don't current log across the range because of the way the
> > code is currently written, but there's no rule that says that
> > logging fields must be done this way.
> > 
> > I can see that there may be reason for logging
> > XFS_AGI_FREE_ROOT|*_LEVEL and XFS_AGI_NEW_INODE all in one go -
> > pointing new inode allocation at recently freed inodes is not
> > unreasonable, and if we split the finobt and update agi_newino in
> > the one update, we will log across this gap.
> > 
> 
> For the sake of argument, it seems a little strange to me to set an
> inode level value in the agi in the context of a btree operation, such
> as a split...

Like we do with the AGF to record changes to the longest
extent in the btree? It's not a stretch to think we might update
the "allocation from here" target in the AGI when we make a specific
type of btree record change.... ;)

True, that is still an isolated logging event, but my point is that
specific btree operations may drive other logging events in the
header than just root/level...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 10/11] xfs: update the finobt on inode free
  2013-09-06 11:39         ` Brian Foster
@ 2013-09-06 21:24           ` Dave Chinner
  2013-09-07 12:30             ` Brian Foster
  0 siblings, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2013-09-06 21:24 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Fri, Sep 06, 2013 at 07:39:18AM -0400, Brian Foster wrote:
> On 09/05/2013 08:28 PM, Dave Chinner wrote:
> > On Thu, Sep 05, 2013 at 12:19:12PM -0400, Brian Foster wrote:
> >> On 09/04/2013 10:54 PM, Dave Chinner wrote:
> >>> On Tue, Sep 03, 2013 at 02:25:07PM -0400, Brian Foster wrote:
> ...
> >>
> >> I think I'm parsing you after having another look at the code.
> >> xfs_inobt_lookup() remains as is and is potentially used from
> >> xfs_inobt_insert(). xfs_inobt_insert_rec() is introduced to set the
> >> cursor fields and do the insert and is used here and from
> >> xfs_inobt_insert().
> > 
> > Effectively. xfs_inobt_insert() becomes:
> > 
> > 	for (each inode chunk) {
> > 		xfs_inobt_lookup(cur, startino)
> > 		xfs_inobt_insert_rec(cur, startino, free, free_count)
> > 	}
> > 
> > And this code becomes:
> > 
> > 	xfs_inobt_lookup(cur, startino);
> > 	if (!found) {
> > 		if (free_count == 1)
> > 			xfs_inobt_insert_rec(cur, startino, free, free_count)
> > 		else
> > 			CORRUPTION
> > 		goto out;
> > 	}
> > 
> >> At that point, this looks close to xfs_inobt_insert(), but I think using
> >> that here would introduce a duplicate lookup.
> > 
> > Yes, it would. I think just using helpers like this is sufficient
> > for the two different cases, especially as xfs_inobt_insert() needs
> > to be able to handle multiple chunk insertion and we don't have that
> > here...
> > 
> 
> Ok, that was my thinking as well.
> 
> >> Regardless, we'll see what
> >> the whole thing looks like at that point. Thanks for the reviews. :)
> > 
> > No worries. BTW, can you post your rudimentary userspace support so
> > we can run tests that use this code, too?
> > 
> 
> Sure. My xfsprogs branch currently is the application of a slightly
> older version of this set (pre-cleanups I made to make this post-worthy)
> with some hacks to make it apply/compile and a few other patches on top
> of that for mkfs, xfs_db and xfs_repair to work through some basic
> things I ran into when running xfstests.
> 
> Would you prefer I drop the whole thing on the list?

Drop it on the list, maybe just a as tarball rather than a patchset
if it's not ready for review yet.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 00/11] xfs: introduce the free inode btree
  2013-09-05 21:17 ` [RFC PATCH 00/11] xfs: introduce the free inode btree Michael L. Semon
  2013-09-06 11:17   ` Brian Foster
@ 2013-09-06 21:35   ` Dave Chinner
  2013-09-07 12:31     ` Brian Foster
  1 sibling, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2013-09-06 21:35 UTC (permalink / raw)
  To: Michael L. Semon; +Cc: Brian Foster, xfs

On Thu, Sep 05, 2013 at 05:17:10PM -0400, Michael L. Semon wrote:
....
> [  814.376620] XFS (sdb4): Mounting Filesystem
> [  815.050778] XFS (sdb4): Ending clean mount
> [  823.169368] 
> [  823.170932] ======================================================
> [  823.172146] [ INFO: possible circular locking dependency detected ]
> [  823.172146] 3.11.0+ #5 Not tainted
> [  823.172146] -------------------------------------------------------
> [  823.172146] dirstress/5276 is trying to acquire lock:
> [  823.172146]  (sb_internal){.+.+.+}, at: [<c11c5e60>] xfs_trans_alloc+0x1f/0x35
> [  823.172146] 
> [  823.172146] but task is already holding lock:
> [  823.172146]  (&(&ip->i_lock)->mr_lock){+++++.}, at: [<c1206cfb>] xfs_ilock+0x100/0x1f1
> [  823.172146] 
> [  823.172146] which lock already depends on the new lock.
> [  823.172146] 
> [  823.172146] 
> [  823.172146] the existing dependency chain (in reverse order) is:
> [  823.172146] 
> [  823.172146] -> #1 (&(&ip->i_lock)->mr_lock){+++++.}:
> [  823.172146]        [<c1070a11>] __lock_acquire+0x345/0xa11
> [  823.172146]        [<c1071c45>] lock_acquire+0x88/0x17e
> [  823.172146]        [<c14bff98>] _raw_spin_lock+0x47/0x74
> [  823.172146]        [<c1116247>] __mark_inode_dirty+0x171/0x38c
> [  823.172146]        [<c111acab>] __set_page_dirty+0x5f/0x95
> [  823.172146]        [<c111b93e>] mark_buffer_dirty+0x58/0x12b
> [  823.172146]        [<c111baff>] __block_commit_write.isra.17+0x64/0x82
> [  823.172146]        [<c111c197>] block_write_end+0x2b/0x53
> [  823.172146]        [<c111c201>] generic_write_end+0x42/0x9a
> [  823.172146]        [<c11a42d5>] xfs_vm_write_end+0x60/0xbe
> [  823.172146]        [<c10bd47a>] generic_file_buffered_write+0x140/0x20f
> [  823.172146]        [<c11b2347>] xfs_file_buffered_aio_write+0x10b/0x205
> [  823.172146]        [<c11b24ee>] xfs_file_aio_write+0xad/0xec
> [  823.172146]        [<c10f0c5f>] do_sync_write+0x60/0x87
> [  823.172146]        [<c10f0e0f>] vfs_write+0x9c/0x189
> [  823.172146]        [<c10f0fc6>] SyS_write+0x49/0x81
> [  823.172146]        [<c14c14bb>] sysenter_do_call+0x12/0x32
> [  823.172146] 
> [  823.172146] -> #0 (sb_internal){.+.+.+}:
> [  823.172146]        [<c106e972>] validate_chain.isra.35+0xfc7/0xff4
> [  823.172146]        [<c1070a11>] __lock_acquire+0x345/0xa11
> [  823.172146]        [<c1071c45>] lock_acquire+0x88/0x17e
> [  823.172146]        [<c10f36eb>] __sb_start_write+0xad/0x177
> [  823.172146]        [<c11c5e60>] xfs_trans_alloc+0x1f/0x35
> [  823.172146]        [<c120a823>] xfs_inactive+0x129/0x4a3
> [  823.172146]        [<c11c280d>] xfs_fs_evict_inode+0x6c/0x114
> [  823.172146]        [<c1106678>] evict+0x8e/0x15d
> [  823.172146]        [<c1107126>] iput+0xc4/0x138
> [  823.172146]        [<c1103504>] dput+0x1b2/0x257
> [  823.172146]        [<c10f1a30>] __fput+0x140/0x1eb
> [  823.172146]        [<c10f1b0f>] ____fput+0xd/0xf
> [  823.172146]        [<c1048477>] task_work_run+0x67/0x90
> [  823.172146]        [<c1001ea5>] do_notify_resume+0x61/0x63
> [  823.172146]        [<c14c0cfa>] work_notifysig+0x1f/0x25
> [  823.172146] 
> [  823.172146] other info that might help us debug this:
> [  823.172146] 
> [  823.172146]  Possible unsafe locking scenario:
> [  823.172146] 
> [  823.172146]        CPU0                    CPU1
> [  823.172146]        ----                    ----
> [  823.172146]   lock(&(&ip->i_lock)->mr_lock);
> [  823.172146]                                lock(sb_internal);
> [  823.172146]                                lock(&(&ip->i_lock)->mr_lock);
> [  823.172146]   lock(sb_internal);

Ah, now there's something I missed in all the xfs_inactive
transaction rework - you can't call
xfs_trans_alloc()/xfs-trans_reserve with the XFS_ILOCK_??? held.
It's not the freeze locks you really have to worry about deadlocking
if you do, it's deadlocking against log space that is much more
likely.

i.e. if you hold the ILOCK, the AIL can't get it to flush the inode
to disk. That means if the inode you hold locked is pinning the tail
of the log and there is no logspace for the transaction you are
about to run, xfs_trans_reserve() will block forever waiting for the
inode to be flushed and the tail of the log to move forward. This
will end up blocking all further reservations and hence deadlock the
filesystem...

Brian, if you rewrite xfs_inactive in the way that I suggested, this
problem goes away ;)

Thanks for reporting this, Michael.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 10/11] xfs: update the finobt on inode free
  2013-09-06 21:24           ` Dave Chinner
@ 2013-09-07 12:30             ` Brian Foster
  2013-09-08 20:08               ` Michael L. Semon
  2013-09-09  2:34               ` Better numbers " Michael L. Semon
  0 siblings, 2 replies; 46+ messages in thread
From: Brian Foster @ 2013-09-07 12:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

[-- Attachment #1: Type: text/plain, Size: 1321 bytes --]

On 09/06/2013 05:24 PM, Dave Chinner wrote:
> On Fri, Sep 06, 2013 at 07:39:18AM -0400, Brian Foster wrote:
>> On 09/05/2013 08:28 PM, Dave Chinner wrote:
>>> On Thu, Sep 05, 2013 at 12:19:12PM -0400, Brian Foster wrote:
>>>> On 09/04/2013 10:54 PM, Dave Chinner wrote:
>>>>> On Tue, Sep 03, 2013 at 02:25:07PM -0400, Brian Foster wrote:
>> ...
>>>>
...
>>> No worries. BTW, can you post your rudimentary userspace support so
>>> we can run tests that use this code, too?
>>>
>>
>> Sure. My xfsprogs branch currently is the application of a slightly
>> older version of this set (pre-cleanups I made to make this post-worthy)
>> with some hacks to make it apply/compile and a few other patches on top
>> of that for mkfs, xfs_db and xfs_repair to work through some basic
>> things I ran into when running xfstests.
>>
>> Would you prefer I drop the whole thing on the list?
> 
> Drop it on the list, maybe just a as tarball rather than a patchset
> if it's not ready for review yet.
> 

Ok, attached. This applies on top of the following commit in the
xfsprogs tree:

982e5c7e xfs_db: add header to freesp -d output

Use the following command to format a finobt enabled fs:

mkfs.xfs -m crc=1,finobt=1 <dev>

... and otherwise there are a couple random fixes for xfs_db and xfs_repair.

Brian


> Cheers,
> 
> Dave.
> 



[-- Attachment #2: xfsprogs_finobt_rfc.tar.bz2 --]
[-- Type: application/x-bzip, Size: 8958 bytes --]

[-- Attachment #3: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 00/11] xfs: introduce the free inode btree
  2013-09-06 21:35   ` Dave Chinner
@ 2013-09-07 12:31     ` Brian Foster
  2013-09-08  1:04       ` Michael L. Semon
  0 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2013-09-07 12:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Michael L. Semon, xfs

On 09/06/2013 05:35 PM, Dave Chinner wrote:
> On Thu, Sep 05, 2013 at 05:17:10PM -0400, Michael L. Semon wrote:
> ....
>> [  814.376620] XFS (sdb4): Mounting Filesystem
>> [  815.050778] XFS (sdb4): Ending clean mount
>> [  823.169368] 
>> [  823.170932] ======================================================
>> [  823.172146] [ INFO: possible circular locking dependency detected ]
>> [  823.172146] 3.11.0+ #5 Not tainted
>> [  823.172146] -------------------------------------------------------
>> [  823.172146] dirstress/5276 is trying to acquire lock:
>> [  823.172146]  (sb_internal){.+.+.+}, at: [<c11c5e60>] xfs_trans_alloc+0x1f/0x35
>> [  823.172146] 
>> [  823.172146] but task is already holding lock:
>> [  823.172146]  (&(&ip->i_lock)->mr_lock){+++++.}, at: [<c1206cfb>] xfs_ilock+0x100/0x1f1
>> [  823.172146] 
>> [  823.172146] which lock already depends on the new lock.
>> [  823.172146] 
>> [  823.172146] 
>> [  823.172146] the existing dependency chain (in reverse order) is:
>> [  823.172146] 
>> [  823.172146] -> #1 (&(&ip->i_lock)->mr_lock){+++++.}:
>> [  823.172146]        [<c1070a11>] __lock_acquire+0x345/0xa11
>> [  823.172146]        [<c1071c45>] lock_acquire+0x88/0x17e
>> [  823.172146]        [<c14bff98>] _raw_spin_lock+0x47/0x74
>> [  823.172146]        [<c1116247>] __mark_inode_dirty+0x171/0x38c
>> [  823.172146]        [<c111acab>] __set_page_dirty+0x5f/0x95
>> [  823.172146]        [<c111b93e>] mark_buffer_dirty+0x58/0x12b
>> [  823.172146]        [<c111baff>] __block_commit_write.isra.17+0x64/0x82
>> [  823.172146]        [<c111c197>] block_write_end+0x2b/0x53
>> [  823.172146]        [<c111c201>] generic_write_end+0x42/0x9a
>> [  823.172146]        [<c11a42d5>] xfs_vm_write_end+0x60/0xbe
>> [  823.172146]        [<c10bd47a>] generic_file_buffered_write+0x140/0x20f
>> [  823.172146]        [<c11b2347>] xfs_file_buffered_aio_write+0x10b/0x205
>> [  823.172146]        [<c11b24ee>] xfs_file_aio_write+0xad/0xec
>> [  823.172146]        [<c10f0c5f>] do_sync_write+0x60/0x87
>> [  823.172146]        [<c10f0e0f>] vfs_write+0x9c/0x189
>> [  823.172146]        [<c10f0fc6>] SyS_write+0x49/0x81
>> [  823.172146]        [<c14c14bb>] sysenter_do_call+0x12/0x32
>> [  823.172146] 
>> [  823.172146] -> #0 (sb_internal){.+.+.+}:
>> [  823.172146]        [<c106e972>] validate_chain.isra.35+0xfc7/0xff4
>> [  823.172146]        [<c1070a11>] __lock_acquire+0x345/0xa11
>> [  823.172146]        [<c1071c45>] lock_acquire+0x88/0x17e
>> [  823.172146]        [<c10f36eb>] __sb_start_write+0xad/0x177
>> [  823.172146]        [<c11c5e60>] xfs_trans_alloc+0x1f/0x35
>> [  823.172146]        [<c120a823>] xfs_inactive+0x129/0x4a3
>> [  823.172146]        [<c11c280d>] xfs_fs_evict_inode+0x6c/0x114
>> [  823.172146]        [<c1106678>] evict+0x8e/0x15d
>> [  823.172146]        [<c1107126>] iput+0xc4/0x138
>> [  823.172146]        [<c1103504>] dput+0x1b2/0x257
>> [  823.172146]        [<c10f1a30>] __fput+0x140/0x1eb
>> [  823.172146]        [<c10f1b0f>] ____fput+0xd/0xf
>> [  823.172146]        [<c1048477>] task_work_run+0x67/0x90
>> [  823.172146]        [<c1001ea5>] do_notify_resume+0x61/0x63
>> [  823.172146]        [<c14c0cfa>] work_notifysig+0x1f/0x25
>> [  823.172146] 
>> [  823.172146] other info that might help us debug this:
>> [  823.172146] 
>> [  823.172146]  Possible unsafe locking scenario:
>> [  823.172146] 
>> [  823.172146]        CPU0                    CPU1
>> [  823.172146]        ----                    ----
>> [  823.172146]   lock(&(&ip->i_lock)->mr_lock);
>> [  823.172146]                                lock(sb_internal);
>> [  823.172146]                                lock(&(&ip->i_lock)->mr_lock);
>> [  823.172146]   lock(sb_internal);
> 
> Ah, now there's something I missed in all the xfs_inactive
> transaction rework - you can't call
> xfs_trans_alloc()/xfs-trans_reserve with the XFS_ILOCK_??? held.
> It's not the freeze locks you really have to worry about deadlocking
> if you do, it's deadlocking against log space that is much more
> likely.
> 
> i.e. if you hold the ILOCK, the AIL can't get it to flush the inode
> to disk. That means if the inode you hold locked is pinning the tail
> of the log and there is no logspace for the transaction you are
> about to run, xfs_trans_reserve() will block forever waiting for the
> inode to be flushed and the tail of the log to move forward. This
> will end up blocking all further reservations and hence deadlock the
> filesystem...
> 
> Brian, if you rewrite xfs_inactive in the way that I suggested, this
> problem goes away ;)
> 
> Thanks for reporting this, Michael.
> 

Oh, very interesting scenario. Thanks again for catching this, Michael,
and for the analysis, Dave. I'll get this cleaned up in the next
revision as well.

Brian

> Cheers,
> 
> Dave.
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 00/11] xfs: introduce the free inode btree
  2013-09-07 12:31     ` Brian Foster
@ 2013-09-08  1:04       ` Michael L. Semon
  0 siblings, 0 replies; 46+ messages in thread
From: Michael L. Semon @ 2013-09-08  1:04 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On 09/07/2013 08:31 AM, Brian Foster wrote:
> On 09/06/2013 05:35 PM, Dave Chinner wrote:
>> On Thu, Sep 05, 2013 at 05:17:10PM -0400, Michael L. Semon wrote:
>> ....
>>> [  814.376620] XFS (sdb4): Mounting Filesystem
>>> [  815.050778] XFS (sdb4): Ending clean mount
>>> [  823.169368] 
>>> [  823.170932] ======================================================
>>> [  823.172146] [ INFO: possible circular locking dependency detected ]
>>> [  823.172146] 3.11.0+ #5 Not tainted
>>> [  823.172146] -------------------------------------------------------
>>> [  823.172146] dirstress/5276 is trying to acquire lock:
>>> [  823.172146]  (sb_internal){.+.+.+}, at: [<c11c5e60>] xfs_trans_alloc+0x1f/0x35
>>> [  823.172146] 
>>> [  823.172146] but task is already holding lock:
>>> [  823.172146]  (&(&ip->i_lock)->mr_lock){+++++.}, at: [<c1206cfb>] xfs_ilock+0x100/0x1f1
>>> [  823.172146] 
>>> [  823.172146] which lock already depends on the new lock.
>>> [  823.172146] 
>>> [  823.172146] 
>>> [  823.172146] the existing dependency chain (in reverse order) is:
>>> [  823.172146] 
>>> [  823.172146] -> #1 (&(&ip->i_lock)->mr_lock){+++++.}:
>>> [  823.172146]        [<c1070a11>] __lock_acquire+0x345/0xa11
>>> [  823.172146]        [<c1071c45>] lock_acquire+0x88/0x17e
>>> [  823.172146]        [<c14bff98>] _raw_spin_lock+0x47/0x74
>>> [  823.172146]        [<c1116247>] __mark_inode_dirty+0x171/0x38c
>>> [  823.172146]        [<c111acab>] __set_page_dirty+0x5f/0x95
>>> [  823.172146]        [<c111b93e>] mark_buffer_dirty+0x58/0x12b
>>> [  823.172146]        [<c111baff>] __block_commit_write.isra.17+0x64/0x82
>>> [  823.172146]        [<c111c197>] block_write_end+0x2b/0x53
>>> [  823.172146]        [<c111c201>] generic_write_end+0x42/0x9a
>>> [  823.172146]        [<c11a42d5>] xfs_vm_write_end+0x60/0xbe
>>> [  823.172146]        [<c10bd47a>] generic_file_buffered_write+0x140/0x20f
>>> [  823.172146]        [<c11b2347>] xfs_file_buffered_aio_write+0x10b/0x205
>>> [  823.172146]        [<c11b24ee>] xfs_file_aio_write+0xad/0xec
>>> [  823.172146]        [<c10f0c5f>] do_sync_write+0x60/0x87
>>> [  823.172146]        [<c10f0e0f>] vfs_write+0x9c/0x189
>>> [  823.172146]        [<c10f0fc6>] SyS_write+0x49/0x81
>>> [  823.172146]        [<c14c14bb>] sysenter_do_call+0x12/0x32
>>> [  823.172146] 
>>> [  823.172146] -> #0 (sb_internal){.+.+.+}:
>>> [  823.172146]        [<c106e972>] validate_chain.isra.35+0xfc7/0xff4
>>> [  823.172146]        [<c1070a11>] __lock_acquire+0x345/0xa11
>>> [  823.172146]        [<c1071c45>] lock_acquire+0x88/0x17e
>>> [  823.172146]        [<c10f36eb>] __sb_start_write+0xad/0x177
>>> [  823.172146]        [<c11c5e60>] xfs_trans_alloc+0x1f/0x35
>>> [  823.172146]        [<c120a823>] xfs_inactive+0x129/0x4a3
>>> [  823.172146]        [<c11c280d>] xfs_fs_evict_inode+0x6c/0x114
>>> [  823.172146]        [<c1106678>] evict+0x8e/0x15d
>>> [  823.172146]        [<c1107126>] iput+0xc4/0x138
>>> [  823.172146]        [<c1103504>] dput+0x1b2/0x257
>>> [  823.172146]        [<c10f1a30>] __fput+0x140/0x1eb
>>> [  823.172146]        [<c10f1b0f>] ____fput+0xd/0xf
>>> [  823.172146]        [<c1048477>] task_work_run+0x67/0x90
>>> [  823.172146]        [<c1001ea5>] do_notify_resume+0x61/0x63
>>> [  823.172146]        [<c14c0cfa>] work_notifysig+0x1f/0x25
>>> [  823.172146] 
>>> [  823.172146] other info that might help us debug this:
>>> [  823.172146] 
>>> [  823.172146]  Possible unsafe locking scenario:
>>> [  823.172146] 
>>> [  823.172146]        CPU0                    CPU1
>>> [  823.172146]        ----                    ----
>>> [  823.172146]   lock(&(&ip->i_lock)->mr_lock);
>>> [  823.172146]                                lock(sb_internal);
>>> [  823.172146]                                lock(&(&ip->i_lock)->mr_lock);
>>> [  823.172146]   lock(sb_internal);
>>
>> Ah, now there's something I missed in all the xfs_inactive
>> transaction rework - you can't call
>> xfs_trans_alloc()/xfs-trans_reserve with the XFS_ILOCK_??? held.
>> It's not the freeze locks you really have to worry about deadlocking
>> if you do, it's deadlocking against log space that is much more
>> likely.
>>
>> i.e. if you hold the ILOCK, the AIL can't get it to flush the inode
>> to disk. That means if the inode you hold locked is pinning the tail
>> of the log and there is no logspace for the transaction you are
>> about to run, xfs_trans_reserve() will block forever waiting for the
>> inode to be flushed and the tail of the log to move forward. This
>> will end up blocking all further reservations and hence deadlock the
>> filesystem...
>>
>> Brian, if you rewrite xfs_inactive in the way that I suggested, this
>> problem goes away ;)
>>
>> Thanks for reporting this, Michael.
>>
> 
> Oh, very interesting scenario. Thanks again for catching this, Michael,
> and for the analysis, Dave. I'll get this cleaned up in the next
> revision as well.
> 
> Brian
> 
>> Cheers,
>>
>> Dave.
>>
> 

No problem, Brian.  I'll try out your userspace as well.  I had worked a 
bit on getting some sane numbers that are better than "results for copying 
3 kernel gits to a 1k-blocksize FS with write cache turned off."  Here's 
my attempt, as a more formal report.  Thanks!

Michael

[REPORT FOLLOWS]

Lockdep threw off the debug numbers for your patchset--a new lockdep 
is at the very end--so these tests were done with a fairly non-debug
setup.  The write cache is on for these tests as well.

Casual "user command" benchmark using built 3.11.0+ kernel git 
tarball.  The idea behind it:

1) Unpack a tarball, and
2) do something with its contents.

The total files are among these:

$TEST_DIR/a/linux/ ...
$TEST_DIR/b/linux/ ...
$TEST_DIR/kernel-git-built-2013-08-05.tar.gz

The file systems are as follows:

v4: mkfs.xfs -l logdev=$TEST_LOGDEV $TEST_DEV
v4dirent: mkfs.xfs -n ftype=1 -l logdev=$TEST_LOGDEV $TEST_DEV
v4d512bi: mkfs.xfs -n ftype=1 -i log=9 -l logdev=$TEST_LOGDEV $TEST_DEV
v5: mkfs.xfs -m crc=1 -l logdev=$TEST_LOGDEV $TEST_DEV

Dave had a benchmark set to break down v5 performance changes into
a 512-byte-inode component and a CRC component.  This is my edition
of the benchmark, done with old spinning rust on x86 hardware, and
updated to reflect your patchset.

Patchset notation: "normal" is the normal xfs-oss/master with Dave's
latest patches on top; "itree" adds the inode btree code.

This is non-debug XFS.  Tracers, lockdep, and almost all other "Kernel 
Hacking" kernel config items are not enabled.  It's still a 
CONFIG_KERNEL_DEBUG=y kernel, though.

======================= REAL ========================
command    patch     v4    v4dirent v4d512bi    v5
==========|======|========|========|========|========
tar -xf    normal  103.202  104.951  101.771  104.486
tar -xf    itree   104.610  101.705   98.784  101.919
----------+------+--------+--------+--------+--------
sha256sum  normal  227.456  228.321  231.947  234.127
sha256sum  itree   230.233  229.375  231.509  233.253
----------+------+--------+--------+--------+--------
cp -r a b  normal  239.714  242.754  248.994  249.584
cp -r a b  itree   241.273  243.216  248.531  254.501
----------+------+--------+--------+--------+--------
find .     normal   11.894   12.370   12.324   12.397
find .     itree    12.043   12.310   12.736   13.216
----------+------+--------+--------+--------+--------
rm -r b    normal    8.556    8.744   11.298   11.774
rm -r b    itree     8.904    8.981   10.590   12.057
----------+------+--------+--------+--------+--------
cp -r b a  normal  262.065  256.448  272.290  272.221
cp -r b a  itree   264.116  265.875  267.346  270.811
----------+------+--------+--------+--------+--------
rm -r b    normal    8.585    9.258    8.791   10.058
rm -r b    itree     9.061    8.345    9.909    9.273
----------+------+--------+--------+--------+--------
stat       normal  161.853  162.772  163.555  165.046
stat       itree   162.641  163.148  163.698  164.015
----------+------+--------+--------+--------+--------
sha check  normal  133.938  133.016  141.352  142.921
sha check  itree   133.885  133.399  138.013  143.315
----------+------+--------+--------+--------+--------
cp tarball normal   44.102   42.812   43.603   43.722
cp tarball itree    43.724   44.187   44.339   42.761
==========|======|========|========|========|========
TOTAL      normal 1201.365 1201.446 1235.925 1246.336
TOTAL      itree  1210.490 1210.541 1225.455 1245.121

======================= USER ========================
command    patch     v4    v4dirent v4d512bi    v5
==========|======|========|========|========|========
tar -xf    normal   59.223   59.473   58.817   59.640
tar -xf    itree    59.420   59.473   58.953   59.893
----------+------+--------+--------+--------+--------
sha256sum  normal   49.877   49.877   49.787   49.730
sha256sum  itree    49.437   49.863   49.583   49.673
----------+------+--------+--------+--------+--------
cp -r a b  normal    0.697    0.707    0.743    0.800
cp -r a b  itree     0.657    0.710    0.677    0.703
----------+------+--------+--------+--------+--------
find .     normal    0.257    0.237    0.233    0.223
find .     itree     0.283    0.223    0.223    0.203
----------+------+--------+--------+--------+--------
rm -r b    normal    0.170    0.120    0.147    0.160
rm -r b    itree     0.160    0.163    0.130    0.137
----------+------+--------+--------+--------+--------
cp -r b a  normal    0.817    0.763    0.817    0.763
cp -r b a  itree     0.737    0.740    0.787    0.670
----------+------+--------+--------+--------+--------
rm -r b    normal    0.170    0.153    0.140    0.133
rm -r b    itree     0.140    0.157    0.143    0.163
----------+------+--------+--------+--------+--------
stat       normal    1.660    1.653    1.570    1.720
stat       itree     1.737    1.727    1.700    1.630
----------+------+--------+--------+--------+--------
sha check  normal   58.467   58.603   58.550   58.370
sha check  itree    58.157   58.183   58.620   58.343
----------+------+--------+--------+--------+--------
cp tarball normal    0.023    0.027    0.033    0.037
cp tarball itree     0.017    0.020    0.020    0.020
==========|======|========|========|========|========
TOTAL      normal  171.361  171.613  170.837  171.576
TOTAL      itree   170.745  171.259  170.836  171.435

======================== SYS ========================
command    patch     v4    v4dirent v4d512bi    v5
==========|======|========|========|========|========
tar -xf    normal   19.770   19.800   19.960   20.770
tar -xf    itree    19.550   19.930   20.067   20.963
----------+------+--------+--------+--------+--------
sha256sum  normal   17.157   14.607   14.393   16.053
sha256sum  itree    17.277   14.813   14.550   15.007
----------+------+--------+--------+--------+--------
cp -r a b  normal   18.697   18.973   18.687   19.253
cp -r a b  itree    19.033   18.993   18.783   19.703
----------+------+--------+--------+--------+--------
find .     normal    0.820    0.573    0.537    0.597
find .     itree     0.793    0.593    0.547    0.610
----------+------+--------+--------+--------+--------
rm -r b    normal    3.883    3.827    3.800    3.967
rm -r b    itree     4.053    3.937    4.003    4.143
----------+------+--------+--------+--------+--------
cp -r b a  normal   19.043   19.083   18.753   19.503
cp -r b a  itree    19.203   19.100   19.040   19.680
----------+------+--------+--------+--------+--------
rm -r b    normal    4.097    3.947    3.900    4.123
rm -r b    itree     4.287    4.067    4.093    4.227
----------+------+--------+--------+--------+--------
stat       normal   11.337   10.730   10.727   10.680
stat       itree    11.080   10.827   10.800   10.457
----------+------+--------+--------+--------+--------
sha check  normal    8.970    8.920    8.980    9.507
sha check  itree     9.053    9.143    8.540    9.420
----------+------+--------+--------+--------+--------
cp tarball normal    5.537    5.397    5.470    5.373
cp tarball itree     5.390    5.313    5.460    5.343
==========|======|========|========|========|========
TOTAL      normal  109.311  105.857  105.207  109.826
TOTAL      itree   109.719  106.716  105.883  109.553

The rest of this report is supplementary noise and the lockdep.

# XFS kernel configuration:

CONFIG_XFS_FS=y
CONFIG_XFS_QUOTA=y
CONFIG_XFS_POSIX_ACL=y
CONFIG_XFS_RT=y
# CONFIG_XFS_WARN is not set
# CONFIG_XFS_DEBUG is not set

# `uname -a` output:
Linux plbearer 3.11.0+ #4 Sat Sep 7 13:04:53 EDT 2013 
i686 Intel(R) Pentium(R) 4 CPU 1.80GHz GenuineIntel GNU/Linux

RAM: 512 MB, less 9 MB for capture kernel

# Hard drive used for $TEST_DEV:
Model Family:     Western Digital Caviar
Device Model:     WDC WD600BB-75CAA0
User Capacity:    60,022,480,896 bytes [60.0 GB]

# Hard drive used for $TEST_LOGDEV and kernel-git tarball:
Model Family:     Seagate Barracuda 7200.9
Device Model:     ST3120814A
User Capacity:    120,034,123,776 bytes [120 GB]
root@plbearer:~/results# uname -a

# Sample xfs_info output for $TEST_DEV, to show how XFS is using the partition:
meta-data=/dev/sdb4              isize=512    agcount=4, agsize=720896 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=0
data     =                       bsize=4096   blocks=2883584, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =external               bsize=4096   blocks=32768, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

# cat /proc/partitions
major minor  #blocks  name
   8        0  117220824 sda
   8        1      98304 sda1 # shared /boot
   8        2          1 sda2
   8        5      65536 sda5
   8        6     131072 sda6 # $TEST_LOGDEV
   8        7     131072 sda7
   8        8     616448 sda8
   8        9   11275264 sda9 # inactive root (v5 XFS), holds tarball
   8       10  104895960 sda10

  11        0    1048575 sr0

   8       16   58615704 sdb
   8       17    3406848 sdb1 # active root partition (JFS)
   8       18     786432 sdb2
   8       19   20971520 sdb3
   8       20   11534336 sdb4 # $TEST_DEV
   8       21    4128768 sdb5
   8       22     786432 sdb6
   8       23     524288 sdb7
   8       24     524288 sdb8
   8       25    1048576 sdb9
   8       26   10708871 sdb10
   8       27    4194304 sdb11

Lockdep that kept tar jobs from completing.  It was found during several 
other tests before I gave up on the debug benchmark idea.

=================================
[ INFO: inconsistent lock state ]
3.11.0+ #2 Not tainted
---------------------------------
inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
tar/287 [HC0[0]:SC0[0]:HE1:SE1] takes:
 (&(&ip->i_lock)->mr_lock){++++?-}, at: [<c120cc1b>] xfs_ilock+0x100/0x1f1
{IN-RECLAIM_FS-W} state was registered at:
  [<c1070697>] __lock_acquire+0x63b/0x1953
  [<c1072515>] lock_acquire+0x88/0x17e
  [<c104fc0f>] down_write_nested+0x4f/0x9d
  [<c120cc1b>] xfs_ilock+0x100/0x1f1
  [<c11bd00d>] xfs_reclaim_inode+0xf4/0x30a
  [<c11bd4d4>] xfs_reclaim_inodes_ag+0x2b1/0x488
  [<c11bd729>] xfs_reclaim_inodes_nr+0x2d/0x33
  [<c11c7e1e>] xfs_fs_free_cached_objects+0x13/0x15
  [<c10f3778>] prune_super+0xd1/0x15c
  [<c10c99da>] shrink_slab+0x143/0x3d8
  [<c10cc768>] kswapd+0x45d/0x835
  [<c104b617>] kthread+0xa7/0xa9
  [<c14e26b7>] ret_from_kernel_thread+0x1b/0x28
irq event stamp: 23333689
hardirqs last  enabled at (23333689): [<c14e1375>] _raw_spin_unlock_irq+0x27/0x36
hardirqs last disabled at (23333688): [<c14e11ea>] _raw_spin_lock_irq+0x15/0x7a
softirqs last  enabled at (23333678): [<c1031d2d>] __do_softirq+0x142/0x2ce
softirqs last disabled at (23333649): [<c1031fec>] irq_exit+0x6d/0x73

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&(&ip->i_lock)->mr_lock);
  <Interrupt>
    lock(&(&ip->i_lock)->mr_lock);

 *** DEADLOCK ***

3 locks held by tar/287:
 #0:  (sb_writers#8){.+.+.+}, at: [<c110d858>] mnt_want_write+0x1e/0x3e
 #1:  (&(&ip->i_lock)->mr_lock){++++?-}, at: [<c120cc1b>] xfs_ilock+0x100/0x1f1
 #2:  (sb_internal){.+.+.+}, at: [<c11cb6d0>] xfs_trans_alloc+0x1f/0x35

stack backtrace:
CPU: 0 PID: 287 Comm: tar Not tainted 3.11.0+ #2
Hardware name: Dell Computer Corporation Dimension 2350/07W080, BIOS A01 12/17/2002
 de296540 de296540 de3e7d5c c14db319 de3e7d98 c14d8708 c1618f4e c1619343
 0000011f 00000000 00000000 00000000 00000000 00000001 00000001 c1619343
 0000000a de2969b0 00000400 de3e7dcc c106e6c4 0000000a de3e7e1c c10703d1
Call Trace:
 [<c14db319>] dump_stack+0x16/0x18
 [<c14d8708>] print_usage_bug+0x1dc/0x1e6
 [<c106e6c4>] mark_lock+0x28c/0x2c1
 [<c10703d1>] ? __lock_acquire+0x375/0x1953
 [<c106dc88>] ? print_shortest_lock_dependencies+0x18f/0x18f
 [<c106e77a>] mark_held_locks+0x81/0xe7
 [<c106f136>] lockdep_trace_alloc+0xa1/0xe3
 [<c10ec3ed>] kmem_cache_alloc+0x28/0x1f2
 [<c11cd53f>] ? kmem_zone_alloc+0x55/0xd0
 [<c11cd53f>] kmem_zone_alloc+0x55/0xd0
 [<c11cb6d0>] ? xfs_trans_alloc+0x1f/0x35
 [<c11cd5cb>] kmem_zone_zalloc+0x11/0x36
 [<c11cb663>] _xfs_trans_alloc+0x2e/0x7c
 [<c11cb6de>] xfs_trans_alloc+0x2d/0x35
 [<c1210743>] xfs_inactive+0x129/0x4a3
 [<c106ebaf>] ? trace_hardirqs_on+0xb/0xd
 [<c14e1375>] ? _raw_spin_unlock_irq+0x27/0x36
 [<c11c807d>] xfs_fs_evict_inode+0x6c/0x114
 [<c1108268>] evict+0x8e/0x15d
 [<c1108d16>] iput+0xc4/0x138
 [<c10fec3c>] do_unlinkat+0x127/0x17f
 [<c102547e>] ? vmalloc_sync_all+0x133/0x133
 [<c10fecb7>] SyS_unlinkat+0x23/0x3a
 [<c14e273b>] sysenter_do_call+0x12/0x32

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 10/11] xfs: update the finobt on inode free
  2013-09-07 12:30             ` Brian Foster
@ 2013-09-08 20:08               ` Michael L. Semon
  2013-09-09  2:34               ` Better numbers " Michael L. Semon
  1 sibling, 0 replies; 46+ messages in thread
From: Michael L. Semon @ 2013-09-08 20:08 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Sat, Sep 7, 2013 at 8:30 AM, Brian Foster <bfoster@redhat.com> wrote:
> On 09/06/2013 05:24 PM, Dave Chinner wrote:
>> On Fri, Sep 06, 2013 at 07:39:18AM -0400, Brian Foster wrote:
>>> On 09/05/2013 08:28 PM, Dave Chinner wrote:
>>>> On Thu, Sep 05, 2013 at 12:19:12PM -0400, Brian Foster wrote:
>>>>> On 09/04/2013 10:54 PM, Dave Chinner wrote:
>>>>>> On Tue, Sep 03, 2013 at 02:25:07PM -0400, Brian Foster wrote:
>>> ...
>>>>>
> ...
>>>> No worries. BTW, can you post your rudimentary userspace support so
>>>> we can run tests that use this code, too?
>>>>
>>>
>>> Sure. My xfsprogs branch currently is the application of a slightly
>>> older version of this set (pre-cleanups I made to make this post-worthy)
>>> with some hacks to make it apply/compile and a few other patches on top
>>> of that for mkfs, xfs_db and xfs_repair to work through some basic
>>> things I ran into when running xfstests.
>>>
>>> Would you prefer I drop the whole thing on the list?
>>
>> Drop it on the list, maybe just a as tarball rather than a patchset
>> if it's not ready for review yet.
>>
>
> Ok, attached. This applies on top of the following commit in the
> xfsprogs tree:
>
> 982e5c7e xfs_db: add header to freesp -d output
>
> Use the following command to format a finobt enabled fs:
>
> mkfs.xfs -m crc=1,finobt=1 <dev>
>
> ... and otherwise there are a couple random fixes for xfs_db and xfs_repair.
>
> Brian

OK, I gave it a try and messed it up.  I have updated numbers that look a
little better, and they'll be posted once I get to a Linux PC.

My merge went badly because it went on top of Dave's latest work and Mark's
v4 dirent work.  It's like some of the features2 stuff takes one extra argument
in places.

Also, when patching some code, it looks like some constants have been
updated to use cpu_to_be32(), and I don't know why they've been updated.

So it's not a surprise that I can make a v4-finobt filesystem with -finobt=1,
mount it, get all the v5 warnings from the kernel, but there's not enough
v5 code in use to choke on 256-byte v4 inodes.  Afterwards, `xfs_repair -n`
suggests that it will correct some areas from each AG.

It's also not a surprise that I don't know what issues are your fault,
which ones
are my fault, and which ones require Mark to update his v4-dirent patches.
Therefore, there's no conclusion to be placed on your userspace code yet.  It's
an RFC, anyway...

My "user commands" benchmark succeeded for the 4 XFS filesystem
variants considered previously.

More later...

Michael

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Better numbers Re: [RFC PATCH 10/11] xfs: update the finobt on inode free
  2013-09-07 12:30             ` Brian Foster
  2013-09-08 20:08               ` Michael L. Semon
@ 2013-09-09  2:34               ` Michael L. Semon
  1 sibling, 0 replies; 46+ messages in thread
From: Michael L. Semon @ 2013-09-09  2:34 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On 09/07/2013 08:30 AM, Brian Foster wrote:
> On 09/06/2013 05:24 PM, Dave Chinner wrote:
>> On Fri, Sep 06, 2013 at 07:39:18AM -0400, Brian Foster wrote:
>>> On 09/05/2013 08:28 PM, Dave Chinner wrote:
>>>> On Thu, Sep 05, 2013 at 12:19:12PM -0400, Brian Foster wrote:
>>>>> On 09/04/2013 10:54 PM, Dave Chinner wrote:
>>>>>> On Tue, Sep 03, 2013 at 02:25:07PM -0400, Brian Foster wrote:
>>> ...
>>>>>
> ...
>>>> No worries. BTW, can you post your rudimentary userspace support so
>>>> we can run tests that use this code, too?
>>>>
>>>
>>> Sure. My xfsprogs branch currently is the application of a slightly
>>> older version of this set (pre-cleanups I made to make this post-worthy)
>>> with some hacks to make it apply/compile and a few other patches on top
>>> of that for mkfs, xfs_db and xfs_repair to work through some basic
>>> things I ran into when running xfstests.
>>>
>>> Would you prefer I drop the whole thing on the list?
>>
>> Drop it on the list, maybe just a as tarball rather than a patchset
>> if it's not ready for review yet.
>>
> 
> Ok, attached. This applies on top of the following commit in the
> xfsprogs tree:
> 
> 982e5c7e xfs_db: add header to freesp -d output
> 
> Use the following command to format a finobt enabled fs:
> 
> mkfs.xfs -m crc=1,finobt=1 <dev>
> 
> ... and otherwise there are a couple random fixes for xfs_db and xfs_repair.
> 
> Brian
> 
> 
>> Cheers,
>>
>> Dave.

Hi!  Here's a better set of numbers, using the finobt-enabled userspace. 
Some cosmetic errors in my chart-generating Perl script were also fixed.

Revised legend:

v4: v4 XFS
v4dirent: v4 with Mark's dirent patches
v4d512bi: same as above, with 512-byte inodes
v5: v5/CRC XFS, no finobt except for "finobt" patch case

Again, a kernel git dir was built, then turned into a source tar.gz 
file for these tests.

Files involved in find/sha256/cp/stat tests:
$TEST_DIR/a/linux/ (kernel + xfs-oss/master, built)
$TEST_DIR/b/linux/ (same)

                         REAL
command     patch     v4    v4dirent v4d512bi    v5
----------+-------+--------+--------+--------+--------
tar -xf    normal   103.202  104.951  101.771  104.486
tar -xf    patched  104.610  101.705   98.784  101.919
tar -xf    finobt   103.753  102.949  102.609   99.847
----------+-------+--------+--------+--------+--------
sha256sum  normal   227.456  228.321  231.947  234.127
sha256sum  patched  230.233  229.375  231.509  233.253
sha256sum  finobt   228.108  228.873  232.464  233.498
----------+-------+--------+--------+--------+--------
cp -r a b  normal   239.714  242.754  248.994  249.584
cp -r a b  patched  241.273  243.216  248.531  254.501
cp -r a b  finobt   240.779  242.838  244.760  253.203
----------+-------+--------+--------+--------+--------
find .     normal    11.894   12.370   12.324   12.397
find .     patched   12.043   12.310   12.736   13.216
find .     finobt    11.882   12.376   13.501   13.827
----------+-------+--------+--------+--------+--------
rm -r a    normal     8.556    8.744   11.298   11.774
rm -r a    patched    8.904    8.981   10.590   12.057
rm -r a    finobt     8.588    9.007   10.841   12.168
----------+-------+--------+--------+--------+--------
cp -r b a  normal   262.065  256.448  272.290  272.221
cp -r b a  patched  264.116  265.875  267.346  270.811
cp -r b a  finobt   265.255  259.111  262.312  268.700
----------+-------+--------+--------+--------+--------
rm -r b    normal     8.585    9.258    8.791   10.058
rm -r b    patched    9.061    8.345    9.909    9.273
rm -r b    finobt     8.326    8.231   10.078   10.464
----------+-------+--------+--------+--------+--------
stat       normal   161.853  162.772  163.555  165.046
stat       patched  162.641  163.148  163.698  164.015
stat       finobt   163.366  162.707  163.256  165.051
----------+-------+--------+--------+--------+--------
sha check  normal   133.938  133.016  141.352  142.921
sha check  patched  133.885  133.399  138.013  143.315
sha check  finobt   135.128  134.900  142.158  141.094
----------+-------+--------+--------+--------+--------
cp tarball normal    44.102   42.812   43.603   43.722
cp tarball patched   43.724   44.187   44.339   42.761
cp tarball finobt    43.930   43.236   42.736   44.000
----------+-------+--------+--------+--------+--------
TOTAL      normal  1201.365 1201.446 1235.925 1246.336
TOTAL      patched 1210.490 1210.541 1225.455 1245.121
TOTAL      finobt  1209.115 1204.228 1224.715 1241.852

                         USER
command     patch     v4    v4dirent v4d512bi    v5
----------+-------+--------+--------+--------+--------
tar -xf    normal    59.223   59.473   58.817   59.640
tar -xf    patched   59.420   59.473   58.953   59.893
tar -xf    finobt    59.153   59.850   59.643   59.153
----------+-------+--------+--------+--------+--------
sha256sum  normal    49.877   49.877   49.787   49.730
sha256sum  patched   49.437   49.863   49.583   49.673
sha256sum  finobt    49.577   49.580   49.743   49.597
----------+-------+--------+--------+--------+--------
cp -r a b  normal     0.697    0.707    0.743    0.800
cp -r a b  patched    0.657    0.710    0.677    0.703
cp -r a b  finobt     0.737    0.777    0.777    0.780
----------+-------+--------+--------+--------+--------
find .     normal     0.257    0.237    0.233    0.223
find .     patched    0.283    0.223    0.223    0.203
find .     finobt     0.263    0.253    0.237    0.273
----------+-------+--------+--------+--------+--------
rm -r a    normal     0.170    0.120    0.147    0.160
rm -r a    patched    0.160    0.163    0.130    0.137
rm -r a    finobt     0.173    0.157    0.123    0.150
----------+-------+--------+--------+--------+--------
cp -r b a  normal     0.817    0.763    0.817    0.763
cp -r b a  patched    0.737    0.740    0.787    0.670
cp -r b a  finobt     0.783    0.747    0.737    0.687
----------+-------+--------+--------+--------+--------
rm -r b    normal     0.170    0.153    0.140    0.133
rm -r b    patched    0.140    0.157    0.143    0.163
rm -r b    finobt     0.173    0.127    0.193    0.153
----------+-------+--------+--------+--------+--------
stat       normal     1.660    1.653    1.570    1.720
stat       patched    1.737    1.727    1.700    1.630
stat       finobt     1.767    1.640    1.557    1.763
----------+-------+--------+--------+--------+--------
sha check  normal    58.467   58.603   58.550   58.370
sha check  patched   58.157   58.183   58.620   58.343
sha check  finobt    58.530   58.107   58.367   58.300
----------+-------+--------+--------+--------+--------
cp tarball normal     0.023    0.027    0.033    0.037
cp tarball patched    0.017    0.020    0.020    0.020
cp tarball finobt     0.020    0.037    0.033    0.020
----------+-------+--------+--------+--------+--------
TOTAL      normal   171.361  171.613  170.837  171.576
TOTAL      patched  170.745  171.259  170.836  171.435
TOTAL      finobt   171.176  171.275  171.410  170.876

                         SYS
command     patch     v4    v4dirent v4d512bi    v5
----------+-------+--------+--------+--------+--------
tar -xf    normal    19.770   19.800   19.960   20.770
tar -xf    patched   19.550   19.930   20.067   20.963
tar -xf    finobt    20.010   19.707   19.707   21.397
----------+-------+--------+--------+--------+--------
sha256sum  normal    17.157   14.607   14.393   16.053
sha256sum  patched   17.277   14.813   14.550   15.007
sha256sum  finobt    17.123   14.920   14.667   15.133
----------+-------+--------+--------+--------+--------
cp -r a b  normal    18.697   18.973   18.687   19.253
cp -r a b  patched   19.033   18.993   18.783   19.703
cp -r a b  finobt    19.093   18.863   18.877   19.363
----------+-------+--------+--------+--------+--------
find .     normal     0.820    0.573    0.537    0.597
find .     patched    0.793    0.593    0.547    0.610
find .     finobt     0.800    0.553    0.533    0.543
----------+-------+--------+--------+--------+--------
rm -r a    normal     3.883    3.827    3.800    3.967
rm -r a    patched    4.053    3.937    4.003    4.143
rm -r a    finobt     4.010    4.020    3.983    4.290
----------+-------+--------+--------+--------+--------
cp -r b a  normal    19.043   19.083   18.753   19.503
cp -r b a  patched   19.203   19.100   19.040   19.680
cp -r b a  finobt    19.133   18.973   18.950   19.607
----------+-------+--------+--------+--------+--------
rm -r b    normal     4.097    3.947    3.900    4.123
rm -r b    patched    4.287    4.067    4.093    4.227
rm -r b    finobt     4.223    4.140    4.013    4.480
----------+-------+--------+--------+--------+--------
stat       normal    11.337   10.730   10.727   10.680
stat       patched   11.080   10.827   10.800   10.457
stat       finobt    11.393   10.720   10.680   10.797
----------+-------+--------+--------+--------+--------
sha check  normal     8.970    8.920    8.980    9.507
sha check  patched    9.053    9.143    8.540    9.420
sha check  finobt     8.653    9.020    8.863    9.103
----------+-------+--------+--------+--------+--------
cp tarball normal     5.537    5.397    5.470    5.373
cp tarball patched    5.390    5.313    5.460    5.343
cp tarball finobt     5.520    5.357    5.333    5.603
----------+-------+--------+--------+--------+--------
TOTAL      normal   109.311  105.857  105.207  109.826
TOTAL      patched  109.719  106.716  105.883  109.553
TOTAL      finobt   109.958  106.273  105.606  110.316

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2013-09-09  2:35 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-09-03 18:24 [RFC PATCH 00/11] xfs: introduce the free inode btree Brian Foster
2013-09-03 18:24 ` [RFC PATCH 01/11] xfs: refactor xfs_ialloc_btree.c to support multiple inobt numbers Brian Foster
2013-09-05  0:36   ` Dave Chinner
2013-09-03 18:24 ` [RFC PATCH 02/11] xfs: reserve v5 superblock read-only compat. feature bit for finobt Brian Foster
2013-09-05  0:39   ` Dave Chinner
2013-09-03 18:25 ` [RFC PATCH 03/11] xfs: support the XFS_BTNUM_FINOBT free inode btree type Brian Foster
2013-09-05  0:54   ` Dave Chinner
2013-09-05 16:17     ` Brian Foster
2013-09-06  0:07       ` Dave Chinner
2013-09-06 11:25         ` Brian Foster
2013-09-06 21:22           ` Dave Chinner
2013-09-03 18:25 ` [RFC PATCH 04/11] xfs: update inode allocation transaction reservations for finobt Brian Foster
2013-09-05  0:59   ` Dave Chinner
2013-09-05 16:17     ` Brian Foster
2013-09-06  0:11       ` Dave Chinner
2013-09-03 18:25 ` [RFC PATCH 05/11] xfs: update ifree " Brian Foster
2013-09-05  1:00   ` Dave Chinner
2013-09-03 18:25 ` [RFC PATCH 06/11] xfs: use correct transaction reservations in xfs_inactive() Brian Foster
2013-09-05  1:35   ` Dave Chinner
2013-09-05 16:18     ` Brian Foster
2013-09-03 18:25 ` [RFC PATCH 07/11] xfs: retry trans reservation on ENOSPC " Brian Foster
2013-09-05  1:40   ` Dave Chinner
2013-09-05 16:18     ` Brian Foster
2013-09-06  0:17       ` Dave Chinner
2013-09-06 11:30         ` Brian Foster
2013-09-03 18:25 ` [RFC PATCH 08/11] xfs: insert newly allocated inode chunks into the finobt Brian Foster
2013-09-05  2:10   ` Dave Chinner
2013-09-03 18:25 ` [RFC PATCH 09/11] xfs: use and update the finobt on inode allocation Brian Foster
2013-09-05  2:27   ` Dave Chinner
2013-09-05 16:18     ` Brian Foster
2013-09-03 18:25 ` [RFC PATCH 10/11] xfs: update the finobt on inode free Brian Foster
2013-09-05  2:54   ` Dave Chinner
2013-09-05 16:19     ` Brian Foster
2013-09-06  0:28       ` Dave Chinner
2013-09-06 11:39         ` Brian Foster
2013-09-06 21:24           ` Dave Chinner
2013-09-07 12:30             ` Brian Foster
2013-09-08 20:08               ` Michael L. Semon
2013-09-09  2:34               ` Better numbers " Michael L. Semon
2013-09-03 18:25 ` [RFC PATCH 11/11] xfs: add finobt support to growfs Brian Foster
2013-09-05  2:55   ` Dave Chinner
2013-09-05 21:17 ` [RFC PATCH 00/11] xfs: introduce the free inode btree Michael L. Semon
2013-09-06 11:17   ` Brian Foster
2013-09-06 21:35   ` Dave Chinner
2013-09-07 12:31     ` Brian Foster
2013-09-08  1:04       ` Michael L. Semon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.