All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/11] xfs: introduce the free inode btree
@ 2013-11-13 14:36 Brian Foster
  2013-11-13 14:36 ` [PATCH v2 01/11] xfs: refactor xfs_ialloc_btree.c to support multiple inobt numbers Brian Foster
                   ` (12 more replies)
  0 siblings, 13 replies; 21+ messages in thread
From: Brian Foster @ 2013-11-13 14:36 UTC (permalink / raw)
  To: xfs

Hi all,

The free inode btree adds a new inode btree to XFS with the intent to
track only inode chunks with at least one free inode. Patches 1-3 add
the necessary support for the new XFS_BTNUM_FINOBT type and introduce a
read-only v5 superblock flag. Patch 4 updates the transaction
reservations for inode allocation operations to account for the finobt.
Patches 5-9 add support to manage the finobt on inode chunk allocation,
inode allocation, inode free (and chunk deletion) and growfs. Patch 10
adds support to report finobt status in the fs geometry. Patch 11 adds
the feature bit to the associated mask. Thoughts, reviews, flames
appreciated.

Brian

v2:
- Rebase to latest xfs tree (minor shifting around of some header bits).
- Added "xfs: report finobt status in fs geometry" patch to series.
v1:
- Separate patch to enable rw finobt support at end of series.
- Rework xfs_ialloc_log_agi() to log the agi in two distinct regions.
- Rework xfs_ialloc_btree.c changes to use separate finobt handlers
  where appropriate.
- Fix bug to show fibt2 stats data in stat proc file.
- Move finobt log reservation calculations into separate helper, made
  conditional and merged to a single patch.
- Use reserved block pool in xfs_inactive() codepath instead of flush.
- Moved and cleaned up xfs_inobt_insert() to use inobt helpers.
- Enhanced lookup algorithm for allocation (xfs_dialloc_ag()).
- Refactored xfs_difree() to use xfs_difree_inobt() and
  xfs_difree_finobt(), cleaned up the latter.

Brian Foster (11):
  xfs: refactor xfs_ialloc_btree.c to support multiple inobt numbers
  xfs: reserve v5 superblock read-only compat. feature bit for finobt
  xfs: support the XFS_BTNUM_FINOBT free inode btree type
  xfs: update inode allocation/free transaction reservations for finobt
  xfs: insert newly allocated inode chunks into the finobt
  xfs: use and update the finobt on inode allocation
  xfs: refactor xfs_difree() inobt bits into xfs_difree_inobt() helper
  xfs: update the finobt on inode free
  xfs: add finobt support to growfs
  xfs: report finobt status in fs geometry
  xfs: enable the finobt feature on v5 superblocks

 fs/xfs/xfs_ag.h           |  32 ++-
 fs/xfs/xfs_btree.c        |   6 +-
 fs/xfs/xfs_btree.h        |   3 +
 fs/xfs/xfs_format.h       |  14 +-
 fs/xfs/xfs_fs.h           |   1 +
 fs/xfs/xfs_fsops.c        |  36 ++-
 fs/xfs/xfs_ialloc.c       | 616 ++++++++++++++++++++++++++++++++++++++--------
 fs/xfs/xfs_ialloc_btree.c |  68 ++++-
 fs/xfs/xfs_ialloc_btree.h |   3 +-
 fs/xfs/xfs_inode.c        |   4 +-
 fs/xfs/xfs_itable.c       |   6 +-
 fs/xfs/xfs_log_recover.c  |   2 +
 fs/xfs/xfs_sb.h           |  10 +-
 fs/xfs/xfs_stats.c        |   1 +
 fs/xfs/xfs_stats.h        |  18 +-
 fs/xfs/xfs_trans_resv.c   |  47 +++-
 fs/xfs/xfs_trans_space.h  |   7 +-
 fs/xfs/xfs_types.h        |   2 +-
 18 files changed, 746 insertions(+), 130 deletions(-)

-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v2 01/11] xfs: refactor xfs_ialloc_btree.c to support multiple inobt numbers
  2013-11-13 14:36 [PATCH v2 00/11] xfs: introduce the free inode btree Brian Foster
@ 2013-11-13 14:36 ` Brian Foster
  2013-11-13 16:17   ` Christoph Hellwig
  2013-11-13 14:36 ` [PATCH v2 02/11] xfs: reserve v5 superblock read-only compat. feature bit for finobt Brian Foster
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 21+ messages in thread
From: Brian Foster @ 2013-11-13 14:36 UTC (permalink / raw)
  To: xfs

The introduction of the free inode btree (finobt) requires that
xfs_ialloc_btree.c handle multiple trees. Refactor xfs_ialloc_btree.c
so the caller specifies the btree type on cursor initialization to
prepare for addition of the finobt.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/xfs_ialloc.c       | 8 ++++----
 fs/xfs/xfs_ialloc_btree.c | 8 +++++---
 fs/xfs/xfs_ialloc_btree.h | 3 ++-
 fs/xfs/xfs_itable.c       | 6 ++++--
 4 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
index e87719c..e9c870f 100644
--- a/fs/xfs/xfs_ialloc.c
+++ b/fs/xfs/xfs_ialloc.c
@@ -456,7 +456,7 @@ xfs_ialloc_ag_alloc(
 	/*
 	 * Insert records describing the new inode chunk into the btree.
 	 */
-	cur = xfs_inobt_init_cursor(args.mp, tp, agbp, agno);
+	cur = xfs_inobt_init_cursor(args.mp, tp, agbp, agno, XFS_BTNUM_INO);
 	for (thisino = newino;
 	     thisino < newino + newlen;
 	     thisino += XFS_INODES_PER_CHUNK) {
@@ -702,7 +702,7 @@ xfs_dialloc_ag(
 	ASSERT(pag->pagi_freecount > 0);
 
  restart_pagno:
-	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno);
+	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno, XFS_BTNUM_INO);
 	/*
 	 * If pagino is 0 (this is the root inode allocation) use newino.
 	 * This must work because we've just allocated some.
@@ -1164,7 +1164,7 @@ xfs_difree(
 	/*
 	 * Initialize the cursor.
 	 */
-	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno);
+	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno, XFS_BTNUM_INO);
 
 	error = xfs_check_agi_freecount(cur, agi);
 	if (error)
@@ -1295,7 +1295,7 @@ xfs_imap_lookup(
 	 * we have a record, we need to ensure it contains the inode number
 	 * we are looking up.
 	 */
-	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno);
+	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno, XFS_BTNUM_INO);
 	error = xfs_inobt_lookup(cur, agino, XFS_LOOKUP_LE, &i);
 	if (!error) {
 		if (i)
diff --git a/fs/xfs/xfs_ialloc_btree.c b/fs/xfs/xfs_ialloc_btree.c
index c8fa5bb..2d1a398 100644
--- a/fs/xfs/xfs_ialloc_btree.c
+++ b/fs/xfs/xfs_ialloc_btree.c
@@ -49,7 +49,8 @@ xfs_inobt_dup_cursor(
 	struct xfs_btree_cur	*cur)
 {
 	return xfs_inobt_init_cursor(cur->bc_mp, cur->bc_tp,
-			cur->bc_private.a.agbp, cur->bc_private.a.agno);
+			cur->bc_private.a.agbp, cur->bc_private.a.agno,
+			cur->bc_btnum);
 }
 
 STATIC void
@@ -323,7 +324,8 @@ xfs_inobt_init_cursor(
 	struct xfs_mount	*mp,		/* file system mount point */
 	struct xfs_trans	*tp,		/* transaction pointer */
 	struct xfs_buf		*agbp,		/* buffer for agi structure */
-	xfs_agnumber_t		agno)		/* allocation group number */
+	xfs_agnumber_t		agno,		/* allocation group number */
+	xfs_btnum_t		btnum)		/* ialloc or free ino btree */
 {
 	struct xfs_agi		*agi = XFS_BUF_TO_AGI(agbp);
 	struct xfs_btree_cur	*cur;
@@ -333,7 +335,7 @@ xfs_inobt_init_cursor(
 	cur->bc_tp = tp;
 	cur->bc_mp = mp;
 	cur->bc_nlevels = be32_to_cpu(agi->agi_level);
-	cur->bc_btnum = XFS_BTNUM_INO;
+	cur->bc_btnum = btnum;
 	cur->bc_blocklog = mp->m_sb.sb_blocklog;
 
 	cur->bc_ops = &xfs_inobt_ops;
diff --git a/fs/xfs/xfs_ialloc_btree.h b/fs/xfs/xfs_ialloc_btree.h
index f38b220..d7ebea72 100644
--- a/fs/xfs/xfs_ialloc_btree.h
+++ b/fs/xfs/xfs_ialloc_btree.h
@@ -58,7 +58,8 @@ struct xfs_mount;
 		 ((index) - 1) * sizeof(xfs_inobt_ptr_t)))
 
 extern struct xfs_btree_cur *xfs_inobt_init_cursor(struct xfs_mount *,
-		struct xfs_trans *, struct xfs_buf *, xfs_agnumber_t);
+		struct xfs_trans *, struct xfs_buf *, xfs_agnumber_t,
+		xfs_btnum_t);
 extern int xfs_inobt_maxrecs(struct xfs_mount *, int, int);
 
 #endif	/* __XFS_IALLOC_BTREE_H__ */
diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
index c237ad1..71a8169 100644
--- a/fs/xfs/xfs_itable.c
+++ b/fs/xfs/xfs_itable.c
@@ -274,7 +274,8 @@ xfs_bulkstat(
 		/*
 		 * Allocate and initialize a btree cursor for ialloc btree.
 		 */
-		cur = xfs_inobt_init_cursor(mp, NULL, agbp, agno);
+		cur = xfs_inobt_init_cursor(mp, NULL, agbp, agno,
+					    XFS_BTNUM_INO);
 		irbp = irbuf;
 		irbufend = irbuf + nirbuf;
 		end_of_ag = 0;
@@ -625,7 +626,8 @@ xfs_inumbers(
 				agino = 0;
 				continue;
 			}
-			cur = xfs_inobt_init_cursor(mp, NULL, agbp, agno);
+			cur = xfs_inobt_init_cursor(mp, NULL, agbp, agno,
+						    XFS_BTNUM_INO);
 			error = xfs_inobt_lookup(cur, agino, XFS_LOOKUP_GE,
 						 &tmp);
 			if (error) {
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 02/11] xfs: reserve v5 superblock read-only compat. feature bit for finobt
  2013-11-13 14:36 [PATCH v2 00/11] xfs: introduce the free inode btree Brian Foster
  2013-11-13 14:36 ` [PATCH v2 01/11] xfs: refactor xfs_ialloc_btree.c to support multiple inobt numbers Brian Foster
@ 2013-11-13 14:36 ` Brian Foster
  2013-11-13 16:18   ` Christoph Hellwig
  2013-11-13 14:36 ` [PATCH v2 03/11] xfs: support the XFS_BTNUM_FINOBT free inode btree type Brian Foster
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 21+ messages in thread
From: Brian Foster @ 2013-11-13 14:36 UTC (permalink / raw)
  To: xfs

Reserve a v5 read-only compatibility feature bit for the finobt and
create the xfs_sb_version_hasfinobt() helper to determine whether
an fs has the feature enabled.

The finobt does not change existing on-disk structures, but must
remain consistent with the ialloc btree. Modifications from older
kernels would violate that constrant. Therefore, we restrict older
kernels to read-only mounts of finobt-enabled filesystems.

Note that this does not yet enable the ability to rw mount a finobt
fs (by setting the feature bit in the XFS_SB_FEAT_RO_COMPAT_ALL
mask).

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_sb.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/xfs/xfs_sb.h b/fs/xfs/xfs_sb.h
index 35061d4..070a7f6 100644
--- a/fs/xfs/xfs_sb.h
+++ b/fs/xfs/xfs_sb.h
@@ -585,6 +585,7 @@ xfs_sb_has_compat_feature(
 	return (sbp->sb_features_compat & feature) != 0;
 }
 
+#define XFS_SB_FEAT_RO_COMPAT_FINOBT   (1 << 0)		/* free inode btree */
 #define XFS_SB_FEAT_RO_COMPAT_ALL 0
 #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL
 static inline bool
@@ -639,6 +640,12 @@ static inline int xfs_sb_version_hasftype(struct xfs_sb *sbp)
 		 (sbp->sb_features2 & XFS_SB_VERSION2_FTYPE));
 }
 
+static inline int xfs_sb_version_hasfinobt(xfs_sb_t *sbp)
+{
+	return (XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5) &&
+		(sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_FINOBT);
+}
+
 /*
  * end of superblock version macros
  */
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 03/11] xfs: support the XFS_BTNUM_FINOBT free inode btree type
  2013-11-13 14:36 [PATCH v2 00/11] xfs: introduce the free inode btree Brian Foster
  2013-11-13 14:36 ` [PATCH v2 01/11] xfs: refactor xfs_ialloc_btree.c to support multiple inobt numbers Brian Foster
  2013-11-13 14:36 ` [PATCH v2 02/11] xfs: reserve v5 superblock read-only compat. feature bit for finobt Brian Foster
@ 2013-11-13 14:36 ` Brian Foster
  2013-11-13 14:37 ` [PATCH v2 04/11] xfs: update inode allocation/free transaction reservations for finobt Brian Foster
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 21+ messages in thread
From: Brian Foster @ 2013-11-13 14:36 UTC (permalink / raw)
  To: xfs

Define the AGI fields for the finobt root/level and add magic
numbers. Update the btree code to add support for the new
XFS_BTNUM_FINOBT inode btree.

The finobt root block is reserved immediately following the inobt
root block in the AG. Update XFS_PREALLOC_BLOCKS() to determine the
starting AG data block based on whether finobt support is enabled.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_ag.h           | 32 +++++++++++++++----------
 fs/xfs/xfs_btree.c        |  6 +++--
 fs/xfs/xfs_btree.h        |  3 +++
 fs/xfs/xfs_format.h       | 14 ++++++++++-
 fs/xfs/xfs_ialloc.c       | 37 +++++++++++++++++++++++++----
 fs/xfs/xfs_ialloc_btree.c | 60 +++++++++++++++++++++++++++++++++++++++++++++--
 fs/xfs/xfs_log_recover.c  |  2 ++
 fs/xfs/xfs_stats.c        |  1 +
 fs/xfs/xfs_stats.h        | 18 +++++++++++++-
 fs/xfs/xfs_types.h        |  2 +-
 10 files changed, 150 insertions(+), 25 deletions(-)

diff --git a/fs/xfs/xfs_ag.h b/fs/xfs/xfs_ag.h
index 3fc1098..5d3011f 100644
--- a/fs/xfs/xfs_ag.h
+++ b/fs/xfs/xfs_ag.h
@@ -164,22 +164,28 @@ typedef struct xfs_agi {
 	__be32		agi_pad32;
 	__be64		agi_lsn;	/* last write sequence */
 
+	__be32		agi_free_root; /* root of the free inode btree */
+	__be32		agi_free_level;/* levels in free inode btree */
+
 	/* structure must be padded to 64 bit alignment */
 } xfs_agi_t;
 
-#define	XFS_AGI_MAGICNUM	0x00000001
-#define	XFS_AGI_VERSIONNUM	0x00000002
-#define	XFS_AGI_SEQNO		0x00000004
-#define	XFS_AGI_LENGTH		0x00000008
-#define	XFS_AGI_COUNT		0x00000010
-#define	XFS_AGI_ROOT		0x00000020
-#define	XFS_AGI_LEVEL		0x00000040
-#define	XFS_AGI_FREECOUNT	0x00000080
-#define	XFS_AGI_NEWINO		0x00000100
-#define	XFS_AGI_DIRINO		0x00000200
-#define	XFS_AGI_UNLINKED	0x00000400
-#define	XFS_AGI_NUM_BITS	11
-#define	XFS_AGI_ALL_BITS	((1 << XFS_AGI_NUM_BITS) - 1)
+#define	XFS_AGI_MAGICNUM	(1 << 0)
+#define	XFS_AGI_VERSIONNUM	(1 << 1)
+#define	XFS_AGI_SEQNO		(1 << 2)
+#define	XFS_AGI_LENGTH		(1 << 3)
+#define	XFS_AGI_COUNT		(1 << 4)
+#define	XFS_AGI_ROOT		(1 << 5)
+#define	XFS_AGI_LEVEL		(1 << 6)
+#define	XFS_AGI_FREECOUNT	(1 << 7)
+#define	XFS_AGI_NEWINO		(1 << 8)
+#define	XFS_AGI_DIRINO		(1 << 9)
+#define	XFS_AGI_UNLINKED	(1 << 10)
+#define	XFS_AGI_NUM_BITS_R1	11	/* end of the 1st agi logging region */
+#define	XFS_AGI_ALL_BITS_R1	((1 << XFS_AGI_NUM_BITS_R1) - 1)
+#define	XFS_AGI_FREE_ROOT	(1 << 11)
+#define	XFS_AGI_FREE_LEVEL	(1 << 12)
+#define	XFS_AGI_NUM_BITS_R2	13
 
 /* disk block (xfs_daddr_t) in the AG */
 #define XFS_AGI_DADDR(mp)	((xfs_daddr_t)(2 << (mp)->m_sectbb_log))
diff --git a/fs/xfs/xfs_btree.c b/fs/xfs/xfs_btree.c
index 9adaae4..ee79f1e 100644
--- a/fs/xfs/xfs_btree.c
+++ b/fs/xfs/xfs_btree.c
@@ -43,9 +43,10 @@ kmem_zone_t	*xfs_btree_cur_zone;
  * Btree magic numbers.
  */
 static const __uint32_t xfs_magics[2][XFS_BTNUM_MAX] = {
-	{ XFS_ABTB_MAGIC, XFS_ABTC_MAGIC, XFS_BMAP_MAGIC, XFS_IBT_MAGIC },
+	{ XFS_ABTB_MAGIC, XFS_ABTC_MAGIC, XFS_BMAP_MAGIC, XFS_IBT_MAGIC,
+	  XFS_FIBT_MAGIC },
 	{ XFS_ABTB_CRC_MAGIC, XFS_ABTC_CRC_MAGIC,
-	  XFS_BMAP_CRC_MAGIC, XFS_IBT_CRC_MAGIC }
+	  XFS_BMAP_CRC_MAGIC, XFS_IBT_CRC_MAGIC, XFS_FIBT_CRC_MAGIC }
 };
 #define xfs_btree_magic(cur) \
 	xfs_magics[!!((cur)->bc_flags & XFS_BTREE_CRC_BLOCKS)][cur->bc_btnum]
@@ -1117,6 +1118,7 @@ xfs_btree_set_refs(
 		xfs_buf_set_ref(bp, XFS_ALLOC_BTREE_REF);
 		break;
 	case XFS_BTNUM_INO:
+	case XFS_BTNUM_FINO:
 		xfs_buf_set_ref(bp, XFS_INO_BTREE_REF);
 		break;
 	case XFS_BTNUM_BMAP:
diff --git a/fs/xfs/xfs_btree.h b/fs/xfs/xfs_btree.h
index 91e34f2..d2ac586 100644
--- a/fs/xfs/xfs_btree.h
+++ b/fs/xfs/xfs_btree.h
@@ -62,6 +62,7 @@ union xfs_btree_rec {
 #define	XFS_BTNUM_CNT	((xfs_btnum_t)XFS_BTNUM_CNTi)
 #define	XFS_BTNUM_BMAP	((xfs_btnum_t)XFS_BTNUM_BMAPi)
 #define	XFS_BTNUM_INO	((xfs_btnum_t)XFS_BTNUM_INOi)
+#define	XFS_BTNUM_FINO	((xfs_btnum_t)XFS_BTNUM_FINOi)
 
 /*
  * For logging record fields.
@@ -92,6 +93,7 @@ do {    \
 	case XFS_BTNUM_CNT: __XFS_BTREE_STATS_INC(abtc, stat); break;	\
 	case XFS_BTNUM_BMAP: __XFS_BTREE_STATS_INC(bmbt, stat); break;	\
 	case XFS_BTNUM_INO: __XFS_BTREE_STATS_INC(ibt, stat); break;	\
+	case XFS_BTNUM_FINO: __XFS_BTREE_STATS_INC(fibt, stat); break;	\
 	case XFS_BTNUM_MAX: ASSERT(0); /* fucking gcc */ ; break;	\
 	}       \
 } while (0)
@@ -105,6 +107,7 @@ do {    \
 	case XFS_BTNUM_CNT: __XFS_BTREE_STATS_ADD(abtc, stat, val); break; \
 	case XFS_BTNUM_BMAP: __XFS_BTREE_STATS_ADD(bmbt, stat, val); break; \
 	case XFS_BTNUM_INO: __XFS_BTREE_STATS_ADD(ibt, stat, val); break; \
+	case XFS_BTNUM_FINO: __XFS_BTREE_STATS_ADD(fibt, stat, val); break; \
 	case XFS_BTNUM_MAX: ASSERT(0); /* fucking gcc */ ; break;	\
 	}       \
 } while (0)
diff --git a/fs/xfs/xfs_format.h b/fs/xfs/xfs_format.h
index b6ab5a3..d1def17 100644
--- a/fs/xfs/xfs_format.h
+++ b/fs/xfs/xfs_format.h
@@ -200,6 +200,8 @@ typedef __be32 xfs_alloc_ptr_t;
  */
 #define	XFS_IBT_MAGIC		0x49414254	/* 'IABT' */
 #define	XFS_IBT_CRC_MAGIC	0x49414233	/* 'IAB3' */
+#define	XFS_FIBT_MAGIC		0x46494254	/* 'FIBT' */
+#define	XFS_FIBT_CRC_MAGIC	0x46494233	/* 'FIB3' */
 
 typedef	__uint64_t	xfs_inofree_t;
 #define	XFS_INODES_PER_CHUNK		(NBBY * sizeof(xfs_inofree_t))
@@ -242,7 +244,17 @@ typedef __be32 xfs_inobt_ptr_t;
  * block numbers in the AG.
  */
 #define	XFS_IBT_BLOCK(mp)		((xfs_agblock_t)(XFS_CNT_BLOCK(mp) + 1))
-#define	XFS_PREALLOC_BLOCKS(mp)		((xfs_agblock_t)(XFS_IBT_BLOCK(mp) + 1))
+#define	XFS_FIBT_BLOCK(mp)		((xfs_agblock_t)(XFS_IBT_BLOCK(mp) + 1))
+
+/*
+ * The first data block of an AG depends on whether the filesystem was formatted
+ * with the finobt feature. If so, account for the finobt reserved root btree
+ * block.
+ */
+#define XFS_PREALLOC_BLOCKS(mp) \
+	(xfs_sb_version_hasfinobt(&((mp)->m_sb)) ? \
+	 XFS_FIBT_BLOCK(mp) + 1 : \
+	 XFS_IBT_BLOCK(mp) + 1)
 
 
 
diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
index e9c870f..1397fc4 100644
--- a/fs/xfs/xfs_ialloc.c
+++ b/fs/xfs/xfs_ialloc.c
@@ -1506,6 +1506,8 @@ xfs_ialloc_log_agi(
 		offsetof(xfs_agi_t, agi_newino),
 		offsetof(xfs_agi_t, agi_dirino),
 		offsetof(xfs_agi_t, agi_unlinked),
+		offsetof(xfs_agi_t, agi_free_root),
+		offsetof(xfs_agi_t, agi_free_level),
 		sizeof(xfs_agi_t)
 	};
 #ifdef DEBUG
@@ -1515,14 +1517,39 @@ xfs_ialloc_log_agi(
 	ASSERT(agi->agi_magicnum == cpu_to_be32(XFS_AGI_MAGIC));
 #endif
 	/*
-	 * Compute byte offsets for the first and last fields.
+	 * The growth of the agi buffer over time now requires that we interpret
+	 * the buffer as two logical regions delineated at the end of the unlinked
+	 * list. This is due to the size of the hash table and its location in the
+	 * middle of the agi.
+	 *
+	 * For example, a request to log a field before agi_unlinked and a field
+	 * after agi_unlinked could cause us to log the entire hash table and use
+	 * an excessive amount of log space. To avoid this behavior, log the
+	 * region up through agi_unlinked in one call and the region after
+	 * agi_unlinked through the end of the structure in another.
 	 */
-	xfs_btree_offsets(fields, offsets, XFS_AGI_NUM_BITS, &first, &last);
+	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_AGI_BUF);
+
 	/*
-	 * Log the allocation group inode header buffer.
+	 * Compute byte offsets for the first and last fields in the first
+	 * region and log agi buffer. This only logs up through agi_unlinked.
 	 */
-	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_AGI_BUF);
-	xfs_trans_log_buf(tp, bp, first, last);
+	if (fields & XFS_AGI_ALL_BITS_R1) {
+		xfs_btree_offsets(fields, offsets, XFS_AGI_NUM_BITS_R1,
+				  &first, &last);
+		xfs_trans_log_buf(tp, bp, first, last);
+	}
+
+	/*
+	 * Mask off the bits in the first region and calculate the first and last
+	 * field offsets for any bits in the second region.
+	 */
+	fields &= ~XFS_AGI_ALL_BITS_R1;
+	if (fields) {
+		xfs_btree_offsets(fields, offsets, XFS_AGI_NUM_BITS_R2,
+				  &first, &last);
+		xfs_trans_log_buf(tp, bp, first, last);
+	}
 }
 
 #ifdef DEBUG
diff --git a/fs/xfs/xfs_ialloc_btree.c b/fs/xfs/xfs_ialloc_btree.c
index 2d1a398..16212dc 100644
--- a/fs/xfs/xfs_ialloc_btree.c
+++ b/fs/xfs/xfs_ialloc_btree.c
@@ -67,6 +67,21 @@ xfs_inobt_set_root(
 	xfs_ialloc_log_agi(cur->bc_tp, agbp, XFS_AGI_ROOT | XFS_AGI_LEVEL);
 }
 
+STATIC void
+xfs_finobt_set_root(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_ptr	*nptr,
+	int			inc)	/* level change */
+{
+	struct xfs_buf		*agbp = cur->bc_private.a.agbp;
+	struct xfs_agi		*agi = XFS_BUF_TO_AGI(agbp);
+
+	agi->agi_free_root = nptr->s;
+	be32_add_cpu(&agi->agi_free_level, inc);
+	xfs_ialloc_log_agi(cur->bc_tp, agbp,
+			   XFS_AGI_FREE_ROOT | XFS_AGI_FREE_LEVEL);
+}
+
 STATIC int
 xfs_inobt_alloc_block(
 	struct xfs_btree_cur	*cur,
@@ -174,6 +189,17 @@ xfs_inobt_init_ptr_from_cur(
 	ptr->s = agi->agi_root;
 }
 
+STATIC void
+xfs_finobt_init_ptr_from_cur(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_ptr	*ptr)
+{
+	struct xfs_agi		*agi = XFS_BUF_TO_AGI(cur->bc_private.a.agbp);
+
+	ASSERT(cur->bc_private.a.agno == be32_to_cpu(agi->agi_seqno));
+	ptr->s = agi->agi_free_root;
+}
+
 STATIC __int64_t
 xfs_inobt_key_diff(
 	struct xfs_btree_cur	*cur,
@@ -204,6 +230,7 @@ xfs_inobt_verify(
 	 */
 	switch (block->bb_magic) {
 	case cpu_to_be32(XFS_IBT_CRC_MAGIC):
+	case cpu_to_be32(XFS_FIBT_CRC_MAGIC):
 		if (!xfs_sb_version_hascrc(&mp->m_sb))
 			return false;
 		if (!uuid_equal(&block->bb_u.s.bb_uuid, &mp->m_sb.sb_uuid))
@@ -215,6 +242,7 @@ xfs_inobt_verify(
 			return false;
 		/* fall through */
 	case cpu_to_be32(XFS_IBT_MAGIC):
+	case cpu_to_be32(XFS_FIBT_MAGIC):
 		break;
 	default:
 		return 0;
@@ -316,6 +344,28 @@ static const struct xfs_btree_ops xfs_inobt_ops = {
 #endif
 };
 
+static const struct xfs_btree_ops xfs_finobt_ops = {
+	.rec_len		= sizeof(xfs_inobt_rec_t),
+	.key_len		= sizeof(xfs_inobt_key_t),
+
+	.dup_cursor		= xfs_inobt_dup_cursor,
+	.set_root		= xfs_finobt_set_root,
+	.alloc_block		= xfs_inobt_alloc_block,
+	.free_block		= xfs_inobt_free_block,
+	.get_minrecs		= xfs_inobt_get_minrecs,
+	.get_maxrecs		= xfs_inobt_get_maxrecs,
+	.init_key_from_rec	= xfs_inobt_init_key_from_rec,
+	.init_rec_from_key	= xfs_inobt_init_rec_from_key,
+	.init_rec_from_cur	= xfs_inobt_init_rec_from_cur,
+	.init_ptr_from_cur	= xfs_finobt_init_ptr_from_cur,
+	.key_diff		= xfs_inobt_key_diff,
+	.buf_ops		= &xfs_inobt_buf_ops,
+#if defined(DEBUG) || defined(XFS_WARN)
+	.keys_inorder		= xfs_inobt_keys_inorder,
+	.recs_inorder		= xfs_inobt_recs_inorder,
+#endif
+};
+
 /*
  * Allocate a new inode btree cursor.
  */
@@ -334,11 +384,17 @@ xfs_inobt_init_cursor(
 
 	cur->bc_tp = tp;
 	cur->bc_mp = mp;
-	cur->bc_nlevels = be32_to_cpu(agi->agi_level);
 	cur->bc_btnum = btnum;
+	if (btnum == XFS_BTNUM_INO) {
+		cur->bc_nlevels = be32_to_cpu(agi->agi_level);
+		cur->bc_ops = &xfs_inobt_ops;
+	} else {
+		cur->bc_nlevels = be32_to_cpu(agi->agi_free_level);
+		cur->bc_ops = &xfs_finobt_ops;
+	}
+
 	cur->bc_blocklog = mp->m_sb.sb_blocklog;
 
-	cur->bc_ops = &xfs_inobt_ops;
 	if (xfs_sb_version_hascrc(&mp->m_sb))
 		cur->bc_flags |= XFS_BTREE_CRC_BLOCKS;
 
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index b6b669d..f8ac8a0 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -2126,7 +2126,9 @@ xlog_recover_validate_buf_type(
 			bp->b_ops = &xfs_allocbt_buf_ops;
 			break;
 		case XFS_IBT_CRC_MAGIC:
+		case XFS_FIBT_CRC_MAGIC:
 		case XFS_IBT_MAGIC:
+		case XFS_FIBT_MAGIC:
 			bp->b_ops = &xfs_inobt_buf_ops;
 			break;
 		case XFS_BMAP_CRC_MAGIC:
diff --git a/fs/xfs/xfs_stats.c b/fs/xfs/xfs_stats.c
index ce372b7..f224038 100644
--- a/fs/xfs/xfs_stats.c
+++ b/fs/xfs/xfs_stats.c
@@ -59,6 +59,7 @@ static int xfs_stat_proc_show(struct seq_file *m, void *v)
 		{ "abtc2",		XFSSTAT_END_ABTC_V2		},
 		{ "bmbt2",		XFSSTAT_END_BMBT_V2		},
 		{ "ibt2",		XFSSTAT_END_IBT_V2		},
+		{ "fibt2",		XFSSTAT_END_FIBT_V2		},
 		/* we print both series of quota information together */
 		{ "qm",			XFSSTAT_END_QM			},
 	};
diff --git a/fs/xfs/xfs_stats.h b/fs/xfs/xfs_stats.h
index c03ad38..c8f238b 100644
--- a/fs/xfs/xfs_stats.h
+++ b/fs/xfs/xfs_stats.h
@@ -183,7 +183,23 @@ struct xfsstats {
 	__uint32_t		xs_ibt_2_alloc;
 	__uint32_t		xs_ibt_2_free;
 	__uint32_t		xs_ibt_2_moves;
-#define XFSSTAT_END_XQMSTAT		(XFSSTAT_END_IBT_V2+6)
+#define XFSSTAT_END_FIBT_V2		(XFSSTAT_END_IBT_V2+15)
+	__uint32_t		xs_fibt_2_lookup;
+	__uint32_t		xs_fibt_2_compare;
+	__uint32_t		xs_fibt_2_insrec;
+	__uint32_t		xs_fibt_2_delrec;
+	__uint32_t		xs_fibt_2_newroot;
+	__uint32_t		xs_fibt_2_killroot;
+	__uint32_t		xs_fibt_2_increment;
+	__uint32_t		xs_fibt_2_decrement;
+	__uint32_t		xs_fibt_2_lshift;
+	__uint32_t		xs_fibt_2_rshift;
+	__uint32_t		xs_fibt_2_split;
+	__uint32_t		xs_fibt_2_join;
+	__uint32_t		xs_fibt_2_alloc;
+	__uint32_t		xs_fibt_2_free;
+	__uint32_t		xs_fibt_2_moves;
+#define XFSSTAT_END_XQMSTAT		(XFSSTAT_END_FIBT_V2+6)
 	__uint32_t		xs_qm_dqreclaims;
 	__uint32_t		xs_qm_dqreclaim_misses;
 	__uint32_t		xs_qm_dquot_dups;
diff --git a/fs/xfs/xfs_types.h b/fs/xfs/xfs_types.h
index 82bbc34..65c6e66 100644
--- a/fs/xfs/xfs_types.h
+++ b/fs/xfs/xfs_types.h
@@ -134,7 +134,7 @@ typedef enum {
 
 typedef enum {
 	XFS_BTNUM_BNOi, XFS_BTNUM_CNTi, XFS_BTNUM_BMAPi, XFS_BTNUM_INOi,
-	XFS_BTNUM_MAX
+	XFS_BTNUM_FINOi, XFS_BTNUM_MAX
 } xfs_btnum_t;
 
 struct xfs_name {
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 04/11] xfs: update inode allocation/free transaction reservations for finobt
  2013-11-13 14:36 [PATCH v2 00/11] xfs: introduce the free inode btree Brian Foster
                   ` (2 preceding siblings ...)
  2013-11-13 14:36 ` [PATCH v2 03/11] xfs: support the XFS_BTNUM_FINOBT free inode btree type Brian Foster
@ 2013-11-13 14:37 ` Brian Foster
  2013-11-13 14:37 ` [PATCH v2 05/11] xfs: insert newly allocated inode chunks into the finobt Brian Foster
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 21+ messages in thread
From: Brian Foster @ 2013-11-13 14:37 UTC (permalink / raw)
  To: xfs

Create the xfs_calc_finobt_res() helper to calculate the finobt log
reservation for inode allocation and free. Update
XFS_IALLOC_SPACE_RES() to reserve blocks for the additional finobt
insertion on inode allocation. Create XFS_IFREE_SPACE_RES() to
reserve blocks for the potential finobt record insertion on inode
free (i.e., if an inode chunk was previously fully allocated).

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_inode.c       |  4 +++-
 fs/xfs/xfs_trans_resv.c  | 47 +++++++++++++++++++++++++++++++++++++++++++----
 fs/xfs/xfs_trans_space.h |  7 ++++++-
 3 files changed, 52 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 001aa89..57c77ed 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1730,7 +1730,9 @@ xfs_inactive_ifree(
 	int			error;
 
 	tp = xfs_trans_alloc(mp, XFS_TRANS_INACTIVE);
-	error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree, 0, 0);
+	tp->t_flags |= XFS_TRANS_RESERVE;
+	error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree,
+				  XFS_IFREE_SPACE_RES(mp), 0);
 	if (error) {
 		ASSERT(XFS_FORCED_SHUTDOWN(mp));
 		xfs_trans_cancel(tp, XFS_TRANS_RELEASE_LOG_RES);
diff --git a/fs/xfs/xfs_trans_resv.c b/fs/xfs/xfs_trans_resv.c
index d53d9f0..d3f0095 100644
--- a/fs/xfs/xfs_trans_resv.c
+++ b/fs/xfs/xfs_trans_resv.c
@@ -98,6 +98,37 @@ xfs_calc_inode_res(
 }
 
 /*
+ * The free inode btree is a conditional feature and the log reservation
+ * requirements differ slightly from that of the traditional inode allocation
+ * btree. The finobt tracks records for inode chunks with at least one free inode.
+ * Therefore, a record can be removed from the tree for an inode allocation or
+ * free and the associated merge reservation is unconditional. This also covers
+ * the possibility of a split on record insertion.
+ *
+ * the free inode btree: max depth * block size
+ * the free inode btree entry: block size
+ *
+ * TODO: is the modify res really necessary? covered by the merge/split res?
+ * This seems to be the pattern of ifree, but not create_resv_alloc. Why?
+ */
+STATIC uint
+xfs_calc_finobt_res(
+	struct xfs_mount 	*mp,
+	int			modify)
+{
+	uint res;
+
+	if (!xfs_sb_version_hasfinobt(&mp->m_sb))
+		return 0;
+
+	res = xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1));
+	if (modify)
+		res += (uint)XFS_FSB_TO_B(mp, 1);
+
+	return res;
+}
+
+/*
  * Various log reservation values.
  *
  * These are based on the size of the file system block because that is what
@@ -267,6 +298,7 @@ xfs_calc_remove_reservation(
  *    the superblock for the nlink flag: sector size
  *    the directory btree: (max depth + v2) * dir block size
  *    the directory inode's bmap btree: (max depth + v2) * block size
+ *    the finobt
  */
 STATIC uint
 xfs_calc_create_resv_modify(
@@ -275,7 +307,8 @@ xfs_calc_create_resv_modify(
 	return xfs_calc_inode_res(mp, 2) +
 		xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
 		(uint)XFS_FSB_TO_B(mp, 1) +
-		xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp), XFS_FSB_TO_B(mp, 1));
+		xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp), XFS_FSB_TO_B(mp, 1)) +
+		xfs_calc_finobt_res(mp, 1);
 }
 
 /*
@@ -285,6 +318,7 @@ xfs_calc_create_resv_modify(
  *    the inode blocks allocated: XFS_IALLOC_BLOCKS * blocksize
  *    the inode btree: max depth * blocksize
  *    the allocation btrees: 2 trees * (max depth - 1) * block size
+ *    the finobt
  */
 STATIC uint
 xfs_calc_create_resv_alloc(
@@ -295,7 +329,8 @@ xfs_calc_create_resv_alloc(
 		xfs_calc_buf_res(XFS_IALLOC_BLOCKS(mp), XFS_FSB_TO_B(mp, 1)) +
 		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
 		xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 1),
-				 XFS_FSB_TO_B(mp, 1));
+				 XFS_FSB_TO_B(mp, 1)) +
+		xfs_calc_finobt_res(mp, 0);
 }
 
 STATIC uint
@@ -313,6 +348,7 @@ __xfs_calc_create_reservation(
  *    the superblock for the nlink flag: sector size
  *    the inode btree: max depth * blocksize
  *    the allocation btrees: 2 trees * (max depth - 1) * block size
+ *    the finobt
  */
 STATIC uint
 xfs_calc_icreate_resv_alloc(
@@ -322,7 +358,8 @@ xfs_calc_icreate_resv_alloc(
 		mp->m_sb.sb_sectsize +
 		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
 		xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 1),
-				 XFS_FSB_TO_B(mp, 1));
+				 XFS_FSB_TO_B(mp, 1)) +
+		xfs_calc_finobt_res(mp, 0);
 }
 
 STATIC uint
@@ -376,6 +413,7 @@ xfs_calc_symlink_reservation(
  *    the on disk inode before ours in the agi hash list: inode cluster size
  *    the inode btree: max depth * blocksize
  *    the allocation btrees: 2 trees * (max depth - 1) * block size
+ *    the finobt
  */
 STATIC uint
 xfs_calc_ifree_reservation(
@@ -391,7 +429,8 @@ xfs_calc_ifree_reservation(
 		xfs_calc_buf_res(2 + XFS_IALLOC_BLOCKS(mp) +
 				 mp->m_in_maxlevels, 0) +
 		xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 1),
-				 XFS_FSB_TO_B(mp, 1));
+				 XFS_FSB_TO_B(mp, 1)) +
+		xfs_calc_finobt_res(mp, 1);
 }
 
 /*
diff --git a/fs/xfs/xfs_trans_space.h b/fs/xfs/xfs_trans_space.h
index 7d2c920..a7d1721e 100644
--- a/fs/xfs/xfs_trans_space.h
+++ b/fs/xfs/xfs_trans_space.h
@@ -47,7 +47,9 @@
 #define	XFS_DIRREMOVE_SPACE_RES(mp)	\
 	XFS_DAREMOVE_SPACE_RES(mp, XFS_DATA_FORK)
 #define	XFS_IALLOC_SPACE_RES(mp)	\
-	(XFS_IALLOC_BLOCKS(mp) + (mp)->m_in_maxlevels - 1)
+	(XFS_IALLOC_BLOCKS(mp) + \
+	 (xfs_sb_version_hasfinobt(&mp->m_sb) ? 2 : 1 * \
+	  ((mp)->m_in_maxlevels - 1)))
 
 /*
  * Space reservation values for various transactions.
@@ -82,5 +84,8 @@
 	(XFS_DIRREMOVE_SPACE_RES(mp) + XFS_DIRENTER_SPACE_RES(mp,nl))
 #define	XFS_SYMLINK_SPACE_RES(mp,nl,b)	\
 	(XFS_IALLOC_SPACE_RES(mp) + XFS_DIRENTER_SPACE_RES(mp,nl) + (b))
+#define XFS_IFREE_SPACE_RES(mp)		\
+	(xfs_sb_version_hasfinobt(&mp->m_sb) ? (mp)->m_in_maxlevels : 0)
+
 
 #endif	/* __XFS_TRANS_SPACE_H__ */
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 05/11] xfs: insert newly allocated inode chunks into the finobt
  2013-11-13 14:36 [PATCH v2 00/11] xfs: introduce the free inode btree Brian Foster
                   ` (3 preceding siblings ...)
  2013-11-13 14:37 ` [PATCH v2 04/11] xfs: update inode allocation/free transaction reservations for finobt Brian Foster
@ 2013-11-13 14:37 ` Brian Foster
  2013-11-13 14:37 ` [PATCH v2 06/11] xfs: use and update the finobt on inode allocation Brian Foster
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 21+ messages in thread
From: Brian Foster @ 2013-11-13 14:37 UTC (permalink / raw)
  To: xfs

A newly allocated inode chunk, by definition, has at least one
free inode, so a record is always inserted into the finobt.

Create the xfs_inobt_insert() helper from existing code to insert
a record in an inobt based on the provided BTNUM. Update
xfs_ialloc_ag_alloc() to invoke the helper for the existing
XFS_BTNUM_INO tree and XFS_BTNUM_FINO tree, if enabled.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_ialloc.c | 93 ++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 70 insertions(+), 23 deletions(-)

diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
index 1397fc4..cd33ed6 100644
--- a/fs/xfs/xfs_ialloc.c
+++ b/fs/xfs/xfs_ialloc.c
@@ -112,6 +112,66 @@ xfs_inobt_get_rec(
 }
 
 /*
+ * Insert a single inobt record. Cursor must already point to desired location.
+ */
+STATIC int
+xfs_inobt_insert_rec(
+	struct xfs_btree_cur	*cur,
+	__int32_t		freecount,
+	xfs_inofree_t		free,
+	int			*stat)
+{
+	cur->bc_rec.i.ir_freecount = freecount;
+	cur->bc_rec.i.ir_free = free;
+	return xfs_btree_insert(cur, stat);
+}
+
+/*
+ * Insert records describing a newly allocated inode chunk into the inobt.
+ */
+STATIC int
+xfs_inobt_insert(
+	struct xfs_mount	*mp,
+	struct xfs_trans	*tp,
+	struct xfs_buf		*agbp,
+	xfs_agino_t		newino,
+	xfs_agino_t		newlen,
+	xfs_btnum_t		btnum)
+{
+	struct xfs_btree_cur	*cur;
+	struct xfs_agi		*agi = XFS_BUF_TO_AGI(agbp);
+	xfs_agnumber_t		agno = be32_to_cpu(agi->agi_seqno);
+	xfs_agino_t		thisino;
+	int			i;
+	int			error;
+
+	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno, btnum);
+
+	for (thisino = newino;
+	     thisino < newino + newlen;
+	     thisino += XFS_INODES_PER_CHUNK) {
+		error = xfs_inobt_lookup(cur, thisino, XFS_LOOKUP_EQ, &i);
+		if (error) {
+			xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+			return error;
+		}
+		ASSERT(i == 0);
+
+		error = xfs_inobt_insert_rec(cur, XFS_INODES_PER_CHUNK,
+					     XFS_INOBT_ALL_FREE, &i);
+		if (error) {
+			xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+			return error;
+		}
+		ASSERT(i == 1);
+	}
+
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+
+	return 0;
+}
+
+/*
  * Verify that the number of free inodes in the AGI is correct.
  */
 #ifdef DEBUG
@@ -310,13 +370,10 @@ xfs_ialloc_ag_alloc(
 {
 	xfs_agi_t	*agi;		/* allocation group header */
 	xfs_alloc_arg_t	args;		/* allocation argument structure */
-	xfs_btree_cur_t	*cur;		/* inode btree cursor */
 	xfs_agnumber_t	agno;
 	int		error;
-	int		i;
 	xfs_agino_t	newino;		/* new first inode's number */
 	xfs_agino_t	newlen;		/* new number of inodes */
-	xfs_agino_t	thisino;	/* current inode number, for loop */
 	int		isaligned = 0;	/* inode allocation at stripe unit */
 					/* boundary */
 	struct xfs_perag *pag;
@@ -454,29 +511,19 @@ xfs_ialloc_ag_alloc(
 	agi->agi_newino = cpu_to_be32(newino);
 
 	/*
-	 * Insert records describing the new inode chunk into the btree.
+	 * Insert records describing the new inode chunk into the btrees.
 	 */
-	cur = xfs_inobt_init_cursor(args.mp, tp, agbp, agno, XFS_BTNUM_INO);
-	for (thisino = newino;
-	     thisino < newino + newlen;
-	     thisino += XFS_INODES_PER_CHUNK) {
-		cur->bc_rec.i.ir_startino = thisino;
-		cur->bc_rec.i.ir_freecount = XFS_INODES_PER_CHUNK;
-		cur->bc_rec.i.ir_free = XFS_INOBT_ALL_FREE;
-		error = xfs_btree_lookup(cur, XFS_LOOKUP_EQ, &i);
-		if (error) {
-			xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
-			return error;
-		}
-		ASSERT(i == 0);
-		error = xfs_btree_insert(cur, &i);
-		if (error) {
-			xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+	error = xfs_inobt_insert(args.mp, tp, agbp, newino, newlen,
+				 XFS_BTNUM_INO);
+	if (error)
+		return error;
+
+	if (xfs_sb_version_hasfinobt(&args.mp->m_sb)) {
+		error = xfs_inobt_insert(args.mp, tp, agbp, newino, newlen,
+					 XFS_BTNUM_FINO);
+		if (error)
 			return error;
-		}
-		ASSERT(i == 1);
 	}
-	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
 	/*
 	 * Log allocation group header fields
 	 */
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 06/11] xfs: use and update the finobt on inode allocation
  2013-11-13 14:36 [PATCH v2 00/11] xfs: introduce the free inode btree Brian Foster
                   ` (4 preceding siblings ...)
  2013-11-13 14:37 ` [PATCH v2 05/11] xfs: insert newly allocated inode chunks into the finobt Brian Foster
@ 2013-11-13 14:37 ` Brian Foster
  2013-11-13 14:37 ` [PATCH v2 07/11] xfs: refactor xfs_difree() inobt bits into xfs_difree_inobt() helper Brian Foster
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 21+ messages in thread
From: Brian Foster @ 2013-11-13 14:37 UTC (permalink / raw)
  To: xfs

Replace xfs_dialloc_ag() with an implementation that looks for a
record in the finobt. The finobt only tracks records with at least
one free inode. This eliminates the need for the intra-ag scan in
the original algorithm. Once the inode is allocated, update the
finobt appropriately (possibly removing the record) as well as the
inobt.

Move the original xfs_dialloc_ag() algorithm to
xfs_dialloc_ag_slow() and fall back as such if finobt support is
not enabled.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_ialloc.c | 211 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 210 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
index cd33ed6..64e8d34 100644
--- a/fs/xfs/xfs_ialloc.c
+++ b/fs/xfs/xfs_ialloc.c
@@ -723,7 +723,7 @@ xfs_ialloc_get_rec(
  * available.
  */
 STATIC int
-xfs_dialloc_ag(
+xfs_dialloc_ag_slow(
 	struct xfs_trans	*tp,
 	struct xfs_buf		*agbp,
 	xfs_ino_t		parent,
@@ -981,6 +981,215 @@ error0:
 	return error;
 }
 
+STATIC int
+xfs_dialloc_ag(
+	struct xfs_trans	*tp,
+	struct xfs_buf		*agbp,
+	xfs_ino_t		parent,
+	xfs_ino_t		*inop)
+{
+	struct xfs_mount		*mp = tp->t_mountp;
+	struct xfs_agi			*agi = XFS_BUF_TO_AGI(agbp);
+	xfs_agnumber_t			agno = be32_to_cpu(agi->agi_seqno);
+	xfs_agnumber_t			pagno = XFS_INO_TO_AGNO(mp, parent);
+	xfs_agino_t			pagino = XFS_INO_TO_AGINO(mp, parent);
+	struct xfs_perag		*pag;
+	struct xfs_btree_cur		*cur;
+	struct xfs_btree_cur		*tcur;
+	struct xfs_inobt_rec_incore	rec;
+	struct xfs_inobt_rec_incore	trec;
+	xfs_ino_t			ino;
+	int				error;
+	int				offset;
+	int				i, j;
+
+	if (!xfs_sb_version_hasfinobt(&mp->m_sb))
+		return xfs_dialloc_ag_slow(tp, agbp, parent, inop);
+
+	pag = xfs_perag_get(mp, agno);
+
+	/*
+	 * If pagino is 0 (this is the root inode allocation) use newino.
+	 * This must work because we've just allocated some.
+	 */
+	if (!pagino)
+		pagino = be32_to_cpu(agi->agi_newino);
+
+	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno, XFS_BTNUM_FINO);
+
+	error = xfs_check_agi_freecount(cur, agi);
+	if (error)
+		goto error_cur;
+
+	if (agno == pagno) {
+		/*
+		 * We're in the same AG as the parent inode so allocate the
+		 * closest inode to the parent.
+		 */
+		error = xfs_inobt_lookup(cur, pagino, XFS_LOOKUP_LE, &i);
+		if (error)
+			goto error_cur;
+		if (i == 1) {
+			error = xfs_inobt_get_rec(cur, &rec, &i);
+			if (error)
+				goto error_cur;
+			XFS_WANT_CORRUPTED_GOTO(i == 1, error_cur);
+
+			/*
+			 * See if we've landed in the parent inode record. The
+			 * finobt only tracks chunks with at least one free
+			 * inode, so record existence is enough.
+			 */
+			if (pagino >= rec.ir_startino &&
+			    pagino < (rec.ir_startino + XFS_INODES_PER_CHUNK))
+				goto alloc_inode;
+		}
+
+		error = xfs_btree_dup_cursor(cur, &tcur);
+		if (error) 
+			goto error_cur;
+
+		error = xfs_inobt_lookup(tcur, pagino, XFS_LOOKUP_GE, &j);
+		if (error)
+			goto error_tcur;
+		if (j == 1) {
+			error = xfs_inobt_get_rec(tcur, &trec, &j);
+			if (error)
+				goto error_tcur;
+			XFS_WANT_CORRUPTED_GOTO(j == 1, error_tcur);
+		}
+
+		if (i == 1 && j == 1) {
+			if ((pagino - rec.ir_startino + XFS_INODES_PER_CHUNK - 1) >
+			    (trec.ir_startino - pagino)) {
+				rec = trec;
+				xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+				cur = tcur;
+			} else {
+				xfs_btree_del_cursor(tcur, XFS_BTREE_NOERROR);
+			}
+		} else if (j == 1) {
+			rec = trec;
+			xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+			cur = tcur;
+		} else {
+			xfs_btree_del_cursor(tcur, XFS_BTREE_NOERROR);
+		}
+	} else {
+		/*
+		 * Different AG from the parent inode. Check the record for the
+		 * most recently allocated inode.
+		 */
+		if (agi->agi_newino != cpu_to_be32(NULLAGINO)) {
+			error = xfs_inobt_lookup(cur, agi->agi_newino,
+						 XFS_LOOKUP_EQ, &i);
+			if (error)
+				goto error_cur;
+			if (i == 1) {
+				error = xfs_inobt_get_rec(cur, &rec, &i);
+				if (error)
+					goto error_cur;
+				XFS_WANT_CORRUPTED_GOTO(i == 1, error_cur);
+				goto alloc_inode;
+			}
+		}
+
+		/*
+		 * Allocate the first inode available in the AG.
+		 */
+		error = xfs_inobt_lookup(cur, 0, XFS_LOOKUP_GE, &i);
+		if (error)
+			goto error_cur;
+		XFS_WANT_CORRUPTED_GOTO(i == 1, error_cur);
+
+		error = xfs_inobt_get_rec(cur, &rec, &i);
+		if (error)
+			goto error_cur;
+		XFS_WANT_CORRUPTED_GOTO(i == 1, error_cur);
+	}
+
+alloc_inode:
+	offset = xfs_lowbit64(rec.ir_free);
+	ASSERT(offset >= 0);
+	ASSERT(offset < XFS_INODES_PER_CHUNK);
+	ASSERT((XFS_AGINO_TO_OFFSET(mp, rec.ir_startino) %
+				   XFS_INODES_PER_CHUNK) == 0);
+	ino = XFS_AGINO_TO_INO(mp, agno, rec.ir_startino + offset);
+
+	/*
+	 * Modify or remove the finobt record.
+	 */
+	rec.ir_free &= ~XFS_INOBT_MASK(offset);
+	rec.ir_freecount--;
+	if (rec.ir_freecount) 
+		error = xfs_inobt_update(cur, &rec);
+	else
+		error = xfs_btree_delete(cur, &i);
+	if (error)
+		goto error_cur;
+
+	/*
+	 * Lookup and modify the equivalent record in the inobt.
+	 */
+	tcur = xfs_inobt_init_cursor(mp, tp, agbp, agno, XFS_BTNUM_INO);
+
+	error = xfs_check_agi_freecount(tcur, agi);
+	if (error)
+		goto error_tcur;
+
+	error = xfs_inobt_lookup(tcur, rec.ir_startino, XFS_LOOKUP_EQ, &i);
+	if (error)
+		goto error_tcur;
+	XFS_WANT_CORRUPTED_GOTO(i == 1, error_tcur);
+
+	error = xfs_inobt_get_rec(tcur, &trec, &i);
+	if (error)
+		goto error_tcur;
+	XFS_WANT_CORRUPTED_GOTO(i == 1, error_tcur);
+	ASSERT((XFS_AGINO_TO_OFFSET(mp, trec.ir_startino) %
+				   XFS_INODES_PER_CHUNK) == 0);
+
+	trec.ir_free &= ~XFS_INOBT_MASK(offset);
+	trec.ir_freecount--;
+
+	XFS_WANT_CORRUPTED_GOTO((rec.ir_free == trec.ir_free) &&
+				(rec.ir_freecount == trec.ir_freecount),
+				error_tcur);
+
+	error = xfs_inobt_update(tcur, &trec);
+	if (error)
+		goto error_tcur;
+
+	/*
+	 * Update the perag and superblock.
+	 */
+	be32_add_cpu(&agi->agi_freecount, -1);
+	xfs_ialloc_log_agi(tp, agbp, XFS_AGI_FREECOUNT);
+	pag->pagi_freecount--;
+
+	xfs_trans_mod_sb(tp, XFS_TRANS_SB_IFREE, -1);
+
+	error = xfs_check_agi_freecount(tcur, agi);
+	if (error)
+		goto error_tcur;
+	error = xfs_check_agi_freecount(cur, agi);
+	if (error)
+		goto error_tcur;
+
+	xfs_btree_del_cursor(tcur, XFS_BTREE_NOERROR);
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+	xfs_perag_put(pag);
+	*inop = ino;
+	return 0;
+
+error_tcur:
+	xfs_btree_del_cursor(tcur, XFS_BTREE_ERROR);
+error_cur:
+	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+	xfs_perag_put(pag);
+	return error;
+}
+
 /*
  * Allocate an inode on disk.
  *
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 07/11] xfs: refactor xfs_difree() inobt bits into xfs_difree_inobt() helper
  2013-11-13 14:36 [PATCH v2 00/11] xfs: introduce the free inode btree Brian Foster
                   ` (5 preceding siblings ...)
  2013-11-13 14:37 ` [PATCH v2 06/11] xfs: use and update the finobt on inode allocation Brian Foster
@ 2013-11-13 14:37 ` Brian Foster
  2013-11-13 14:37 ` [PATCH v2 08/11] xfs: update the finobt on inode free Brian Foster
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 21+ messages in thread
From: Brian Foster @ 2013-11-13 14:37 UTC (permalink / raw)
  To: xfs

Refactor xfs_difree() in preparation for the finobt. xfs_difree()
performs the validity checks against the ag and reads the agi
header. The work of physically updating the inode allocation btree
is pushed down into the new xfs_difree_inobt() helper.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_ialloc.c | 160 +++++++++++++++++++++++++++++++---------------------
 1 file changed, 96 insertions(+), 64 deletions(-)

diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
index 64e8d34..fd77b28 100644
--- a/fs/xfs/xfs_ialloc.c
+++ b/fs/xfs/xfs_ialloc.c
@@ -1349,74 +1349,31 @@ out_error:
 	return XFS_ERROR(error);
 }
 
-/*
- * Free disk inode.  Carefully avoids touching the incore inode, all
- * manipulations incore are the caller's responsibility.
- * The on-disk inode is not changed by this operation, only the
- * btree (free inode mask) is changed.
- */
-int
-xfs_difree(
-	xfs_trans_t	*tp,		/* transaction pointer */
-	xfs_ino_t	inode,		/* inode to be freed */
-	xfs_bmap_free_t	*flist,		/* extents to free */
-	int		*delete,	/* set if inode cluster was deleted */
-	xfs_ino_t	*first_ino)	/* first inode in deleted cluster */
+STATIC int
+xfs_difree_inobt(
+	struct xfs_mount		*mp,
+	struct xfs_trans		*tp,
+	struct xfs_buf			*agbp,
+	xfs_agino_t			agino,
+	struct xfs_bmap_free		*flist,
+	int				*delete,
+	xfs_ino_t			*first_ino,
+	struct xfs_inobt_rec_incore	*orec)
 {
-	/* REFERENCED */
-	xfs_agblock_t	agbno;	/* block number containing inode */
-	xfs_buf_t	*agbp;	/* buffer containing allocation group header */
-	xfs_agino_t	agino;	/* inode number relative to allocation group */
-	xfs_agnumber_t	agno;	/* allocation group number */
-	xfs_agi_t	*agi;	/* allocation group header */
-	xfs_btree_cur_t	*cur;	/* inode btree cursor */
-	int		error;	/* error return value */
-	int		i;	/* result code */
-	int		ilen;	/* inodes in an inode cluster */
-	xfs_mount_t	*mp;	/* mount structure for filesystem */
-	int		off;	/* offset of inode in inode chunk */
-	xfs_inobt_rec_incore_t rec;	/* btree record */
-	struct xfs_perag *pag;
-
-	mp = tp->t_mountp;
+	struct xfs_agi			*agi = XFS_BUF_TO_AGI(agbp);
+	xfs_agnumber_t			agno = be32_to_cpu(agi->agi_seqno);
+	xfs_agblock_t			agbno = XFS_AGINO_TO_AGBNO(mp, agino);
+	struct xfs_perag		*pag;
+	struct xfs_btree_cur		*cur;
+	struct xfs_inobt_rec_incore	rec;
+	int				ilen;
+	int				error;
+	int				i;
+	int				off;
 
-	/*
-	 * Break up inode number into its components.
-	 */
-	agno = XFS_INO_TO_AGNO(mp, inode);
-	if (agno >= mp->m_sb.sb_agcount)  {
-		xfs_warn(mp, "%s: agno >= mp->m_sb.sb_agcount (%d >= %d).",
-			__func__, agno, mp->m_sb.sb_agcount);
-		ASSERT(0);
-		return XFS_ERROR(EINVAL);
-	}
-	agino = XFS_INO_TO_AGINO(mp, inode);
-	if (inode != XFS_AGINO_TO_INO(mp, agno, agino))  {
-		xfs_warn(mp, "%s: inode != XFS_AGINO_TO_INO() (%llu != %llu).",
-			__func__, (unsigned long long)inode,
-			(unsigned long long)XFS_AGINO_TO_INO(mp, agno, agino));
-		ASSERT(0);
-		return XFS_ERROR(EINVAL);
-	}
-	agbno = XFS_AGINO_TO_AGBNO(mp, agino);
-	if (agbno >= mp->m_sb.sb_agblocks)  {
-		xfs_warn(mp, "%s: agbno >= mp->m_sb.sb_agblocks (%d >= %d).",
-			__func__, agbno, mp->m_sb.sb_agblocks);
-		ASSERT(0);
-		return XFS_ERROR(EINVAL);
-	}
-	/*
-	 * Get the allocation group header.
-	 */
-	error = xfs_ialloc_read_agi(mp, tp, agno, &agbp);
-	if (error) {
-		xfs_warn(mp, "%s: xfs_ialloc_read_agi() returned error %d.",
-			__func__, error);
-		return error;
-	}
-	agi = XFS_BUF_TO_AGI(agbp);
 	ASSERT(agi->agi_magicnum == cpu_to_be32(XFS_AGI_MAGIC));
 	ASSERT(agbno < be32_to_cpu(agi->agi_length));
+
 	/*
 	 * Initialize the cursor.
 	 */
@@ -1512,6 +1469,7 @@ xfs_difree(
 	if (error)
 		goto error0;
 
+	*orec = rec;
 	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
 	return 0;
 
@@ -1520,6 +1478,80 @@ error0:
 	return error;
 }
 
+/*
+ * Free disk inode.  Carefully avoids touching the incore inode, all
+ * manipulations incore are the caller's responsibility.
+ * The on-disk inode is not changed by this operation, only the
+ * btree (free inode mask) is changed.
+ */
+int
+xfs_difree(
+	xfs_trans_t	*tp,		/* transaction pointer */
+	xfs_ino_t	inode,		/* inode to be freed */
+	xfs_bmap_free_t	*flist,		/* extents to free */
+	int		*delete,	/* set if inode cluster was deleted */
+	xfs_ino_t	*first_ino)	/* first inode in deleted cluster */
+{
+	/* REFERENCED */
+	xfs_agblock_t	agbno;	/* block number containing inode */
+	xfs_buf_t	*agbp;	/* buffer containing allocation group header */
+	xfs_agino_t	agino;	/* inode number relative to allocation group */
+	xfs_agnumber_t	agno;	/* allocation group number */
+	int		error;	/* error return value */
+	xfs_mount_t	*mp;	/* mount structure for filesystem */
+	xfs_inobt_rec_incore_t rec;	/* btree record */
+
+	mp = tp->t_mountp;
+
+	/*
+	 * Break up inode number into its components.
+	 */
+	agno = XFS_INO_TO_AGNO(mp, inode);
+	if (agno >= mp->m_sb.sb_agcount)  {
+		xfs_warn(mp, "%s: agno >= mp->m_sb.sb_agcount (%d >= %d).",
+			__func__, agno, mp->m_sb.sb_agcount);
+		ASSERT(0);
+		return XFS_ERROR(EINVAL);
+	}
+	agino = XFS_INO_TO_AGINO(mp, inode);
+	if (inode != XFS_AGINO_TO_INO(mp, agno, agino))  {
+		xfs_warn(mp, "%s: inode != XFS_AGINO_TO_INO() (%llu != %llu).",
+			__func__, (unsigned long long)inode,
+			(unsigned long long)XFS_AGINO_TO_INO(mp, agno, agino));
+		ASSERT(0);
+		return XFS_ERROR(EINVAL);
+	}
+	agbno = XFS_AGINO_TO_AGBNO(mp, agino);
+	if (agbno >= mp->m_sb.sb_agblocks)  {
+		xfs_warn(mp, "%s: agbno >= mp->m_sb.sb_agblocks (%d >= %d).",
+			__func__, agbno, mp->m_sb.sb_agblocks);
+		ASSERT(0);
+		return XFS_ERROR(EINVAL);
+	}
+	/*
+	 * Get the allocation group header.
+	 */
+	error = xfs_ialloc_read_agi(mp, tp, agno, &agbp);
+	if (error) {
+		xfs_warn(mp, "%s: xfs_ialloc_read_agi() returned error %d.",
+			__func__, error);
+		return error;
+	}
+
+	/*
+	 * Fix up the inode allocation btree.
+	 */
+	error = xfs_difree_inobt(mp, tp, agbp, agino, flist, delete, first_ino,
+				 &rec);
+	if (error)
+		goto error0;
+
+	return 0;
+
+error0:
+	return error;
+}
+
 STATIC int
 xfs_imap_lookup(
 	struct xfs_mount	*mp,
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 08/11] xfs: update the finobt on inode free
  2013-11-13 14:36 [PATCH v2 00/11] xfs: introduce the free inode btree Brian Foster
                   ` (6 preceding siblings ...)
  2013-11-13 14:37 ` [PATCH v2 07/11] xfs: refactor xfs_difree() inobt bits into xfs_difree_inobt() helper Brian Foster
@ 2013-11-13 14:37 ` Brian Foster
  2013-11-13 14:37 ` [PATCH v2 09/11] xfs: add finobt support to growfs Brian Foster
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 21+ messages in thread
From: Brian Foster @ 2013-11-13 14:37 UTC (permalink / raw)
  To: xfs

An inode free operation can have several effects on the finobt. If
all inodes have been freed and the chunk deallocated, we remove the
finobt record. If the inode chunk was previously full, we must
insert a new record based on the existing inobt record. Otherwise,
we modify the record in place.

Create the xfs_ifree_finobt() function to identify the potential
scenarios and update the finobt appropriately.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_ialloc.c | 109 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 109 insertions(+)

diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
index fd77b28..de46ba2 100644
--- a/fs/xfs/xfs_ialloc.c
+++ b/fs/xfs/xfs_ialloc.c
@@ -1479,6 +1479,106 @@ error0:
 }
 
 /*
+ * Free an inode in the free inode btree.
+ */
+STATIC int
+xfs_difree_finobt(
+	struct xfs_mount		*mp,
+	struct xfs_trans		*tp,
+	struct xfs_buf			*agbp,
+	xfs_agino_t			agino,
+	struct xfs_inobt_rec_incore	*ibtrec) /* inobt record */
+{
+	struct xfs_agi			*agi = XFS_BUF_TO_AGI(agbp);
+	xfs_agnumber_t			agno = be32_to_cpu(agi->agi_seqno);
+	struct xfs_btree_cur		*cur;
+	struct xfs_inobt_rec_incore	rec;
+	int				offset = agino - ibtrec->ir_startino;
+	int				error;
+	int				i;
+
+	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno, XFS_BTNUM_FINO);
+
+	error = xfs_inobt_lookup(cur, ibtrec->ir_startino, XFS_LOOKUP_EQ, &i);
+	if (error)
+		goto error;
+	if (i == 0) {
+		/*
+		 * If the record does not exist in the finobt, we must have just
+		 * freed an inode in a previously fully allocated chunk. If not,
+		 * something is out of sync.
+		 */
+		XFS_WANT_CORRUPTED_GOTO(ibtrec->ir_freecount == 1, error);
+
+		error = xfs_inobt_insert_rec(cur, ibtrec->ir_freecount,
+					     ibtrec->ir_free, &i);
+		if (error)
+			goto error;
+		ASSERT(i == 1);
+
+		goto out;
+	}
+
+	/*
+	 * Read and update the existing record.
+	 */
+	error = xfs_inobt_get_rec(cur, &rec, &i);
+	if (error)
+		goto error;
+	XFS_WANT_CORRUPTED_GOTO(i == 1, error);
+
+	rec.ir_free |= XFS_INOBT_MASK(offset);
+	rec.ir_freecount++;
+
+	XFS_WANT_CORRUPTED_GOTO((rec.ir_free == ibtrec->ir_free) &&
+				(rec.ir_freecount == ibtrec->ir_freecount),
+				error);
+
+	/*
+	 * The content of inobt records should always match between the inobt
+	 * and finobt. The lifecycle of records in the finobt is different from
+	 * the inobt in that the finobt only tracks records with at least one
+	 * free inode. This is to optimize lookup for inode allocation purposes.
+	 * The following checks determine whether to update the existing record or
+	 * remove it entirely.
+	 */
+
+	if (rec.ir_freecount == XFS_IALLOC_INODES(mp) &&
+	    !(mp->m_flags & XFS_MOUNT_IKEEP)) {
+		/*
+		 * If all inodes are free and we're in !ikeep mode, the entire
+		 * inode chunk has been deallocated. Remove the record from the
+		 * finobt.
+		 */
+		error = xfs_btree_delete(cur, &i);
+		if (error)
+			goto error;
+		ASSERT(i == 1);
+	} else {
+		/*
+		 * The existing finobt record was modified and has a combination
+		 * of allocated and free inodes or is completely free and ikeep
+		 * is enabled. Update the record.
+		 */
+		error = xfs_inobt_update(cur, &rec);
+		if (error)
+			goto error;
+	}
+
+out:
+	error = xfs_check_agi_freecount(cur, agi);
+	if (error)
+		goto error;
+
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+	return 0;
+
+error:
+	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+	return error;
+}
+
+/*
  * Free disk inode.  Carefully avoids touching the incore inode, all
  * manipulations incore are the caller's responsibility.
  * The on-disk inode is not changed by this operation, only the
@@ -1546,6 +1646,15 @@ xfs_difree(
 	if (error)
 		goto error0;
 
+	/*
+	 * Fix up the free inode btree.
+	 */
+	if (xfs_sb_version_hasfinobt(&mp->m_sb)) {
+		error = xfs_difree_finobt(mp, tp, agbp, agino, &rec);
+		if (error)
+			goto error0;
+	}
+
 	return 0;
 
 error0:
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 09/11] xfs: add finobt support to growfs
  2013-11-13 14:36 [PATCH v2 00/11] xfs: introduce the free inode btree Brian Foster
                   ` (7 preceding siblings ...)
  2013-11-13 14:37 ` [PATCH v2 08/11] xfs: update the finobt on inode free Brian Foster
@ 2013-11-13 14:37 ` Brian Foster
  2013-11-13 14:37 ` [PATCH v2 10/11] xfs: report finobt status in fs geometry Brian Foster
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 21+ messages in thread
From: Brian Foster @ 2013-11-13 14:37 UTC (permalink / raw)
  To: xfs

Add finobt support to growfs. Initialize the agi root/level fields
and the root finobt block.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/xfs_fsops.c | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index a6e54b3..63d9424 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -312,6 +312,10 @@ xfs_growfs_data_private(
 		agi->agi_dirino = cpu_to_be32(NULLAGINO);
 		if (xfs_sb_version_hascrc(&mp->m_sb))
 			uuid_copy(&agi->agi_uuid, &mp->m_sb.sb_uuid);
+		if (xfs_sb_version_hasfinobt(&mp->m_sb)) {
+			agi->agi_free_root = cpu_to_be32(XFS_FIBT_BLOCK(mp));
+			agi->agi_free_level = cpu_to_be32(1);
+		}
 		for (bucket = 0; bucket < XFS_AGI_UNLINKED_BUCKETS; bucket++)
 			agi->agi_unlinked[bucket] = cpu_to_be32(NULLAGINO);
 
@@ -403,6 +407,34 @@ xfs_growfs_data_private(
 		xfs_buf_relse(bp);
 		if (error)
 			goto error0;
+
+		/*
+		 * FINO btree root block
+		 */
+		if (xfs_sb_version_hasfinobt(&mp->m_sb)) {
+			bp = xfs_growfs_get_hdr_buf(mp,
+				XFS_AGB_TO_DADDR(mp, agno, XFS_FIBT_BLOCK(mp)),
+				BTOBB(mp->m_sb.sb_blocksize), 0,
+				&xfs_inobt_buf_ops);
+			if (!bp) {
+				error = ENOMEM;
+				goto error0;
+			}
+
+			if (xfs_sb_version_hascrc(&mp->m_sb))
+				xfs_btree_init_block(mp, bp, XFS_FIBT_CRC_MAGIC,
+						     0, 0, agno,
+						     XFS_BTREE_CRC_BLOCKS);
+			else
+				xfs_btree_init_block(mp, bp, XFS_FIBT_MAGIC, 0,
+						     0, agno, 0);
+
+			error = xfs_bwrite(bp);
+			xfs_buf_relse(bp);
+			if (error)
+				goto error0;
+		}
+
 	}
 	xfs_trans_agblocks_delta(tp, nfree);
 	/*
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 10/11] xfs: report finobt status in fs geometry
  2013-11-13 14:36 [PATCH v2 00/11] xfs: introduce the free inode btree Brian Foster
                   ` (8 preceding siblings ...)
  2013-11-13 14:37 ` [PATCH v2 09/11] xfs: add finobt support to growfs Brian Foster
@ 2013-11-13 14:37 ` Brian Foster
  2013-11-13 14:37 ` [PATCH v2 11/11] xfs: enable the finobt feature on v5 superblocks Brian Foster
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 21+ messages in thread
From: Brian Foster @ 2013-11-13 14:37 UTC (permalink / raw)
  To: xfs

Define the XFS_FSOP_GEOM_FLAGS_FINOBT fs geometry flag and set the
associated bit if the filesystem supports the free inode btree.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_fs.h    | 1 +
 fs/xfs/xfs_fsops.c | 4 +++-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_fs.h b/fs/xfs/xfs_fs.h
index c5fc116..d34703d 100644
--- a/fs/xfs/xfs_fs.h
+++ b/fs/xfs/xfs_fs.h
@@ -238,6 +238,7 @@ typedef struct xfs_fsop_resblks {
 #define XFS_FSOP_GEOM_FLAGS_LAZYSB	0x4000	/* lazy superblock counters */
 #define XFS_FSOP_GEOM_FLAGS_V5SB	0x8000	/* version 5 superblock */
 #define XFS_FSOP_GEOM_FLAGS_FTYPE	0x10000	/* inode directory types */
+#define XFS_FSOP_GEOM_FLAGS_FINOBT	0x20000	/* free inode btree */
 
 /*
  * Minimum and maximum sizes need for growth checks.
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 63d9424..5f78ba9 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -104,7 +104,9 @@ xfs_fs_geometry(
 			(xfs_sb_version_hascrc(&mp->m_sb) ?
 				XFS_FSOP_GEOM_FLAGS_V5SB : 0) |
 			(xfs_sb_version_hasftype(&mp->m_sb) ?
-				XFS_FSOP_GEOM_FLAGS_FTYPE : 0);
+				XFS_FSOP_GEOM_FLAGS_FTYPE : 0) |
+			(xfs_sb_version_hasfinobt(&mp->m_sb) ?
+				XFS_FSOP_GEOM_FLAGS_FINOBT : 0);
 		geo->logsectsize = xfs_sb_version_hassector(&mp->m_sb) ?
 				mp->m_sb.sb_logsectsize : BBSIZE;
 		geo->rtsectsize = mp->m_sb.sb_blocksize;
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 11/11] xfs: enable the finobt feature on v5 superblocks
  2013-11-13 14:36 [PATCH v2 00/11] xfs: introduce the free inode btree Brian Foster
                   ` (9 preceding siblings ...)
  2013-11-13 14:37 ` [PATCH v2 10/11] xfs: report finobt status in fs geometry Brian Foster
@ 2013-11-13 14:37 ` Brian Foster
  2013-11-13 16:17 ` [PATCH v2 00/11] xfs: introduce the free inode btree Christoph Hellwig
  2013-11-17 22:43 ` Michael L. Semon
  12 siblings, 0 replies; 21+ messages in thread
From: Brian Foster @ 2013-11-13 14:37 UTC (permalink / raw)
  To: xfs

Add the finobt feature bit to the list of known features. As of
this point, the kernel code knows how to mount and manage both
finobt and non-finobt formatted filesystems.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_sb.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_sb.h b/fs/xfs/xfs_sb.h
index 070a7f6..9919fb8 100644
--- a/fs/xfs/xfs_sb.h
+++ b/fs/xfs/xfs_sb.h
@@ -586,7 +586,8 @@ xfs_sb_has_compat_feature(
 }
 
 #define XFS_SB_FEAT_RO_COMPAT_FINOBT   (1 << 0)		/* free inode btree */
-#define XFS_SB_FEAT_RO_COMPAT_ALL 0
+#define XFS_SB_FEAT_RO_COMPAT_ALL \
+		(XFS_SB_FEAT_RO_COMPAT_FINOBT)
 #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL
 static inline bool
 xfs_sb_has_ro_compat_feature(
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 00/11] xfs: introduce the free inode btree
  2013-11-13 14:36 [PATCH v2 00/11] xfs: introduce the free inode btree Brian Foster
                   ` (10 preceding siblings ...)
  2013-11-13 14:37 ` [PATCH v2 11/11] xfs: enable the finobt feature on v5 superblocks Brian Foster
@ 2013-11-13 16:17 ` Christoph Hellwig
  2013-11-13 17:55   ` Brian Foster
  2013-11-17 22:43 ` Michael L. Semon
  12 siblings, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2013-11-13 16:17 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

I have to admit that I haven't followed this series as closely as I
should, but could you summarize the performance of it?  What workloads
does it help most, what workloads does it hurt and how much?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 01/11] xfs: refactor xfs_ialloc_btree.c to support multiple inobt numbers
  2013-11-13 14:36 ` [PATCH v2 01/11] xfs: refactor xfs_ialloc_btree.c to support multiple inobt numbers Brian Foster
@ 2013-11-13 16:17   ` Christoph Hellwig
  0 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2013-11-13 16:17 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 02/11] xfs: reserve v5 superblock read-only compat. feature bit for finobt
  2013-11-13 14:36 ` [PATCH v2 02/11] xfs: reserve v5 superblock read-only compat. feature bit for finobt Brian Foster
@ 2013-11-13 16:18   ` Christoph Hellwig
  0 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2013-11-13 16:18 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

> +	return (XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5) &&

no need for the bracing here.

Otherwise looks fine.

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 00/11] xfs: introduce the free inode btree
  2013-11-13 16:17 ` [PATCH v2 00/11] xfs: introduce the free inode btree Christoph Hellwig
@ 2013-11-13 17:55   ` Brian Foster
  2013-11-13 21:10     ` Dave Chinner
  0 siblings, 1 reply; 21+ messages in thread
From: Brian Foster @ 2013-11-13 17:55 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On 11/13/2013 11:17 AM, Christoph Hellwig wrote:
> I have to admit that I haven't followed this series as closely as I
> should, but could you summarize the performance of it?  What workloads
> does it help most, what workloads does it hurt and how much?
> 

Hi Christoph,

Sure... this work is based on Dave's write up here:

http://oss.sgi.com/archives/xfs/2013-08/msg00344.html

... where he also explains the general idea, which is basically to
improve inode allocation performance on a large fs' that happens to be
sparsely populated with inode chunks with free inodes. We do this by
creating a second inode btree that only tracks inode chunks with at
least one free inode.

So far I've only really ad hoc tested the focused case: create millions
of inodes on an fs, strategically remove an inode towards the end of the
ag such that there is one existing inode chunk with a single free inode,
then go and create a file.

The current implementation hits the fallback search in xfs_dialloc_ag()
(the for loop prior to 'alloc_inode:') and degrades to a couple seconds
or so (on my crappy single spindle setup). Alternatively, the finobt in
this scenario contains a single record with the chunk with the free
inode, so the record lookup and allocation time is basically constant
(e.g., we eliminate the need to ever run the full ag scan).

Sorry I don't have more specific numbers at the moment. Most of my
testing so far has been the focused case and general reliability
testing. I'll need to find some hardware worthy of performance testing,
particularly to check for any potential negative effects of managing the
secondary tree. I suppose I wouldn't expect it to be much worse than the
overhead of managing two free space trees, but we'll see.
Thoughts/suggestions appreciated, thanks.

Brian


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 00/11] xfs: introduce the free inode btree
  2013-11-13 17:55   ` Brian Foster
@ 2013-11-13 21:10     ` Dave Chinner
  2013-11-19 21:29       ` Brian Foster
  0 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2013-11-13 21:10 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, xfs

On Wed, Nov 13, 2013 at 12:55:38PM -0500, Brian Foster wrote:
> On 11/13/2013 11:17 AM, Christoph Hellwig wrote:
> > I have to admit that I haven't followed this series as closely as I
> > should, but could you summarize the performance of it?  What workloads
> > does it help most, what workloads does it hurt and how much?
> > 
> 
> Hi Christoph,
> 
> Sure... this work is based on Dave's write up here:
> 
> http://oss.sgi.com/archives/xfs/2013-08/msg00344.html
> 
> ... where he also explains the general idea, which is basically to
> improve inode allocation performance on a large fs' that happens to be
> sparsely populated with inode chunks with free inodes. We do this by
> creating a second inode btree that only tracks inode chunks with at
> least one free inode.

This is a common problem for people use hard-link based backups
repositories when they start removing backups. It results in random
inode removal, and so allocation never hits the "no free inodes"
fast path. As a result, allocation speed can drop a couple of orders
of magnitude due to the added CPU overhead of searching for free
inodes to allocate. It is completely unpredictable as to when it will
occur, so one backup might run at full speed, and the next might
take 3-4x as long to complete....

> Sorry I don't have more specific numbers at the moment. Most of my
> testing so far has been the focused case and general reliability
> testing. I'll need to find some hardware worthy of performance testing,
> particularly to check for any potential negative effects of managing the
> secondary tree. I suppose I wouldn't expect it to be much worse than the
> overhead of managing two free space trees, but we'll see.
> Thoughts/suggestions appreciated, thanks.

The problem can be demonstrated with a single CPU and a single
spindle. Create a single AG filesystem of a 100GB, and populate it
with 10 million inodes.

Time how long it takes to create another 10000 inodes in a new
directory. Measure CPU usage.

Randomly delete 10,000 inodes from the original population to
sparsely populate the inobt with 10000 free inodes.

Time how long it takes to create another 10000 inodes in a new
directory. Measure CPU usage.

The difference in time and CPU will be diretly related to the
addition time spent searching the inobt for free inodes...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 00/11] xfs: introduce the free inode btree
  2013-11-13 14:36 [PATCH v2 00/11] xfs: introduce the free inode btree Brian Foster
                   ` (11 preceding siblings ...)
  2013-11-13 16:17 ` [PATCH v2 00/11] xfs: introduce the free inode btree Christoph Hellwig
@ 2013-11-17 22:43 ` Michael L. Semon
  2013-11-18 22:38   ` Michael L. Semon
  12 siblings, 1 reply; 21+ messages in thread
From: Michael L. Semon @ 2013-11-17 22:43 UTC (permalink / raw)
  To: Brian Foster, xfs

On 11/13/2013 09:36 AM, Brian Foster wrote:
> Hi all,
> 
> The free inode btree adds a new inode btree to XFS with the intent to
> track only inode chunks with at least one free inode. Patches 1-3 add
> the necessary support for the new XFS_BTNUM_FINOBT type and introduce a
> read-only v5 superblock flag. Patch 4 updates the transaction
> reservations for inode allocation operations to account for the finobt.
> Patches 5-9 add support to manage the finobt on inode chunk allocation,
> inode allocation, inode free (and chunk deletion) and growfs. Patch 10
> adds support to report finobt status in the fs geometry. Patch 11 adds
> the feature bit to the associated mask. Thoughts, reviews, flames
> appreciated.
> 
> Brian
> 
> v2:
> - Rebase to latest xfs tree (minor shifting around of some header bits).
> - Added "xfs: report finobt status in fs geometry" patch to series.

Very nice rebase!  There might have been a whitespace issue on patch #6 
for kernel and xfsprogs, but it was easy going after that.

I'm halfway through testing 4k finobt CRC filesystems on a 2.2-GB, 2-disk 
md RAID-0, x86 Pentium 4, 512 MB of RAM.  The current nasty setup is 
kernel 3.12.0+, less the 5 most recent AIO commits/merges, and me trying 
to get in the few not-merged Dave Chinner kernel/xfsprogs patches along 
with your patches.

I meant to be done with 4k by now, but generic/224 caused the kernel OOM 
killer to halt testing, much like it does in 256 MB RAM without finobt.  
No problem:  I'll thank Stan in advance for introducing me to the term 
O_PONIES.

The rest of this letter is random junk that hasn't been re-tested, to 
give a flavor of what might lie ahead.  I'm missing a stack trace to the 
effect of "Error 117: offline filesystem operation in progress" as 
something later than xfstests xfs/296 was running.  None of this letter 
needs a reply.

Good luck!

Michael

[NOISE FOLLOWS]

***** I don't know if this one is an xfstests issue or an xfsprogs 
issue.  Something like this also happened in a non-finobt 
`./check -g auto`...

xfs/033	 [failed, exit status 1] - output mismatch (see /var/lib/xfstests/results//xfs/033.out.bad)
    --- tests/xfs/033.out	2013-11-11 13:46:22.367412935 -0500
    +++ /var/lib/xfstests/results//xfs/033.out.bad	2013-11-17 12:57:28.010382465 -0500
    @@ -17,9 +17,10 @@
             - process known inodes and perform inode discovery...
     bad magic number 0x0 on inode INO
     bad version number 0x0 on inode INO
    +inode identifier 0 mismatch on inode INO
     bad magic number 0x0 on inode INO, resetting magic number
     bad version number 0x0 on inode INO, resetting version number
    -imap claims a free inode INO is in use, correcting imap and clearing inode
     ...
     (Run 'diff -u tests/xfs/033.out /var/lib/xfstests/results//xfs/033.out.bad' to see the entire diff)

***** The diff for xfs/033:

19a20
> inode identifier 0 mismatch on inode INO
22c23
< imap claims a free inode INO is in use, correcting imap and clearing inode
---
> inode identifier 0 mismatch on inode INO
33,194c34,37
<         - resetting contents of realtime bitmap and summary inodes
<         - traversing filesystem ...
<         - traversal finished ...
<         - moving disconnected inodes to lost+found ...
< Phase 7 - verify and correct link counts...
< resetting inode INO nlinks from 1 to 2
< done
< Corrupting rt bitmap inode - setting bits to 0
< Wrote X.XXKb (value 0x0)
< Phase 1 - find and verify superblock...
< Phase 2 - using <TYPEOF> log
<         - zero log...
<         - scan filesystem freespace and inode maps...
<         - found root inode chunk
< Phase 3 - for each AG...
<         - scan and clear agi unlinked lists...
<         - process known inodes and perform inode discovery...
< bad magic number 0x0 on inode INO
< bad version number 0x0 on inode INO
< bad magic number 0x0 on inode INO, resetting magic number
< bad version number 0x0 on inode INO, resetting version number
< imap claims a free inode INO is in use, correcting imap and clearing inode
< cleared realtime bitmap inode INO
<         - process newly discovered inodes...
< Phase 4 - check for duplicate blocks...
<         - setting up duplicate extent list...
<         - check for inodes claiming duplicate blocks...
< Phase 5 - rebuild AG headers and trees...
<         - reset superblock...
< Phase 6 - check inode connectivity...
< reinitializing realtime bitmap inode
<         - resetting contents of realtime bitmap and summary inodes
<         - traversing filesystem ...
<         - traversal finished ...
<         - moving disconnected inodes to lost+found ...
< Phase 7 - verify and correct link counts...
< done
< Corrupting rt summary inode - setting bits to 0
< Wrote X.XXKb (value 0x0)
< Phase 1 - find and verify superblock...
< Phase 2 - using <TYPEOF> log
<         - zero log...
<         - scan filesystem freespace and inode maps...
<         - found root inode chunk
< Phase 3 - for each AG...
<         - scan and clear agi unlinked lists...
<         - process known inodes and perform inode discovery...
< bad magic number 0x0 on inode INO
< bad version number 0x0 on inode INO
< bad magic number 0x0 on inode INO, resetting magic number
< bad version number 0x0 on inode INO, resetting version number
< imap claims a free inode INO is in use, correcting imap and clearing inode
< cleared realtime summary inode INO
<         - process newly discovered inodes...
< Phase 4 - check for duplicate blocks...
<         - setting up duplicate extent list...
<         - check for inodes claiming duplicate blocks...
< Phase 5 - rebuild AG headers and trees...
<         - reset superblock...
< Phase 6 - check inode connectivity...
< reinitializing realtime summary inode
<         - resetting contents of realtime bitmap and summary inodes
<         - traversing filesystem ...
<         - traversal finished ...
<         - moving disconnected inodes to lost+found ...
< Phase 7 - verify and correct link counts...
< done
< Corrupting root inode - setting bits to -1
< Wrote X.XXKb (value 0xffffffff)
< Phase 1 - find and verify superblock...
< Phase 2 - using <TYPEOF> log
<         - zero log...
<         - scan filesystem freespace and inode maps...
<         - found root inode chunk
< Phase 3 - for each AG...
<         - scan and clear agi unlinked lists...
<         - process known inodes and perform inode discovery...
< bad magic number 0xffff on inode INO
< bad version number 0xffffffff on inode INO
< bad (negative) size -1 on inode INO
< bad magic number 0xffff on inode INO, resetting magic number
< bad version number 0xffffffff on inode INO, resetting version number
< bad (negative) size -1 on inode INO
< cleared root inode INO
<         - process newly discovered inodes...
< Phase 4 - check for duplicate blocks...
<         - setting up duplicate extent list...
< root inode lost
<         - check for inodes claiming duplicate blocks...
< Phase 5 - rebuild AG headers and trees...
<         - reset superblock...
< Phase 6 - check inode connectivity...
< reinitializing root directory
<         - resetting contents of realtime bitmap and summary inodes
<         - traversing filesystem ...
<         - traversal finished ...
<         - moving disconnected inodes to lost+found ...
< Phase 7 - verify and correct link counts...
< resetting inode INO nlinks from 1 to 2
< done
< Corrupting rt bitmap inode - setting bits to -1
< Wrote X.XXKb (value 0xffffffff)
< Phase 1 - find and verify superblock...
< Phase 2 - using <TYPEOF> log
<         - zero log...
<         - scan filesystem freespace and inode maps...
<         - found root inode chunk
< Phase 3 - for each AG...
<         - scan and clear agi unlinked lists...
<         - process known inodes and perform inode discovery...
< bad magic number 0xffff on inode INO
< bad version number 0xffffffff on inode INO
< bad (negative) size -1 on inode INO
< bad magic number 0xffff on inode INO, resetting magic number
< bad version number 0xffffffff on inode INO, resetting version number
< bad (negative) size -1 on inode INO
< cleared realtime bitmap inode INO
<         - process newly discovered inodes...
< Phase 4 - check for duplicate blocks...
<         - setting up duplicate extent list...
<         - check for inodes claiming duplicate blocks...
< Phase 5 - rebuild AG headers and trees...
<         - reset superblock...
< Phase 6 - check inode connectivity...
< reinitializing realtime bitmap inode
<         - resetting contents of realtime bitmap and summary inodes
<         - traversing filesystem ...
<         - traversal finished ...
<         - moving disconnected inodes to lost+found ...
< Phase 7 - verify and correct link counts...
< done
< Corrupting rt summary inode - setting bits to -1
< Wrote X.XXKb (value 0xffffffff)
< Phase 1 - find and verify superblock...
< Phase 2 - using <TYPEOF> log
<         - zero log...
<         - scan filesystem freespace and inode maps...
<         - found root inode chunk
< Phase 3 - for each AG...
<         - scan and clear agi unlinked lists...
<         - process known inodes and perform inode discovery...
< bad magic number 0xffff on inode INO
< bad version number 0xffffffff on inode INO
< bad (negative) size -1 on inode INO
< bad magic number 0xffff on inode INO, resetting magic number
< bad version number 0xffffffff on inode INO, resetting version number
< bad (negative) size -1 on inode INO
< cleared realtime summary inode INO
<         - process newly discovered inodes...
< Phase 4 - check for duplicate blocks...
<         - setting up duplicate extent list...
<         - check for inodes claiming duplicate blocks...
< Phase 5 - rebuild AG headers and trees...
<         - reset superblock...
< Phase 6 - check inode connectivity...
< reinitializing realtime summary inode
<         - resetting contents of realtime bitmap and summary inodes
<         - traversing filesystem ...
<         - traversal finished ...
<         - moving disconnected inodes to lost+found ...
< Phase 7 - verify and correct link counts...
< done
---
> xfs_imap_to_bp: xfs_trans_read_buf() returned error 117.
> 
> fatal error -- could not iget root inode -- error - 117
> _check_xfs_filesystem: filesystem on /dev/md126 is inconsistent (r) (see /var/lib/xfstests/results//xfs/033.full)

***** This is the lone segfault so far:

xfs/291	[12832.846621] XFS (md126): Version 5 superblock detected. This kernel has EXPERIMENTAL support enabled!
[12832.846621] Use of these features in this kernel is at your own risk!
[12832.872608] XFS (md126): Mounting Filesystem
[12833.063779] XFS (md126): Ending clean mount
[13153.675046] XFS (md126): Version 5 superblock detected. This kernel has EXPERIMENTAL support enabled!
[13153.675046] Use of these features in this kernel is at your own risk!
[13153.694128] XFS (md126): Mounting Filesystem
[13154.105167] XFS (md126): Ending clean mount
[13201.470358] xfs_db[17902]: segfault at 9c157f8 ip 0809b6b0 sp bfe97950 error 4 in xfs_db[8048000+90000]
 [failed, exit status 1] - output mismatch (see /var/lib/xfstests/results//xfs/291.out.bad)
    --- tests/xfs/291.out	2013-11-11 13:46:26.652264785 -0500
    +++ /var/lib/xfstests/results//xfs/291.out.bad	2013-11-17 16:28:05.133832908 -0500
    @@ -1 +1,11 @@
     QA output created by 291
    +xfs_dir3_data_read_verify: XFS_CORRUPTION_ERROR
    +xfs_dir3_data_read_verify: XFS_CORRUPTION_ERROR
    +xfs_dir3_data_read_verify: XFS_CORRUPTION_ERROR
    +xfs_dir3_data_read_verify: XFS_CORRUPTION_ERROR
    +xfs_dir3_data_read_verify: XFS_CORRUPTION_ERROR
    +__read_verify: XFS_CORRUPTION_ERROR
     ...
     (Run 'diff -u tests/xfs/291.out /var/lib/xfstests/results//xfs/291.out.bad' to see the entire diff)
[13202.293470] XFS (md127): Version 5 superblock detected. This kernel has EXPERIMENTAL support enabled!
[13202.293470] Use of these features in this kernel is at your own risk!
[13202.309944] XFS (md127): Mounting Filesystem
[13202.587663] XFS (md127): Ending clean mount

***** I might not have seen this lockdep splat yet, but this 
is a new merge window.  This splat is repeatable and may be 
independent of finobt.

xfs/078	[87803.635893] 
======================================================
[ INFO: possible circular locking dependency detected ]
3.12.0+ #2 Not tainted
-------------------------------------------------------
xfs_repair/12944 is trying to acquire lock:
 (timekeeper_seq){------}, at: [<c104f843>] __hrtimer_start_range_ns+0xc7/0x35d

but task is already holding lock:
 (hrtimer_bases.lock){-.-.-.}, at: [<c104f7a4>] __hrtimer_start_range_ns+0x28/0x35d

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #5 (hrtimer_bases.lock){-.-.-.}:
       [<c106577c>] lock_acquire+0x7f/0x15e
       [<c162d072>] _raw_spin_lock_irqsave+0x4a/0x7a
       [<c104f7a4>] __hrtimer_start_range_ns+0x28/0x35d
       [<c1055b01>] start_bandwidth_timer+0x60/0x6f
       [<c105b1c2>] enqueue_task_rt+0xd3/0xfd
       [<c10546aa>] enqueue_task+0x45/0x60
       [<c1055813>] __sched_setscheduler+0x243/0x372
       [<c1056a21>] sched_setscheduler+0x17/0x19
       [<c108ae53>] watchdog_enable+0x69/0x7d
       [<c1053063>] smpboot_thread_fn+0x93/0x130
       [<c104c4ab>] kthread+0xb3/0xc7
       [<c162e4b7>] ret_from_kernel_thread+0x1b/0x28

-> #4 (&rt_b->rt_runtime_lock){-.-.-.}:
       [<c106577c>] lock_acquire+0x7f/0x15e
       [<c162cffb>] _raw_spin_lock+0x41/0x6e
       [<c105b1ac>] enqueue_task_rt+0xbd/0xfd
       [<c10546aa>] enqueue_task+0x45/0x60
       [<c1055813>] __sched_setscheduler+0x243/0x372
       [<c1056a21>] sched_setscheduler+0x17/0x19
       [<c108ae53>] watchdog_enable+0x69/0x7d
       [<c1053063>] smpboot_thread_fn+0x93/0x130
       [<c104c4ab>] kthread+0xb3/0xc7
       [<c162e4b7>] ret_from_kernel_thread+0x1b/0x28

-> #3 (&rq->lock){-.-.-.}:
       [<c106577c>] lock_acquire+0x7f/0x15e
       [<c162cffb>] _raw_spin_lock+0x41/0x6e
       [<c10561da>] wake_up_new_task+0x3b/0x147
       [<c102d132>] do_fork+0x116/0x305
       [<c102d34e>] kernel_thread+0x2d/0x33
       [<c161f0b2>] rest_init+0x22/0x128
       [<c19f39da>] start_kernel+0x2df/0x2e5
       [<c19f3378>] i386_start_kernel+0x12e/0x131

-> #2 (&p->pi_lock){-.-.-.}:
       [<c106577c>] lock_acquire+0x7f/0x15e
       [<c162d072>] _raw_spin_lock_irqsave+0x4a/0x7a
       [<c1055e1a>] try_to_wake_up+0x23/0x138
       [<c1055f60>] wake_up_process+0x1f/0x33
       [<c104411c>] start_worker+0x25/0x28
       [<c10451cc>] create_and_start_worker+0x37/0x5d
       [<c1a03b34>] init_workqueues+0xd4/0x2c4
       [<c19f3a99>] do_one_initcall+0xb9/0x153
       [<c19f3b7e>] kernel_init_freeable+0x4b/0x17d
       [<c161f1c8>] kernel_init+0x10/0xf2
       [<c162e4b7>] ret_from_kernel_thread+0x1b/0x28

-> #1 (&(&pool->lock)->rlock){-.-.-.}:
       [<c106577c>] lock_acquire+0x7f/0x15e
       [<c162cffb>] _raw_spin_lock+0x41/0x6e
       [<c104575a>] __queue_work+0x12b/0x393
       [<c1045c26>] queue_work_on+0x2f/0x6a
       [<c104f4b6>] clock_was_set_delayed+0x1d/0x1f
       [<c1075a67>] do_adjtimex+0xf4/0x145
       [<c1030ce0>] SYSC_adjtimex+0x30/0x62
       [<c1030f67>] SyS_adjtimex+0x10/0x12
       [<c162e53f>] sysenter_do_call+0x12/0x36

-> #0 (timekeeper_seq){------}:
       [<c10648b9>] __lock_acquire+0x13a4/0x17ac
       [<c106577c>] lock_acquire+0x7f/0x15e
       [<c1073688>] ktime_get+0x4f/0x169
       [<c104f843>] __hrtimer_start_range_ns+0xc7/0x35d
       [<c104faff>] hrtimer_start_range_ns+0x26/0x2c
       [<c104b17f>] common_timer_set+0xf5/0x164
       [<c104bd58>] SyS_timer_settime+0xbe/0x183
       [<c162dcc8>] syscall_call+0x7/0xb

other info that might help us debug this:

Chain exists of:
  timekeeper_seq --> &rt_b->rt_runtime_lock --> hrtimer_bases.lock

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(hrtimer_bases.lock);
                               lock(&rt_b->rt_runtime_lock);
                               lock(hrtimer_bases.lock);
  lock(timekeeper_seq);

 *** DEADLOCK ***

2 locks held by xfs_repair/12944:
 #0:  (&(&new_timer->it_lock)->rlock){......}, at: [<c104b292>] __lock_timer+0xa4/0x1af
 #1:  (hrtimer_bases.lock){-.-.-.}, at: [<c104f7a4>] __hrtimer_start_range_ns+0x28/0x35d

stack backtrace:
CPU: 0 PID: 12944 Comm: xfs_repair Not tainted 3.12.0+ #2
Hardware name: Dell Computer Corporation Dimension 2350/07W080, BIOS A01 12/17/2002
 c1cb2e70 c1cb2e70 deb01dd8 c162748d deb01df8 c162303c c17a3306 deb01e3c
 deaad0c0 deaad550 deaad550 00000002 deb01e6c c10648b9 deaad528 0000006f
 c106269b deb01e20 c1c8bd08 00000003 00000000 0000000e 00000002 00000001
Call Trace:
 [<c162748d>] dump_stack+0x16/0x18
 [<c162303c>] print_circular_bug+0x1b8/0x1c2
 [<c10648b9>] __lock_acquire+0x13a4/0x17ac
 [<c106269b>] ? trace_hardirqs_off+0xb/0xd
 [<c106577c>] lock_acquire+0x7f/0x15e
 [<c104f843>] ? __hrtimer_start_range_ns+0xc7/0x35d
 [<c1073688>] ktime_get+0x4f/0x169
 [<c104f843>] ? __hrtimer_start_range_ns+0xc7/0x35d
 [<c162d098>] ? _raw_spin_lock_irqsave+0x70/0x7a
 [<c104f7a4>] ? __hrtimer_start_range_ns+0x28/0x35d
 [<c104f843>] __hrtimer_start_range_ns+0xc7/0x35d
 [<c104faff>] hrtimer_start_range_ns+0x26/0x2c
 [<c104b17f>] common_timer_set+0xf5/0x164
 [<c104b08a>] ? __posix_timers_find+0xa7/0xa7
 [<c104bd58>] SyS_timer_settime+0xbe/0x183
 [<c162dcc8>] syscall_call+0x7/0xb



_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 00/11] xfs: introduce the free inode btree
  2013-11-17 22:43 ` Michael L. Semon
@ 2013-11-18 22:38   ` Michael L. Semon
  0 siblings, 0 replies; 21+ messages in thread
From: Michael L. Semon @ 2013-11-18 22:38 UTC (permalink / raw)
  To: Brian Foster, xfs

On 11/17/2013 05:43 PM, Michael L. Semon wrote:
> On 11/13/2013 09:36 AM, Brian Foster wrote:
>> Hi all,
>>
>> The free inode btree adds a new inode btree to XFS with the intent to
>> track only inode chunks with at least one free inode. Patches 1-3 add
>> the necessary support for the new XFS_BTNUM_FINOBT type and introduce a
>> read-only v5 superblock flag. Patch 4 updates the transaction
>> reservations for inode allocation operations to account for the finobt.
>> Patches 5-9 add support to manage the finobt on inode chunk allocation,
>> inode allocation, inode free (and chunk deletion) and growfs. Patch 10
>> adds support to report finobt status in the fs geometry. Patch 11 adds
>> the feature bit to the associated mask. Thoughts, reviews, flames
>> appreciated.
>>
>> Brian

This is more data, but it doesn't seem to be noise.  No reply is needed, 
though.  I'm only guessing that finobt has a role in this day.  It looks 
like something quota got everything started...

I think I got through 4k blocksize testing OK.  However, disaster loomed 
after I switched to a 2k block size (again ~2.2-GB md RAID-0 partitions):

root@plbearer:/var/lib/xfstests# MKFS_OPTIONS='-m crc=1 -m finobt=1 -b log=11' ./check -g auto
[  203.967784] XFS (md127): Version 5 superblock detected. This kernel has EXPERIMENTAL support enabled!
[  203.967784] Use of these features in this kernel is at your own risk!
FSTYP         -- xfs (debug)
PLATFORM      -- Linux/i686 plbearer 3.12.0+
MKFS_OPTIONS  -- -f -m crc=1 -m finobt=1 -b log=11 /dev/md126
MOUNT_OPTIONS -- /dev/md126 /mnt/xfstests-scratch

# This run started in sequence with generic/001.

generic/231 1074s ...[ 7434.717194] BUG: unable to handle kernel paging request at dd3fc000
[ 7434.717315] IP: [<c13dd9b4>] memcpy+0x14/0x24
[ 7434.717315] *pde = 1fbf0067 *pte = 1d3fc060 
[ 7434.717315] Oops: 0000 [#1] DEBUG_PAGEALLOC

Entering kdb (current=0xde8ae4f0, pid 27666) Oops: (null)
due to oops @ 0xc13dd9b4
CPU: 0 PID: 27666 Comm: xfs_quota Not tainted 3.12.0+ #2
Hardware name: Dell Computer Corporation Dimension 2350/07W080, BIOS A01 12/17/2002
task: de8ae4f0 ti: dd01c000 task.ti: dd01c000
EIP: 0060:[<c13dd9b4>] EFLAGS: 00010206 CPU: 0
EIP is at memcpy+0x14/0x24
EAX: d88e1868 EBX: 0000005c ECX: 00000001 EDX: dd3fbfa8
ESI: dd3fc000 EDI: d88e18c0 EBP: dd01de38 ESP: dd01de2c
 DS: 007b ES: 007b FS: 0000 GS: 00e0 SS: 0068
CR0: 8005003b CR2: dd3fc000 CR3: 1eb43000 CR4: 000007d0
Stack:
 d88e185c 00000000 d88e1868 dd01de50 c12aa415 d88e1840 dd3fbf60 de71b8f0
 00000000 dd01deb4 c12aad31 c0098d00 dd01de90 00000084 d9008540 dd01decc
 00253b60 de635634 de635600 da71d000 00000084 de71ba00 d88e1840 de635600
Call Trace:
 [<c12aa415>] xlog_cil_lv_item_format+0x45/0x68
 [<c12aad31>] xfs_log_commit_cil+0x452/0x4e7
 [<c1253408>] xfs_trans_commit+0xac/0x230
 [<c12b5ef5>] xfs_qm_log_quotaoff_end+0x60/0x7b
 [<c12b7206>] xfs_qm_scall_quotaoff+0x120/0x48a
 [<c12bbfde>] ? xfs_fs_get_xstatev+0x27/0x27
 [<c12bc09e>] xfs_fs_set_xstate+0xc0/0xe1
 [<c1130639>] SyS_quotactl+0x4cd/0x564
 [<c10e68cb>] ? SyS_stat64+0x34/0x3a
 [<c162dcfb>] ? restore_all+0xf/0xf
 [<c1025910>] ? vmalloc_sync_all+0x133/0x133
 [<c1062418>] ? trace_hardirqs_on_caller+0xe6/0x1aa
 [<c162e53f>] sysenter_do_call+0x12/0x36
Code: 00 74 0c 8b 43 54 2b 43 50 88 43 4e 5b 5d c3 e8 a8 fc ff ff eb ed 90 55 89 e5 57 56 53 3e 8d 74 26 00 89 cb c1 e9 02 89 c7 89 d6 <f3> a5 89 d9 83 e1 03 74 02 f3 a4 5b 5e 5f 5d c3 55 89 e5 57 53

This is not the disaster, only a test that did not complete.  After a 
successful reboot, I tried to run generic/231 again, only to have my 
non-finobt v5/CRC XFS / filesystem bark at me:

root@plbearer:/var/lib/xfstests# MKFS_OPTIONS='-m crc=1 -m finobt=1 -b log=11' ./check generic/231

[  392.914511] XFS (md127): Version 5 superblock detected. This kernel has EXPERIMENTAL support enabled!
[  392.914511] Use of these features in this kernel is at your own risk!
FSTYP         -- xfs (debug)
PLATFORM      -- Linux/i686 plbearer 3.12.0+
MKFS_OPTIONS  -- -f -m crc=1 -m finobt=1 -b log=11 /dev/md126
MOUNT_OPTIONS -- /dev/md126 /mnt/xfstests-scratch

[  396.616456] XFS (md126): Version 5 superblock detected. This kernel has EXPERIMENTAL support enabled!
[  396.616456] Use of these features in this kernel is at your own risk!
[  398.271753] XFS (md127): Version 5 superblock detected. This kernel has EXPERIMENTAL support enabled!
[  398.271753] Use of these features in this kernel is at your own risk!
generic/231 1074s ...[  403.133309] XFS (md126): Version 5 superblock detected. This kernel has EXPERIMENTAL support enabled!
[  403.133309] Use of these features in this kernel is at your own risk!
[  620.702535] XFS (md126): Version 5 superblock detected. This kernel has EXPERIMENTAL support enabled!
[  620.702535] Use of these features in this kernel is at your own risk!
[  621.430480] dc8ff000: 41 42 33 42 00 00 01 45 ff ff ff ff ff ff ff ff  AB3B...E........
[  621.438646] dc8ff010: 00 00 00 00 00 00 00 08 00 00 00 61 00 01 1c 77  ...........a...w
[  621.446751] dc8ff020: dd 91 0c 6a 2e c4 49 3a ac a9 79 89 72 a4 a9 ce  ...j..I:..y.r...
[  621.454856] dc8ff030: 00 00 00 00 fb b2 9e 73 00 00 04 ff 00 00 00 01  .......s........
[  621.462964] XFS (sdb3): Internal error xfs_allocbt_read_verify at line 362 of file fs/xfs/xfs_alloc_btree.c.  Caller 0xc1237c47
[  621.474781] XFS (sdb3): Corruption detected. Unmount and run xfs_repair
[  621.482296] XFS (sdb3): metadata I/O error: block 0x8 ("xfs_trans_read_buf_map") error 117 numblks 8
[  621.522828] XFS (sdb3): Corruption of in-memory data detected.  Shutting down filesystem
[  621.531070] XFS (sdb3): Please umount the filesystem and rectify the problem(s)
./check: line 145: /usr/bin/awk: Input/output error
./check: line 145: date: command not found
./check: line 461: /tmp/965.rawout: Input/output error
./check: line 462: /usr/bin/rm: Input/output error
 [failed, exit status 1] - no qualified output
./check: line 527: expr: command not found
./check: line 533: expr: command not found
./common/rc: line 849: /usr/bin/awk: Input/output error
./common/rc: line 849: sed: command not found
./common/rc: line 774: /usr/bin/awk: Input/output error
./common/rc: line 1532: grep: command not found
./common/rc: line 1532: tee: command not found
_check_xfs_filesystem: filesystem on /dev/md127 has dirty log (see /var/lib/xfstests/results//generic/231.full)
./common/rc: line 1537: /var/lib/xfstests/results//generic/231.full: Input/output error
./common/rc: line 1538: /var/lib/xfstests/results//generic/231.full: Input/output error
./common/rc: line 1539: /var/lib/xfstests/results//generic/231.full: Input/output error
./common/rc: line 1540: /var/lib/xfstests/results//generic/231.full: Input/output error
./common/rc: line 1550: /tmp/965.fs_check: Input/output error
./common/rc: line 1564: /tmp/965.repair: Input/output error
_check_xfs_filesystem: filesystem on /dev/md127 is inconsistent (r) (see /var/lib/xfstests/results//generic/231.full)
./common/rc: line 1569: /var/lib/xfstests/results//generic/231.full: Input/output error
./common/rc: line 1570: /var/lib/xfstests/results//generic/231.full: Input/output error
./common/rc: line 1571: /var/lib/xfstests/results//generic/231.full: Input/output error
./common/rc: line 1571: /usr/bin/cat: Input/output error
./common/rc: line 1572: /var/lib/xfstests/results//generic/231.full: Input/output error
./common/rc: line 1576: /usr/bin/rm: Input/output error
./common/rc: line 1580: /var/lib/xfstests/results//generic/231.full: Input/output error
./common/rc: line 1581: /var/lib/xfstests/results//generic/231.full: Input/output error
./common/rc: line 1582: /var/lib/xfstests/results//generic/231.full: Input/output error
./check: line 320: /var/lib/xfstests/results//check.log: Input/output error
./check: line 321: /var/lib/xfstests/results//check.log: Input/output error
./check: line 322: /var/lib/xfstests/results//check.log: Input/output error
./check: line 322: fmt: command not found
./check: line 323: /var/lib/xfstests/results//check.log: Input/output error
./check: line 325: [: too many arguments
./check: line 336: [: too many arguments
Passed all  tests
./check: line 344: /var/lib/xfstests/results//check.log: Input/output error
./check: line 349: /usr/bin/rm: Input/output error
./check: line 350: /usr/bin/rm: Input/output error
root@plbearer:/var/lib/xfstests# ls
-bash: /bin/ls: Input/output error

There was a trace in my logs that was probably from the same event but had 
more detail:

Nov 18 08:14:53 plbearer kernel: [  621.127588] XFS (md126): Quotacheck needed: Please wait.
Nov 18 08:14:54 plbearer kernel: [  621.210188] XFS (md126): Quotacheck: Done.
Nov 18 08:14:54 plbearer kernel: [  621.210188] XFS (md126): Quotacheck: Done.
Nov 18 08:14:54 plbearer [  621.430480] dc8ff000: 41 42 33 42 00 00 01 45 ff ff ff ff ff ff ff ff  AB3B...E........ 
Nov 18 08:14:54 plbearer [  621.438646] dc8ff010: 00 00 00 00 00 00 00 08 00 00 00 61 00 01 1c 77  ...........a...w 
Nov 18 08:14:54 plbearer [  621.446751] dc8ff020: dd 91 0c 6a 2e c4 49 3a ac a9 79 89 72 a4 a9 ce  ...j..I:..y.r... 
Nov 18 08:14:54 plbearer [  621.454856] dc8ff030: 00 00 00 00 fb b2 9e 73 00 00 04 ff 00 00 00 01  .......s........ 
Nov 18 08:14:54 plbearer [  621.462964] XFS (sdb3): Internal error xfs_allocbt_read_verify at line 362 of file fs/xfs/xfs_alloc_btree.c.  Caller 0xc1237c47 
Nov 18 08:14:54 plbearer [  621.474781] XFS (sdb3): Corruption detected. Unmount and run xfs_repair 
Nov 18 08:14:54 plbearer [  621.482296] XFS (sdb3): metadata I/O error: block 0x8 ("xfs_trans_read_buf_map") error 117 numblks 8 
Nov 18 08:14:54 plbearer kernel: [  621.491567] XFS (sdb3): xfs_do_force_shutdown(0x8) called from line 138 of file fs/xfs/xfs_bmap_util.c.  Return address = 0xc1232e5d
Nov 18 08:14:54 plbearer kernel: [  621.491567] XFS (sdb3): xfs_do_force_shutdown(0x8) called from line 138 of file fs/xfs/xfs_bmap_util.c.  Return address = 0xc1232e5d
Nov 18 08:14:54 plbearer [  621.522828] XFS (sdb3): Corruption of in-memory data detected.  Shutting down filesystem 
Nov 18 08:14:54 plbearer [  621.531070] XFS (sdb3): Please umount the filesystem and rectify the problem(s) 

After an unsuccessful attempt to reboot to that / partition, I rebooted to 
an alternate (JFS) / setup.  Note that write caches are off on this PC, so 
it was a surprise that the log recovery did not complete for the v5 XFS / 
partition.  xfs_repair was run, and the following mount was fine on what 
was probably a non-finobt kernel:

root@plbearer:~# xfs_repair -L /dev/sdb3
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
ALERT: The filesystem has valuable metadata changes in a log which is being
destroyed because the -L option was used.
        - scan filesystem freespace and inode maps...
xfs_agf_read_verify: XFS_CORRUPTION_ERROR
xfs_allocbt_read_verify: XFS_CORRUPTION_ERROR
xfs_allocbt_read_verify: XFS_CORRUPTION_ERROR
sb_ifree 1842, counted 1817
sb_fdblocks 513939, counted 513840
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
done

The v5 XFS / was xfsdump'ed without issue.  No harm, no foul.  But it does 
mean I'll have to take a step back from finobt for the moment, taking the 
time to re-bisect my AIO issues so I can file another bug report about them.

Thanks for reading!

Michael

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 00/11] xfs: introduce the free inode btree
  2013-11-13 21:10     ` Dave Chinner
@ 2013-11-19 21:29       ` Brian Foster
  2013-11-19 22:17         ` Dave Chinner
  0 siblings, 1 reply; 21+ messages in thread
From: Brian Foster @ 2013-11-19 21:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs

On 11/13/2013 04:10 PM, Dave Chinner wrote:
...
> 
> The problem can be demonstrated with a single CPU and a single
> spindle. Create a single AG filesystem of a 100GB, and populate it
> with 10 million inodes.
> 
> Time how long it takes to create another 10000 inodes in a new
> directory. Measure CPU usage.
> 
> Randomly delete 10,000 inodes from the original population to
> sparsely populate the inobt with 10000 free inodes.
> 
> Time how long it takes to create another 10000 inodes in a new
> directory. Measure CPU usage.
> 
> The difference in time and CPU will be diretly related to the
> addition time spent searching the inobt for free inodes...
> 

Thanks for the suggestion, Dave. I've run some fs_mark tests along the
lines of what is described here. I create 10m files, randomly remove
~10k from that dataset and measure the process of allocating 10k new
inodes in both finobt and non-finobt scenarios (after a clean remount).

The tests run from a 4xcpu VM with 4GB RAM and against an isolated SATA
drive I had lying around (mapped directly via virtio). The drive is
formatted with a single VG/LV and as follows with xfs:

meta-data=/dev/mapper/testvg-testlv isize=512    agcount=1,
agsize=26214400 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=0
data     =                       bsize=4096   blocks=26214400, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal               bsize=4096   blocks=12800, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

Once the fs has been prepared with a random set of free inodes, the
following command is used to measure performance:

	fs_mark -k -S 0 -D 4 -L 10 -n 1000 -s 0 -d /mnt/testdir

I've also collected some perf record data of these commands to compare
CPU usage. I can make the full/raw data available if desirable. Snippets
of the results are included below.

--- non-finobt, agi freecount = 9961 after random removal

- fs_mark

FSUse%        Count         Size    Files/sec     App Overhead
     5         1000            0       1020.1            10811
     5         2000            0        361.4            19498
     5         3000            0        230.1            12154
     5         4000            0        166.7            12816
     5         5000            0        129.7            27409
     5         6000            0        105.7            13946
     5         7000            0         87.6            31792
     5         8000            0         77.8            14921
     5         9000            0         67.3            15597
     5        10000            0         62.4            15835

- time

real    1m26.579s
user    0m0.120s
sys     1m26.113s

- perf report

     6.21%    :1994  [kernel.kallsyms]  [k] memcmp
     5.66%    :1993  [kernel.kallsyms]  [k] memcmp
     4.84%    :1992  [kernel.kallsyms]  [k] memcmp
     4.76%    :1994  [xfs]              [k] xfs_btree_check_sblock
     4.46%    :1993  [xfs]              [k] xfs_btree_check_sblock
     4.39%    :1991  [kernel.kallsyms]  [k] memcmp
     3.88%    :1992  [xfs]              [k] xfs_btree_check_sblock
     3.54%    :1990  [kernel.kallsyms]  [k] memcmp
     3.38%    :1991  [xfs]              [k] xfs_btree_check_sblock
     2.91%    :1989  [kernel.kallsyms]  [k] memcmp
     2.89%    :1990  [xfs]              [k] xfs_btree_check_sblock
     2.44%    :1988  [kernel.kallsyms]  [k] memcmp
     2.31%    :1989  [xfs]              [k] xfs_btree_check_sblock
     1.84%    :1988  [xfs]              [k] xfs_btree_check_sblock
     1.65%    :1987  [kernel.kallsyms]  [k] memcmp
     1.28%    :1987  [xfs]              [k] xfs_btree_check_sblock
     1.12%    :1994  [xfs]              [k] xfs_btree_increment
     1.08%    :1994  [xfs]              [k] xfs_btree_get_rec
     1.04%    :1993  [xfs]              [k] xfs_btree_increment
     1.00%    :1993  [xfs]              [k] xfs_btree_get_rec
     0.99%    :1986  [kernel.kallsyms]  [k] memcmp
     0.89%    :1992  [xfs]              [k] xfs_btree_increment
     0.85%    :1994  [xfs]              [k] xfs_inobt_get_rec
     0.84%    :1992  [xfs]              [k] xfs_btree_get_rec
     0.77%    :1991  [xfs]              [k] xfs_btree_increment
     0.77%    :1986  [xfs]              [k] xfs_btree_check_sblock
     0.77%    :1993  [xfs]              [k] xfs_inobt_get_rec
     0.75%    :1991  [xfs]              [k] xfs_btree_get_rec
     0.69%    :1992  [xfs]              [k] xfs_inobt_get_rec
     0.64%    :1990  [xfs]              [k] xfs_btree_increment
     0.62%    :1994  [xfs]              [k] xfs_inobt_get_maxrecs
     0.61%    :1990  [xfs]              [k] xfs_btree_get_rec
     0.58%    :1991  [xfs]              [k] xfs_inobt_get_rec
...

--- finobt, agi freecount = 10137 after random removal

- fs_mark

FSUse%        Count         Size    Files/sec     App Overhead
     5         1000            0       9210.0             8587
     5         2000            0       5592.1            14933
     5         3000            0       7095.4            11355
     5         4000            0       5371.1            13613
     5         5000            0       4919.3            14534
     5         6000            0       4375.7            15813
     5         7000            0       5011.3            15095
     5         8000            0       4629.8            17902
     5         9000            0       5622.9            12975
     5        10000            0       5761.4            12203

- time

real    0m1.831s
user    0m0.104s
sys     0m1.384s

- perf report

     1.82%    :2520  [kernel.kallsyms]  [k] lock_acquire
     1.65%    :2519  [kernel.kallsyms]  [k] lock_acquire
     1.65%    :2525  [kernel.kallsyms]  [k] lock_acquire
     1.45%    :2523  [kernel.kallsyms]  [k] lock_acquire
     1.44%    :2524  [kernel.kallsyms]  [k] lock_acquire
     1.34%    :2521  [kernel.kallsyms]  [k] lock_acquire
     1.27%    :2522  [kernel.kallsyms]  [k] lock_acquire
     1.18%    :2526  [kernel.kallsyms]  [k] lock_acquire
     1.15%    :2527  [kernel.kallsyms]  [k] lock_acquire
     1.09%    :2525  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     1.03%    :2524  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     0.88%    :2520  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     0.83%    :2523  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     0.81%    :2521  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     0.79%    :2519  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     0.79%    :2522  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     0.76%    :2519  [kernel.kallsyms]  [k] kmem_cache_free
     0.76%    :2520  [kernel.kallsyms]  [k] kmem_cache_free
     0.73%    :2526  [kernel.kallsyms]  [k] kmem_cache_free
...
     0.30%    :2525  [xfs]              [k] xfs_dir3_leaf_check_int
     0.28%    :2525  [kernel.kallsyms]  [k] memcpy
     0.27%    :2527  [kernel.kallsyms]  [k] security_compute_sid.part.14
     0.26%    :2520  [kernel.kallsyms]  [k] memcpy
     0.26%    :2523  [xfs]              [k] _xfs_buf_find
     0.26%    :2526  [xfs]              [k] _xfs_buf_find

Summarized, the results show a nice improvement for inode allocation
into a set of inode chunks with random free inode availability. The 10k
inode allocation reduces from ~90s to ~2s and CPU usage from XFS drops
way down in the perf profile.

I haven't extensively tested the following, but a quick 1 million inode
allocation test on a fresh, single AG fs shows a slight degradation with
the finobt enabled in terms of time to complete:

	fs_mark -k -S 0 -D 4 -L 10 -n 100000 -s 0 -d /mnt/bigdir

- non-finobt

real    1m35.349s
user    0m4.555s
sys     1m29.749s

- finobt

real    1m42.396s
user    0m4.326s
sys     1m37.152s

Brian

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 00/11] xfs: introduce the free inode btree
  2013-11-19 21:29       ` Brian Foster
@ 2013-11-19 22:17         ` Dave Chinner
  0 siblings, 0 replies; 21+ messages in thread
From: Dave Chinner @ 2013-11-19 22:17 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, xfs

On Tue, Nov 19, 2013 at 04:29:55PM -0500, Brian Foster wrote:
> On 11/13/2013 04:10 PM, Dave Chinner wrote:
> ...
> > 
> > The problem can be demonstrated with a single CPU and a single
> > spindle. Create a single AG filesystem of a 100GB, and populate it
> > with 10 million inodes.
> > 
> > Time how long it takes to create another 10000 inodes in a new
> > directory. Measure CPU usage.
> > 
> > Randomly delete 10,000 inodes from the original population to
> > sparsely populate the inobt with 10000 free inodes.
> > 
> > Time how long it takes to create another 10000 inodes in a new
> > directory. Measure CPU usage.
> > 
> > The difference in time and CPU will be diretly related to the
> > addition time spent searching the inobt for free inodes...
> > 
> 
> Thanks for the suggestion, Dave. I've run some fs_mark tests along the
> lines of what is described here. I create 10m files, randomly remove
> ~10k from that dataset and measure the process of allocating 10k new
> inodes in both finobt and non-finobt scenarios (after a clean remount).
> 
> The tests run from a 4xcpu VM with 4GB RAM and against an isolated SATA
> drive I had lying around (mapped directly via virtio). The drive is
> formatted with a single VG/LV and as follows with xfs:
> 
> meta-data=/dev/mapper/testvg-testlv isize=512    agcount=1,
> agsize=26214400 blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=1        finobt=0
> data     =                       bsize=4096   blocks=26214400, imaxpct=25
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> log      =internal               bsize=4096   blocks=12800, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> Once the fs has been prepared with a random set of free inodes, the
> following command is used to measure performance:
> 
> 	fs_mark -k -S 0 -D 4 -L 10 -n 1000 -s 0 -d /mnt/testdir
> 
> I've also collected some perf record data of these commands to compare
> CPU usage. I can make the full/raw data available if desirable. Snippets
> of the results are included below.
> 
> --- non-finobt, agi freecount = 9961 after random removal
> 
> - fs_mark
> 
> FSUse%        Count         Size    Files/sec     App Overhead
>      5         1000            0       1020.1            10811
>      5         2000            0        361.4            19498
>      5         3000            0        230.1            12154
>      5         4000            0        166.7            12816
>      5         5000            0        129.7            27409
>      5         6000            0        105.7            13946
>      5         7000            0         87.6            31792
>      5         8000            0         77.8            14921
>      5         9000            0         67.3            15597
>      5        10000            0         62.4            15835

Yes, that's pretty much as I expected - exponential degradation due
to the increasing search radius from the parent directory location...

> --- finobt, agi freecount = 10137 after random removal
> 
> - fs_mark
> 
> FSUse%        Count         Size    Files/sec     App Overhead
>      5         1000            0       9210.0             8587
>      5         2000            0       5592.1            14933
>      5         3000            0       7095.4            11355
>      5         4000            0       5371.1            13613
>      5         5000            0       4919.3            14534
>      5         6000            0       4375.7            15813
>      5         7000            0       5011.3            15095
>      5         8000            0       4629.8            17902
>      5         9000            0       5622.9            12975
>      5        10000            0       5761.4            12203

And that shows little, if any degradation once we toss the first
1000 inodes from the result. Nice demonstration!

> Summarized, the results show a nice improvement for inode allocation
> into a set of inode chunks with random free inode availability. The 10k
> inode allocation reduces from ~90s to ~2s and CPU usage from XFS drops
> way down in the perf profile.
> 
> I haven't extensively tested the following, but a quick 1 million inode
> allocation test on a fresh, single AG fs shows a slight degradation with
> the finobt enabled in terms of time to complete:
> 
> 	fs_mark -k -S 0 -D 4 -L 10 -n 100000 -s 0 -d /mnt/bigdir
> 
> - non-finobt
> 
> real    1m35.349s
> user    0m4.555s
> sys     1m29.749s
> 
> - finobt
> 
> real    1m42.396s
> user    0m4.326s
> sys     1m37.152s

Given that you have multiple threads banging on the same AGI, and
the hold time for the AGI is going to be slightly longer due to
needing to update two btrees instead of one, this is to be expected.

However, if you are in a memory limited situation, there's a good
chance that the lower memory footprint of the buffer cache as a
result of the finobt based searches will make a difference to these
results. With 4GB of RAM and 1M inodes, you're not generating memory
pressure and so such effects won't be seen in performance results.

As it is, the parallel fsmark tests I did on v1 of the patchset on a
fast SSD based filesystem (sparse 100TB filesystem) showed a small
improvement in performance with finobt enabled. Those tests spend
most of their time in memory pressure situations, so perhaps we're
actually seeing the difference here. However, I haven't tested the
current version yet, so take that with a grain of salt for the
moment.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2013-11-19 22:17 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-13 14:36 [PATCH v2 00/11] xfs: introduce the free inode btree Brian Foster
2013-11-13 14:36 ` [PATCH v2 01/11] xfs: refactor xfs_ialloc_btree.c to support multiple inobt numbers Brian Foster
2013-11-13 16:17   ` Christoph Hellwig
2013-11-13 14:36 ` [PATCH v2 02/11] xfs: reserve v5 superblock read-only compat. feature bit for finobt Brian Foster
2013-11-13 16:18   ` Christoph Hellwig
2013-11-13 14:36 ` [PATCH v2 03/11] xfs: support the XFS_BTNUM_FINOBT free inode btree type Brian Foster
2013-11-13 14:37 ` [PATCH v2 04/11] xfs: update inode allocation/free transaction reservations for finobt Brian Foster
2013-11-13 14:37 ` [PATCH v2 05/11] xfs: insert newly allocated inode chunks into the finobt Brian Foster
2013-11-13 14:37 ` [PATCH v2 06/11] xfs: use and update the finobt on inode allocation Brian Foster
2013-11-13 14:37 ` [PATCH v2 07/11] xfs: refactor xfs_difree() inobt bits into xfs_difree_inobt() helper Brian Foster
2013-11-13 14:37 ` [PATCH v2 08/11] xfs: update the finobt on inode free Brian Foster
2013-11-13 14:37 ` [PATCH v2 09/11] xfs: add finobt support to growfs Brian Foster
2013-11-13 14:37 ` [PATCH v2 10/11] xfs: report finobt status in fs geometry Brian Foster
2013-11-13 14:37 ` [PATCH v2 11/11] xfs: enable the finobt feature on v5 superblocks Brian Foster
2013-11-13 16:17 ` [PATCH v2 00/11] xfs: introduce the free inode btree Christoph Hellwig
2013-11-13 17:55   ` Brian Foster
2013-11-13 21:10     ` Dave Chinner
2013-11-19 21:29       ` Brian Foster
2013-11-19 22:17         ` Dave Chinner
2013-11-17 22:43 ` Michael L. Semon
2013-11-18 22:38   ` Michael L. Semon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.