All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/11] xfs: introduce the free inode btree
@ 2014-02-04 17:49 Brian Foster
  2014-02-04 17:49 ` [PATCH v3 01/11] xfs: refactor xfs_ialloc_btree.c to support multiple inobt numbers Brian Foster
                   ` (10 more replies)
  0 siblings, 11 replies; 34+ messages in thread
From: Brian Foster @ 2014-02-04 17:49 UTC (permalink / raw)
  To: xfs

Hi all,

Here's a small update of the finobt work. Previous version here:

http://oss.sgi.com/archives/xfs/2013-11/msg00404.html

This series is rebased on the latest master and contains a few very
minor fixes. Thoughts, reviews, flames appreciated.

Brian

v3:
- Rebased to latest master.
- Fixed up unused variable warning in xfs_difree_inobt().
- Replaced a few more typedefs.
v2:
- Rebase to latest xfs tree (minor shifting around of some header bits).
- Added "xfs: report finobt status in fs geometry" patch to series.
v1:
- Separate patch to enable rw finobt support at end of series.
- Rework xfs_ialloc_log_agi() to log the agi in two distinct regions.
- Rework xfs_ialloc_btree.c changes to use separate finobt handlers
  where appropriate.
- Fix bug to show fibt2 stats data in stat proc file.
- Move finobt log reservation calculations into separate helper, made
  conditional and merged to a single patch.
- Use reserved block pool in xfs_inactive() codepath instead of flush.
- Moved and cleaned up xfs_inobt_insert() to use inobt helpers.
- Enhanced lookup algorithm for allocation (xfs_dialloc_ag()).
- Refactored xfs_difree() to use xfs_difree_inobt() and
  xfs_difree_finobt(), cleaned up the latter.

Brian Foster (11):
  xfs: refactor xfs_ialloc_btree.c to support multiple inobt numbers
  xfs: reserve v5 superblock read-only compat. feature bit for finobt
  xfs: support the XFS_BTNUM_FINOBT free inode btree type
  xfs: update inode allocation/free transaction reservations for finobt
  xfs: insert newly allocated inode chunks into the finobt
  xfs: use and update the finobt on inode allocation
  xfs: refactor xfs_difree() inobt bits into xfs_difree_inobt() helper
  xfs: update the finobt on inode free
  xfs: add finobt support to growfs
  xfs: report finobt status in fs geometry
  xfs: enable the finobt feature on v5 superblocks

 fs/xfs/xfs_ag.h           |  32 ++-
 fs/xfs/xfs_btree.c        |   6 +-
 fs/xfs/xfs_btree.h        |   3 +
 fs/xfs/xfs_format.h       |  14 +-
 fs/xfs/xfs_fs.h           |   1 +
 fs/xfs/xfs_fsops.c        |  36 ++-
 fs/xfs/xfs_ialloc.c       | 617 ++++++++++++++++++++++++++++++++++++++--------
 fs/xfs/xfs_ialloc_btree.c |  68 ++++-
 fs/xfs/xfs_ialloc_btree.h |   3 +-
 fs/xfs/xfs_inode.c        |   4 +-
 fs/xfs/xfs_itable.c       |   6 +-
 fs/xfs/xfs_log_recover.c  |   2 +
 fs/xfs/xfs_sb.h           |  10 +-
 fs/xfs/xfs_stats.c        |   1 +
 fs/xfs/xfs_stats.h        |  18 +-
 fs/xfs/xfs_trans_resv.c   |  47 +++-
 fs/xfs/xfs_trans_space.h  |   7 +-
 fs/xfs/xfs_types.h        |   2 +-
 18 files changed, 746 insertions(+), 131 deletions(-)

-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v3 01/11] xfs: refactor xfs_ialloc_btree.c to support multiple inobt numbers
  2014-02-04 17:49 [PATCH v3 00/11] xfs: introduce the free inode btree Brian Foster
@ 2014-02-04 17:49 ` Brian Foster
  2014-02-04 17:49 ` [PATCH v3 02/11] xfs: reserve v5 superblock read-only compat. feature bit for finobt Brian Foster
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 34+ messages in thread
From: Brian Foster @ 2014-02-04 17:49 UTC (permalink / raw)
  To: xfs

The introduction of the free inode btree (finobt) requires that
xfs_ialloc_btree.c handle multiple trees. Refactor xfs_ialloc_btree.c
so the caller specifies the btree type on cursor initialization to
prepare for addition of the finobt.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_ialloc.c       | 8 ++++----
 fs/xfs/xfs_ialloc_btree.c | 8 +++++---
 fs/xfs/xfs_ialloc_btree.h | 3 ++-
 fs/xfs/xfs_itable.c       | 6 ++++--
 4 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
index 7a728f9f..ef3abd1 100644
--- a/fs/xfs/xfs_ialloc.c
+++ b/fs/xfs/xfs_ialloc.c
@@ -456,7 +456,7 @@ xfs_ialloc_ag_alloc(
 	/*
 	 * Insert records describing the new inode chunk into the btree.
 	 */
-	cur = xfs_inobt_init_cursor(args.mp, tp, agbp, agno);
+	cur = xfs_inobt_init_cursor(args.mp, tp, agbp, agno, XFS_BTNUM_INO);
 	for (thisino = newino;
 	     thisino < newino + newlen;
 	     thisino += XFS_INODES_PER_CHUNK) {
@@ -702,7 +702,7 @@ xfs_dialloc_ag(
 	ASSERT(pag->pagi_freecount > 0);
 
  restart_pagno:
-	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno);
+	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno, XFS_BTNUM_INO);
 	/*
 	 * If pagino is 0 (this is the root inode allocation) use newino.
 	 * This must work because we've just allocated some.
@@ -1164,7 +1164,7 @@ xfs_difree(
 	/*
 	 * Initialize the cursor.
 	 */
-	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno);
+	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno, XFS_BTNUM_INO);
 
 	error = xfs_check_agi_freecount(cur, agi);
 	if (error)
@@ -1295,7 +1295,7 @@ xfs_imap_lookup(
 	 * we have a record, we need to ensure it contains the inode number
 	 * we are looking up.
 	 */
-	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno);
+	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno, XFS_BTNUM_INO);
 	error = xfs_inobt_lookup(cur, agino, XFS_LOOKUP_LE, &i);
 	if (!error) {
 		if (i)
diff --git a/fs/xfs/xfs_ialloc_btree.c b/fs/xfs/xfs_ialloc_btree.c
index c8fa5bb..2d1a398 100644
--- a/fs/xfs/xfs_ialloc_btree.c
+++ b/fs/xfs/xfs_ialloc_btree.c
@@ -49,7 +49,8 @@ xfs_inobt_dup_cursor(
 	struct xfs_btree_cur	*cur)
 {
 	return xfs_inobt_init_cursor(cur->bc_mp, cur->bc_tp,
-			cur->bc_private.a.agbp, cur->bc_private.a.agno);
+			cur->bc_private.a.agbp, cur->bc_private.a.agno,
+			cur->bc_btnum);
 }
 
 STATIC void
@@ -323,7 +324,8 @@ xfs_inobt_init_cursor(
 	struct xfs_mount	*mp,		/* file system mount point */
 	struct xfs_trans	*tp,		/* transaction pointer */
 	struct xfs_buf		*agbp,		/* buffer for agi structure */
-	xfs_agnumber_t		agno)		/* allocation group number */
+	xfs_agnumber_t		agno,		/* allocation group number */
+	xfs_btnum_t		btnum)		/* ialloc or free ino btree */
 {
 	struct xfs_agi		*agi = XFS_BUF_TO_AGI(agbp);
 	struct xfs_btree_cur	*cur;
@@ -333,7 +335,7 @@ xfs_inobt_init_cursor(
 	cur->bc_tp = tp;
 	cur->bc_mp = mp;
 	cur->bc_nlevels = be32_to_cpu(agi->agi_level);
-	cur->bc_btnum = XFS_BTNUM_INO;
+	cur->bc_btnum = btnum;
 	cur->bc_blocklog = mp->m_sb.sb_blocklog;
 
 	cur->bc_ops = &xfs_inobt_ops;
diff --git a/fs/xfs/xfs_ialloc_btree.h b/fs/xfs/xfs_ialloc_btree.h
index f38b220..d7ebea72 100644
--- a/fs/xfs/xfs_ialloc_btree.h
+++ b/fs/xfs/xfs_ialloc_btree.h
@@ -58,7 +58,8 @@ struct xfs_mount;
 		 ((index) - 1) * sizeof(xfs_inobt_ptr_t)))
 
 extern struct xfs_btree_cur *xfs_inobt_init_cursor(struct xfs_mount *,
-		struct xfs_trans *, struct xfs_buf *, xfs_agnumber_t);
+		struct xfs_trans *, struct xfs_buf *, xfs_agnumber_t,
+		xfs_btnum_t);
 extern int xfs_inobt_maxrecs(struct xfs_mount *, int, int);
 
 #endif	/* __XFS_IALLOC_BTREE_H__ */
diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
index c237ad1..71a8169 100644
--- a/fs/xfs/xfs_itable.c
+++ b/fs/xfs/xfs_itable.c
@@ -274,7 +274,8 @@ xfs_bulkstat(
 		/*
 		 * Allocate and initialize a btree cursor for ialloc btree.
 		 */
-		cur = xfs_inobt_init_cursor(mp, NULL, agbp, agno);
+		cur = xfs_inobt_init_cursor(mp, NULL, agbp, agno,
+					    XFS_BTNUM_INO);
 		irbp = irbuf;
 		irbufend = irbuf + nirbuf;
 		end_of_ag = 0;
@@ -625,7 +626,8 @@ xfs_inumbers(
 				agino = 0;
 				continue;
 			}
-			cur = xfs_inobt_init_cursor(mp, NULL, agbp, agno);
+			cur = xfs_inobt_init_cursor(mp, NULL, agbp, agno,
+						    XFS_BTNUM_INO);
 			error = xfs_inobt_lookup(cur, agino, XFS_LOOKUP_GE,
 						 &tmp);
 			if (error) {
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 02/11] xfs: reserve v5 superblock read-only compat. feature bit for finobt
  2014-02-04 17:49 [PATCH v3 00/11] xfs: introduce the free inode btree Brian Foster
  2014-02-04 17:49 ` [PATCH v3 01/11] xfs: refactor xfs_ialloc_btree.c to support multiple inobt numbers Brian Foster
@ 2014-02-04 17:49 ` Brian Foster
  2014-02-11  6:07   ` Dave Chinner
  2014-02-04 17:49 ` [PATCH v3 03/11] xfs: support the XFS_BTNUM_FINOBT free inode btree type Brian Foster
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 34+ messages in thread
From: Brian Foster @ 2014-02-04 17:49 UTC (permalink / raw)
  To: xfs

Reserve a v5 read-only compatibility feature bit for the finobt and
create the xfs_sb_version_hasfinobt() helper to determine whether
an fs has the feature enabled.

The finobt does not change existing on-disk structures, but must
remain consistent with the ialloc btree. Modifications from older
kernels would violate that constrant. Therefore, we restrict older
kernels to read-only mounts of finobt-enabled filesystems.

Note that this does not yet enable the ability to rw mount a finobt
fs (by setting the feature bit in the XFS_SB_FEAT_RO_COMPAT_ALL
mask).

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_sb.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/xfs/xfs_sb.h b/fs/xfs/xfs_sb.h
index 35061d4..070a7f6 100644
--- a/fs/xfs/xfs_sb.h
+++ b/fs/xfs/xfs_sb.h
@@ -585,6 +585,7 @@ xfs_sb_has_compat_feature(
 	return (sbp->sb_features_compat & feature) != 0;
 }
 
+#define XFS_SB_FEAT_RO_COMPAT_FINOBT   (1 << 0)		/* free inode btree */
 #define XFS_SB_FEAT_RO_COMPAT_ALL 0
 #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL
 static inline bool
@@ -639,6 +640,12 @@ static inline int xfs_sb_version_hasftype(struct xfs_sb *sbp)
 		 (sbp->sb_features2 & XFS_SB_VERSION2_FTYPE));
 }
 
+static inline int xfs_sb_version_hasfinobt(xfs_sb_t *sbp)
+{
+	return (XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5) &&
+		(sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_FINOBT);
+}
+
 /*
  * end of superblock version macros
  */
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 03/11] xfs: support the XFS_BTNUM_FINOBT free inode btree type
  2014-02-04 17:49 [PATCH v3 00/11] xfs: introduce the free inode btree Brian Foster
  2014-02-04 17:49 ` [PATCH v3 01/11] xfs: refactor xfs_ialloc_btree.c to support multiple inobt numbers Brian Foster
  2014-02-04 17:49 ` [PATCH v3 02/11] xfs: reserve v5 superblock read-only compat. feature bit for finobt Brian Foster
@ 2014-02-04 17:49 ` Brian Foster
  2014-02-11  6:22   ` Dave Chinner
  2014-02-04 17:49 ` [PATCH v3 04/11] xfs: update inode allocation/free transaction reservations for finobt Brian Foster
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 34+ messages in thread
From: Brian Foster @ 2014-02-04 17:49 UTC (permalink / raw)
  To: xfs

Define the AGI fields for the finobt root/level and add magic
numbers. Update the btree code to add support for the new
XFS_BTNUM_FINOBT inode btree.

The finobt root block is reserved immediately following the inobt
root block in the AG. Update XFS_PREALLOC_BLOCKS() to determine the
starting AG data block based on whether finobt support is enabled.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_ag.h           | 32 +++++++++++++++----------
 fs/xfs/xfs_btree.c        |  6 +++--
 fs/xfs/xfs_btree.h        |  3 +++
 fs/xfs/xfs_format.h       | 14 ++++++++++-
 fs/xfs/xfs_ialloc.c       | 37 +++++++++++++++++++++++++----
 fs/xfs/xfs_ialloc_btree.c | 60 +++++++++++++++++++++++++++++++++++++++++++++--
 fs/xfs/xfs_log_recover.c  |  2 ++
 fs/xfs/xfs_stats.c        |  1 +
 fs/xfs/xfs_stats.h        | 18 +++++++++++++-
 fs/xfs/xfs_types.h        |  2 +-
 10 files changed, 150 insertions(+), 25 deletions(-)

diff --git a/fs/xfs/xfs_ag.h b/fs/xfs/xfs_ag.h
index 3fc1098..5d3011f 100644
--- a/fs/xfs/xfs_ag.h
+++ b/fs/xfs/xfs_ag.h
@@ -164,22 +164,28 @@ typedef struct xfs_agi {
 	__be32		agi_pad32;
 	__be64		agi_lsn;	/* last write sequence */
 
+	__be32		agi_free_root; /* root of the free inode btree */
+	__be32		agi_free_level;/* levels in free inode btree */
+
 	/* structure must be padded to 64 bit alignment */
 } xfs_agi_t;
 
-#define	XFS_AGI_MAGICNUM	0x00000001
-#define	XFS_AGI_VERSIONNUM	0x00000002
-#define	XFS_AGI_SEQNO		0x00000004
-#define	XFS_AGI_LENGTH		0x00000008
-#define	XFS_AGI_COUNT		0x00000010
-#define	XFS_AGI_ROOT		0x00000020
-#define	XFS_AGI_LEVEL		0x00000040
-#define	XFS_AGI_FREECOUNT	0x00000080
-#define	XFS_AGI_NEWINO		0x00000100
-#define	XFS_AGI_DIRINO		0x00000200
-#define	XFS_AGI_UNLINKED	0x00000400
-#define	XFS_AGI_NUM_BITS	11
-#define	XFS_AGI_ALL_BITS	((1 << XFS_AGI_NUM_BITS) - 1)
+#define	XFS_AGI_MAGICNUM	(1 << 0)
+#define	XFS_AGI_VERSIONNUM	(1 << 1)
+#define	XFS_AGI_SEQNO		(1 << 2)
+#define	XFS_AGI_LENGTH		(1 << 3)
+#define	XFS_AGI_COUNT		(1 << 4)
+#define	XFS_AGI_ROOT		(1 << 5)
+#define	XFS_AGI_LEVEL		(1 << 6)
+#define	XFS_AGI_FREECOUNT	(1 << 7)
+#define	XFS_AGI_NEWINO		(1 << 8)
+#define	XFS_AGI_DIRINO		(1 << 9)
+#define	XFS_AGI_UNLINKED	(1 << 10)
+#define	XFS_AGI_NUM_BITS_R1	11	/* end of the 1st agi logging region */
+#define	XFS_AGI_ALL_BITS_R1	((1 << XFS_AGI_NUM_BITS_R1) - 1)
+#define	XFS_AGI_FREE_ROOT	(1 << 11)
+#define	XFS_AGI_FREE_LEVEL	(1 << 12)
+#define	XFS_AGI_NUM_BITS_R2	13
 
 /* disk block (xfs_daddr_t) in the AG */
 #define XFS_AGI_DADDR(mp)	((xfs_daddr_t)(2 << (mp)->m_sectbb_log))
diff --git a/fs/xfs/xfs_btree.c b/fs/xfs/xfs_btree.c
index 9adaae4..ee79f1e 100644
--- a/fs/xfs/xfs_btree.c
+++ b/fs/xfs/xfs_btree.c
@@ -43,9 +43,10 @@ kmem_zone_t	*xfs_btree_cur_zone;
  * Btree magic numbers.
  */
 static const __uint32_t xfs_magics[2][XFS_BTNUM_MAX] = {
-	{ XFS_ABTB_MAGIC, XFS_ABTC_MAGIC, XFS_BMAP_MAGIC, XFS_IBT_MAGIC },
+	{ XFS_ABTB_MAGIC, XFS_ABTC_MAGIC, XFS_BMAP_MAGIC, XFS_IBT_MAGIC,
+	  XFS_FIBT_MAGIC },
 	{ XFS_ABTB_CRC_MAGIC, XFS_ABTC_CRC_MAGIC,
-	  XFS_BMAP_CRC_MAGIC, XFS_IBT_CRC_MAGIC }
+	  XFS_BMAP_CRC_MAGIC, XFS_IBT_CRC_MAGIC, XFS_FIBT_CRC_MAGIC }
 };
 #define xfs_btree_magic(cur) \
 	xfs_magics[!!((cur)->bc_flags & XFS_BTREE_CRC_BLOCKS)][cur->bc_btnum]
@@ -1117,6 +1118,7 @@ xfs_btree_set_refs(
 		xfs_buf_set_ref(bp, XFS_ALLOC_BTREE_REF);
 		break;
 	case XFS_BTNUM_INO:
+	case XFS_BTNUM_FINO:
 		xfs_buf_set_ref(bp, XFS_INO_BTREE_REF);
 		break;
 	case XFS_BTNUM_BMAP:
diff --git a/fs/xfs/xfs_btree.h b/fs/xfs/xfs_btree.h
index 91e34f2..d2ac586 100644
--- a/fs/xfs/xfs_btree.h
+++ b/fs/xfs/xfs_btree.h
@@ -62,6 +62,7 @@ union xfs_btree_rec {
 #define	XFS_BTNUM_CNT	((xfs_btnum_t)XFS_BTNUM_CNTi)
 #define	XFS_BTNUM_BMAP	((xfs_btnum_t)XFS_BTNUM_BMAPi)
 #define	XFS_BTNUM_INO	((xfs_btnum_t)XFS_BTNUM_INOi)
+#define	XFS_BTNUM_FINO	((xfs_btnum_t)XFS_BTNUM_FINOi)
 
 /*
  * For logging record fields.
@@ -92,6 +93,7 @@ do {    \
 	case XFS_BTNUM_CNT: __XFS_BTREE_STATS_INC(abtc, stat); break;	\
 	case XFS_BTNUM_BMAP: __XFS_BTREE_STATS_INC(bmbt, stat); break;	\
 	case XFS_BTNUM_INO: __XFS_BTREE_STATS_INC(ibt, stat); break;	\
+	case XFS_BTNUM_FINO: __XFS_BTREE_STATS_INC(fibt, stat); break;	\
 	case XFS_BTNUM_MAX: ASSERT(0); /* fucking gcc */ ; break;	\
 	}       \
 } while (0)
@@ -105,6 +107,7 @@ do {    \
 	case XFS_BTNUM_CNT: __XFS_BTREE_STATS_ADD(abtc, stat, val); break; \
 	case XFS_BTNUM_BMAP: __XFS_BTREE_STATS_ADD(bmbt, stat, val); break; \
 	case XFS_BTNUM_INO: __XFS_BTREE_STATS_ADD(ibt, stat, val); break; \
+	case XFS_BTNUM_FINO: __XFS_BTREE_STATS_ADD(fibt, stat, val); break; \
 	case XFS_BTNUM_MAX: ASSERT(0); /* fucking gcc */ ; break;	\
 	}       \
 } while (0)
diff --git a/fs/xfs/xfs_format.h b/fs/xfs/xfs_format.h
index b6ab5a3..d1def17 100644
--- a/fs/xfs/xfs_format.h
+++ b/fs/xfs/xfs_format.h
@@ -200,6 +200,8 @@ typedef __be32 xfs_alloc_ptr_t;
  */
 #define	XFS_IBT_MAGIC		0x49414254	/* 'IABT' */
 #define	XFS_IBT_CRC_MAGIC	0x49414233	/* 'IAB3' */
+#define	XFS_FIBT_MAGIC		0x46494254	/* 'FIBT' */
+#define	XFS_FIBT_CRC_MAGIC	0x46494233	/* 'FIB3' */
 
 typedef	__uint64_t	xfs_inofree_t;
 #define	XFS_INODES_PER_CHUNK		(NBBY * sizeof(xfs_inofree_t))
@@ -242,7 +244,17 @@ typedef __be32 xfs_inobt_ptr_t;
  * block numbers in the AG.
  */
 #define	XFS_IBT_BLOCK(mp)		((xfs_agblock_t)(XFS_CNT_BLOCK(mp) + 1))
-#define	XFS_PREALLOC_BLOCKS(mp)		((xfs_agblock_t)(XFS_IBT_BLOCK(mp) + 1))
+#define	XFS_FIBT_BLOCK(mp)		((xfs_agblock_t)(XFS_IBT_BLOCK(mp) + 1))
+
+/*
+ * The first data block of an AG depends on whether the filesystem was formatted
+ * with the finobt feature. If so, account for the finobt reserved root btree
+ * block.
+ */
+#define XFS_PREALLOC_BLOCKS(mp) \
+	(xfs_sb_version_hasfinobt(&((mp)->m_sb)) ? \
+	 XFS_FIBT_BLOCK(mp) + 1 : \
+	 XFS_IBT_BLOCK(mp) + 1)
 
 
 
diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
index ef3abd1..d99eb09 100644
--- a/fs/xfs/xfs_ialloc.c
+++ b/fs/xfs/xfs_ialloc.c
@@ -1506,6 +1506,8 @@ xfs_ialloc_log_agi(
 		offsetof(xfs_agi_t, agi_newino),
 		offsetof(xfs_agi_t, agi_dirino),
 		offsetof(xfs_agi_t, agi_unlinked),
+		offsetof(xfs_agi_t, agi_free_root),
+		offsetof(xfs_agi_t, agi_free_level),
 		sizeof(xfs_agi_t)
 	};
 #ifdef DEBUG
@@ -1515,14 +1517,39 @@ xfs_ialloc_log_agi(
 	ASSERT(agi->agi_magicnum == cpu_to_be32(XFS_AGI_MAGIC));
 #endif
 	/*
-	 * Compute byte offsets for the first and last fields.
+	 * The growth of the agi buffer over time now requires that we interpret
+	 * the buffer as two logical regions delineated at the end of the unlinked
+	 * list. This is due to the size of the hash table and its location in the
+	 * middle of the agi.
+	 *
+	 * For example, a request to log a field before agi_unlinked and a field
+	 * after agi_unlinked could cause us to log the entire hash table and use
+	 * an excessive amount of log space. To avoid this behavior, log the
+	 * region up through agi_unlinked in one call and the region after
+	 * agi_unlinked through the end of the structure in another.
 	 */
-	xfs_btree_offsets(fields, offsets, XFS_AGI_NUM_BITS, &first, &last);
+	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_AGI_BUF);
+
 	/*
-	 * Log the allocation group inode header buffer.
+	 * Compute byte offsets for the first and last fields in the first
+	 * region and log agi buffer. This only logs up through agi_unlinked.
 	 */
-	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_AGI_BUF);
-	xfs_trans_log_buf(tp, bp, first, last);
+	if (fields & XFS_AGI_ALL_BITS_R1) {
+		xfs_btree_offsets(fields, offsets, XFS_AGI_NUM_BITS_R1,
+				  &first, &last);
+		xfs_trans_log_buf(tp, bp, first, last);
+	}
+
+	/*
+	 * Mask off the bits in the first region and calculate the first and last
+	 * field offsets for any bits in the second region.
+	 */
+	fields &= ~XFS_AGI_ALL_BITS_R1;
+	if (fields) {
+		xfs_btree_offsets(fields, offsets, XFS_AGI_NUM_BITS_R2,
+				  &first, &last);
+		xfs_trans_log_buf(tp, bp, first, last);
+	}
 }
 
 #ifdef DEBUG
diff --git a/fs/xfs/xfs_ialloc_btree.c b/fs/xfs/xfs_ialloc_btree.c
index 2d1a398..16212dc 100644
--- a/fs/xfs/xfs_ialloc_btree.c
+++ b/fs/xfs/xfs_ialloc_btree.c
@@ -67,6 +67,21 @@ xfs_inobt_set_root(
 	xfs_ialloc_log_agi(cur->bc_tp, agbp, XFS_AGI_ROOT | XFS_AGI_LEVEL);
 }
 
+STATIC void
+xfs_finobt_set_root(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_ptr	*nptr,
+	int			inc)	/* level change */
+{
+	struct xfs_buf		*agbp = cur->bc_private.a.agbp;
+	struct xfs_agi		*agi = XFS_BUF_TO_AGI(agbp);
+
+	agi->agi_free_root = nptr->s;
+	be32_add_cpu(&agi->agi_free_level, inc);
+	xfs_ialloc_log_agi(cur->bc_tp, agbp,
+			   XFS_AGI_FREE_ROOT | XFS_AGI_FREE_LEVEL);
+}
+
 STATIC int
 xfs_inobt_alloc_block(
 	struct xfs_btree_cur	*cur,
@@ -174,6 +189,17 @@ xfs_inobt_init_ptr_from_cur(
 	ptr->s = agi->agi_root;
 }
 
+STATIC void
+xfs_finobt_init_ptr_from_cur(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_ptr	*ptr)
+{
+	struct xfs_agi		*agi = XFS_BUF_TO_AGI(cur->bc_private.a.agbp);
+
+	ASSERT(cur->bc_private.a.agno == be32_to_cpu(agi->agi_seqno));
+	ptr->s = agi->agi_free_root;
+}
+
 STATIC __int64_t
 xfs_inobt_key_diff(
 	struct xfs_btree_cur	*cur,
@@ -204,6 +230,7 @@ xfs_inobt_verify(
 	 */
 	switch (block->bb_magic) {
 	case cpu_to_be32(XFS_IBT_CRC_MAGIC):
+	case cpu_to_be32(XFS_FIBT_CRC_MAGIC):
 		if (!xfs_sb_version_hascrc(&mp->m_sb))
 			return false;
 		if (!uuid_equal(&block->bb_u.s.bb_uuid, &mp->m_sb.sb_uuid))
@@ -215,6 +242,7 @@ xfs_inobt_verify(
 			return false;
 		/* fall through */
 	case cpu_to_be32(XFS_IBT_MAGIC):
+	case cpu_to_be32(XFS_FIBT_MAGIC):
 		break;
 	default:
 		return 0;
@@ -316,6 +344,28 @@ static const struct xfs_btree_ops xfs_inobt_ops = {
 #endif
 };
 
+static const struct xfs_btree_ops xfs_finobt_ops = {
+	.rec_len		= sizeof(xfs_inobt_rec_t),
+	.key_len		= sizeof(xfs_inobt_key_t),
+
+	.dup_cursor		= xfs_inobt_dup_cursor,
+	.set_root		= xfs_finobt_set_root,
+	.alloc_block		= xfs_inobt_alloc_block,
+	.free_block		= xfs_inobt_free_block,
+	.get_minrecs		= xfs_inobt_get_minrecs,
+	.get_maxrecs		= xfs_inobt_get_maxrecs,
+	.init_key_from_rec	= xfs_inobt_init_key_from_rec,
+	.init_rec_from_key	= xfs_inobt_init_rec_from_key,
+	.init_rec_from_cur	= xfs_inobt_init_rec_from_cur,
+	.init_ptr_from_cur	= xfs_finobt_init_ptr_from_cur,
+	.key_diff		= xfs_inobt_key_diff,
+	.buf_ops		= &xfs_inobt_buf_ops,
+#if defined(DEBUG) || defined(XFS_WARN)
+	.keys_inorder		= xfs_inobt_keys_inorder,
+	.recs_inorder		= xfs_inobt_recs_inorder,
+#endif
+};
+
 /*
  * Allocate a new inode btree cursor.
  */
@@ -334,11 +384,17 @@ xfs_inobt_init_cursor(
 
 	cur->bc_tp = tp;
 	cur->bc_mp = mp;
-	cur->bc_nlevels = be32_to_cpu(agi->agi_level);
 	cur->bc_btnum = btnum;
+	if (btnum == XFS_BTNUM_INO) {
+		cur->bc_nlevels = be32_to_cpu(agi->agi_level);
+		cur->bc_ops = &xfs_inobt_ops;
+	} else {
+		cur->bc_nlevels = be32_to_cpu(agi->agi_free_level);
+		cur->bc_ops = &xfs_finobt_ops;
+	}
+
 	cur->bc_blocklog = mp->m_sb.sb_blocklog;
 
-	cur->bc_ops = &xfs_inobt_ops;
 	if (xfs_sb_version_hascrc(&mp->m_sb))
 		cur->bc_flags |= XFS_BTREE_CRC_BLOCKS;
 
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 07ab52c..e9a013b6 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -2135,7 +2135,9 @@ xlog_recover_validate_buf_type(
 			bp->b_ops = &xfs_allocbt_buf_ops;
 			break;
 		case XFS_IBT_CRC_MAGIC:
+		case XFS_FIBT_CRC_MAGIC:
 		case XFS_IBT_MAGIC:
+		case XFS_FIBT_MAGIC:
 			bp->b_ops = &xfs_inobt_buf_ops;
 			break;
 		case XFS_BMAP_CRC_MAGIC:
diff --git a/fs/xfs/xfs_stats.c b/fs/xfs/xfs_stats.c
index ce372b7..f224038 100644
--- a/fs/xfs/xfs_stats.c
+++ b/fs/xfs/xfs_stats.c
@@ -59,6 +59,7 @@ static int xfs_stat_proc_show(struct seq_file *m, void *v)
 		{ "abtc2",		XFSSTAT_END_ABTC_V2		},
 		{ "bmbt2",		XFSSTAT_END_BMBT_V2		},
 		{ "ibt2",		XFSSTAT_END_IBT_V2		},
+		{ "fibt2",		XFSSTAT_END_FIBT_V2		},
 		/* we print both series of quota information together */
 		{ "qm",			XFSSTAT_END_QM			},
 	};
diff --git a/fs/xfs/xfs_stats.h b/fs/xfs/xfs_stats.h
index c03ad38..c8f238b 100644
--- a/fs/xfs/xfs_stats.h
+++ b/fs/xfs/xfs_stats.h
@@ -183,7 +183,23 @@ struct xfsstats {
 	__uint32_t		xs_ibt_2_alloc;
 	__uint32_t		xs_ibt_2_free;
 	__uint32_t		xs_ibt_2_moves;
-#define XFSSTAT_END_XQMSTAT		(XFSSTAT_END_IBT_V2+6)
+#define XFSSTAT_END_FIBT_V2		(XFSSTAT_END_IBT_V2+15)
+	__uint32_t		xs_fibt_2_lookup;
+	__uint32_t		xs_fibt_2_compare;
+	__uint32_t		xs_fibt_2_insrec;
+	__uint32_t		xs_fibt_2_delrec;
+	__uint32_t		xs_fibt_2_newroot;
+	__uint32_t		xs_fibt_2_killroot;
+	__uint32_t		xs_fibt_2_increment;
+	__uint32_t		xs_fibt_2_decrement;
+	__uint32_t		xs_fibt_2_lshift;
+	__uint32_t		xs_fibt_2_rshift;
+	__uint32_t		xs_fibt_2_split;
+	__uint32_t		xs_fibt_2_join;
+	__uint32_t		xs_fibt_2_alloc;
+	__uint32_t		xs_fibt_2_free;
+	__uint32_t		xs_fibt_2_moves;
+#define XFSSTAT_END_XQMSTAT		(XFSSTAT_END_FIBT_V2+6)
 	__uint32_t		xs_qm_dqreclaims;
 	__uint32_t		xs_qm_dqreclaim_misses;
 	__uint32_t		xs_qm_dquot_dups;
diff --git a/fs/xfs/xfs_types.h b/fs/xfs/xfs_types.h
index 82bbc34..65c6e66 100644
--- a/fs/xfs/xfs_types.h
+++ b/fs/xfs/xfs_types.h
@@ -134,7 +134,7 @@ typedef enum {
 
 typedef enum {
 	XFS_BTNUM_BNOi, XFS_BTNUM_CNTi, XFS_BTNUM_BMAPi, XFS_BTNUM_INOi,
-	XFS_BTNUM_MAX
+	XFS_BTNUM_FINOi, XFS_BTNUM_MAX
 } xfs_btnum_t;
 
 struct xfs_name {
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 04/11] xfs: update inode allocation/free transaction reservations for finobt
  2014-02-04 17:49 [PATCH v3 00/11] xfs: introduce the free inode btree Brian Foster
                   ` (2 preceding siblings ...)
  2014-02-04 17:49 ` [PATCH v3 03/11] xfs: support the XFS_BTNUM_FINOBT free inode btree type Brian Foster
@ 2014-02-04 17:49 ` Brian Foster
  2014-02-11  6:46   ` Dave Chinner
  2014-02-04 17:49 ` [PATCH v3 05/11] xfs: insert newly allocated inode chunks into the finobt Brian Foster
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 34+ messages in thread
From: Brian Foster @ 2014-02-04 17:49 UTC (permalink / raw)
  To: xfs

Create the xfs_calc_finobt_res() helper to calculate the finobt log
reservation for inode allocation and free. Update
XFS_IALLOC_SPACE_RES() to reserve blocks for the additional finobt
insertion on inode allocation. Create XFS_IFREE_SPACE_RES() to
reserve blocks for the potential finobt record insertion on inode
free (i.e., if an inode chunk was previously fully allocated).

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_inode.c       |  4 +++-
 fs/xfs/xfs_trans_resv.c  | 47 +++++++++++++++++++++++++++++++++++++++++++----
 fs/xfs/xfs_trans_space.h |  7 ++++++-
 3 files changed, 52 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 001aa89..57c77ed 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1730,7 +1730,9 @@ xfs_inactive_ifree(
 	int			error;
 
 	tp = xfs_trans_alloc(mp, XFS_TRANS_INACTIVE);
-	error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree, 0, 0);
+	tp->t_flags |= XFS_TRANS_RESERVE;
+	error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree,
+				  XFS_IFREE_SPACE_RES(mp), 0);
 	if (error) {
 		ASSERT(XFS_FORCED_SHUTDOWN(mp));
 		xfs_trans_cancel(tp, XFS_TRANS_RELEASE_LOG_RES);
diff --git a/fs/xfs/xfs_trans_resv.c b/fs/xfs/xfs_trans_resv.c
index 2fd59c0..32f35c1 100644
--- a/fs/xfs/xfs_trans_resv.c
+++ b/fs/xfs/xfs_trans_resv.c
@@ -98,6 +98,37 @@ xfs_calc_inode_res(
 }
 
 /*
+ * The free inode btree is a conditional feature and the log reservation
+ * requirements differ slightly from that of the traditional inode allocation
+ * btree. The finobt tracks records for inode chunks with at least one free inode.
+ * Therefore, a record can be removed from the tree for an inode allocation or
+ * free and the associated merge reservation is unconditional. This also covers
+ * the possibility of a split on record insertion.
+ *
+ * the free inode btree: max depth * block size
+ * the free inode btree entry: block size
+ *
+ * TODO: is the modify res really necessary? covered by the merge/split res?
+ * This seems to be the pattern of ifree, but not create_resv_alloc. Why?
+ */
+STATIC uint
+xfs_calc_finobt_res(
+	struct xfs_mount 	*mp,
+	int			modify)
+{
+	uint res;
+
+	if (!xfs_sb_version_hasfinobt(&mp->m_sb))
+		return 0;
+
+	res = xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1));
+	if (modify)
+		res += (uint)XFS_FSB_TO_B(mp, 1);
+
+	return res;
+}
+
+/*
  * Various log reservation values.
  *
  * These are based on the size of the file system block because that is what
@@ -267,6 +298,7 @@ xfs_calc_remove_reservation(
  *    the superblock for the nlink flag: sector size
  *    the directory btree: (max depth + v2) * dir block size
  *    the directory inode's bmap btree: (max depth + v2) * block size
+ *    the finobt
  */
 STATIC uint
 xfs_calc_create_resv_modify(
@@ -275,7 +307,8 @@ xfs_calc_create_resv_modify(
 	return xfs_calc_inode_res(mp, 2) +
 		xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
 		(uint)XFS_FSB_TO_B(mp, 1) +
-		xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp), XFS_FSB_TO_B(mp, 1));
+		xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp), XFS_FSB_TO_B(mp, 1)) +
+		xfs_calc_finobt_res(mp, 1);
 }
 
 /*
@@ -285,6 +318,7 @@ xfs_calc_create_resv_modify(
  *    the inode blocks allocated: XFS_IALLOC_BLOCKS * blocksize
  *    the inode btree: max depth * blocksize
  *    the allocation btrees: 2 trees * (max depth - 1) * block size
+ *    the finobt
  */
 STATIC uint
 xfs_calc_create_resv_alloc(
@@ -295,7 +329,8 @@ xfs_calc_create_resv_alloc(
 		xfs_calc_buf_res(XFS_IALLOC_BLOCKS(mp), XFS_FSB_TO_B(mp, 1)) +
 		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
 		xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 1),
-				 XFS_FSB_TO_B(mp, 1));
+				 XFS_FSB_TO_B(mp, 1)) +
+		xfs_calc_finobt_res(mp, 0);
 }
 
 STATIC uint
@@ -313,6 +348,7 @@ __xfs_calc_create_reservation(
  *    the superblock for the nlink flag: sector size
  *    the inode btree: max depth * blocksize
  *    the allocation btrees: 2 trees * (max depth - 1) * block size
+ *    the finobt
  */
 STATIC uint
 xfs_calc_icreate_resv_alloc(
@@ -322,7 +358,8 @@ xfs_calc_icreate_resv_alloc(
 		mp->m_sb.sb_sectsize +
 		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
 		xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 1),
-				 XFS_FSB_TO_B(mp, 1));
+				 XFS_FSB_TO_B(mp, 1)) +
+		xfs_calc_finobt_res(mp, 0);
 }
 
 STATIC uint
@@ -376,6 +413,7 @@ xfs_calc_symlink_reservation(
  *    the on disk inode before ours in the agi hash list: inode cluster size
  *    the inode btree: max depth * blocksize
  *    the allocation btrees: 2 trees * (max depth - 1) * block size
+ *    the finobt
  */
 STATIC uint
 xfs_calc_ifree_reservation(
@@ -390,7 +428,8 @@ xfs_calc_ifree_reservation(
 		xfs_calc_buf_res(2 + XFS_IALLOC_BLOCKS(mp) +
 				 mp->m_in_maxlevels, 0) +
 		xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 1),
-				 XFS_FSB_TO_B(mp, 1));
+				 XFS_FSB_TO_B(mp, 1)) +
+		xfs_calc_finobt_res(mp, 1);
 }
 
 /*
diff --git a/fs/xfs/xfs_trans_space.h b/fs/xfs/xfs_trans_space.h
index 7d2c920..a7d1721e 100644
--- a/fs/xfs/xfs_trans_space.h
+++ b/fs/xfs/xfs_trans_space.h
@@ -47,7 +47,9 @@
 #define	XFS_DIRREMOVE_SPACE_RES(mp)	\
 	XFS_DAREMOVE_SPACE_RES(mp, XFS_DATA_FORK)
 #define	XFS_IALLOC_SPACE_RES(mp)	\
-	(XFS_IALLOC_BLOCKS(mp) + (mp)->m_in_maxlevels - 1)
+	(XFS_IALLOC_BLOCKS(mp) + \
+	 (xfs_sb_version_hasfinobt(&mp->m_sb) ? 2 : 1 * \
+	  ((mp)->m_in_maxlevels - 1)))
 
 /*
  * Space reservation values for various transactions.
@@ -82,5 +84,8 @@
 	(XFS_DIRREMOVE_SPACE_RES(mp) + XFS_DIRENTER_SPACE_RES(mp,nl))
 #define	XFS_SYMLINK_SPACE_RES(mp,nl,b)	\
 	(XFS_IALLOC_SPACE_RES(mp) + XFS_DIRENTER_SPACE_RES(mp,nl) + (b))
+#define XFS_IFREE_SPACE_RES(mp)		\
+	(xfs_sb_version_hasfinobt(&mp->m_sb) ? (mp)->m_in_maxlevels : 0)
+
 
 #endif	/* __XFS_TRANS_SPACE_H__ */
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 05/11] xfs: insert newly allocated inode chunks into the finobt
  2014-02-04 17:49 [PATCH v3 00/11] xfs: introduce the free inode btree Brian Foster
                   ` (3 preceding siblings ...)
  2014-02-04 17:49 ` [PATCH v3 04/11] xfs: update inode allocation/free transaction reservations for finobt Brian Foster
@ 2014-02-04 17:49 ` Brian Foster
  2014-02-11  6:48   ` Dave Chinner
  2014-02-04 17:49 ` [PATCH v3 06/11] xfs: use and update the finobt on inode allocation Brian Foster
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 34+ messages in thread
From: Brian Foster @ 2014-02-04 17:49 UTC (permalink / raw)
  To: xfs

A newly allocated inode chunk, by definition, has at least one
free inode, so a record is always inserted into the finobt.

Create the xfs_inobt_insert() helper from existing code to insert
a record in an inobt based on the provided BTNUM. Update
xfs_ialloc_ag_alloc() to invoke the helper for the existing
XFS_BTNUM_INO tree and XFS_BTNUM_FINO tree, if enabled.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_ialloc.c | 93 ++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 70 insertions(+), 23 deletions(-)

diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
index d99eb09..bdaab76 100644
--- a/fs/xfs/xfs_ialloc.c
+++ b/fs/xfs/xfs_ialloc.c
@@ -112,6 +112,66 @@ xfs_inobt_get_rec(
 }
 
 /*
+ * Insert a single inobt record. Cursor must already point to desired location.
+ */
+STATIC int
+xfs_inobt_insert_rec(
+	struct xfs_btree_cur	*cur,
+	__int32_t		freecount,
+	xfs_inofree_t		free,
+	int			*stat)
+{
+	cur->bc_rec.i.ir_freecount = freecount;
+	cur->bc_rec.i.ir_free = free;
+	return xfs_btree_insert(cur, stat);
+}
+
+/*
+ * Insert records describing a newly allocated inode chunk into the inobt.
+ */
+STATIC int
+xfs_inobt_insert(
+	struct xfs_mount	*mp,
+	struct xfs_trans	*tp,
+	struct xfs_buf		*agbp,
+	xfs_agino_t		newino,
+	xfs_agino_t		newlen,
+	xfs_btnum_t		btnum)
+{
+	struct xfs_btree_cur	*cur;
+	struct xfs_agi		*agi = XFS_BUF_TO_AGI(agbp);
+	xfs_agnumber_t		agno = be32_to_cpu(agi->agi_seqno);
+	xfs_agino_t		thisino;
+	int			i;
+	int			error;
+
+	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno, btnum);
+
+	for (thisino = newino;
+	     thisino < newino + newlen;
+	     thisino += XFS_INODES_PER_CHUNK) {
+		error = xfs_inobt_lookup(cur, thisino, XFS_LOOKUP_EQ, &i);
+		if (error) {
+			xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+			return error;
+		}
+		ASSERT(i == 0);
+
+		error = xfs_inobt_insert_rec(cur, XFS_INODES_PER_CHUNK,
+					     XFS_INOBT_ALL_FREE, &i);
+		if (error) {
+			xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+			return error;
+		}
+		ASSERT(i == 1);
+	}
+
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+
+	return 0;
+}
+
+/*
  * Verify that the number of free inodes in the AGI is correct.
  */
 #ifdef DEBUG
@@ -310,13 +370,10 @@ xfs_ialloc_ag_alloc(
 {
 	xfs_agi_t	*agi;		/* allocation group header */
 	xfs_alloc_arg_t	args;		/* allocation argument structure */
-	xfs_btree_cur_t	*cur;		/* inode btree cursor */
 	xfs_agnumber_t	agno;
 	int		error;
-	int		i;
 	xfs_agino_t	newino;		/* new first inode's number */
 	xfs_agino_t	newlen;		/* new number of inodes */
-	xfs_agino_t	thisino;	/* current inode number, for loop */
 	int		isaligned = 0;	/* inode allocation at stripe unit */
 					/* boundary */
 	struct xfs_perag *pag;
@@ -454,29 +511,19 @@ xfs_ialloc_ag_alloc(
 	agi->agi_newino = cpu_to_be32(newino);
 
 	/*
-	 * Insert records describing the new inode chunk into the btree.
+	 * Insert records describing the new inode chunk into the btrees.
 	 */
-	cur = xfs_inobt_init_cursor(args.mp, tp, agbp, agno, XFS_BTNUM_INO);
-	for (thisino = newino;
-	     thisino < newino + newlen;
-	     thisino += XFS_INODES_PER_CHUNK) {
-		cur->bc_rec.i.ir_startino = thisino;
-		cur->bc_rec.i.ir_freecount = XFS_INODES_PER_CHUNK;
-		cur->bc_rec.i.ir_free = XFS_INOBT_ALL_FREE;
-		error = xfs_btree_lookup(cur, XFS_LOOKUP_EQ, &i);
-		if (error) {
-			xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
-			return error;
-		}
-		ASSERT(i == 0);
-		error = xfs_btree_insert(cur, &i);
-		if (error) {
-			xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+	error = xfs_inobt_insert(args.mp, tp, agbp, newino, newlen,
+				 XFS_BTNUM_INO);
+	if (error)
+		return error;
+
+	if (xfs_sb_version_hasfinobt(&args.mp->m_sb)) {
+		error = xfs_inobt_insert(args.mp, tp, agbp, newino, newlen,
+					 XFS_BTNUM_FINO);
+		if (error)
 			return error;
-		}
-		ASSERT(i == 1);
 	}
-	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
 	/*
 	 * Log allocation group header fields
 	 */
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 06/11] xfs: use and update the finobt on inode allocation
  2014-02-04 17:49 [PATCH v3 00/11] xfs: introduce the free inode btree Brian Foster
                   ` (4 preceding siblings ...)
  2014-02-04 17:49 ` [PATCH v3 05/11] xfs: insert newly allocated inode chunks into the finobt Brian Foster
@ 2014-02-04 17:49 ` Brian Foster
  2014-02-11  7:17   ` Dave Chinner
  2014-02-04 17:49 ` [PATCH v3 07/11] xfs: refactor xfs_difree() inobt bits into xfs_difree_inobt() helper Brian Foster
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 34+ messages in thread
From: Brian Foster @ 2014-02-04 17:49 UTC (permalink / raw)
  To: xfs

Replace xfs_dialloc_ag() with an implementation that looks for a
record in the finobt. The finobt only tracks records with at least
one free inode. This eliminates the need for the intra-ag scan in
the original algorithm. Once the inode is allocated, update the
finobt appropriately (possibly removing the record) as well as the
inobt.

Move the original xfs_dialloc_ag() algorithm to
xfs_dialloc_ag_slow() and fall back as such if finobt support is
not enabled.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_ialloc.c | 211 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 210 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
index bdaab76..afc9840 100644
--- a/fs/xfs/xfs_ialloc.c
+++ b/fs/xfs/xfs_ialloc.c
@@ -723,7 +723,7 @@ xfs_ialloc_get_rec(
  * available.
  */
 STATIC int
-xfs_dialloc_ag(
+xfs_dialloc_ag_slow(
 	struct xfs_trans	*tp,
 	struct xfs_buf		*agbp,
 	xfs_ino_t		parent,
@@ -981,6 +981,215 @@ error0:
 	return error;
 }
 
+STATIC int
+xfs_dialloc_ag(
+	struct xfs_trans	*tp,
+	struct xfs_buf		*agbp,
+	xfs_ino_t		parent,
+	xfs_ino_t		*inop)
+{
+	struct xfs_mount		*mp = tp->t_mountp;
+	struct xfs_agi			*agi = XFS_BUF_TO_AGI(agbp);
+	xfs_agnumber_t			agno = be32_to_cpu(agi->agi_seqno);
+	xfs_agnumber_t			pagno = XFS_INO_TO_AGNO(mp, parent);
+	xfs_agino_t			pagino = XFS_INO_TO_AGINO(mp, parent);
+	struct xfs_perag		*pag;
+	struct xfs_btree_cur		*cur;
+	struct xfs_btree_cur		*tcur;
+	struct xfs_inobt_rec_incore	rec;
+	struct xfs_inobt_rec_incore	trec;
+	xfs_ino_t			ino;
+	int				error;
+	int				offset;
+	int				i, j;
+
+	if (!xfs_sb_version_hasfinobt(&mp->m_sb))
+		return xfs_dialloc_ag_slow(tp, agbp, parent, inop);
+
+	pag = xfs_perag_get(mp, agno);
+
+	/*
+	 * If pagino is 0 (this is the root inode allocation) use newino.
+	 * This must work because we've just allocated some.
+	 */
+	if (!pagino)
+		pagino = be32_to_cpu(agi->agi_newino);
+
+	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno, XFS_BTNUM_FINO);
+
+	error = xfs_check_agi_freecount(cur, agi);
+	if (error)
+		goto error_cur;
+
+	if (agno == pagno) {
+		/*
+		 * We're in the same AG as the parent inode so allocate the
+		 * closest inode to the parent.
+		 */
+		error = xfs_inobt_lookup(cur, pagino, XFS_LOOKUP_LE, &i);
+		if (error)
+			goto error_cur;
+		if (i == 1) {
+			error = xfs_inobt_get_rec(cur, &rec, &i);
+			if (error)
+				goto error_cur;
+			XFS_WANT_CORRUPTED_GOTO(i == 1, error_cur);
+
+			/*
+			 * See if we've landed in the parent inode record. The
+			 * finobt only tracks chunks with at least one free
+			 * inode, so record existence is enough.
+			 */
+			if (pagino >= rec.ir_startino &&
+			    pagino < (rec.ir_startino + XFS_INODES_PER_CHUNK))
+				goto alloc_inode;
+		}
+
+		error = xfs_btree_dup_cursor(cur, &tcur);
+		if (error) 
+			goto error_cur;
+
+		error = xfs_inobt_lookup(tcur, pagino, XFS_LOOKUP_GE, &j);
+		if (error)
+			goto error_tcur;
+		if (j == 1) {
+			error = xfs_inobt_get_rec(tcur, &trec, &j);
+			if (error)
+				goto error_tcur;
+			XFS_WANT_CORRUPTED_GOTO(j == 1, error_tcur);
+		}
+
+		if (i == 1 && j == 1) {
+			if ((pagino - rec.ir_startino + XFS_INODES_PER_CHUNK - 1) >
+			    (trec.ir_startino - pagino)) {
+				rec = trec;
+				xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+				cur = tcur;
+			} else {
+				xfs_btree_del_cursor(tcur, XFS_BTREE_NOERROR);
+			}
+		} else if (j == 1) {
+			rec = trec;
+			xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+			cur = tcur;
+		} else {
+			xfs_btree_del_cursor(tcur, XFS_BTREE_NOERROR);
+		}
+	} else {
+		/*
+		 * Different AG from the parent inode. Check the record for the
+		 * most recently allocated inode.
+		 */
+		if (agi->agi_newino != cpu_to_be32(NULLAGINO)) {
+			error = xfs_inobt_lookup(cur, agi->agi_newino,
+						 XFS_LOOKUP_EQ, &i);
+			if (error)
+				goto error_cur;
+			if (i == 1) {
+				error = xfs_inobt_get_rec(cur, &rec, &i);
+				if (error)
+					goto error_cur;
+				XFS_WANT_CORRUPTED_GOTO(i == 1, error_cur);
+				goto alloc_inode;
+			}
+		}
+
+		/*
+		 * Allocate the first inode available in the AG.
+		 */
+		error = xfs_inobt_lookup(cur, 0, XFS_LOOKUP_GE, &i);
+		if (error)
+			goto error_cur;
+		XFS_WANT_CORRUPTED_GOTO(i == 1, error_cur);
+
+		error = xfs_inobt_get_rec(cur, &rec, &i);
+		if (error)
+			goto error_cur;
+		XFS_WANT_CORRUPTED_GOTO(i == 1, error_cur);
+	}
+
+alloc_inode:
+	offset = xfs_lowbit64(rec.ir_free);
+	ASSERT(offset >= 0);
+	ASSERT(offset < XFS_INODES_PER_CHUNK);
+	ASSERT((XFS_AGINO_TO_OFFSET(mp, rec.ir_startino) %
+				   XFS_INODES_PER_CHUNK) == 0);
+	ino = XFS_AGINO_TO_INO(mp, agno, rec.ir_startino + offset);
+
+	/*
+	 * Modify or remove the finobt record.
+	 */
+	rec.ir_free &= ~XFS_INOBT_MASK(offset);
+	rec.ir_freecount--;
+	if (rec.ir_freecount) 
+		error = xfs_inobt_update(cur, &rec);
+	else
+		error = xfs_btree_delete(cur, &i);
+	if (error)
+		goto error_cur;
+
+	/*
+	 * Lookup and modify the equivalent record in the inobt.
+	 */
+	tcur = xfs_inobt_init_cursor(mp, tp, agbp, agno, XFS_BTNUM_INO);
+
+	error = xfs_check_agi_freecount(tcur, agi);
+	if (error)
+		goto error_tcur;
+
+	error = xfs_inobt_lookup(tcur, rec.ir_startino, XFS_LOOKUP_EQ, &i);
+	if (error)
+		goto error_tcur;
+	XFS_WANT_CORRUPTED_GOTO(i == 1, error_tcur);
+
+	error = xfs_inobt_get_rec(tcur, &trec, &i);
+	if (error)
+		goto error_tcur;
+	XFS_WANT_CORRUPTED_GOTO(i == 1, error_tcur);
+	ASSERT((XFS_AGINO_TO_OFFSET(mp, trec.ir_startino) %
+				   XFS_INODES_PER_CHUNK) == 0);
+
+	trec.ir_free &= ~XFS_INOBT_MASK(offset);
+	trec.ir_freecount--;
+
+	XFS_WANT_CORRUPTED_GOTO((rec.ir_free == trec.ir_free) &&
+				(rec.ir_freecount == trec.ir_freecount),
+				error_tcur);
+
+	error = xfs_inobt_update(tcur, &trec);
+	if (error)
+		goto error_tcur;
+
+	/*
+	 * Update the perag and superblock.
+	 */
+	be32_add_cpu(&agi->agi_freecount, -1);
+	xfs_ialloc_log_agi(tp, agbp, XFS_AGI_FREECOUNT);
+	pag->pagi_freecount--;
+
+	xfs_trans_mod_sb(tp, XFS_TRANS_SB_IFREE, -1);
+
+	error = xfs_check_agi_freecount(tcur, agi);
+	if (error)
+		goto error_tcur;
+	error = xfs_check_agi_freecount(cur, agi);
+	if (error)
+		goto error_tcur;
+
+	xfs_btree_del_cursor(tcur, XFS_BTREE_NOERROR);
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+	xfs_perag_put(pag);
+	*inop = ino;
+	return 0;
+
+error_tcur:
+	xfs_btree_del_cursor(tcur, XFS_BTREE_ERROR);
+error_cur:
+	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+	xfs_perag_put(pag);
+	return error;
+}
+
 /*
  * Allocate an inode on disk.
  *
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 07/11] xfs: refactor xfs_difree() inobt bits into xfs_difree_inobt() helper
  2014-02-04 17:49 [PATCH v3 00/11] xfs: introduce the free inode btree Brian Foster
                   ` (5 preceding siblings ...)
  2014-02-04 17:49 ` [PATCH v3 06/11] xfs: use and update the finobt on inode allocation Brian Foster
@ 2014-02-04 17:49 ` Brian Foster
  2014-02-11  7:19   ` Dave Chinner
  2014-02-04 17:49 ` [PATCH v3 08/11] xfs: update the finobt on inode free Brian Foster
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 34+ messages in thread
From: Brian Foster @ 2014-02-04 17:49 UTC (permalink / raw)
  To: xfs

Refactor xfs_difree() in preparation for the finobt. xfs_difree()
performs the validity checks against the ag and reads the agi
header. The work of physically updating the inode allocation btree
is pushed down into the new xfs_difree_inobt() helper.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_ialloc.c | 161 +++++++++++++++++++++++++++++++---------------------
 1 file changed, 96 insertions(+), 65 deletions(-)

diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
index afc9840..06851db 100644
--- a/fs/xfs/xfs_ialloc.c
+++ b/fs/xfs/xfs_ialloc.c
@@ -1349,74 +1349,30 @@ out_error:
 	return XFS_ERROR(error);
 }
 
-/*
- * Free disk inode.  Carefully avoids touching the incore inode, all
- * manipulations incore are the caller's responsibility.
- * The on-disk inode is not changed by this operation, only the
- * btree (free inode mask) is changed.
- */
-int
-xfs_difree(
-	xfs_trans_t	*tp,		/* transaction pointer */
-	xfs_ino_t	inode,		/* inode to be freed */
-	xfs_bmap_free_t	*flist,		/* extents to free */
-	int		*delete,	/* set if inode cluster was deleted */
-	xfs_ino_t	*first_ino)	/* first inode in deleted cluster */
+STATIC int
+xfs_difree_inobt(
+	struct xfs_mount		*mp,
+	struct xfs_trans		*tp,
+	struct xfs_buf			*agbp,
+	xfs_agino_t			agino,
+	struct xfs_bmap_free		*flist,
+	int				*delete,
+	xfs_ino_t			*first_ino,
+	struct xfs_inobt_rec_incore	*orec)
 {
-	/* REFERENCED */
-	xfs_agblock_t	agbno;	/* block number containing inode */
-	xfs_buf_t	*agbp;	/* buffer containing allocation group header */
-	xfs_agino_t	agino;	/* inode number relative to allocation group */
-	xfs_agnumber_t	agno;	/* allocation group number */
-	xfs_agi_t	*agi;	/* allocation group header */
-	xfs_btree_cur_t	*cur;	/* inode btree cursor */
-	int		error;	/* error return value */
-	int		i;	/* result code */
-	int		ilen;	/* inodes in an inode cluster */
-	xfs_mount_t	*mp;	/* mount structure for filesystem */
-	int		off;	/* offset of inode in inode chunk */
-	xfs_inobt_rec_incore_t rec;	/* btree record */
-	struct xfs_perag *pag;
-
-	mp = tp->t_mountp;
+	struct xfs_agi			*agi = XFS_BUF_TO_AGI(agbp);
+	xfs_agnumber_t			agno = be32_to_cpu(agi->agi_seqno);
+	struct xfs_perag		*pag;
+	struct xfs_btree_cur		*cur;
+	struct xfs_inobt_rec_incore	rec;
+	int				ilen;
+	int				error;
+	int				i;
+	int				off;
 
-	/*
-	 * Break up inode number into its components.
-	 */
-	agno = XFS_INO_TO_AGNO(mp, inode);
-	if (agno >= mp->m_sb.sb_agcount)  {
-		xfs_warn(mp, "%s: agno >= mp->m_sb.sb_agcount (%d >= %d).",
-			__func__, agno, mp->m_sb.sb_agcount);
-		ASSERT(0);
-		return XFS_ERROR(EINVAL);
-	}
-	agino = XFS_INO_TO_AGINO(mp, inode);
-	if (inode != XFS_AGINO_TO_INO(mp, agno, agino))  {
-		xfs_warn(mp, "%s: inode != XFS_AGINO_TO_INO() (%llu != %llu).",
-			__func__, (unsigned long long)inode,
-			(unsigned long long)XFS_AGINO_TO_INO(mp, agno, agino));
-		ASSERT(0);
-		return XFS_ERROR(EINVAL);
-	}
-	agbno = XFS_AGINO_TO_AGBNO(mp, agino);
-	if (agbno >= mp->m_sb.sb_agblocks)  {
-		xfs_warn(mp, "%s: agbno >= mp->m_sb.sb_agblocks (%d >= %d).",
-			__func__, agbno, mp->m_sb.sb_agblocks);
-		ASSERT(0);
-		return XFS_ERROR(EINVAL);
-	}
-	/*
-	 * Get the allocation group header.
-	 */
-	error = xfs_ialloc_read_agi(mp, tp, agno, &agbp);
-	if (error) {
-		xfs_warn(mp, "%s: xfs_ialloc_read_agi() returned error %d.",
-			__func__, error);
-		return error;
-	}
-	agi = XFS_BUF_TO_AGI(agbp);
 	ASSERT(agi->agi_magicnum == cpu_to_be32(XFS_AGI_MAGIC));
-	ASSERT(agbno < be32_to_cpu(agi->agi_length));
+	ASSERT(XFS_AGINO_TO_AGBNO(mp, agino) < be32_to_cpu(agi->agi_length));
+
 	/*
 	 * Initialize the cursor.
 	 */
@@ -1512,6 +1468,7 @@ xfs_difree(
 	if (error)
 		goto error0;
 
+	*orec = rec;
 	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
 	return 0;
 
@@ -1520,6 +1477,80 @@ error0:
 	return error;
 }
 
+/*
+ * Free disk inode.  Carefully avoids touching the incore inode, all
+ * manipulations incore are the caller's responsibility.
+ * The on-disk inode is not changed by this operation, only the
+ * btree (free inode mask) is changed.
+ */
+int
+xfs_difree(
+	struct xfs_trans	*tp,		/* transaction pointer */
+	xfs_ino_t		inode,		/* inode to be freed */
+	struct xfs_bmap_free	*flist,		/* extents to free */
+	int			*delete,/* set if inode cluster was deleted */
+	xfs_ino_t		*first_ino)/* first inode in deleted cluster */
+{
+	/* REFERENCED */
+	xfs_agblock_t		agbno;	/* block number containing inode */
+	struct xfs_buf		*agbp;	/* buffer for allocation group header */
+	xfs_agino_t		agino;	/* allocation group inode number */
+	xfs_agnumber_t		agno;	/* allocation group number */
+	int			error;	/* error return value */
+	struct xfs_mount	*mp;	/* mount structure for filesystem */
+	struct xfs_inobt_rec_incore rec;/* btree record */
+
+	mp = tp->t_mountp;
+
+	/*
+	 * Break up inode number into its components.
+	 */
+	agno = XFS_INO_TO_AGNO(mp, inode);
+	if (agno >= mp->m_sb.sb_agcount)  {
+		xfs_warn(mp, "%s: agno >= mp->m_sb.sb_agcount (%d >= %d).",
+			__func__, agno, mp->m_sb.sb_agcount);
+		ASSERT(0);
+		return XFS_ERROR(EINVAL);
+	}
+	agino = XFS_INO_TO_AGINO(mp, inode);
+	if (inode != XFS_AGINO_TO_INO(mp, agno, agino))  {
+		xfs_warn(mp, "%s: inode != XFS_AGINO_TO_INO() (%llu != %llu).",
+			__func__, (unsigned long long)inode,
+			(unsigned long long)XFS_AGINO_TO_INO(mp, agno, agino));
+		ASSERT(0);
+		return XFS_ERROR(EINVAL);
+	}
+	agbno = XFS_AGINO_TO_AGBNO(mp, agino);
+	if (agbno >= mp->m_sb.sb_agblocks)  {
+		xfs_warn(mp, "%s: agbno >= mp->m_sb.sb_agblocks (%d >= %d).",
+			__func__, agbno, mp->m_sb.sb_agblocks);
+		ASSERT(0);
+		return XFS_ERROR(EINVAL);
+	}
+	/*
+	 * Get the allocation group header.
+	 */
+	error = xfs_ialloc_read_agi(mp, tp, agno, &agbp);
+	if (error) {
+		xfs_warn(mp, "%s: xfs_ialloc_read_agi() returned error %d.",
+			__func__, error);
+		return error;
+	}
+
+	/*
+	 * Fix up the inode allocation btree.
+	 */
+	error = xfs_difree_inobt(mp, tp, agbp, agino, flist, delete, first_ino,
+				 &rec);
+	if (error)
+		goto error0;
+
+	return 0;
+
+error0:
+	return error;
+}
+
 STATIC int
 xfs_imap_lookup(
 	struct xfs_mount	*mp,
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 08/11] xfs: update the finobt on inode free
  2014-02-04 17:49 [PATCH v3 00/11] xfs: introduce the free inode btree Brian Foster
                   ` (6 preceding siblings ...)
  2014-02-04 17:49 ` [PATCH v3 07/11] xfs: refactor xfs_difree() inobt bits into xfs_difree_inobt() helper Brian Foster
@ 2014-02-04 17:49 ` Brian Foster
  2014-02-11  7:31   ` Dave Chinner
  2014-02-04 17:49 ` [PATCH v3 09/11] xfs: add finobt support to growfs Brian Foster
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 34+ messages in thread
From: Brian Foster @ 2014-02-04 17:49 UTC (permalink / raw)
  To: xfs

An inode free operation can have several effects on the finobt. If
all inodes have been freed and the chunk deallocated, we remove the
finobt record. If the inode chunk was previously full, we must
insert a new record based on the existing inobt record. Otherwise,
we modify the record in place.

Create the xfs_ifree_finobt() function to identify the potential
scenarios and update the finobt appropriately.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_ialloc.c | 109 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 109 insertions(+)

diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
index 06851db..c5d6cdc 100644
--- a/fs/xfs/xfs_ialloc.c
+++ b/fs/xfs/xfs_ialloc.c
@@ -1478,6 +1478,106 @@ error0:
 }
 
 /*
+ * Free an inode in the free inode btree.
+ */
+STATIC int
+xfs_difree_finobt(
+	struct xfs_mount		*mp,
+	struct xfs_trans		*tp,
+	struct xfs_buf			*agbp,
+	xfs_agino_t			agino,
+	struct xfs_inobt_rec_incore	*ibtrec) /* inobt record */
+{
+	struct xfs_agi			*agi = XFS_BUF_TO_AGI(agbp);
+	xfs_agnumber_t			agno = be32_to_cpu(agi->agi_seqno);
+	struct xfs_btree_cur		*cur;
+	struct xfs_inobt_rec_incore	rec;
+	int				offset = agino - ibtrec->ir_startino;
+	int				error;
+	int				i;
+
+	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno, XFS_BTNUM_FINO);
+
+	error = xfs_inobt_lookup(cur, ibtrec->ir_startino, XFS_LOOKUP_EQ, &i);
+	if (error)
+		goto error;
+	if (i == 0) {
+		/*
+		 * If the record does not exist in the finobt, we must have just
+		 * freed an inode in a previously fully allocated chunk. If not,
+		 * something is out of sync.
+		 */
+		XFS_WANT_CORRUPTED_GOTO(ibtrec->ir_freecount == 1, error);
+
+		error = xfs_inobt_insert_rec(cur, ibtrec->ir_freecount,
+					     ibtrec->ir_free, &i);
+		if (error)
+			goto error;
+		ASSERT(i == 1);
+
+		goto out;
+	}
+
+	/*
+	 * Read and update the existing record.
+	 */
+	error = xfs_inobt_get_rec(cur, &rec, &i);
+	if (error)
+		goto error;
+	XFS_WANT_CORRUPTED_GOTO(i == 1, error);
+
+	rec.ir_free |= XFS_INOBT_MASK(offset);
+	rec.ir_freecount++;
+
+	XFS_WANT_CORRUPTED_GOTO((rec.ir_free == ibtrec->ir_free) &&
+				(rec.ir_freecount == ibtrec->ir_freecount),
+				error);
+
+	/*
+	 * The content of inobt records should always match between the inobt
+	 * and finobt. The lifecycle of records in the finobt is different from
+	 * the inobt in that the finobt only tracks records with at least one
+	 * free inode. This is to optimize lookup for inode allocation purposes.
+	 * The following checks determine whether to update the existing record or
+	 * remove it entirely.
+	 */
+
+	if (rec.ir_freecount == XFS_IALLOC_INODES(mp) &&
+	    !(mp->m_flags & XFS_MOUNT_IKEEP)) {
+		/*
+		 * If all inodes are free and we're in !ikeep mode, the entire
+		 * inode chunk has been deallocated. Remove the record from the
+		 * finobt.
+		 */
+		error = xfs_btree_delete(cur, &i);
+		if (error)
+			goto error;
+		ASSERT(i == 1);
+	} else {
+		/*
+		 * The existing finobt record was modified and has a combination
+		 * of allocated and free inodes or is completely free and ikeep
+		 * is enabled. Update the record.
+		 */
+		error = xfs_inobt_update(cur, &rec);
+		if (error)
+			goto error;
+	}
+
+out:
+	error = xfs_check_agi_freecount(cur, agi);
+	if (error)
+		goto error;
+
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+	return 0;
+
+error:
+	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+	return error;
+}
+
+/*
  * Free disk inode.  Carefully avoids touching the incore inode, all
  * manipulations incore are the caller's responsibility.
  * The on-disk inode is not changed by this operation, only the
@@ -1545,6 +1645,15 @@ xfs_difree(
 	if (error)
 		goto error0;
 
+	/*
+	 * Fix up the free inode btree.
+	 */
+	if (xfs_sb_version_hasfinobt(&mp->m_sb)) {
+		error = xfs_difree_finobt(mp, tp, agbp, agino, &rec);
+		if (error)
+			goto error0;
+	}
+
 	return 0;
 
 error0:
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 09/11] xfs: add finobt support to growfs
  2014-02-04 17:49 [PATCH v3 00/11] xfs: introduce the free inode btree Brian Foster
                   ` (7 preceding siblings ...)
  2014-02-04 17:49 ` [PATCH v3 08/11] xfs: update the finobt on inode free Brian Foster
@ 2014-02-04 17:49 ` Brian Foster
  2014-02-04 17:49 ` [PATCH v3 10/11] xfs: report finobt status in fs geometry Brian Foster
  2014-02-04 17:49 ` [PATCH v3 11/11] xfs: enable the finobt feature on v5 superblocks Brian Foster
  10 siblings, 0 replies; 34+ messages in thread
From: Brian Foster @ 2014-02-04 17:49 UTC (permalink / raw)
  To: xfs

Add finobt support to growfs. Initialize the agi root/level fields
and the root finobt block.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/xfs_fsops.c | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 02fb943..96e4eb0 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -316,6 +316,10 @@ xfs_growfs_data_private(
 		agi->agi_dirino = cpu_to_be32(NULLAGINO);
 		if (xfs_sb_version_hascrc(&mp->m_sb))
 			uuid_copy(&agi->agi_uuid, &mp->m_sb.sb_uuid);
+		if (xfs_sb_version_hasfinobt(&mp->m_sb)) {
+			agi->agi_free_root = cpu_to_be32(XFS_FIBT_BLOCK(mp));
+			agi->agi_free_level = cpu_to_be32(1);
+		}
 		for (bucket = 0; bucket < XFS_AGI_UNLINKED_BUCKETS; bucket++)
 			agi->agi_unlinked[bucket] = cpu_to_be32(NULLAGINO);
 
@@ -407,6 +411,34 @@ xfs_growfs_data_private(
 		xfs_buf_relse(bp);
 		if (error)
 			goto error0;
+
+		/*
+		 * FINO btree root block
+		 */
+		if (xfs_sb_version_hasfinobt(&mp->m_sb)) {
+			bp = xfs_growfs_get_hdr_buf(mp,
+				XFS_AGB_TO_DADDR(mp, agno, XFS_FIBT_BLOCK(mp)),
+				BTOBB(mp->m_sb.sb_blocksize), 0,
+				&xfs_inobt_buf_ops);
+			if (!bp) {
+				error = ENOMEM;
+				goto error0;
+			}
+
+			if (xfs_sb_version_hascrc(&mp->m_sb))
+				xfs_btree_init_block(mp, bp, XFS_FIBT_CRC_MAGIC,
+						     0, 0, agno,
+						     XFS_BTREE_CRC_BLOCKS);
+			else
+				xfs_btree_init_block(mp, bp, XFS_FIBT_MAGIC, 0,
+						     0, agno, 0);
+
+			error = xfs_bwrite(bp);
+			xfs_buf_relse(bp);
+			if (error)
+				goto error0;
+		}
+
 	}
 	xfs_trans_agblocks_delta(tp, nfree);
 	/*
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 10/11] xfs: report finobt status in fs geometry
  2014-02-04 17:49 [PATCH v3 00/11] xfs: introduce the free inode btree Brian Foster
                   ` (8 preceding siblings ...)
  2014-02-04 17:49 ` [PATCH v3 09/11] xfs: add finobt support to growfs Brian Foster
@ 2014-02-04 17:49 ` Brian Foster
  2014-02-11  7:34   ` Dave Chinner
  2014-02-04 17:49 ` [PATCH v3 11/11] xfs: enable the finobt feature on v5 superblocks Brian Foster
  10 siblings, 1 reply; 34+ messages in thread
From: Brian Foster @ 2014-02-04 17:49 UTC (permalink / raw)
  To: xfs

Define the XFS_FSOP_GEOM_FLAGS_FINOBT fs geometry flag and set the
associated bit if the filesystem supports the free inode btree.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_fs.h    | 1 +
 fs/xfs/xfs_fsops.c | 4 +++-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_fs.h b/fs/xfs/xfs_fs.h
index c5fc116..d34703d 100644
--- a/fs/xfs/xfs_fs.h
+++ b/fs/xfs/xfs_fs.h
@@ -238,6 +238,7 @@ typedef struct xfs_fsop_resblks {
 #define XFS_FSOP_GEOM_FLAGS_LAZYSB	0x4000	/* lazy superblock counters */
 #define XFS_FSOP_GEOM_FLAGS_V5SB	0x8000	/* version 5 superblock */
 #define XFS_FSOP_GEOM_FLAGS_FTYPE	0x10000	/* inode directory types */
+#define XFS_FSOP_GEOM_FLAGS_FINOBT	0x20000	/* free inode btree */
 
 /*
  * Minimum and maximum sizes need for growth checks.
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 96e4eb0..3445ead 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -104,7 +104,9 @@ xfs_fs_geometry(
 			(xfs_sb_version_hascrc(&mp->m_sb) ?
 				XFS_FSOP_GEOM_FLAGS_V5SB : 0) |
 			(xfs_sb_version_hasftype(&mp->m_sb) ?
-				XFS_FSOP_GEOM_FLAGS_FTYPE : 0);
+				XFS_FSOP_GEOM_FLAGS_FTYPE : 0) |
+			(xfs_sb_version_hasfinobt(&mp->m_sb) ?
+				XFS_FSOP_GEOM_FLAGS_FINOBT : 0);
 		geo->logsectsize = xfs_sb_version_hassector(&mp->m_sb) ?
 				mp->m_sb.sb_logsectsize : BBSIZE;
 		geo->rtsectsize = mp->m_sb.sb_blocksize;
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 11/11] xfs: enable the finobt feature on v5 superblocks
  2014-02-04 17:49 [PATCH v3 00/11] xfs: introduce the free inode btree Brian Foster
                   ` (9 preceding siblings ...)
  2014-02-04 17:49 ` [PATCH v3 10/11] xfs: report finobt status in fs geometry Brian Foster
@ 2014-02-04 17:49 ` Brian Foster
  2014-02-11  7:34   ` Dave Chinner
  10 siblings, 1 reply; 34+ messages in thread
From: Brian Foster @ 2014-02-04 17:49 UTC (permalink / raw)
  To: xfs

Add the finobt feature bit to the list of known features. As of
this point, the kernel code knows how to mount and manage both
finobt and non-finobt formatted filesystems.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_sb.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_sb.h b/fs/xfs/xfs_sb.h
index 070a7f6..9919fb8 100644
--- a/fs/xfs/xfs_sb.h
+++ b/fs/xfs/xfs_sb.h
@@ -586,7 +586,8 @@ xfs_sb_has_compat_feature(
 }
 
 #define XFS_SB_FEAT_RO_COMPAT_FINOBT   (1 << 0)		/* free inode btree */
-#define XFS_SB_FEAT_RO_COMPAT_ALL 0
+#define XFS_SB_FEAT_RO_COMPAT_ALL \
+		(XFS_SB_FEAT_RO_COMPAT_FINOBT)
 #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL
 static inline bool
 xfs_sb_has_ro_compat_feature(
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 02/11] xfs: reserve v5 superblock read-only compat. feature bit for finobt
  2014-02-04 17:49 ` [PATCH v3 02/11] xfs: reserve v5 superblock read-only compat. feature bit for finobt Brian Foster
@ 2014-02-11  6:07   ` Dave Chinner
  0 siblings, 0 replies; 34+ messages in thread
From: Dave Chinner @ 2014-02-11  6:07 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Tue, Feb 04, 2014 at 12:49:33PM -0500, Brian Foster wrote:
> Reserve a v5 read-only compatibility feature bit for the finobt and
> create the xfs_sb_version_hasfinobt() helper to determine whether
> an fs has the feature enabled.
> 
> The finobt does not change existing on-disk structures, but must
> remain consistent with the ialloc btree. Modifications from older
> kernels would violate that constrant. Therefore, we restrict older
> kernels to read-only mounts of finobt-enabled filesystems.
> 
> Note that this does not yet enable the ability to rw mount a finobt
> fs (by setting the feature bit in the XFS_SB_FEAT_RO_COMPAT_ALL
> mask).
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>

Looks good.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 03/11] xfs: support the XFS_BTNUM_FINOBT free inode btree type
  2014-02-04 17:49 ` [PATCH v3 03/11] xfs: support the XFS_BTNUM_FINOBT free inode btree type Brian Foster
@ 2014-02-11  6:22   ` Dave Chinner
  0 siblings, 0 replies; 34+ messages in thread
From: Dave Chinner @ 2014-02-11  6:22 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Tue, Feb 04, 2014 at 12:49:34PM -0500, Brian Foster wrote:
> Define the AGI fields for the finobt root/level and add magic
> numbers. Update the btree code to add support for the new
> XFS_BTNUM_FINOBT inode btree.
> 
> The finobt root block is reserved immediately following the inobt
> root block in the AG. Update XFS_PREALLOC_BLOCKS() to determine the
> starting AG data block based on whether finobt support is enabled.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_ag.h           | 32 +++++++++++++++----------
>  fs/xfs/xfs_btree.c        |  6 +++--
>  fs/xfs/xfs_btree.h        |  3 +++
>  fs/xfs/xfs_format.h       | 14 ++++++++++-
>  fs/xfs/xfs_ialloc.c       | 37 +++++++++++++++++++++++++----
>  fs/xfs/xfs_ialloc_btree.c | 60 +++++++++++++++++++++++++++++++++++++++++++++--
>  fs/xfs/xfs_log_recover.c  |  2 ++
>  fs/xfs/xfs_stats.c        |  1 +
>  fs/xfs/xfs_stats.h        | 18 +++++++++++++-
>  fs/xfs/xfs_types.h        |  2 +-
>  10 files changed, 150 insertions(+), 25 deletions(-)
> 
> diff --git a/fs/xfs/xfs_ag.h b/fs/xfs/xfs_ag.h
> index 3fc1098..5d3011f 100644
> --- a/fs/xfs/xfs_ag.h
> +++ b/fs/xfs/xfs_ag.h
> @@ -164,22 +164,28 @@ typedef struct xfs_agi {
>  	__be32		agi_pad32;
>  	__be64		agi_lsn;	/* last write sequence */
>  
> +	__be32		agi_free_root; /* root of the free inode btree */
> +	__be32		agi_free_level;/* levels in free inode btree */
> +
>  	/* structure must be padded to 64 bit alignment */
>  } xfs_agi_t;

Can you add comments to the agi structure that indicate where the
contiguous logging regions start and end?

.....

>  #ifdef DEBUG
> @@ -1515,14 +1517,39 @@ xfs_ialloc_log_agi(
>  	ASSERT(agi->agi_magicnum == cpu_to_be32(XFS_AGI_MAGIC));
>  #endif
>  	/*
> -	 * Compute byte offsets for the first and last fields.
> +	 * The growth of the agi buffer over time now requires that we interpret
> +	 * the buffer as two logical regions delineated at the end of the unlinked
> +	 * list. This is due to the size of the hash table and its location in the
> +	 * middle of the agi.
> +	 *
> +	 * For example, a request to log a field before agi_unlinked and a field
> +	 * after agi_unlinked could cause us to log the entire hash table and use
> +	 * an excessive amount of log space. To avoid this behavior, log the
> +	 * region up through agi_unlinked in one call and the region after
> +	 * agi_unlinked through the end of the structure in another.
>  	 */

This comment should be at the head of the function as it describes
the constraints the function works under, rather than explaining
particular piece of the code.

....
> @@ -334,11 +384,17 @@ xfs_inobt_init_cursor(
>  
>  	cur->bc_tp = tp;
>  	cur->bc_mp = mp;
> -	cur->bc_nlevels = be32_to_cpu(agi->agi_level);
>  	cur->bc_btnum = btnum;
> +	if (btnum == XFS_BTNUM_INO) {
> +		cur->bc_nlevels = be32_to_cpu(agi->agi_level);
> +		cur->bc_ops = &xfs_inobt_ops;
> +	} else {
> +		cur->bc_nlevels = be32_to_cpu(agi->agi_free_level);
> +		cur->bc_ops = &xfs_finobt_ops;

		ASSERT(btnum == XFS_BTNUM_FINO);
....

Otherwise looks pretty good.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 04/11] xfs: update inode allocation/free transaction reservations for finobt
  2014-02-04 17:49 ` [PATCH v3 04/11] xfs: update inode allocation/free transaction reservations for finobt Brian Foster
@ 2014-02-11  6:46   ` Dave Chinner
  2014-02-11 16:22     ` Brian Foster
  2014-02-18 17:10     ` Brian Foster
  0 siblings, 2 replies; 34+ messages in thread
From: Dave Chinner @ 2014-02-11  6:46 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Tue, Feb 04, 2014 at 12:49:35PM -0500, Brian Foster wrote:
> Create the xfs_calc_finobt_res() helper to calculate the finobt log
> reservation for inode allocation and free. Update
> XFS_IALLOC_SPACE_RES() to reserve blocks for the additional finobt
> insertion on inode allocation. Create XFS_IFREE_SPACE_RES() to
> reserve blocks for the potential finobt record insertion on inode
> free (i.e., if an inode chunk was previously fully allocated).
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_inode.c       |  4 +++-
>  fs/xfs/xfs_trans_resv.c  | 47 +++++++++++++++++++++++++++++++++++++++++++----
>  fs/xfs/xfs_trans_space.h |  7 ++++++-
>  3 files changed, 52 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 001aa89..57c77ed 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1730,7 +1730,9 @@ xfs_inactive_ifree(
>  	int			error;
>  
>  	tp = xfs_trans_alloc(mp, XFS_TRANS_INACTIVE);
> -	error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree, 0, 0);
> +	tp->t_flags |= XFS_TRANS_RESERVE;
> +	error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree,
> +				  XFS_IFREE_SPACE_RES(mp), 0);

Can you add a comment explaining why the XFS_TRANS_RESERVE flag is
used here, and why it's use won't lead to accelerated reserve pool
depletion?

>  	if (error) {
>  		ASSERT(XFS_FORCED_SHUTDOWN(mp));
>  		xfs_trans_cancel(tp, XFS_TRANS_RELEASE_LOG_RES);
> diff --git a/fs/xfs/xfs_trans_resv.c b/fs/xfs/xfs_trans_resv.c
> index 2fd59c0..32f35c1 100644
> --- a/fs/xfs/xfs_trans_resv.c
> +++ b/fs/xfs/xfs_trans_resv.c
> @@ -98,6 +98,37 @@ xfs_calc_inode_res(
>  }
>  
>  /*
> + * The free inode btree is a conditional feature and the log reservation
> + * requirements differ slightly from that of the traditional inode allocation
> + * btree. The finobt tracks records for inode chunks with at least one free inode.
> + * Therefore, a record can be removed from the tree for an inode allocation or
> + * free and the associated merge reservation is unconditional. This also covers
> + * the possibility of a split on record insertion.

Slightly wider than 80 columns here. FWIW, if you use vim, add this
rule to have it add a red line at the textwidth you have set:

" highlight textwidth
set cc=+1

And that will point out lines that are too long quite obviously ;)

> + *
> + * the free inode btree: max depth * block size
> + * the free inode btree entry: block size
> + *
> + * TODO: is the modify res really necessary? covered by the merge/split res?
> + * This seems to be the pattern of ifree, but not create_resv_alloc. Why?

The modify case is for an allocation that only updates an inobt
record (i.e. chunk already allocated, free inodes in it). Because
we can remove a finobt record when "modifying" the last free inode
record in a chunk, "modify" can cause a redcord removal and hence a
tree merge. In which case it's no different of any of the other
finobt reservations....

> @@ -267,6 +298,7 @@ xfs_calc_remove_reservation(
>   *    the superblock for the nlink flag: sector size
>   *    the directory btree: (max depth + v2) * dir block size
>   *    the directory inode's bmap btree: (max depth + v2) * block size
> + *    the finobt
>   */
>  STATIC uint
>  xfs_calc_create_resv_modify(
> @@ -275,7 +307,8 @@ xfs_calc_create_resv_modify(
>  	return xfs_calc_inode_res(mp, 2) +
>  		xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
>  		(uint)XFS_FSB_TO_B(mp, 1) +
> -		xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp), XFS_FSB_TO_B(mp, 1));
> +		xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp), XFS_FSB_TO_B(mp, 1)) +
> +		xfs_calc_finobt_res(mp, 1);
>  }

And this is where is starts to get complex. The modify operation can
now cause a finobt merge, when means blocks will be allocated/freed.
That means we now need to take into account:

 *    the allocation btrees: 2 trees * (max depth - 1) * block size

and anything else freeing an extent requires.

>  /*
> @@ -285,6 +318,7 @@ xfs_calc_create_resv_modify(
>   *    the inode blocks allocated: XFS_IALLOC_BLOCKS * blocksize
>   *    the inode btree: max depth * blocksize
>   *    the allocation btrees: 2 trees * (max depth - 1) * block size
> + *    the finobt
>   */
>  STATIC uint
>  xfs_calc_create_resv_alloc(
> @@ -295,7 +329,8 @@ xfs_calc_create_resv_alloc(
>  		xfs_calc_buf_res(XFS_IALLOC_BLOCKS(mp), XFS_FSB_TO_B(mp, 1)) +
>  		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
>  		xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 1),
> -				 XFS_FSB_TO_B(mp, 1));
> +				 XFS_FSB_TO_B(mp, 1)) +
> +		xfs_calc_finobt_res(mp, 0);
>  }

This reservation is only for v4 superblocks - the icreate
transaction reservation is used for v5 superblocks, so that's the
only one you need to modify.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 05/11] xfs: insert newly allocated inode chunks into the finobt
  2014-02-04 17:49 ` [PATCH v3 05/11] xfs: insert newly allocated inode chunks into the finobt Brian Foster
@ 2014-02-11  6:48   ` Dave Chinner
  0 siblings, 0 replies; 34+ messages in thread
From: Dave Chinner @ 2014-02-11  6:48 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Tue, Feb 04, 2014 at 12:49:36PM -0500, Brian Foster wrote:
> A newly allocated inode chunk, by definition, has at least one
> free inode, so a record is always inserted into the finobt.
> 
> Create the xfs_inobt_insert() helper from existing code to insert
> a record in an inobt based on the provided BTNUM. Update
> xfs_ialloc_ag_alloc() to invoke the helper for the existing
> XFS_BTNUM_INO tree and XFS_BTNUM_FINO tree, if enabled.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>

Looks good.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 06/11] xfs: use and update the finobt on inode allocation
  2014-02-04 17:49 ` [PATCH v3 06/11] xfs: use and update the finobt on inode allocation Brian Foster
@ 2014-02-11  7:17   ` Dave Chinner
  2014-02-11 16:32     ` Brian Foster
  2014-02-14 20:01     ` Brian Foster
  0 siblings, 2 replies; 34+ messages in thread
From: Dave Chinner @ 2014-02-11  7:17 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Tue, Feb 04, 2014 at 12:49:37PM -0500, Brian Foster wrote:
> Replace xfs_dialloc_ag() with an implementation that looks for a
> record in the finobt. The finobt only tracks records with at least
> one free inode. This eliminates the need for the intra-ag scan in
> the original algorithm. Once the inode is allocated, update the
> finobt appropriately (possibly removing the record) as well as the
> inobt.
> 
> Move the original xfs_dialloc_ag() algorithm to
> xfs_dialloc_ag_slow() and fall back as such if finobt support is
> not enabled.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_ialloc.c | 211 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 210 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
> index bdaab76..afc9840 100644
> --- a/fs/xfs/xfs_ialloc.c
> +++ b/fs/xfs/xfs_ialloc.c
> @@ -723,7 +723,7 @@ xfs_ialloc_get_rec(
>   * available.
>   */
>  STATIC int
> -xfs_dialloc_ag(
> +xfs_dialloc_ag_slow(
>  	struct xfs_trans	*tp,
>  	struct xfs_buf		*agbp,
>  	xfs_ino_t		parent,

OK, though I would have called it xfs_dialloc_ag_from_inobt() or
something similar to indicate what tree it is walking....

> +STATIC int
> +xfs_dialloc_ag(
> +	struct xfs_trans	*tp,
> +	struct xfs_buf		*agbp,
> +	xfs_ino_t		parent,
> +	xfs_ino_t		*inop)
> +{

Initial thought: Wow, long, long function. How can we split this up?

> +	struct xfs_mount		*mp = tp->t_mountp;
> +	struct xfs_agi			*agi = XFS_BUF_TO_AGI(agbp);
> +	xfs_agnumber_t			agno = be32_to_cpu(agi->agi_seqno);
> +	xfs_agnumber_t			pagno = XFS_INO_TO_AGNO(mp, parent);
> +	xfs_agino_t			pagino = XFS_INO_TO_AGINO(mp, parent);
> +	struct xfs_perag		*pag;
> +	struct xfs_btree_cur		*cur;
> +	struct xfs_btree_cur		*tcur;
> +	struct xfs_inobt_rec_incore	rec;
> +	struct xfs_inobt_rec_incore	trec;
> +	xfs_ino_t			ino;
> +	int				error;
> +	int				offset;
> +	int				i, j;
> +
> +	if (!xfs_sb_version_hasfinobt(&mp->m_sb))
> +		return xfs_dialloc_ag_slow(tp, agbp, parent, inop);
> +
> +	pag = xfs_perag_get(mp, agno);
> +
> +	/*
> +	 * If pagino is 0 (this is the root inode allocation) use newino.
> +	 * This must work because we've just allocated some.
> +	 */
> +	if (!pagino)
> +		pagino = be32_to_cpu(agi->agi_newino);
> +
> +	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno, XFS_BTNUM_FINO);
> +
> +	error = xfs_check_agi_freecount(cur, agi);
> +	if (error)
> +		goto error_cur;
> +
> +	if (agno == pagno) {
> +		/*
> +		 * We're in the same AG as the parent inode so allocate the
> +		 * closest inode to the parent.
> +		 */
> +		error = xfs_inobt_lookup(cur, pagino, XFS_LOOKUP_LE, &i);
> +		if (error)
> +			goto error_cur;
> +		if (i == 1) {
> +			error = xfs_inobt_get_rec(cur, &rec, &i);
> +			if (error)
> +				goto error_cur;
> +			XFS_WANT_CORRUPTED_GOTO(i == 1, error_cur);
> +
> +			/*
> +			 * See if we've landed in the parent inode record. The
> +			 * finobt only tracks chunks with at least one free
> +			 * inode, so record existence is enough.
> +			 */
> +			if (pagino >= rec.ir_startino &&
> +			    pagino < (rec.ir_startino + XFS_INODES_PER_CHUNK))
> +				goto alloc_inode;
> +		}
> +
> +		error = xfs_btree_dup_cursor(cur, &tcur);
> +		if (error) 
> +			goto error_cur;
> +
> +		error = xfs_inobt_lookup(tcur, pagino, XFS_LOOKUP_GE, &j);
> +		if (error)
> +			goto error_tcur;
> +		if (j == 1) {
> +			error = xfs_inobt_get_rec(tcur, &trec, &j);
> +			if (error)
> +				goto error_tcur;
> +			XFS_WANT_CORRUPTED_GOTO(j == 1, error_tcur);
> +		}
> +
> +		if (i == 1 && j == 1) {
> +			if ((pagino - rec.ir_startino + XFS_INODES_PER_CHUNK - 1) >
> +			    (trec.ir_startino - pagino)) {
> +				rec = trec;
> +				xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> +				cur = tcur;
> +			} else {
> +				xfs_btree_del_cursor(tcur, XFS_BTREE_NOERROR);
> +			}
> +		} else if (j == 1) {
> +			rec = trec;
> +			xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> +			cur = tcur;
> +		} else {
> +			xfs_btree_del_cursor(tcur, XFS_BTREE_NOERROR);
> +		}

That entire branch can be easily factored into:

		error = xfs_dialloc_ag_finobt_near(cur, pagino, &rec);

/*
 * Allocate as close to the target agino as possible
 */
static int
xfs_dialloc_ag_finobt_near(
	struct xfs_btree_cur	*cur,
	xfs_agino_t		agino,
	struct xfs_inobt_rec_incore *rec)
{
	struct xfs_btree_cur	*rcur;	/* cursor for right search */
	struct xfs_inobt_rec_incore rrec; /* and the record used */
	int		error;
	int		i;
	int		l;

	error = xfs_inobt_lookup(cur, agino, XFS_LOOKUP_LE, &i);
	if (error)
		return error;

	if (i == 1) {
		error = xfs_inobt_get_rec(cur, rec, &i);
		if (error)
			return error;
		XFS_WANT_CORRUPTED_RETURN(i == 1);

		/*
		 * See if we've landed in the target inode record. The
		 * finobt only tracks chunks with at least one free
		 * inode, so record existence is enough.
		 */
		if (agino >= rec->ir_startino &&
		    agino < (rec->ir_startino + XFS_INODES_PER_CHUNK))
			return 0;
	}

	error = xfs_btree_dup_cursor(cur, &rcur);
	if (error)
		return error;

	error = xfs_inobt_lookup(rcur, agino, XFS_LOOKUP_GE, &j);
	if (error)
		goto error_rcur;
	if (j == 1) {
		error = xfs_inobt_get_rec(rcur, &rrec, &j);
		if (error)
			goto error_rcur;
		XFS_WANT_CORRUPTED_GOTO(j == 1, error_rcur);
	}

	if (i == 1 && j == 1) {
		/*
		 * both left and right records are valid, so choose
		 * the closer inode chunk to the target.
		 */
		if ((agino - rec.ir_startino + XFS_INODES_PER_CHUNK - 1) >
					(rrec.ir_startino - agino)) {
			*rec = rrec;
			xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
			cur = rcur;
		} else {
			xfs_btree_del_cursor(rcur, XFS_BTREE_NOERROR);
		}
	} else if (j == 1) {
		/* only right record is valid */
		*rec = rrec;
		xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
		cur = rcur;
	} else {
		/* Found neither left nor right.
		xfs_btree_del_cursor(rcur, XFS_BTREE_NOERROR);
	}
	return 0;

error_rcur:
	xfs_btree_del_cursor(rcur, XFS_BTREE_ERROR);
	return error;
}

> +	} else {
> +		/*
> +		 * Different AG from the parent inode. Check the record for the
> +		 * most recently allocated inode.
> +		 */
> +		if (agi->agi_newino != cpu_to_be32(NULLAGINO)) {
> +			error = xfs_inobt_lookup(cur, agi->agi_newino,
> +						 XFS_LOOKUP_EQ, &i);
> +			if (error)
> +				goto error_cur;
> +			if (i == 1) {
> +				error = xfs_inobt_get_rec(cur, &rec, &i);
> +				if (error)
> +					goto error_cur;
> +				XFS_WANT_CORRUPTED_GOTO(i == 1, error_cur);
> +				goto alloc_inode;
> +			}
> +		}
> +
> +		/*
> +		 * Allocate the first inode available in the AG.
> +		 */
> +		error = xfs_inobt_lookup(cur, 0, XFS_LOOKUP_GE, &i);
> +		if (error)
> +			goto error_cur;
> +		XFS_WANT_CORRUPTED_GOTO(i == 1, error_cur);
> +
> +		error = xfs_inobt_get_rec(cur, &rec, &i);
> +		if (error)
> +			goto error_cur;
> +		XFS_WANT_CORRUPTED_GOTO(i == 1, error_cur);

And that can be factored in a similar manner in
xfs_dialloc_ag_newino()....

> +	}
> +
> +alloc_inode:
> +	offset = xfs_lowbit64(rec.ir_free);
> +	ASSERT(offset >= 0);
> +	ASSERT(offset < XFS_INODES_PER_CHUNK);
> +	ASSERT((XFS_AGINO_TO_OFFSET(mp, rec.ir_startino) %
> +				   XFS_INODES_PER_CHUNK) == 0);
> +	ino = XFS_AGINO_TO_INO(mp, agno, rec.ir_startino + offset);
> +
> +	/*
> +	 * Modify or remove the finobt record.
> +	 */
> +	rec.ir_free &= ~XFS_INOBT_MASK(offset);
> +	rec.ir_freecount--;
> +	if (rec.ir_freecount) 
> +		error = xfs_inobt_update(cur, &rec);
> +	else
> +		error = xfs_btree_delete(cur, &i);
> +	if (error)
> +		goto error_cur;
> +
> +	/*
> +	 * Lookup and modify the equivalent record in the inobt.
> +	 */
> +	tcur = xfs_inobt_init_cursor(mp, tp, agbp, agno, XFS_BTNUM_INO);

In case ou hadn't guessed, I don't like the "tcur/trec" variables
because they make me thing "temporary" which they aren't. In this
case it is the inobt cursor and record....

In fact, this whole segment could be factored into a function like
xfs_dialloc_ag_inobt_update(), hence removing the second cursor from
xfs_dialloc_ag() function altogether and that would clean a lot of
the logic up....

> +
> +	error = xfs_check_agi_freecount(tcur, agi);
> +	if (error)
> +		goto error_tcur;
> +
> +	error = xfs_inobt_lookup(tcur, rec.ir_startino, XFS_LOOKUP_EQ, &i);
> +	if (error)
> +		goto error_tcur;
> +	XFS_WANT_CORRUPTED_GOTO(i == 1, error_tcur);
> +
> +	error = xfs_inobt_get_rec(tcur, &trec, &i);
> +	if (error)
> +		goto error_tcur;
> +	XFS_WANT_CORRUPTED_GOTO(i == 1, error_tcur);
> +	ASSERT((XFS_AGINO_TO_OFFSET(mp, trec.ir_startino) %
> +				   XFS_INODES_PER_CHUNK) == 0);
> +
> +	trec.ir_free &= ~XFS_INOBT_MASK(offset);
> +	trec.ir_freecount--;
> +
> +	XFS_WANT_CORRUPTED_GOTO((rec.ir_free == trec.ir_free) &&
> +				(rec.ir_freecount == trec.ir_freecount),
> +				error_tcur);
> +
> +	error = xfs_inobt_update(tcur, &trec);
> +	if (error)
> +		goto error_tcur;
> +
> +	/*
> +	 * Update the perag and superblock.
> +	 */
> +	be32_add_cpu(&agi->agi_freecount, -1);
> +	xfs_ialloc_log_agi(tp, agbp, XFS_AGI_FREECOUNT);
> +	pag->pagi_freecount--;
> +
> +	xfs_trans_mod_sb(tp, XFS_TRANS_SB_IFREE, -1);

This will need to be done before you update the inobt, though, so
you can run the xfs_check_agi_freecount() count in it and it will
come out correct....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 07/11] xfs: refactor xfs_difree() inobt bits into xfs_difree_inobt() helper
  2014-02-04 17:49 ` [PATCH v3 07/11] xfs: refactor xfs_difree() inobt bits into xfs_difree_inobt() helper Brian Foster
@ 2014-02-11  7:19   ` Dave Chinner
  0 siblings, 0 replies; 34+ messages in thread
From: Dave Chinner @ 2014-02-11  7:19 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Tue, Feb 04, 2014 at 12:49:38PM -0500, Brian Foster wrote:
> Refactor xfs_difree() in preparation for the finobt. xfs_difree()
> performs the validity checks against the ag and reads the agi
> header. The work of physically updating the inode allocation btree
> is pushed down into the new xfs_difree_inobt() helper.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>

Looks good.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 08/11] xfs: update the finobt on inode free
  2014-02-04 17:49 ` [PATCH v3 08/11] xfs: update the finobt on inode free Brian Foster
@ 2014-02-11  7:31   ` Dave Chinner
  0 siblings, 0 replies; 34+ messages in thread
From: Dave Chinner @ 2014-02-11  7:31 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Tue, Feb 04, 2014 at 12:49:39PM -0500, Brian Foster wrote:
> An inode free operation can have several effects on the finobt. If
> all inodes have been freed and the chunk deallocated, we remove the
> finobt record. If the inode chunk was previously full, we must
> insert a new record based on the existing inobt record. Otherwise,
> we modify the record in place.
> 
> Create the xfs_ifree_finobt() function to identify the potential
> scenarios and update the finobt appropriately.

Couple of minor things....

> +	 * Read and update the existing record.
> +	 */
> +	error = xfs_inobt_get_rec(cur, &rec, &i);
> +	if (error)
> +		goto error;
> +	XFS_WANT_CORRUPTED_GOTO(i == 1, error);
> +
> +	rec.ir_free |= XFS_INOBT_MASK(offset);
> +	rec.ir_freecount++;
> +
> +	XFS_WANT_CORRUPTED_GOTO((rec.ir_free == ibtrec->ir_free) &&
> +				(rec.ir_freecount == ibtrec->ir_freecount),
> +				error);

I'd add a small comment here saying:

	/*
	 * We could just copy the ibtrec across here, but that
	 * defeats the purpose of having redundant metadata. By
	 * doing the modifications independently, we can catch
	 * corruptions that we wouldn't see if we just copied from
	 * one record to another.
	 */

> +	/*
> +	 * The content of inobt records should always match between the inobt
> +	 * and finobt. The lifecycle of records in the finobt is different from
> +	 * the inobt in that the finobt only tracks records with at least one
> +	 * free inode. This is to optimize lookup for inode allocation purposes.
> +	 * The following checks determine whether to update the existing record or
> +	 * remove it entirely.
> +	 */
> +

extra whitespace line

> +	if (rec.ir_freecount == XFS_IALLOC_INODES(mp) &&
> +	    !(mp->m_flags & XFS_MOUNT_IKEEP)) {
> +		/*
> +		 * If all inodes are free and we're in !ikeep mode, the entire
> +		 * inode chunk has been deallocated. Remove the record from the
> +		 * finobt.
> +		 */
> +		error = xfs_btree_delete(cur, &i);
> +		if (error)
> +			goto error;
> +		ASSERT(i == 1);
> +	} else {
> +		/*
> +		 * The existing finobt record was modified and has a combination
> +		 * of allocated and free inodes or is completely free and ikeep
> +		 * is enabled. Update the record.
> +		 */
> +		error = xfs_inobt_update(cur, &rec);
> +		if (error)
> +			goto error;
> +	}

All th comments say pretty much the same thing, and repeat what the
code does. Hence I think this is sufficient:

	/*
	 * The content of inobt records should always match between the inobt
	 * and finobt. The lifecycle of records in the finobt is different from
	 * the inobt in that the finobt only tracks records with at least one
	 * free inode. Hence if all the inodes are free and we aren't keeping
	 * inode chunks permanently on disk, remove them. Otherwise just
	 * update the record with the new information.
	 */
	if (rec.ir_freecount == XFS_IALLOC_INODES(mp) &&
	    !(mp->m_flags & XFS_MOUNT_IKEEP)) {
		error = xfs_btree_delete(cur, &i);
		if (error)
			goto error;
		ASSERT(i == 1);
	} else {
		error = xfs_inobt_update(cur, &rec);
		if (error)
			goto error;
	}

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 10/11] xfs: report finobt status in fs geometry
  2014-02-04 17:49 ` [PATCH v3 10/11] xfs: report finobt status in fs geometry Brian Foster
@ 2014-02-11  7:34   ` Dave Chinner
  0 siblings, 0 replies; 34+ messages in thread
From: Dave Chinner @ 2014-02-11  7:34 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Tue, Feb 04, 2014 at 12:49:41PM -0500, Brian Foster wrote:
> Define the XFS_FSOP_GEOM_FLAGS_FINOBT fs geometry flag and set the
> associated bit if the filesystem supports the free inode btree.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>

Looks good.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 11/11] xfs: enable the finobt feature on v5 superblocks
  2014-02-04 17:49 ` [PATCH v3 11/11] xfs: enable the finobt feature on v5 superblocks Brian Foster
@ 2014-02-11  7:34   ` Dave Chinner
  0 siblings, 0 replies; 34+ messages in thread
From: Dave Chinner @ 2014-02-11  7:34 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Tue, Feb 04, 2014 at 12:49:42PM -0500, Brian Foster wrote:
> Add the finobt feature bit to the list of known features. As of
> this point, the kernel code knows how to mount and manage both
> finobt and non-finobt formatted filesystems.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>

Looks good,

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 04/11] xfs: update inode allocation/free transaction reservations for finobt
  2014-02-11  6:46   ` Dave Chinner
@ 2014-02-11 16:22     ` Brian Foster
  2014-02-20  1:00       ` Dave Chinner
  2014-02-18 17:10     ` Brian Foster
  1 sibling, 1 reply; 34+ messages in thread
From: Brian Foster @ 2014-02-11 16:22 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On 02/11/2014 01:46 AM, Dave Chinner wrote:
> On Tue, Feb 04, 2014 at 12:49:35PM -0500, Brian Foster wrote:
>> Create the xfs_calc_finobt_res() helper to calculate the finobt log
>> reservation for inode allocation and free. Update
>> XFS_IALLOC_SPACE_RES() to reserve blocks for the additional finobt
>> insertion on inode allocation. Create XFS_IFREE_SPACE_RES() to
>> reserve blocks for the potential finobt record insertion on inode
>> free (i.e., if an inode chunk was previously fully allocated).
>>
>> Signed-off-by: Brian Foster <bfoster@redhat.com>
>> ---
>>  fs/xfs/xfs_inode.c       |  4 +++-
>>  fs/xfs/xfs_trans_resv.c  | 47 +++++++++++++++++++++++++++++++++++++++++++----
>>  fs/xfs/xfs_trans_space.h |  7 ++++++-
>>  3 files changed, 52 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
...
>> diff --git a/fs/xfs/xfs_trans_resv.c b/fs/xfs/xfs_trans_resv.c
>> index 2fd59c0..32f35c1 100644
>> --- a/fs/xfs/xfs_trans_resv.c
>> +++ b/fs/xfs/xfs_trans_resv.c
>> @@ -98,6 +98,37 @@ xfs_calc_inode_res(
>>  }
>>  
>>  /*
>> + * The free inode btree is a conditional feature and the log reservation
>> + * requirements differ slightly from that of the traditional inode allocation
>> + * btree. The finobt tracks records for inode chunks with at least one free inode.
>> + * Therefore, a record can be removed from the tree for an inode allocation or
>> + * free and the associated merge reservation is unconditional. This also covers
>> + * the possibility of a split on record insertion.
> 
> Slightly wider than 80 columns here. FWIW, if you use vim, add this
> rule to have it add a red line at the textwidth you have set:
> 
> " highlight textwidth
> set cc=+1
> 
> And that will point out lines that are too long quite obviously ;)
> 

Interesting, thanks. :)

>> + *
>> + * the free inode btree: max depth * block size
>> + * the free inode btree entry: block size
>> + *
>> + * TODO: is the modify res really necessary? covered by the merge/split res?
>> + * This seems to be the pattern of ifree, but not create_resv_alloc. Why?
> 
> The modify case is for an allocation that only updates an inobt
> record (i.e. chunk already allocated, free inodes in it). Because
> we can remove a finobt record when "modifying" the last free inode
> record in a chunk, "modify" can cause a redcord removal and hence a
> tree merge. In which case it's no different of any of the other
> finobt reservations....
> 

When I made this note, I think I was curious why there was a need to add
to the reservation in the modify case. If we always made a reservation
for a btree merge, wouldn't that be enough to cover the basic modify
case? In the code, it looks like we either delete the record or modify
it. Is there something in the delete case that I'm missing, or are we
just being conservative here? Nonetheless, I modelled this after the
ifree requirements, assuming the reservation there was correct.

>> @@ -267,6 +298,7 @@ xfs_calc_remove_reservation(
>>   *    the superblock for the nlink flag: sector size
>>   *    the directory btree: (max depth + v2) * dir block size
>>   *    the directory inode's bmap btree: (max depth + v2) * block size
>> + *    the finobt
>>   */
>>  STATIC uint
>>  xfs_calc_create_resv_modify(
>> @@ -275,7 +307,8 @@ xfs_calc_create_resv_modify(
>>  	return xfs_calc_inode_res(mp, 2) +
>>  		xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
>>  		(uint)XFS_FSB_TO_B(mp, 1) +
>> -		xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp), XFS_FSB_TO_B(mp, 1));
>> +		xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp), XFS_FSB_TO_B(mp, 1)) +
>> +		xfs_calc_finobt_res(mp, 1);
>>  }
> 
> And this is where is starts to get complex. The modify operation can
> now cause a finobt merge, when means blocks will be allocated/freed.
> That means we now need to take into account:
> 
>  *    the allocation btrees: 2 trees * (max depth - 1) * block size
> 
> and anything else freeing an extent requires.
> 

So is the "allocation btrees: 2 trees ..." portion of
xfs_calc_create_resv_alloc() associated with the allocation of the
actual inode blocks or potential allocations due to inode btree
management? I had thought the former, thus didn't include reservations
for this in the finobt calculation.

>>  /*
>> @@ -285,6 +318,7 @@ xfs_calc_create_resv_modify(
>>   *    the inode blocks allocated: XFS_IALLOC_BLOCKS * blocksize
>>   *    the inode btree: max depth * blocksize
>>   *    the allocation btrees: 2 trees * (max depth - 1) * block size
>> + *    the finobt
>>   */
>>  STATIC uint
>>  xfs_calc_create_resv_alloc(
>> @@ -295,7 +329,8 @@ xfs_calc_create_resv_alloc(
>>  		xfs_calc_buf_res(XFS_IALLOC_BLOCKS(mp), XFS_FSB_TO_B(mp, 1)) +
>>  		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
>>  		xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 1),
>> -				 XFS_FSB_TO_B(mp, 1));
>> +				 XFS_FSB_TO_B(mp, 1)) +
>> +		xfs_calc_finobt_res(mp, 0);
>>  }
> 
> This reservation is only for v4 superblocks - the icreate
> transaction reservation is used for v5 superblocks, so that's the
> only one you need to modify.
> 

Ok, thanks.

Brian

> Cheers,
> 
> Dave.
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 06/11] xfs: use and update the finobt on inode allocation
  2014-02-11  7:17   ` Dave Chinner
@ 2014-02-11 16:32     ` Brian Foster
  2014-02-14 20:01     ` Brian Foster
  1 sibling, 0 replies; 34+ messages in thread
From: Brian Foster @ 2014-02-11 16:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On 02/11/2014 02:17 AM, Dave Chinner wrote:
> On Tue, Feb 04, 2014 at 12:49:37PM -0500, Brian Foster wrote:
>> Replace xfs_dialloc_ag() with an implementation that looks for a
>> record in the finobt. The finobt only tracks records with at least
>> one free inode. This eliminates the need for the intra-ag scan in
>> the original algorithm. Once the inode is allocated, update the
>> finobt appropriately (possibly removing the record) as well as the
>> inobt.
>>
>> Move the original xfs_dialloc_ag() algorithm to
>> xfs_dialloc_ag_slow() and fall back as such if finobt support is
>> not enabled.
>>
>> Signed-off-by: Brian Foster <bfoster@redhat.com>
>> ---
>>  fs/xfs/xfs_ialloc.c | 211 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>>  1 file changed, 210 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
>> index bdaab76..afc9840 100644
>> --- a/fs/xfs/xfs_ialloc.c
>> +++ b/fs/xfs/xfs_ialloc.c
>> @@ -723,7 +723,7 @@ xfs_ialloc_get_rec(
>>   * available.
>>   */
>>  STATIC int
>> -xfs_dialloc_ag(
>> +xfs_dialloc_ag_slow(
>>  	struct xfs_trans	*tp,
>>  	struct xfs_buf		*agbp,
>>  	xfs_ino_t		parent,
> 
> OK, though I would have called it xfs_dialloc_ag_from_inobt() or
> something similar to indicate what tree it is walking....
> 

Fair enough, that's more specific at least.

>> +STATIC int
>> +xfs_dialloc_ag(
>> +	struct xfs_trans	*tp,
>> +	struct xfs_buf		*agbp,
>> +	xfs_ino_t		parent,
>> +	xfs_ino_t		*inop)
>> +{
> 
> Initial thought: Wow, long, long function. How can we split this up?
> 
>> +	struct xfs_mount		*mp = tp->t_mountp;
>> +	struct xfs_agi			*agi = XFS_BUF_TO_AGI(agbp);
>> +	xfs_agnumber_t			agno = be32_to_cpu(agi->agi_seqno);
>> +	xfs_agnumber_t			pagno = XFS_INO_TO_AGNO(mp, parent);
>> +	xfs_agino_t			pagino = XFS_INO_TO_AGINO(mp, parent);
>> +	struct xfs_perag		*pag;
>> +	struct xfs_btree_cur		*cur;
>> +	struct xfs_btree_cur		*tcur;
>> +	struct xfs_inobt_rec_incore	rec;
>> +	struct xfs_inobt_rec_incore	trec;
>> +	xfs_ino_t			ino;
>> +	int				error;
>> +	int				offset;
>> +	int				i, j;
>> +
>> +	if (!xfs_sb_version_hasfinobt(&mp->m_sb))
>> +		return xfs_dialloc_ag_slow(tp, agbp, parent, inop);
>> +
>> +	pag = xfs_perag_get(mp, agno);
>> +
>> +	/*
>> +	 * If pagino is 0 (this is the root inode allocation) use newino.
>> +	 * This must work because we've just allocated some.
>> +	 */
>> +	if (!pagino)
>> +		pagino = be32_to_cpu(agi->agi_newino);
>> +
>> +	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno, XFS_BTNUM_FINO);
>> +
>> +	error = xfs_check_agi_freecount(cur, agi);
>> +	if (error)
>> +		goto error_cur;
>> +
>> +	if (agno == pagno) {
>> +		/*
>> +		 * We're in the same AG as the parent inode so allocate the
>> +		 * closest inode to the parent.
>> +		 */
>> +		error = xfs_inobt_lookup(cur, pagino, XFS_LOOKUP_LE, &i);
>> +		if (error)
>> +			goto error_cur;
>> +		if (i == 1) {
>> +			error = xfs_inobt_get_rec(cur, &rec, &i);
>> +			if (error)
>> +				goto error_cur;
>> +			XFS_WANT_CORRUPTED_GOTO(i == 1, error_cur);
>> +
>> +			/*
>> +			 * See if we've landed in the parent inode record. The
>> +			 * finobt only tracks chunks with at least one free
>> +			 * inode, so record existence is enough.
>> +			 */
>> +			if (pagino >= rec.ir_startino &&
>> +			    pagino < (rec.ir_startino + XFS_INODES_PER_CHUNK))
>> +				goto alloc_inode;
>> +		}
>> +
>> +		error = xfs_btree_dup_cursor(cur, &tcur);
>> +		if (error) 
>> +			goto error_cur;
>> +
>> +		error = xfs_inobt_lookup(tcur, pagino, XFS_LOOKUP_GE, &j);
>> +		if (error)
>> +			goto error_tcur;
>> +		if (j == 1) {
>> +			error = xfs_inobt_get_rec(tcur, &trec, &j);
>> +			if (error)
>> +				goto error_tcur;
>> +			XFS_WANT_CORRUPTED_GOTO(j == 1, error_tcur);
>> +		}
>> +
>> +		if (i == 1 && j == 1) {
>> +			if ((pagino - rec.ir_startino + XFS_INODES_PER_CHUNK - 1) >
>> +			    (trec.ir_startino - pagino)) {
>> +				rec = trec;
>> +				xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
>> +				cur = tcur;
>> +			} else {
>> +				xfs_btree_del_cursor(tcur, XFS_BTREE_NOERROR);
>> +			}
>> +		} else if (j == 1) {
>> +			rec = trec;
>> +			xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
>> +			cur = tcur;
>> +		} else {
>> +			xfs_btree_del_cursor(tcur, XFS_BTREE_NOERROR);
>> +		}
> 
> That entire branch can be easily factored into:
> 
> 		error = xfs_dialloc_ag_finobt_near(cur, pagino, &rec);
> 

Indeed. The function is already organized to facilitate this via the use
of the cursor/record pointers. A slight difference is the callee may
need to duplicate and replace the caller's cursor.

> /*
>  * Allocate as close to the target agino as possible
>  */
> static int
> xfs_dialloc_ag_finobt_near(
> 	struct xfs_btree_cur	*cur,
> 	xfs_agino_t		agino,
> 	struct xfs_inobt_rec_incore *rec)
> {
> 	struct xfs_btree_cur	*rcur;	/* cursor for right search */
> 	struct xfs_inobt_rec_incore rrec; /* and the record used */
> 	int		error;
> 	int		i;
> 	int		l;
> 
> 	error = xfs_inobt_lookup(cur, agino, XFS_LOOKUP_LE, &i);
> 	if (error)
> 		return error;
> 
> 	if (i == 1) {
> 		error = xfs_inobt_get_rec(cur, rec, &i);
> 		if (error)
> 			return error;
> 		XFS_WANT_CORRUPTED_RETURN(i == 1);
> 
> 		/*
> 		 * See if we've landed in the target inode record. The
> 		 * finobt only tracks chunks with at least one free
> 		 * inode, so record existence is enough.
> 		 */
> 		if (agino >= rec->ir_startino &&
> 		    agino < (rec->ir_startino + XFS_INODES_PER_CHUNK))
> 			return 0;
> 	}
> 
> 	error = xfs_btree_dup_cursor(cur, &rcur);
> 	if (error)
> 		return error;
> 
> 	error = xfs_inobt_lookup(rcur, agino, XFS_LOOKUP_GE, &j);
> 	if (error)
> 		goto error_rcur;
> 	if (j == 1) {
> 		error = xfs_inobt_get_rec(rcur, &rrec, &j);
> 		if (error)
> 			goto error_rcur;
> 		XFS_WANT_CORRUPTED_GOTO(j == 1, error_rcur);
> 	}
> 
> 	if (i == 1 && j == 1) {
> 		/*
> 		 * both left and right records are valid, so choose
> 		 * the closer inode chunk to the target.
> 		 */
> 		if ((agino - rec.ir_startino + XFS_INODES_PER_CHUNK - 1) >
> 					(rrec.ir_startino - agino)) {
> 			*rec = rrec;
> 			xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> 			cur = rcur;
> 		} else {
> 			xfs_btree_del_cursor(rcur, XFS_BTREE_NOERROR);
> 		}
> 	} else if (j == 1) {
> 		/* only right record is valid */
> 		*rec = rrec;
> 		xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> 		cur = rcur;
> 	} else {
> 		/* Found neither left nor right.
> 		xfs_btree_del_cursor(rcur, XFS_BTREE_NOERROR);
> 	}
> 	return 0;
> 
> error_rcur:
> 	xfs_btree_del_cursor(rcur, XFS_BTREE_ERROR);
> 	return error;
> }
> 
>> +	} else {
>> +		/*
>> +		 * Different AG from the parent inode. Check the record for the
>> +		 * most recently allocated inode.
>> +		 */
>> +		if (agi->agi_newino != cpu_to_be32(NULLAGINO)) {
>> +			error = xfs_inobt_lookup(cur, agi->agi_newino,
>> +						 XFS_LOOKUP_EQ, &i);
>> +			if (error)
>> +				goto error_cur;
>> +			if (i == 1) {
>> +				error = xfs_inobt_get_rec(cur, &rec, &i);
>> +				if (error)
>> +					goto error_cur;
>> +				XFS_WANT_CORRUPTED_GOTO(i == 1, error_cur);
>> +				goto alloc_inode;
>> +			}
>> +		}
>> +
>> +		/*
>> +		 * Allocate the first inode available in the AG.
>> +		 */
>> +		error = xfs_inobt_lookup(cur, 0, XFS_LOOKUP_GE, &i);
>> +		if (error)
>> +			goto error_cur;
>> +		XFS_WANT_CORRUPTED_GOTO(i == 1, error_cur);
>> +
>> +		error = xfs_inobt_get_rec(cur, &rec, &i);
>> +		if (error)
>> +			goto error_cur;
>> +		XFS_WANT_CORRUPTED_GOTO(i == 1, error_cur);
> 
> And that can be factored in a similar manner in
> xfs_dialloc_ag_newino()....
> 

Ok.

>> +	}
>> +
>> +alloc_inode:
>> +	offset = xfs_lowbit64(rec.ir_free);
>> +	ASSERT(offset >= 0);
>> +	ASSERT(offset < XFS_INODES_PER_CHUNK);
>> +	ASSERT((XFS_AGINO_TO_OFFSET(mp, rec.ir_startino) %
>> +				   XFS_INODES_PER_CHUNK) == 0);
>> +	ino = XFS_AGINO_TO_INO(mp, agno, rec.ir_startino + offset);
>> +
>> +	/*
>> +	 * Modify or remove the finobt record.
>> +	 */
>> +	rec.ir_free &= ~XFS_INOBT_MASK(offset);
>> +	rec.ir_freecount--;
>> +	if (rec.ir_freecount) 
>> +		error = xfs_inobt_update(cur, &rec);
>> +	else
>> +		error = xfs_btree_delete(cur, &i);
>> +	if (error)
>> +		goto error_cur;
>> +
>> +	/*
>> +	 * Lookup and modify the equivalent record in the inobt.
>> +	 */
>> +	tcur = xfs_inobt_init_cursor(mp, tp, agbp, agno, XFS_BTNUM_INO);
> 
> In case ou hadn't guessed, I don't like the "tcur/trec" variables
> because they make me thing "temporary" which they aren't. In this
> case it is the inobt cursor and record....
> 
> In fact, this whole segment could be factored into a function like
> xfs_dialloc_ag_inobt_update(), hence removing the second cursor from
> xfs_dialloc_ag() function altogether and that would clean a lot of
> the logic up....
> 

Sounds reasonable.

>> +
>> +	error = xfs_check_agi_freecount(tcur, agi);
>> +	if (error)
>> +		goto error_tcur;
>> +
>> +	error = xfs_inobt_lookup(tcur, rec.ir_startino, XFS_LOOKUP_EQ, &i);
>> +	if (error)
>> +		goto error_tcur;
>> +	XFS_WANT_CORRUPTED_GOTO(i == 1, error_tcur);
>> +
>> +	error = xfs_inobt_get_rec(tcur, &trec, &i);
>> +	if (error)
>> +		goto error_tcur;
>> +	XFS_WANT_CORRUPTED_GOTO(i == 1, error_tcur);
>> +	ASSERT((XFS_AGINO_TO_OFFSET(mp, trec.ir_startino) %
>> +				   XFS_INODES_PER_CHUNK) == 0);
>> +
>> +	trec.ir_free &= ~XFS_INOBT_MASK(offset);
>> +	trec.ir_freecount--;
>> +
>> +	XFS_WANT_CORRUPTED_GOTO((rec.ir_free == trec.ir_free) &&
>> +				(rec.ir_freecount == trec.ir_freecount),
>> +				error_tcur);
>> +
>> +	error = xfs_inobt_update(tcur, &trec);
>> +	if (error)
>> +		goto error_tcur;
>> +
>> +	/*
>> +	 * Update the perag and superblock.
>> +	 */
>> +	be32_add_cpu(&agi->agi_freecount, -1);
>> +	xfs_ialloc_log_agi(tp, agbp, XFS_AGI_FREECOUNT);
>> +	pag->pagi_freecount--;
>> +
>> +	xfs_trans_mod_sb(tp, XFS_TRANS_SB_IFREE, -1);
> 
> This will need to be done before you update the inobt, though, so
> you can run the xfs_check_agi_freecount() count in it and it will
> come out correct....
> 

Right. I suppose this could update the finobt, perag & super, check the
agi against the finobt, then invoke the new helper (which will update
the inobt and check its cursor against the agi).

Brian

> Cheers,
> 
> Dave.
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 06/11] xfs: use and update the finobt on inode allocation
  2014-02-11  7:17   ` Dave Chinner
  2014-02-11 16:32     ` Brian Foster
@ 2014-02-14 20:01     ` Brian Foster
  2014-02-20  0:38       ` Dave Chinner
  1 sibling, 1 reply; 34+ messages in thread
From: Brian Foster @ 2014-02-14 20:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On 02/11/2014 02:17 AM, Dave Chinner wrote:
> On Tue, Feb 04, 2014 at 12:49:37PM -0500, Brian Foster wrote:
...
>> +	/*
>> +	 * Lookup and modify the equivalent record in the inobt.
>> +	 */
>> +	tcur = xfs_inobt_init_cursor(mp, tp, agbp, agno, XFS_BTNUM_INO);
> 
> In case ou hadn't guessed, I don't like the "tcur/trec" variables
> because they make me thing "temporary" which they aren't. In this
> case it is the inobt cursor and record....
> 
> In fact, this whole segment could be factored into a function like
> xfs_dialloc_ag_inobt_update(), hence removing the second cursor from
> xfs_dialloc_ag() function altogether and that would clean a lot of
> the logic up....
> 
...
>> +	/*
>> +	 * Update the perag and superblock.
>> +	 */
>> +	be32_add_cpu(&agi->agi_freecount, -1);
>> +	xfs_ialloc_log_agi(tp, agbp, XFS_AGI_FREECOUNT);
>> +	pag->pagi_freecount--;
>> +
>> +	xfs_trans_mod_sb(tp, XFS_TRANS_SB_IFREE, -1);
> 
> This will need to be done before you update the inobt, though, so
> you can run the xfs_check_agi_freecount() count in it and it will
> come out correct....
> 

After cleaning up some of this code and taking a closer look, I end up
with something just short of complete removal of the inobt cursor in
this function. Reason being... the point above with regard to checking
the btrees against the agi freecount pre and post modification.

The only way I see to push the inobt cursor into a separate function is
to also push the agi/sb update to the middle of that function, which
just seems ugly to me. The helper function is now non-intuitively
sensitive to the context in which it is called (and conversely buries
context that makes xfs_dialloc_ag() harder to follow). E.g., misplaced
functionality for an "update the inobt" helper.

That said, xfs_dialloc_ag() is much smaller from all of the other
refactoring, trec is gone and tcur is renamed to icur. So I think
clarity is still improved. Let me know if there's any major objections
to that pattern or alternate suggestions. Otherwise, I'll run with this
for the next version...

Brian

> Cheers,
> 
> Dave.
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 04/11] xfs: update inode allocation/free transaction reservations for finobt
  2014-02-11  6:46   ` Dave Chinner
  2014-02-11 16:22     ` Brian Foster
@ 2014-02-18 17:10     ` Brian Foster
  2014-02-18 20:34       ` Brian Foster
  2014-02-20  2:01       ` Dave Chinner
  1 sibling, 2 replies; 34+ messages in thread
From: Brian Foster @ 2014-02-18 17:10 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On 02/11/2014 01:46 AM, Dave Chinner wrote:
> On Tue, Feb 04, 2014 at 12:49:35PM -0500, Brian Foster wrote:
>> Create the xfs_calc_finobt_res() helper to calculate the finobt log
>> reservation for inode allocation and free. Update
>> XFS_IALLOC_SPACE_RES() to reserve blocks for the additional finobt
>> insertion on inode allocation. Create XFS_IFREE_SPACE_RES() to
>> reserve blocks for the potential finobt record insertion on inode
>> free (i.e., if an inode chunk was previously fully allocated).
>>
>> Signed-off-by: Brian Foster <bfoster@redhat.com>
>> ---
>>  fs/xfs/xfs_inode.c       |  4 +++-
>>  fs/xfs/xfs_trans_resv.c  | 47 +++++++++++++++++++++++++++++++++++++++++++----
>>  fs/xfs/xfs_trans_space.h |  7 ++++++-
>>  3 files changed, 52 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
>> index 001aa89..57c77ed 100644
>> --- a/fs/xfs/xfs_inode.c
>> +++ b/fs/xfs/xfs_inode.c
>> @@ -1730,7 +1730,9 @@ xfs_inactive_ifree(
>>  	int			error;
>>  
>>  	tp = xfs_trans_alloc(mp, XFS_TRANS_INACTIVE);
>> -	error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree, 0, 0);
>> +	tp->t_flags |= XFS_TRANS_RESERVE;
>> +	error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree,
>> +				  XFS_IFREE_SPACE_RES(mp), 0);
> 
> Can you add a comment explaining why the XFS_TRANS_RESERVE flag is
> used here, and why it's use won't lead to accelerated reserve pool
> depletion?
> 

So this aspect of things appears to be a bit more interesting than I
originally anticipated. I "reserve enabled" this transaction to
facilitate the ability to free up inodes under ENOSPC conditions.
Without this, the problem of failing out of xfs_inactive_ifree() (and
leaving an inode chained on the unlinked list) is easily reproducible
with generic/083.

The basic argument for why this is reasonable is that releasing an inode
releases used space (i.e., file blocks and potentially directory blocks
and inode chunks over time). That said, I can manufacture situations
where this is not the case. E.g., allocate a bunch of 0-sized files,
consume remaining free space in some separate file, start removing
inodes in a manner that removes a single inode per chunk or so. This
creates a scenario where the inobt can be very large and the finobt very
small (likely a single record). Removing the inodes in this manner
reduces the likelihood of freeing up any space and thus rapidly grows
the finobt towards the size of the inobt without any free space
available. This might or might not qualify as sane use of the fs, but I
don't think the failure scenario is acceptable as things currently stand.

I think there are several ways this can go from here. A couple ideas
that have crossed my mind:

- Find a way to variably reserve the number of blocks that would be
required to grow the finobt to the finobt, based on current state. This
would require the total number of blocks (not just enough for a split),
so this could get complex and somewhat overbearing (i.e., a lot of space
could be quietly reserved, current tracking might not be sufficient and
the allocation paths could get hairy).

- Work to push the ifree transaction allocation and reservation to the
unlink codepath rather than the eviction codepath. Under normal
circumstances, chain the tp to the xfs_inode such that the eviction code
path can grab it and run. This prevents us going into the state where an
inode is unlinked without having enough space to free up. On the flip
side, ENOSPC on unlink isn't very forgiving behavior to the user.

I think the former approach is probably overkill for something that
might be a pathological situation. The latter approach is more simple,
but it feels like a bit of a hack. I've experimented with it a bit, but
I'm not quite sure yet if it introduces any transaction issues by
allocating the unlink and ifree transactions at the same time.

Perhaps another argument could be made that it's rather unlikely we run
into an fs with as many 0-sized (or sub-inode chunk sized) files as
required to deplete the reserve pool without freeing any space, and we
should just touch up the failure handling. E.g.,

1.) Continue to reserve enable the ifree transaction. Consider expanding
the reserve pool on finobt-enabled fs' if appropriate. Note that this is
not guaranteed to provide enough resources to populate the finobt to the
level of the inobt without freeing up more space.
2.) Attempt a !XFS_TRANS_RESERVE tp reservation in xfs_inactive_ifree().
If fails, xfs_warn()/notice() and enable XFS_TRANS_RESERVE.
3.) Attempt XFS_TRANS_RESERVE reservation. If fails, xfs_notice() and
shutdown.

And this could probably be made more intelligent to bail out sooner if
we repeat XFS_TRANS_RESERVE reservations without freeing up any space,
etc. Before going too far in one direction... thoughts?

Brian

>>  	if (error) {
>>  		ASSERT(XFS_FORCED_SHUTDOWN(mp));
>>  		xfs_trans_cancel(tp, XFS_TRANS_RELEASE_LOG_RES);
>> diff --git a/fs/xfs/xfs_trans_resv.c b/fs/xfs/xfs_trans_resv.c
>> index 2fd59c0..32f35c1 100644
>> --- a/fs/xfs/xfs_trans_resv.c
>> +++ b/fs/xfs/xfs_trans_resv.c
>> @@ -98,6 +98,37 @@ xfs_calc_inode_res(
>>  }
>>  
>>  /*
>> + * The free inode btree is a conditional feature and the log reservation
>> + * requirements differ slightly from that of the traditional inode allocation
>> + * btree. The finobt tracks records for inode chunks with at least one free inode.
>> + * Therefore, a record can be removed from the tree for an inode allocation or
>> + * free and the associated merge reservation is unconditional. This also covers
>> + * the possibility of a split on record insertion.
> 
> Slightly wider than 80 columns here. FWIW, if you use vim, add this
> rule to have it add a red line at the textwidth you have set:
> 
> " highlight textwidth
> set cc=+1
> 
> And that will point out lines that are too long quite obviously ;)
> 
>> + *
>> + * the free inode btree: max depth * block size
>> + * the free inode btree entry: block size
>> + *
>> + * TODO: is the modify res really necessary? covered by the merge/split res?
>> + * This seems to be the pattern of ifree, but not create_resv_alloc. Why?
> 
> The modify case is for an allocation that only updates an inobt
> record (i.e. chunk already allocated, free inodes in it). Because
> we can remove a finobt record when "modifying" the last free inode
> record in a chunk, "modify" can cause a redcord removal and hence a
> tree merge. In which case it's no different of any of the other
> finobt reservations....
> 
>> @@ -267,6 +298,7 @@ xfs_calc_remove_reservation(
>>   *    the superblock for the nlink flag: sector size
>>   *    the directory btree: (max depth + v2) * dir block size
>>   *    the directory inode's bmap btree: (max depth + v2) * block size
>> + *    the finobt
>>   */
>>  STATIC uint
>>  xfs_calc_create_resv_modify(
>> @@ -275,7 +307,8 @@ xfs_calc_create_resv_modify(
>>  	return xfs_calc_inode_res(mp, 2) +
>>  		xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
>>  		(uint)XFS_FSB_TO_B(mp, 1) +
>> -		xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp), XFS_FSB_TO_B(mp, 1));
>> +		xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp), XFS_FSB_TO_B(mp, 1)) +
>> +		xfs_calc_finobt_res(mp, 1);
>>  }
> 
> And this is where is starts to get complex. The modify operation can
> now cause a finobt merge, when means blocks will be allocated/freed.
> That means we now need to take into account:
> 
>  *    the allocation btrees: 2 trees * (max depth - 1) * block size
> 
> and anything else freeing an extent requires.
> 
>>  /*
>> @@ -285,6 +318,7 @@ xfs_calc_create_resv_modify(
>>   *    the inode blocks allocated: XFS_IALLOC_BLOCKS * blocksize
>>   *    the inode btree: max depth * blocksize
>>   *    the allocation btrees: 2 trees * (max depth - 1) * block size
>> + *    the finobt
>>   */
>>  STATIC uint
>>  xfs_calc_create_resv_alloc(
>> @@ -295,7 +329,8 @@ xfs_calc_create_resv_alloc(
>>  		xfs_calc_buf_res(XFS_IALLOC_BLOCKS(mp), XFS_FSB_TO_B(mp, 1)) +
>>  		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
>>  		xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 1),
>> -				 XFS_FSB_TO_B(mp, 1));
>> +				 XFS_FSB_TO_B(mp, 1)) +
>> +		xfs_calc_finobt_res(mp, 0);
>>  }
> 
> This reservation is only for v4 superblocks - the icreate
> transaction reservation is used for v5 superblocks, so that's the
> only one you need to modify.
> 
> Cheers,
> 
> Dave.
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 04/11] xfs: update inode allocation/free transaction reservations for finobt
  2014-02-18 17:10     ` Brian Foster
@ 2014-02-18 20:34       ` Brian Foster
  2014-02-20  2:01       ` Dave Chinner
  1 sibling, 0 replies; 34+ messages in thread
From: Brian Foster @ 2014-02-18 20:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On 02/18/2014 12:10 PM, Brian Foster wrote:
> On 02/11/2014 01:46 AM, Dave Chinner wrote:
>> On Tue, Feb 04, 2014 at 12:49:35PM -0500, Brian Foster wrote:
>>> Create the xfs_calc_finobt_res() helper to calculate the finobt log
>>> reservation for inode allocation and free. Update
>>> XFS_IALLOC_SPACE_RES() to reserve blocks for the additional finobt
>>> insertion on inode allocation. Create XFS_IFREE_SPACE_RES() to
>>> reserve blocks for the potential finobt record insertion on inode
>>> free (i.e., if an inode chunk was previously fully allocated).
>>>
>>> Signed-off-by: Brian Foster <bfoster@redhat.com>
>>> ---
>>>  fs/xfs/xfs_inode.c       |  4 +++-
>>>  fs/xfs/xfs_trans_resv.c  | 47 +++++++++++++++++++++++++++++++++++++++++++----
>>>  fs/xfs/xfs_trans_space.h |  7 ++++++-
>>>  3 files changed, 52 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
>>> index 001aa89..57c77ed 100644
>>> --- a/fs/xfs/xfs_inode.c
>>> +++ b/fs/xfs/xfs_inode.c
>>> @@ -1730,7 +1730,9 @@ xfs_inactive_ifree(
>>>  	int			error;
>>>  
>>>  	tp = xfs_trans_alloc(mp, XFS_TRANS_INACTIVE);
>>> -	error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree, 0, 0);
>>> +	tp->t_flags |= XFS_TRANS_RESERVE;
>>> +	error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree,
>>> +				  XFS_IFREE_SPACE_RES(mp), 0);
>>
>> Can you add a comment explaining why the XFS_TRANS_RESERVE flag is
>> used here, and why it's use won't lead to accelerated reserve pool
>> depletion?
>>
> 
> So this aspect of things appears to be a bit more interesting than I
> originally anticipated. I "reserve enabled" this transaction to
> facilitate the ability to free up inodes under ENOSPC conditions.
> Without this, the problem of failing out of xfs_inactive_ifree() (and
> leaving an inode chained on the unlinked list) is easily reproducible
> with generic/083.
> 
> The basic argument for why this is reasonable is that releasing an inode
> releases used space (i.e., file blocks and potentially directory blocks
> and inode chunks over time). That said, I can manufacture situations
> where this is not the case. E.g., allocate a bunch of 0-sized files,
> consume remaining free space in some separate file, start removing
> inodes in a manner that removes a single inode per chunk or so. This
> creates a scenario where the inobt can be very large and the finobt very
> small (likely a single record). Removing the inodes in this manner
> reduces the likelihood of freeing up any space and thus rapidly grows
> the finobt towards the size of the inobt without any free space
> available. This might or might not qualify as sane use of the fs, but I
> don't think the failure scenario is acceptable as things currently stand.
> 
> I think there are several ways this can go from here. A couple ideas
> that have crossed my mind:
> 
> - Find a way to variably reserve the number of blocks that would be
> required to grow the finobt to the finobt, based on current state. This
> would require the total number of blocks (not just enough for a split),
> so this could get complex and somewhat overbearing (i.e., a lot of space
> could be quietly reserved, current tracking might not be sufficient and
> the allocation paths could get hairy).
> 
> - Work to push the ifree transaction allocation and reservation to the
> unlink codepath rather than the eviction codepath. Under normal
> circumstances, chain the tp to the xfs_inode such that the eviction code
> path can grab it and run. This prevents us going into the state where an
> inode is unlinked without having enough space to free up. On the flip
> side, ENOSPC on unlink isn't very forgiving behavior to the user.
> 

- Add some state or flags bits to the finobt and the associated ability
to kill/invalidate it at runtime. Print a warning with regard to the
situation that indicates performance might be affected and a repair is
required to re-enable.

Brian

> I think the former approach is probably overkill for something that
> might be a pathological situation. The latter approach is more simple,
> but it feels like a bit of a hack. I've experimented with it a bit, but
> I'm not quite sure yet if it introduces any transaction issues by
> allocating the unlink and ifree transactions at the same time.
> 
> Perhaps another argument could be made that it's rather unlikely we run
> into an fs with as many 0-sized (or sub-inode chunk sized) files as
> required to deplete the reserve pool without freeing any space, and we
> should just touch up the failure handling. E.g.,
> 
> 1.) Continue to reserve enable the ifree transaction. Consider expanding
> the reserve pool on finobt-enabled fs' if appropriate. Note that this is
> not guaranteed to provide enough resources to populate the finobt to the
> level of the inobt without freeing up more space.
> 2.) Attempt a !XFS_TRANS_RESERVE tp reservation in xfs_inactive_ifree().
> If fails, xfs_warn()/notice() and enable XFS_TRANS_RESERVE.
> 3.) Attempt XFS_TRANS_RESERVE reservation. If fails, xfs_notice() and
> shutdown.
> 
> And this could probably be made more intelligent to bail out sooner if
> we repeat XFS_TRANS_RESERVE reservations without freeing up any space,
> etc. Before going too far in one direction... thoughts?
> 
> Brian
> 
>>>  	if (error) {
>>>  		ASSERT(XFS_FORCED_SHUTDOWN(mp));
>>>  		xfs_trans_cancel(tp, XFS_TRANS_RELEASE_LOG_RES);
>>> diff --git a/fs/xfs/xfs_trans_resv.c b/fs/xfs/xfs_trans_resv.c
>>> index 2fd59c0..32f35c1 100644
>>> --- a/fs/xfs/xfs_trans_resv.c
>>> +++ b/fs/xfs/xfs_trans_resv.c
>>> @@ -98,6 +98,37 @@ xfs_calc_inode_res(
>>>  }
>>>  
>>>  /*
>>> + * The free inode btree is a conditional feature and the log reservation
>>> + * requirements differ slightly from that of the traditional inode allocation
>>> + * btree. The finobt tracks records for inode chunks with at least one free inode.
>>> + * Therefore, a record can be removed from the tree for an inode allocation or
>>> + * free and the associated merge reservation is unconditional. This also covers
>>> + * the possibility of a split on record insertion.
>>
>> Slightly wider than 80 columns here. FWIW, if you use vim, add this
>> rule to have it add a red line at the textwidth you have set:
>>
>> " highlight textwidth
>> set cc=+1
>>
>> And that will point out lines that are too long quite obviously ;)
>>
>>> + *
>>> + * the free inode btree: max depth * block size
>>> + * the free inode btree entry: block size
>>> + *
>>> + * TODO: is the modify res really necessary? covered by the merge/split res?
>>> + * This seems to be the pattern of ifree, but not create_resv_alloc. Why?
>>
>> The modify case is for an allocation that only updates an inobt
>> record (i.e. chunk already allocated, free inodes in it). Because
>> we can remove a finobt record when "modifying" the last free inode
>> record in a chunk, "modify" can cause a redcord removal and hence a
>> tree merge. In which case it's no different of any of the other
>> finobt reservations....
>>
>>> @@ -267,6 +298,7 @@ xfs_calc_remove_reservation(
>>>   *    the superblock for the nlink flag: sector size
>>>   *    the directory btree: (max depth + v2) * dir block size
>>>   *    the directory inode's bmap btree: (max depth + v2) * block size
>>> + *    the finobt
>>>   */
>>>  STATIC uint
>>>  xfs_calc_create_resv_modify(
>>> @@ -275,7 +307,8 @@ xfs_calc_create_resv_modify(
>>>  	return xfs_calc_inode_res(mp, 2) +
>>>  		xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
>>>  		(uint)XFS_FSB_TO_B(mp, 1) +
>>> -		xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp), XFS_FSB_TO_B(mp, 1));
>>> +		xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp), XFS_FSB_TO_B(mp, 1)) +
>>> +		xfs_calc_finobt_res(mp, 1);
>>>  }
>>
>> And this is where is starts to get complex. The modify operation can
>> now cause a finobt merge, when means blocks will be allocated/freed.
>> That means we now need to take into account:
>>
>>  *    the allocation btrees: 2 trees * (max depth - 1) * block size
>>
>> and anything else freeing an extent requires.
>>
>>>  /*
>>> @@ -285,6 +318,7 @@ xfs_calc_create_resv_modify(
>>>   *    the inode blocks allocated: XFS_IALLOC_BLOCKS * blocksize
>>>   *    the inode btree: max depth * blocksize
>>>   *    the allocation btrees: 2 trees * (max depth - 1) * block size
>>> + *    the finobt
>>>   */
>>>  STATIC uint
>>>  xfs_calc_create_resv_alloc(
>>> @@ -295,7 +329,8 @@ xfs_calc_create_resv_alloc(
>>>  		xfs_calc_buf_res(XFS_IALLOC_BLOCKS(mp), XFS_FSB_TO_B(mp, 1)) +
>>>  		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
>>>  		xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 1),
>>> -				 XFS_FSB_TO_B(mp, 1));
>>> +				 XFS_FSB_TO_B(mp, 1)) +
>>> +		xfs_calc_finobt_res(mp, 0);
>>>  }
>>
>> This reservation is only for v4 superblocks - the icreate
>> transaction reservation is used for v5 superblocks, so that's the
>> only one you need to modify.
>>
>> Cheers,
>>
>> Dave.
>>
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 06/11] xfs: use and update the finobt on inode allocation
  2014-02-14 20:01     ` Brian Foster
@ 2014-02-20  0:38       ` Dave Chinner
  0 siblings, 0 replies; 34+ messages in thread
From: Dave Chinner @ 2014-02-20  0:38 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Fri, Feb 14, 2014 at 03:01:45PM -0500, Brian Foster wrote:
> On 02/11/2014 02:17 AM, Dave Chinner wrote:
> > On Tue, Feb 04, 2014 at 12:49:37PM -0500, Brian Foster wrote:
> ...
> >> +	/*
> >> +	 * Lookup and modify the equivalent record in the inobt.
> >> +	 */
> >> +	tcur = xfs_inobt_init_cursor(mp, tp, agbp, agno, XFS_BTNUM_INO);
> > 
> > In case ou hadn't guessed, I don't like the "tcur/trec" variables
> > because they make me thing "temporary" which they aren't. In this
> > case it is the inobt cursor and record....
> > 
> > In fact, this whole segment could be factored into a function like
> > xfs_dialloc_ag_inobt_update(), hence removing the second cursor from
> > xfs_dialloc_ag() function altogether and that would clean a lot of
> > the logic up....
> > 
> ...
> >> +	/*
> >> +	 * Update the perag and superblock.
> >> +	 */
> >> +	be32_add_cpu(&agi->agi_freecount, -1);
> >> +	xfs_ialloc_log_agi(tp, agbp, XFS_AGI_FREECOUNT);
> >> +	pag->pagi_freecount--;
> >> +
> >> +	xfs_trans_mod_sb(tp, XFS_TRANS_SB_IFREE, -1);
> > 
> > This will need to be done before you update the inobt, though, so
> > you can run the xfs_check_agi_freecount() count in it and it will
> > come out correct....
> > 
> 
> After cleaning up some of this code and taking a closer look, I end up
> with something just short of complete removal of the inobt cursor in
> this function. Reason being... the point above with regard to checking
> the btrees against the agi freecount pre and post modification.

I can't really comment all that well without having seen the
factored code you've written....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 04/11] xfs: update inode allocation/free transaction reservations for finobt
  2014-02-11 16:22     ` Brian Foster
@ 2014-02-20  1:00       ` Dave Chinner
  2014-02-20 16:04         ` Brian Foster
  0 siblings, 1 reply; 34+ messages in thread
From: Dave Chinner @ 2014-02-20  1:00 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Tue, Feb 11, 2014 at 11:22:57AM -0500, Brian Foster wrote:
> On 02/11/2014 01:46 AM, Dave Chinner wrote:
> > On Tue, Feb 04, 2014 at 12:49:35PM -0500, Brian Foster wrote:
> >> Create the xfs_calc_finobt_res() helper to calculate the finobt log
> >> reservation for inode allocation and free. Update
> >> XFS_IALLOC_SPACE_RES() to reserve blocks for the additional finobt
> >> insertion on inode allocation. Create XFS_IFREE_SPACE_RES() to
> >> reserve blocks for the potential finobt record insertion on inode
> >> free (i.e., if an inode chunk was previously fully allocated).
....
> >> + *
> >> + * the free inode btree: max depth * block size
> >> + * the free inode btree entry: block size
> >> + *
> >> + * TODO: is the modify res really necessary? covered by the merge/split res?
> >> + * This seems to be the pattern of ifree, but not create_resv_alloc. Why?
> > 
> > The modify case is for an allocation that only updates an inobt
> > record (i.e. chunk already allocated, free inodes in it). Because
> > we can remove a finobt record when "modifying" the last free inode
> > record in a chunk, "modify" can cause a redcord removal and hence a
> > tree merge. In which case it's no different of any of the other
> > finobt reservations....
> 
> When I made this note, I think I was curious why there was a need to add
> to the reservation in the modify case. If we always made a reservation
> for a btree merge, wouldn't that be enough to cover the basic modify
> case?

The two cases are kept separate because if we ever separate inode
chunk allocation from individual inode allocation from an allocated
chunk, then we need the different reservations. if you go back to
the original RFD series that documented the finobt, it also
discusses this separation and the optimisations in inode management
that such a separation brings.

Hence we should not assume that an inode chunk allocation
reservation covers the reservation of an inode allocation from an
allocated chunk.

> In the code, it looks like we either delete the record or modify
> it. Is there something in the delete case that I'm missing, or are we
> just being conservative here? Nonetheless, I modelled this after the
> ifree requirements, assuming the reservation there was correct.

But  the ifree requirement changed, didn't they? i.e. Because the
first inode freed in a chunk can now trigger allocation.

Basically, we have these cases:

Allocation:

	- allocate chunk, insert inobt rec, insert finobt rec
	- allocate inode, modify inobt rec, modify finobt rec
	- allocate inode, modify inobt rec, remove finobt rec

The first case is obvious in that it needs to allocate blocks
for the inode chunk, but also possibly the inbot and finobt need
allocations too (split).

The second case obvious doesn't allocate any blocks at all,
and tha tis the existing "modify reservation"

The third case is the new "modify reservation" case, where the
removal of a finobt record can cause a merge and hence freeing of a
block. That's the case we need to make sure we have the corerct
reservation for.

In the case of ifree, it is similar:

	- free inode, modify inobt rec, insert finobt rec
	- free inode, modify inobt rec, modify finobt rec
	- free chunk, remove inobt rec, remove finobt rec

And so it is clear that the same new case occurs here. however, the
ifree transaction reservation is not split up like the create
transaction, so it already takes into account allocation/freeing of
blocks and hence didn't need updating.

> >> @@ -267,6 +298,7 @@ xfs_calc_remove_reservation(
> >>   *    the superblock for the nlink flag: sector size
> >>   *    the directory btree: (max depth + v2) * dir block size
> >>   *    the directory inode's bmap btree: (max depth + v2) * block size
> >> + *    the finobt
> >>   */
> >>  STATIC uint
> >>  xfs_calc_create_resv_modify(
> >> @@ -275,7 +307,8 @@ xfs_calc_create_resv_modify(
> >>  	return xfs_calc_inode_res(mp, 2) +
> >>  		xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
> >>  		(uint)XFS_FSB_TO_B(mp, 1) +
> >> -		xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp), XFS_FSB_TO_B(mp, 1));
> >> +		xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp), XFS_FSB_TO_B(mp, 1)) +
> >> +		xfs_calc_finobt_res(mp, 1);
> >>  }
> > 
> > And this is where is starts to get complex. The modify operation can
> > now cause a finobt merge, when means blocks will be allocated/freed.
> > That means we now need to take into account:
> > 
> >  *    the allocation btrees: 2 trees * (max depth - 1) * block size
> > 
> > and anything else freeing an extent requires.
> > 
> 
> So is the "allocation btrees: 2 trees ..." portion of
> xfs_calc_create_resv_alloc() associated with the allocation of the
> actual inode blocks or potential allocations due to inode btree
> management? I had thought the former, thus didn't include reservations
> for this in the finobt calculation.

This is the log space reservation - so it's a reservation for the
number of blocks in the btree that can be *modified* by the
transaction. So if we have to allocate a block, we can potentially
modify leaf-to-root in both free space trees if they both split or
merge during the allocation. Hence any transaction that can allocate
or free a block needs to reservae this space in the log for the
freespace tree modifications.

We have a separate space reservation - as defined in
xfs_trans_space.h - which is used to define the worst case metadata
block allocation in a given transaction. It is assumed that the log
space reservation covers all those blocks being modified, too.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 04/11] xfs: update inode allocation/free transaction reservations for finobt
  2014-02-18 17:10     ` Brian Foster
  2014-02-18 20:34       ` Brian Foster
@ 2014-02-20  2:01       ` Dave Chinner
  2014-02-20 18:49         ` Brian Foster
  1 sibling, 1 reply; 34+ messages in thread
From: Dave Chinner @ 2014-02-20  2:01 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

[patched in the extra case from your subsequent reply]

On Tue, Feb 18, 2014 at 12:10:16PM -0500, Brian Foster wrote:
> On 02/11/2014 01:46 AM, Dave Chinner wrote:
> > On Tue, Feb 04, 2014 at 12:49:35PM -0500, Brian Foster wrote:
> >> Create the xfs_calc_finobt_res() helper to calculate the finobt log
> >> reservation for inode allocation and free. Update
> >> XFS_IALLOC_SPACE_RES() to reserve blocks for the additional finobt
> >> insertion on inode allocation. Create XFS_IFREE_SPACE_RES() to
> >> reserve blocks for the potential finobt record insertion on inode
> >> free (i.e., if an inode chunk was previously fully allocated).
> >>
> >> Signed-off-by: Brian Foster <bfoster@redhat.com>
> >> ---
> >>  fs/xfs/xfs_inode.c       |  4 +++-
> >>  fs/xfs/xfs_trans_resv.c  | 47 +++++++++++++++++++++++++++++++++++++++++++----
> >>  fs/xfs/xfs_trans_space.h |  7 ++++++-
> >>  3 files changed, 52 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> >> index 001aa89..57c77ed 100644
> >> --- a/fs/xfs/xfs_inode.c
> >> +++ b/fs/xfs/xfs_inode.c
> >> @@ -1730,7 +1730,9 @@ xfs_inactive_ifree(
> >>  	int			error;
> >>  
> >>  	tp = xfs_trans_alloc(mp, XFS_TRANS_INACTIVE);
> >> -	error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree, 0, 0);
> >> +	tp->t_flags |= XFS_TRANS_RESERVE;
> >> +	error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree,
> >> +				  XFS_IFREE_SPACE_RES(mp), 0);
> > 
> > Can you add a comment explaining why the XFS_TRANS_RESERVE flag is
> > used here, and why it's use won't lead to accelerated reserve pool
> > depletion?
> > 
> 
> So this aspect of things appears to be a bit more interesting than I
> originally anticipated. I "reserve enabled" this transaction to
> facilitate the ability to free up inodes under ENOSPC conditions.
> Without this, the problem of failing out of xfs_inactive_ifree() (and
> leaving an inode chained on the unlinked list) is easily reproducible
> with generic/083.

*nod*

> The basic argument for why this is reasonable is that releasing an inode
> releases used space (i.e., file blocks and potentially directory blocks
> and inode chunks over time). That said, I can manufacture situations
> where this is not the case. E.g., allocate a bunch of 0-sized files,
> consume remaining free space in some separate file, start removing
> inodes in a manner that removes a single inode per chunk or so. This
> creates a scenario where the inobt can be very large and the finobt very
> small (likely a single record). Removing the inodes in this manner
> reduces the likelihood of freeing up any space and thus rapidly grows
> the finobt towards the size of the inobt without any free space
> available. This might or might not qualify as sane use of the fs, but I
> don't think the failure scenario is acceptable as things currently stand.

Right, that can happen. But my question is this: how realistic is it
that we have someone who has ENOSPC because of enough zero length
files to trigger this? I've never seen an application or user try to
store any significant number of zero length files, so I suspect this
is a theoretical problem, not a practical one.

Indeed, the finobt only needs to grow a block for every 250-odd
records for a 4k block size filesystem. Hence, IMO, the default
reserve pool size of 8192 filesystem blocks is going be sufficient
for most users. i.e. the case you are talking about requires
(ignoring node block usage for simplicity) 250 * 8192 * 64 = 131
million zero length inodes to be present in the filesystem to have
this "1 inode per chunk" freeing pattern exhaust the default reserve
pool with finobt tree allocations....

> I think there are several ways this can go from here. A couple ideas
> that have crossed my mind:
> 
> - Find a way to variably reserve the number of blocks that would be
> required to grow the finobt to the finobt, based on current state. This

Not sure what "grow the finobt to the finobt" means. There's a typo
or a key word missing there ;)

> would require the total number of blocks (not just enough for a split),
> so this could get complex and somewhat overbearing (i.e., a lot of space
> could be quietly reserved, current tracking might not be sufficient and
> the allocation paths could get hairy).

Doesn't seem worth the complexity to me.

> - Work to push the ifree transaction allocation and reservation to the
> unlink codepath rather than the eviction codepath. Under normal
> circumstances, chain the tp to the xfs_inode such that the eviction code
> path can grab it and run. This prevents us going into the state where an
> inode is unlinked without having enough space to free up. On the flip
> side, ENOSPC on unlink isn't very forgiving behavior to the user.

That's the long term plan anyway - to move to background freeing of
the inodes once they are on the unlinked list and unreferenced by
the VFS. But, really, once the inode is on the unlinked list we can
probably ignore the ENOSPC problem because we know that it is
unlinked.

Indeed, the long term plan (along with background freeing) is to
allow inode allocation direct from the unlinked lists, and that
means we could leave the inodes on the unlinked lists and not
care about the ENOSPC problem at all ;)

> - Add some state or flags bits to the finobt and the associated
>   ability to kill/invalidate it at runtime. Print a warning with
>   regard to the situation that indicates performance might be
>   affected and a repair is required to re-enable.

We've already got that state through the unlinked lists. Again,
go back to the RFD series and look through the followup work....

> 
> I think the former approach is probably overkill for something that
> might be a pathological situation. The latter approach is more simple,
> but it feels like a bit of a hack. I've experimented with it a bit, but
> I'm not quite sure yet if it introduces any transaction issues by
> allocating the unlink and ifree transactions at the same time.
> 
> Perhaps another argument could be made that it's rather unlikely we run
> into an fs with as many 0-sized (or sub-inode chunk sized) files as
> required to deplete the reserve pool without freeing any space, and we
> should just touch up the failure handling. E.g.,
> 
> 1.) Continue to reserve enable the ifree transaction. Consider expanding
> the reserve pool on finobt-enabled fs' if appropriate. Note that this is
> not guaranteed to provide enough resources to populate the finobt to the
> level of the inobt without freeing up more space.
> 2.) Attempt a !XFS_TRANS_RESERVE tp reservation in xfs_inactive_ifree().
> If fails, xfs_warn()/notice() and enable XFS_TRANS_RESERVE.
> 3.) Attempt XFS_TRANS_RESERVE reservation. If fails, xfs_notice() and
> shutdown.

I don't think we ned to shut down. Indeed, there's no point in doing
an !XFS_TRANS_RESERVE in the first place because a warning will just
generate unnecessary noise in the logs.

Realistically, we can leave inodes on the unlinked list
indefinitely without causing any significant problems except for
there being used space that users can't account for from the
namespace. Log recovery cleans them up when it runs, or blows away
the unlinked list when it fails, and that results in leaked inodes.
If we get to that point, xfs-repair will clean it up just fine
unless there's still not enough space. At that point, it's not a
problem we can solve with tools - the user has to free up some space
in the filesystem....

> And this could probably be made more intelligent to bail out sooner if
> we repeat XFS_TRANS_RESERVE reservations without freeing up any space,
> etc. Before going too far in one direction... thoughts?

Right now, I just don't think it is a case we need to be
pariticularly concerned with. There are plenty of theoretical issues
that can occur (including data loss) when the reserve pool is
depleted because of prolonged ENOSPC issues, but the reality is
that the only place we see this code being exercised is by the tests
in xfstests that intentionally trigger reserve pool depletion....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 04/11] xfs: update inode allocation/free transaction reservations for finobt
  2014-02-20  1:00       ` Dave Chinner
@ 2014-02-20 16:04         ` Brian Foster
  0 siblings, 0 replies; 34+ messages in thread
From: Brian Foster @ 2014-02-20 16:04 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On 02/19/2014 08:00 PM, Dave Chinner wrote:
> On Tue, Feb 11, 2014 at 11:22:57AM -0500, Brian Foster wrote:
>> On 02/11/2014 01:46 AM, Dave Chinner wrote:
>>> On Tue, Feb 04, 2014 at 12:49:35PM -0500, Brian Foster wrote:
>>>> Create the xfs_calc_finobt_res() helper to calculate the finobt log
>>>> reservation for inode allocation and free. Update
>>>> XFS_IALLOC_SPACE_RES() to reserve blocks for the additional finobt
>>>> insertion on inode allocation. Create XFS_IFREE_SPACE_RES() to
>>>> reserve blocks for the potential finobt record insertion on inode
>>>> free (i.e., if an inode chunk was previously fully allocated).
> ....
>>>> + *
>>>> + * the free inode btree: max depth * block size
>>>> + * the free inode btree entry: block size
>>>> + *
>>>> + * TODO: is the modify res really necessary? covered by the merge/split res?
>>>> + * This seems to be the pattern of ifree, but not create_resv_alloc. Why?
>>>
>>> The modify case is for an allocation that only updates an inobt
>>> record (i.e. chunk already allocated, free inodes in it). Because
>>> we can remove a finobt record when "modifying" the last free inode
>>> record in a chunk, "modify" can cause a redcord removal and hence a
>>> tree merge. In which case it's no different of any of the other
>>> finobt reservations....
>>
>> When I made this note, I think I was curious why there was a need to add
>> to the reservation in the modify case. If we always made a reservation
>> for a btree merge, wouldn't that be enough to cover the basic modify
>> case?
> 
> The two cases are kept separate because if we ever separate inode
> chunk allocation from individual inode allocation from an allocated
> chunk, then we need the different reservations. if you go back to
> the original RFD series that documented the finobt, it also
> discusses this separation and the optimisations in inode management
> that such a separation brings.
> 
> Hence we should not assume that an inode chunk allocation
> reservation covers the reservation of an inode allocation from an
> allocated chunk.
> 

Ok.

>> In the code, it looks like we either delete the record or modify
>> it. Is there something in the delete case that I'm missing, or are we
>> just being conservative here? Nonetheless, I modelled this after the
>> ifree requirements, assuming the reservation there was correct.
> 
> But  the ifree requirement changed, didn't they? i.e. Because the
> first inode freed in a chunk can now trigger allocation.
> 
> Basically, we have these cases:
> 
> Allocation:
> 
> 	- allocate chunk, insert inobt rec, insert finobt rec
> 	- allocate inode, modify inobt rec, modify finobt rec
> 	- allocate inode, modify inobt rec, remove finobt rec
> 
> The first case is obvious in that it needs to allocate blocks
> for the inode chunk, but also possibly the inbot and finobt need
> allocations too (split).
> 
> The second case obvious doesn't allocate any blocks at all,
> and tha tis the existing "modify reservation"
> 
> The third case is the new "modify reservation" case, where the
> removal of a finobt record can cause a merge and hence freeing of a
> block. That's the case we need to make sure we have the corerct
> reservation for.
> 
> In the case of ifree, it is similar:
> 
> 	- free inode, modify inobt rec, insert finobt rec
> 	- free inode, modify inobt rec, modify finobt rec
> 	- free chunk, remove inobt rec, remove finobt rec
> 
> And so it is clear that the same new case occurs here. however, the
> ifree transaction reservation is not split up like the create
> transaction, so it already takes into account allocation/freeing of
> blocks and hence didn't need updating.
> 

Ok, I think I see what you're saying here based on this and our previous
irc conversation. I previously looked at the allocation btrees portion
of the icreate and ifree reservations as being associated with the
allocation and free of the inode chunk. Not associated in any way with
the inobt, because of the reasoning that would have to account for the
allocation btrees _again_.

Per our irc conversation, it is not technically required to account for
the allocation btrees multiple times in the situations where we do
multiple allocations, i.e.: allocate inode chunks (one alloc. btree
modification) and then allocate again if we need blocks for the inobt to
insert the record (a second alloc. btree modification).

Conversely, it is a requirement to account for the allocation btrees at
least once in any situation where we insert or remove a record from the
inobt/finobt. The icreate_resv_alloc() calculation already covers any
subsequent allocation btree modifications on behalf of the finobt by
virtue of the explicit allocation of the inode chunk. The
icreate_resv_modify() calculation does not cover allocation btree
modifications on behalf of the finobt because it made no explicit
allocations in the first place. It only modified the inobt record, which
will not allocate/free on behalf of the inobt.

>>>> @@ -267,6 +298,7 @@ xfs_calc_remove_reservation(
>>>>   *    the superblock for the nlink flag: sector size
>>>>   *    the directory btree: (max depth + v2) * dir block size
>>>>   *    the directory inode's bmap btree: (max depth + v2) * block size
>>>> + *    the finobt
>>>>   */
>>>>  STATIC uint
>>>>  xfs_calc_create_resv_modify(
>>>> @@ -275,7 +307,8 @@ xfs_calc_create_resv_modify(
>>>>  	return xfs_calc_inode_res(mp, 2) +
>>>>  		xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
>>>>  		(uint)XFS_FSB_TO_B(mp, 1) +
>>>> -		xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp), XFS_FSB_TO_B(mp, 1));
>>>> +		xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp), XFS_FSB_TO_B(mp, 1)) +
>>>> +		xfs_calc_finobt_res(mp, 1);
>>>>  }
>>>
>>> And this is where is starts to get complex. The modify operation can
>>> now cause a finobt merge, when means blocks will be allocated/freed.
>>> That means we now need to take into account:
>>>
>>>  *    the allocation btrees: 2 trees * (max depth - 1) * block size
>>>
>>> and anything else freeing an extent requires.
>>>
>>
>> So is the "allocation btrees: 2 trees ..." portion of
>> xfs_calc_create_resv_alloc() associated with the allocation of the
>> actual inode blocks or potential allocations due to inode btree
>> management? I had thought the former, thus didn't include reservations
>> for this in the finobt calculation.
> 
> This is the log space reservation - so it's a reservation for the
> number of blocks in the btree that can be *modified* by the
> transaction. So if we have to allocate a block, we can potentially
> modify leaf-to-root in both free space trees if they both split or
> merge during the allocation. Hence any transaction that can allocate
> or free a block needs to reservae this space in the log for the
> freespace tree modifications.
> 
> We have a separate space reservation - as defined in
> xfs_trans_space.h - which is used to define the worst case metadata
> block allocation in a given transaction. It is assumed that the log
> space reservation covers all those blocks being modified, too.
> 

Right, I get that part with regard to the reservation of blocks vs.
reservation of log space. My question was what we discussed over irc.
Specifically, the semantics of what the allocation btrees portion of the
res. calc covers - potentially multiple freespace tree modifications vs.
one instance per freespace tree modification. Thanks for the
explanation. I understand now why we need to include the allocation
btrees due to the finobt under the create_resv_modify() scenario and not
the others. ;)

Brian

> Cheers,
> 
> Dave.
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 04/11] xfs: update inode allocation/free transaction reservations for finobt
  2014-02-20  2:01       ` Dave Chinner
@ 2014-02-20 18:49         ` Brian Foster
  2014-02-20 20:50           ` Dave Chinner
  2014-02-20 21:14           ` Christoph Hellwig
  0 siblings, 2 replies; 34+ messages in thread
From: Brian Foster @ 2014-02-20 18:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On 02/19/2014 09:01 PM, Dave Chinner wrote:
> [patched in the extra case from your subsequent reply]
> 
> On Tue, Feb 18, 2014 at 12:10:16PM -0500, Brian Foster wrote:
>> On 02/11/2014 01:46 AM, Dave Chinner wrote:
>>> On Tue, Feb 04, 2014 at 12:49:35PM -0500, Brian Foster wrote:
>>>> Create the xfs_calc_finobt_res() helper to calculate the finobt log
>>>> reservation for inode allocation and free. Update
>>>> XFS_IALLOC_SPACE_RES() to reserve blocks for the additional finobt
>>>> insertion on inode allocation. Create XFS_IFREE_SPACE_RES() to
>>>> reserve blocks for the potential finobt record insertion on inode
>>>> free (i.e., if an inode chunk was previously fully allocated).
>>>>
>>>> Signed-off-by: Brian Foster <bfoster@redhat.com>
>>>> ---
>>>>  fs/xfs/xfs_inode.c       |  4 +++-
>>>>  fs/xfs/xfs_trans_resv.c  | 47 +++++++++++++++++++++++++++++++++++++++++++----
>>>>  fs/xfs/xfs_trans_space.h |  7 ++++++-
>>>>  3 files changed, 52 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
>>>> index 001aa89..57c77ed 100644
>>>> --- a/fs/xfs/xfs_inode.c
>>>> +++ b/fs/xfs/xfs_inode.c
>>>> @@ -1730,7 +1730,9 @@ xfs_inactive_ifree(
>>>>  	int			error;
>>>>  
>>>>  	tp = xfs_trans_alloc(mp, XFS_TRANS_INACTIVE);
>>>> -	error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree, 0, 0);
>>>> +	tp->t_flags |= XFS_TRANS_RESERVE;
>>>> +	error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree,
>>>> +				  XFS_IFREE_SPACE_RES(mp), 0);
>>>
>>> Can you add a comment explaining why the XFS_TRANS_RESERVE flag is
>>> used here, and why it's use won't lead to accelerated reserve pool
>>> depletion?
>>>
>>
>> So this aspect of things appears to be a bit more interesting than I
>> originally anticipated. I "reserve enabled" this transaction to
>> facilitate the ability to free up inodes under ENOSPC conditions.
>> Without this, the problem of failing out of xfs_inactive_ifree() (and
>> leaving an inode chained on the unlinked list) is easily reproducible
>> with generic/083.
> 
> *nod*
> 
>> The basic argument for why this is reasonable is that releasing an inode
>> releases used space (i.e., file blocks and potentially directory blocks
>> and inode chunks over time). That said, I can manufacture situations
>> where this is not the case. E.g., allocate a bunch of 0-sized files,
>> consume remaining free space in some separate file, start removing
>> inodes in a manner that removes a single inode per chunk or so. This
>> creates a scenario where the inobt can be very large and the finobt very
>> small (likely a single record). Removing the inodes in this manner
>> reduces the likelihood of freeing up any space and thus rapidly grows
>> the finobt towards the size of the inobt without any free space
>> available. This might or might not qualify as sane use of the fs, but I
>> don't think the failure scenario is acceptable as things currently stand.
> 
> Right, that can happen. But my question is this: how realistic is it
> that we have someone who has ENOSPC because of enough zero length
> files to trigger this? I've never seen an application or user try to
> store any significant number of zero length files, so I suspect this
> is a theoretical problem, not a practical one.
> 

Probably not very realistic. ;) The only thing I know that does rely on
some zero-length files is gluster distribution to represent "link files"
when one a file that hashes to one server ends up stored on another.
Even then, I don't see how we would ever have a situation where those
link files exist in such massive numbers and are removed in bulk. So
it's likely a pathological scenario.

> Indeed, the finobt only needs to grow a block for every 250-odd
> records for a 4k block size filesystem. Hence, IMO, the default
> reserve pool size of 8192 filesystem blocks is going be sufficient
> for most users. i.e. the case you are talking about requires
> (ignoring node block usage for simplicity) 250 * 8192 * 64 = 131
> million zero length inodes to be present in the filesystem to have
> this "1 inode per chunk" freeing pattern exhaust the default reserve
> pool with finobt tree allocations....
> 

Well, only ~2 million of the ~131 million you happen to free would have
to be 0-sized and per-chunk, right? ;) I guess sub aligned inode chunk
sized would be more accurate than 0-sized as well. But regardless, this
doesn't seem probable.

>> I think there are several ways this can go from here. A couple ideas
>> that have crossed my mind:
>>
>> - Find a way to variably reserve the number of blocks that would be
>> required to grow the finobt to the finobt, based on current state. This
> 
> Not sure what "grow the finobt to the finobt" means. There's a typo
> or a key word missing there ;)
> 

That should read "grow the finobt to the inobt," referring to the size
in terms of number of records in the tree.

>> would require the total number of blocks (not just enough for a split),
>> so this could get complex and somewhat overbearing (i.e., a lot of space
>> could be quietly reserved, current tracking might not be sufficient and
>> the allocation paths could get hairy).
> 
> Doesn't seem worth the complexity to me.
> 

Agreed.

>> - Work to push the ifree transaction allocation and reservation to the
>> unlink codepath rather than the eviction codepath. Under normal
>> circumstances, chain the tp to the xfs_inode such that the eviction code
>> path can grab it and run. This prevents us going into the state where an
>> inode is unlinked without having enough space to free up. On the flip
>> side, ENOSPC on unlink isn't very forgiving behavior to the user.
> 
> That's the long term plan anyway - to move to background freeing of
> the inodes once they are on the unlinked list and unreferenced by
> the VFS. But, really, once the inode is on the unlinked list we can
> probably ignore the ENOSPC problem because we know that it is
> unlinked.
> 
> Indeed, the long term plan (along with background freeing) is to
> allow inode allocation direct from the unlinked lists, and that
> means we could leave the inodes on the unlinked lists and not
> care about the ENOSPC problem at all ;)
> 

Ah, ok. I had a similar thought as well about having some kind of
background scanner for unlinked list inodes that kicks in when we know
we have evicted inodes on the list...

>> - Add some state or flags bits to the finobt and the associated
>>   ability to kill/invalidate it at runtime. Print a warning with
>>   regard to the situation that indicates performance might be
>>   affected and a repair is required to re-enable.
> 
> We've already got that state through the unlinked lists. Again,
> go back to the RFD series and look through the followup work....
> 

... and taking a quick look back through that, I see 14/17 refers to
precisely such a mechanism. I thought that sounded familiar. ;)

>>
>> I think the former approach is probably overkill for something that
>> might be a pathological situation. The latter approach is more simple,
>> but it feels like a bit of a hack. I've experimented with it a bit, but
>> I'm not quite sure yet if it introduces any transaction issues by
>> allocating the unlink and ifree transactions at the same time.
>>
>> Perhaps another argument could be made that it's rather unlikely we run
>> into an fs with as many 0-sized (or sub-inode chunk sized) files as
>> required to deplete the reserve pool without freeing any space, and we
>> should just touch up the failure handling. E.g.,
>>
>> 1.) Continue to reserve enable the ifree transaction. Consider expanding
>> the reserve pool on finobt-enabled fs' if appropriate. Note that this is
>> not guaranteed to provide enough resources to populate the finobt to the
>> level of the inobt without freeing up more space.
>> 2.) Attempt a !XFS_TRANS_RESERVE tp reservation in xfs_inactive_ifree().
>> If fails, xfs_warn()/notice() and enable XFS_TRANS_RESERVE.
>> 3.) Attempt XFS_TRANS_RESERVE reservation. If fails, xfs_notice() and
>> shutdown.
> 
> I don't think we ned to shut down. Indeed, there's no point in doing
> an !XFS_TRANS_RESERVE in the first place because a warning will just
> generate unnecessary noise in the logs.
> 
> Realistically, we can leave inodes on the unlinked list
> indefinitely without causing any significant problems except for
> there being used space that users can't account for from the
> namespace. Log recovery cleans them up when it runs, or blows away
> the unlinked list when it fails, and that results in leaked inodes.
> If we get to that point, xfs-repair will clean it up just fine
> unless there's still not enough space. At that point, it's not a
> problem we can solve with tools - the user has to free up some space
> in the filesystem....
> 

Ok, the current failure behavior (as unlikely as it seems to hit) seems
less hasty given the roadmap for improved unlinked list management.

I'm not sure how log recovery plays into things unless there is a crash.
In my experiments, the inodes are simply never freed and linger on the
unlinked list until repair. Repair moves everything to lost+found as
opposed to freeing (I presume since the inodes are still "allocated"
after all), but repairs the fs nonetheless.

>> And this could probably be made more intelligent to bail out sooner if
>> we repeat XFS_TRANS_RESERVE reservations without freeing up any space,
>> etc. Before going too far in one direction... thoughts?
> 
> Right now, I just don't think it is a case we need to be
> pariticularly concerned with. There are plenty of theoretical issues
> that can occur (including data loss) when the reserve pool is
> depleted because of prolonged ENOSPC issues, but the reality is
> that the only place we see this code being exercised is by the tests
> in xfstests that intentionally trigger reserve pool depletion....
> 

Fair enough, I'll leave things generally as is apart from the review
feedback. I might add a warning or ratelimited printk in the inactive
ENOSPC failure path, just so it's not completely silent if it does
trigger in the wild. Thanks for the feedback.

Brian

> Cheers,
> 
> Dave.
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 04/11] xfs: update inode allocation/free transaction reservations for finobt
  2014-02-20 18:49         ` Brian Foster
@ 2014-02-20 20:50           ` Dave Chinner
  2014-02-20 21:14           ` Christoph Hellwig
  1 sibling, 0 replies; 34+ messages in thread
From: Dave Chinner @ 2014-02-20 20:50 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Thu, Feb 20, 2014 at 01:49:10PM -0500, Brian Foster wrote:
> On 02/19/2014 09:01 PM, Dave Chinner wrote:
> > [patched in the extra case from your subsequent reply]
> > 
> > On Tue, Feb 18, 2014 at 12:10:16PM -0500, Brian Foster wrote:
> >> On 02/11/2014 01:46 AM, Dave Chinner wrote:
> >>> On Tue, Feb 04, 2014 at 12:49:35PM -0500, Brian Foster wrote:
> >>>> Create the xfs_calc_finobt_res() helper to calculate the finobt log
> >>>> reservation for inode allocation and free. Update
> >>>> XFS_IALLOC_SPACE_RES() to reserve blocks for the additional finobt
> >>>> insertion on inode allocation. Create XFS_IFREE_SPACE_RES() to
> >>>> reserve blocks for the potential finobt record insertion on inode
> >>>> free (i.e., if an inode chunk was previously fully allocated).
> >>>>
> >>>> Signed-off-by: Brian Foster <bfoster@redhat.com>
> >>>> ---
> >>>>  fs/xfs/xfs_inode.c       |  4 +++-
> >>>>  fs/xfs/xfs_trans_resv.c  | 47 +++++++++++++++++++++++++++++++++++++++++++----
> >>>>  fs/xfs/xfs_trans_space.h |  7 ++++++-
> >>>>  3 files changed, 52 insertions(+), 6 deletions(-)
> >>>>
> >>>> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> >>>> index 001aa89..57c77ed 100644
> >>>> --- a/fs/xfs/xfs_inode.c
> >>>> +++ b/fs/xfs/xfs_inode.c
> >>>> @@ -1730,7 +1730,9 @@ xfs_inactive_ifree(
> >>>>  	int			error;
> >>>>  
> >>>>  	tp = xfs_trans_alloc(mp, XFS_TRANS_INACTIVE);
> >>>> -	error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree, 0, 0);
> >>>> +	tp->t_flags |= XFS_TRANS_RESERVE;
> >>>> +	error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree,
> >>>> +				  XFS_IFREE_SPACE_RES(mp), 0);
> >>>
> >>> Can you add a comment explaining why the XFS_TRANS_RESERVE flag is
> >>> used here, and why it's use won't lead to accelerated reserve pool
> >>> depletion?
.....
> >> Perhaps another argument could be made that it's rather unlikely we run
> >> into an fs with as many 0-sized (or sub-inode chunk sized) files as
> >> required to deplete the reserve pool without freeing any space, and we
> >> should just touch up the failure handling. E.g.,
> >>
> >> 1.) Continue to reserve enable the ifree transaction. Consider expanding
> >> the reserve pool on finobt-enabled fs' if appropriate. Note that this is
> >> not guaranteed to provide enough resources to populate the finobt to the
> >> level of the inobt without freeing up more space.
> >> 2.) Attempt a !XFS_TRANS_RESERVE tp reservation in xfs_inactive_ifree().
> >> If fails, xfs_warn()/notice() and enable XFS_TRANS_RESERVE.
> >> 3.) Attempt XFS_TRANS_RESERVE reservation. If fails, xfs_notice() and
> >> shutdown.
> > 
> > I don't think we ned to shut down. Indeed, there's no point in doing
> > an !XFS_TRANS_RESERVE in the first place because a warning will just
> > generate unnecessary noise in the logs.
> > 
> > Realistically, we can leave inodes on the unlinked list
> > indefinitely without causing any significant problems except for
> > there being used space that users can't account for from the
> > namespace. Log recovery cleans them up when it runs, or blows away
> > the unlinked list when it fails, and that results in leaked inodes.
> > If we get to that point, xfs-repair will clean it up just fine
> > unless there's still not enough space. At that point, it's not a
> > problem we can solve with tools - the user has to free up some space
> > in the filesystem....
> 
> Ok, the current failure behavior (as unlikely as it seems to hit) seems
> less hasty given the roadmap for improved unlinked list management.
> 
> I'm not sure how log recovery plays into things unless there is a crash.
> In my experiments, the inodes are simply never freed and linger on the
> unlinked list until repair. Repair moves everything to lost+found as
> opposed to freeing (I presume since the inodes are still "allocated"
> after all), but repairs the fs nonetheless.

With the recovery-without-a-crash case, the recovery of unlinked
inodes happens in the second phase of recovery (i.e via
xlog_recover_process_iunlinks() in xlog_recover_finish()). This is
after we've done the actual log recovery (so that unlinked list
changes have been recovered), and so effectively in not so much a
part of "log recovery" but a part of "unclean shutdown cleanup".
That currently is associates with log recovery, but there's not
reason why we couldn't just do it unconditionally on a mount. If we
are going to leave inodes on the unlinked lists and allocate
directly from there, then it kind of makes sense to have the
unlinked lists cleaned up at mount time, even if it is only by
kicking a background cleaner thread...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 04/11] xfs: update inode allocation/free transaction reservations for finobt
  2014-02-20 18:49         ` Brian Foster
  2014-02-20 20:50           ` Dave Chinner
@ 2014-02-20 21:14           ` Christoph Hellwig
  2014-02-20 23:13             ` Dave Chinner
  1 sibling, 1 reply; 34+ messages in thread
From: Christoph Hellwig @ 2014-02-20 21:14 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Thu, Feb 20, 2014 at 01:49:10PM -0500, Brian Foster wrote:
> > Right, that can happen. But my question is this: how realistic is it
> > that we have someone who has ENOSPC because of enough zero length
> > files to trigger this? I've never seen an application or user try to
> > store any significant number of zero length files, so I suspect this
> > is a theoretical problem, not a practical one.
> > 
> 
> Probably not very realistic. ;) The only thing I know that does rely on
> some zero-length files is gluster distribution to represent "link files"
> when one a file that hashes to one server ends up stored on another.
> Even then, I don't see how we would ever have a situation where those
> link files exist in such massive numbers and are removed in bulk. So
> it's likely a pathological scenario.

Zero data blocks are the only case for device nodes or fifos, very
common for symlinks that can be stored inline, and not unusual for
directories.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 04/11] xfs: update inode allocation/free transaction reservations for finobt
  2014-02-20 21:14           ` Christoph Hellwig
@ 2014-02-20 23:13             ` Dave Chinner
  0 siblings, 0 replies; 34+ messages in thread
From: Dave Chinner @ 2014-02-20 23:13 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Brian Foster, xfs

On Thu, Feb 20, 2014 at 01:14:57PM -0800, Christoph Hellwig wrote:
> On Thu, Feb 20, 2014 at 01:49:10PM -0500, Brian Foster wrote:
> > > Right, that can happen. But my question is this: how realistic is it
> > > that we have someone who has ENOSPC because of enough zero length
> > > files to trigger this? I've never seen an application or user try to
> > > store any significant number of zero length files, so I suspect this
> > > is a theoretical problem, not a practical one.
> > > 
> > 
> > Probably not very realistic. ;) The only thing I know that does rely on
> > some zero-length files is gluster distribution to represent "link files"
> > when one a file that hashes to one server ends up stored on another.
> > Even then, I don't see how we would ever have a situation where those
> > link files exist in such massive numbers and are removed in bulk. So
> > it's likely a pathological scenario.
> 
> Zero data blocks are the only case for device nodes or fifos, very
> common for symlinks that can be stored inline, and not unusual for
> directories.

Sure, symlinks and directories are a common case, but they don't
generally make up millions of inodes in a filesystem. In the case of
directories, you've got to remove other filesystem before that
directory inode can be freed, and as such most cases are going to
end up freeing blocks from files before we get to freeing the
shortform directory inode.

Symlink farms could be a problem, but again if you have a large
symlink farm you're going to have blocks in directories and other
files that the symlinks point to that get removed. As it is, these
are the sorts of workloads that are going to gain massively from the
finobt modifications (e.g. backup programs that use link farms and
sparsify the inode btree when a backup is removed), so we'll find
out pretty quickly if the default reserve pool is not large enough
for these workloads.

As it is, I see this as a similar issue to the way we changed
speculative preallocation - we knew that there were significant
benefits to changing the behaviour, but we also knew that there
would be corner cases where problems would arise. However, we had no
real idea of exactly which workloads ior users would be affected by
ENOSPC issues. Users reported problems, we had mitigation
strategiesthey could apply (e.g. allocsize) while we fixed the
problems they reported. And now we get very few complaints about
that functionality - it just works for almost everyone out of the
box.

I mention this because I think that we are in exactly the same
situation of trying to work out just how widespread this corner
case issue will have an affect. Similar to specualtive prealloc,
we have:

	a) mitigation strategies already in place (increase
	reserve pool size); and

	b) a short- to medium-term plan of development that solves
	the issue completely.

As such, I don't think we should spend too much more time worrying
about this issue and just move onwards. Like the specualtive
preallocation, there may be some short term pain for corner case
workloads, but we end up in a much better place in the medium to
long term because we don't stall development trying to solve all the
problems at once and/or solving problems we don't need to solve....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2014-02-20 23:14 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-04 17:49 [PATCH v3 00/11] xfs: introduce the free inode btree Brian Foster
2014-02-04 17:49 ` [PATCH v3 01/11] xfs: refactor xfs_ialloc_btree.c to support multiple inobt numbers Brian Foster
2014-02-04 17:49 ` [PATCH v3 02/11] xfs: reserve v5 superblock read-only compat. feature bit for finobt Brian Foster
2014-02-11  6:07   ` Dave Chinner
2014-02-04 17:49 ` [PATCH v3 03/11] xfs: support the XFS_BTNUM_FINOBT free inode btree type Brian Foster
2014-02-11  6:22   ` Dave Chinner
2014-02-04 17:49 ` [PATCH v3 04/11] xfs: update inode allocation/free transaction reservations for finobt Brian Foster
2014-02-11  6:46   ` Dave Chinner
2014-02-11 16:22     ` Brian Foster
2014-02-20  1:00       ` Dave Chinner
2014-02-20 16:04         ` Brian Foster
2014-02-18 17:10     ` Brian Foster
2014-02-18 20:34       ` Brian Foster
2014-02-20  2:01       ` Dave Chinner
2014-02-20 18:49         ` Brian Foster
2014-02-20 20:50           ` Dave Chinner
2014-02-20 21:14           ` Christoph Hellwig
2014-02-20 23:13             ` Dave Chinner
2014-02-04 17:49 ` [PATCH v3 05/11] xfs: insert newly allocated inode chunks into the finobt Brian Foster
2014-02-11  6:48   ` Dave Chinner
2014-02-04 17:49 ` [PATCH v3 06/11] xfs: use and update the finobt on inode allocation Brian Foster
2014-02-11  7:17   ` Dave Chinner
2014-02-11 16:32     ` Brian Foster
2014-02-14 20:01     ` Brian Foster
2014-02-20  0:38       ` Dave Chinner
2014-02-04 17:49 ` [PATCH v3 07/11] xfs: refactor xfs_difree() inobt bits into xfs_difree_inobt() helper Brian Foster
2014-02-11  7:19   ` Dave Chinner
2014-02-04 17:49 ` [PATCH v3 08/11] xfs: update the finobt on inode free Brian Foster
2014-02-11  7:31   ` Dave Chinner
2014-02-04 17:49 ` [PATCH v3 09/11] xfs: add finobt support to growfs Brian Foster
2014-02-04 17:49 ` [PATCH v3 10/11] xfs: report finobt status in fs geometry Brian Foster
2014-02-11  7:34   ` Dave Chinner
2014-02-04 17:49 ` [PATCH v3 11/11] xfs: enable the finobt feature on v5 superblocks Brian Foster
2014-02-11  7:34   ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.