* [RFC PATCH 00/11] xfs: introduce the free inode btree
@ 2013-09-03 18:24 Brian Foster
2013-09-03 18:24 ` [RFC PATCH 01/11] xfs: refactor xfs_ialloc_btree.c to support multiple inobt numbers Brian Foster
` (11 more replies)
0 siblings, 12 replies; 46+ messages in thread
From: Brian Foster @ 2013-09-03 18:24 UTC (permalink / raw)
To: xfs
Hi all,
This is an RFC for the kernel work to support a free inode btree. The free inode
btree (finobt) is equivalent to the existing inode allocation btree with the
exception that the finobt only tracks inode chunks with at least one free inode.
The purpose is to improve lookups for free inode clusters for inode allocation.
Newly allocated inode chunks by definition contain free inodes and are thus
inserted into the finobt immediately. The record for a previously full inode
chunk is inserted to the finobt when the first inode is freed. A record is
removed from the finobt when the last free inode has been allocated or the
entire chunk is completely deallocated.
Patches 1-3 refactor some ialloc btree code to introduce the new finobt type and
feature bit. Patches 4-7 fix up the transaction handling for inode allocation
and deallocation to account for the new tree. Patches 8-10 add the finobt
management code to insert, remove and modify records as appropriate. Patch 11
fixes growfs to support the finobt.
Thoughts, reviews, flames appreciated.
NOTES:
- This is RFC because it is lightly tested and I don't have much userspace code
at the moment (though most of this code should apply).
- I'm not totally sure about the scope of change in patch 6 (use correct
transaction reservations in xfs_inactive()). I started off minimally altering
the reserved blocks value, but the existing transaction management seemed
strange enough that I ended up with the current patch. I figured I'd start here
and pare it back as necessary if the changes are bogus or lead to other
problems.
- I've yet to do any performance testing to measure the benefit of patch 9.
Brian
Brian Foster (11):
xfs: refactor xfs_ialloc_btree.c to support multiple inobt numbers
xfs: reserve v5 superblock read-only compat. feature bit for finobt
xfs: support the XFS_BTNUM_FINOBT free inode btree type
xfs: update inode allocation transaction reservations for finobt
xfs: update ifree transaction reservations for finobt
xfs: use correct transaction reservations in xfs_inactive()
xfs: retry trans reservation on ENOSPC in xfs_inactive()
xfs: insert newly allocated inode chunks into the finobt
xfs: use and update the finobt on inode allocation
xfs: update the finobt on inode free
xfs: add finobt support to growfs
fs/xfs/xfs_ag.h | 7 +-
fs/xfs/xfs_btree.c | 6 +-
fs/xfs/xfs_btree.h | 3 +
fs/xfs/xfs_fsops.c | 32 +++++
fs/xfs/xfs_ialloc.c | 343 ++++++++++++++++++++++++++++++++++++++++++----
fs/xfs/xfs_ialloc_btree.c | 37 +++--
fs/xfs/xfs_ialloc_btree.h | 17 ++-
fs/xfs/xfs_inode.c | 82 ++++++-----
fs/xfs/xfs_itable.c | 6 +-
fs/xfs/xfs_log_recover.c | 2 +
fs/xfs/xfs_sb.h | 10 +-
fs/xfs/xfs_stats.h | 18 ++-
fs/xfs/xfs_trans_resv.c | 14 +-
fs/xfs/xfs_trans_space.h | 4 +-
fs/xfs/xfs_types.h | 2 +-
15 files changed, 500 insertions(+), 83 deletions(-)
--
1.8.1.4
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 46+ messages in thread
* [RFC PATCH 01/11] xfs: refactor xfs_ialloc_btree.c to support multiple inobt numbers
2013-09-03 18:24 [RFC PATCH 00/11] xfs: introduce the free inode btree Brian Foster
@ 2013-09-03 18:24 ` Brian Foster
2013-09-05 0:36 ` Dave Chinner
2013-09-03 18:24 ` [RFC PATCH 02/11] xfs: reserve v5 superblock read-only compat. feature bit for finobt Brian Foster
` (10 subsequent siblings)
11 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2013-09-03 18:24 UTC (permalink / raw)
To: xfs
The introduction of the free inode btree (finobt) requires that
xfs_ialloc_btree.c handle multiple trees. Refactor xfs_ialloc_btree.c
so the caller specifies the btree type on cursor initialization to
prepare for addition of the finobt.
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
fs/xfs/xfs_ialloc.c | 8 ++++----
fs/xfs/xfs_ialloc_btree.c | 8 +++++---
fs/xfs/xfs_ialloc_btree.h | 3 ++-
fs/xfs/xfs_itable.c | 6 ++++--
4 files changed, 15 insertions(+), 10 deletions(-)
diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
index ccf2fb1..524aa88 100644
--- a/fs/xfs/xfs_ialloc.c
+++ b/fs/xfs/xfs_ialloc.c
@@ -455,7 +455,7 @@ xfs_ialloc_ag_alloc(
/*
* Insert records describing the new inode chunk into the btree.
*/
- cur = xfs_inobt_init_cursor(args.mp, tp, agbp, agno);
+ cur = xfs_inobt_init_cursor(args.mp, tp, agbp, agno, XFS_BTNUM_INO);
for (thisino = newino;
thisino < newino + newlen;
thisino += XFS_INODES_PER_CHUNK) {
@@ -701,7 +701,7 @@ xfs_dialloc_ag(
ASSERT(pag->pagi_freecount > 0);
restart_pagno:
- cur = xfs_inobt_init_cursor(mp, tp, agbp, agno);
+ cur = xfs_inobt_init_cursor(mp, tp, agbp, agno, XFS_BTNUM_INO);
/*
* If pagino is 0 (this is the root inode allocation) use newino.
* This must work because we've just allocated some.
@@ -1163,7 +1163,7 @@ xfs_difree(
/*
* Initialize the cursor.
*/
- cur = xfs_inobt_init_cursor(mp, tp, agbp, agno);
+ cur = xfs_inobt_init_cursor(mp, tp, agbp, agno, XFS_BTNUM_INO);
error = xfs_check_agi_freecount(cur, agi);
if (error)
@@ -1294,7 +1294,7 @@ xfs_imap_lookup(
* we have a record, we need to ensure it contains the inode number
* we are looking up.
*/
- cur = xfs_inobt_init_cursor(mp, tp, agbp, agno);
+ cur = xfs_inobt_init_cursor(mp, tp, agbp, agno, XFS_BTNUM_INO);
error = xfs_inobt_lookup(cur, agino, XFS_LOOKUP_LE, &i);
if (!error) {
if (i)
diff --git a/fs/xfs/xfs_ialloc_btree.c b/fs/xfs/xfs_ialloc_btree.c
index 5448eb6..0cdb88b 100644
--- a/fs/xfs/xfs_ialloc_btree.c
+++ b/fs/xfs/xfs_ialloc_btree.c
@@ -50,7 +50,8 @@ xfs_inobt_dup_cursor(
struct xfs_btree_cur *cur)
{
return xfs_inobt_init_cursor(cur->bc_mp, cur->bc_tp,
- cur->bc_private.a.agbp, cur->bc_private.a.agno);
+ cur->bc_private.a.agbp, cur->bc_private.a.agno,
+ cur->bc_btnum);
}
STATIC void
@@ -324,7 +325,8 @@ xfs_inobt_init_cursor(
struct xfs_mount *mp, /* file system mount point */
struct xfs_trans *tp, /* transaction pointer */
struct xfs_buf *agbp, /* buffer for agi structure */
- xfs_agnumber_t agno) /* allocation group number */
+ xfs_agnumber_t agno, /* allocation group number */
+ xfs_btnum_t btnum) /* ialloc or free ino btree */
{
struct xfs_agi *agi = XFS_BUF_TO_AGI(agbp);
struct xfs_btree_cur *cur;
@@ -334,7 +336,7 @@ xfs_inobt_init_cursor(
cur->bc_tp = tp;
cur->bc_mp = mp;
cur->bc_nlevels = be32_to_cpu(agi->agi_level);
- cur->bc_btnum = XFS_BTNUM_INO;
+ cur->bc_btnum = btnum;
cur->bc_blocklog = mp->m_sb.sb_blocklog;
cur->bc_ops = &xfs_inobt_ops;
diff --git a/fs/xfs/xfs_ialloc_btree.h b/fs/xfs/xfs_ialloc_btree.h
index 3ac36b76..ce7a62b 100644
--- a/fs/xfs/xfs_ialloc_btree.h
+++ b/fs/xfs/xfs_ialloc_btree.h
@@ -107,7 +107,8 @@ typedef __be32 xfs_inobt_ptr_t;
((index) - 1) * sizeof(xfs_inobt_ptr_t)))
extern struct xfs_btree_cur *xfs_inobt_init_cursor(struct xfs_mount *,
- struct xfs_trans *, struct xfs_buf *, xfs_agnumber_t);
+ struct xfs_trans *, struct xfs_buf *, xfs_agnumber_t,
+ xfs_btnum_t);
extern int xfs_inobt_maxrecs(struct xfs_mount *, int, int);
extern const struct xfs_buf_ops xfs_inobt_buf_ops;
diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
index b93e14b..e720f8b 100644
--- a/fs/xfs/xfs_itable.c
+++ b/fs/xfs/xfs_itable.c
@@ -275,7 +275,8 @@ xfs_bulkstat(
/*
* Allocate and initialize a btree cursor for ialloc btree.
*/
- cur = xfs_inobt_init_cursor(mp, NULL, agbp, agno);
+ cur = xfs_inobt_init_cursor(mp, NULL, agbp, agno,
+ XFS_BTNUM_INO);
irbp = irbuf;
irbufend = irbuf + nirbuf;
end_of_ag = 0;
@@ -625,7 +626,8 @@ xfs_inumbers(
agino = 0;
continue;
}
- cur = xfs_inobt_init_cursor(mp, NULL, agbp, agno);
+ cur = xfs_inobt_init_cursor(mp, NULL, agbp, agno,
+ XFS_BTNUM_INO);
error = xfs_inobt_lookup(cur, agino, XFS_LOOKUP_GE,
&tmp);
if (error) {
--
1.8.1.4
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [RFC PATCH 02/11] xfs: reserve v5 superblock read-only compat. feature bit for finobt
2013-09-03 18:24 [RFC PATCH 00/11] xfs: introduce the free inode btree Brian Foster
2013-09-03 18:24 ` [RFC PATCH 01/11] xfs: refactor xfs_ialloc_btree.c to support multiple inobt numbers Brian Foster
@ 2013-09-03 18:24 ` Brian Foster
2013-09-05 0:39 ` Dave Chinner
2013-09-03 18:25 ` [RFC PATCH 03/11] xfs: support the XFS_BTNUM_FINOBT free inode btree type Brian Foster
` (9 subsequent siblings)
11 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2013-09-03 18:24 UTC (permalink / raw)
To: xfs
Reserve a v5 read-only compatibility feature bit for the finobt and
create the xfs_sb_version_hasfinobt() helper to determine whether
an fs has the feature enabled.
The finobt does not change existing on-disk structures, but must
remain consistent with the ialloc btree. Modifications from older
kernels would violate that constrant. Therefore, we restrict older
kernels to read-only mounts of finobt-enabled filesystems.
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
fs/xfs/xfs_sb.h | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_sb.h b/fs/xfs/xfs_sb.h
index 6835b44..c48d95d 100644
--- a/fs/xfs/xfs_sb.h
+++ b/fs/xfs/xfs_sb.h
@@ -585,7 +585,9 @@ xfs_sb_has_compat_feature(
return (sbp->sb_features_compat & feature) != 0;
}
-#define XFS_SB_FEAT_RO_COMPAT_ALL 0
+#define XFS_SB_FEAT_RO_COMPAT_FINOBT (1 << 0) /* free inode btree */
+#define XFS_SB_FEAT_RO_COMPAT_ALL \
+ (XFS_SB_FEAT_RO_COMPAT_FINOBT)
#define XFS_SB_FEAT_RO_COMPAT_UNKNOWN ~XFS_SB_FEAT_RO_COMPAT_ALL
static inline bool
xfs_sb_has_ro_compat_feature(
@@ -639,6 +641,12 @@ static inline int xfs_sb_version_hasftype(struct xfs_sb *sbp)
(sbp->sb_features2 & XFS_SB_VERSION2_FTYPE));
}
+static inline int xfs_sb_version_hasfinobt(xfs_sb_t *sbp)
+{
+ return (XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5) &&
+ (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_FINOBT);
+}
+
/*
* end of superblock version macros
*/
--
1.8.1.4
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [RFC PATCH 03/11] xfs: support the XFS_BTNUM_FINOBT free inode btree type
2013-09-03 18:24 [RFC PATCH 00/11] xfs: introduce the free inode btree Brian Foster
2013-09-03 18:24 ` [RFC PATCH 01/11] xfs: refactor xfs_ialloc_btree.c to support multiple inobt numbers Brian Foster
2013-09-03 18:24 ` [RFC PATCH 02/11] xfs: reserve v5 superblock read-only compat. feature bit for finobt Brian Foster
@ 2013-09-03 18:25 ` Brian Foster
2013-09-05 0:54 ` Dave Chinner
2013-09-03 18:25 ` [RFC PATCH 04/11] xfs: update inode allocation transaction reservations for finobt Brian Foster
` (8 subsequent siblings)
11 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2013-09-03 18:25 UTC (permalink / raw)
To: xfs
Define the AGI fields for the finobt root/level and add magic
numbers. Update the btree code to add support for the new
XFS_BTNUM_FINOBT inode btree.
The finobt root block is reserved immediately following the inobt
root block in the AG. Update XFS_PREALLOC_BLOCKS() to determine the
starting AG data block based on whether finobt support is enabled.
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
fs/xfs/xfs_ag.h | 7 ++++++-
fs/xfs/xfs_btree.c | 6 ++++--
fs/xfs/xfs_btree.h | 3 +++
fs/xfs/xfs_ialloc.c | 2 ++
fs/xfs/xfs_ialloc_btree.c | 29 +++++++++++++++++++++++------
fs/xfs/xfs_ialloc_btree.h | 14 +++++++++++++-
fs/xfs/xfs_log_recover.c | 2 ++
fs/xfs/xfs_stats.h | 18 +++++++++++++++++-
fs/xfs/xfs_types.h | 2 +-
9 files changed, 71 insertions(+), 12 deletions(-)
diff --git a/fs/xfs/xfs_ag.h b/fs/xfs/xfs_ag.h
index 1cb740a..b85585d 100644
--- a/fs/xfs/xfs_ag.h
+++ b/fs/xfs/xfs_ag.h
@@ -166,6 +166,9 @@ typedef struct xfs_agi {
__be32 agi_pad32;
__be64 agi_lsn; /* last write sequence */
+ __be32 agi_free_root; /* root of the free inode btree */
+ __be32 agi_free_level;/* levels in free inode btree */
+
/* structure must be padded to 64 bit alignment */
} xfs_agi_t;
@@ -180,7 +183,9 @@ typedef struct xfs_agi {
#define XFS_AGI_NEWINO 0x00000100
#define XFS_AGI_DIRINO 0x00000200
#define XFS_AGI_UNLINKED 0x00000400
-#define XFS_AGI_NUM_BITS 11
+#define XFS_AGI_FREE_ROOT 0x00000800
+#define XFS_AGI_FREE_LEVEL 0x00001000
+#define XFS_AGI_NUM_BITS 13
#define XFS_AGI_ALL_BITS ((1 << XFS_AGI_NUM_BITS) - 1)
/* disk block (xfs_daddr_t) in the AG */
diff --git a/fs/xfs/xfs_btree.c b/fs/xfs/xfs_btree.c
index 7a2b4da..0090e3f 100644
--- a/fs/xfs/xfs_btree.c
+++ b/fs/xfs/xfs_btree.c
@@ -45,9 +45,10 @@ kmem_zone_t *xfs_btree_cur_zone;
* Btree magic numbers.
*/
static const __uint32_t xfs_magics[2][XFS_BTNUM_MAX] = {
- { XFS_ABTB_MAGIC, XFS_ABTC_MAGIC, XFS_BMAP_MAGIC, XFS_IBT_MAGIC },
+ { XFS_ABTB_MAGIC, XFS_ABTC_MAGIC, XFS_BMAP_MAGIC, XFS_IBT_MAGIC,
+ XFS_FIBT_MAGIC },
{ XFS_ABTB_CRC_MAGIC, XFS_ABTC_CRC_MAGIC,
- XFS_BMAP_CRC_MAGIC, XFS_IBT_CRC_MAGIC }
+ XFS_BMAP_CRC_MAGIC, XFS_IBT_CRC_MAGIC, XFS_FIBT_CRC_MAGIC }
};
#define xfs_btree_magic(cur) \
xfs_magics[!!((cur)->bc_flags & XFS_BTREE_CRC_BLOCKS)][cur->bc_btnum]
@@ -1102,6 +1103,7 @@ xfs_btree_set_refs(
xfs_buf_set_ref(bp, XFS_ALLOC_BTREE_REF);
break;
case XFS_BTNUM_INO:
+ case XFS_BTNUM_FINO:
xfs_buf_set_ref(bp, XFS_INO_BTREE_REF);
break;
case XFS_BTNUM_BMAP:
diff --git a/fs/xfs/xfs_btree.h b/fs/xfs/xfs_btree.h
index c8473c7..939002d 100644
--- a/fs/xfs/xfs_btree.h
+++ b/fs/xfs/xfs_btree.h
@@ -37,6 +37,7 @@ extern kmem_zone_t *xfs_btree_cur_zone;
#define XFS_BTNUM_CNT ((xfs_btnum_t)XFS_BTNUM_CNTi)
#define XFS_BTNUM_BMAP ((xfs_btnum_t)XFS_BTNUM_BMAPi)
#define XFS_BTNUM_INO ((xfs_btnum_t)XFS_BTNUM_INOi)
+#define XFS_BTNUM_FINO ((xfs_btnum_t)XFS_BTNUM_FINOi)
/*
* Generic btree header.
@@ -144,6 +145,7 @@ do { \
case XFS_BTNUM_CNT: __XFS_BTREE_STATS_INC(abtc, stat); break; \
case XFS_BTNUM_BMAP: __XFS_BTREE_STATS_INC(bmbt, stat); break; \
case XFS_BTNUM_INO: __XFS_BTREE_STATS_INC(ibt, stat); break; \
+ case XFS_BTNUM_FINO: __XFS_BTREE_STATS_INC(fibt, stat); break; \
case XFS_BTNUM_MAX: ASSERT(0); /* fucking gcc */ ; break; \
} \
} while (0)
@@ -157,6 +159,7 @@ do { \
case XFS_BTNUM_CNT: __XFS_BTREE_STATS_ADD(abtc, stat, val); break; \
case XFS_BTNUM_BMAP: __XFS_BTREE_STATS_ADD(bmbt, stat, val); break; \
case XFS_BTNUM_INO: __XFS_BTREE_STATS_ADD(ibt, stat, val); break; \
+ case XFS_BTNUM_FINO: __XFS_BTREE_STATS_ADD(fibt, stat, val); break; \
case XFS_BTNUM_MAX: ASSERT(0); /* fucking gcc */ ; break; \
} \
} while (0)
diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
index 524aa88..5ced506 100644
--- a/fs/xfs/xfs_ialloc.c
+++ b/fs/xfs/xfs_ialloc.c
@@ -1505,6 +1505,8 @@ xfs_ialloc_log_agi(
offsetof(xfs_agi_t, agi_newino),
offsetof(xfs_agi_t, agi_dirino),
offsetof(xfs_agi_t, agi_unlinked),
+ offsetof(xfs_agi_t, agi_free_root),
+ offsetof(xfs_agi_t, agi_free_level),
sizeof(xfs_agi_t)
};
#ifdef DEBUG
diff --git a/fs/xfs/xfs_ialloc_btree.c b/fs/xfs/xfs_ialloc_btree.c
index 0cdb88b..7923292 100644
--- a/fs/xfs/xfs_ialloc_btree.c
+++ b/fs/xfs/xfs_ialloc_btree.c
@@ -62,10 +62,18 @@ xfs_inobt_set_root(
{
struct xfs_buf *agbp = cur->bc_private.a.agbp;
struct xfs_agi *agi = XFS_BUF_TO_AGI(agbp);
-
- agi->agi_root = nptr->s;
- be32_add_cpu(&agi->agi_level, inc);
- xfs_ialloc_log_agi(cur->bc_tp, agbp, XFS_AGI_ROOT | XFS_AGI_LEVEL);
+ int fields;
+
+ if (cur->bc_btnum == XFS_BTNUM_INO) {
+ agi->agi_root = nptr->s;
+ be32_add_cpu(&agi->agi_level, inc);
+ fields = XFS_AGI_ROOT | XFS_AGI_LEVEL;
+ } else {
+ agi->agi_free_root = nptr->s;
+ be32_add_cpu(&agi->agi_free_level, inc);
+ fields = XFS_AGI_FREE_ROOT | XFS_AGI_FREE_LEVEL;
+ }
+ xfs_ialloc_log_agi(cur->bc_tp, agbp, fields);
}
STATIC int
@@ -172,7 +180,10 @@ xfs_inobt_init_ptr_from_cur(
ASSERT(cur->bc_private.a.agno == be32_to_cpu(agi->agi_seqno));
- ptr->s = agi->agi_root;
+ if (cur->bc_btnum == XFS_BTNUM_INO)
+ ptr->s = agi->agi_root;
+ else
+ ptr->s = agi->agi_free_root;
}
STATIC __int64_t
@@ -205,6 +216,7 @@ xfs_inobt_verify(
*/
switch (block->bb_magic) {
case cpu_to_be32(XFS_IBT_CRC_MAGIC):
+ case cpu_to_be32(XFS_FIBT_CRC_MAGIC):
if (!xfs_sb_version_hascrc(&mp->m_sb))
return false;
if (!uuid_equal(&block->bb_u.s.bb_uuid, &mp->m_sb.sb_uuid))
@@ -216,6 +228,7 @@ xfs_inobt_verify(
return false;
/* fall through */
case cpu_to_be32(XFS_IBT_MAGIC):
+ case cpu_to_be32(XFS_FIBT_MAGIC):
break;
default:
return 0;
@@ -335,8 +348,12 @@ xfs_inobt_init_cursor(
cur->bc_tp = tp;
cur->bc_mp = mp;
- cur->bc_nlevels = be32_to_cpu(agi->agi_level);
cur->bc_btnum = btnum;
+ if (btnum == XFS_BTNUM_INO)
+ cur->bc_nlevels = be32_to_cpu(agi->agi_level);
+ else
+ cur->bc_nlevels = be32_to_cpu(agi->agi_free_level);
+
cur->bc_blocklog = mp->m_sb.sb_blocklog;
cur->bc_ops = &xfs_inobt_ops;
diff --git a/fs/xfs/xfs_ialloc_btree.h b/fs/xfs/xfs_ialloc_btree.h
index ce7a62b..33d6dd4 100644
--- a/fs/xfs/xfs_ialloc_btree.h
+++ b/fs/xfs/xfs_ialloc_btree.h
@@ -31,6 +31,8 @@ struct xfs_mount;
*/
#define XFS_IBT_MAGIC 0x49414254 /* 'IABT' */
#define XFS_IBT_CRC_MAGIC 0x49414233 /* 'IAB3' */
+#define XFS_FIBT_MAGIC 0x46494254 /* 'FIBT' */
+#define XFS_FIBT_CRC_MAGIC 0x46494233 /* 'FIB3' */
typedef __uint64_t xfs_inofree_t;
#define XFS_INODES_PER_CHUNK (NBBY * sizeof(xfs_inofree_t))
@@ -73,7 +75,17 @@ typedef __be32 xfs_inobt_ptr_t;
* block numbers in the AG.
*/
#define XFS_IBT_BLOCK(mp) ((xfs_agblock_t)(XFS_CNT_BLOCK(mp) + 1))
-#define XFS_PREALLOC_BLOCKS(mp) ((xfs_agblock_t)(XFS_IBT_BLOCK(mp) + 1))
+#define XFS_FIBT_BLOCK(mp) ((xfs_agblock_t)(XFS_IBT_BLOCK(mp) + 1))
+
+/*
+ * The first data block of an AG depends on whether the filesystem was formatted
+ * with the finobt feature. If so, account for the finobt reserved root btree
+ * block.
+ */
+#define XFS_PREALLOC_BLOCKS(mp) \
+ (xfs_sb_version_hasfinobt(&((mp)->m_sb)) ? \
+ XFS_FIBT_BLOCK(mp) + 1 : \
+ XFS_IBT_BLOCK(mp) + 1)
/*
* Btree block header size depends on a superblock flag.
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 7c0c1fd..b8f16d2 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -2088,7 +2088,9 @@ xlog_recover_validate_buf_type(
bp->b_ops = &xfs_allocbt_buf_ops;
break;
case XFS_IBT_CRC_MAGIC:
+ case XFS_FIBT_CRC_MAGIC:
case XFS_IBT_MAGIC:
+ case XFS_FIBT_MAGIC:
bp->b_ops = &xfs_inobt_buf_ops;
break;
case XFS_BMAP_CRC_MAGIC:
diff --git a/fs/xfs/xfs_stats.h b/fs/xfs/xfs_stats.h
index c03ad38..c8f238b 100644
--- a/fs/xfs/xfs_stats.h
+++ b/fs/xfs/xfs_stats.h
@@ -183,7 +183,23 @@ struct xfsstats {
__uint32_t xs_ibt_2_alloc;
__uint32_t xs_ibt_2_free;
__uint32_t xs_ibt_2_moves;
-#define XFSSTAT_END_XQMSTAT (XFSSTAT_END_IBT_V2+6)
+#define XFSSTAT_END_FIBT_V2 (XFSSTAT_END_IBT_V2+15)
+ __uint32_t xs_fibt_2_lookup;
+ __uint32_t xs_fibt_2_compare;
+ __uint32_t xs_fibt_2_insrec;
+ __uint32_t xs_fibt_2_delrec;
+ __uint32_t xs_fibt_2_newroot;
+ __uint32_t xs_fibt_2_killroot;
+ __uint32_t xs_fibt_2_increment;
+ __uint32_t xs_fibt_2_decrement;
+ __uint32_t xs_fibt_2_lshift;
+ __uint32_t xs_fibt_2_rshift;
+ __uint32_t xs_fibt_2_split;
+ __uint32_t xs_fibt_2_join;
+ __uint32_t xs_fibt_2_alloc;
+ __uint32_t xs_fibt_2_free;
+ __uint32_t xs_fibt_2_moves;
+#define XFSSTAT_END_XQMSTAT (XFSSTAT_END_FIBT_V2+6)
__uint32_t xs_qm_dqreclaims;
__uint32_t xs_qm_dqreclaim_misses;
__uint32_t xs_qm_dquot_dups;
diff --git a/fs/xfs/xfs_types.h b/fs/xfs/xfs_types.h
index 82bbc34..65c6e66 100644
--- a/fs/xfs/xfs_types.h
+++ b/fs/xfs/xfs_types.h
@@ -134,7 +134,7 @@ typedef enum {
typedef enum {
XFS_BTNUM_BNOi, XFS_BTNUM_CNTi, XFS_BTNUM_BMAPi, XFS_BTNUM_INOi,
- XFS_BTNUM_MAX
+ XFS_BTNUM_FINOi, XFS_BTNUM_MAX
} xfs_btnum_t;
struct xfs_name {
--
1.8.1.4
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [RFC PATCH 04/11] xfs: update inode allocation transaction reservations for finobt
2013-09-03 18:24 [RFC PATCH 00/11] xfs: introduce the free inode btree Brian Foster
` (2 preceding siblings ...)
2013-09-03 18:25 ` [RFC PATCH 03/11] xfs: support the XFS_BTNUM_FINOBT free inode btree type Brian Foster
@ 2013-09-03 18:25 ` Brian Foster
2013-09-05 0:59 ` Dave Chinner
2013-09-03 18:25 ` [RFC PATCH 05/11] xfs: update ifree " Brian Foster
` (7 subsequent siblings)
11 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2013-09-03 18:25 UTC (permalink / raw)
To: xfs
Update inode allocation transaction reservations for the finobt. A
create via record modification requires a log reservation for the
additional finobt record. Any such allocation could result in an
finobt removal if the inode chunk has become fully allocated, thus
we include a reservation for a finobt btree merge as well.
Allocation of a new inode chunk must account for splits in the
finobt as well as the existing ialloc tree.
Also update XFS_IALLOC_SPACE_RES() to reserve data blocks for
finobt split/merge scenarios.
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
fs/xfs/xfs_trans_resv.c | 9 ++++++++-
fs/xfs/xfs_trans_space.h | 2 +-
2 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/fs/xfs/xfs_trans_resv.c b/fs/xfs/xfs_trans_resv.c
index a65a3cc..3040dad 100644
--- a/fs/xfs/xfs_trans_resv.c
+++ b/fs/xfs/xfs_trans_resv.c
@@ -272,6 +272,8 @@ xfs_calc_remove_reservation(
* the parent directory inode: inode size
* the new inode: inode size
* the inode btree entry: block size
+ * the free inode btree entry: block size
+ * the free inode btree: max depth * block size
* the superblock for the nlink flag: sector size
* the directory btree: (max depth + v2) * dir block size
* the directory inode's bmap btree: (max depth + v2) * block size
@@ -282,7 +284,8 @@ xfs_calc_create_resv_modify(
{
return xfs_calc_inode_res(mp, 2) +
xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
- (uint)XFS_FSB_TO_B(mp, 1) +
+ (uint)XFS_FSB_TO_B(mp, 2) +
+ xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp), XFS_FSB_TO_B(mp, 1));
}
@@ -292,6 +295,7 @@ xfs_calc_create_resv_modify(
* the superblock for the nlink flag: sector size
* the inode blocks allocated: XFS_IALLOC_BLOCKS * blocksize
* the inode btree: max depth * blocksize
+ * the free inode btree: max depth * blocksize
* the allocation btrees: 2 trees * (max depth - 1) * block size
*/
STATIC uint
@@ -302,6 +306,7 @@ xfs_calc_create_resv_alloc(
mp->m_sb.sb_sectsize +
xfs_calc_buf_res(XFS_IALLOC_BLOCKS(mp), XFS_FSB_TO_B(mp, 1)) +
xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
+ xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 1),
XFS_FSB_TO_B(mp, 1));
}
@@ -320,6 +325,7 @@ __xfs_calc_create_reservation(
* the agi and agf of the ag getting the new inodes: 2 * sectorsize
* the superblock for the nlink flag: sector size
* the inode btree: max depth * blocksize
+ * the free inode btree: max depth * blocksize
* the allocation btrees: 2 trees * (max depth - 1) * block size
*/
STATIC uint
@@ -329,6 +335,7 @@ xfs_calc_icreate_resv_alloc(
return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
mp->m_sb.sb_sectsize +
xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
+ xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 1),
XFS_FSB_TO_B(mp, 1));
}
diff --git a/fs/xfs/xfs_trans_space.h b/fs/xfs/xfs_trans_space.h
index 7d2c920..5dca732 100644
--- a/fs/xfs/xfs_trans_space.h
+++ b/fs/xfs/xfs_trans_space.h
@@ -47,7 +47,7 @@
#define XFS_DIRREMOVE_SPACE_RES(mp) \
XFS_DAREMOVE_SPACE_RES(mp, XFS_DATA_FORK)
#define XFS_IALLOC_SPACE_RES(mp) \
- (XFS_IALLOC_BLOCKS(mp) + (mp)->m_in_maxlevels - 1)
+ (XFS_IALLOC_BLOCKS(mp) + 2 * ((mp)->m_in_maxlevels - 1))
/*
* Space reservation values for various transactions.
--
1.8.1.4
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [RFC PATCH 05/11] xfs: update ifree transaction reservations for finobt
2013-09-03 18:24 [RFC PATCH 00/11] xfs: introduce the free inode btree Brian Foster
` (3 preceding siblings ...)
2013-09-03 18:25 ` [RFC PATCH 04/11] xfs: update inode allocation transaction reservations for finobt Brian Foster
@ 2013-09-03 18:25 ` Brian Foster
2013-09-05 1:00 ` Dave Chinner
2013-09-03 18:25 ` [RFC PATCH 06/11] xfs: use correct transaction reservations in xfs_inactive() Brian Foster
` (6 subsequent siblings)
11 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2013-09-03 18:25 UTC (permalink / raw)
To: xfs
Update the ifree transaction log reservations to support the
finobt. An inode free can now mean an extra record modification,
record removal (i.e., inode chunk being freed) or a record
insertion (i.e., a previously full inode chunk).
Define XFS_IFREE_SPACE_RES() for the inactive transaction to
reserve data blocks for possible finobt merge/split situations.
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
fs/xfs/xfs_trans_resv.c | 5 ++++-
fs/xfs/xfs_trans_space.h | 2 ++
2 files changed, 6 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_trans_resv.c b/fs/xfs/xfs_trans_resv.c
index 3040dad..bbd393a 100644
--- a/fs/xfs/xfs_trans_resv.c
+++ b/fs/xfs/xfs_trans_resv.c
@@ -388,8 +388,10 @@ xfs_calc_symlink_reservation(
* the super block free inode counter: sector size
* the agi hash list and counters: sector size
* the inode btree entry: block size
+ * the free inode btree entry: block size
* the on disk inode before ours in the agi hash list: inode cluster size
* the inode btree: max depth * blocksize
+ * the free inode btree: max depth * block size
* the allocation btrees: 2 trees * (max depth - 1) * block size
*/
STATIC uint
@@ -399,12 +401,13 @@ xfs_calc_ifree_reservation(
return XFS_DQUOT_LOGRES(mp) +
xfs_calc_inode_res(mp, 1) +
xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
- xfs_calc_buf_res(1, XFS_FSB_TO_B(mp, 1)) +
+ xfs_calc_buf_res(2, XFS_FSB_TO_B(mp, 1)) +
MAX((__uint16_t)XFS_FSB_TO_B(mp, 1),
XFS_INODE_CLUSTER_SIZE(mp)) +
xfs_calc_buf_res(1, 0) +
xfs_calc_buf_res(2 + XFS_IALLOC_BLOCKS(mp) +
mp->m_in_maxlevels, 0) +
+ xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 1),
XFS_FSB_TO_B(mp, 1));
}
diff --git a/fs/xfs/xfs_trans_space.h b/fs/xfs/xfs_trans_space.h
index 5dca732..4fc8c10 100644
--- a/fs/xfs/xfs_trans_space.h
+++ b/fs/xfs/xfs_trans_space.h
@@ -82,5 +82,7 @@
(XFS_DIRREMOVE_SPACE_RES(mp) + XFS_DIRENTER_SPACE_RES(mp,nl))
#define XFS_SYMLINK_SPACE_RES(mp,nl,b) \
(XFS_IALLOC_SPACE_RES(mp) + XFS_DIRENTER_SPACE_RES(mp,nl) + (b))
+#define XFS_IFREE_SPACE_RES(mp) \
+ (mp)->m_in_maxlevels
#endif /* __XFS_TRANS_SPACE_H__ */
--
1.8.1.4
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [RFC PATCH 06/11] xfs: use correct transaction reservations in xfs_inactive()
2013-09-03 18:24 [RFC PATCH 00/11] xfs: introduce the free inode btree Brian Foster
` (4 preceding siblings ...)
2013-09-03 18:25 ` [RFC PATCH 05/11] xfs: update ifree " Brian Foster
@ 2013-09-03 18:25 ` Brian Foster
2013-09-05 1:35 ` Dave Chinner
2013-09-03 18:25 ` [RFC PATCH 07/11] xfs: retry trans reservation on ENOSPC " Brian Foster
` (5 subsequent siblings)
11 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2013-09-03 18:25 UTC (permalink / raw)
To: xfs
The transaction allocated in xfs_inactive() can be passed down into
xfs_inactive_symlink() or xfs_itruncate_extents(), both of which
can commit and reallocate a new transaction. This leads to
reservation issues if the transaction is subsequently passed into
xfs_ifree(), which requires a larger reservation to manage the
finobt.
Reorganize xfs_inactive() to commit any transaction handed back
from symlink or truncate processing and unconditionally allocate
a new transaction for xfs_ifree() with the appropriate reservation.
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
fs/xfs/xfs_inode.c | 71 ++++++++++++++++++++++++++++--------------------------
1 file changed, 37 insertions(+), 34 deletions(-)
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index e3d7538..56cbf63 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1679,7 +1679,6 @@ xfs_inactive(
int committed;
struct xfs_trans *tp;
struct xfs_mount *mp;
- struct xfs_trans_res *resp;
int error;
int truncate = 0;
@@ -1724,33 +1723,39 @@ xfs_inactive(
if (error)
return VN_INACTIVE_CACHE;
- tp = xfs_trans_alloc(mp, XFS_TRANS_INACTIVE);
- resp = (truncate || S_ISLNK(ip->i_d.di_mode)) ?
- &M_RES(mp)->tr_itruncate : &M_RES(mp)->tr_ifree;
+ xfs_ilock(ip, XFS_ILOCK_EXCL);
- error = xfs_trans_reserve(tp, resp, 0, 0);
- if (error) {
- ASSERT(XFS_FORCED_SHUTDOWN(mp));
- xfs_trans_cancel(tp, 0);
- return VN_INACTIVE_CACHE;
- }
+ if (truncate || S_ISLNK(ip->i_d.di_mode)) {
+ tp = xfs_trans_alloc(mp, XFS_TRANS_INACTIVE);
- xfs_ilock(ip, XFS_ILOCK_EXCL);
- xfs_trans_ijoin(tp, ip, 0);
+ error = xfs_trans_reserve(tp, &M_RES(mp)->tr_itruncate, 0, 0);
+ if (error) {
+ ASSERT(XFS_FORCED_SHUTDOWN(mp));
+ xfs_trans_cancel(tp, 0);
+ return VN_INACTIVE_CACHE;
+ }
- if (S_ISLNK(ip->i_d.di_mode)) {
- error = xfs_inactive_symlink(ip, &tp);
- if (error)
- goto out_cancel;
- } else if (truncate) {
- ip->i_d.di_size = 0;
- xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+ xfs_trans_ijoin(tp, ip, 0);
+
+ if (S_ISLNK(ip->i_d.di_mode)) {
+ error = xfs_inactive_symlink(ip, &tp);
+ if (error)
+ goto out_cancel;
+ } else if (truncate) {
+ ip->i_d.di_size = 0;
+ xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+
+ error = xfs_itruncate_extents(&tp, ip, XFS_DATA_FORK,
+ 0);
+ if (error)
+ goto out_cancel;
- error = xfs_itruncate_extents(&tp, ip, XFS_DATA_FORK, 0);
+ ASSERT(ip->i_d.di_nextents == 0);
+ }
+
+ error = xfs_trans_commit(tp, XFS_TRANS_RELEASE_LOG_RES);
if (error)
goto out_cancel;
-
- ASSERT(ip->i_d.di_nextents == 0);
}
/*
@@ -1762,27 +1767,25 @@ xfs_inactive(
if (ip->i_d.di_anextents > 0) {
ASSERT(ip->i_d.di_forkoff != 0);
- error = xfs_trans_commit(tp, XFS_TRANS_RELEASE_LOG_RES);
- if (error)
- goto out_unlock;
-
xfs_iunlock(ip, XFS_ILOCK_EXCL);
error = xfs_attr_inactive(ip);
if (error)
goto out;
- tp = xfs_trans_alloc(mp, XFS_TRANS_INACTIVE);
- error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree, 0, 0);
- if (error) {
- xfs_trans_cancel(tp, 0);
- goto out;
- }
-
xfs_ilock(ip, XFS_ILOCK_EXCL);
- xfs_trans_ijoin(tp, ip, 0);
}
+ tp = xfs_trans_alloc(mp, XFS_TRANS_INACTIVE);
+ error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree,
+ XFS_IFREE_SPACE_RES(mp), 0);
+ if (error) {
+ xfs_trans_cancel(tp, 0);
+ goto out_unlock;
+ }
+
+ xfs_trans_ijoin(tp, ip, 0);
+
if (ip->i_afp)
xfs_idestroy_fork(ip, XFS_ATTR_FORK);
--
1.8.1.4
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [RFC PATCH 07/11] xfs: retry trans reservation on ENOSPC in xfs_inactive()
2013-09-03 18:24 [RFC PATCH 00/11] xfs: introduce the free inode btree Brian Foster
` (5 preceding siblings ...)
2013-09-03 18:25 ` [RFC PATCH 06/11] xfs: use correct transaction reservations in xfs_inactive() Brian Foster
@ 2013-09-03 18:25 ` Brian Foster
2013-09-05 1:40 ` Dave Chinner
2013-09-03 18:25 ` [RFC PATCH 08/11] xfs: insert newly allocated inode chunks into the finobt Brian Foster
` (4 subsequent siblings)
11 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2013-09-03 18:25 UTC (permalink / raw)
To: xfs
An ifree data block reservation can fail with ENOSPC. Flush inodes
to try and free up space or attempt without a data block
reservation to avoid failing out of xfs_inactive().
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
fs/xfs/xfs_inode.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 56cbf63..92de4b7 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1779,7 +1779,18 @@ xfs_inactive(
tp = xfs_trans_alloc(mp, XFS_TRANS_INACTIVE);
error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree,
XFS_IFREE_SPACE_RES(mp), 0);
+ if (error == ENOSPC) {
+ /* flush outstanding delalloc blocks and retry */
+ xfs_flush_inodes(mp);
+ error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree,
+ XFS_IFREE_SPACE_RES(mp), 0);
+ }
+ if (error == ENOSPC) {
+ /* resort to a no alloc. reservation */
+ error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree, 0, 0);
+ }
if (error) {
+ ASSERT(XFS_FORCED_SHUTDOWN(mp));
xfs_trans_cancel(tp, 0);
goto out_unlock;
}
--
1.8.1.4
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [RFC PATCH 08/11] xfs: insert newly allocated inode chunks into the finobt
2013-09-03 18:24 [RFC PATCH 00/11] xfs: introduce the free inode btree Brian Foster
` (6 preceding siblings ...)
2013-09-03 18:25 ` [RFC PATCH 07/11] xfs: retry trans reservation on ENOSPC " Brian Foster
@ 2013-09-03 18:25 ` Brian Foster
2013-09-05 2:10 ` Dave Chinner
2013-09-03 18:25 ` [RFC PATCH 09/11] xfs: use and update the finobt on inode allocation Brian Foster
` (3 subsequent siblings)
11 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2013-09-03 18:25 UTC (permalink / raw)
To: xfs
A newly allocated inode chunk, by definition, has at least one
free inode, so a record is always inserted into the finobt.
Create the xfs_inobt_insert() helper from existing code to insert
a record in an inobt based on the provided BTNUM. Update
xfs_ialloc_ag_alloc() to invoke the helper for the existing
XFS_BTNUM_INO tree and XFS_BTNUM_FINO tree, if enabled.
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
fs/xfs/xfs_ialloc.c | 79 +++++++++++++++++++++++++++++++++++++----------------
1 file changed, 56 insertions(+), 23 deletions(-)
diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
index 5ced506..e64a728 100644
--- a/fs/xfs/xfs_ialloc.c
+++ b/fs/xfs/xfs_ialloc.c
@@ -152,6 +152,52 @@ xfs_check_agi_freecount(
#endif
/*
+ * Insert records describing a newly allocated inode chunk into the inobt.
+ */
+STATIC int
+xfs_inobt_insert(
+ struct xfs_mount *mp,
+ struct xfs_trans *tp,
+ struct xfs_buf *agbp,
+ xfs_agino_t newino,
+ xfs_agino_t newlen,
+ xfs_btnum_t btnum)
+{
+ struct xfs_btree_cur *cur;
+ struct xfs_agi *agi = XFS_BUF_TO_AGI(agbp);
+ xfs_agnumber_t agno = be32_to_cpu(agi->agi_seqno);
+ xfs_agino_t thisino;
+ int i;
+ int error;
+
+ cur = xfs_inobt_init_cursor(mp, tp, agbp, agno, btnum);
+
+ for (thisino = newino;
+ thisino < newino + newlen;
+ thisino += XFS_INODES_PER_CHUNK) {
+ cur->bc_rec.i.ir_startino = thisino;
+ cur->bc_rec.i.ir_freecount = XFS_INODES_PER_CHUNK;
+ cur->bc_rec.i.ir_free = XFS_INOBT_ALL_FREE;
+ error = xfs_btree_lookup(cur, XFS_LOOKUP_EQ, &i);
+ if (error) {
+ xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+ return error;
+ }
+ ASSERT(i == 0);
+ error = xfs_btree_insert(cur, &i);
+ if (error) {
+ xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+ return error;
+ }
+ ASSERT(i == 1);
+ }
+
+ xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+
+ return 0;
+}
+
+/*
* Initialise a new set of inodes. When called without a transaction context
* (e.g. from recovery) we initiate a delayed write of the inode buffers rather
* than logging them (which in a transaction context puts them into the AIL
@@ -309,13 +355,10 @@ xfs_ialloc_ag_alloc(
{
xfs_agi_t *agi; /* allocation group header */
xfs_alloc_arg_t args; /* allocation argument structure */
- xfs_btree_cur_t *cur; /* inode btree cursor */
xfs_agnumber_t agno;
int error;
- int i;
xfs_agino_t newino; /* new first inode's number */
xfs_agino_t newlen; /* new number of inodes */
- xfs_agino_t thisino; /* current inode number, for loop */
int isaligned = 0; /* inode allocation at stripe unit */
/* boundary */
struct xfs_perag *pag;
@@ -453,29 +496,19 @@ xfs_ialloc_ag_alloc(
agi->agi_newino = cpu_to_be32(newino);
/*
- * Insert records describing the new inode chunk into the btree.
+ * Insert records describing the new inode chunk into the btrees.
*/
- cur = xfs_inobt_init_cursor(args.mp, tp, agbp, agno, XFS_BTNUM_INO);
- for (thisino = newino;
- thisino < newino + newlen;
- thisino += XFS_INODES_PER_CHUNK) {
- cur->bc_rec.i.ir_startino = thisino;
- cur->bc_rec.i.ir_freecount = XFS_INODES_PER_CHUNK;
- cur->bc_rec.i.ir_free = XFS_INOBT_ALL_FREE;
- error = xfs_btree_lookup(cur, XFS_LOOKUP_EQ, &i);
- if (error) {
- xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
- return error;
- }
- ASSERT(i == 0);
- error = xfs_btree_insert(cur, &i);
- if (error) {
- xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+ error = xfs_inobt_insert(args.mp, tp, agbp, newino, newlen,
+ XFS_BTNUM_INO);
+ if (error)
+ return error;
+
+ if (xfs_sb_version_hasfinobt(&args.mp->m_sb)) {
+ error = xfs_inobt_insert(args.mp, tp, agbp, newino, newlen,
+ XFS_BTNUM_FINO);
+ if (error)
return error;
- }
- ASSERT(i == 1);
}
- xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
/*
* Log allocation group header fields
*/
--
1.8.1.4
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [RFC PATCH 09/11] xfs: use and update the finobt on inode allocation
2013-09-03 18:24 [RFC PATCH 00/11] xfs: introduce the free inode btree Brian Foster
` (7 preceding siblings ...)
2013-09-03 18:25 ` [RFC PATCH 08/11] xfs: insert newly allocated inode chunks into the finobt Brian Foster
@ 2013-09-03 18:25 ` Brian Foster
2013-09-05 2:27 ` Dave Chinner
2013-09-03 18:25 ` [RFC PATCH 10/11] xfs: update the finobt on inode free Brian Foster
` (2 subsequent siblings)
11 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2013-09-03 18:25 UTC (permalink / raw)
To: xfs
Replace xfs_dialloc_ag() with an implementation that looks for a
record in the finobt. The finobt only tracks records with at least
one free inode. This eliminates the need for the intra-ag scan in
the original algorithm. Once the inode is allocated, update the
finobt appropriately (possibly removing the record) as well as the
inobt.
Move the original xfs_dialloc_ag() algorithm to
xfs_dialloc_ag_slow() and fall back as such if finobt support is
not enabled.
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
fs/xfs/xfs_ialloc.c | 136 +++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 135 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
index e64a728..516f4af 100644
--- a/fs/xfs/xfs_ialloc.c
+++ b/fs/xfs/xfs_ialloc.c
@@ -708,7 +708,7 @@ xfs_ialloc_get_rec(
* available.
*/
STATIC int
-xfs_dialloc_ag(
+xfs_dialloc_ag_slow(
struct xfs_trans *tp,
struct xfs_buf *agbp,
xfs_ino_t parent,
@@ -966,6 +966,140 @@ error0:
return error;
}
+STATIC int
+xfs_dialloc_ag(
+ struct xfs_trans *tp,
+ struct xfs_buf *agbp,
+ xfs_ino_t parent,
+ xfs_ino_t *inop)
+{
+ struct xfs_mount *mp = tp->t_mountp;
+ struct xfs_agi *agi = XFS_BUF_TO_AGI(agbp);
+ xfs_agnumber_t agno = be32_to_cpu(agi->agi_seqno);
+ xfs_agino_t pagino = XFS_INO_TO_AGINO(mp, parent);
+ struct xfs_perag *pag;
+ struct xfs_btree_cur *fcur;
+ struct xfs_btree_cur *icur;
+ struct xfs_inobt_rec_incore frec;
+ struct xfs_inobt_rec_incore irec;
+ xfs_ino_t ino;
+ int error;
+ int offset;
+ int i;
+
+ if (!xfs_sb_version_hasfinobt(&mp->m_sb))
+ return xfs_dialloc_ag_slow(tp, agbp, parent, inop);
+
+ pag = xfs_perag_get(mp, agno);
+
+ /*
+ * If pagino is 0 (this is the root inode allocation) use newino.
+ * This must work because we've just allocated some.
+ */
+ if (!pagino)
+ pagino = be32_to_cpu(agi->agi_newino);
+
+ fcur = xfs_inobt_init_cursor(mp, tp, agbp, agno, XFS_BTNUM_FINO);
+ icur = xfs_inobt_init_cursor(mp, tp, agbp, agno, XFS_BTNUM_INO);
+
+ error = xfs_check_agi_freecount(fcur, agi);
+ if (error)
+ goto error;
+ error = xfs_check_agi_freecount(icur, agi);
+ if (error)
+ goto error;
+
+ /*
+ * Search the finobt.
+ */
+ error = xfs_inobt_lookup(fcur, pagino, XFS_LOOKUP_LE, &i);
+ if (error)
+ goto error;
+ if (i == 0) {
+ error = xfs_inobt_lookup(fcur, pagino, XFS_LOOKUP_GE, &i);
+ if (error)
+ goto error;
+ XFS_WANT_CORRUPTED_GOTO(i == 1, error);
+ }
+
+ error = xfs_inobt_get_rec(fcur, &frec, &i);
+ if (error)
+ goto error;
+ XFS_WANT_CORRUPTED_GOTO(i == 1, error);
+
+ offset = xfs_lowbit64(frec.ir_free);
+ ASSERT(offset >= 0);
+ ASSERT(offset < XFS_INODES_PER_CHUNK);
+ ASSERT((XFS_AGINO_TO_OFFSET(mp, frec.ir_startino) %
+ XFS_INODES_PER_CHUNK) == 0);
+ ino = XFS_AGINO_TO_INO(mp, agno, frec.ir_startino + offset);
+
+ /*
+ * Modify or remove the finobt record.
+ */
+ frec.ir_free &= ~XFS_INOBT_MASK(offset);
+ frec.ir_freecount--;
+ if (frec.ir_freecount)
+ error = xfs_inobt_update(fcur, &frec);
+ else
+ error = xfs_btree_delete(fcur, &i);
+ if (error)
+ goto error;
+
+ /*
+ * Lookup and modify the equivalent record in the inobt.
+ */
+ error = xfs_inobt_lookup(icur, frec.ir_startino, XFS_LOOKUP_EQ, &i);
+ if (error)
+ goto error;
+ XFS_WANT_CORRUPTED_GOTO(i == 1, error);
+
+ error = xfs_inobt_get_rec(icur, &irec, &i);
+ if (error)
+ goto error;
+ XFS_WANT_CORRUPTED_GOTO(i == 1, error);
+ ASSERT((XFS_AGINO_TO_OFFSET(mp, irec.ir_startino) %
+ XFS_INODES_PER_CHUNK) == 0);
+
+ irec.ir_free &= ~XFS_INOBT_MASK(offset);
+ irec.ir_freecount--;
+
+ XFS_WANT_CORRUPTED_GOTO((frec.ir_free == irec.ir_free) &&
+ (frec.ir_freecount == irec.ir_freecount),
+ error);
+
+ error = xfs_inobt_update(icur, &irec);
+ if (error)
+ goto error;
+
+ /*
+ * Update the perag and superblock.
+ */
+ be32_add_cpu(&agi->agi_freecount, -1);
+ xfs_ialloc_log_agi(tp, agbp, XFS_AGI_FREECOUNT);
+ pag->pagi_freecount--;
+
+ xfs_trans_mod_sb(tp, XFS_TRANS_SB_IFREE, -1);
+ xfs_perag_put(pag);
+
+ error = xfs_check_agi_freecount(fcur, agi);
+ if (error)
+ goto error;
+ error = xfs_check_agi_freecount(icur, agi);
+ if (error)
+ goto error;
+
+ xfs_btree_del_cursor(icur, XFS_BTREE_NOERROR);
+ xfs_btree_del_cursor(fcur, XFS_BTREE_ERROR);
+ *inop = ino;
+ return 0;
+error:
+ xfs_perag_put(pag);
+ xfs_btree_del_cursor(icur, XFS_BTREE_ERROR);
+ xfs_btree_del_cursor(fcur, XFS_BTREE_ERROR);
+ return error;
+}
+
/*
* Allocate an inode on disk.
*
--
1.8.1.4
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [RFC PATCH 10/11] xfs: update the finobt on inode free
2013-09-03 18:24 [RFC PATCH 00/11] xfs: introduce the free inode btree Brian Foster
` (8 preceding siblings ...)
2013-09-03 18:25 ` [RFC PATCH 09/11] xfs: use and update the finobt on inode allocation Brian Foster
@ 2013-09-03 18:25 ` Brian Foster
2013-09-05 2:54 ` Dave Chinner
2013-09-03 18:25 ` [RFC PATCH 11/11] xfs: add finobt support to growfs Brian Foster
2013-09-05 21:17 ` [RFC PATCH 00/11] xfs: introduce the free inode btree Michael L. Semon
11 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2013-09-03 18:25 UTC (permalink / raw)
To: xfs
An inode free operation can have several effects on the finobt. If
all inodes have been freed and the chunk deallocated, we remove the
finobt record. If the inode chunk was previously full, we must
insert a new record based on the existing inobt record. Otherwise,
we modify the record in place.
Create the xfs_ifree_finobt() function to identify the potential
scenarios and update the finobt appropriately.
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
fs/xfs/xfs_ialloc.c | 120 ++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 120 insertions(+)
diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
index 516f4af..96f71b5 100644
--- a/fs/xfs/xfs_ialloc.c
+++ b/fs/xfs/xfs_ialloc.c
@@ -198,6 +198,117 @@ xfs_inobt_insert(
}
/*
+ * Free an inode in the free inode btree.
+ */
+STATIC int
+xfs_ifree_finobt(
+ struct xfs_mount *mp,
+ struct xfs_trans *tp,
+ struct xfs_buf *agbp,
+ struct xfs_inobt_rec_incore *ibtrec,/* inobt record */
+ int offset) /* offset of inode */
+{
+ struct xfs_agi *agi = XFS_BUF_TO_AGI(agbp);
+ xfs_agnumber_t agno = be32_to_cpu(agi->agi_seqno);
+ struct xfs_btree_cur *cur;
+ struct xfs_inobt_rec_incore rec;
+ int error;
+ int i;
+
+ if (!xfs_sb_version_hasfinobt(&mp->m_sb))
+ return 0;
+
+ cur = xfs_inobt_init_cursor(mp, tp, agbp, agno, XFS_BTNUM_FINO);
+
+ error = xfs_inobt_lookup(cur, ibtrec->ir_startino, XFS_LOOKUP_EQ, &i);
+ if (error)
+ goto error;
+
+ if (i == 1) {
+ int j;
+ /*
+ * Read and update the existing record.
+ */
+ error = xfs_inobt_get_rec(cur, &rec, &j);
+ if (error)
+ goto error;
+ XFS_WANT_CORRUPTED_GOTO(j == 1, error);
+
+ rec.ir_free |= XFS_INOBT_MASK(offset);
+ rec.ir_freecount++;
+
+ XFS_WANT_CORRUPTED_GOTO((rec.ir_free == ibtrec->ir_free) &&
+ (rec.ir_freecount == ibtrec->ir_freecount),
+ error);
+ }
+
+ /*
+ * The content of inobt records should always match between the inobt
+ * and finobt. The lifecycle of records in the finobt is different from
+ * the inobt in that the finobt only tracks records with at least one
+ * free inode. This is to optimize lookup for inode allocation purposes.
+ * The following checks fix up the finobt appropriately based on the
+ * state of the record subsequent to the current operation.
+ */
+
+ if ((i == 1) &&
+ (rec.ir_freecount == XFS_IALLOC_INODES(mp) &&
+ !(mp->m_flags & XFS_MOUNT_IKEEP))) {
+ /*
+ * We have an existing finobt record. If all inodes are free
+ * and we're in !ikeep mode, the entire inode chunk has been
+ * deallocated. Remove the record from the finobt.
+ */
+ error = xfs_btree_delete(cur, &i);
+ if (error)
+ goto error;
+ ASSERT(i == 1);
+ } else if ((i == 0) && (ibtrec->ir_freecount == 1)) {
+ /*
+ * No existing finobt record and the inobt record has a single
+ * free inode. This means we've freed an inode in a previously
+ * fully allocated chunk. Insert a new record into the finobt
+ * based on the current inobt record.
+ */
+ cur->bc_rec.i.ir_startino = ibtrec->ir_startino;
+ cur->bc_rec.i.ir_free = ibtrec->ir_free;
+ cur->bc_rec.i.ir_freecount = ibtrec->ir_freecount;
+ error = xfs_btree_insert(cur, &i);
+ if (error)
+ goto error;
+ ASSERT(i == 1);
+ } else if (i == 1) {
+ /*
+ * The existing finobt record was modified and has a combination
+ * of allocated and free inodes or is completely free and ikeep
+ * is enabled. Update the record.
+ */
+ error = xfs_inobt_update(cur, &rec);
+ if (error)
+ goto error;
+ } else {
+ /* somehow out of sync */
+ XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
+ agbp->b_addr);
+ ASSERT(0);
+
+ error = XFS_ERROR(EFSCORRUPTED);
+ goto error;
+ }
+
+ error = xfs_check_agi_freecount(cur, agi);
+ if (error)
+ goto error;
+
+ xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+ return 0;
+
+error:
+ xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+ return error;
+}
+
+/*
* Initialise a new set of inodes. When called without a transaction context
* (e.g. from recovery) we initiate a delayed write of the inode buffers rather
* than logging them (which in a transaction context puts them into the AIL
@@ -1422,6 +1533,15 @@ xfs_difree(
if (error)
goto error0;
+ /*
+ * Fix up the free inode btree.
+ */
+ if (xfs_sb_version_hasfinobt(&mp->m_sb)) {
+ error = xfs_ifree_finobt(mp, tp, agbp, &rec, off);
+ if (error)
+ goto error0;
+ }
+
xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
return 0;
--
1.8.1.4
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [RFC PATCH 11/11] xfs: add finobt support to growfs
2013-09-03 18:24 [RFC PATCH 00/11] xfs: introduce the free inode btree Brian Foster
` (9 preceding siblings ...)
2013-09-03 18:25 ` [RFC PATCH 10/11] xfs: update the finobt on inode free Brian Foster
@ 2013-09-03 18:25 ` Brian Foster
2013-09-05 2:55 ` Dave Chinner
2013-09-05 21:17 ` [RFC PATCH 00/11] xfs: introduce the free inode btree Michael L. Semon
11 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2013-09-03 18:25 UTC (permalink / raw)
To: xfs
Add finobt support to growfs. Initialize the agi root/level fields
and the root finobt block.
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
fs/xfs/xfs_fsops.c | 32 ++++++++++++++++++++++++++++++++
1 file changed, 32 insertions(+)
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index e64ee52..2e6ef6d 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -309,6 +309,10 @@ xfs_growfs_data_private(
agi->agi_dirino = cpu_to_be32(NULLAGINO);
if (xfs_sb_version_hascrc(&mp->m_sb))
uuid_copy(&agi->agi_uuid, &mp->m_sb.sb_uuid);
+ if (xfs_sb_version_hasfinobt(&mp->m_sb)) {
+ agi->agi_free_root = cpu_to_be32(XFS_FIBT_BLOCK(mp));
+ agi->agi_free_level = cpu_to_be32(1);
+ }
for (bucket = 0; bucket < XFS_AGI_UNLINKED_BUCKETS; bucket++)
agi->agi_unlinked[bucket] = cpu_to_be32(NULLAGINO);
@@ -400,6 +404,34 @@ xfs_growfs_data_private(
xfs_buf_relse(bp);
if (error)
goto error0;
+
+ /*
+ * FINO btree root block
+ */
+ if (xfs_sb_version_hasfinobt(&mp->m_sb)) {
+ bp = xfs_growfs_get_hdr_buf(mp,
+ XFS_AGB_TO_DADDR(mp, agno, XFS_FIBT_BLOCK(mp)),
+ BTOBB(mp->m_sb.sb_blocksize), 0,
+ &xfs_inobt_buf_ops);
+ if (!bp) {
+ error = ENOMEM;
+ goto error0;
+ }
+
+ if (xfs_sb_version_hascrc(&mp->m_sb))
+ xfs_btree_init_block(mp, bp, XFS_FIBT_CRC_MAGIC,
+ 0, 0, agno,
+ XFS_BTREE_CRC_BLOCKS);
+ else
+ xfs_btree_init_block(mp, bp, XFS_FIBT_MAGIC, 0,
+ 0, agno, 0);
+
+ error = xfs_bwrite(bp);
+ xfs_buf_relse(bp);
+ if (error)
+ goto error0;
+ }
+
}
xfs_trans_agblocks_delta(tp, nfree);
/*
--
1.8.1.4
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 01/11] xfs: refactor xfs_ialloc_btree.c to support multiple inobt numbers
2013-09-03 18:24 ` [RFC PATCH 01/11] xfs: refactor xfs_ialloc_btree.c to support multiple inobt numbers Brian Foster
@ 2013-09-05 0:36 ` Dave Chinner
0 siblings, 0 replies; 46+ messages in thread
From: Dave Chinner @ 2013-09-05 0:36 UTC (permalink / raw)
To: Brian Foster; +Cc: xfs
On Tue, Sep 03, 2013 at 02:24:58PM -0400, Brian Foster wrote:
> The introduction of the free inode btree (finobt) requires that
> xfs_ialloc_btree.c handle multiple trees. Refactor xfs_ialloc_btree.c
> so the caller specifies the btree type on cursor initialization to
> prepare for addition of the finobt.
>
> Signed-off-by: Brian Foster <bfoster@redhat.com>
Simple enough, obvious place to start. :)
Reviewed-by: Dave Chinner <david@fromorbit.com>
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 02/11] xfs: reserve v5 superblock read-only compat. feature bit for finobt
2013-09-03 18:24 ` [RFC PATCH 02/11] xfs: reserve v5 superblock read-only compat. feature bit for finobt Brian Foster
@ 2013-09-05 0:39 ` Dave Chinner
0 siblings, 0 replies; 46+ messages in thread
From: Dave Chinner @ 2013-09-05 0:39 UTC (permalink / raw)
To: Brian Foster; +Cc: xfs
On Tue, Sep 03, 2013 at 02:24:59PM -0400, Brian Foster wrote:
> Reserve a v5 read-only compatibility feature bit for the finobt and
> create the xfs_sb_version_hasfinobt() helper to determine whether
> an fs has the feature enabled.
>
> The finobt does not change existing on-disk structures, but must
> remain consistent with the ialloc btree. Modifications from older
> kernels would violate that constrant. Therefore, we restrict older
> kernels to read-only mounts of finobt-enabled filesystems.
>
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
> fs/xfs/xfs_sb.h | 10 +++++++++-
> 1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/fs/xfs/xfs_sb.h b/fs/xfs/xfs_sb.h
> index 6835b44..c48d95d 100644
> --- a/fs/xfs/xfs_sb.h
> +++ b/fs/xfs/xfs_sb.h
> @@ -585,7 +585,9 @@ xfs_sb_has_compat_feature(
> return (sbp->sb_features_compat & feature) != 0;
> }
>
> -#define XFS_SB_FEAT_RO_COMPAT_ALL 0
> +#define XFS_SB_FEAT_RO_COMPAT_FINOBT (1 << 0) /* free inode btree */
> +#define XFS_SB_FEAT_RO_COMPAT_ALL \
> + (XFS_SB_FEAT_RO_COMPAT_FINOBT)
> #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN ~XFS_SB_FEAT_RO_COMPAT_ALL
The only thing I'd suggest here is that the last patch in the series
should add the XFS_SB_FEAT_RO_COMPAT_FINOBT bit to the
XFS_SB_FEAT_RO_COMPAT_ALL mask.
Otherwise we can have the problem of bisects landing in the middle
of the series and thinking that the feature is fully supported when
it isn't. So it's fine to add the xfs_sb_version_hasfinobt() helper
here and define the bit but don't add it to the supported mask
until all the changes for the feature are complete.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 03/11] xfs: support the XFS_BTNUM_FINOBT free inode btree type
2013-09-03 18:25 ` [RFC PATCH 03/11] xfs: support the XFS_BTNUM_FINOBT free inode btree type Brian Foster
@ 2013-09-05 0:54 ` Dave Chinner
2013-09-05 16:17 ` Brian Foster
0 siblings, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2013-09-05 0:54 UTC (permalink / raw)
To: Brian Foster; +Cc: xfs
On Tue, Sep 03, 2013 at 02:25:00PM -0400, Brian Foster wrote:
> Define the AGI fields for the finobt root/level and add magic
> numbers. Update the btree code to add support for the new
> XFS_BTNUM_FINOBT inode btree.
>
> The finobt root block is reserved immediately following the inobt
> root block in the AG. Update XFS_PREALLOC_BLOCKS() to determine the
> starting AG data block based on whether finobt support is enabled.
A few minor things...
>
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
> fs/xfs/xfs_ag.h | 7 ++++++-
> fs/xfs/xfs_btree.c | 6 ++++--
> fs/xfs/xfs_btree.h | 3 +++
> fs/xfs/xfs_ialloc.c | 2 ++
> fs/xfs/xfs_ialloc_btree.c | 29 +++++++++++++++++++++++------
> fs/xfs/xfs_ialloc_btree.h | 14 +++++++++++++-
> fs/xfs/xfs_log_recover.c | 2 ++
> fs/xfs/xfs_stats.h | 18 +++++++++++++++++-
> fs/xfs/xfs_types.h | 2 +-
> 9 files changed, 71 insertions(+), 12 deletions(-)
>
> diff --git a/fs/xfs/xfs_ag.h b/fs/xfs/xfs_ag.h
> index 1cb740a..b85585d 100644
> --- a/fs/xfs/xfs_ag.h
> +++ b/fs/xfs/xfs_ag.h
> @@ -166,6 +166,9 @@ typedef struct xfs_agi {
> __be32 agi_pad32;
> __be64 agi_lsn; /* last write sequence */
>
> + __be32 agi_free_root; /* root of the free inode btree */
> + __be32 agi_free_level;/* levels in free inode btree */
> +
> /* structure must be padded to 64 bit alignment */
> } xfs_agi_t;
That's fine, but...
>
> @@ -180,7 +183,9 @@ typedef struct xfs_agi {
> #define XFS_AGI_NEWINO 0x00000100
> #define XFS_AGI_DIRINO 0x00000200
> #define XFS_AGI_UNLINKED 0x00000400
> -#define XFS_AGI_NUM_BITS 11
> +#define XFS_AGI_FREE_ROOT 0x00000800
> +#define XFS_AGI_FREE_LEVEL 0x00001000
> +#define XFS_AGI_NUM_BITS 13
> #define XFS_AGI_ALL_BITS ((1 << XFS_AGI_NUM_BITS) - 1)
This is a bit of a problem, because the range logging bits will now
cause the entire AGI to be logged (including all the unlinked list
hash) because these only define the first/last offsets to be
logged...
> diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
> index 524aa88..5ced506 100644
> --- a/fs/xfs/xfs_ialloc.c
> +++ b/fs/xfs/xfs_ialloc.c
> @@ -1505,6 +1505,8 @@ xfs_ialloc_log_agi(
> offsetof(xfs_agi_t, agi_newino),
> offsetof(xfs_agi_t, agi_dirino),
> offsetof(xfs_agi_t, agi_unlinked),
> + offsetof(xfs_agi_t, agi_free_root),
> + offsetof(xfs_agi_t, agi_free_level),
> sizeof(xfs_agi_t)
> };
Because of how this table works.
What we really need here is for xfs_ialloc_log_agi to consider that
there are two distinct regions for range logging - the first spaces
from offset 0 to offset of agi_unlinked, and the second is from the
the offset of agi_free_root to the end of the xfs_agi_t....
It's abit messy, I know, but we couldn't easily add new padding to
the AGI in the existing range logging area like was done for the AGF
because of the unlinked list hash table already defining the end of
the range logging region....
> #ifdef DEBUG
> diff --git a/fs/xfs/xfs_ialloc_btree.c b/fs/xfs/xfs_ialloc_btree.c
> index 0cdb88b..7923292 100644
> --- a/fs/xfs/xfs_ialloc_btree.c
> +++ b/fs/xfs/xfs_ialloc_btree.c
> @@ -62,10 +62,18 @@ xfs_inobt_set_root(
> {
> struct xfs_buf *agbp = cur->bc_private.a.agbp;
> struct xfs_agi *agi = XFS_BUF_TO_AGI(agbp);
> -
> - agi->agi_root = nptr->s;
> - be32_add_cpu(&agi->agi_level, inc);
> - xfs_ialloc_log_agi(cur->bc_tp, agbp, XFS_AGI_ROOT | XFS_AGI_LEVEL);
> + int fields;
> +
> + if (cur->bc_btnum == XFS_BTNUM_INO) {
> + agi->agi_root = nptr->s;
> + be32_add_cpu(&agi->agi_level, inc);
> + fields = XFS_AGI_ROOT | XFS_AGI_LEVEL;
> + } else {
> + agi->agi_free_root = nptr->s;
> + be32_add_cpu(&agi->agi_free_level, inc);
> + fields = XFS_AGI_FREE_ROOT | XFS_AGI_FREE_LEVEL;
> + }
> + xfs_ialloc_log_agi(cur->bc_tp, agbp, fields);
> }
I suspect that it would be better to have separate functions for
these differences i.e. xfs_inobt_set_root() and
xfs_finobt_set_root(), and set up separate btree ops structure
forthe two different trees. Most of the code is still identical,
but the differences in root structures can easily be handled without
putting switches in the code everywhere.
>
> STATIC int
> @@ -172,7 +180,10 @@ xfs_inobt_init_ptr_from_cur(
>
> ASSERT(cur->bc_private.a.agno == be32_to_cpu(agi->agi_seqno));
>
> - ptr->s = agi->agi_root;
> + if (cur->bc_btnum == XFS_BTNUM_INO)
> + ptr->s = agi->agi_root;
> + else
> + ptr->s = agi->agi_free_root;
> }
Like this...
>
> STATIC __int64_t
> @@ -205,6 +216,7 @@ xfs_inobt_verify(
> */
> switch (block->bb_magic) {
> case cpu_to_be32(XFS_IBT_CRC_MAGIC):
> + case cpu_to_be32(XFS_FIBT_CRC_MAGIC):
> if (!xfs_sb_version_hascrc(&mp->m_sb))
> return false;
> if (!uuid_equal(&block->bb_u.s.bb_uuid, &mp->m_sb.sb_uuid))
> @@ -216,6 +228,7 @@ xfs_inobt_verify(
> return false;
> /* fall through */
> case cpu_to_be32(XFS_IBT_MAGIC):
> + case cpu_to_be32(XFS_FIBT_MAGIC):
> break;
> default:
> return 0;
> @@ -335,8 +348,12 @@ xfs_inobt_init_cursor(
>
> cur->bc_tp = tp;
> cur->bc_mp = mp;
> - cur->bc_nlevels = be32_to_cpu(agi->agi_level);
> cur->bc_btnum = btnum;
> + if (btnum == XFS_BTNUM_INO)
> + cur->bc_nlevels = be32_to_cpu(agi->agi_level);
> + else
> + cur->bc_nlevels = be32_to_cpu(agi->agi_free_level);
> +
> cur->bc_blocklog = mp->m_sb.sb_blocklog;
>
> cur->bc_ops = &xfs_inobt_ops;
And this is where you do the check on the btnum and set the
appropriate ops structure....
> #define XFS_INODES_PER_CHUNK (NBBY * sizeof(xfs_inofree_t))
> @@ -73,7 +75,17 @@ typedef __be32 xfs_inobt_ptr_t;
> * block numbers in the AG.
> */
> #define XFS_IBT_BLOCK(mp) ((xfs_agblock_t)(XFS_CNT_BLOCK(mp) + 1))
> -#define XFS_PREALLOC_BLOCKS(mp) ((xfs_agblock_t)(XFS_IBT_BLOCK(mp) + 1))
> +#define XFS_FIBT_BLOCK(mp) ((xfs_agblock_t)(XFS_IBT_BLOCK(mp) + 1))
> +
> +/*
> + * The first data block of an AG depends on whether the filesystem was formatted
> + * with the finobt feature. If so, account for the finobt reserved root btree
> + * block.
> + */
> +#define XFS_PREALLOC_BLOCKS(mp) \
> + (xfs_sb_version_hasfinobt(&((mp)->m_sb)) ? \
> + XFS_FIBT_BLOCK(mp) + 1 : \
> + XFS_IBT_BLOCK(mp) + 1)
Yup, that looks sensible, with a nice comment to explain it :)
> diff --git a/fs/xfs/xfs_stats.h b/fs/xfs/xfs_stats.h
> index c03ad38..c8f238b 100644
> --- a/fs/xfs/xfs_stats.h
> +++ b/fs/xfs/xfs_stats.h
> @@ -183,7 +183,23 @@ struct xfsstats {
> __uint32_t xs_ibt_2_alloc;
> __uint32_t xs_ibt_2_free;
> __uint32_t xs_ibt_2_moves;
> -#define XFSSTAT_END_XQMSTAT (XFSSTAT_END_IBT_V2+6)
> +#define XFSSTAT_END_FIBT_V2 (XFSSTAT_END_IBT_V2+15)
> + __uint32_t xs_fibt_2_lookup;
> + __uint32_t xs_fibt_2_compare;
> + __uint32_t xs_fibt_2_insrec;
> + __uint32_t xs_fibt_2_delrec;
> + __uint32_t xs_fibt_2_newroot;
> + __uint32_t xs_fibt_2_killroot;
> + __uint32_t xs_fibt_2_increment;
> + __uint32_t xs_fibt_2_decrement;
> + __uint32_t xs_fibt_2_lshift;
> + __uint32_t xs_fibt_2_rshift;
> + __uint32_t xs_fibt_2_split;
> + __uint32_t xs_fibt_2_join;
> + __uint32_t xs_fibt_2_alloc;
> + __uint32_t xs_fibt_2_free;
> + __uint32_t xs_fibt_2_moves;
> +#define XFSSTAT_END_XQMSTAT (XFSSTAT_END_FIBT_V2+6)
> __uint32_t xs_qm_dqreclaims;
> __uint32_t xs_qm_dqreclaim_misses;
> __uint32_t xs_qm_dquot_dups;
I didn't see an equivalent change to add these new stats to the proc
file output (ie. in xfs_stat_proc_show()). maybe I just missed it,
but if I didn't, can you add it?
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 04/11] xfs: update inode allocation transaction reservations for finobt
2013-09-03 18:25 ` [RFC PATCH 04/11] xfs: update inode allocation transaction reservations for finobt Brian Foster
@ 2013-09-05 0:59 ` Dave Chinner
2013-09-05 16:17 ` Brian Foster
0 siblings, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2013-09-05 0:59 UTC (permalink / raw)
To: Brian Foster; +Cc: xfs
On Tue, Sep 03, 2013 at 02:25:01PM -0400, Brian Foster wrote:
> Update inode allocation transaction reservations for the finobt. A
> create via record modification requires a log reservation for the
> additional finobt record. Any such allocation could result in an
> finobt removal if the inode chunk has become fully allocated, thus
> we include a reservation for a finobt btree merge as well.
> Allocation of a new inode chunk must account for splits in the
> finobt as well as the existing ialloc tree.
These transaction reservation changes are only necessary for
filesystems with free inode btrees, otherwise they just use more log
space than is necessary.
Can you add helper functions for the free inode btree reservations,
and have them return 0 when the feature is not enabled? That way the
code stays pretty clean, is self documenting and doesn't take
unnecessary space when the feature is not enabled....
> Also update XFS_IALLOC_SPACE_RES() to reserve data blocks for
> finobt split/merge scenarios.
Needs to handle the enabled/disabled case, too.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 05/11] xfs: update ifree transaction reservations for finobt
2013-09-03 18:25 ` [RFC PATCH 05/11] xfs: update ifree " Brian Foster
@ 2013-09-05 1:00 ` Dave Chinner
0 siblings, 0 replies; 46+ messages in thread
From: Dave Chinner @ 2013-09-05 1:00 UTC (permalink / raw)
To: Brian Foster; +Cc: xfs
On Tue, Sep 03, 2013 at 02:25:02PM -0400, Brian Foster wrote:
> Update the ifree transaction log reservations to support the
> finobt. An inode free can now mean an extra record modification,
> record removal (i.e., inode chunk being freed) or a record
> insertion (i.e., a previously full inode chunk).
>
> Define XFS_IFREE_SPACE_RES() for the inactive transaction to
> reserve data blocks for possible finobt merge/split situations.
Same comment about doing this with helper functions...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 06/11] xfs: use correct transaction reservations in xfs_inactive()
2013-09-03 18:25 ` [RFC PATCH 06/11] xfs: use correct transaction reservations in xfs_inactive() Brian Foster
@ 2013-09-05 1:35 ` Dave Chinner
2013-09-05 16:18 ` Brian Foster
0 siblings, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2013-09-05 1:35 UTC (permalink / raw)
To: Brian Foster; +Cc: xfs
On Tue, Sep 03, 2013 at 02:25:03PM -0400, Brian Foster wrote:
> The transaction allocated in xfs_inactive() can be passed down into
> xfs_inactive_symlink() or xfs_itruncate_extents(), both of which
> can commit and reallocate a new transaction. This leads to
> reservation issues if the transaction is subsequently passed into
> xfs_ifree(), which requires a larger reservation to manage the
> finobt.
>
> Reorganize xfs_inactive() to commit any transaction handed back
> from symlink or truncate processing and unconditionally allocate
> a new transaction for xfs_ifree() with the appropriate reservation.
Ok, I've had a bit of a look at this now, and I like how the code
turns out. However, I don't think it goes far enough, or fix the
problem that causes all the transaction nastiness in xfs_inactive().
Firstly, we are not doing rolling transactions here - there is no
need for all the changes to be atomic because the inode is on the
unlinked list if it is going to be freed. Hence we don't need to
pass transaction pointers around.
xfs_inactive_symlink() can do a transaction completely internally,
and, well, it doesn't even log the inode if the symlink is in-line
and so may not even need a transaction. Hence really only
xfs_inactive_symlink_rmt() needs to run a transaction, and it can do
that internally just fine.
For the xfs_itruncate_extents() data fork transaction, just add a
new wrapper called xfs_inactive_truncate() that holds the
transaction context internally - that moves the only other
transaction context that you need to commit out of xfs_inactive()
altogether, as the attr fork already uses a private transaction
context.
And, finally, you can then factor the xfs_ifree() and it's
transaction context into a helper function as well, so there aren't
any transaction contexts left in xfs_inactive() at all.
That would leave us with:
if (ISLNK) {
error = xfs_inactive_symlink(ip);
} else if (truncate)
error = xfs_inactive_truncate(ip);
}
if (error)
goto out;
if (ip->i_d.di_anextents > 0)
error = xfs_attr_inactive(ip);
if (error)
goto out;
error = xfs_inactive_ifree(ip);
xfs_qm_dqdetach(ip);
out:
return;
This gives us a natural separation of the different transaction
reservations and contexts needed to perform the operations, and does
result in any extraneous work being done because we don't know what
the transaction context passed to us contains at all...
FWIW, there are other reasons for suggesting this structure - have a
read of "[RFD 14/17] xfs: separate inode freeing from inactivation"
and you'll see that what I've suggested above sets the code up for
implementing the optimisations documented in the RFD.
http://oss.sgi.com/archives/xfs/2013-08/msg00345.html
It might be best to put this as 3-4 patches at the start of the
series, rather than in the middle of it as it's really a separate
piece of cleanup work....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 07/11] xfs: retry trans reservation on ENOSPC in xfs_inactive()
2013-09-03 18:25 ` [RFC PATCH 07/11] xfs: retry trans reservation on ENOSPC " Brian Foster
@ 2013-09-05 1:40 ` Dave Chinner
2013-09-05 16:18 ` Brian Foster
0 siblings, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2013-09-05 1:40 UTC (permalink / raw)
To: Brian Foster; +Cc: xfs
On Tue, Sep 03, 2013 at 02:25:04PM -0400, Brian Foster wrote:
> An ifree data block reservation can fail with ENOSPC. Flush inodes
> to try and free up space or attempt without a data block
> reservation to avoid failing out of xfs_inactive().
>
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
> fs/xfs/xfs_inode.c | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 56cbf63..92de4b7 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1779,7 +1779,18 @@ xfs_inactive(
> tp = xfs_trans_alloc(mp, XFS_TRANS_INACTIVE);
> error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree,
> XFS_IFREE_SPACE_RES(mp), 0);
> + if (error == ENOSPC) {
> + /* flush outstanding delalloc blocks and retry */
> + xfs_flush_inodes(mp);
> + error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree,
> + XFS_IFREE_SPACE_RES(mp), 0);
> + }
We don't want to be blocking for inode flushes here. We might be in
a shrinker context, for example, and blocking those for a filesystem
sync is going to be unfriendly.
If this really is a problem, then the right thing to do is to allow
this transaction to dip into the reserve block pool so the
transaction can complete and make progress - other write operations
will trigger the flushing of the filesystem, and freeing of whole
inode chunks should return more free space than we need for the
finobt modifications in the removing lots of zero length inodes
at ENOSPC case....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 08/11] xfs: insert newly allocated inode chunks into the finobt
2013-09-03 18:25 ` [RFC PATCH 08/11] xfs: insert newly allocated inode chunks into the finobt Brian Foster
@ 2013-09-05 2:10 ` Dave Chinner
0 siblings, 0 replies; 46+ messages in thread
From: Dave Chinner @ 2013-09-05 2:10 UTC (permalink / raw)
To: Brian Foster; +Cc: xfs
On Tue, Sep 03, 2013 at 02:25:05PM -0400, Brian Foster wrote:
> A newly allocated inode chunk, by definition, has at least one
> free inode, so a record is always inserted into the finobt.
>
> Create the xfs_inobt_insert() helper from existing code to insert
> a record in an inobt based on the provided BTNUM. Update
> xfs_ialloc_ag_alloc() to invoke the helper for the existing
> XFS_BTNUM_INO tree and XFS_BTNUM_FINO tree, if enabled.
Factoring is good...
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
> fs/xfs/xfs_ialloc.c | 79 +++++++++++++++++++++++++++++++++++++----------------
> 1 file changed, 56 insertions(+), 23 deletions(-)
>
> diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
> index 5ced506..e64a728 100644
> --- a/fs/xfs/xfs_ialloc.c
> +++ b/fs/xfs/xfs_ialloc.c
> @@ -152,6 +152,52 @@ xfs_check_agi_freecount(
> #endif
>
> /*
> + * Insert records describing a newly allocated inode chunk into the inobt.
> + */
> +STATIC int
> +xfs_inobt_insert(
> + struct xfs_mount *mp,
> + struct xfs_trans *tp,
> + struct xfs_buf *agbp,
> + xfs_agino_t newino,
> + xfs_agino_t newlen,
> + xfs_btnum_t btnum)
> +{
> + struct xfs_btree_cur *cur;
> + struct xfs_agi *agi = XFS_BUF_TO_AGI(agbp);
> + xfs_agnumber_t agno = be32_to_cpu(agi->agi_seqno);
> + xfs_agino_t thisino;
> + int i;
> + int error;
> +
> + cur = xfs_inobt_init_cursor(mp, tp, agbp, agno, btnum);
> +
> + for (thisino = newino;
> + thisino < newino + newlen;
> + thisino += XFS_INODES_PER_CHUNK) {
> + cur->bc_rec.i.ir_startino = thisino;
> + cur->bc_rec.i.ir_freecount = XFS_INODES_PER_CHUNK;
> + cur->bc_rec.i.ir_free = XFS_INOBT_ALL_FREE;
> + error = xfs_btree_lookup(cur, XFS_LOOKUP_EQ, &i);
> + if (error) {
> + xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
> + return error;
> + }
I'm wondering if we'd do better to pass freecount/free to
xfs_inobt_lookup() and call that instead. i.e. we don't need to
expose the btree cursor internals here...
Maybe, also, move this function up to the top of the file with the
other xfs_inobt_* functions.
Otherwise it all looks OK.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 09/11] xfs: use and update the finobt on inode allocation
2013-09-03 18:25 ` [RFC PATCH 09/11] xfs: use and update the finobt on inode allocation Brian Foster
@ 2013-09-05 2:27 ` Dave Chinner
2013-09-05 16:18 ` Brian Foster
0 siblings, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2013-09-05 2:27 UTC (permalink / raw)
To: Brian Foster; +Cc: xfs
On Tue, Sep 03, 2013 at 02:25:06PM -0400, Brian Foster wrote:
> Replace xfs_dialloc_ag() with an implementation that looks for a
> record in the finobt. The finobt only tracks records with at least
> one free inode. This eliminates the need for the intra-ag scan in
> the original algorithm. Once the inode is allocated, update the
> finobt appropriately (possibly removing the record) as well as the
> inobt.
>
> Move the original xfs_dialloc_ag() algorithm to
> xfs_dialloc_ag_slow() and fall back as such if finobt support is
> not enabled.
>
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
> fs/xfs/xfs_ialloc.c | 136 +++++++++++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 135 insertions(+), 1 deletion(-)
>
> diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
> index e64a728..516f4af 100644
> --- a/fs/xfs/xfs_ialloc.c
> +++ b/fs/xfs/xfs_ialloc.c
> @@ -708,7 +708,7 @@ xfs_ialloc_get_rec(
> * available.
> */
> STATIC int
> -xfs_dialloc_ag(
> +xfs_dialloc_ag_slow(
> struct xfs_trans *tp,
> struct xfs_buf *agbp,
> xfs_ino_t parent,
> @@ -966,6 +966,140 @@ error0:
> return error;
> }
>
> +STATIC int
> +xfs_dialloc_ag(
> + struct xfs_trans *tp,
> + struct xfs_buf *agbp,
> + xfs_ino_t parent,
> + xfs_ino_t *inop)
> +{
> + struct xfs_mount *mp = tp->t_mountp;
> + struct xfs_agi *agi = XFS_BUF_TO_AGI(agbp);
> + xfs_agnumber_t agno = be32_to_cpu(agi->agi_seqno);
> + xfs_agino_t pagino = XFS_INO_TO_AGINO(mp, parent);
> + struct xfs_perag *pag;
> + struct xfs_btree_cur *fcur;
> + struct xfs_btree_cur *icur;
> + struct xfs_inobt_rec_incore frec;
> + struct xfs_inobt_rec_incore irec;
> + xfs_ino_t ino;
> + int error;
> + int offset;
> + int i;
> +
> + if (!xfs_sb_version_hasfinobt(&mp->m_sb))
> + return xfs_dialloc_ag_slow(tp, agbp, parent, inop);
I'm starting to think that we really, really need the iops vector
mentioned in "[RFD 15/17] xfs: introduce a method vector for unlinked
list operations" so we don't need to have these sorts of switches in
the code...
> +
> + pag = xfs_perag_get(mp, agno);
> +
> + /*
> + * If pagino is 0 (this is the root inode allocation) use newino.
> + * This must work because we've just allocated some.
> + */
> + if (!pagino)
> + pagino = be32_to_cpu(agi->agi_newino);
> +
> + fcur = xfs_inobt_init_cursor(mp, tp, agbp, agno, XFS_BTNUM_FINO);
> + icur = xfs_inobt_init_cursor(mp, tp, agbp, agno, XFS_BTNUM_INO);
> +
> + error = xfs_check_agi_freecount(fcur, agi);
> + if (error)
> + goto error;
> + error = xfs_check_agi_freecount(icur, agi);
> + if (error)
> + goto error;
Why do we need to initialise both cursors at once? We only do the
operations one at a time, and you should actually use 2 cursors
for the finobt lookup.....
> +
> + /*
> + * Search the finobt.
> + */
> + error = xfs_inobt_lookup(fcur, pagino, XFS_LOOKUP_LE, &i);
> + if (error)
> + goto error;
> + if (i == 0) {
> + error = xfs_inobt_lookup(fcur, pagino, XFS_LOOKUP_GE, &i);
> + if (error)
> + goto error;
> + XFS_WANT_CORRUPTED_GOTO(i == 1, error);
> + }
.... because this biases allocation to lower inode numbers than the
target. i.e we only ever search for higher numbers if here are none
lower. That's quite different to the current algorithm which first
searches for the *closest* free inode.
That is, we should be using two cursors for the free inode search,
one for LE, the other for GT. If they both return records then, like
the "slow" algorithm, calculate the diff between them and the target
inode, and select the closer one (smallest diff). Destroy the cursor
you don't need.
> + error = xfs_inobt_get_rec(fcur, &frec, &i);
> + if (error)
> + goto error;
> + XFS_WANT_CORRUPTED_GOTO(i == 1, error);
> +
> + offset = xfs_lowbit64(frec.ir_free);
> + ASSERT(offset >= 0);
> + ASSERT(offset < XFS_INODES_PER_CHUNK);
> + ASSERT((XFS_AGINO_TO_OFFSET(mp, frec.ir_startino) %
> + XFS_INODES_PER_CHUNK) == 0);
> + ino = XFS_AGINO_TO_INO(mp, agno, frec.ir_startino + offset);
> +
> + /*
> + * Modify or remove the finobt record.
> + */
> + frec.ir_free &= ~XFS_INOBT_MASK(offset);
> + frec.ir_freecount--;
> + if (frec.ir_freecount)
> + error = xfs_inobt_update(fcur, &frec);
> + else
> + error = xfs_btree_delete(fcur, &i);
> + if (error)
> + goto error;
Yup, good.
Now you can re-initialise the second cursor to point at the inobt
and:
> +
> + /*
> + * Lookup and modify the equivalent record in the inobt.
> + */
> + error = xfs_inobt_lookup(icur, frec.ir_startino, XFS_LOOKUP_EQ, &i);
> + if (error)
> + goto error;
> + XFS_WANT_CORRUPTED_GOTO(i == 1, error);
> +
> + error = xfs_inobt_get_rec(icur, &irec, &i);
> + if (error)
> + goto error;
> + XFS_WANT_CORRUPTED_GOTO(i == 1, error);
> + ASSERT((XFS_AGINO_TO_OFFSET(mp, irec.ir_startino) %
> + XFS_INODES_PER_CHUNK) == 0);
> +
> + irec.ir_free &= ~XFS_INOBT_MASK(offset);
> + irec.ir_freecount--;
> +
> + XFS_WANT_CORRUPTED_GOTO((frec.ir_free == irec.ir_free) &&
> + (frec.ir_freecount == irec.ir_freecount),
> + error);
Good, I like that check - they should be the same!
> +
> + error = xfs_inobt_update(icur, &irec);
> + if (error)
> + goto error;
> +
> + /*
> + * Update the perag and superblock.
> + */
> + be32_add_cpu(&agi->agi_freecount, -1);
> + xfs_ialloc_log_agi(tp, agbp, XFS_AGI_FREECOUNT);
> + pag->pagi_freecount--;
> +
> + xfs_trans_mod_sb(tp, XFS_TRANS_SB_IFREE, -1);
> + xfs_perag_put(pag);
> +
> + error = xfs_check_agi_freecount(fcur, agi);
> + if (error)
> + goto error;
> + error = xfs_check_agi_freecount(icur, agi);
> + if (error)
> + goto error;
Failures here will result in 2 calls to xfs_perag_put(pag);
> +
> + xfs_btree_del_cursor(icur, XFS_BTREE_NOERROR);
> + xfs_btree_del_cursor(fcur, XFS_BTREE_ERROR);
> + *inop = ino;
> + return 0;
> +error:
> + xfs_perag_put(pag);
> + xfs_btree_del_cursor(icur, XFS_BTREE_ERROR);
> + xfs_btree_del_cursor(fcur, XFS_BTREE_ERROR);
> + return error;
> +}
Otherwise it looks good.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 10/11] xfs: update the finobt on inode free
2013-09-03 18:25 ` [RFC PATCH 10/11] xfs: update the finobt on inode free Brian Foster
@ 2013-09-05 2:54 ` Dave Chinner
2013-09-05 16:19 ` Brian Foster
0 siblings, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2013-09-05 2:54 UTC (permalink / raw)
To: Brian Foster; +Cc: xfs
On Tue, Sep 03, 2013 at 02:25:07PM -0400, Brian Foster wrote:
> An inode free operation can have several effects on the finobt. If
> all inodes have been freed and the chunk deallocated, we remove the
> finobt record. If the inode chunk was previously full, we must
> insert a new record based on the existing inobt record. Otherwise,
> we modify the record in place.
>
> Create the xfs_ifree_finobt() function to identify the potential
> scenarios and update the finobt appropriately.
The first thing I'd do is factor all the inobt manipulation
code xfs_difree() into a xfs_difree_inobt() helper function. have it
return the record and offset that is then passed to your new helper
xfs_difree_finobt(). That way xfs_difree() really becomes the setup
function for the two btree operations rather than containing one set
of modifications and calling a function to do the other...
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
> fs/xfs/xfs_ialloc.c | 120 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 120 insertions(+)
>
> diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
> index 516f4af..96f71b5 100644
> --- a/fs/xfs/xfs_ialloc.c
> +++ b/fs/xfs/xfs_ialloc.c
> @@ -198,6 +198,117 @@ xfs_inobt_insert(
> }
>
> /*
> + * Free an inode in the free inode btree.
> + */
> +STATIC int
> +xfs_ifree_finobt(
> + struct xfs_mount *mp,
> + struct xfs_trans *tp,
> + struct xfs_buf *agbp,
> + struct xfs_inobt_rec_incore *ibtrec,/* inobt record */
> + int offset) /* offset of inode */
> +{
> + struct xfs_agi *agi = XFS_BUF_TO_AGI(agbp);
> + xfs_agnumber_t agno = be32_to_cpu(agi->agi_seqno);
> + struct xfs_btree_cur *cur;
> + struct xfs_inobt_rec_incore rec;
> + int error;
> + int i;
> +
> + if (!xfs_sb_version_hasfinobt(&mp->m_sb))
> + return 0;
There's that vector thing again...
> +
> + cur = xfs_inobt_init_cursor(mp, tp, agbp, agno, XFS_BTNUM_FINO);
> +
> + error = xfs_inobt_lookup(cur, ibtrec->ir_startino, XFS_LOOKUP_EQ, &i);
> + if (error)
> + goto error;
> +
> + if (i == 1) {
> + int j;
> + /*
> + * Read and update the existing record.
> + */
> + error = xfs_inobt_get_rec(cur, &rec, &j);
> + if (error)
> + goto error;
> + XFS_WANT_CORRUPTED_GOTO(j == 1, error);
> +
> + rec.ir_free |= XFS_INOBT_MASK(offset);
> + rec.ir_freecount++;
> +
> + XFS_WANT_CORRUPTED_GOTO((rec.ir_free == ibtrec->ir_free) &&
> + (rec.ir_freecount == ibtrec->ir_freecount),
> + error);
> + }
I can't say I'm a great fan of the layout of the logic. Yes, there's
lots of cases to handle. It looks like:
lookup()
if (found)
modify in place
if (found && full && deleting chunks)
delete record
else if (!found && no record)
insert record
else if (found)
update record
else
corruption!
I think it woul dbe better to get then "!found" case out of the way
at the start. ie
if (i == 0) {
if (ibtrec->ir_freecount == 1)
insert record
else
CORRUPTION
goto out;
}
/* found a record, no need to check i == 1 anymore */
ASSERT(i == 1);
/* read and update */
if (full && deleting chunks)
delete record
else
update record
> +
> + /*
> + * The content of inobt records should always match between the inobt
> + * and finobt. The lifecycle of records in the finobt is different from
> + * the inobt in that the finobt only tracks records with at least one
> + * free inode. This is to optimize lookup for inode allocation purposes.
> + * The following checks fix up the finobt appropriately based on the
> + * state of the record subsequent to the current operation.
> + */
> +
> + if ((i == 1) &&
> + (rec.ir_freecount == XFS_IALLOC_INODES(mp) &&
> + !(mp->m_flags & XFS_MOUNT_IKEEP))) {
> + /*
> + * We have an existing finobt record. If all inodes are free
> + * and we're in !ikeep mode, the entire inode chunk has been
> + * deallocated. Remove the record from the finobt.
> + */
> + error = xfs_btree_delete(cur, &i);
> + if (error)
> + goto error;
> + ASSERT(i == 1);
> + } else if ((i == 0) && (ibtrec->ir_freecount == 1)) {
> + /*
> + * No existing finobt record and the inobt record has a single
> + * free inode. This means we've freed an inode in a previously
> + * fully allocated chunk. Insert a new record into the finobt
> + * based on the current inobt record.
> + */
> + cur->bc_rec.i.ir_startino = ibtrec->ir_startino;
> + cur->bc_rec.i.ir_free = ibtrec->ir_free;
> + cur->bc_rec.i.ir_freecount = ibtrec->ir_freecount;
> + error = xfs_btree_insert(cur, &i);
> + if (error)
> + goto error;
> + ASSERT(i == 1);
That's rather similar to the code in xfs_inobt_insert(). Indeed,
is you write a helper - xfs_inobt_insert_rec() - for this, then rather than modifying
xfs_inobt_lookup() to take extra parameters like I wondered for the
previous patch, leave it alonge and pass the parameters to
xfs_inobt_insert_rec() instead.
Then this code is functionally identical to xfs_inobt_insert() done
during allocation....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 11/11] xfs: add finobt support to growfs
2013-09-03 18:25 ` [RFC PATCH 11/11] xfs: add finobt support to growfs Brian Foster
@ 2013-09-05 2:55 ` Dave Chinner
0 siblings, 0 replies; 46+ messages in thread
From: Dave Chinner @ 2013-09-05 2:55 UTC (permalink / raw)
To: Brian Foster; +Cc: xfs
On Tue, Sep 03, 2013 at 02:25:08PM -0400, Brian Foster wrote:
> Add finobt support to growfs. Initialize the agi root/level fields
> and the root finobt block.
Looks OK.
Reviewed-by: Dave Chinner <david@fromorbit.com>
growfs is crying out for a table driven initialisation loop,
though...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 03/11] xfs: support the XFS_BTNUM_FINOBT free inode btree type
2013-09-05 0:54 ` Dave Chinner
@ 2013-09-05 16:17 ` Brian Foster
2013-09-06 0:07 ` Dave Chinner
0 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2013-09-05 16:17 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On 09/04/2013 08:54 PM, Dave Chinner wrote:
> On Tue, Sep 03, 2013 at 02:25:00PM -0400, Brian Foster wrote:
>> Define the AGI fields for the finobt root/level and add magic
>> numbers. Update the btree code to add support for the new
>> XFS_BTNUM_FINOBT inode btree.
>>
>> The finobt root block is reserved immediately following the inobt
>> root block in the AG. Update XFS_PREALLOC_BLOCKS() to determine the
>> starting AG data block based on whether finobt support is enabled.
>
> A few minor things...
>
>>
>> Signed-off-by: Brian Foster <bfoster@redhat.com>
>> ---
>> fs/xfs/xfs_ag.h | 7 ++++++-
>> fs/xfs/xfs_btree.c | 6 ++++--
>> fs/xfs/xfs_btree.h | 3 +++
>> fs/xfs/xfs_ialloc.c | 2 ++
>> fs/xfs/xfs_ialloc_btree.c | 29 +++++++++++++++++++++++------
>> fs/xfs/xfs_ialloc_btree.h | 14 +++++++++++++-
>> fs/xfs/xfs_log_recover.c | 2 ++
>> fs/xfs/xfs_stats.h | 18 +++++++++++++++++-
>> fs/xfs/xfs_types.h | 2 +-
>> 9 files changed, 71 insertions(+), 12 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_ag.h b/fs/xfs/xfs_ag.h
>> index 1cb740a..b85585d 100644
>> --- a/fs/xfs/xfs_ag.h
>> +++ b/fs/xfs/xfs_ag.h
>> @@ -166,6 +166,9 @@ typedef struct xfs_agi {
>> __be32 agi_pad32;
>> __be64 agi_lsn; /* last write sequence */
>>
>> + __be32 agi_free_root; /* root of the free inode btree */
>> + __be32 agi_free_level;/* levels in free inode btree */
>> +
>> /* structure must be padded to 64 bit alignment */
>> } xfs_agi_t;
>
> That's fine, but...
>>
>> @@ -180,7 +183,9 @@ typedef struct xfs_agi {
>> #define XFS_AGI_NEWINO 0x00000100
>> #define XFS_AGI_DIRINO 0x00000200
>> #define XFS_AGI_UNLINKED 0x00000400
>> -#define XFS_AGI_NUM_BITS 11
>> +#define XFS_AGI_FREE_ROOT 0x00000800
>> +#define XFS_AGI_FREE_LEVEL 0x00001000
>> +#define XFS_AGI_NUM_BITS 13
>> #define XFS_AGI_ALL_BITS ((1 << XFS_AGI_NUM_BITS) - 1)
>
> This is a bit of a problem, because the range logging bits will now
> cause the entire AGI to be logged (including all the unlinked list
> hash) because these only define the first/last offsets to be
> logged...
>
Ok, I see what you mean here...
>> diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
>> index 524aa88..5ced506 100644
>> --- a/fs/xfs/xfs_ialloc.c
>> +++ b/fs/xfs/xfs_ialloc.c
>> @@ -1505,6 +1505,8 @@ xfs_ialloc_log_agi(
>> offsetof(xfs_agi_t, agi_newino),
>> offsetof(xfs_agi_t, agi_dirino),
>> offsetof(xfs_agi_t, agi_unlinked),
>> + offsetof(xfs_agi_t, agi_free_root),
>> + offsetof(xfs_agi_t, agi_free_level),
>> sizeof(xfs_agi_t)
>> };
>
> Because of how this table works.
>
> What we really need here is for xfs_ialloc_log_agi to consider that
> there are two distinct regions for range logging - the first spaces
> from offset 0 to offset of agi_unlinked, and the second is from the
> the offset of agi_free_root to the end of the xfs_agi_t....
>
> It's abit messy, I know, but we couldn't easily add new padding to
> the AGI in the existing range logging area like was done for the AGF
> because of the unlinked list hash table already defining the end of
> the range logging region....
>
... but where would that ever happen? The existing invocations of
xfs_ialloc_log_agi() seem to log either the agi inode count values or
the btree root/level values (i.e., never the range across both). I think
I've introduced at least a couple new invocations throughout this set,
but I've not changed that model (i.e., an XFS_AGI_FREECOUNT instance in
the new lookup code and an XFS_AGI_FREE_ROOT|*_LEVEL instance in the new
btree code).
My understanding of this code is that the range to log is defined at
invocation time to xfs_iallog_log_agi(), so if the callers never specify
a range that includes the unlinked bits in a single call, we won't set
that range in the buffer log item code. In other words, even if we
ultimately happen to log both ranges in the agi, the lower level code
will never expand the logged region. Therefore, this shouldn't happen
unless a single invocation that specifies one of the
XFS_AGI_FREE_ROOT/LEVEL bits also specifies one of the existing agi bits.
I could see breaking this up as a matter of preparation for future
fields or calls that would introduce logging that kind of range, but at
the same time (and assuming my interpretation of above is correct),
that's a bit of code that serves no purpose for the foreseeable future.
Perhaps a comment in xfs_ialloc_log_agi() and the one caller that uses
the AGI_FREE_* bits to document this restriction is sufficient?
>> #ifdef DEBUG
>> diff --git a/fs/xfs/xfs_ialloc_btree.c b/fs/xfs/xfs_ialloc_btree.c
>> index 0cdb88b..7923292 100644
>> --- a/fs/xfs/xfs_ialloc_btree.c
>> +++ b/fs/xfs/xfs_ialloc_btree.c
>> @@ -62,10 +62,18 @@ xfs_inobt_set_root(
>> {
>> struct xfs_buf *agbp = cur->bc_private.a.agbp;
>> struct xfs_agi *agi = XFS_BUF_TO_AGI(agbp);
>> -
>> - agi->agi_root = nptr->s;
>> - be32_add_cpu(&agi->agi_level, inc);
>> - xfs_ialloc_log_agi(cur->bc_tp, agbp, XFS_AGI_ROOT | XFS_AGI_LEVEL);
>> + int fields;
>> +
>> + if (cur->bc_btnum == XFS_BTNUM_INO) {
>> + agi->agi_root = nptr->s;
>> + be32_add_cpu(&agi->agi_level, inc);
>> + fields = XFS_AGI_ROOT | XFS_AGI_LEVEL;
>> + } else {
>> + agi->agi_free_root = nptr->s;
>> + be32_add_cpu(&agi->agi_free_level, inc);
>> + fields = XFS_AGI_FREE_ROOT | XFS_AGI_FREE_LEVEL;
>> + }
>> + xfs_ialloc_log_agi(cur->bc_tp, agbp, fields);
>> }
>
> I suspect that it would be better to have separate functions for
> these differences i.e. xfs_inobt_set_root() and
> xfs_finobt_set_root(), and set up separate btree ops structure
> forthe two different trees. Most of the code is still identical,
> but the differences in root structures can easily be handled without
> putting switches in the code everywhere.
>
Ok, I'm assuming the suggestion is to only create new functions for the
implementations that differ.
>>
>> STATIC int
>> @@ -172,7 +180,10 @@ xfs_inobt_init_ptr_from_cur(
>>
>> ASSERT(cur->bc_private.a.agno == be32_to_cpu(agi->agi_seqno));
>>
>> - ptr->s = agi->agi_root;
>> + if (cur->bc_btnum == XFS_BTNUM_INO)
>> + ptr->s = agi->agi_root;
>> + else
>> + ptr->s = agi->agi_free_root;
>> }
>
> Like this...
>
>>
>> STATIC __int64_t
>> @@ -205,6 +216,7 @@ xfs_inobt_verify(
>> */
>> switch (block->bb_magic) {
>> case cpu_to_be32(XFS_IBT_CRC_MAGIC):
>> + case cpu_to_be32(XFS_FIBT_CRC_MAGIC):
>> if (!xfs_sb_version_hascrc(&mp->m_sb))
>> return false;
>> if (!uuid_equal(&block->bb_u.s.bb_uuid, &mp->m_sb.sb_uuid))
>> @@ -216,6 +228,7 @@ xfs_inobt_verify(
>> return false;
>> /* fall through */
>> case cpu_to_be32(XFS_IBT_MAGIC):
>> + case cpu_to_be32(XFS_FIBT_MAGIC):
>> break;
>> default:
>> return 0;
>> @@ -335,8 +348,12 @@ xfs_inobt_init_cursor(
>>
>> cur->bc_tp = tp;
>> cur->bc_mp = mp;
>> - cur->bc_nlevels = be32_to_cpu(agi->agi_level);
>> cur->bc_btnum = btnum;
>> + if (btnum == XFS_BTNUM_INO)
>> + cur->bc_nlevels = be32_to_cpu(agi->agi_level);
>> + else
>> + cur->bc_nlevels = be32_to_cpu(agi->agi_free_level);
>> +
>> cur->bc_blocklog = mp->m_sb.sb_blocklog;
>>
>> cur->bc_ops = &xfs_inobt_ops;
>
> And this is where you do the check on the btnum and set the
> appropriate ops structure....
>
Ok.
>> #define XFS_INODES_PER_CHUNK (NBBY * sizeof(xfs_inofree_t))
>> @@ -73,7 +75,17 @@ typedef __be32 xfs_inobt_ptr_t;
>> * block numbers in the AG.
>> */
>> #define XFS_IBT_BLOCK(mp) ((xfs_agblock_t)(XFS_CNT_BLOCK(mp) + 1))
>> -#define XFS_PREALLOC_BLOCKS(mp) ((xfs_agblock_t)(XFS_IBT_BLOCK(mp) + 1))
>> +#define XFS_FIBT_BLOCK(mp) ((xfs_agblock_t)(XFS_IBT_BLOCK(mp) + 1))
>> +
>> +/*
>> + * The first data block of an AG depends on whether the filesystem was formatted
>> + * with the finobt feature. If so, account for the finobt reserved root btree
>> + * block.
>> + */
>> +#define XFS_PREALLOC_BLOCKS(mp) \
>> + (xfs_sb_version_hasfinobt(&((mp)->m_sb)) ? \
>> + XFS_FIBT_BLOCK(mp) + 1 : \
>> + XFS_IBT_BLOCK(mp) + 1)
>
> Yup, that looks sensible, with a nice comment to explain it :)
>
>> diff --git a/fs/xfs/xfs_stats.h b/fs/xfs/xfs_stats.h
>> index c03ad38..c8f238b 100644
>> --- a/fs/xfs/xfs_stats.h
>> +++ b/fs/xfs/xfs_stats.h
>> @@ -183,7 +183,23 @@ struct xfsstats {
>> __uint32_t xs_ibt_2_alloc;
>> __uint32_t xs_ibt_2_free;
>> __uint32_t xs_ibt_2_moves;
>> -#define XFSSTAT_END_XQMSTAT (XFSSTAT_END_IBT_V2+6)
>> +#define XFSSTAT_END_FIBT_V2 (XFSSTAT_END_IBT_V2+15)
>> + __uint32_t xs_fibt_2_lookup;
>> + __uint32_t xs_fibt_2_compare;
>> + __uint32_t xs_fibt_2_insrec;
>> + __uint32_t xs_fibt_2_delrec;
>> + __uint32_t xs_fibt_2_newroot;
>> + __uint32_t xs_fibt_2_killroot;
>> + __uint32_t xs_fibt_2_increment;
>> + __uint32_t xs_fibt_2_decrement;
>> + __uint32_t xs_fibt_2_lshift;
>> + __uint32_t xs_fibt_2_rshift;
>> + __uint32_t xs_fibt_2_split;
>> + __uint32_t xs_fibt_2_join;
>> + __uint32_t xs_fibt_2_alloc;
>> + __uint32_t xs_fibt_2_free;
>> + __uint32_t xs_fibt_2_moves;
>> +#define XFSSTAT_END_XQMSTAT (XFSSTAT_END_FIBT_V2+6)
>> __uint32_t xs_qm_dqreclaims;
>> __uint32_t xs_qm_dqreclaim_misses;
>> __uint32_t xs_qm_dquot_dups;
>
> I didn't see an equivalent change to add these new stats to the proc
> file output (ie. in xfs_stat_proc_show()). maybe I just missed it,
> but if I didn't, can you add it?
>
Yeah, that's missing. Good catch. I think I added these bits on some
kind of compilation failure or warning and I hadn't gone back to check
that they were used anywhere. :P
Brian
> Cheers,
>
> Dave.
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 04/11] xfs: update inode allocation transaction reservations for finobt
2013-09-05 0:59 ` Dave Chinner
@ 2013-09-05 16:17 ` Brian Foster
2013-09-06 0:11 ` Dave Chinner
0 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2013-09-05 16:17 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On 09/04/2013 08:59 PM, Dave Chinner wrote:
> On Tue, Sep 03, 2013 at 02:25:01PM -0400, Brian Foster wrote:
>> Update inode allocation transaction reservations for the finobt. A
>> create via record modification requires a log reservation for the
>> additional finobt record. Any such allocation could result in an
>> finobt removal if the inode chunk has become fully allocated, thus
>> we include a reservation for a finobt btree merge as well.
>> Allocation of a new inode chunk must account for splits in the
>> finobt as well as the existing ialloc tree.
>
> These transaction reservation changes are only necessary for
> filesystems with free inode btrees, otherwise they just use more log
> space than is necessary.
>
Yeah, good point.
> Can you add helper functions for the free inode btree reservations,
> and have them return 0 when the feature is not enabled? That way the
> code stays pretty clean, is self documenting and doesn't take
> unnecessary space when the feature is not enabled....
>
So not new functions that duplicate the entire calculations for the
finobt enabled cases, but new functions that define the additional
requirements for the finobt on top of the existing reservation
calculations for particular operations (i.e., similar to the recent
patch to fix the inode log size, if I recall). Sounds good.
Brian
>> Also update XFS_IALLOC_SPACE_RES() to reserve data blocks for
>> finobt split/merge scenarios.
>
> Needs to handle the enabled/disabled case, too.
>
> Cheers,
>
> Dave.
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 06/11] xfs: use correct transaction reservations in xfs_inactive()
2013-09-05 1:35 ` Dave Chinner
@ 2013-09-05 16:18 ` Brian Foster
0 siblings, 0 replies; 46+ messages in thread
From: Brian Foster @ 2013-09-05 16:18 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On 09/04/2013 09:35 PM, Dave Chinner wrote:
> On Tue, Sep 03, 2013 at 02:25:03PM -0400, Brian Foster wrote:
>> The transaction allocated in xfs_inactive() can be passed down into
>> xfs_inactive_symlink() or xfs_itruncate_extents(), both of which
>> can commit and reallocate a new transaction. This leads to
>> reservation issues if the transaction is subsequently passed into
>> xfs_ifree(), which requires a larger reservation to manage the
>> finobt.
>>
>> Reorganize xfs_inactive() to commit any transaction handed back
>> from symlink or truncate processing and unconditionally allocate
>> a new transaction for xfs_ifree() with the appropriate reservation.
>
> Ok, I've had a bit of a look at this now, and I like how the code
> turns out. However, I don't think it goes far enough, or fix the
> problem that causes all the transaction nastiness in xfs_inactive().
>
> Firstly, we are not doing rolling transactions here - there is no
> need for all the changes to be atomic because the inode is on the
> unlinked list if it is going to be freed. Hence we don't need to
> pass transaction pointers around.
>
I was wondering about this when I was passing through the code. If I
recall, I ended up just trying to work around the lower level calls as
opposed to messing with them because xfs_truncate_extents() had other
callers (but the discussion below addresses that).
> xfs_inactive_symlink() can do a transaction completely internally,
> and, well, it doesn't even log the inode if the symlink is in-line
> and so may not even need a transaction. Hence really only
> xfs_inactive_symlink_rmt() needs to run a transaction, and it can do
> that internally just fine.
>
Ok.
> For the xfs_itruncate_extents() data fork transaction, just add a
> new wrapper called xfs_inactive_truncate() that holds the
> transaction context internally - that moves the only other
> transaction context that you need to commit out of xfs_inactive()
> altogether, as the attr fork already uses a private transaction
> context.
>
Sounds good.
> And, finally, you can then factor the xfs_ifree() and it's
> transaction context into a helper function as well, so there aren't
> any transaction contexts left in xfs_inactive() at all.
>
> That would leave us with:
>
> if (ISLNK) {
> error = xfs_inactive_symlink(ip);
> } else if (truncate)
> error = xfs_inactive_truncate(ip);
> }
> if (error)
> goto out;
> if (ip->i_d.di_anextents > 0)
> error = xfs_attr_inactive(ip);
> if (error)
> goto out;
>
> error = xfs_inactive_ifree(ip);
>
> xfs_qm_dqdetach(ip);
> out:
> return;
>
> This gives us a natural separation of the different transaction
> reservations and contexts needed to perform the operations, and does
> result in any extraneous work being done because we don't know what
> the transaction context passed to us contains at all...
>
Yeah, I agree. That cleans things up nicely.
> FWIW, there are other reasons for suggesting this structure - have a
> read of "[RFD 14/17] xfs: separate inode freeing from inactivation"
> and you'll see that what I've suggested above sets the code up for
> implementing the optimisations documented in the RFD.
>
> http://oss.sgi.com/archives/xfs/2013-08/msg00345.html
>
> It might be best to put this as 3-4 patches at the start of the
> series, rather than in the middle of it as it's really a separate
> piece of cleanup work....
>
Ok, I've skimmed through most of the writeups in that set. If it turns
into a handful of patches, I might just test and post this as an
independent, prerequisite set for personal ease of management reasons if
nothing else. I'm more confident in the changes with the review feedback
and that I obviously now know what the finobt requirements are. Thanks.
Brian
> Cheers,
>
> Dave.
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 07/11] xfs: retry trans reservation on ENOSPC in xfs_inactive()
2013-09-05 1:40 ` Dave Chinner
@ 2013-09-05 16:18 ` Brian Foster
2013-09-06 0:17 ` Dave Chinner
0 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2013-09-05 16:18 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On 09/04/2013 09:40 PM, Dave Chinner wrote:
> On Tue, Sep 03, 2013 at 02:25:04PM -0400, Brian Foster wrote:
>> An ifree data block reservation can fail with ENOSPC. Flush inodes
>> to try and free up space or attempt without a data block
>> reservation to avoid failing out of xfs_inactive().
>>
>> Signed-off-by: Brian Foster <bfoster@redhat.com>
>> ---
>> fs/xfs/xfs_inode.c | 11 +++++++++++
>> 1 file changed, 11 insertions(+)
>>
>> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
>> index 56cbf63..92de4b7 100644
>> --- a/fs/xfs/xfs_inode.c
>> +++ b/fs/xfs/xfs_inode.c
>> @@ -1779,7 +1779,18 @@ xfs_inactive(
>> tp = xfs_trans_alloc(mp, XFS_TRANS_INACTIVE);
>> error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree,
>> XFS_IFREE_SPACE_RES(mp), 0);
>> + if (error == ENOSPC) {
>> + /* flush outstanding delalloc blocks and retry */
>> + xfs_flush_inodes(mp);
>> + error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree,
>> + XFS_IFREE_SPACE_RES(mp), 0);
>> + }
>
> We don't want to be blocking for inode flushes here. We might be in
> a shrinker context, for example, and blocking those for a filesystem
> sync is going to be unfriendly.
>
Ok.
> If this really is a problem, then the right thing to do is to allow
> this transaction to dip into the reserve block pool so the
> transaction can complete and make progress - other write operations
> will trigger the flushing of the filesystem, and freeing of whole
> inode chunks should return more free space than we need for the
> finobt modifications in the removing lots of zero length inodes
> at ENOSPC case....
>
I did have one of the enospc xfstests lead to this situation, though I
don't have the particular test in my notes. It initially manifested as
an assert failure due to the fs not being shutdown after an
xfs_trans_reserve() ENOSPC failure. Subsequent to avoiding that, I
believe there were inconsistent fs issues called out due to the unlinked
lists being populated after umount.
Taking a further look, I missed the XFS_TRANS_RESERVE flag and whole
m_resblks mechanism. I'll take a closer look at that and see if that
works to resolve the problem instead of the flush.
Brian
> Cheers,
>
> Dave.
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 09/11] xfs: use and update the finobt on inode allocation
2013-09-05 2:27 ` Dave Chinner
@ 2013-09-05 16:18 ` Brian Foster
0 siblings, 0 replies; 46+ messages in thread
From: Brian Foster @ 2013-09-05 16:18 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On 09/04/2013 10:27 PM, Dave Chinner wrote:
> On Tue, Sep 03, 2013 at 02:25:06PM -0400, Brian Foster wrote:
>> Replace xfs_dialloc_ag() with an implementation that looks for a
>> record in the finobt. The finobt only tracks records with at least
>> one free inode. This eliminates the need for the intra-ag scan in
>> the original algorithm. Once the inode is allocated, update the
>> finobt appropriately (possibly removing the record) as well as the
>> inobt.
>>
>> Move the original xfs_dialloc_ag() algorithm to
>> xfs_dialloc_ag_slow() and fall back as such if finobt support is
>> not enabled.
>>
>> Signed-off-by: Brian Foster <bfoster@redhat.com>
>> ---
>> fs/xfs/xfs_ialloc.c | 136 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>> 1 file changed, 135 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
>> index e64a728..516f4af 100644
>> --- a/fs/xfs/xfs_ialloc.c
>> +++ b/fs/xfs/xfs_ialloc.c
>> @@ -708,7 +708,7 @@ xfs_ialloc_get_rec(
>> * available.
>> */
...
>
> Why do we need to initialise both cursors at once? We only do the
> operations one at a time, and you should actually use 2 cursors
> for the finobt lookup.....
>
No good reason. I probably just did that to simplify error handling.
>> +
>> + /*
>> + * Search the finobt.
>> + */
>> + error = xfs_inobt_lookup(fcur, pagino, XFS_LOOKUP_LE, &i);
>> + if (error)
>> + goto error;
>> + if (i == 0) {
>> + error = xfs_inobt_lookup(fcur, pagino, XFS_LOOKUP_GE, &i);
>> + if (error)
>> + goto error;
>> + XFS_WANT_CORRUPTED_GOTO(i == 1, error);
>> + }
>
> .... because this biases allocation to lower inode numbers than the
> target. i.e we only ever search for higher numbers if here are none
> lower. That's quite different to the current algorithm which first
> searches for the *closest* free inode.
>
> That is, we should be using two cursors for the free inode search,
> one for LE, the other for GT. If they both return records then, like
> the "slow" algorithm, calculate the diff between them and the target
> inode, and select the closer one (smallest diff). Destroy the cursor
> you don't need.
>
Ah, Ok. I hadn't taken a close enough look at the existing algorithm
yet, to be honest. I'll do so and incorporate the closest free inode
heuristic.
...
>> + xfs_trans_mod_sb(tp, XFS_TRANS_SB_IFREE, -1);
>> + xfs_perag_put(pag);
>> +
>> + error = xfs_check_agi_freecount(fcur, agi);
>> + if (error)
>> + goto error;
>> + error = xfs_check_agi_freecount(icur, agi);
>> + if (error)
>> + goto error;
>
> Failures here will result in 2 calls to xfs_perag_put(pag);
>
Yeah, thanks.
Brian
>> +
>> + xfs_btree_del_cursor(icur, XFS_BTREE_NOERROR);
>> + xfs_btree_del_cursor(fcur, XFS_BTREE_ERROR);
>> + *inop = ino;
>> + return 0;
>> +error:
>> + xfs_perag_put(pag);
>> + xfs_btree_del_cursor(icur, XFS_BTREE_ERROR);
>> + xfs_btree_del_cursor(fcur, XFS_BTREE_ERROR);
>> + return error;
>> +}
>
> Otherwise it looks good.
>
> Cheers,
>
> Dave.
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 10/11] xfs: update the finobt on inode free
2013-09-05 2:54 ` Dave Chinner
@ 2013-09-05 16:19 ` Brian Foster
2013-09-06 0:28 ` Dave Chinner
0 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2013-09-05 16:19 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On 09/04/2013 10:54 PM, Dave Chinner wrote:
> On Tue, Sep 03, 2013 at 02:25:07PM -0400, Brian Foster wrote:
>> An inode free operation can have several effects on the finobt. If
>> all inodes have been freed and the chunk deallocated, we remove the
>> finobt record. If the inode chunk was previously full, we must
>> insert a new record based on the existing inobt record. Otherwise,
>> we modify the record in place.
>>
>> Create the xfs_ifree_finobt() function to identify the potential
>> scenarios and update the finobt appropriately.
>
> The first thing I'd do is factor all the inobt manipulation
> code xfs_difree() into a xfs_difree_inobt() helper function. have it
> return the record and offset that is then passed to your new helper
> xfs_difree_finobt(). That way xfs_difree() really becomes the setup
> function for the two btree operations rather than containing one set
> of modifications and calling a function to do the other...
>
Sounds logical.
>> Signed-off-by: Brian Foster <bfoster@redhat.com>
>> ---
>> fs/xfs/xfs_ialloc.c | 120 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 120 insertions(+)
>>
>> diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
>> index 516f4af..96f71b5 100644
>> --- a/fs/xfs/xfs_ialloc.c
>> +++ b/fs/xfs/xfs_ialloc.c
>> @@ -198,6 +198,117 @@ xfs_inobt_insert(
>> }
>>
>> /*
>> + * Free an inode in the free inode btree.
>> + */
>> +STATIC int
>> +xfs_ifree_finobt(
...
>
> I can't say I'm a great fan of the layout of the logic. Yes, there's
> lots of cases to handle. It looks like:
>
Yeah, I've shuffled this code around quite a bit myself.
> lookup()
> if (found)
> modify in place
> if (found && full && deleting chunks)
> delete record
> else if (!found && no record)
> insert record
> else if (found)
> update record
> else
> corruption!
>
> I think it woul dbe better to get then "!found" case out of the way
> at the start. ie
>
> if (i == 0) {
> if (ibtrec->ir_freecount == 1)
> insert record
> else
> CORRUPTION
> goto out;
> }
>
> /* found a record, no need to check i == 1 anymore */
> ASSERT(i == 1);
>
> /* read and update */
>
> if (full && deleting chunks)
> delete record
> else
> update record
>
Ok, I'll try to pull that logic up and see what falls out.
...
>> + } else if ((i == 0) && (ibtrec->ir_freecount == 1)) {
>> + /*
>> + * No existing finobt record and the inobt record has a single
>> + * free inode. This means we've freed an inode in a previously
>> + * fully allocated chunk. Insert a new record into the finobt
>> + * based on the current inobt record.
>> + */
>> + cur->bc_rec.i.ir_startino = ibtrec->ir_startino;
>> + cur->bc_rec.i.ir_free = ibtrec->ir_free;
>> + cur->bc_rec.i.ir_freecount = ibtrec->ir_freecount;
>> + error = xfs_btree_insert(cur, &i);
>> + if (error)
>> + goto error;
>> + ASSERT(i == 1);
>
> That's rather similar to the code in xfs_inobt_insert(). Indeed,
> is you write a helper - xfs_inobt_insert_rec() - for this, then rather than modifying
> xfs_inobt_lookup() to take extra parameters like I wondered for the
> previous patch, leave it alonge and pass the parameters to
> xfs_inobt_insert_rec() instead.
>
> Then this code is functionally identical to xfs_inobt_insert() done
> during allocation....
>
I think I'm parsing you after having another look at the code.
xfs_inobt_lookup() remains as is and is potentially used from
xfs_inobt_insert(). xfs_inobt_insert_rec() is introduced to set the
cursor fields and do the insert and is used here and from
xfs_inobt_insert().
At that point, this looks close to xfs_inobt_insert(), but I think using
that here would introduce a duplicate lookup. Regardless, we'll see what
the whole thing looks like at that point. Thanks for the reviews. :)
Brian
> Cheers,
>
> Dave.
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 00/11] xfs: introduce the free inode btree
2013-09-03 18:24 [RFC PATCH 00/11] xfs: introduce the free inode btree Brian Foster
` (10 preceding siblings ...)
2013-09-03 18:25 ` [RFC PATCH 11/11] xfs: add finobt support to growfs Brian Foster
@ 2013-09-05 21:17 ` Michael L. Semon
2013-09-06 11:17 ` Brian Foster
2013-09-06 21:35 ` Dave Chinner
11 siblings, 2 replies; 46+ messages in thread
From: Michael L. Semon @ 2013-09-05 21:17 UTC (permalink / raw)
To: Brian Foster; +Cc: xfs
On 09/03/2013 02:24 PM, Brian Foster wrote:
> Hi all,
>
> This is an RFC for the kernel work to support a free inode btree. The free inode
> btree (finobt) is equivalent to the existing inode allocation btree with the
> exception that the finobt only tracks inode chunks with at least one free inode.
> The purpose is to improve lookups for free inode clusters for inode allocation.
>
> Newly allocated inode chunks by definition contain free inodes and are thus
> inserted into the finobt immediately. The record for a previously full inode
> chunk is inserted to the finobt when the first inode is freed. A record is
> removed from the finobt when the last free inode has been allocated or the
> entire chunk is completely deallocated.
>
> Patches 1-3 refactor some ialloc btree code to introduce the new finobt type and
> feature bit. Patches 4-7 fix up the transaction handling for inode allocation
> and deallocation to account for the new tree. Patches 8-10 add the finobt
> management code to insert, remove and modify records as appropriate. Patch 11
> fixes growfs to support the finobt.
>
> Thoughts, reviews, flames appreciated.
I'm looking for Dave's judgement call on whether I should run this code
full-time. The patchset applied well on top of Dave's latest work--only
a "trailing whitespace" warning on Patch #9 (I think)--and the code
compiled without error. There was a lockdep while running xfstests,
before generic/013 (I think), so I switched back to my normal git branch
and have your patchset in a separate branch.
My setup here is slow--x86, old IDE hardware, write cache off, debug
kernel--but the patchset made things seem a little slower. At the
right time--not necessarily now--performance numbers might be nice.
I didn't time it but did a copy of 3 kernel gits to v5 1k-block-size
XFS and just felt something was off. The copy did complete, though.
Will try timing this on another day.
Anyway, good work so far! No additional stack traces were caused by
your code in limited testing, and the filesystems were are still
intact.
Thanks!
Michael
[lockdep from xfstests generic/0-ten-something follows:]
[ 763.993429] XFS (sdb4): Mounting Filesystem
[ 764.258701] XFS (sdb4): Ending clean mount
[ 768.798390] XFS (sdb5): Mounting Filesystem
[ 769.061280] XFS (sdb5): Ending clean mount
[ 770.030277] XFS (sdb4): Mounting Filesystem
[ 770.313502] XFS (sdb4): Ending clean mount
[ 788.932588] XFS (sdb4): Mounting Filesystem
[ 789.256815] XFS (sdb4): Ending clean mount
[ 792.639933] XFS (sdb4): Mounting Filesystem
[ 792.965477] XFS (sdb4): Ending clean mount
[ 795.166220] XFS (sdb4): Mounting Filesystem
[ 795.507372] XFS (sdb4): Ending clean mount
[ 802.870263] XFS (sdb4): Mounting Filesystem
[ 803.516422] XFS (sdb4): Ending clean mount
[ 814.376620] XFS (sdb4): Mounting Filesystem
[ 815.050778] XFS (sdb4): Ending clean mount
[ 823.169368]
[ 823.170932] ======================================================
[ 823.172146] [ INFO: possible circular locking dependency detected ]
[ 823.172146] 3.11.0+ #5 Not tainted
[ 823.172146] -------------------------------------------------------
[ 823.172146] dirstress/5276 is trying to acquire lock:
[ 823.172146] (sb_internal){.+.+.+}, at: [<c11c5e60>] xfs_trans_alloc+0x1f/0x35
[ 823.172146]
[ 823.172146] but task is already holding lock:
[ 823.172146] (&(&ip->i_lock)->mr_lock){+++++.}, at: [<c1206cfb>] xfs_ilock+0x100/0x1f1
[ 823.172146]
[ 823.172146] which lock already depends on the new lock.
[ 823.172146]
[ 823.172146]
[ 823.172146] the existing dependency chain (in reverse order) is:
[ 823.172146]
[ 823.172146] -> #1 (&(&ip->i_lock)->mr_lock){+++++.}:
[ 823.172146] [<c1070a11>] __lock_acquire+0x345/0xa11
[ 823.172146] [<c1071c45>] lock_acquire+0x88/0x17e
[ 823.172146] [<c14bff98>] _raw_spin_lock+0x47/0x74
[ 823.172146] [<c1116247>] __mark_inode_dirty+0x171/0x38c
[ 823.172146] [<c111acab>] __set_page_dirty+0x5f/0x95
[ 823.172146] [<c111b93e>] mark_buffer_dirty+0x58/0x12b
[ 823.172146] [<c111baff>] __block_commit_write.isra.17+0x64/0x82
[ 823.172146] [<c111c197>] block_write_end+0x2b/0x53
[ 823.172146] [<c111c201>] generic_write_end+0x42/0x9a
[ 823.172146] [<c11a42d5>] xfs_vm_write_end+0x60/0xbe
[ 823.172146] [<c10bd47a>] generic_file_buffered_write+0x140/0x20f
[ 823.172146] [<c11b2347>] xfs_file_buffered_aio_write+0x10b/0x205
[ 823.172146] [<c11b24ee>] xfs_file_aio_write+0xad/0xec
[ 823.172146] [<c10f0c5f>] do_sync_write+0x60/0x87
[ 823.172146] [<c10f0e0f>] vfs_write+0x9c/0x189
[ 823.172146] [<c10f0fc6>] SyS_write+0x49/0x81
[ 823.172146] [<c14c14bb>] sysenter_do_call+0x12/0x32
[ 823.172146]
[ 823.172146] -> #0 (sb_internal){.+.+.+}:
[ 823.172146] [<c106e972>] validate_chain.isra.35+0xfc7/0xff4
[ 823.172146] [<c1070a11>] __lock_acquire+0x345/0xa11
[ 823.172146] [<c1071c45>] lock_acquire+0x88/0x17e
[ 823.172146] [<c10f36eb>] __sb_start_write+0xad/0x177
[ 823.172146] [<c11c5e60>] xfs_trans_alloc+0x1f/0x35
[ 823.172146] [<c120a823>] xfs_inactive+0x129/0x4a3
[ 823.172146] [<c11c280d>] xfs_fs_evict_inode+0x6c/0x114
[ 823.172146] [<c1106678>] evict+0x8e/0x15d
[ 823.172146] [<c1107126>] iput+0xc4/0x138
[ 823.172146] [<c1103504>] dput+0x1b2/0x257
[ 823.172146] [<c10f1a30>] __fput+0x140/0x1eb
[ 823.172146] [<c10f1b0f>] ____fput+0xd/0xf
[ 823.172146] [<c1048477>] task_work_run+0x67/0x90
[ 823.172146] [<c1001ea5>] do_notify_resume+0x61/0x63
[ 823.172146] [<c14c0cfa>] work_notifysig+0x1f/0x25
[ 823.172146]
[ 823.172146] other info that might help us debug this:
[ 823.172146]
[ 823.172146] Possible unsafe locking scenario:
[ 823.172146]
[ 823.172146] CPU0 CPU1
[ 823.172146] ---- ----
[ 823.172146] lock(&(&ip->i_lock)->mr_lock);
[ 823.172146] lock(sb_internal);
[ 823.172146] lock(&(&ip->i_lock)->mr_lock);
[ 823.172146] lock(sb_internal);
[ 823.172146]
[ 823.172146] *** DEADLOCK ***
[ 823.172146]
[ 823.172146] 1 lock held by dirstress/5276:
[ 823.172146] #0: (&(&ip->i_lock)->mr_lock){+++++.}, at: [<c1206cfb>] xfs_ilock+0x100/0x1f1
[ 823.172146]
[ 823.172146] stack backtrace:
[ 823.172146] CPU: 0 PID: 5276 Comm: dirstress Not tainted 3.11.0+ #5
[ 823.172146] Hardware name: Dell Computer Corporation Dimension 2350/07W080, BIOS A01 12/17/2002
[ 823.172146] c1c26060 c1c26060 da34fd58 c14ba216 da34fd78 c14b7317 c15f171b da34fdb4
[ 823.172146] dcaa1440 00000001 dcaa18b0 00000000 da34fde4 c106e972 dcaa1888 00000001
[ 823.172146] da34fdb4 c1057e0f 00000000 00003f61 c1c28660 00000000 dcaa1888 dcaa18b0
[ 823.172146] Call Trace:
[ 823.172146] [<c14ba216>] dump_stack+0x16/0x18
[ 823.172146] [<c14b7317>] print_circular_bug+0x1b8/0x1c2
[ 823.172146] [<c106e972>] validate_chain.isra.35+0xfc7/0xff4
[ 823.172146] [<c1057e0f>] ? sched_clock_local.constprop.3+0x39/0x131
[ 823.172146] [<c1057fd4>] ? sched_clock_cpu+0x8f/0xe2
[ 823.172146] [<c1070a11>] __lock_acquire+0x345/0xa11
[ 823.172146] [<c1070a36>] ? __lock_acquire+0x36a/0xa11
[ 823.172146] [<c1071c45>] lock_acquire+0x88/0x17e
[ 823.172146] [<c11c5e60>] ? xfs_trans_alloc+0x1f/0x35
[ 823.172146] [<c10f36eb>] __sb_start_write+0xad/0x177
[ 823.172146] [<c11c5e60>] ? xfs_trans_alloc+0x1f/0x35
[ 823.172146] [<c11c5e60>] ? xfs_trans_alloc+0x1f/0x35
[ 823.172146] [<c1206cfb>] ? xfs_ilock+0x100/0x1f1
[ 823.172146] [<c11c5e60>] xfs_trans_alloc+0x1f/0x35
[ 823.172146] [<c120a823>] xfs_inactive+0x129/0x4a3
[ 823.172146] [<c106f21f>] ? trace_hardirqs_on+0xb/0xd
[ 823.172146] [<c14c01e5>] ? _raw_spin_unlock_irq+0x27/0x36
[ 823.172146] [<c11c280d>] xfs_fs_evict_inode+0x6c/0x114
[ 823.172146] [<c1106678>] evict+0x8e/0x15d
[ 823.172146] [<c1107126>] iput+0xc4/0x138
[ 823.172146] [<c1103504>] dput+0x1b2/0x257
[ 823.172146] [<c10f1a30>] __fput+0x140/0x1eb
[ 823.172146] [<c10f1b0f>] ____fput+0xd/0xf
[ 823.172146] [<c1048477>] task_work_run+0x67/0x90
[ 823.172146] [<c1001ea5>] do_notify_resume+0x61/0x63
[ 823.172146] [<c14c0cfa>] work_notifysig+0x1f/0x25
[ 824.015366] Clocksource tsc unstable (delta = 486645129 ns)
[ 825.324019] XFS (sdb4): Mounting Filesystem
[ 825.743317] XFS (sdb4): Ending clean mount
[ 827.223193] XFS (sdb4): Mounting Filesystem
[ 827.668493] XFS (sdb4): Ending clean mount
[ 837.524673] XFS (sdb4): Mounting Filesystem
[ 837.986097] XFS (sdb4): Ending clean mount
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 03/11] xfs: support the XFS_BTNUM_FINOBT free inode btree type
2013-09-05 16:17 ` Brian Foster
@ 2013-09-06 0:07 ` Dave Chinner
2013-09-06 11:25 ` Brian Foster
0 siblings, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2013-09-06 0:07 UTC (permalink / raw)
To: Brian Foster; +Cc: xfs
On Thu, Sep 05, 2013 at 12:17:04PM -0400, Brian Foster wrote:
> On 09/04/2013 08:54 PM, Dave Chinner wrote:
> > On Tue, Sep 03, 2013 at 02:25:00PM -0400, Brian Foster wrote:
> >> Define the AGI fields for the finobt root/level and add magic
> >> numbers. Update the btree code to add support for the new
> >> XFS_BTNUM_FINOBT inode btree.
.....
> >> @@ -180,7 +183,9 @@ typedef struct xfs_agi {
> >> #define XFS_AGI_NEWINO 0x00000100
> >> #define XFS_AGI_DIRINO 0x00000200
> >> #define XFS_AGI_UNLINKED 0x00000400
> >> -#define XFS_AGI_NUM_BITS 11
> >> +#define XFS_AGI_FREE_ROOT 0x00000800
> >> +#define XFS_AGI_FREE_LEVEL 0x00001000
> >> +#define XFS_AGI_NUM_BITS 13
> >> #define XFS_AGI_ALL_BITS ((1 << XFS_AGI_NUM_BITS) - 1)
> >
> > This is a bit of a problem, because the range logging bits will now
> > cause the entire AGI to be logged (including all the unlinked list
> > hash) because these only define the first/last offsets to be
> > logged...
> >
>
> Ok, I see what you mean here...
>
> >> diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
> >> index 524aa88..5ced506 100644
> >> --- a/fs/xfs/xfs_ialloc.c
> >> +++ b/fs/xfs/xfs_ialloc.c
> >> @@ -1505,6 +1505,8 @@ xfs_ialloc_log_agi(
> >> offsetof(xfs_agi_t, agi_newino),
> >> offsetof(xfs_agi_t, agi_dirino),
> >> offsetof(xfs_agi_t, agi_unlinked),
> >> + offsetof(xfs_agi_t, agi_free_root),
> >> + offsetof(xfs_agi_t, agi_free_level),
> >> sizeof(xfs_agi_t)
> >> };
> >
> > Because of how this table works.
> >
> > What we really need here is for xfs_ialloc_log_agi to consider that
> > there are two distinct regions for range logging - the first spaces
> > from offset 0 to offset of agi_unlinked, and the second is from the
> > the offset of agi_free_root to the end of the xfs_agi_t....
> >
> > It's abit messy, I know, but we couldn't easily add new padding to
> > the AGI in the existing range logging area like was done for the AGF
> > because of the unlinked list hash table already defining the end of
> > the range logging region....
> >
>
> ... but where would that ever happen? The existing invocations of
> xfs_ialloc_log_agi() seem to log either the agi inode count values or
> the btree root/level values (i.e., never the range across both). I think
> I've introduced at least a couple new invocations throughout this set,
> but I've not changed that model (i.e., an XFS_AGI_FREECOUNT instance in
> the new lookup code and an XFS_AGI_FREE_ROOT|*_LEVEL instance in the new
> btree code).
Right, we don't current log across the range because of the way the
code is currently written, but there's no rule that says that
logging fields must be done this way.
I can see that there may be reason for logging
XFS_AGI_FREE_ROOT|*_LEVEL and XFS_AGI_NEW_INODE all in one go -
pointing new inode allocation at recently freed inodes is not
unreasonable, and if we split the finobt and update agi_newino in
the one update, we will log across this gap.
> My understanding of this code is that the range to log is defined at
> invocation time to xfs_iallog_log_agi(), so if the callers never specify
> a range that includes the unlinked bits in a single call, we won't set
> that range in the buffer log item code. In other words, even if we
> ultimately happen to log both ranges in the agi, the lower level code
> will never expand the logged region. Therefore, this shouldn't happen
> unless a single invocation that specifies one of the
> XFS_AGI_FREE_ROOT/LEVEL bits also specifies one of the existing agi bits.
Yes, we can avoid that by logging te regions separately, but that
then puts the onus on the future developers and reviewers to
remember this landmine and avoid it.
> I could see breaking this up as a matter of preparation for future
> fields or calls that would introduce logging that kind of range, but at
> the same time (and assuming my interpretation of above is correct),
> that's a bit of code that serves no purpose for the foreseeable future.
> Perhaps a comment in xfs_ialloc_log_agi() and the one caller that uses
> the AGI_FREE_* bits to document this restriction is sufficient?
Given it is only a few lines of extra code in xfs_ialloc_log_agi(),
I'd prefer that the code documents and deals with the different
regions internally. That way we can forget about it completely for the
future and never have to worry about it again.
Keep in mind that I'm looking at this from a code maintenance
timescale of at least 5-10 years into the future here - a few
minutes extra fixing this right now could save us a lot of hassle
years down the track. ;)
> >> #ifdef DEBUG
> >> diff --git a/fs/xfs/xfs_ialloc_btree.c b/fs/xfs/xfs_ialloc_btree.c
> >> index 0cdb88b..7923292 100644
> >> --- a/fs/xfs/xfs_ialloc_btree.c
> >> +++ b/fs/xfs/xfs_ialloc_btree.c
> >> @@ -62,10 +62,18 @@ xfs_inobt_set_root(
> >> {
> >> struct xfs_buf *agbp = cur->bc_private.a.agbp;
> >> struct xfs_agi *agi = XFS_BUF_TO_AGI(agbp);
> >> -
> >> - agi->agi_root = nptr->s;
> >> - be32_add_cpu(&agi->agi_level, inc);
> >> - xfs_ialloc_log_agi(cur->bc_tp, agbp, XFS_AGI_ROOT | XFS_AGI_LEVEL);
> >> + int fields;
> >> +
> >> + if (cur->bc_btnum == XFS_BTNUM_INO) {
> >> + agi->agi_root = nptr->s;
> >> + be32_add_cpu(&agi->agi_level, inc);
> >> + fields = XFS_AGI_ROOT | XFS_AGI_LEVEL;
> >> + } else {
> >> + agi->agi_free_root = nptr->s;
> >> + be32_add_cpu(&agi->agi_free_level, inc);
> >> + fields = XFS_AGI_FREE_ROOT | XFS_AGI_FREE_LEVEL;
> >> + }
> >> + xfs_ialloc_log_agi(cur->bc_tp, agbp, fields);
> >> }
> >
> > I suspect that it would be better to have separate functions for
> > these differences i.e. xfs_inobt_set_root() and
> > xfs_finobt_set_root(), and set up separate btree ops structure
> > forthe two different trees. Most of the code is still identical,
> > but the differences in root structures can easily be handled without
> > putting switches in the code everywhere.
> >
>
> Ok, I'm assuming the suggestion is to only create new functions for the
> implementations that differ.
Exactly.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 04/11] xfs: update inode allocation transaction reservations for finobt
2013-09-05 16:17 ` Brian Foster
@ 2013-09-06 0:11 ` Dave Chinner
0 siblings, 0 replies; 46+ messages in thread
From: Dave Chinner @ 2013-09-06 0:11 UTC (permalink / raw)
To: Brian Foster; +Cc: xfs
On Thu, Sep 05, 2013 at 12:17:23PM -0400, Brian Foster wrote:
> On 09/04/2013 08:59 PM, Dave Chinner wrote:
> > On Tue, Sep 03, 2013 at 02:25:01PM -0400, Brian Foster wrote:
> >> Update inode allocation transaction reservations for the finobt. A
> >> create via record modification requires a log reservation for the
> >> additional finobt record. Any such allocation could result in an
> >> finobt removal if the inode chunk has become fully allocated, thus
> >> we include a reservation for a finobt btree merge as well.
> >> Allocation of a new inode chunk must account for splits in the
> >> finobt as well as the existing ialloc tree.
> >
> > These transaction reservation changes are only necessary for
> > filesystems with free inode btrees, otherwise they just use more log
> > space than is necessary.
> >
>
> Yeah, good point.
>
> > Can you add helper functions for the free inode btree reservations,
> > and have them return 0 when the feature is not enabled? That way the
> > code stays pretty clean, is self documenting and doesn't take
> > unnecessary space when the feature is not enabled....
> >
>
> So not new functions that duplicate the entire calculations for the
> finobt enabled cases, but new functions that define the additional
> requirements for the finobt on top of the existing reservation
> calculations for particular operations (i.e., similar to the recent
> patch to fix the inode log size, if I recall). Sounds good.
That's exactly what I'm thinking - just like the
xfs_calc_inode_res() helper. This really helps document the
transaction reservation calculations - the comments help, but they
are no substitute for being able to say "*that line* is what
reserves space for the inodes modified in the transaction"...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 07/11] xfs: retry trans reservation on ENOSPC in xfs_inactive()
2013-09-05 16:18 ` Brian Foster
@ 2013-09-06 0:17 ` Dave Chinner
2013-09-06 11:30 ` Brian Foster
0 siblings, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2013-09-06 0:17 UTC (permalink / raw)
To: Brian Foster; +Cc: xfs
On Thu, Sep 05, 2013 at 12:18:31PM -0400, Brian Foster wrote:
> On 09/04/2013 09:40 PM, Dave Chinner wrote:
> > On Tue, Sep 03, 2013 at 02:25:04PM -0400, Brian Foster wrote:
> >> An ifree data block reservation can fail with ENOSPC. Flush inodes
> >> to try and free up space or attempt without a data block
> >> reservation to avoid failing out of xfs_inactive().
> >>
> >> Signed-off-by: Brian Foster <bfoster@redhat.com>
> >> ---
> >> fs/xfs/xfs_inode.c | 11 +++++++++++
> >> 1 file changed, 11 insertions(+)
> >>
> >> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> >> index 56cbf63..92de4b7 100644
> >> --- a/fs/xfs/xfs_inode.c
> >> +++ b/fs/xfs/xfs_inode.c
> >> @@ -1779,7 +1779,18 @@ xfs_inactive(
> >> tp = xfs_trans_alloc(mp, XFS_TRANS_INACTIVE);
> >> error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree,
> >> XFS_IFREE_SPACE_RES(mp), 0);
> >> + if (error == ENOSPC) {
> >> + /* flush outstanding delalloc blocks and retry */
> >> + xfs_flush_inodes(mp);
> >> + error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree,
> >> + XFS_IFREE_SPACE_RES(mp), 0);
> >> + }
> >
> > We don't want to be blocking for inode flushes here. We might be in
> > a shrinker context, for example, and blocking those for a filesystem
> > sync is going to be unfriendly.
> >
>
> Ok.
>
> > If this really is a problem, then the right thing to do is to allow
> > this transaction to dip into the reserve block pool so the
> > transaction can complete and make progress - other write operations
> > will trigger the flushing of the filesystem, and freeing of whole
> > inode chunks should return more free space than we need for the
> > finobt modifications in the removing lots of zero length inodes
> > at ENOSPC case....
> >
>
> I did have one of the enospc xfstests lead to this situation, though I
> don't have the particular test in my notes. It initially manifested as
> an assert failure due to the fs not being shutdown after an
> xfs_trans_reserve() ENOSPC failure.
Ok. I can see how ENOSPC might occur here :)
> Subsequent to avoiding that, I
> believe there were inconsistent fs issues called out due to the unlinked
> lists being populated after umount.
That sounds like a recovery failure, not so much an ENOSPC failure.
i.e. that recovery only looks at the log to see if it's clean, and
only recovers unlinked lists if it's dirty. There is the
*possibility* of having a clean log with inodes on the unlinked
list, and log recovery doesn't run the unlinked list processing in
that case.
This is one of the issues we'll need to fix for O_TMPFILE support
as it will actively use inodes on unlinked list for potentially long
periods of time.
> Taking a further look, I missed the XFS_TRANS_RESERVE flag and whole
> m_resblks mechanism. I'll take a closer look at that and see if that
> works to resolve the problem instead of the flush.
It should - the only time it won't is if we exhaust the pool, but
that doesn't happen in normal ENOSPC situations and any blocks we do
end up freeing will immediately refill the reserve pool...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 10/11] xfs: update the finobt on inode free
2013-09-05 16:19 ` Brian Foster
@ 2013-09-06 0:28 ` Dave Chinner
2013-09-06 11:39 ` Brian Foster
0 siblings, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2013-09-06 0:28 UTC (permalink / raw)
To: Brian Foster; +Cc: xfs
On Thu, Sep 05, 2013 at 12:19:12PM -0400, Brian Foster wrote:
> On 09/04/2013 10:54 PM, Dave Chinner wrote:
> > On Tue, Sep 03, 2013 at 02:25:07PM -0400, Brian Foster wrote:
> >> An inode free operation can have several effects on the finobt. If
> >> all inodes have been freed and the chunk deallocated, we remove the
> >> finobt record. If the inode chunk was previously full, we must
> >> insert a new record based on the existing inobt record. Otherwise,
> >> we modify the record in place.
.....
> >> + } else if ((i == 0) && (ibtrec->ir_freecount == 1)) {
> >> + /*
> >> + * No existing finobt record and the inobt record has a single
> >> + * free inode. This means we've freed an inode in a previously
> >> + * fully allocated chunk. Insert a new record into the finobt
> >> + * based on the current inobt record.
> >> + */
> >> + cur->bc_rec.i.ir_startino = ibtrec->ir_startino;
> >> + cur->bc_rec.i.ir_free = ibtrec->ir_free;
> >> + cur->bc_rec.i.ir_freecount = ibtrec->ir_freecount;
> >> + error = xfs_btree_insert(cur, &i);
> >> + if (error)
> >> + goto error;
> >> + ASSERT(i == 1);
> >
> > That's rather similar to the code in xfs_inobt_insert(). Indeed,
> > is you write a helper - xfs_inobt_insert_rec() - for this, then rather than modifying
> > xfs_inobt_lookup() to take extra parameters like I wondered for the
> > previous patch, leave it alonge and pass the parameters to
> > xfs_inobt_insert_rec() instead.
> >
> > Then this code is functionally identical to xfs_inobt_insert() done
> > during allocation....
> >
>
> I think I'm parsing you after having another look at the code.
> xfs_inobt_lookup() remains as is and is potentially used from
> xfs_inobt_insert(). xfs_inobt_insert_rec() is introduced to set the
> cursor fields and do the insert and is used here and from
> xfs_inobt_insert().
Effectively. xfs_inobt_insert() becomes:
for (each inode chunk) {
xfs_inobt_lookup(cur, startino)
xfs_inobt_insert_rec(cur, startino, free, free_count)
}
And this code becomes:
xfs_inobt_lookup(cur, startino);
if (!found) {
if (free_count == 1)
xfs_inobt_insert_rec(cur, startino, free, free_count)
else
CORRUPTION
goto out;
}
> At that point, this looks close to xfs_inobt_insert(), but I think using
> that here would introduce a duplicate lookup.
Yes, it would. I think just using helpers like this is sufficient
for the two different cases, especially as xfs_inobt_insert() needs
to be able to handle multiple chunk insertion and we don't have that
here...
> Regardless, we'll see what
> the whole thing looks like at that point. Thanks for the reviews. :)
No worries. BTW, can you post your rudimentary userspace support so
we can run tests that use this code, too?
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 00/11] xfs: introduce the free inode btree
2013-09-05 21:17 ` [RFC PATCH 00/11] xfs: introduce the free inode btree Michael L. Semon
@ 2013-09-06 11:17 ` Brian Foster
2013-09-06 21:35 ` Dave Chinner
1 sibling, 0 replies; 46+ messages in thread
From: Brian Foster @ 2013-09-06 11:17 UTC (permalink / raw)
To: Michael L. Semon; +Cc: xfs
On 09/05/2013 05:17 PM, Michael L. Semon wrote:
> On 09/03/2013 02:24 PM, Brian Foster wrote:
>> Hi all,
>>
>> This is an RFC for the kernel work to support a free inode btree. The free inode
>> btree (finobt) is equivalent to the existing inode allocation btree with the
>> exception that the finobt only tracks inode chunks with at least one free inode.
>> The purpose is to improve lookups for free inode clusters for inode allocation.
>>
>> Newly allocated inode chunks by definition contain free inodes and are thus
>> inserted into the finobt immediately. The record for a previously full inode
>> chunk is inserted to the finobt when the first inode is freed. A record is
>> removed from the finobt when the last free inode has been allocated or the
>> entire chunk is completely deallocated.
>>
>> Patches 1-3 refactor some ialloc btree code to introduce the new finobt type and
>> feature bit. Patches 4-7 fix up the transaction handling for inode allocation
>> and deallocation to account for the new tree. Patches 8-10 add the finobt
>> management code to insert, remove and modify records as appropriate. Patch 11
>> fixes growfs to support the finobt.
>>
>> Thoughts, reviews, flames appreciated.
>
> I'm looking for Dave's judgement call on whether I should run this code
> full-time. The patchset applied well on top of Dave's latest work--only
> a "trailing whitespace" warning on Patch #9 (I think)--and the code
> compiled without error. There was a lockdep while running xfstests,
> before generic/013 (I think), so I switched back to my normal git branch
> and have your patchset in a separate branch.
>
Hi Michael,
Thanks for giving this a burn. I actually haven't tested with lockdep
yet, so I've made a note to do so and see if I can reproduce that or any
other problems.
With regard to running "full-time," I'd suggest to hold off at least
until I post a non-RFC version (the next version will probably be a real
v1). The review feedback has shown that a few areas need some
non-trivial rework that change the behavior enough (i.e., the flush on
inactive thing, transaction reservation management, etc.) that any
issues at this point might be worth retesting for on the later version
anyways.
> My setup here is slow--x86, old IDE hardware, write cache off, debug
> kernel--but the patchset made things seem a little slower. At the
> right time--not necessarily now--performance numbers might be nice.
> I didn't time it but did a copy of 3 kernel gits to v5 1k-block-size
> XFS and just felt something was off. The copy did complete, though.
> Will try timing this on another day.
>
That's interesting. I had a chat with Dave with regard to the existing
lookup allocation algorithm and the best way to test the improved
lookup, and I am able to reproduce a nice inode allocation improvement
under certain conditions. I could understand seeing some general effect
in performance via the addition of the new tree (there's more work to do
now to manage it, after all), but my expectation is that for finobt=0
filesystems, there should be no effect whatsoever.
> Anyway, good work so far! No additional stack traces were caused by
> your code in limited testing, and the filesystems were are still
> intact.
>
Thanks again for testing. Most of my testing so far has been with finobt
enabled, so that's a good bit of news! :)
Brian
> Thanks!
>
> Michael
>
> [lockdep from xfstests generic/0-ten-something follows:]
>
> [ 763.993429] XFS (sdb4): Mounting Filesystem
> [ 764.258701] XFS (sdb4): Ending clean mount
> [ 768.798390] XFS (sdb5): Mounting Filesystem
> [ 769.061280] XFS (sdb5): Ending clean mount
> [ 770.030277] XFS (sdb4): Mounting Filesystem
> [ 770.313502] XFS (sdb4): Ending clean mount
> [ 788.932588] XFS (sdb4): Mounting Filesystem
> [ 789.256815] XFS (sdb4): Ending clean mount
> [ 792.639933] XFS (sdb4): Mounting Filesystem
> [ 792.965477] XFS (sdb4): Ending clean mount
> [ 795.166220] XFS (sdb4): Mounting Filesystem
> [ 795.507372] XFS (sdb4): Ending clean mount
> [ 802.870263] XFS (sdb4): Mounting Filesystem
> [ 803.516422] XFS (sdb4): Ending clean mount
> [ 814.376620] XFS (sdb4): Mounting Filesystem
> [ 815.050778] XFS (sdb4): Ending clean mount
> [ 823.169368]
> [ 823.170932] ======================================================
> [ 823.172146] [ INFO: possible circular locking dependency detected ]
> [ 823.172146] 3.11.0+ #5 Not tainted
> [ 823.172146] -------------------------------------------------------
> [ 823.172146] dirstress/5276 is trying to acquire lock:
> [ 823.172146] (sb_internal){.+.+.+}, at: [<c11c5e60>] xfs_trans_alloc+0x1f/0x35
> [ 823.172146]
> [ 823.172146] but task is already holding lock:
> [ 823.172146] (&(&ip->i_lock)->mr_lock){+++++.}, at: [<c1206cfb>] xfs_ilock+0x100/0x1f1
> [ 823.172146]
> [ 823.172146] which lock already depends on the new lock.
> [ 823.172146]
> [ 823.172146]
> [ 823.172146] the existing dependency chain (in reverse order) is:
> [ 823.172146]
> [ 823.172146] -> #1 (&(&ip->i_lock)->mr_lock){+++++.}:
> [ 823.172146] [<c1070a11>] __lock_acquire+0x345/0xa11
> [ 823.172146] [<c1071c45>] lock_acquire+0x88/0x17e
> [ 823.172146] [<c14bff98>] _raw_spin_lock+0x47/0x74
> [ 823.172146] [<c1116247>] __mark_inode_dirty+0x171/0x38c
> [ 823.172146] [<c111acab>] __set_page_dirty+0x5f/0x95
> [ 823.172146] [<c111b93e>] mark_buffer_dirty+0x58/0x12b
> [ 823.172146] [<c111baff>] __block_commit_write.isra.17+0x64/0x82
> [ 823.172146] [<c111c197>] block_write_end+0x2b/0x53
> [ 823.172146] [<c111c201>] generic_write_end+0x42/0x9a
> [ 823.172146] [<c11a42d5>] xfs_vm_write_end+0x60/0xbe
> [ 823.172146] [<c10bd47a>] generic_file_buffered_write+0x140/0x20f
> [ 823.172146] [<c11b2347>] xfs_file_buffered_aio_write+0x10b/0x205
> [ 823.172146] [<c11b24ee>] xfs_file_aio_write+0xad/0xec
> [ 823.172146] [<c10f0c5f>] do_sync_write+0x60/0x87
> [ 823.172146] [<c10f0e0f>] vfs_write+0x9c/0x189
> [ 823.172146] [<c10f0fc6>] SyS_write+0x49/0x81
> [ 823.172146] [<c14c14bb>] sysenter_do_call+0x12/0x32
> [ 823.172146]
> [ 823.172146] -> #0 (sb_internal){.+.+.+}:
> [ 823.172146] [<c106e972>] validate_chain.isra.35+0xfc7/0xff4
> [ 823.172146] [<c1070a11>] __lock_acquire+0x345/0xa11
> [ 823.172146] [<c1071c45>] lock_acquire+0x88/0x17e
> [ 823.172146] [<c10f36eb>] __sb_start_write+0xad/0x177
> [ 823.172146] [<c11c5e60>] xfs_trans_alloc+0x1f/0x35
> [ 823.172146] [<c120a823>] xfs_inactive+0x129/0x4a3
> [ 823.172146] [<c11c280d>] xfs_fs_evict_inode+0x6c/0x114
> [ 823.172146] [<c1106678>] evict+0x8e/0x15d
> [ 823.172146] [<c1107126>] iput+0xc4/0x138
> [ 823.172146] [<c1103504>] dput+0x1b2/0x257
> [ 823.172146] [<c10f1a30>] __fput+0x140/0x1eb
> [ 823.172146] [<c10f1b0f>] ____fput+0xd/0xf
> [ 823.172146] [<c1048477>] task_work_run+0x67/0x90
> [ 823.172146] [<c1001ea5>] do_notify_resume+0x61/0x63
> [ 823.172146] [<c14c0cfa>] work_notifysig+0x1f/0x25
> [ 823.172146]
> [ 823.172146] other info that might help us debug this:
> [ 823.172146]
> [ 823.172146] Possible unsafe locking scenario:
> [ 823.172146]
> [ 823.172146] CPU0 CPU1
> [ 823.172146] ---- ----
> [ 823.172146] lock(&(&ip->i_lock)->mr_lock);
> [ 823.172146] lock(sb_internal);
> [ 823.172146] lock(&(&ip->i_lock)->mr_lock);
> [ 823.172146] lock(sb_internal);
> [ 823.172146]
> [ 823.172146] *** DEADLOCK ***
> [ 823.172146]
> [ 823.172146] 1 lock held by dirstress/5276:
> [ 823.172146] #0: (&(&ip->i_lock)->mr_lock){+++++.}, at: [<c1206cfb>] xfs_ilock+0x100/0x1f1
> [ 823.172146]
> [ 823.172146] stack backtrace:
> [ 823.172146] CPU: 0 PID: 5276 Comm: dirstress Not tainted 3.11.0+ #5
> [ 823.172146] Hardware name: Dell Computer Corporation Dimension 2350/07W080, BIOS A01 12/17/2002
> [ 823.172146] c1c26060 c1c26060 da34fd58 c14ba216 da34fd78 c14b7317 c15f171b da34fdb4
> [ 823.172146] dcaa1440 00000001 dcaa18b0 00000000 da34fde4 c106e972 dcaa1888 00000001
> [ 823.172146] da34fdb4 c1057e0f 00000000 00003f61 c1c28660 00000000 dcaa1888 dcaa18b0
> [ 823.172146] Call Trace:
> [ 823.172146] [<c14ba216>] dump_stack+0x16/0x18
> [ 823.172146] [<c14b7317>] print_circular_bug+0x1b8/0x1c2
> [ 823.172146] [<c106e972>] validate_chain.isra.35+0xfc7/0xff4
> [ 823.172146] [<c1057e0f>] ? sched_clock_local.constprop.3+0x39/0x131
> [ 823.172146] [<c1057fd4>] ? sched_clock_cpu+0x8f/0xe2
> [ 823.172146] [<c1070a11>] __lock_acquire+0x345/0xa11
> [ 823.172146] [<c1070a36>] ? __lock_acquire+0x36a/0xa11
> [ 823.172146] [<c1071c45>] lock_acquire+0x88/0x17e
> [ 823.172146] [<c11c5e60>] ? xfs_trans_alloc+0x1f/0x35
> [ 823.172146] [<c10f36eb>] __sb_start_write+0xad/0x177
> [ 823.172146] [<c11c5e60>] ? xfs_trans_alloc+0x1f/0x35
> [ 823.172146] [<c11c5e60>] ? xfs_trans_alloc+0x1f/0x35
> [ 823.172146] [<c1206cfb>] ? xfs_ilock+0x100/0x1f1
> [ 823.172146] [<c11c5e60>] xfs_trans_alloc+0x1f/0x35
> [ 823.172146] [<c120a823>] xfs_inactive+0x129/0x4a3
> [ 823.172146] [<c106f21f>] ? trace_hardirqs_on+0xb/0xd
> [ 823.172146] [<c14c01e5>] ? _raw_spin_unlock_irq+0x27/0x36
> [ 823.172146] [<c11c280d>] xfs_fs_evict_inode+0x6c/0x114
> [ 823.172146] [<c1106678>] evict+0x8e/0x15d
> [ 823.172146] [<c1107126>] iput+0xc4/0x138
> [ 823.172146] [<c1103504>] dput+0x1b2/0x257
> [ 823.172146] [<c10f1a30>] __fput+0x140/0x1eb
> [ 823.172146] [<c10f1b0f>] ____fput+0xd/0xf
> [ 823.172146] [<c1048477>] task_work_run+0x67/0x90
> [ 823.172146] [<c1001ea5>] do_notify_resume+0x61/0x63
> [ 823.172146] [<c14c0cfa>] work_notifysig+0x1f/0x25
> [ 824.015366] Clocksource tsc unstable (delta = 486645129 ns)
> [ 825.324019] XFS (sdb4): Mounting Filesystem
> [ 825.743317] XFS (sdb4): Ending clean mount
> [ 827.223193] XFS (sdb4): Mounting Filesystem
> [ 827.668493] XFS (sdb4): Ending clean mount
> [ 837.524673] XFS (sdb4): Mounting Filesystem
> [ 837.986097] XFS (sdb4): Ending clean mount
>
>
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 03/11] xfs: support the XFS_BTNUM_FINOBT free inode btree type
2013-09-06 0:07 ` Dave Chinner
@ 2013-09-06 11:25 ` Brian Foster
2013-09-06 21:22 ` Dave Chinner
0 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2013-09-06 11:25 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On 09/05/2013 08:07 PM, Dave Chinner wrote:
> On Thu, Sep 05, 2013 at 12:17:04PM -0400, Brian Foster wrote:
>> On 09/04/2013 08:54 PM, Dave Chinner wrote:
>>> On Tue, Sep 03, 2013 at 02:25:00PM -0400, Brian Foster wrote:
...
>>>
>>> What we really need here is for xfs_ialloc_log_agi to consider that
>>> there are two distinct regions for range logging - the first spaces
>>> from offset 0 to offset of agi_unlinked, and the second is from the
>>> the offset of agi_free_root to the end of the xfs_agi_t....
>>>
>>> It's abit messy, I know, but we couldn't easily add new padding to
>>> the AGI in the existing range logging area like was done for the AGF
>>> because of the unlinked list hash table already defining the end of
>>> the range logging region....
>>>
>>
>> ... but where would that ever happen? The existing invocations of
>> xfs_ialloc_log_agi() seem to log either the agi inode count values or
>> the btree root/level values (i.e., never the range across both). I think
>> I've introduced at least a couple new invocations throughout this set,
>> but I've not changed that model (i.e., an XFS_AGI_FREECOUNT instance in
>> the new lookup code and an XFS_AGI_FREE_ROOT|*_LEVEL instance in the new
>> btree code).
>
> Right, we don't current log across the range because of the way the
> code is currently written, but there's no rule that says that
> logging fields must be done this way.
>
> I can see that there may be reason for logging
> XFS_AGI_FREE_ROOT|*_LEVEL and XFS_AGI_NEW_INODE all in one go -
> pointing new inode allocation at recently freed inodes is not
> unreasonable, and if we split the finobt and update agi_newino in
> the one update, we will log across this gap.
>
For the sake of argument, it seems a little strange to me to set an
inode level value in the agi in the context of a btree operation, such
as a split...
>> My understanding of this code is that the range to log is defined at
>> invocation time to xfs_iallog_log_agi(), so if the callers never specify
>> a range that includes the unlinked bits in a single call, we won't set
>> that range in the buffer log item code. In other words, even if we
>> ultimately happen to log both ranges in the agi, the lower level code
>> will never expand the logged region. Therefore, this shouldn't happen
>> unless a single invocation that specifies one of the
>> XFS_AGI_FREE_ROOT/LEVEL bits also specifies one of the existing agi bits.
>
> Yes, we can avoid that by logging te regions separately, but that
> then puts the onus on the future developers and reviewers to
> remember this landmine and avoid it.
>
... but I agree with the general premise. It's certainly a landmine.
>> I could see breaking this up as a matter of preparation for future
>> fields or calls that would introduce logging that kind of range, but at
>> the same time (and assuming my interpretation of above is correct),
>> that's a bit of code that serves no purpose for the foreseeable future.
>> Perhaps a comment in xfs_ialloc_log_agi() and the one caller that uses
>> the AGI_FREE_* bits to document this restriction is sufficient?
>
> Given it is only a few lines of extra code in xfs_ialloc_log_agi(),
> I'd prefer that the code documents and deals with the different
> regions internally. That way we can forget about it completely for the
> future and never have to worry about it again.
>
> Keep in mind that I'm looking at this from a code maintenance
> timescale of at least 5-10 years into the future here - a few
> minutes extra fixing this right now could save us a lot of hassle
> years down the track. ;)
>
Fair enough. If it's at least a reasonably likely scenario, then I'm on
board with the better safe than sorry approach. ;)
Brian
>>>> #ifdef DEBUG
>>>> diff --git a/fs/xfs/xfs_ialloc_btree.c b/fs/xfs/xfs_ialloc_btree.c
>>>> index 0cdb88b..7923292 100644
>>>> --- a/fs/xfs/xfs_ialloc_btree.c
>>>> +++ b/fs/xfs/xfs_ialloc_btree.c
>>>> @@ -62,10 +62,18 @@ xfs_inobt_set_root(
>>>> {
>>>> struct xfs_buf *agbp = cur->bc_private.a.agbp;
>>>> struct xfs_agi *agi = XFS_BUF_TO_AGI(agbp);
>>>> -
>>>> - agi->agi_root = nptr->s;
>>>> - be32_add_cpu(&agi->agi_level, inc);
>>>> - xfs_ialloc_log_agi(cur->bc_tp, agbp, XFS_AGI_ROOT | XFS_AGI_LEVEL);
>>>> + int fields;
>>>> +
>>>> + if (cur->bc_btnum == XFS_BTNUM_INO) {
>>>> + agi->agi_root = nptr->s;
>>>> + be32_add_cpu(&agi->agi_level, inc);
>>>> + fields = XFS_AGI_ROOT | XFS_AGI_LEVEL;
>>>> + } else {
>>>> + agi->agi_free_root = nptr->s;
>>>> + be32_add_cpu(&agi->agi_free_level, inc);
>>>> + fields = XFS_AGI_FREE_ROOT | XFS_AGI_FREE_LEVEL;
>>>> + }
>>>> + xfs_ialloc_log_agi(cur->bc_tp, agbp, fields);
>>>> }
>>>
>>> I suspect that it would be better to have separate functions for
>>> these differences i.e. xfs_inobt_set_root() and
>>> xfs_finobt_set_root(), and set up separate btree ops structure
>>> forthe two different trees. Most of the code is still identical,
>>> but the differences in root structures can easily be handled without
>>> putting switches in the code everywhere.
>>>
>>
>> Ok, I'm assuming the suggestion is to only create new functions for the
>> implementations that differ.
>
> Exactly.
>
> Cheers,
>
> Dave.
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 07/11] xfs: retry trans reservation on ENOSPC in xfs_inactive()
2013-09-06 0:17 ` Dave Chinner
@ 2013-09-06 11:30 ` Brian Foster
0 siblings, 0 replies; 46+ messages in thread
From: Brian Foster @ 2013-09-06 11:30 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On 09/05/2013 08:17 PM, Dave Chinner wrote:
> On Thu, Sep 05, 2013 at 12:18:31PM -0400, Brian Foster wrote:
>> On 09/04/2013 09:40 PM, Dave Chinner wrote:
>>> On Tue, Sep 03, 2013 at 02:25:04PM -0400, Brian Foster wrote:
>>>> An ifree data block reservation can fail with ENOSPC. Flush inodes
>>>> to try and free up space or attempt without a data block
>>>> reservation to avoid failing out of xfs_inactive().
>>>>
>>>> Signed-off-by: Brian Foster <bfoster@redhat.com>
>>>> ---
>>>> fs/xfs/xfs_inode.c | 11 +++++++++++
>>>> 1 file changed, 11 insertions(+)
>>>>
...
>
>> Subsequent to avoiding that, I
>> believe there were inconsistent fs issues called out due to the unlinked
>> lists being populated after umount.
>
> That sounds like a recovery failure, not so much an ENOSPC failure.
> i.e. that recovery only looks at the log to see if it's clean, and
> only recovers unlinked lists if it's dirty. There is the
> *possibility* of having a clean log with inodes on the unlinked
> list, and log recovery doesn't run the unlinked list processing in
> that case.
>
Interesting, I'll have a closer look when I rework the inactive
transaction reservation bits. Thanks.
Brian
> This is one of the issues we'll need to fix for O_TMPFILE support
> as it will actively use inodes on unlinked list for potentially long
> periods of time.
>
>> Taking a further look, I missed the XFS_TRANS_RESERVE flag and whole
>> m_resblks mechanism. I'll take a closer look at that and see if that
>> works to resolve the problem instead of the flush.
>
> It should - the only time it won't is if we exhaust the pool, but
> that doesn't happen in normal ENOSPC situations and any blocks we do
> end up freeing will immediately refill the reserve pool...
>
> Cheers,
>
> Dave.
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 10/11] xfs: update the finobt on inode free
2013-09-06 0:28 ` Dave Chinner
@ 2013-09-06 11:39 ` Brian Foster
2013-09-06 21:24 ` Dave Chinner
0 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2013-09-06 11:39 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On 09/05/2013 08:28 PM, Dave Chinner wrote:
> On Thu, Sep 05, 2013 at 12:19:12PM -0400, Brian Foster wrote:
>> On 09/04/2013 10:54 PM, Dave Chinner wrote:
>>> On Tue, Sep 03, 2013 at 02:25:07PM -0400, Brian Foster wrote:
...
>>
>> I think I'm parsing you after having another look at the code.
>> xfs_inobt_lookup() remains as is and is potentially used from
>> xfs_inobt_insert(). xfs_inobt_insert_rec() is introduced to set the
>> cursor fields and do the insert and is used here and from
>> xfs_inobt_insert().
>
> Effectively. xfs_inobt_insert() becomes:
>
> for (each inode chunk) {
> xfs_inobt_lookup(cur, startino)
> xfs_inobt_insert_rec(cur, startino, free, free_count)
> }
>
> And this code becomes:
>
> xfs_inobt_lookup(cur, startino);
> if (!found) {
> if (free_count == 1)
> xfs_inobt_insert_rec(cur, startino, free, free_count)
> else
> CORRUPTION
> goto out;
> }
>
>> At that point, this looks close to xfs_inobt_insert(), but I think using
>> that here would introduce a duplicate lookup.
>
> Yes, it would. I think just using helpers like this is sufficient
> for the two different cases, especially as xfs_inobt_insert() needs
> to be able to handle multiple chunk insertion and we don't have that
> here...
>
Ok, that was my thinking as well.
>> Regardless, we'll see what
>> the whole thing looks like at that point. Thanks for the reviews. :)
>
> No worries. BTW, can you post your rudimentary userspace support so
> we can run tests that use this code, too?
>
Sure. My xfsprogs branch currently is the application of a slightly
older version of this set (pre-cleanups I made to make this post-worthy)
with some hacks to make it apply/compile and a few other patches on top
of that for mkfs, xfs_db and xfs_repair to work through some basic
things I ran into when running xfstests.
Would you prefer I drop the whole thing on the list?
Brian
> Cheers,
>
> Dave.
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 03/11] xfs: support the XFS_BTNUM_FINOBT free inode btree type
2013-09-06 11:25 ` Brian Foster
@ 2013-09-06 21:22 ` Dave Chinner
0 siblings, 0 replies; 46+ messages in thread
From: Dave Chinner @ 2013-09-06 21:22 UTC (permalink / raw)
To: Brian Foster; +Cc: xfs
On Fri, Sep 06, 2013 at 07:25:58AM -0400, Brian Foster wrote:
> On 09/05/2013 08:07 PM, Dave Chinner wrote:
> > On Thu, Sep 05, 2013 at 12:17:04PM -0400, Brian Foster wrote:
> >> On 09/04/2013 08:54 PM, Dave Chinner wrote:
> >>> On Tue, Sep 03, 2013 at 02:25:00PM -0400, Brian Foster wrote:
> ...
> >>>
> >>> What we really need here is for xfs_ialloc_log_agi to consider that
> >>> there are two distinct regions for range logging - the first spaces
> >>> from offset 0 to offset of agi_unlinked, and the second is from the
> >>> the offset of agi_free_root to the end of the xfs_agi_t....
> >>>
> >>> It's abit messy, I know, but we couldn't easily add new padding to
> >>> the AGI in the existing range logging area like was done for the AGF
> >>> because of the unlinked list hash table already defining the end of
> >>> the range logging region....
> >>>
> >>
> >> ... but where would that ever happen? The existing invocations of
> >> xfs_ialloc_log_agi() seem to log either the agi inode count values or
> >> the btree root/level values (i.e., never the range across both). I think
> >> I've introduced at least a couple new invocations throughout this set,
> >> but I've not changed that model (i.e., an XFS_AGI_FREECOUNT instance in
> >> the new lookup code and an XFS_AGI_FREE_ROOT|*_LEVEL instance in the new
> >> btree code).
> >
> > Right, we don't current log across the range because of the way the
> > code is currently written, but there's no rule that says that
> > logging fields must be done this way.
> >
> > I can see that there may be reason for logging
> > XFS_AGI_FREE_ROOT|*_LEVEL and XFS_AGI_NEW_INODE all in one go -
> > pointing new inode allocation at recently freed inodes is not
> > unreasonable, and if we split the finobt and update agi_newino in
> > the one update, we will log across this gap.
> >
>
> For the sake of argument, it seems a little strange to me to set an
> inode level value in the agi in the context of a btree operation, such
> as a split...
Like we do with the AGF to record changes to the longest
extent in the btree? It's not a stretch to think we might update
the "allocation from here" target in the AGI when we make a specific
type of btree record change.... ;)
True, that is still an isolated logging event, but my point is that
specific btree operations may drive other logging events in the
header than just root/level...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 10/11] xfs: update the finobt on inode free
2013-09-06 11:39 ` Brian Foster
@ 2013-09-06 21:24 ` Dave Chinner
2013-09-07 12:30 ` Brian Foster
0 siblings, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2013-09-06 21:24 UTC (permalink / raw)
To: Brian Foster; +Cc: xfs
On Fri, Sep 06, 2013 at 07:39:18AM -0400, Brian Foster wrote:
> On 09/05/2013 08:28 PM, Dave Chinner wrote:
> > On Thu, Sep 05, 2013 at 12:19:12PM -0400, Brian Foster wrote:
> >> On 09/04/2013 10:54 PM, Dave Chinner wrote:
> >>> On Tue, Sep 03, 2013 at 02:25:07PM -0400, Brian Foster wrote:
> ...
> >>
> >> I think I'm parsing you after having another look at the code.
> >> xfs_inobt_lookup() remains as is and is potentially used from
> >> xfs_inobt_insert(). xfs_inobt_insert_rec() is introduced to set the
> >> cursor fields and do the insert and is used here and from
> >> xfs_inobt_insert().
> >
> > Effectively. xfs_inobt_insert() becomes:
> >
> > for (each inode chunk) {
> > xfs_inobt_lookup(cur, startino)
> > xfs_inobt_insert_rec(cur, startino, free, free_count)
> > }
> >
> > And this code becomes:
> >
> > xfs_inobt_lookup(cur, startino);
> > if (!found) {
> > if (free_count == 1)
> > xfs_inobt_insert_rec(cur, startino, free, free_count)
> > else
> > CORRUPTION
> > goto out;
> > }
> >
> >> At that point, this looks close to xfs_inobt_insert(), but I think using
> >> that here would introduce a duplicate lookup.
> >
> > Yes, it would. I think just using helpers like this is sufficient
> > for the two different cases, especially as xfs_inobt_insert() needs
> > to be able to handle multiple chunk insertion and we don't have that
> > here...
> >
>
> Ok, that was my thinking as well.
>
> >> Regardless, we'll see what
> >> the whole thing looks like at that point. Thanks for the reviews. :)
> >
> > No worries. BTW, can you post your rudimentary userspace support so
> > we can run tests that use this code, too?
> >
>
> Sure. My xfsprogs branch currently is the application of a slightly
> older version of this set (pre-cleanups I made to make this post-worthy)
> with some hacks to make it apply/compile and a few other patches on top
> of that for mkfs, xfs_db and xfs_repair to work through some basic
> things I ran into when running xfstests.
>
> Would you prefer I drop the whole thing on the list?
Drop it on the list, maybe just a as tarball rather than a patchset
if it's not ready for review yet.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 00/11] xfs: introduce the free inode btree
2013-09-05 21:17 ` [RFC PATCH 00/11] xfs: introduce the free inode btree Michael L. Semon
2013-09-06 11:17 ` Brian Foster
@ 2013-09-06 21:35 ` Dave Chinner
2013-09-07 12:31 ` Brian Foster
1 sibling, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2013-09-06 21:35 UTC (permalink / raw)
To: Michael L. Semon; +Cc: Brian Foster, xfs
On Thu, Sep 05, 2013 at 05:17:10PM -0400, Michael L. Semon wrote:
....
> [ 814.376620] XFS (sdb4): Mounting Filesystem
> [ 815.050778] XFS (sdb4): Ending clean mount
> [ 823.169368]
> [ 823.170932] ======================================================
> [ 823.172146] [ INFO: possible circular locking dependency detected ]
> [ 823.172146] 3.11.0+ #5 Not tainted
> [ 823.172146] -------------------------------------------------------
> [ 823.172146] dirstress/5276 is trying to acquire lock:
> [ 823.172146] (sb_internal){.+.+.+}, at: [<c11c5e60>] xfs_trans_alloc+0x1f/0x35
> [ 823.172146]
> [ 823.172146] but task is already holding lock:
> [ 823.172146] (&(&ip->i_lock)->mr_lock){+++++.}, at: [<c1206cfb>] xfs_ilock+0x100/0x1f1
> [ 823.172146]
> [ 823.172146] which lock already depends on the new lock.
> [ 823.172146]
> [ 823.172146]
> [ 823.172146] the existing dependency chain (in reverse order) is:
> [ 823.172146]
> [ 823.172146] -> #1 (&(&ip->i_lock)->mr_lock){+++++.}:
> [ 823.172146] [<c1070a11>] __lock_acquire+0x345/0xa11
> [ 823.172146] [<c1071c45>] lock_acquire+0x88/0x17e
> [ 823.172146] [<c14bff98>] _raw_spin_lock+0x47/0x74
> [ 823.172146] [<c1116247>] __mark_inode_dirty+0x171/0x38c
> [ 823.172146] [<c111acab>] __set_page_dirty+0x5f/0x95
> [ 823.172146] [<c111b93e>] mark_buffer_dirty+0x58/0x12b
> [ 823.172146] [<c111baff>] __block_commit_write.isra.17+0x64/0x82
> [ 823.172146] [<c111c197>] block_write_end+0x2b/0x53
> [ 823.172146] [<c111c201>] generic_write_end+0x42/0x9a
> [ 823.172146] [<c11a42d5>] xfs_vm_write_end+0x60/0xbe
> [ 823.172146] [<c10bd47a>] generic_file_buffered_write+0x140/0x20f
> [ 823.172146] [<c11b2347>] xfs_file_buffered_aio_write+0x10b/0x205
> [ 823.172146] [<c11b24ee>] xfs_file_aio_write+0xad/0xec
> [ 823.172146] [<c10f0c5f>] do_sync_write+0x60/0x87
> [ 823.172146] [<c10f0e0f>] vfs_write+0x9c/0x189
> [ 823.172146] [<c10f0fc6>] SyS_write+0x49/0x81
> [ 823.172146] [<c14c14bb>] sysenter_do_call+0x12/0x32
> [ 823.172146]
> [ 823.172146] -> #0 (sb_internal){.+.+.+}:
> [ 823.172146] [<c106e972>] validate_chain.isra.35+0xfc7/0xff4
> [ 823.172146] [<c1070a11>] __lock_acquire+0x345/0xa11
> [ 823.172146] [<c1071c45>] lock_acquire+0x88/0x17e
> [ 823.172146] [<c10f36eb>] __sb_start_write+0xad/0x177
> [ 823.172146] [<c11c5e60>] xfs_trans_alloc+0x1f/0x35
> [ 823.172146] [<c120a823>] xfs_inactive+0x129/0x4a3
> [ 823.172146] [<c11c280d>] xfs_fs_evict_inode+0x6c/0x114
> [ 823.172146] [<c1106678>] evict+0x8e/0x15d
> [ 823.172146] [<c1107126>] iput+0xc4/0x138
> [ 823.172146] [<c1103504>] dput+0x1b2/0x257
> [ 823.172146] [<c10f1a30>] __fput+0x140/0x1eb
> [ 823.172146] [<c10f1b0f>] ____fput+0xd/0xf
> [ 823.172146] [<c1048477>] task_work_run+0x67/0x90
> [ 823.172146] [<c1001ea5>] do_notify_resume+0x61/0x63
> [ 823.172146] [<c14c0cfa>] work_notifysig+0x1f/0x25
> [ 823.172146]
> [ 823.172146] other info that might help us debug this:
> [ 823.172146]
> [ 823.172146] Possible unsafe locking scenario:
> [ 823.172146]
> [ 823.172146] CPU0 CPU1
> [ 823.172146] ---- ----
> [ 823.172146] lock(&(&ip->i_lock)->mr_lock);
> [ 823.172146] lock(sb_internal);
> [ 823.172146] lock(&(&ip->i_lock)->mr_lock);
> [ 823.172146] lock(sb_internal);
Ah, now there's something I missed in all the xfs_inactive
transaction rework - you can't call
xfs_trans_alloc()/xfs-trans_reserve with the XFS_ILOCK_??? held.
It's not the freeze locks you really have to worry about deadlocking
if you do, it's deadlocking against log space that is much more
likely.
i.e. if you hold the ILOCK, the AIL can't get it to flush the inode
to disk. That means if the inode you hold locked is pinning the tail
of the log and there is no logspace for the transaction you are
about to run, xfs_trans_reserve() will block forever waiting for the
inode to be flushed and the tail of the log to move forward. This
will end up blocking all further reservations and hence deadlock the
filesystem...
Brian, if you rewrite xfs_inactive in the way that I suggested, this
problem goes away ;)
Thanks for reporting this, Michael.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 10/11] xfs: update the finobt on inode free
2013-09-06 21:24 ` Dave Chinner
@ 2013-09-07 12:30 ` Brian Foster
2013-09-08 20:08 ` Michael L. Semon
2013-09-09 2:34 ` Better numbers " Michael L. Semon
0 siblings, 2 replies; 46+ messages in thread
From: Brian Foster @ 2013-09-07 12:30 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
[-- Attachment #1: Type: text/plain, Size: 1321 bytes --]
On 09/06/2013 05:24 PM, Dave Chinner wrote:
> On Fri, Sep 06, 2013 at 07:39:18AM -0400, Brian Foster wrote:
>> On 09/05/2013 08:28 PM, Dave Chinner wrote:
>>> On Thu, Sep 05, 2013 at 12:19:12PM -0400, Brian Foster wrote:
>>>> On 09/04/2013 10:54 PM, Dave Chinner wrote:
>>>>> On Tue, Sep 03, 2013 at 02:25:07PM -0400, Brian Foster wrote:
>> ...
>>>>
...
>>> No worries. BTW, can you post your rudimentary userspace support so
>>> we can run tests that use this code, too?
>>>
>>
>> Sure. My xfsprogs branch currently is the application of a slightly
>> older version of this set (pre-cleanups I made to make this post-worthy)
>> with some hacks to make it apply/compile and a few other patches on top
>> of that for mkfs, xfs_db and xfs_repair to work through some basic
>> things I ran into when running xfstests.
>>
>> Would you prefer I drop the whole thing on the list?
>
> Drop it on the list, maybe just a as tarball rather than a patchset
> if it's not ready for review yet.
>
Ok, attached. This applies on top of the following commit in the
xfsprogs tree:
982e5c7e xfs_db: add header to freesp -d output
Use the following command to format a finobt enabled fs:
mkfs.xfs -m crc=1,finobt=1 <dev>
... and otherwise there are a couple random fixes for xfs_db and xfs_repair.
Brian
> Cheers,
>
> Dave.
>
[-- Attachment #2: xfsprogs_finobt_rfc.tar.bz2 --]
[-- Type: application/x-bzip, Size: 8958 bytes --]
[-- Attachment #3: Type: text/plain, Size: 121 bytes --]
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 00/11] xfs: introduce the free inode btree
2013-09-06 21:35 ` Dave Chinner
@ 2013-09-07 12:31 ` Brian Foster
2013-09-08 1:04 ` Michael L. Semon
0 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2013-09-07 12:31 UTC (permalink / raw)
To: Dave Chinner; +Cc: Michael L. Semon, xfs
On 09/06/2013 05:35 PM, Dave Chinner wrote:
> On Thu, Sep 05, 2013 at 05:17:10PM -0400, Michael L. Semon wrote:
> ....
>> [ 814.376620] XFS (sdb4): Mounting Filesystem
>> [ 815.050778] XFS (sdb4): Ending clean mount
>> [ 823.169368]
>> [ 823.170932] ======================================================
>> [ 823.172146] [ INFO: possible circular locking dependency detected ]
>> [ 823.172146] 3.11.0+ #5 Not tainted
>> [ 823.172146] -------------------------------------------------------
>> [ 823.172146] dirstress/5276 is trying to acquire lock:
>> [ 823.172146] (sb_internal){.+.+.+}, at: [<c11c5e60>] xfs_trans_alloc+0x1f/0x35
>> [ 823.172146]
>> [ 823.172146] but task is already holding lock:
>> [ 823.172146] (&(&ip->i_lock)->mr_lock){+++++.}, at: [<c1206cfb>] xfs_ilock+0x100/0x1f1
>> [ 823.172146]
>> [ 823.172146] which lock already depends on the new lock.
>> [ 823.172146]
>> [ 823.172146]
>> [ 823.172146] the existing dependency chain (in reverse order) is:
>> [ 823.172146]
>> [ 823.172146] -> #1 (&(&ip->i_lock)->mr_lock){+++++.}:
>> [ 823.172146] [<c1070a11>] __lock_acquire+0x345/0xa11
>> [ 823.172146] [<c1071c45>] lock_acquire+0x88/0x17e
>> [ 823.172146] [<c14bff98>] _raw_spin_lock+0x47/0x74
>> [ 823.172146] [<c1116247>] __mark_inode_dirty+0x171/0x38c
>> [ 823.172146] [<c111acab>] __set_page_dirty+0x5f/0x95
>> [ 823.172146] [<c111b93e>] mark_buffer_dirty+0x58/0x12b
>> [ 823.172146] [<c111baff>] __block_commit_write.isra.17+0x64/0x82
>> [ 823.172146] [<c111c197>] block_write_end+0x2b/0x53
>> [ 823.172146] [<c111c201>] generic_write_end+0x42/0x9a
>> [ 823.172146] [<c11a42d5>] xfs_vm_write_end+0x60/0xbe
>> [ 823.172146] [<c10bd47a>] generic_file_buffered_write+0x140/0x20f
>> [ 823.172146] [<c11b2347>] xfs_file_buffered_aio_write+0x10b/0x205
>> [ 823.172146] [<c11b24ee>] xfs_file_aio_write+0xad/0xec
>> [ 823.172146] [<c10f0c5f>] do_sync_write+0x60/0x87
>> [ 823.172146] [<c10f0e0f>] vfs_write+0x9c/0x189
>> [ 823.172146] [<c10f0fc6>] SyS_write+0x49/0x81
>> [ 823.172146] [<c14c14bb>] sysenter_do_call+0x12/0x32
>> [ 823.172146]
>> [ 823.172146] -> #0 (sb_internal){.+.+.+}:
>> [ 823.172146] [<c106e972>] validate_chain.isra.35+0xfc7/0xff4
>> [ 823.172146] [<c1070a11>] __lock_acquire+0x345/0xa11
>> [ 823.172146] [<c1071c45>] lock_acquire+0x88/0x17e
>> [ 823.172146] [<c10f36eb>] __sb_start_write+0xad/0x177
>> [ 823.172146] [<c11c5e60>] xfs_trans_alloc+0x1f/0x35
>> [ 823.172146] [<c120a823>] xfs_inactive+0x129/0x4a3
>> [ 823.172146] [<c11c280d>] xfs_fs_evict_inode+0x6c/0x114
>> [ 823.172146] [<c1106678>] evict+0x8e/0x15d
>> [ 823.172146] [<c1107126>] iput+0xc4/0x138
>> [ 823.172146] [<c1103504>] dput+0x1b2/0x257
>> [ 823.172146] [<c10f1a30>] __fput+0x140/0x1eb
>> [ 823.172146] [<c10f1b0f>] ____fput+0xd/0xf
>> [ 823.172146] [<c1048477>] task_work_run+0x67/0x90
>> [ 823.172146] [<c1001ea5>] do_notify_resume+0x61/0x63
>> [ 823.172146] [<c14c0cfa>] work_notifysig+0x1f/0x25
>> [ 823.172146]
>> [ 823.172146] other info that might help us debug this:
>> [ 823.172146]
>> [ 823.172146] Possible unsafe locking scenario:
>> [ 823.172146]
>> [ 823.172146] CPU0 CPU1
>> [ 823.172146] ---- ----
>> [ 823.172146] lock(&(&ip->i_lock)->mr_lock);
>> [ 823.172146] lock(sb_internal);
>> [ 823.172146] lock(&(&ip->i_lock)->mr_lock);
>> [ 823.172146] lock(sb_internal);
>
> Ah, now there's something I missed in all the xfs_inactive
> transaction rework - you can't call
> xfs_trans_alloc()/xfs-trans_reserve with the XFS_ILOCK_??? held.
> It's not the freeze locks you really have to worry about deadlocking
> if you do, it's deadlocking against log space that is much more
> likely.
>
> i.e. if you hold the ILOCK, the AIL can't get it to flush the inode
> to disk. That means if the inode you hold locked is pinning the tail
> of the log and there is no logspace for the transaction you are
> about to run, xfs_trans_reserve() will block forever waiting for the
> inode to be flushed and the tail of the log to move forward. This
> will end up blocking all further reservations and hence deadlock the
> filesystem...
>
> Brian, if you rewrite xfs_inactive in the way that I suggested, this
> problem goes away ;)
>
> Thanks for reporting this, Michael.
>
Oh, very interesting scenario. Thanks again for catching this, Michael,
and for the analysis, Dave. I'll get this cleaned up in the next
revision as well.
Brian
> Cheers,
>
> Dave.
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 00/11] xfs: introduce the free inode btree
2013-09-07 12:31 ` Brian Foster
@ 2013-09-08 1:04 ` Michael L. Semon
0 siblings, 0 replies; 46+ messages in thread
From: Michael L. Semon @ 2013-09-08 1:04 UTC (permalink / raw)
To: Brian Foster; +Cc: xfs
On 09/07/2013 08:31 AM, Brian Foster wrote:
> On 09/06/2013 05:35 PM, Dave Chinner wrote:
>> On Thu, Sep 05, 2013 at 05:17:10PM -0400, Michael L. Semon wrote:
>> ....
>>> [ 814.376620] XFS (sdb4): Mounting Filesystem
>>> [ 815.050778] XFS (sdb4): Ending clean mount
>>> [ 823.169368]
>>> [ 823.170932] ======================================================
>>> [ 823.172146] [ INFO: possible circular locking dependency detected ]
>>> [ 823.172146] 3.11.0+ #5 Not tainted
>>> [ 823.172146] -------------------------------------------------------
>>> [ 823.172146] dirstress/5276 is trying to acquire lock:
>>> [ 823.172146] (sb_internal){.+.+.+}, at: [<c11c5e60>] xfs_trans_alloc+0x1f/0x35
>>> [ 823.172146]
>>> [ 823.172146] but task is already holding lock:
>>> [ 823.172146] (&(&ip->i_lock)->mr_lock){+++++.}, at: [<c1206cfb>] xfs_ilock+0x100/0x1f1
>>> [ 823.172146]
>>> [ 823.172146] which lock already depends on the new lock.
>>> [ 823.172146]
>>> [ 823.172146]
>>> [ 823.172146] the existing dependency chain (in reverse order) is:
>>> [ 823.172146]
>>> [ 823.172146] -> #1 (&(&ip->i_lock)->mr_lock){+++++.}:
>>> [ 823.172146] [<c1070a11>] __lock_acquire+0x345/0xa11
>>> [ 823.172146] [<c1071c45>] lock_acquire+0x88/0x17e
>>> [ 823.172146] [<c14bff98>] _raw_spin_lock+0x47/0x74
>>> [ 823.172146] [<c1116247>] __mark_inode_dirty+0x171/0x38c
>>> [ 823.172146] [<c111acab>] __set_page_dirty+0x5f/0x95
>>> [ 823.172146] [<c111b93e>] mark_buffer_dirty+0x58/0x12b
>>> [ 823.172146] [<c111baff>] __block_commit_write.isra.17+0x64/0x82
>>> [ 823.172146] [<c111c197>] block_write_end+0x2b/0x53
>>> [ 823.172146] [<c111c201>] generic_write_end+0x42/0x9a
>>> [ 823.172146] [<c11a42d5>] xfs_vm_write_end+0x60/0xbe
>>> [ 823.172146] [<c10bd47a>] generic_file_buffered_write+0x140/0x20f
>>> [ 823.172146] [<c11b2347>] xfs_file_buffered_aio_write+0x10b/0x205
>>> [ 823.172146] [<c11b24ee>] xfs_file_aio_write+0xad/0xec
>>> [ 823.172146] [<c10f0c5f>] do_sync_write+0x60/0x87
>>> [ 823.172146] [<c10f0e0f>] vfs_write+0x9c/0x189
>>> [ 823.172146] [<c10f0fc6>] SyS_write+0x49/0x81
>>> [ 823.172146] [<c14c14bb>] sysenter_do_call+0x12/0x32
>>> [ 823.172146]
>>> [ 823.172146] -> #0 (sb_internal){.+.+.+}:
>>> [ 823.172146] [<c106e972>] validate_chain.isra.35+0xfc7/0xff4
>>> [ 823.172146] [<c1070a11>] __lock_acquire+0x345/0xa11
>>> [ 823.172146] [<c1071c45>] lock_acquire+0x88/0x17e
>>> [ 823.172146] [<c10f36eb>] __sb_start_write+0xad/0x177
>>> [ 823.172146] [<c11c5e60>] xfs_trans_alloc+0x1f/0x35
>>> [ 823.172146] [<c120a823>] xfs_inactive+0x129/0x4a3
>>> [ 823.172146] [<c11c280d>] xfs_fs_evict_inode+0x6c/0x114
>>> [ 823.172146] [<c1106678>] evict+0x8e/0x15d
>>> [ 823.172146] [<c1107126>] iput+0xc4/0x138
>>> [ 823.172146] [<c1103504>] dput+0x1b2/0x257
>>> [ 823.172146] [<c10f1a30>] __fput+0x140/0x1eb
>>> [ 823.172146] [<c10f1b0f>] ____fput+0xd/0xf
>>> [ 823.172146] [<c1048477>] task_work_run+0x67/0x90
>>> [ 823.172146] [<c1001ea5>] do_notify_resume+0x61/0x63
>>> [ 823.172146] [<c14c0cfa>] work_notifysig+0x1f/0x25
>>> [ 823.172146]
>>> [ 823.172146] other info that might help us debug this:
>>> [ 823.172146]
>>> [ 823.172146] Possible unsafe locking scenario:
>>> [ 823.172146]
>>> [ 823.172146] CPU0 CPU1
>>> [ 823.172146] ---- ----
>>> [ 823.172146] lock(&(&ip->i_lock)->mr_lock);
>>> [ 823.172146] lock(sb_internal);
>>> [ 823.172146] lock(&(&ip->i_lock)->mr_lock);
>>> [ 823.172146] lock(sb_internal);
>>
>> Ah, now there's something I missed in all the xfs_inactive
>> transaction rework - you can't call
>> xfs_trans_alloc()/xfs-trans_reserve with the XFS_ILOCK_??? held.
>> It's not the freeze locks you really have to worry about deadlocking
>> if you do, it's deadlocking against log space that is much more
>> likely.
>>
>> i.e. if you hold the ILOCK, the AIL can't get it to flush the inode
>> to disk. That means if the inode you hold locked is pinning the tail
>> of the log and there is no logspace for the transaction you are
>> about to run, xfs_trans_reserve() will block forever waiting for the
>> inode to be flushed and the tail of the log to move forward. This
>> will end up blocking all further reservations and hence deadlock the
>> filesystem...
>>
>> Brian, if you rewrite xfs_inactive in the way that I suggested, this
>> problem goes away ;)
>>
>> Thanks for reporting this, Michael.
>>
>
> Oh, very interesting scenario. Thanks again for catching this, Michael,
> and for the analysis, Dave. I'll get this cleaned up in the next
> revision as well.
>
> Brian
>
>> Cheers,
>>
>> Dave.
>>
>
No problem, Brian. I'll try out your userspace as well. I had worked a
bit on getting some sane numbers that are better than "results for copying
3 kernel gits to a 1k-blocksize FS with write cache turned off." Here's
my attempt, as a more formal report. Thanks!
Michael
[REPORT FOLLOWS]
Lockdep threw off the debug numbers for your patchset--a new lockdep
is at the very end--so these tests were done with a fairly non-debug
setup. The write cache is on for these tests as well.
Casual "user command" benchmark using built 3.11.0+ kernel git
tarball. The idea behind it:
1) Unpack a tarball, and
2) do something with its contents.
The total files are among these:
$TEST_DIR/a/linux/ ...
$TEST_DIR/b/linux/ ...
$TEST_DIR/kernel-git-built-2013-08-05.tar.gz
The file systems are as follows:
v4: mkfs.xfs -l logdev=$TEST_LOGDEV $TEST_DEV
v4dirent: mkfs.xfs -n ftype=1 -l logdev=$TEST_LOGDEV $TEST_DEV
v4d512bi: mkfs.xfs -n ftype=1 -i log=9 -l logdev=$TEST_LOGDEV $TEST_DEV
v5: mkfs.xfs -m crc=1 -l logdev=$TEST_LOGDEV $TEST_DEV
Dave had a benchmark set to break down v5 performance changes into
a 512-byte-inode component and a CRC component. This is my edition
of the benchmark, done with old spinning rust on x86 hardware, and
updated to reflect your patchset.
Patchset notation: "normal" is the normal xfs-oss/master with Dave's
latest patches on top; "itree" adds the inode btree code.
This is non-debug XFS. Tracers, lockdep, and almost all other "Kernel
Hacking" kernel config items are not enabled. It's still a
CONFIG_KERNEL_DEBUG=y kernel, though.
======================= REAL ========================
command patch v4 v4dirent v4d512bi v5
==========|======|========|========|========|========
tar -xf normal 103.202 104.951 101.771 104.486
tar -xf itree 104.610 101.705 98.784 101.919
----------+------+--------+--------+--------+--------
sha256sum normal 227.456 228.321 231.947 234.127
sha256sum itree 230.233 229.375 231.509 233.253
----------+------+--------+--------+--------+--------
cp -r a b normal 239.714 242.754 248.994 249.584
cp -r a b itree 241.273 243.216 248.531 254.501
----------+------+--------+--------+--------+--------
find . normal 11.894 12.370 12.324 12.397
find . itree 12.043 12.310 12.736 13.216
----------+------+--------+--------+--------+--------
rm -r b normal 8.556 8.744 11.298 11.774
rm -r b itree 8.904 8.981 10.590 12.057
----------+------+--------+--------+--------+--------
cp -r b a normal 262.065 256.448 272.290 272.221
cp -r b a itree 264.116 265.875 267.346 270.811
----------+------+--------+--------+--------+--------
rm -r b normal 8.585 9.258 8.791 10.058
rm -r b itree 9.061 8.345 9.909 9.273
----------+------+--------+--------+--------+--------
stat normal 161.853 162.772 163.555 165.046
stat itree 162.641 163.148 163.698 164.015
----------+------+--------+--------+--------+--------
sha check normal 133.938 133.016 141.352 142.921
sha check itree 133.885 133.399 138.013 143.315
----------+------+--------+--------+--------+--------
cp tarball normal 44.102 42.812 43.603 43.722
cp tarball itree 43.724 44.187 44.339 42.761
==========|======|========|========|========|========
TOTAL normal 1201.365 1201.446 1235.925 1246.336
TOTAL itree 1210.490 1210.541 1225.455 1245.121
======================= USER ========================
command patch v4 v4dirent v4d512bi v5
==========|======|========|========|========|========
tar -xf normal 59.223 59.473 58.817 59.640
tar -xf itree 59.420 59.473 58.953 59.893
----------+------+--------+--------+--------+--------
sha256sum normal 49.877 49.877 49.787 49.730
sha256sum itree 49.437 49.863 49.583 49.673
----------+------+--------+--------+--------+--------
cp -r a b normal 0.697 0.707 0.743 0.800
cp -r a b itree 0.657 0.710 0.677 0.703
----------+------+--------+--------+--------+--------
find . normal 0.257 0.237 0.233 0.223
find . itree 0.283 0.223 0.223 0.203
----------+------+--------+--------+--------+--------
rm -r b normal 0.170 0.120 0.147 0.160
rm -r b itree 0.160 0.163 0.130 0.137
----------+------+--------+--------+--------+--------
cp -r b a normal 0.817 0.763 0.817 0.763
cp -r b a itree 0.737 0.740 0.787 0.670
----------+------+--------+--------+--------+--------
rm -r b normal 0.170 0.153 0.140 0.133
rm -r b itree 0.140 0.157 0.143 0.163
----------+------+--------+--------+--------+--------
stat normal 1.660 1.653 1.570 1.720
stat itree 1.737 1.727 1.700 1.630
----------+------+--------+--------+--------+--------
sha check normal 58.467 58.603 58.550 58.370
sha check itree 58.157 58.183 58.620 58.343
----------+------+--------+--------+--------+--------
cp tarball normal 0.023 0.027 0.033 0.037
cp tarball itree 0.017 0.020 0.020 0.020
==========|======|========|========|========|========
TOTAL normal 171.361 171.613 170.837 171.576
TOTAL itree 170.745 171.259 170.836 171.435
======================== SYS ========================
command patch v4 v4dirent v4d512bi v5
==========|======|========|========|========|========
tar -xf normal 19.770 19.800 19.960 20.770
tar -xf itree 19.550 19.930 20.067 20.963
----------+------+--------+--------+--------+--------
sha256sum normal 17.157 14.607 14.393 16.053
sha256sum itree 17.277 14.813 14.550 15.007
----------+------+--------+--------+--------+--------
cp -r a b normal 18.697 18.973 18.687 19.253
cp -r a b itree 19.033 18.993 18.783 19.703
----------+------+--------+--------+--------+--------
find . normal 0.820 0.573 0.537 0.597
find . itree 0.793 0.593 0.547 0.610
----------+------+--------+--------+--------+--------
rm -r b normal 3.883 3.827 3.800 3.967
rm -r b itree 4.053 3.937 4.003 4.143
----------+------+--------+--------+--------+--------
cp -r b a normal 19.043 19.083 18.753 19.503
cp -r b a itree 19.203 19.100 19.040 19.680
----------+------+--------+--------+--------+--------
rm -r b normal 4.097 3.947 3.900 4.123
rm -r b itree 4.287 4.067 4.093 4.227
----------+------+--------+--------+--------+--------
stat normal 11.337 10.730 10.727 10.680
stat itree 11.080 10.827 10.800 10.457
----------+------+--------+--------+--------+--------
sha check normal 8.970 8.920 8.980 9.507
sha check itree 9.053 9.143 8.540 9.420
----------+------+--------+--------+--------+--------
cp tarball normal 5.537 5.397 5.470 5.373
cp tarball itree 5.390 5.313 5.460 5.343
==========|======|========|========|========|========
TOTAL normal 109.311 105.857 105.207 109.826
TOTAL itree 109.719 106.716 105.883 109.553
The rest of this report is supplementary noise and the lockdep.
# XFS kernel configuration:
CONFIG_XFS_FS=y
CONFIG_XFS_QUOTA=y
CONFIG_XFS_POSIX_ACL=y
CONFIG_XFS_RT=y
# CONFIG_XFS_WARN is not set
# CONFIG_XFS_DEBUG is not set
# `uname -a` output:
Linux plbearer 3.11.0+ #4 Sat Sep 7 13:04:53 EDT 2013
i686 Intel(R) Pentium(R) 4 CPU 1.80GHz GenuineIntel GNU/Linux
RAM: 512 MB, less 9 MB for capture kernel
# Hard drive used for $TEST_DEV:
Model Family: Western Digital Caviar
Device Model: WDC WD600BB-75CAA0
User Capacity: 60,022,480,896 bytes [60.0 GB]
# Hard drive used for $TEST_LOGDEV and kernel-git tarball:
Model Family: Seagate Barracuda 7200.9
Device Model: ST3120814A
User Capacity: 120,034,123,776 bytes [120 GB]
root@plbearer:~/results# uname -a
# Sample xfs_info output for $TEST_DEV, to show how XFS is using the partition:
meta-data=/dev/sdb4 isize=512 agcount=4, agsize=720896 blks
= sectsz=512 attr=2, projid32bit=1
= crc=0
data = bsize=4096 blocks=2883584, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0
log =external bsize=4096 blocks=32768, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
# cat /proc/partitions
major minor #blocks name
8 0 117220824 sda
8 1 98304 sda1 # shared /boot
8 2 1 sda2
8 5 65536 sda5
8 6 131072 sda6 # $TEST_LOGDEV
8 7 131072 sda7
8 8 616448 sda8
8 9 11275264 sda9 # inactive root (v5 XFS), holds tarball
8 10 104895960 sda10
11 0 1048575 sr0
8 16 58615704 sdb
8 17 3406848 sdb1 # active root partition (JFS)
8 18 786432 sdb2
8 19 20971520 sdb3
8 20 11534336 sdb4 # $TEST_DEV
8 21 4128768 sdb5
8 22 786432 sdb6
8 23 524288 sdb7
8 24 524288 sdb8
8 25 1048576 sdb9
8 26 10708871 sdb10
8 27 4194304 sdb11
Lockdep that kept tar jobs from completing. It was found during several
other tests before I gave up on the debug benchmark idea.
=================================
[ INFO: inconsistent lock state ]
3.11.0+ #2 Not tainted
---------------------------------
inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
tar/287 [HC0[0]:SC0[0]:HE1:SE1] takes:
(&(&ip->i_lock)->mr_lock){++++?-}, at: [<c120cc1b>] xfs_ilock+0x100/0x1f1
{IN-RECLAIM_FS-W} state was registered at:
[<c1070697>] __lock_acquire+0x63b/0x1953
[<c1072515>] lock_acquire+0x88/0x17e
[<c104fc0f>] down_write_nested+0x4f/0x9d
[<c120cc1b>] xfs_ilock+0x100/0x1f1
[<c11bd00d>] xfs_reclaim_inode+0xf4/0x30a
[<c11bd4d4>] xfs_reclaim_inodes_ag+0x2b1/0x488
[<c11bd729>] xfs_reclaim_inodes_nr+0x2d/0x33
[<c11c7e1e>] xfs_fs_free_cached_objects+0x13/0x15
[<c10f3778>] prune_super+0xd1/0x15c
[<c10c99da>] shrink_slab+0x143/0x3d8
[<c10cc768>] kswapd+0x45d/0x835
[<c104b617>] kthread+0xa7/0xa9
[<c14e26b7>] ret_from_kernel_thread+0x1b/0x28
irq event stamp: 23333689
hardirqs last enabled at (23333689): [<c14e1375>] _raw_spin_unlock_irq+0x27/0x36
hardirqs last disabled at (23333688): [<c14e11ea>] _raw_spin_lock_irq+0x15/0x7a
softirqs last enabled at (23333678): [<c1031d2d>] __do_softirq+0x142/0x2ce
softirqs last disabled at (23333649): [<c1031fec>] irq_exit+0x6d/0x73
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&(&ip->i_lock)->mr_lock);
<Interrupt>
lock(&(&ip->i_lock)->mr_lock);
*** DEADLOCK ***
3 locks held by tar/287:
#0: (sb_writers#8){.+.+.+}, at: [<c110d858>] mnt_want_write+0x1e/0x3e
#1: (&(&ip->i_lock)->mr_lock){++++?-}, at: [<c120cc1b>] xfs_ilock+0x100/0x1f1
#2: (sb_internal){.+.+.+}, at: [<c11cb6d0>] xfs_trans_alloc+0x1f/0x35
stack backtrace:
CPU: 0 PID: 287 Comm: tar Not tainted 3.11.0+ #2
Hardware name: Dell Computer Corporation Dimension 2350/07W080, BIOS A01 12/17/2002
de296540 de296540 de3e7d5c c14db319 de3e7d98 c14d8708 c1618f4e c1619343
0000011f 00000000 00000000 00000000 00000000 00000001 00000001 c1619343
0000000a de2969b0 00000400 de3e7dcc c106e6c4 0000000a de3e7e1c c10703d1
Call Trace:
[<c14db319>] dump_stack+0x16/0x18
[<c14d8708>] print_usage_bug+0x1dc/0x1e6
[<c106e6c4>] mark_lock+0x28c/0x2c1
[<c10703d1>] ? __lock_acquire+0x375/0x1953
[<c106dc88>] ? print_shortest_lock_dependencies+0x18f/0x18f
[<c106e77a>] mark_held_locks+0x81/0xe7
[<c106f136>] lockdep_trace_alloc+0xa1/0xe3
[<c10ec3ed>] kmem_cache_alloc+0x28/0x1f2
[<c11cd53f>] ? kmem_zone_alloc+0x55/0xd0
[<c11cd53f>] kmem_zone_alloc+0x55/0xd0
[<c11cb6d0>] ? xfs_trans_alloc+0x1f/0x35
[<c11cd5cb>] kmem_zone_zalloc+0x11/0x36
[<c11cb663>] _xfs_trans_alloc+0x2e/0x7c
[<c11cb6de>] xfs_trans_alloc+0x2d/0x35
[<c1210743>] xfs_inactive+0x129/0x4a3
[<c106ebaf>] ? trace_hardirqs_on+0xb/0xd
[<c14e1375>] ? _raw_spin_unlock_irq+0x27/0x36
[<c11c807d>] xfs_fs_evict_inode+0x6c/0x114
[<c1108268>] evict+0x8e/0x15d
[<c1108d16>] iput+0xc4/0x138
[<c10fec3c>] do_unlinkat+0x127/0x17f
[<c102547e>] ? vmalloc_sync_all+0x133/0x133
[<c10fecb7>] SyS_unlinkat+0x23/0x3a
[<c14e273b>] sysenter_do_call+0x12/0x32
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 10/11] xfs: update the finobt on inode free
2013-09-07 12:30 ` Brian Foster
@ 2013-09-08 20:08 ` Michael L. Semon
2013-09-09 2:34 ` Better numbers " Michael L. Semon
1 sibling, 0 replies; 46+ messages in thread
From: Michael L. Semon @ 2013-09-08 20:08 UTC (permalink / raw)
To: Brian Foster; +Cc: xfs
On Sat, Sep 7, 2013 at 8:30 AM, Brian Foster <bfoster@redhat.com> wrote:
> On 09/06/2013 05:24 PM, Dave Chinner wrote:
>> On Fri, Sep 06, 2013 at 07:39:18AM -0400, Brian Foster wrote:
>>> On 09/05/2013 08:28 PM, Dave Chinner wrote:
>>>> On Thu, Sep 05, 2013 at 12:19:12PM -0400, Brian Foster wrote:
>>>>> On 09/04/2013 10:54 PM, Dave Chinner wrote:
>>>>>> On Tue, Sep 03, 2013 at 02:25:07PM -0400, Brian Foster wrote:
>>> ...
>>>>>
> ...
>>>> No worries. BTW, can you post your rudimentary userspace support so
>>>> we can run tests that use this code, too?
>>>>
>>>
>>> Sure. My xfsprogs branch currently is the application of a slightly
>>> older version of this set (pre-cleanups I made to make this post-worthy)
>>> with some hacks to make it apply/compile and a few other patches on top
>>> of that for mkfs, xfs_db and xfs_repair to work through some basic
>>> things I ran into when running xfstests.
>>>
>>> Would you prefer I drop the whole thing on the list?
>>
>> Drop it on the list, maybe just a as tarball rather than a patchset
>> if it's not ready for review yet.
>>
>
> Ok, attached. This applies on top of the following commit in the
> xfsprogs tree:
>
> 982e5c7e xfs_db: add header to freesp -d output
>
> Use the following command to format a finobt enabled fs:
>
> mkfs.xfs -m crc=1,finobt=1 <dev>
>
> ... and otherwise there are a couple random fixes for xfs_db and xfs_repair.
>
> Brian
OK, I gave it a try and messed it up. I have updated numbers that look a
little better, and they'll be posted once I get to a Linux PC.
My merge went badly because it went on top of Dave's latest work and Mark's
v4 dirent work. It's like some of the features2 stuff takes one extra argument
in places.
Also, when patching some code, it looks like some constants have been
updated to use cpu_to_be32(), and I don't know why they've been updated.
So it's not a surprise that I can make a v4-finobt filesystem with -finobt=1,
mount it, get all the v5 warnings from the kernel, but there's not enough
v5 code in use to choke on 256-byte v4 inodes. Afterwards, `xfs_repair -n`
suggests that it will correct some areas from each AG.
It's also not a surprise that I don't know what issues are your fault,
which ones
are my fault, and which ones require Mark to update his v4-dirent patches.
Therefore, there's no conclusion to be placed on your userspace code yet. It's
an RFC, anyway...
My "user commands" benchmark succeeded for the 4 XFS filesystem
variants considered previously.
More later...
Michael
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 46+ messages in thread
* Better numbers Re: [RFC PATCH 10/11] xfs: update the finobt on inode free
2013-09-07 12:30 ` Brian Foster
2013-09-08 20:08 ` Michael L. Semon
@ 2013-09-09 2:34 ` Michael L. Semon
1 sibling, 0 replies; 46+ messages in thread
From: Michael L. Semon @ 2013-09-09 2:34 UTC (permalink / raw)
To: Brian Foster; +Cc: xfs
On 09/07/2013 08:30 AM, Brian Foster wrote:
> On 09/06/2013 05:24 PM, Dave Chinner wrote:
>> On Fri, Sep 06, 2013 at 07:39:18AM -0400, Brian Foster wrote:
>>> On 09/05/2013 08:28 PM, Dave Chinner wrote:
>>>> On Thu, Sep 05, 2013 at 12:19:12PM -0400, Brian Foster wrote:
>>>>> On 09/04/2013 10:54 PM, Dave Chinner wrote:
>>>>>> On Tue, Sep 03, 2013 at 02:25:07PM -0400, Brian Foster wrote:
>>> ...
>>>>>
> ...
>>>> No worries. BTW, can you post your rudimentary userspace support so
>>>> we can run tests that use this code, too?
>>>>
>>>
>>> Sure. My xfsprogs branch currently is the application of a slightly
>>> older version of this set (pre-cleanups I made to make this post-worthy)
>>> with some hacks to make it apply/compile and a few other patches on top
>>> of that for mkfs, xfs_db and xfs_repair to work through some basic
>>> things I ran into when running xfstests.
>>>
>>> Would you prefer I drop the whole thing on the list?
>>
>> Drop it on the list, maybe just a as tarball rather than a patchset
>> if it's not ready for review yet.
>>
>
> Ok, attached. This applies on top of the following commit in the
> xfsprogs tree:
>
> 982e5c7e xfs_db: add header to freesp -d output
>
> Use the following command to format a finobt enabled fs:
>
> mkfs.xfs -m crc=1,finobt=1 <dev>
>
> ... and otherwise there are a couple random fixes for xfs_db and xfs_repair.
>
> Brian
>
>
>> Cheers,
>>
>> Dave.
Hi! Here's a better set of numbers, using the finobt-enabled userspace.
Some cosmetic errors in my chart-generating Perl script were also fixed.
Revised legend:
v4: v4 XFS
v4dirent: v4 with Mark's dirent patches
v4d512bi: same as above, with 512-byte inodes
v5: v5/CRC XFS, no finobt except for "finobt" patch case
Again, a kernel git dir was built, then turned into a source tar.gz
file for these tests.
Files involved in find/sha256/cp/stat tests:
$TEST_DIR/a/linux/ (kernel + xfs-oss/master, built)
$TEST_DIR/b/linux/ (same)
REAL
command patch v4 v4dirent v4d512bi v5
----------+-------+--------+--------+--------+--------
tar -xf normal 103.202 104.951 101.771 104.486
tar -xf patched 104.610 101.705 98.784 101.919
tar -xf finobt 103.753 102.949 102.609 99.847
----------+-------+--------+--------+--------+--------
sha256sum normal 227.456 228.321 231.947 234.127
sha256sum patched 230.233 229.375 231.509 233.253
sha256sum finobt 228.108 228.873 232.464 233.498
----------+-------+--------+--------+--------+--------
cp -r a b normal 239.714 242.754 248.994 249.584
cp -r a b patched 241.273 243.216 248.531 254.501
cp -r a b finobt 240.779 242.838 244.760 253.203
----------+-------+--------+--------+--------+--------
find . normal 11.894 12.370 12.324 12.397
find . patched 12.043 12.310 12.736 13.216
find . finobt 11.882 12.376 13.501 13.827
----------+-------+--------+--------+--------+--------
rm -r a normal 8.556 8.744 11.298 11.774
rm -r a patched 8.904 8.981 10.590 12.057
rm -r a finobt 8.588 9.007 10.841 12.168
----------+-------+--------+--------+--------+--------
cp -r b a normal 262.065 256.448 272.290 272.221
cp -r b a patched 264.116 265.875 267.346 270.811
cp -r b a finobt 265.255 259.111 262.312 268.700
----------+-------+--------+--------+--------+--------
rm -r b normal 8.585 9.258 8.791 10.058
rm -r b patched 9.061 8.345 9.909 9.273
rm -r b finobt 8.326 8.231 10.078 10.464
----------+-------+--------+--------+--------+--------
stat normal 161.853 162.772 163.555 165.046
stat patched 162.641 163.148 163.698 164.015
stat finobt 163.366 162.707 163.256 165.051
----------+-------+--------+--------+--------+--------
sha check normal 133.938 133.016 141.352 142.921
sha check patched 133.885 133.399 138.013 143.315
sha check finobt 135.128 134.900 142.158 141.094
----------+-------+--------+--------+--------+--------
cp tarball normal 44.102 42.812 43.603 43.722
cp tarball patched 43.724 44.187 44.339 42.761
cp tarball finobt 43.930 43.236 42.736 44.000
----------+-------+--------+--------+--------+--------
TOTAL normal 1201.365 1201.446 1235.925 1246.336
TOTAL patched 1210.490 1210.541 1225.455 1245.121
TOTAL finobt 1209.115 1204.228 1224.715 1241.852
USER
command patch v4 v4dirent v4d512bi v5
----------+-------+--------+--------+--------+--------
tar -xf normal 59.223 59.473 58.817 59.640
tar -xf patched 59.420 59.473 58.953 59.893
tar -xf finobt 59.153 59.850 59.643 59.153
----------+-------+--------+--------+--------+--------
sha256sum normal 49.877 49.877 49.787 49.730
sha256sum patched 49.437 49.863 49.583 49.673
sha256sum finobt 49.577 49.580 49.743 49.597
----------+-------+--------+--------+--------+--------
cp -r a b normal 0.697 0.707 0.743 0.800
cp -r a b patched 0.657 0.710 0.677 0.703
cp -r a b finobt 0.737 0.777 0.777 0.780
----------+-------+--------+--------+--------+--------
find . normal 0.257 0.237 0.233 0.223
find . patched 0.283 0.223 0.223 0.203
find . finobt 0.263 0.253 0.237 0.273
----------+-------+--------+--------+--------+--------
rm -r a normal 0.170 0.120 0.147 0.160
rm -r a patched 0.160 0.163 0.130 0.137
rm -r a finobt 0.173 0.157 0.123 0.150
----------+-------+--------+--------+--------+--------
cp -r b a normal 0.817 0.763 0.817 0.763
cp -r b a patched 0.737 0.740 0.787 0.670
cp -r b a finobt 0.783 0.747 0.737 0.687
----------+-------+--------+--------+--------+--------
rm -r b normal 0.170 0.153 0.140 0.133
rm -r b patched 0.140 0.157 0.143 0.163
rm -r b finobt 0.173 0.127 0.193 0.153
----------+-------+--------+--------+--------+--------
stat normal 1.660 1.653 1.570 1.720
stat patched 1.737 1.727 1.700 1.630
stat finobt 1.767 1.640 1.557 1.763
----------+-------+--------+--------+--------+--------
sha check normal 58.467 58.603 58.550 58.370
sha check patched 58.157 58.183 58.620 58.343
sha check finobt 58.530 58.107 58.367 58.300
----------+-------+--------+--------+--------+--------
cp tarball normal 0.023 0.027 0.033 0.037
cp tarball patched 0.017 0.020 0.020 0.020
cp tarball finobt 0.020 0.037 0.033 0.020
----------+-------+--------+--------+--------+--------
TOTAL normal 171.361 171.613 170.837 171.576
TOTAL patched 170.745 171.259 170.836 171.435
TOTAL finobt 171.176 171.275 171.410 170.876
SYS
command patch v4 v4dirent v4d512bi v5
----------+-------+--------+--------+--------+--------
tar -xf normal 19.770 19.800 19.960 20.770
tar -xf patched 19.550 19.930 20.067 20.963
tar -xf finobt 20.010 19.707 19.707 21.397
----------+-------+--------+--------+--------+--------
sha256sum normal 17.157 14.607 14.393 16.053
sha256sum patched 17.277 14.813 14.550 15.007
sha256sum finobt 17.123 14.920 14.667 15.133
----------+-------+--------+--------+--------+--------
cp -r a b normal 18.697 18.973 18.687 19.253
cp -r a b patched 19.033 18.993 18.783 19.703
cp -r a b finobt 19.093 18.863 18.877 19.363
----------+-------+--------+--------+--------+--------
find . normal 0.820 0.573 0.537 0.597
find . patched 0.793 0.593 0.547 0.610
find . finobt 0.800 0.553 0.533 0.543
----------+-------+--------+--------+--------+--------
rm -r a normal 3.883 3.827 3.800 3.967
rm -r a patched 4.053 3.937 4.003 4.143
rm -r a finobt 4.010 4.020 3.983 4.290
----------+-------+--------+--------+--------+--------
cp -r b a normal 19.043 19.083 18.753 19.503
cp -r b a patched 19.203 19.100 19.040 19.680
cp -r b a finobt 19.133 18.973 18.950 19.607
----------+-------+--------+--------+--------+--------
rm -r b normal 4.097 3.947 3.900 4.123
rm -r b patched 4.287 4.067 4.093 4.227
rm -r b finobt 4.223 4.140 4.013 4.480
----------+-------+--------+--------+--------+--------
stat normal 11.337 10.730 10.727 10.680
stat patched 11.080 10.827 10.800 10.457
stat finobt 11.393 10.720 10.680 10.797
----------+-------+--------+--------+--------+--------
sha check normal 8.970 8.920 8.980 9.507
sha check patched 9.053 9.143 8.540 9.420
sha check finobt 8.653 9.020 8.863 9.103
----------+-------+--------+--------+--------+--------
cp tarball normal 5.537 5.397 5.470 5.373
cp tarball patched 5.390 5.313 5.460 5.343
cp tarball finobt 5.520 5.357 5.333 5.603
----------+-------+--------+--------+--------+--------
TOTAL normal 109.311 105.857 105.207 109.826
TOTAL patched 109.719 106.716 105.883 109.553
TOTAL finobt 109.958 106.273 105.606 110.316
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 46+ messages in thread
end of thread, other threads:[~2013-09-09 2:35 UTC | newest]
Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-09-03 18:24 [RFC PATCH 00/11] xfs: introduce the free inode btree Brian Foster
2013-09-03 18:24 ` [RFC PATCH 01/11] xfs: refactor xfs_ialloc_btree.c to support multiple inobt numbers Brian Foster
2013-09-05 0:36 ` Dave Chinner
2013-09-03 18:24 ` [RFC PATCH 02/11] xfs: reserve v5 superblock read-only compat. feature bit for finobt Brian Foster
2013-09-05 0:39 ` Dave Chinner
2013-09-03 18:25 ` [RFC PATCH 03/11] xfs: support the XFS_BTNUM_FINOBT free inode btree type Brian Foster
2013-09-05 0:54 ` Dave Chinner
2013-09-05 16:17 ` Brian Foster
2013-09-06 0:07 ` Dave Chinner
2013-09-06 11:25 ` Brian Foster
2013-09-06 21:22 ` Dave Chinner
2013-09-03 18:25 ` [RFC PATCH 04/11] xfs: update inode allocation transaction reservations for finobt Brian Foster
2013-09-05 0:59 ` Dave Chinner
2013-09-05 16:17 ` Brian Foster
2013-09-06 0:11 ` Dave Chinner
2013-09-03 18:25 ` [RFC PATCH 05/11] xfs: update ifree " Brian Foster
2013-09-05 1:00 ` Dave Chinner
2013-09-03 18:25 ` [RFC PATCH 06/11] xfs: use correct transaction reservations in xfs_inactive() Brian Foster
2013-09-05 1:35 ` Dave Chinner
2013-09-05 16:18 ` Brian Foster
2013-09-03 18:25 ` [RFC PATCH 07/11] xfs: retry trans reservation on ENOSPC " Brian Foster
2013-09-05 1:40 ` Dave Chinner
2013-09-05 16:18 ` Brian Foster
2013-09-06 0:17 ` Dave Chinner
2013-09-06 11:30 ` Brian Foster
2013-09-03 18:25 ` [RFC PATCH 08/11] xfs: insert newly allocated inode chunks into the finobt Brian Foster
2013-09-05 2:10 ` Dave Chinner
2013-09-03 18:25 ` [RFC PATCH 09/11] xfs: use and update the finobt on inode allocation Brian Foster
2013-09-05 2:27 ` Dave Chinner
2013-09-05 16:18 ` Brian Foster
2013-09-03 18:25 ` [RFC PATCH 10/11] xfs: update the finobt on inode free Brian Foster
2013-09-05 2:54 ` Dave Chinner
2013-09-05 16:19 ` Brian Foster
2013-09-06 0:28 ` Dave Chinner
2013-09-06 11:39 ` Brian Foster
2013-09-06 21:24 ` Dave Chinner
2013-09-07 12:30 ` Brian Foster
2013-09-08 20:08 ` Michael L. Semon
2013-09-09 2:34 ` Better numbers " Michael L. Semon
2013-09-03 18:25 ` [RFC PATCH 11/11] xfs: add finobt support to growfs Brian Foster
2013-09-05 2:55 ` Dave Chinner
2013-09-05 21:17 ` [RFC PATCH 00/11] xfs: introduce the free inode btree Michael L. Semon
2013-09-06 11:17 ` Brian Foster
2013-09-06 21:35 ` Dave Chinner
2013-09-07 12:31 ` Brian Foster
2013-09-08 1:04 ` Michael L. Semon
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.