All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 00/18] xfs: sparse inode chunks
@ 2014-07-24 14:22 Brian Foster
  2014-07-24 14:22 ` [PATCH 01/18] xfs: refactor xfs_inobt_insert() to eliminate loop and support variable count Brian Foster
                   ` (19 more replies)
  0 siblings, 20 replies; 50+ messages in thread
From: Brian Foster @ 2014-07-24 14:22 UTC (permalink / raw)
  To: xfs

Hi all,

This is a first pass at sparse inode chunk support for XFS. Some
background on this work is available here:

http://oss.sgi.com/archives/xfs/2013-08/msg00346.html

The basic idea is to allow the partial allocation of inode chunks into
fragmented regions of free space. This is accomplished through addition
of a holemask field into the inobt record that defines what portion(s)
of an inode chunk are invalid (i.e., holes in the chunk). This work is
not quite complete, but is at a point where I'd like to start getting
feedback on the design and what direction to take for some of the known
gaps.

The basic breakdown of functionality in this set is as follows:

- Patches 1-2 - A couple generic cleanups that are dependencies for later
  patches in the series.
- Patches 3-5 - Basic data structure update, feature bit and minor
  helper introduction.
- Patches 6-7 - Update v5 icreate logging and recovery to handle sparse
  inode records.
- Patches 8-13 - Allocation support for sparse inode records. Physical
  chunk allocation and individual inode allocation.
- Patches 14-16 - Deallocation support for sparse inode chunks. Physical
  chunk deallocation, individual inode free and cluster free.
- Patch 17 - Fixes for bulkstat/inumbers.
- Patch 18 - Activate support for sparse chunk allocation and
  processing.

This work is lightly tested for regression (some xfstests failures due
to repair) and basic functionality. I have a new xfstests test I'll
forward along for demonstration purposes.

Some notes on gaps in the design:

- Sparse inode chunk allocation granularity:

The current minimum sparse chunk allocation granularity is the cluster
size. My initial attempts at this work tried to redefine to the minimum
chunk length based on the holemask granularity (a la the stale macro I
seemingly left in this series ;), but this involves tweaking the
codepaths that use the cluster size (i.e., imap) which proved rather
hairy. This also means we need a solution where an imap can change if an
inode was initially mapped as a sparse chunk and said chunk is
subsequently made full. E.g., we'd perhaps need to invalidate the inode
buffers for sparse chunks at the time where they are made full. Given
that, I landed on using the cluster size and leaving those codepaths as
is for the time being.

There is a tradeoff here for v5 superblocks because we've recently made
a change to scale the cluster size based on the factor increase in the
inode size from the default (see xfsprogs commit 7b5f9801). This means
that effectiveness of sparse chunks is tied to whether the level of free
space fragmentation matches the cluster size. By that I mean effectivess
is good (near 100% utilization possible) if free space fragmentation
leaves free extents around that at least match the cluster size. If
fragmentation is worse than the cluster size, effectiveness is reduced.
This can also be demonstrated with the forthcoming xfstests test.

- On-disk lifecycle of the sparse inode chunks feature bit:

We set an incompatible feature bit once a sparse inode chunk is
allocated because older revisions of code will interpret the non-zero
holemask bits in the higher order bytes of the record freecount. The
feature bit must be removed once all sparse inode chunks are eliminated
one way or another. This series does not currently remove the feature
bit once set simply because I hadn't thought through the mechanism quite
yet. For the next version, I'm thinking about adding an inobt walk
mechanism that can be conditionally invoked (i.e., feature bit is
currently set and a sparse inode chunk has been eliminated) either via
workqueue on an interval or during unmount if necessary. Thoughts or
alternative suggestions on that appreciated.

That's about it for now. Thoughts, reviews, flames appreciated. Thanks.

Brian

Brian Foster (18):
  xfs: refactor xfs_inobt_insert() to eliminate loop and support
    variable count
  xfs: pass xfs_mount directly to xfs_ialloc_cluster_alignment()
  xfs: define sparse inode chunks v5 sb feature bit and helper function
  xfs: introduce inode record hole mask for sparse inode chunks
  xfs: create macros/helpers for dealing with sparse inode chunks
  xfs: pass inode count through ordered icreate log item
  xfs: handle sparse inode chunks in icreate log recovery
  xfs: create helper to manage record overlap for sparse inode chunks
  xfs: allocate sparse inode chunks on full chunk allocation failure
  xfs: set sparse inodes feature bit when a sparse chunk is allocated
  xfs: reduce min. inode allocation space requirement for sparse inode
    chunks
  xfs: helper to convert inobt record holemask to inode alloc. bitmap
  xfs: filter out sparse regions from individual inode allocation
  xfs: update free inode record logic to support sparse inode records
  xfs: only free allocated regions of inode chunks
  xfs: skip unallocated regions of inode chunks in xfs_ifree_cluster()
  xfs: use actual inode count for sparse records in bulkstat/inumbers
  xfs: enable sparse inode chunks for v5 superblocks

 fs/xfs/libxfs/xfs_format.h       |  17 +-
 fs/xfs/libxfs/xfs_ialloc.c       | 441 +++++++++++++++++++++++++++++++++------
 fs/xfs/libxfs/xfs_ialloc.h       |  17 +-
 fs/xfs/libxfs/xfs_ialloc_btree.c |   4 +-
 fs/xfs/libxfs/xfs_sb.h           |   9 +-
 fs/xfs/xfs_inode.c               |  28 ++-
 fs/xfs/xfs_itable.c              |  12 +-
 fs/xfs/xfs_log_recover.c         |  23 +-
 8 files changed, 460 insertions(+), 91 deletions(-)

-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 01/18] xfs: refactor xfs_inobt_insert() to eliminate loop and support variable count
  2014-07-24 14:22 [PATCH RFC 00/18] xfs: sparse inode chunks Brian Foster
@ 2014-07-24 14:22 ` Brian Foster
  2014-07-24 22:10   ` Dave Chinner
  2014-07-24 14:22 ` [PATCH 02/18] xfs: pass xfs_mount directly to xfs_ialloc_cluster_alignment() Brian Foster
                   ` (18 subsequent siblings)
  19 siblings, 1 reply; 50+ messages in thread
From: Brian Foster @ 2014-07-24 14:22 UTC (permalink / raw)
  To: xfs

Inodes are always allocated in chunks of 64 and thus the loop in
xfs_inobt_insert() is unnecessary. Also replace the use of hardcoded
constants (i.e., XFS_INODES_PER_CHUNK) with values provided by the
caller. This prepares the codepath to support sparse inode chunks, which
might have varying numbers of free inodes.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_ialloc.c | 37 ++++++++++++++++---------------------
 1 file changed, 16 insertions(+), 21 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index b62771f..6e2ccb3 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -134,37 +134,32 @@ xfs_inobt_insert(
 	struct xfs_mount	*mp,
 	struct xfs_trans	*tp,
 	struct xfs_buf		*agbp,
-	xfs_agino_t		newino,
-	xfs_agino_t		newlen,
+	xfs_agino_t		newino,	/* start inode of record */
+	xfs_agino_t		count,	/* inode count */
+	xfs_inofree_t		free,	/* free mask */
 	xfs_btnum_t		btnum)
 {
 	struct xfs_btree_cur	*cur;
 	struct xfs_agi		*agi = XFS_BUF_TO_AGI(agbp);
 	xfs_agnumber_t		agno = be32_to_cpu(agi->agi_seqno);
-	xfs_agino_t		thisino;
 	int			i;
 	int			error;
 
 	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno, btnum);
 
-	for (thisino = newino;
-	     thisino < newino + newlen;
-	     thisino += XFS_INODES_PER_CHUNK) {
-		error = xfs_inobt_lookup(cur, thisino, XFS_LOOKUP_EQ, &i);
-		if (error) {
-			xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
-			return error;
-		}
-		ASSERT(i == 0);
+	error = xfs_inobt_lookup(cur, newino, XFS_LOOKUP_EQ, &i);
+	if (error) {
+		xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+		return error;
+	}
+	ASSERT(i == 0);
 
-		error = xfs_inobt_insert_rec(cur, XFS_INODES_PER_CHUNK,
-					     XFS_INOBT_ALL_FREE, &i);
-		if (error) {
-			xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
-			return error;
-		}
-		ASSERT(i == 1);
+	error = xfs_inobt_insert_rec(cur, count, free, &i);
+	if (error) {
+		xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+		return error;
 	}
+	ASSERT(i == 1);
 
 	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
 
@@ -517,13 +512,13 @@ xfs_ialloc_ag_alloc(
 	 * Insert records describing the new inode chunk into the btrees.
 	 */
 	error = xfs_inobt_insert(args.mp, tp, agbp, newino, newlen,
-				 XFS_BTNUM_INO);
+				 XFS_INOBT_ALL_FREE, XFS_BTNUM_INO);
 	if (error)
 		return error;
 
 	if (xfs_sb_version_hasfinobt(&args.mp->m_sb)) {
 		error = xfs_inobt_insert(args.mp, tp, agbp, newino, newlen,
-					 XFS_BTNUM_FINO);
+					 XFS_INOBT_ALL_FREE, XFS_BTNUM_FINO);
 		if (error)
 			return error;
 	}
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 02/18] xfs: pass xfs_mount directly to xfs_ialloc_cluster_alignment()
  2014-07-24 14:22 [PATCH RFC 00/18] xfs: sparse inode chunks Brian Foster
  2014-07-24 14:22 ` [PATCH 01/18] xfs: refactor xfs_inobt_insert() to eliminate loop and support variable count Brian Foster
@ 2014-07-24 14:22 ` Brian Foster
  2014-07-24 14:22 ` [PATCH 03/18] xfs: define sparse inode chunks v5 sb feature bit and helper function Brian Foster
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 50+ messages in thread
From: Brian Foster @ 2014-07-24 14:22 UTC (permalink / raw)
  To: xfs

xfs_ialloc_cluster_alignment() takes an args parameter but only uses the
xfs_mount pointer. Replace the args parameter with mp such that we can
use this function from additional contexts.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_ialloc.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 6e2ccb3..5448a74 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -48,12 +48,12 @@
  */
 static inline int
 xfs_ialloc_cluster_alignment(
-	xfs_alloc_arg_t	*args)
+	struct xfs_mount *mp)
 {
-	if (xfs_sb_version_hasalign(&args->mp->m_sb) &&
-	    args->mp->m_sb.sb_inoalignmt >=
-	     XFS_B_TO_FSBT(args->mp, args->mp->m_inode_cluster_size))
-		return args->mp->m_sb.sb_inoalignmt;
+	if (xfs_sb_version_hasalign(&mp->m_sb) &&
+	    mp->m_sb.sb_inoalignmt >=
+	     XFS_B_TO_FSBT(mp, mp->m_inode_cluster_size))
+		return mp->m_sb.sb_inoalignmt;
 	return 1;
 }
 
@@ -407,7 +407,7 @@ xfs_ialloc_ag_alloc(
 		 * but not to use them in the actual exact allocation.
 		 */
 		args.alignment = 1;
-		args.minalignslop = xfs_ialloc_cluster_alignment(&args) - 1;
+		args.minalignslop = xfs_ialloc_cluster_alignment(args.mp) - 1;
 
 		/* Allow space for the inode btree to split. */
 		args.minleft = args.mp->m_in_maxlevels - 1;
@@ -443,7 +443,7 @@ xfs_ialloc_ag_alloc(
 			args.alignment = args.mp->m_dalign;
 			isaligned = 1;
 		} else
-			args.alignment = xfs_ialloc_cluster_alignment(&args);
+			args.alignment = xfs_ialloc_cluster_alignment(args.mp);
 		/*
 		 * Need to figure out where to allocate the inode blocks.
 		 * Ideally they should be spaced out through the a.g.
@@ -472,7 +472,7 @@ xfs_ialloc_ag_alloc(
 		args.type = XFS_ALLOCTYPE_NEAR_BNO;
 		args.agbno = be32_to_cpu(agi->agi_root);
 		args.fsbno = XFS_AGB_TO_FSB(args.mp, agno, args.agbno);
-		args.alignment = xfs_ialloc_cluster_alignment(&args);
+		args.alignment = xfs_ialloc_cluster_alignment(args.mp);
 		if ((error = xfs_alloc_vextent(&args)))
 			return error;
 	}
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 03/18] xfs: define sparse inode chunks v5 sb feature bit and helper function
  2014-07-24 14:22 [PATCH RFC 00/18] xfs: sparse inode chunks Brian Foster
  2014-07-24 14:22 ` [PATCH 01/18] xfs: refactor xfs_inobt_insert() to eliminate loop and support variable count Brian Foster
  2014-07-24 14:22 ` [PATCH 02/18] xfs: pass xfs_mount directly to xfs_ialloc_cluster_alignment() Brian Foster
@ 2014-07-24 14:22 ` Brian Foster
  2014-07-24 17:08   ` Mark Tinguely
  2014-07-24 23:35   ` Dave Chinner
  2014-07-24 14:22 ` [PATCH 04/18] xfs: introduce inode record hole mask for sparse inode chunks Brian Foster
                   ` (16 subsequent siblings)
  19 siblings, 2 replies; 50+ messages in thread
From: Brian Foster @ 2014-07-24 14:22 UTC (permalink / raw)
  To: xfs

The sparse inode chunks feature will use the helper function to enable
the allocation of sparse inode chunks. The incompatible feature bit is
set on disk once a sparse inode chunk is allocated to prevent older
drivers from mounting an fs with sparse chunks.

Note that the feature is hardcoded disabled and the feature bit not
included in the all features mask until fully implemented.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_sb.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/xfs/libxfs/xfs_sb.h b/fs/xfs/libxfs/xfs_sb.h
index c43c2d6..6f48de9 100644
--- a/fs/xfs/libxfs/xfs_sb.h
+++ b/fs/xfs/libxfs/xfs_sb.h
@@ -509,6 +509,7 @@ xfs_sb_has_ro_compat_feature(
 }
 
 #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
+#define XFS_SB_FEAT_INCOMPAT_SPINODES	(1 << 1)	/* sparse inode chunks */
 #define XFS_SB_FEAT_INCOMPAT_ALL \
 		(XFS_SB_FEAT_INCOMPAT_FTYPE)
 
@@ -558,6 +559,11 @@ static inline int xfs_sb_version_hasfinobt(xfs_sb_t *sbp)
 		(sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_FINOBT);
 }
 
+static inline int xfs_sb_version_hassparseinodes(struct xfs_sb *sbp)
+{
+	return 0; /* not yet enabled */
+}
+
 /*
  * end of superblock version macros
  */
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 04/18] xfs: introduce inode record hole mask for sparse inode chunks
  2014-07-24 14:22 [PATCH RFC 00/18] xfs: sparse inode chunks Brian Foster
                   ` (2 preceding siblings ...)
  2014-07-24 14:22 ` [PATCH 03/18] xfs: define sparse inode chunks v5 sb feature bit and helper function Brian Foster
@ 2014-07-24 14:22 ` Brian Foster
  2014-07-24 22:14   ` Dave Chinner
  2014-07-24 14:22 ` [PATCH 05/18] xfs: create macros/helpers for dealing with " Brian Foster
                   ` (15 subsequent siblings)
  19 siblings, 1 reply; 50+ messages in thread
From: Brian Foster @ 2014-07-24 14:22 UTC (permalink / raw)
  To: xfs

The inode btrees track 64 inodes per record, regardless of inode size.
Thus, inode chunks on disk vary in size depending on the size of the
inodes. This creates a contiguous allocation requirement for new inode
chunks that can be difficult to satisfy on an aged and fragmented (free
space) filesystem.

The inode record freecount currently uses 4 bytes on disk to track the
free inode count. With a maximum freecount value of 64, only one byte is
required. Convert the freecount field to a single byte and reserve two
of the remaining 3 higher order bytes left to the hole mask field.

The hole mask field tracks potential holes in the chunks of physical
space that the inode record refers to. This facilitates the sparse
allocation of inode chunks when contiguous chunks are not available and
allows the inode btrees to identify what portions of the chunk contain
valid inodes.

Tracking holes means the field is initialized to zero and thus maintains
backwards compatibility with existing filesystems. E.g., the higher
order bytes of a counter with a max value of 64 are already initialized
to 0. Update the inode record management functions to handle the new
field and initialize it to zero for now.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_format.h       | 7 +++++--
 fs/xfs/libxfs/xfs_ialloc.c       | 9 +++++++--
 fs/xfs/libxfs/xfs_ialloc_btree.c | 4 +++-
 3 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 34d85ac..39022d9 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -221,13 +221,16 @@ static inline xfs_inofree_t xfs_inobt_maskn(int i, int n)
  */
 typedef struct xfs_inobt_rec {
 	__be32		ir_startino;	/* starting inode number */
-	__be32		ir_freecount;	/* count of free inodes (set bits) */
+	__be16		ir_holemask;	/* hole mask for sparse chunks */
+	__u8		ir_pad;
+	__u8		ir_freecount;	/* count of free inodes (set bits) */
 	__be64		ir_free;	/* free inode mask */
 } xfs_inobt_rec_t;
 
 typedef struct xfs_inobt_rec_incore {
 	xfs_agino_t	ir_startino;	/* starting inode number */
-	__int32_t	ir_freecount;	/* count of free inodes (set bits) */
+	__uint16_t	ir_holemask;	/* hole mask for sparse chunks */
+	__uint8_t	ir_freecount;	/* count of free inodes (set bits) */
 	xfs_inofree_t	ir_free;	/* free inode mask */
 } xfs_inobt_rec_incore_t;
 
diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 5448a74..8286fda 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -68,6 +68,7 @@ xfs_inobt_lookup(
 	int			*stat)	/* success/failure */
 {
 	cur->bc_rec.i.ir_startino = ino;
+	cur->bc_rec.i.ir_holemask = 0;
 	cur->bc_rec.i.ir_freecount = 0;
 	cur->bc_rec.i.ir_free = 0;
 	return xfs_btree_lookup(cur, dir, stat);
@@ -85,7 +86,9 @@ xfs_inobt_update(
 	union xfs_btree_rec	rec;
 
 	rec.inobt.ir_startino = cpu_to_be32(irec->ir_startino);
-	rec.inobt.ir_freecount = cpu_to_be32(irec->ir_freecount);
+	rec.inobt.ir_holemask = cpu_to_be16(irec->ir_holemask);
+	rec.inobt.ir_pad = 0;
+	rec.inobt.ir_freecount = irec->ir_freecount;
 	rec.inobt.ir_free = cpu_to_be64(irec->ir_free);
 	return xfs_btree_update(cur, &rec);
 }
@@ -105,7 +108,8 @@ xfs_inobt_get_rec(
 	error = xfs_btree_get_rec(cur, &rec, stat);
 	if (!error && *stat == 1) {
 		irec->ir_startino = be32_to_cpu(rec->inobt.ir_startino);
-		irec->ir_freecount = be32_to_cpu(rec->inobt.ir_freecount);
+		irec->ir_holemask = be16_to_cpu(rec->inobt.ir_holemask);
+		irec->ir_freecount = rec->inobt.ir_freecount;
 		irec->ir_free = be64_to_cpu(rec->inobt.ir_free);
 	}
 	return error;
@@ -121,6 +125,7 @@ xfs_inobt_insert_rec(
 	xfs_inofree_t		free,
 	int			*stat)
 {
+	cur->bc_rec.i.ir_holemask = 0;
 	cur->bc_rec.i.ir_freecount = freecount;
 	cur->bc_rec.i.ir_free = free;
 	return xfs_btree_insert(cur, stat);
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index c9b06f3..0c94879 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -169,7 +169,9 @@ xfs_inobt_init_rec_from_cur(
 	union xfs_btree_rec	*rec)
 {
 	rec->inobt.ir_startino = cpu_to_be32(cur->bc_rec.i.ir_startino);
-	rec->inobt.ir_freecount = cpu_to_be32(cur->bc_rec.i.ir_freecount);
+	rec->inobt.ir_holemask = cpu_to_be16(cur->bc_rec.i.ir_holemask);
+	rec->inobt.ir_pad = 0;
+	rec->inobt.ir_freecount = cur->bc_rec.i.ir_freecount;
 	rec->inobt.ir_free = cpu_to_be64(cur->bc_rec.i.ir_free);
 }
 
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 05/18] xfs: create macros/helpers for dealing with sparse inode chunks
  2014-07-24 14:22 [PATCH RFC 00/18] xfs: sparse inode chunks Brian Foster
                   ` (3 preceding siblings ...)
  2014-07-24 14:22 ` [PATCH 04/18] xfs: introduce inode record hole mask for sparse inode chunks Brian Foster
@ 2014-07-24 14:22 ` Brian Foster
  2014-07-24 22:13   ` Dave Chinner
  2014-07-24 14:22 ` [PATCH 06/18] xfs: pass inode count through ordered icreate log item Brian Foster
                   ` (14 subsequent siblings)
  19 siblings, 1 reply; 50+ messages in thread
From: Brian Foster @ 2014-07-24 14:22 UTC (permalink / raw)
  To: xfs

Sparse inode chunks allow the traditional inode btree record format to
describe an inode chunk that is not fully allocated and/or contiguous.
Define a couple constants that set requirements for allocation and
management of such chunks. Also define a helper to easily detect sparse
inode chunks.

The granularity of a sparse chunk is defined by the the 16-bit holemask
field in the inode record. Assuming 64 inodes per full chunk, a single
holemask bit accounts for 4 inodes. The minimum allocation requirement
for a sparse inode chunk is defined as the minimum number of blocks
required to meet the 4 inode granularity.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_format.h | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 39022d9..0baad50 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -211,6 +211,11 @@ typedef	__uint64_t	xfs_inofree_t;
 #define	XFS_INOBT_ALL_FREE		((xfs_inofree_t)-1)
 #define	XFS_INOBT_MASK(i)		((xfs_inofree_t)1 << (i))
 
+#define XFS_INODES_PER_SPCHUNK		\
+	(XFS_INODES_PER_CHUNK / (NBBY * sizeof(__uint16_t)))
+#define XFS_INOBT_MIN_SPCHUNKLEN(sb)	\
+	(roundup(XFS_INODES_PER_SPCHUNK, sb.sb_inopblock) / sb.sb_inopblock)
+
 static inline xfs_inofree_t xfs_inobt_maskn(int i, int n)
 {
 	return ((n >= XFS_INODES_PER_CHUNK ? 0 : XFS_INOBT_MASK(n)) - 1) << i;
@@ -234,6 +239,10 @@ typedef struct xfs_inobt_rec_incore {
 	xfs_inofree_t	ir_free;	/* free inode mask */
 } xfs_inobt_rec_incore_t;
 
+static inline bool xfs_inobt_issparse(struct xfs_inobt_rec_incore *rec)
+{
+	return rec->ir_holemask == 0 ? false : true;
+}
 
 /*
  * Key structure
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 06/18] xfs: pass inode count through ordered icreate log item
  2014-07-24 14:22 [PATCH RFC 00/18] xfs: sparse inode chunks Brian Foster
                   ` (4 preceding siblings ...)
  2014-07-24 14:22 ` [PATCH 05/18] xfs: create macros/helpers for dealing with " Brian Foster
@ 2014-07-24 14:22 ` Brian Foster
  2014-07-24 14:22 ` [PATCH 07/18] xfs: handle sparse inode chunks in icreate log recovery Brian Foster
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 50+ messages in thread
From: Brian Foster @ 2014-07-24 14:22 UTC (permalink / raw)
  To: xfs

v5 superblocks use an ordered log item for logging the initialization of
inode chunks. The icreate log item is currently hardcoded to an inode
count of 64 inodes.

The agbno and extent length are used to initialize the inode chunk from
log recovery. While an incorrect inode count does not lead to bad inode
chunk initialization, we should pass the correct inode count such that log
recovery has enough data to perform meaningful validity checks on the
chunk.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_ialloc.c | 7 ++++---
 fs/xfs/libxfs/xfs_ialloc.h | 2 +-
 fs/xfs/xfs_log_recover.c   | 4 ++--
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 8286fda..27d3437 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -223,6 +223,7 @@ xfs_ialloc_inode_init(
 	struct xfs_mount	*mp,
 	struct xfs_trans	*tp,
 	struct list_head	*buffer_list,
+	int			icount,
 	xfs_agnumber_t		agno,
 	xfs_agblock_t		agbno,
 	xfs_agblock_t		length,
@@ -278,7 +279,7 @@ xfs_ialloc_inode_init(
 		 * they track in the AIL as if they were physically logged.
 		 */
 		if (tp)
-			xfs_icreate_log(tp, agno, agbno, mp->m_ialloc_inos,
+			xfs_icreate_log(tp, agno, agbno, icount,
 					mp->m_sb.sb_inodesize, length, gen);
 	} else
 		version = 2;
@@ -497,8 +498,8 @@ xfs_ialloc_ag_alloc(
 	 * rather than a linear progression to prevent the next generation
 	 * number from being easily guessable.
 	 */
-	error = xfs_ialloc_inode_init(args.mp, tp, NULL, agno, args.agbno,
-			args.len, prandom_u32());
+	error = xfs_ialloc_inode_init(args.mp, tp, NULL, newlen, agno,
+			args.agbno, args.len, prandom_u32());
 
 	if (error)
 		return error;
diff --git a/fs/xfs/libxfs/xfs_ialloc.h b/fs/xfs/libxfs/xfs_ialloc.h
index 95ad1c0..62c1381 100644
--- a/fs/xfs/libxfs/xfs_ialloc.h
+++ b/fs/xfs/libxfs/xfs_ialloc.h
@@ -156,7 +156,7 @@ int xfs_inobt_get_rec(struct xfs_btree_cur *cur,
  * Inode chunk initialisation routine
  */
 int xfs_ialloc_inode_init(struct xfs_mount *mp, struct xfs_trans *tp,
-			  struct list_head *buffer_list,
+			  struct list_head *buffer_list, int icount,
 			  xfs_agnumber_t agno, xfs_agblock_t agbno,
 			  xfs_agblock_t length, unsigned int gen);
 
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index fbc2362..cbe782a 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -3236,8 +3236,8 @@ xlog_recover_do_icreate_pass2(
 			XFS_AGB_TO_DADDR(mp, agno, agbno), length, 0))
 		return 0;
 
-	xfs_ialloc_inode_init(mp, NULL, buffer_list, agno, agbno, length,
-					be32_to_cpu(icl->icl_gen));
+	xfs_ialloc_inode_init(mp, NULL, buffer_list, count, agno, agbno, length,
+			      be32_to_cpu(icl->icl_gen));
 	return 0;
 }
 
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 07/18] xfs: handle sparse inode chunks in icreate log recovery
  2014-07-24 14:22 [PATCH RFC 00/18] xfs: sparse inode chunks Brian Foster
                   ` (5 preceding siblings ...)
  2014-07-24 14:22 ` [PATCH 06/18] xfs: pass inode count through ordered icreate log item Brian Foster
@ 2014-07-24 14:22 ` Brian Foster
  2014-07-24 14:22 ` [PATCH 08/18] xfs: create helper to manage record overlap for sparse inode chunks Brian Foster
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 50+ messages in thread
From: Brian Foster @ 2014-07-24 14:22 UTC (permalink / raw)
  To: xfs

Recovery of icreate transactions assumes hardcoded values for the inode
count and chunk length.

Sparse inode chunks are allocated in units of the cluster size. Update
the icreate validity checks to allow for appropriately sized inode
chunks and verify the inode count matches what is expected based on the
extent length rather than assuming a hardcoded count.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_log_recover.c | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index cbe782a..5994fb5 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -3213,12 +3213,19 @@ xlog_recover_do_icreate_pass2(
 		return -EINVAL;
 	}
 
-	/* existing allocation is fixed value */
-	ASSERT(count == mp->m_ialloc_inos);
-	ASSERT(length == mp->m_ialloc_blks);
-	if (count != mp->m_ialloc_inos ||
-	     length != mp->m_ialloc_blks) {
-		xfs_warn(log->l_mp, "xlog_recover_do_icreate_trans: bad count 2");
+	/* inode chunk is either full or sparse (cluster sized) */
+	if (length != mp->m_ialloc_blks &&
+	    length != xfs_icluster_size_fsb(mp)) {
+		xfs_warn(log->l_mp,
+			 "%s: unsupported chunk length", __FUNCTION__);
+		return -EINVAL;
+	}
+
+	/* verify inode count is consistent with extent length */
+	if ((count >> mp->m_sb.sb_inopblog) != length) {
+		xfs_warn(log->l_mp,
+			 "%s: inconsistent inode count and chunk length",
+			 __FUNCTION__);
 		return -EINVAL;
 	}
 
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 08/18] xfs: create helper to manage record overlap for sparse inode chunks
  2014-07-24 14:22 [PATCH RFC 00/18] xfs: sparse inode chunks Brian Foster
                   ` (6 preceding siblings ...)
  2014-07-24 14:22 ` [PATCH 07/18] xfs: handle sparse inode chunks in icreate log recovery Brian Foster
@ 2014-07-24 14:22 ` Brian Foster
  2014-07-24 22:41   ` Dave Chinner
  2014-07-24 14:22 ` [PATCH 09/18] xfs: allocate sparse inode chunks on full chunk allocation failure Brian Foster
                   ` (11 subsequent siblings)
  19 siblings, 1 reply; 50+ messages in thread
From: Brian Foster @ 2014-07-24 14:22 UTC (permalink / raw)
  To: xfs

Create xfs_spchunk_has_record() to receive the parameters of a new
sparse inode chunk allocation and identify whether a record exists that
is capable of tracking this sparse chunk.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_ialloc.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 55 insertions(+)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 27d3437..be57b51 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -351,6 +351,61 @@ xfs_ialloc_inode_init(
 }
 
 /*
+ * Determine whether part of a sparse inode chunk that has just been allocated
+ * is covered by an existing inobt record.
+ */
+STATIC int
+xfs_spchunk_has_record(
+	struct xfs_mount		*mp,
+	struct xfs_trans		*tp,
+	struct xfs_buf			*agbp,
+	xfs_agino_t			newino,
+	xfs_agino_t			count,
+	xfs_btnum_t			btnum,
+	struct xfs_inobt_rec_incore	*orec)
+{
+	struct xfs_btree_cur		*cur;
+	struct xfs_agi			*agi = XFS_BUF_TO_AGI(agbp);
+	xfs_agnumber_t			agno = be32_to_cpu(agi->agi_seqno);
+	xfs_agino_t			previno;
+	int				error;
+	int				i;
+	struct xfs_inobt_rec_incore	rec;
+
+	orec->ir_startino = NULLAGINO;
+
+	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno, btnum);
+
+	previno = newino + count - XFS_INODES_PER_CHUNK;
+	error = xfs_inobt_lookup(cur, previno, XFS_LOOKUP_GE, &i);
+	if (error)
+		goto error;
+	if (i == 0)
+		goto out;
+
+	error = xfs_inobt_get_rec(cur, &rec, &i);
+	if (error)
+		goto error;
+	XFS_WANT_CORRUPTED_GOTO(i == 1, error);
+
+	if (rec.ir_startino > newino)
+		goto out;
+
+	ASSERT(rec.ir_startino <= newino &&
+	       rec.ir_startino + XFS_INODES_PER_CHUNK > newino);
+	ASSERT(rec.ir_freecount + count <= XFS_INODES_PER_CHUNK);
+
+	*orec = rec;
+
+out:
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+	return 0;
+error:
+	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+	return error;
+}
+
+/*
  * Allocate new inodes in the allocation group specified by agbp.
  * Return 0 for success, else error code.
  */
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 09/18] xfs: allocate sparse inode chunks on full chunk allocation failure
  2014-07-24 14:22 [PATCH RFC 00/18] xfs: sparse inode chunks Brian Foster
                   ` (7 preceding siblings ...)
  2014-07-24 14:22 ` [PATCH 08/18] xfs: create helper to manage record overlap for sparse inode chunks Brian Foster
@ 2014-07-24 14:22 ` Brian Foster
  2014-07-24 14:23 ` [PATCH 10/18] xfs: set sparse inodes feature bit when a sparse chunk is allocated Brian Foster
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 50+ messages in thread
From: Brian Foster @ 2014-07-24 14:22 UTC (permalink / raw)
  To: xfs

xfs_ialloc_ag_alloc() makes several attempts to allocate a full inode
chunk. If all else fails, reduce the allocation to inode cluster size
and attempt to allocate a sparse inode chunk.

If sparse chunk allocation succeeds, check whether an inobt record
already exists that can track the chunk. If so, inherit and update the
existing record. Otherwise, insert a new record for the sparse chunk.

Update xfs_inobt_insert_rec() to take the holemask as a parameter and
set the associated field on disk. Convert xfs_inobt_insert() to
xfs_inobt_update_insert() to handle record insertion or update in a
generic fashion. This facilitates the continued use of the same function
for the inobt and finobt.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_ialloc.c | 105 +++++++++++++++++++++++++++++++++------------
 1 file changed, 77 insertions(+), 28 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index be57b51..4226b1b 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -121,28 +121,28 @@ xfs_inobt_get_rec(
 STATIC int
 xfs_inobt_insert_rec(
 	struct xfs_btree_cur	*cur,
+	__uint16_t		holemask,
 	__int32_t		freecount,
 	xfs_inofree_t		free,
 	int			*stat)
 {
-	cur->bc_rec.i.ir_holemask = 0;
+	cur->bc_rec.i.ir_holemask = holemask;
 	cur->bc_rec.i.ir_freecount = freecount;
 	cur->bc_rec.i.ir_free = free;
 	return xfs_btree_insert(cur, stat);
 }
 
 /*
- * Insert records describing a newly allocated inode chunk into the inobt.
+ * Update or insert records describing a newly allocated inode chunk into the
+ * specified inobt.
  */
 STATIC int
-xfs_inobt_insert(
-	struct xfs_mount	*mp,
-	struct xfs_trans	*tp,
-	struct xfs_buf		*agbp,
-	xfs_agino_t		newino,	/* start inode of record */
-	xfs_agino_t		count,	/* inode count */
-	xfs_inofree_t		free,	/* free mask */
-	xfs_btnum_t		btnum)
+xfs_inobt_update_insert(
+	struct xfs_mount		*mp,
+	struct xfs_trans		*tp,
+	struct xfs_buf			*agbp,
+	struct xfs_inobt_rec_incore	*rec,	/* record to update/insert */
+	xfs_btnum_t			btnum)
 {
 	struct xfs_btree_cur	*cur;
 	struct xfs_agi		*agi = XFS_BUF_TO_AGI(agbp);
@@ -152,23 +152,25 @@ xfs_inobt_insert(
 
 	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno, btnum);
 
-	error = xfs_inobt_lookup(cur, newino, XFS_LOOKUP_EQ, &i);
-	if (error) {
-		xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
-		return error;
-	}
-	ASSERT(i == 0);
-
-	error = xfs_inobt_insert_rec(cur, count, free, &i);
-	if (error) {
-		xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
-		return error;
+	error = xfs_inobt_lookup(cur, rec->ir_startino, XFS_LOOKUP_EQ, &i);
+	if (i == 1) {
+		error = xfs_inobt_update(cur, rec);
+		if (error)
+			goto error;
+	} else {
+		error = xfs_inobt_insert_rec(cur, rec->ir_holemask,
+				rec->ir_freecount, rec->ir_free, &i);
+		if (error)
+			goto error;
+		ASSERT(i == 1);
 	}
-	ASSERT(i == 1);
 
 	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
-
 	return 0;
+
+error:
+	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+	return error;
 }
 
 /*
@@ -423,6 +425,10 @@ xfs_ialloc_ag_alloc(
 	xfs_agino_t	newlen;		/* new number of inodes */
 	int		isaligned = 0;	/* inode allocation at stripe unit */
 					/* boundary */
+	uint16_t	allocmask = (uint16_t) -1; /* init. to full chunk */
+	struct xfs_inobt_rec_incore rec;
+	int		offset;
+
 	struct xfs_perag *pag;
 
 	memset(&args, 0, sizeof(args));
@@ -538,6 +544,28 @@ xfs_ialloc_ag_alloc(
 			return error;
 	}
 
+	/*
+	 * Finally, try a sparse allocation if the filesystem supports it.
+	 */
+	if (xfs_sb_version_hassparseinodes(&args.mp->m_sb) &&
+	    args.fsbno == NULLFSBLOCK) {
+		args.type = XFS_ALLOCTYPE_NEAR_BNO;
+		args.agbno = be32_to_cpu(agi->agi_root);
+		args.fsbno = XFS_AGB_TO_FSB(args.mp, agno, args.agbno);
+		args.alignment = xfs_ialloc_cluster_alignment(args.mp);
+
+		/* allocate sparse regions in cluster granularity */
+		args.minlen = xfs_ialloc_cluster_alignment(args.mp);
+		args.maxlen = args.minlen;
+
+		error = xfs_alloc_vextent(&args);
+		if (error)
+			return error;
+
+		newlen = args.len << args.mp->m_sb.sb_inopblog;
+		allocmask = (1 << (newlen / XFS_INODES_PER_SPCHUNK)) - 1;
+	}
+
 	if (args.fsbno == NULLFSBLOCK) {
 		*alloc = 0;
 		return 0;
@@ -572,14 +600,34 @@ xfs_ialloc_ag_alloc(
 	/*
 	 * Insert records describing the new inode chunk into the btrees.
 	 */
-	error = xfs_inobt_insert(args.mp, tp, agbp, newino, newlen,
-				 XFS_INOBT_ALL_FREE, XFS_BTNUM_INO);
+	error = xfs_spchunk_has_record(args.mp, tp, agbp, newino, newlen,
+				       XFS_BTNUM_INO, &rec);
+	if (error)
+		return error;
+	if (rec.ir_startino == NULLAGINO) {
+		/* no existing record, set all fields */
+		rec.ir_startino = newino;
+		rec.ir_holemask = ~allocmask;
+		rec.ir_freecount = newlen;
+		rec.ir_free = XFS_INOBT_ALL_FREE;
+	} else {
+		/* we already have a record, update it */
+		offset = newino - rec.ir_startino;
+		ASSERT(offset % XFS_INODES_PER_SPCHUNK == 0);
+
+		allocmask <<= offset / XFS_INODES_PER_SPCHUNK;
+
+		rec.ir_freecount += newlen;
+		rec.ir_holemask &= ~allocmask;
+	}
+
+	error = xfs_inobt_update_insert(args.mp, tp, agbp, &rec, XFS_BTNUM_INO);
 	if (error)
 		return error;
 
 	if (xfs_sb_version_hasfinobt(&args.mp->m_sb)) {
-		error = xfs_inobt_insert(args.mp, tp, agbp, newino, newlen,
-					 XFS_INOBT_ALL_FREE, XFS_BTNUM_FINO);
+		error = xfs_inobt_update_insert(args.mp, tp, agbp, &rec,
+						XFS_BTNUM_FINO);
 		if (error)
 			return error;
 	}
@@ -1644,7 +1692,8 @@ xfs_difree_finobt(
 		 */
 		XFS_WANT_CORRUPTED_GOTO(ibtrec->ir_freecount == 1, error);
 
-		error = xfs_inobt_insert_rec(cur, ibtrec->ir_freecount,
+		error = xfs_inobt_insert_rec(cur, ibtrec->ir_holemask,
+					     ibtrec->ir_freecount,
 					     ibtrec->ir_free, &i);
 		if (error)
 			goto error;
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 10/18] xfs: set sparse inodes feature bit when a sparse chunk is allocated
  2014-07-24 14:22 [PATCH RFC 00/18] xfs: sparse inode chunks Brian Foster
                   ` (8 preceding siblings ...)
  2014-07-24 14:22 ` [PATCH 09/18] xfs: allocate sparse inode chunks on full chunk allocation failure Brian Foster
@ 2014-07-24 14:23 ` Brian Foster
  2014-07-24 22:46   ` Dave Chinner
  2014-07-24 14:23 ` [PATCH 11/18] xfs: reduce min. inode allocation space requirement for sparse inode chunks Brian Foster
                   ` (9 subsequent siblings)
  19 siblings, 1 reply; 50+ messages in thread
From: Brian Foster @ 2014-07-24 14:23 UTC (permalink / raw)
  To: xfs

The sparse chunk inode record format is backwards compatible with old
format inobt records as long as full chunks are allocated. The holemask
field uses higher order bytes of the freecount. While sparse chunks can
be enabled on previously unsupported fs, older kernel drivers cannot
parse sparse inode records.

Set the feature incompatible bit once a sparse inode chunk is allocated
to prevent older XFS drivers from tripping over the new format.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_ialloc.c | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 4226b1b..4dd45c2 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -407,6 +407,27 @@ error:
 	return error;
 }
 
+STATIC void
+xfs_sbversion_add_spinodes(
+	struct xfs_trans	*tp,
+	struct xfs_mount	*mp)
+{
+	if (xfs_sb_has_incompat_log_feature(&mp->m_sb,
+					    XFS_SB_FEAT_INCOMPAT_SPINODES))
+		return;
+
+	spin_lock(&mp->m_sb_lock);
+	if (xfs_sb_has_incompat_log_feature(&mp->m_sb,
+					    XFS_SB_FEAT_INCOMPAT_SPINODES)) {
+		spin_unlock(&mp->m_sb_lock);
+		return;
+	}
+
+	mp->m_sb.sb_features_incompat |= XFS_SB_FEAT_INCOMPAT_SPINODES;
+	spin_unlock(&mp->m_sb_lock);
+	xfs_mod_sb(tp, XFS_SB_FEATURES_INCOMPAT);
+}
+
 /*
  * Allocate new inodes in the allocation group specified by agbp.
  * Return 0 for success, else error code.
@@ -631,6 +652,18 @@ xfs_ialloc_ag_alloc(
 		if (error)
 			return error;
 	}
+
+	/*
+	 * Set an incompat feature bit as old drivers can't parse sparse
+	 * records. Pre-sparse inode chunk drivers will include the
+	 * holemask in the higher order freecount bits, resulting in a
+	 * bogus value.
+	 *
+	 * XXX: when is this bit removed?
+	 */
+	if (xfs_inobt_issparse(&rec))
+		xfs_sbversion_add_spinodes(tp, args.mp);
+
 	/*
 	 * Log allocation group header fields
 	 */
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 11/18] xfs: reduce min. inode allocation space requirement for sparse inode chunks
  2014-07-24 14:22 [PATCH RFC 00/18] xfs: sparse inode chunks Brian Foster
                   ` (9 preceding siblings ...)
  2014-07-24 14:23 ` [PATCH 10/18] xfs: set sparse inodes feature bit when a sparse chunk is allocated Brian Foster
@ 2014-07-24 14:23 ` Brian Foster
  2014-07-24 22:50   ` Dave Chinner
  2014-07-24 14:23 ` [PATCH 12/18] xfs: helper to convert inobt record holemask to inode alloc. bitmap Brian Foster
                   ` (8 subsequent siblings)
  19 siblings, 1 reply; 50+ messages in thread
From: Brian Foster @ 2014-07-24 14:23 UTC (permalink / raw)
  To: xfs

xfs_ialloc_ag_select() iterates through the allocation groups looking
for free inodes or free space to determine whether to allow an inode
allocation to proceed. If no free inodes are available, it assumes that
an AG must have an extent longer than mp->m_ialloc_blks.

Sparse inode chunk support currently allows for allocations as small as
the cluster size. If sparse inodes are supported, use the cluster
alignment as a minimum extent size limit to determine whether an inode
chunk allocation attempt can proceed.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_ialloc.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 4dd45c2..4e98a21 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -770,9 +770,15 @@ xfs_ialloc_ag_select(
 
 		/*
 		 * Is there enough free space for the file plus a block of
-		 * inodes? (if we need to allocate some)?
+		 * inodes? (if we need to allocate some)? If sparse inode chunks
+		 * are supported, we only require an extent of length equal to
+		 * the cluster size.
 		 */
-		ineed = mp->m_ialloc_blks;
+		if (xfs_sb_version_hassparseinodes(&mp->m_sb))
+			ineed = xfs_ialloc_cluster_alignment(mp);
+		else
+			ineed = mp->m_ialloc_blks;
+
 		longest = pag->pagf_longest;
 		if (!longest)
 			longest = pag->pagf_flcount > 0;
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 12/18] xfs: helper to convert inobt record holemask to inode alloc. bitmap
  2014-07-24 14:22 [PATCH RFC 00/18] xfs: sparse inode chunks Brian Foster
                   ` (10 preceding siblings ...)
  2014-07-24 14:23 ` [PATCH 11/18] xfs: reduce min. inode allocation space requirement for sparse inode chunks Brian Foster
@ 2014-07-24 14:23 ` Brian Foster
  2014-07-24 23:21   ` Dave Chinner
  2014-07-24 14:23 ` [PATCH 13/18] xfs: filter out sparse regions from individual inode allocation Brian Foster
                   ` (7 subsequent siblings)
  19 siblings, 1 reply; 50+ messages in thread
From: Brian Foster @ 2014-07-24 14:23 UTC (permalink / raw)
  To: xfs

The inobt record holemask field is a condensed data type designed to fit
into the existing on-disk record and is zero based (allocated regions
are set to 0, sparse regions are set to 1) to provide backwards
compatibility. Thus the type is unnecessarily complex for use in higher
level inode manipulations such as individual inode allocations, etc.

Rather than foist the complexity of dealing with this field to every bit
of logic that requires inode chunk allocation information, create the
xfs_inobt_ialloc_bitmap() helper to convert the inobt record holemask to
an inode allocation bitmap. The inode allocation bitmap is inode
granularity similar to the inobt record free mask and indicates which
inodes of the chunk are physically allocated on disk irrespective of
whether the inode is considered allocated or free by the filesystem.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_format.h |  1 +
 fs/xfs/libxfs/xfs_ialloc.c | 67 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 68 insertions(+)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 0baad50..cbc3296 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -206,6 +206,7 @@ typedef __be32 xfs_alloc_ptr_t;
 #define	XFS_FIBT_CRC_MAGIC	0x46494233	/* 'FIB3' */
 
 typedef	__uint64_t	xfs_inofree_t;
+typedef	__uint64_t	xfs_inoalloc_t;
 #define	XFS_INODES_PER_CHUNK		(NBBY * sizeof(xfs_inofree_t))
 #define	XFS_INODES_PER_CHUNK_LOG	(XFS_NBBYLOG + 3)
 #define	XFS_INOBT_ALL_FREE		((xfs_inofree_t)-1)
diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 4e98a21..166602e 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -863,6 +863,73 @@ xfs_ialloc_get_rec(
 }
 
 /*
+ * Convert the inode record holemask to an inode allocation bitmap. The inode
+ * allocation bitmap is inode granularity and specifies whether an inode is
+ * physically allocated on disk (not whether the inode is considered allocated
+ * or free by the fs).
+ */
+STATIC xfs_inoalloc_t
+xfs_inobt_ialloc_bitmap(
+	struct xfs_inobt_rec_incore	*rec)
+{
+	xfs_inofree_t	bitmap = 0;
+	xfs_inofree_t	sparsebits;
+	int		nextbit;
+	int		shift;
+	__uint16_t	allocmask;
+	uint		allocbitmap;
+
+	/*
+	 * Each holemask bit represents XFS_INODES_PER_SPCHUNK inodes. Determine
+	 * the inode bits per holemask bit.
+	 */
+	sparsebits = xfs_mask64lo(XFS_INODES_PER_SPCHUNK);
+
+	/*
+	 * The bit flip and type conversion are intentionally done separately
+	 * here to zero-extend the bitmask.
+	 */
+	allocmask = ~rec->ir_holemask;
+	allocbitmap = allocmask;
+
+	/*
+	 * Each bit of allocbitmap represents an allocated region of the inode
+	 * chunk. Thus, each bit represents XFS_INODES_PER_SPCHUNK physical
+	 * inodes. Walk through allocbitmap and set the associated individual
+	 * inode bits in the inode bitmap for each allocated chunk.
+	 *
+	 * For example, consider a 512b inode fs with a cluster size of 16k.
+	 * Each holemask bit represents 4 inodes and each cluster contains 32
+	 * inodes. Since sparse chunks are allocated at cluster granularity, a
+	 * valid 16-bit holemask (and negated allocbitmap) with this geometry
+	 * might look as follows:
+	 *
+	 * 	holemask		~	allocbitmap
+	 * 	0000 0000 1111 1111	=>	1111 1111 0000 0000
+	 *
+	 * At 4 inodes per bit, this indicates that the first 32 inodes of the
+	 * chunk are not physically allocated inodes. This is a hole from the
+	 * perspective of the inode record. Converting the allocbitmap to a
+	 * 64-bit inode allocation bitmap yields:
+	 *
+	 * 	0xFFFFFFFF00000000
+	 *
+	 * ... where any of the 32 inodes defined by the higher order 32 bits of
+	 * the map could be in use or free. Logically AND this bitmap with the
+	 * record ir_free map to identify which of the physically allocated
+	 * inodes are in use.
+	 */
+	nextbit = xfs_next_bit(&allocbitmap, 1, 0);
+	while (nextbit != -1) {
+		shift = nextbit * XFS_INODES_PER_SPCHUNK;
+		bitmap |= (sparsebits << shift);
+		nextbit = xfs_next_bit(&allocbitmap, 1, nextbit + 1);
+	}
+
+	return bitmap;
+}
+
+/*
  * Allocate an inode using the inobt-only algorithm.
  */
 STATIC int
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 13/18] xfs: filter out sparse regions from individual inode allocation
  2014-07-24 14:22 [PATCH RFC 00/18] xfs: sparse inode chunks Brian Foster
                   ` (11 preceding siblings ...)
  2014-07-24 14:23 ` [PATCH 12/18] xfs: helper to convert inobt record holemask to inode alloc. bitmap Brian Foster
@ 2014-07-24 14:23 ` Brian Foster
  2014-07-24 14:23 ` [PATCH 14/18] xfs: update free inode record logic to support sparse inode records Brian Foster
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 50+ messages in thread
From: Brian Foster @ 2014-07-24 14:23 UTC (permalink / raw)
  To: xfs

Inode allocation from an existing record with free inodes traditionally
selects the first inode available according to the ir_free mask. With
sparse inode chunks, the ir_free mask could refer to an unallocated
region. We must mask the unallocated regions out of ir_free before using
it to select a free inode in the chunk.

Create the xfs_inobt_first_free_inode() helper to find the first free
inode available of the allocated regions of the inode chunk.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_ialloc.c | 25 +++++++++++++++++++++++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 166602e..1d74394 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -930,6 +930,27 @@ xfs_inobt_ialloc_bitmap(
 }
 
 /*
+ * Return the offset of the first free inode in the record. If the inode chunk
+ * is sparsely allocated, we convert the record holemask to inode granularity
+ * and mask off the unallocated regions from the inode free mask.
+ */
+STATIC int
+xfs_inobt_first_free_inode(
+	struct xfs_inobt_rec_incore	*rec)
+{
+	xfs_inofree_t	realfree;
+
+	/* if there are no holes, return the first available offset */
+	if (!xfs_inobt_issparse(rec))
+		return xfs_lowbit64(rec->ir_free);
+
+	realfree = xfs_inobt_ialloc_bitmap(rec);
+	realfree &= rec->ir_free;
+
+	return xfs_lowbit64(realfree);
+}
+
+/*
  * Allocate an inode using the inobt-only algorithm.
  */
 STATIC int
@@ -1159,7 +1180,7 @@ newino:
 	}
 
 alloc_inode:
-	offset = xfs_lowbit64(rec.ir_free);
+	offset = xfs_inobt_first_free_inode(&rec);
 	ASSERT(offset >= 0);
 	ASSERT(offset < XFS_INODES_PER_CHUNK);
 	ASSERT((XFS_AGINO_TO_OFFSET(mp, rec.ir_startino) %
@@ -1413,7 +1434,7 @@ xfs_dialloc_ag(
 	if (error)
 		goto error_cur;
 
-	offset = xfs_lowbit64(rec.ir_free);
+	offset = xfs_inobt_first_free_inode(&rec);
 	ASSERT(offset >= 0);
 	ASSERT(offset < XFS_INODES_PER_CHUNK);
 	ASSERT((XFS_AGINO_TO_OFFSET(mp, rec.ir_startino) %
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 14/18] xfs: update free inode record logic to support sparse inode records
  2014-07-24 14:22 [PATCH RFC 00/18] xfs: sparse inode chunks Brian Foster
                   ` (12 preceding siblings ...)
  2014-07-24 14:23 ` [PATCH 13/18] xfs: filter out sparse regions from individual inode allocation Brian Foster
@ 2014-07-24 14:23 ` Brian Foster
  2014-07-24 14:23 ` [PATCH 15/18] xfs: only free allocated regions of inode chunks Brian Foster
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 50+ messages in thread
From: Brian Foster @ 2014-07-24 14:23 UTC (permalink / raw)
  To: xfs

xfs_difree_inobt() uses logic in a couple places that assume inobt
records refer to fully allocated chunks. Specifically, the use of
mp->m_ialloc_inos can cause problems for inode chunks that are sparsely
allocated. Sparse inode chunks can, by definition, define a smaller
number of inodes than a full inode chunk.

Fix the logic that determines whether an inode record should be removed
from the inobt to use the ir_free mask rather than ir_freecount.

Fix the agi counters modification to use ir_freecount to add the actual
number of inodes freed rather than assuming a full inode chunk.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_ialloc.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 1d74394..f75f191 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -1724,7 +1724,7 @@ xfs_difree_inobt(
 	 * When an inode cluster is free, it becomes eligible for removal
 	 */
 	if (!(mp->m_flags & XFS_MOUNT_IKEEP) &&
-	    (rec.ir_freecount == mp->m_ialloc_inos)) {
+	    (rec.ir_free == XFS_INOBT_ALL_FREE)) {
 
 		*deleted = 1;
 		*first_ino = XFS_AGINO_TO_INO(mp, agno, rec.ir_startino);
@@ -1734,7 +1734,7 @@ xfs_difree_inobt(
 		 * AGI and Superblock inode counts, and mark the disk space
 		 * to be freed when the transaction is committed.
 		 */
-		ilen = mp->m_ialloc_inos;
+		ilen = rec.ir_freecount;
 		be32_add_cpu(&agi->agi_count, -ilen);
 		be32_add_cpu(&agi->agi_freecount, -(ilen - 1));
 		xfs_ialloc_log_agi(tp, agbp, XFS_AGI_COUNT | XFS_AGI_FREECOUNT);
@@ -1856,7 +1856,7 @@ xfs_difree_finobt(
 	 * keeping inode chunks permanently on disk, remove the record.
 	 * Otherwise, update the record with the new information.
 	 */
-	if (rec.ir_freecount == mp->m_ialloc_inos &&
+	if (rec.ir_free == XFS_INOBT_ALL_FREE &&
 	    !(mp->m_flags & XFS_MOUNT_IKEEP)) {
 		error = xfs_btree_delete(cur, &i);
 		if (error)
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 15/18] xfs: only free allocated regions of inode chunks
  2014-07-24 14:22 [PATCH RFC 00/18] xfs: sparse inode chunks Brian Foster
                   ` (13 preceding siblings ...)
  2014-07-24 14:23 ` [PATCH 14/18] xfs: update free inode record logic to support sparse inode records Brian Foster
@ 2014-07-24 14:23 ` Brian Foster
  2014-07-24 23:24   ` Dave Chinner
  2014-07-24 14:23 ` [PATCH 16/18] xfs: skip unallocated regions of inode chunks in xfs_ifree_cluster() Brian Foster
                   ` (4 subsequent siblings)
  19 siblings, 1 reply; 50+ messages in thread
From: Brian Foster @ 2014-07-24 14:23 UTC (permalink / raw)
  To: xfs

An inode chunk is currently added to the transaction free list based on
a simple fsb conversion and hardcoded chunk length. The nature of sparse
chunks is such that the physical chunk of inodes on disk may consist of
one or more discontiguous parts. Blocks that reside in the holes of the
inode chunk are not inodes and could be allocated to any other use or
not allocated at all.

Refactor the existing xfs_bmap_add_free() call into the
xfs_difree_inode_chunk() helper. The new helper uses the existing
calculation if a chunk is not sparse. Otherwise, use the inobt record
holemask to free the contiguous regions of the chunk.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_ialloc.c | 64 +++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 61 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index f75f191..1be57b1 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -1659,6 +1659,66 @@ out_error:
 	return error;
 }
 
+/*
+ * Free the blocks of an inode chunk. We must consider that the inode chunk
+ * might be sparse and only free the regions that are allocated as part of the
+ * chunk.
+ */
+STATIC void
+xfs_difree_inode_chunk(
+	struct xfs_mount		*mp,
+	xfs_agnumber_t			agno,
+	struct xfs_inobt_rec_incore	*rec,
+	struct xfs_bmap_free		*flist)
+{
+	xfs_agblock_t	sagbno = XFS_AGINO_TO_AGBNO(mp, rec->ir_startino);
+	xfs_agblock_t	agbno;
+	int		nextbit;
+	int		contig, contigblk;
+	__uint16_t	allocmask;
+	uint		allocbitmap;
+
+	if (!xfs_inobt_issparse(rec)) {
+		/* not sparse, calculate extent info directly */
+		xfs_bmap_add_free(XFS_AGB_TO_FSB(mp, agno,
+				  XFS_AGINO_TO_AGBNO(mp, rec->ir_startino)),
+				  mp->m_ialloc_blks, flist, mp);
+		return;
+	}
+
+	/*
+	 * The bit flip and type conversion are intentionally done separately
+	 * here to zero-extend the bitmask.
+	 */
+	allocmask = ~rec->ir_holemask;
+	allocbitmap = allocmask;
+
+	/*
+	 * We now have an allocation bitmap in units of inodes at sparse chunk
+	 * granularity (e.g., more than one inode per bit). Use the bitmask
+	 * functions to find each contigious range of bits in the map. For each
+	 * range, convert the start bit and count to block values and use that
+	 * data to add the associated extent to the free list.
+	 */
+	nextbit = xfs_next_bit(&allocbitmap, 1, 0);
+	while (nextbit != -1) {
+		agbno = (nextbit * XFS_INODES_PER_SPCHUNK) /
+				mp->m_sb.sb_inopblock;
+		agbno += sagbno;
+
+		contig = xfs_contig_bits(&allocbitmap, 1, nextbit);
+		contigblk = (contig * XFS_INODES_PER_SPCHUNK) /
+				mp->m_sb.sb_inopblock;
+
+		ASSERT(agbno % xfs_ialloc_cluster_alignment(mp) == 0);
+		ASSERT(contigblk % xfs_ialloc_cluster_alignment(mp) == 0);
+		xfs_bmap_add_free(XFS_AGB_TO_FSB(mp, agno, agbno), contigblk,
+				  flist, mp);
+
+		nextbit = xfs_next_bit(&allocbitmap, 1, nextbit + contig + 1);
+	}
+}
+
 STATIC int
 xfs_difree_inobt(
 	struct xfs_mount		*mp,
@@ -1750,9 +1810,7 @@ xfs_difree_inobt(
 			goto error0;
 		}
 
-		xfs_bmap_add_free(XFS_AGB_TO_FSB(mp, agno,
-				  XFS_AGINO_TO_AGBNO(mp, rec.ir_startino)),
-				  mp->m_ialloc_blks, flist, mp);
+		xfs_difree_inode_chunk(mp, agno, &rec, flist);
 	} else {
 		*deleted = 0;
 
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 16/18] xfs: skip unallocated regions of inode chunks in xfs_ifree_cluster()
  2014-07-24 14:22 [PATCH RFC 00/18] xfs: sparse inode chunks Brian Foster
                   ` (14 preceding siblings ...)
  2014-07-24 14:23 ` [PATCH 15/18] xfs: only free allocated regions of inode chunks Brian Foster
@ 2014-07-24 14:23 ` Brian Foster
  2014-07-24 14:23 ` [PATCH 17/18] xfs: use actual inode count for sparse records in bulkstat/inumbers Brian Foster
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 50+ messages in thread
From: Brian Foster @ 2014-07-24 14:23 UTC (permalink / raw)
  To: xfs

xfs_ifree_cluster() is called to mark all in-memory inodes and inode
buffers as stale. This occurs after we've removed the inobt records and
dropped any references of inobt data. xfs_ifree_cluster() uses the
starting inode number to walk the namespace of inodes expected for a
single chunk a cluster buffer at a time. The cluster buffer disk
addresses are calculated by decoding the sequential inode numbers
expected from the chunk.

The problem with this approach is that if the inode chunk being removed
is a sparse chunk, not all of the buffer addresses that are calculated
as part of this sequence may be inode clusters. Attempting to acquire
the buffer based on expected inode characterstics (i.e., cluster length)
can lead to errors and is generally incorrect.

We already use a couple variables to carry requisite state from
xfs_difree() to xfs_ifree_cluster(). Rather than add a third, define a
new internal structure to carry the existing parameters through these
functions. Add an alloc field that represents the physical allocation
bitmap of inodes in the chunk being removed. Modify xfs_ifree_cluster()
to check each inode against the bitmap and skip the clusters that were
never allocated as real inodes on disk.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_ialloc.c | 16 +++++++---------
 fs/xfs/libxfs/xfs_ialloc.h | 10 ++++++++--
 fs/xfs/xfs_inode.c         | 28 ++++++++++++++++++++--------
 3 files changed, 35 insertions(+), 19 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 1be57b1..86c6ccd 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -1726,8 +1726,7 @@ xfs_difree_inobt(
 	struct xfs_buf			*agbp,
 	xfs_agino_t			agino,
 	struct xfs_bmap_free		*flist,
-	int				*deleted,
-	xfs_ino_t			*first_ino,
+	struct xfs_icluster		*xic,
 	struct xfs_inobt_rec_incore	*orec)
 {
 	struct xfs_agi			*agi = XFS_BUF_TO_AGI(agbp);
@@ -1786,8 +1785,9 @@ xfs_difree_inobt(
 	if (!(mp->m_flags & XFS_MOUNT_IKEEP) &&
 	    (rec.ir_free == XFS_INOBT_ALL_FREE)) {
 
-		*deleted = 1;
-		*first_ino = XFS_AGINO_TO_INO(mp, agno, rec.ir_startino);
+		xic->deleted = 1;
+		xic->first_ino = XFS_AGINO_TO_INO(mp, agno, rec.ir_startino);
+		xic->alloc = xfs_inobt_ialloc_bitmap(&rec);
 
 		/*
 		 * Remove the inode cluster from the AGI B+Tree, adjust the
@@ -1812,7 +1812,7 @@ xfs_difree_inobt(
 
 		xfs_difree_inode_chunk(mp, agno, &rec, flist);
 	} else {
-		*deleted = 0;
+		xic->deleted = 0;
 
 		error = xfs_inobt_update(cur, &rec);
 		if (error) {
@@ -1950,8 +1950,7 @@ xfs_difree(
 	struct xfs_trans	*tp,		/* transaction pointer */
 	xfs_ino_t		inode,		/* inode to be freed */
 	struct xfs_bmap_free	*flist,		/* extents to free */
-	int			*deleted,/* set if inode cluster was deleted */
-	xfs_ino_t		*first_ino)/* first inode in deleted cluster */
+	struct xfs_icluster	*xic)	/* cluster info if deleted */
 {
 	/* REFERENCED */
 	xfs_agblock_t		agbno;	/* block number containing inode */
@@ -2002,8 +2001,7 @@ xfs_difree(
 	/*
 	 * Fix up the inode allocation btree.
 	 */
-	error = xfs_difree_inobt(mp, tp, agbp, agino, flist, deleted, first_ino,
-				 &rec);
+	error = xfs_difree_inobt(mp, tp, agbp, agino, flist, xic, &rec);
 	if (error)
 		goto error0;
 
diff --git a/fs/xfs/libxfs/xfs_ialloc.h b/fs/xfs/libxfs/xfs_ialloc.h
index 62c1381..5aa8d6f 100644
--- a/fs/xfs/libxfs/xfs_ialloc.h
+++ b/fs/xfs/libxfs/xfs_ialloc.h
@@ -28,6 +28,13 @@ struct xfs_btree_cur;
 /* Move inodes in clusters of this size */
 #define	XFS_INODE_BIG_CLUSTER_SIZE	8192
 
+struct xfs_icluster {
+	bool		deleted;	/* record is deleted */
+	xfs_ino_t	first_ino;	/* first inode number */
+	xfs_inoalloc_t	alloc;		/* inode phys. allocation bitmap for
+					 * sparse chunks */
+};
+
 /* Calculate and return the number of filesystem blocks per inode cluster */
 static inline int
 xfs_icluster_size_fsb(
@@ -90,8 +97,7 @@ xfs_difree(
 	struct xfs_trans *tp,		/* transaction pointer */
 	xfs_ino_t	inode,		/* inode to be freed */
 	struct xfs_bmap_free *flist,	/* extents to free */
-	int		*deleted,	/* set if inode cluster was deleted */
-	xfs_ino_t	*first_ino);	/* first inode in deleted cluster */
+	struct xfs_icluster *ifree);	/* cluster info if deleted */
 
 /*
  * Return the location of the inode in imap, for mapping it into a buffer.
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 1a5e068..641fe34 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2181,9 +2181,9 @@ xfs_iunlink_remove(
  */
 STATIC int
 xfs_ifree_cluster(
-	xfs_inode_t	*free_ip,
-	xfs_trans_t	*tp,
-	xfs_ino_t	inum)
+	xfs_inode_t		*free_ip,
+	xfs_trans_t		*tp,
+	struct xfs_icluster	*xic)
 {
 	xfs_mount_t		*mp = free_ip->i_mount;
 	int			blks_per_cluster;
@@ -2196,13 +2196,26 @@ xfs_ifree_cluster(
 	xfs_inode_log_item_t	*iip;
 	xfs_log_item_t		*lip;
 	struct xfs_perag	*pag;
+	xfs_ino_t		inum;
 
+	inum = xic->first_ino;
 	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, inum));
 	blks_per_cluster = xfs_icluster_size_fsb(mp);
 	inodes_per_cluster = blks_per_cluster << mp->m_sb.sb_inopblog;
 	nbufs = mp->m_ialloc_blks / blks_per_cluster;
 
 	for (j = 0; j < nbufs; j++, inum += inodes_per_cluster) {
+		/*
+		 * The allocation bitmap tells us which inodes of the chunk were
+		 * physically allocated. Skip the cluster if an inode falls into
+		 * a sparse region.
+		 */
+		if ((xic->alloc & XFS_INOBT_MASK(inum - xic->first_ino)) == 0) {
+			ASSERT(((inum - xic->first_ino) %
+				inodes_per_cluster) == 0);
+			continue;
+		}
+
 		blkno = XFS_AGB_TO_DADDR(mp, XFS_INO_TO_AGNO(mp, inum),
 					 XFS_INO_TO_AGBNO(mp, inum));
 
@@ -2360,8 +2373,7 @@ xfs_ifree(
 	xfs_bmap_free_t	*flist)
 {
 	int			error;
-	int			delete;
-	xfs_ino_t		first_ino;
+	struct xfs_icluster	xic = { 0 };
 
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
 	ASSERT(ip->i_d.di_nlink == 0);
@@ -2377,7 +2389,7 @@ xfs_ifree(
 	if (error)
 		return error;
 
-	error = xfs_difree(tp, ip->i_ino, flist, &delete, &first_ino);
+	error = xfs_difree(tp, ip->i_ino, flist, &xic);
 	if (error)
 		return error;
 
@@ -2394,8 +2406,8 @@ xfs_ifree(
 	ip->i_d.di_gen++;
 	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
 
-	if (delete)
-		error = xfs_ifree_cluster(ip, tp, first_ino);
+	if (xic.deleted)
+		error = xfs_ifree_cluster(ip, tp, &xic);
 
 	return error;
 }
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 17/18] xfs: use actual inode count for sparse records in bulkstat/inumbers
  2014-07-24 14:22 [PATCH RFC 00/18] xfs: sparse inode chunks Brian Foster
                   ` (15 preceding siblings ...)
  2014-07-24 14:23 ` [PATCH 16/18] xfs: skip unallocated regions of inode chunks in xfs_ifree_cluster() Brian Foster
@ 2014-07-24 14:23 ` Brian Foster
  2014-07-24 23:29   ` Dave Chinner
  2014-07-24 14:23 ` [PATCH 18/18] xfs: enable sparse inode chunks for v5 superblocks Brian Foster
                   ` (2 subsequent siblings)
  19 siblings, 1 reply; 50+ messages in thread
From: Brian Foster @ 2014-07-24 14:23 UTC (permalink / raw)
  To: xfs

The bulkstat and inumbers mechanisms make the assumption that inode
records consist of a full 64 inode chunk in several places. E.g., this
is used to track how many inodes have been processed overall as well as
to determine when all in-use inodes of a record have been processed. The
record processing, in particular, increments the record freecount for
each in-use inode until it hits the expected max of 64.

This is invalid for sparse inode records. While all inodes might be
marked free in the free mask regardless of whether they are allocated on
disk, ir_freecount is based on the total number of physically allocated
inodes and thus may be less than 64 inodes on a completely free inode
chunk.

Create the xfs_inobt_count() helper to calculate the total number of
physically allocated inodes based on the holemask. Use the helper in
xfs_bulkstat() and xfs_inumbers() instead of the fixed
XFS_INODE_PER_CHUNK value to ensure correct accounting in the event of
sparse inode records.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_ialloc.c | 27 +++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_ialloc.h |  5 +++++
 fs/xfs/xfs_itable.c        | 12 +++++++-----
 3 files changed, 39 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 86c6ccd..daf317f 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -951,6 +951,33 @@ xfs_inobt_first_free_inode(
 }
 
 /*
+ * Calculate the real count of inodes in a chunk.
+ */
+int
+xfs_inobt_count(
+	struct xfs_inobt_rec_incore	*rec)
+{
+	__uint16_t	allocmask;
+	uint		allocbitmap;
+	int		nextbit;
+	int		count = 0;
+
+	if (!xfs_inobt_issparse(rec))
+		return XFS_INODES_PER_CHUNK;
+
+	allocmask = ~rec->ir_holemask;
+	allocbitmap = allocmask;
+
+	nextbit = xfs_next_bit(&allocbitmap, 1, 0);
+	while (nextbit != -1) {
+		count += XFS_INODES_PER_SPCHUNK;
+		nextbit = xfs_next_bit(&allocbitmap, 1, nextbit + 1);
+	}
+
+	return count;
+}
+
+/*
  * Allocate an inode using the inobt-only algorithm.
  */
 STATIC int
diff --git a/fs/xfs/libxfs/xfs_ialloc.h b/fs/xfs/libxfs/xfs_ialloc.h
index 5aa8d6f..4230b22 100644
--- a/fs/xfs/libxfs/xfs_ialloc.h
+++ b/fs/xfs/libxfs/xfs_ialloc.h
@@ -166,4 +166,9 @@ int xfs_ialloc_inode_init(struct xfs_mount *mp, struct xfs_trans *tp,
 			  xfs_agnumber_t agno, xfs_agblock_t agbno,
 			  xfs_agblock_t length, unsigned int gen);
 
+/*
+ * Calculate the real count of inodes in a chunk.
+ */
+int xfs_inobt_count(struct xfs_inobt_rec_incore *rec);
+
 #endif	/* __XFS_IALLOC_H__ */
diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
index 7e54992..8ba967c 100644
--- a/fs/xfs/xfs_itable.c
+++ b/fs/xfs/xfs_itable.c
@@ -312,11 +312,12 @@ xfs_bulkstat(
 				}
 				r.ir_free |= xfs_inobt_maskn(0, chunkidx);
 				irbp->ir_startino = r.ir_startino;
+				irbp->ir_holemask = r.ir_holemask;
 				irbp->ir_freecount = r.ir_freecount;
 				irbp->ir_free = r.ir_free;
 				irbp++;
 				agino = r.ir_startino + XFS_INODES_PER_CHUNK;
-				icount = XFS_INODES_PER_CHUNK - r.ir_freecount;
+				icount = xfs_inobt_count(&r) - r.ir_freecount;
 			} else {
 				/*
 				 * If any of those tests failed, bump the
@@ -376,7 +377,7 @@ xfs_bulkstat(
 			 * If this chunk has any allocated inodes, save it.
 			 * Also start read-ahead now for this chunk.
 			 */
-			if (r.ir_freecount < XFS_INODES_PER_CHUNK) {
+			if (r.ir_freecount < xfs_inobt_count(&r)) {
 				struct blk_plug	plug;
 				/*
 				 * Loop over all clusters in the next chunk.
@@ -397,10 +398,11 @@ xfs_bulkstat(
 				}
 				blk_finish_plug(&plug);
 				irbp->ir_startino = r.ir_startino;
+				irbp->ir_holemask = r.ir_holemask;
 				irbp->ir_freecount = r.ir_freecount;
 				irbp->ir_free = r.ir_free;
 				irbp++;
-				icount += XFS_INODES_PER_CHUNK - r.ir_freecount;
+				icount += xfs_inobt_count(&r) - r.ir_freecount;
 			}
 			/*
 			 * Set agino to after this chunk and bump the cursor.
@@ -427,7 +429,7 @@ xfs_bulkstat(
 			 */
 			for (agino = irbp->ir_startino, chunkidx = clustidx = 0;
 			     XFS_BULKSTAT_UBLEFT(ubleft) &&
-				irbp->ir_freecount < XFS_INODES_PER_CHUNK;
+				irbp->ir_freecount < xfs_inobt_count(irbp);
 			     chunkidx++, clustidx++, agino++) {
 				ASSERT(chunkidx < XFS_INODES_PER_CHUNK);
 
@@ -654,7 +656,7 @@ xfs_inumbers(
 		buffer[bufidx].xi_startino =
 			XFS_AGINO_TO_INO(mp, agno, r.ir_startino);
 		buffer[bufidx].xi_alloccount =
-			XFS_INODES_PER_CHUNK - r.ir_freecount;
+			xfs_inobt_count(&r) - r.ir_freecount;
 		buffer[bufidx].xi_allocmask = ~r.ir_free;
 		bufidx++;
 		left--;
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 18/18] xfs: enable sparse inode chunks for v5 superblocks
  2014-07-24 14:22 [PATCH RFC 00/18] xfs: sparse inode chunks Brian Foster
                   ` (16 preceding siblings ...)
  2014-07-24 14:23 ` [PATCH 17/18] xfs: use actual inode count for sparse records in bulkstat/inumbers Brian Foster
@ 2014-07-24 14:23 ` Brian Foster
  2014-07-24 23:34   ` Dave Chinner
  2014-07-24 16:28 ` [PATCH RFC 00/18] xfs: sparse inode chunks Brian Foster
  2014-07-24 22:32 ` Dave Chinner
  19 siblings, 1 reply; 50+ messages in thread
From: Brian Foster @ 2014-07-24 14:23 UTC (permalink / raw)
  To: xfs

Enable the use and processing of sparse inode chunks.  Fix the
xfs_sb_version_hassparseinodes() helper to allow the allocation of
sparse chunks on v5 format superblocks. Add the incompat. feature bit to
the *_ALL mask such that fs' with sparse chunks can be mounted.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_sb.h | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_sb.h b/fs/xfs/libxfs/xfs_sb.h
index 6f48de9..f2299b7 100644
--- a/fs/xfs/libxfs/xfs_sb.h
+++ b/fs/xfs/libxfs/xfs_sb.h
@@ -511,7 +511,8 @@ xfs_sb_has_ro_compat_feature(
 #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
 #define XFS_SB_FEAT_INCOMPAT_SPINODES	(1 << 1)	/* sparse inode chunks */
 #define XFS_SB_FEAT_INCOMPAT_ALL \
-		(XFS_SB_FEAT_INCOMPAT_FTYPE)
+		(XFS_SB_FEAT_INCOMPAT_FTYPE|	\
+		 XFS_SB_FEAT_INCOMPAT_SPINODES)
 
 #define XFS_SB_FEAT_INCOMPAT_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_ALL
 static inline bool
@@ -561,7 +562,7 @@ static inline int xfs_sb_version_hasfinobt(xfs_sb_t *sbp)
 
 static inline int xfs_sb_version_hassparseinodes(struct xfs_sb *sbp)
 {
-	return 0; /* not yet enabled */
+	return XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5;
 }
 
 /*
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 00/18] xfs: sparse inode chunks
  2014-07-24 14:22 [PATCH RFC 00/18] xfs: sparse inode chunks Brian Foster
                   ` (17 preceding siblings ...)
  2014-07-24 14:23 ` [PATCH 18/18] xfs: enable sparse inode chunks for v5 superblocks Brian Foster
@ 2014-07-24 16:28 ` Brian Foster
  2014-07-24 22:32 ` Dave Chinner
  19 siblings, 0 replies; 50+ messages in thread
From: Brian Foster @ 2014-07-24 16:28 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: Type: text/plain, Size: 6708 bytes --]

On Thu, Jul 24, 2014 at 10:22:50AM -0400, Brian Foster wrote:
> Hi all,
> 
> This is a first pass at sparse inode chunk support for XFS. Some
> background on this work is available here:
> 
> http://oss.sgi.com/archives/xfs/2013-08/msg00346.html
> 
> The basic idea is to allow the partial allocation of inode chunks into
> fragmented regions of free space. This is accomplished through addition
> of a holemask field into the inobt record that defines what portion(s)
> of an inode chunk are invalid (i.e., holes in the chunk). This work is
> not quite complete, but is at a point where I'd like to start getting
> feedback on the design and what direction to take for some of the known
> gaps.
> 

I've attached a tarball to this message with a couple userspace patches
and an xfstests patch to facilitate experimentation. The userspace
patches update the inobt record data structure and add the holemask
field to xfs_db to facilitate poking around. Note that the rest of
userspace is untouched at this point, including repair being broken,
etc., so I don't recommend use beyond xfs_db.

The xfstests test case fragments free space, allocates inodes until
ENOSPC and expects to consume most of the free space available in the
fs. The "fragmentation factor" is currently dynamic and based on the
cluster size due to the cluster size scaling behavior documented below.

Finally, sparse inode chunks are only enabled for v5 superblocks, so a
crc enabled fs is required to test.

Brian

> The basic breakdown of functionality in this set is as follows:
> 
> - Patches 1-2 - A couple generic cleanups that are dependencies for later
>   patches in the series.
> - Patches 3-5 - Basic data structure update, feature bit and minor
>   helper introduction.
> - Patches 6-7 - Update v5 icreate logging and recovery to handle sparse
>   inode records.
> - Patches 8-13 - Allocation support for sparse inode records. Physical
>   chunk allocation and individual inode allocation.
> - Patches 14-16 - Deallocation support for sparse inode chunks. Physical
>   chunk deallocation, individual inode free and cluster free.
> - Patch 17 - Fixes for bulkstat/inumbers.
> - Patch 18 - Activate support for sparse chunk allocation and
>   processing.
> 
> This work is lightly tested for regression (some xfstests failures due
> to repair) and basic functionality. I have a new xfstests test I'll
> forward along for demonstration purposes.
> 
> Some notes on gaps in the design:
> 
> - Sparse inode chunk allocation granularity:
> 
> The current minimum sparse chunk allocation granularity is the cluster
> size. My initial attempts at this work tried to redefine to the minimum
> chunk length based on the holemask granularity (a la the stale macro I
> seemingly left in this series ;), but this involves tweaking the
> codepaths that use the cluster size (i.e., imap) which proved rather
> hairy. This also means we need a solution where an imap can change if an
> inode was initially mapped as a sparse chunk and said chunk is
> subsequently made full. E.g., we'd perhaps need to invalidate the inode
> buffers for sparse chunks at the time where they are made full. Given
> that, I landed on using the cluster size and leaving those codepaths as
> is for the time being.
> 
> There is a tradeoff here for v5 superblocks because we've recently made
> a change to scale the cluster size based on the factor increase in the
> inode size from the default (see xfsprogs commit 7b5f9801). This means
> that effectiveness of sparse chunks is tied to whether the level of free
> space fragmentation matches the cluster size. By that I mean effectivess
> is good (near 100% utilization possible) if free space fragmentation
> leaves free extents around that at least match the cluster size. If
> fragmentation is worse than the cluster size, effectiveness is reduced.
> This can also be demonstrated with the forthcoming xfstests test.
> 
> - On-disk lifecycle of the sparse inode chunks feature bit:
> 
> We set an incompatible feature bit once a sparse inode chunk is
> allocated because older revisions of code will interpret the non-zero
> holemask bits in the higher order bytes of the record freecount. The
> feature bit must be removed once all sparse inode chunks are eliminated
> one way or another. This series does not currently remove the feature
> bit once set simply because I hadn't thought through the mechanism quite
> yet. For the next version, I'm thinking about adding an inobt walk
> mechanism that can be conditionally invoked (i.e., feature bit is
> currently set and a sparse inode chunk has been eliminated) either via
> workqueue on an interval or during unmount if necessary. Thoughts or
> alternative suggestions on that appreciated.
> 
> That's about it for now. Thoughts, reviews, flames appreciated. Thanks.
> 
> Brian
> 
> Brian Foster (18):
>   xfs: refactor xfs_inobt_insert() to eliminate loop and support
>     variable count
>   xfs: pass xfs_mount directly to xfs_ialloc_cluster_alignment()
>   xfs: define sparse inode chunks v5 sb feature bit and helper function
>   xfs: introduce inode record hole mask for sparse inode chunks
>   xfs: create macros/helpers for dealing with sparse inode chunks
>   xfs: pass inode count through ordered icreate log item
>   xfs: handle sparse inode chunks in icreate log recovery
>   xfs: create helper to manage record overlap for sparse inode chunks
>   xfs: allocate sparse inode chunks on full chunk allocation failure
>   xfs: set sparse inodes feature bit when a sparse chunk is allocated
>   xfs: reduce min. inode allocation space requirement for sparse inode
>     chunks
>   xfs: helper to convert inobt record holemask to inode alloc. bitmap
>   xfs: filter out sparse regions from individual inode allocation
>   xfs: update free inode record logic to support sparse inode records
>   xfs: only free allocated regions of inode chunks
>   xfs: skip unallocated regions of inode chunks in xfs_ifree_cluster()
>   xfs: use actual inode count for sparse records in bulkstat/inumbers
>   xfs: enable sparse inode chunks for v5 superblocks
> 
>  fs/xfs/libxfs/xfs_format.h       |  17 +-
>  fs/xfs/libxfs/xfs_ialloc.c       | 441 +++++++++++++++++++++++++++++++++------
>  fs/xfs/libxfs/xfs_ialloc.h       |  17 +-
>  fs/xfs/libxfs/xfs_ialloc_btree.c |   4 +-
>  fs/xfs/libxfs/xfs_sb.h           |   9 +-
>  fs/xfs/xfs_inode.c               |  28 ++-
>  fs/xfs/xfs_itable.c              |  12 +-
>  fs/xfs/xfs_log_recover.c         |  23 +-
>  8 files changed, 460 insertions(+), 91 deletions(-)
> 
> -- 
> 1.8.3.1
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

[-- Attachment #2: xfs_spinodes_user.tar.bz2 --]
[-- Type: application/x-bzip2, Size: 4002 bytes --]

[-- Attachment #3: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 03/18] xfs: define sparse inode chunks v5 sb feature bit and helper function
  2014-07-24 14:22 ` [PATCH 03/18] xfs: define sparse inode chunks v5 sb feature bit and helper function Brian Foster
@ 2014-07-24 17:08   ` Mark Tinguely
  2014-07-24 17:37     ` Brian Foster
  2014-07-24 23:35   ` Dave Chinner
  1 sibling, 1 reply; 50+ messages in thread
From: Mark Tinguely @ 2014-07-24 17:08 UTC (permalink / raw)
  To: Brian Foster, xfs-oss

On 07/24/14 09:22, Brian Foster wrote:
> The sparse inode chunks feature will use the helper function to enable
> the allocation of sparse inode chunks. The incompatible feature bit is
> set on disk once a sparse inode chunk is allocated to prevent older
> drivers from mounting an fs with sparse chunks.
>
> Note that the feature is hardcoded disabled and the feature bit not
> included in the all features mask until fully implemented.
>
> Signed-off-by: Brian Foster<bfoster@redhat.com>
> ---

Sorry if I missed it in the series but is there an 
XFS_FSOP_GEOM_FLAGS_SPINODES for xfs_info/growfs?

Thanks,

--Mark.


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 03/18] xfs: define sparse inode chunks v5 sb feature bit and helper function
  2014-07-24 17:08   ` Mark Tinguely
@ 2014-07-24 17:37     ` Brian Foster
  2014-07-24 18:38       ` Mark Tinguely
  0 siblings, 1 reply; 50+ messages in thread
From: Brian Foster @ 2014-07-24 17:37 UTC (permalink / raw)
  To: Mark Tinguely; +Cc: xfs-oss

On Thu, Jul 24, 2014 at 12:08:36PM -0500, Mark Tinguely wrote:
> On 07/24/14 09:22, Brian Foster wrote:
> >The sparse inode chunks feature will use the helper function to enable
> >the allocation of sparse inode chunks. The incompatible feature bit is
> >set on disk once a sparse inode chunk is allocated to prevent older
> >drivers from mounting an fs with sparse chunks.
> >
> >Note that the feature is hardcoded disabled and the feature bit not
> >included in the all features mask until fully implemented.
> >
> >Signed-off-by: Brian Foster<bfoster@redhat.com>
> >---
> 
> Sorry if I missed it in the series but is there an
> XFS_FSOP_GEOM_FLAGS_SPINODES for xfs_info/growfs?
> 

Nope, looks like I missed it. It probably slipped my mind as I haven't
got into userspace yet and thus hadn't thought about xfs_info. I'll make
a note to add it, thanks!

Brian

> Thanks,
> 
> --Mark.
> 
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 03/18] xfs: define sparse inode chunks v5 sb feature bit and helper function
  2014-07-24 17:37     ` Brian Foster
@ 2014-07-24 18:38       ` Mark Tinguely
  2014-07-24 19:38         ` Brian Foster
  0 siblings, 1 reply; 50+ messages in thread
From: Mark Tinguely @ 2014-07-24 18:38 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs-oss

On 07/24/14 12:37, Brian Foster wrote:
> On Thu, Jul 24, 2014 at 12:08:36PM -0500, Mark Tinguely wrote:
>> On 07/24/14 09:22, Brian Foster wrote:
>>> The sparse inode chunks feature will use the helper function to enable
>>> the allocation of sparse inode chunks. The incompatible feature bit is
>>> set on disk once a sparse inode chunk is allocated to prevent older
>>> drivers from mounting an fs with sparse chunks.
>>>
>>> Note that the feature is hardcoded disabled and the feature bit not
>>> included in the all features mask until fully implemented.
>>>
>>> Signed-off-by: Brian Foster<bfoster@redhat.com>
>>> ---
>>
>> Sorry if I missed it in the series but is there an
>> XFS_FSOP_GEOM_FLAGS_SPINODES for xfs_info/growfs?
>>
>
> Nope, looks like I missed it. It probably slipped my mind as I haven't
> got into userspace yet and thus hadn't thought about xfs_info. I'll make
> a note to add it, thanks!
>
> Brian

Again forgive my quick scanning of the series, but am I correct in 
thinking that this does not change the minimum number of reserved blocks 
for create like fs ops. The create/rename do some attempts to continue 
when it cannot get the full number of reserved blocks. Would allocating 
a sparse inode chunk make sense in that case? My gut says the 
complication does not.

Thanks again,

--Mark.


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 03/18] xfs: define sparse inode chunks v5 sb feature bit and helper function
  2014-07-24 18:38       ` Mark Tinguely
@ 2014-07-24 19:38         ` Brian Foster
  0 siblings, 0 replies; 50+ messages in thread
From: Brian Foster @ 2014-07-24 19:38 UTC (permalink / raw)
  To: Mark Tinguely; +Cc: xfs-oss

On Thu, Jul 24, 2014 at 01:38:07PM -0500, Mark Tinguely wrote:
> On 07/24/14 12:37, Brian Foster wrote:
> >On Thu, Jul 24, 2014 at 12:08:36PM -0500, Mark Tinguely wrote:
> >>On 07/24/14 09:22, Brian Foster wrote:
> >>>The sparse inode chunks feature will use the helper function to enable
> >>>the allocation of sparse inode chunks. The incompatible feature bit is
> >>>set on disk once a sparse inode chunk is allocated to prevent older
> >>>drivers from mounting an fs with sparse chunks.
> >>>
> >>>Note that the feature is hardcoded disabled and the feature bit not
> >>>included in the all features mask until fully implemented.
> >>>
> >>>Signed-off-by: Brian Foster<bfoster@redhat.com>
> >>>---
> >>
> >>Sorry if I missed it in the series but is there an
> >>XFS_FSOP_GEOM_FLAGS_SPINODES for xfs_info/growfs?
> >>
> >
> >Nope, looks like I missed it. It probably slipped my mind as I haven't
> >got into userspace yet and thus hadn't thought about xfs_info. I'll make
> >a note to add it, thanks!
> >
> >Brian
> 
> Again forgive my quick scanning of the series, but am I correct in thinking
> that this does not change the minimum number of reserved blocks for create
> like fs ops. The create/rename do some attempts to continue when it cannot
> get the full number of reserved blocks. Would allocating a sparse inode
> chunk make sense in that case? My gut says the complication does not.
> 

No problem, thanks for looking.

Correct... I haven't made any changes to transaction reservations or
things like XFS_CREATE_SPACE_RES(), etc. xfs_create() has the kind of
logic you describe where we set resblks = 0 if the reservation fails
with ENOSPC and try again with a flag set to indicate we can't do
allocations. My understanding is that this is justified since we could
always allocate an inode from an existing record and a dentry from an
unused slot in a directory.

Technically, I think we could explicitly allocate a sparse chunk in this
scenario if we wanted to, but we'd have to pass that down to/through the
allocation code in xfs_ialloc_ag_alloc(). I could be wrong, but I think
the reservation failure in this case will always be limited to a case
where we are truly up against ENOSPC (as opposed to failing due to an
extent length or alignment requirement), since the reservation comes out
of the sb counters. Given that, I tend to agree that it might not be
worth the extra complexity there.

Brian

> Thanks again,
> 
> --Mark.
> 
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 01/18] xfs: refactor xfs_inobt_insert() to eliminate loop and support variable count
  2014-07-24 14:22 ` [PATCH 01/18] xfs: refactor xfs_inobt_insert() to eliminate loop and support variable count Brian Foster
@ 2014-07-24 22:10   ` Dave Chinner
  2014-07-28 16:03     ` Brian Foster
  0 siblings, 1 reply; 50+ messages in thread
From: Dave Chinner @ 2014-07-24 22:10 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Thu, Jul 24, 2014 at 10:22:51AM -0400, Brian Foster wrote:
> Inodes are always allocated in chunks of 64 and thus the loop in
> xfs_inobt_insert() is unnecessary.

I don't believe this is true. The number of inodes allocated at once
is:

        mp->m_ialloc_inos = (int)MAX((__uint16_t)XFS_INODES_PER_CHUNK,
	                                        sbp->sb_inopblock);

So when the block size is, say, 64k, the number of 512 byte inodes
allocated at once is 128. i.e. 2 chunks. Hence xfs_inobt_insert()
can be called with a inode could of > 64 and therefore the loop is
still necessary...

And, indeed, we might want to increase the allocation size in future
to do entire stripe units or stripe widths of inodes at once:

http://xfs.org/index.php/Improving_inode_Caching#Contiguous_Inode_Allocation

This also means a loop would be required -somewhere-...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 05/18] xfs: create macros/helpers for dealing with sparse inode chunks
  2014-07-24 14:22 ` [PATCH 05/18] xfs: create macros/helpers for dealing with " Brian Foster
@ 2014-07-24 22:13   ` Dave Chinner
  0 siblings, 0 replies; 50+ messages in thread
From: Dave Chinner @ 2014-07-24 22:13 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Thu, Jul 24, 2014 at 10:22:55AM -0400, Brian Foster wrote:
> Sparse inode chunks allow the traditional inode btree record format to
> describe an inode chunk that is not fully allocated and/or contiguous.
> Define a couple constants that set requirements for allocation and
> management of such chunks. Also define a helper to easily detect sparse
> inode chunks.
> 
> The granularity of a sparse chunk is defined by the the 16-bit holemask
> field in the inode record. Assuming 64 inodes per full chunk, a single
> holemask bit accounts for 4 inodes. The minimum allocation requirement
> for a sparse inode chunk is defined as the minimum number of blocks
> required to meet the 4 inode granularity.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_format.h | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> index 39022d9..0baad50 100644
> --- a/fs/xfs/libxfs/xfs_format.h
> +++ b/fs/xfs/libxfs/xfs_format.h
> @@ -211,6 +211,11 @@ typedef	__uint64_t	xfs_inofree_t;
>  #define	XFS_INOBT_ALL_FREE		((xfs_inofree_t)-1)
>  #define	XFS_INOBT_MASK(i)		((xfs_inofree_t)1 << (i))
>  
> +#define XFS_INODES_PER_SPCHUNK		\
> +	(XFS_INODES_PER_CHUNK / (NBBY * sizeof(__uint16_t)))
> +#define XFS_INOBT_MIN_SPCHUNKLEN(sb)	\
> +	(roundup(XFS_INODES_PER_SPCHUNK, sb.sb_inopblock) / sb.sb_inopblock)

static inline.

-Dave.

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 04/18] xfs: introduce inode record hole mask for sparse inode chunks
  2014-07-24 14:22 ` [PATCH 04/18] xfs: introduce inode record hole mask for sparse inode chunks Brian Foster
@ 2014-07-24 22:14   ` Dave Chinner
  2014-07-28 16:16     ` Brian Foster
  0 siblings, 1 reply; 50+ messages in thread
From: Dave Chinner @ 2014-07-24 22:14 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Thu, Jul 24, 2014 at 10:22:54AM -0400, Brian Foster wrote:
> The inode btrees track 64 inodes per record, regardless of inode size.
> Thus, inode chunks on disk vary in size depending on the size of the
> inodes. This creates a contiguous allocation requirement for new inode
> chunks that can be difficult to satisfy on an aged and fragmented (free
> space) filesystem.
> 
> The inode record freecount currently uses 4 bytes on disk to track the
> free inode count. With a maximum freecount value of 64, only one byte is
> required. Convert the freecount field to a single byte and reserve two
> of the remaining 3 higher order bytes left to the hole mask field.
> 
> The hole mask field tracks potential holes in the chunks of physical
> space that the inode record refers to. This facilitates the sparse
> allocation of inode chunks when contiguous chunks are not available and
> allows the inode btrees to identify what portions of the chunk contain
> valid inodes.
> 
> Tracking holes means the field is initialized to zero and thus maintains
> backwards compatibility with existing filesystems. E.g., the higher
> order bytes of a counter with a max value of 64 are already initialized
> to 0. Update the inode record management functions to handle the new
> field and initialize it to zero for now.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_format.h       | 7 +++++--
>  fs/xfs/libxfs/xfs_ialloc.c       | 9 +++++++--
>  fs/xfs/libxfs/xfs_ialloc_btree.c | 4 +++-
>  3 files changed, 15 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> index 34d85ac..39022d9 100644
> --- a/fs/xfs/libxfs/xfs_format.h
> +++ b/fs/xfs/libxfs/xfs_format.h
> @@ -221,13 +221,16 @@ static inline xfs_inofree_t xfs_inobt_maskn(int i, int n)
>   */
>  typedef struct xfs_inobt_rec {
>  	__be32		ir_startino;	/* starting inode number */
> -	__be32		ir_freecount;	/* count of free inodes (set bits) */
> +	__be16		ir_holemask;	/* hole mask for sparse chunks */
> +	__u8		ir_pad;
> +	__u8		ir_freecount;	/* count of free inodes (set bits) */
>  	__be64		ir_free;	/* free inode mask */
>  } xfs_inobt_rec_t;

might we want the number of inodes allocated in the chunk there as
well (i.e. in the pad field) so we can validate the holemask against
the number of inodes allocated in the chunk?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 00/18] xfs: sparse inode chunks
  2014-07-24 14:22 [PATCH RFC 00/18] xfs: sparse inode chunks Brian Foster
                   ` (18 preceding siblings ...)
  2014-07-24 16:28 ` [PATCH RFC 00/18] xfs: sparse inode chunks Brian Foster
@ 2014-07-24 22:32 ` Dave Chinner
  2014-07-25 16:30   ` Brian Foster
  19 siblings, 1 reply; 50+ messages in thread
From: Dave Chinner @ 2014-07-24 22:32 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Thu, Jul 24, 2014 at 10:22:50AM -0400, Brian Foster wrote:
> Hi all,
> 
> This is a first pass at sparse inode chunk support for XFS. Some
> background on this work is available here:
> 
> http://oss.sgi.com/archives/xfs/2013-08/msg00346.html
> 
> The basic idea is to allow the partial allocation of inode chunks into
> fragmented regions of free space. This is accomplished through addition
> of a holemask field into the inobt record that defines what portion(s)
> of an inode chunk are invalid (i.e., holes in the chunk). This work is
> not quite complete, but is at a point where I'd like to start getting
> feedback on the design and what direction to take for some of the known
> gaps.
> 
> The basic breakdown of functionality in this set is as follows:
> 
> - Patches 1-2 - A couple generic cleanups that are dependencies for later
>   patches in the series.
> - Patches 3-5 - Basic data structure update, feature bit and minor
>   helper introduction.
> - Patches 6-7 - Update v5 icreate logging and recovery to handle sparse
>   inode records.
> - Patches 8-13 - Allocation support for sparse inode records. Physical
>   chunk allocation and individual inode allocation.
> - Patches 14-16 - Deallocation support for sparse inode chunks. Physical
>   chunk deallocation, individual inode free and cluster free.
> - Patch 17 - Fixes for bulkstat/inumbers.
> - Patch 18 - Activate support for sparse chunk allocation and
>   processing.
> 
> This work is lightly tested for regression (some xfstests failures due
> to repair) and basic functionality. I have a new xfstests test I'll
> forward along for demonstration purposes.
> 
> Some notes on gaps in the design:
> 
> - Sparse inode chunk allocation granularity:
> 
> The current minimum sparse chunk allocation granularity is the cluster
> size.

Looking at the patchset (I got to patch 5 that first uses this),
this is problematic. the cluster size is currently a kernel
implementation detail, and not something defined by the on-disk
format. We can change the cluster size in the kernel and not affect
the format on disk. Making the cluster size a part of the disk
format by defining it to be the resolution of sparse inode chunks
changes that - it's now a part of the on-disk inode format, and that
greatly limits what we can do with it.

> My initial attempts at this work tried to redefine to the minimum
> chunk length based on the holemask granularity (a la the stale macro I
> seemingly left in this series ;), but this involves tweaking the
> codepaths that use the cluster size (i.e., imap) which proved rather
> hairy.

This is where we need to head towards, though. The cluster size is
currently the unit of inode IO, so that needs to be influenced by the
sparse inode chunk granularity. Yes, we can define the inode chunk
granularity to be the same as the cluster size, but that simply
means we need to configure the cluster size appropriately at mount.
It doesn't mean we need to change what cluster size means or it's
implementation....

> This also means we need a solution where an imap can change if an
> inode was initially mapped as a sparse chunk and said chunk is
> subsequently made full. E.g., we'd perhaps need to invalidate the inode
> buffers for sparse chunks at the time where they are made full. Given
> that, I landed on using the cluster size and leaving those codepaths as
> is for the time being.

Again, that's kernel inode buffer cache implementaiton details, not
something that matters for the on-disk format. So really these need
to be separated. Probably means we need a "sparse inode allocation
alignment" field in the superblock to define this. Having
the kernel reject sparse alignments it can't support from the
initial implementation means we can improve the kernel
implementation over time and (eventually) support sub-cluster sized
sparse inode allocation.

i.e. initial implementation only supports sparse alignment ==
cluster size, and rejects everything else....

> There is a tradeoff here for v5 superblocks because we've recently made
> a change to scale the cluster size based on the factor increase in the
> inode size from the default (see xfsprogs commit 7b5f9801). This means
> that effectiveness of sparse chunks is tied to whether the level of free
> space fragmentation matches the cluster size. By that I mean effectivess
> is good (near 100% utilization possible) if free space fragmentation
> leaves free extents around that at least match the cluster size. If
> fragmentation is worse than the cluster size, effectiveness is reduced.
> This can also be demonstrated with the forthcoming xfstests test.

Exactly. We don't need to solve every problem with the initial
implementation - we can iteratively improve the code because once
the fields are one disk we only need to change the kernel
implemenation to support finer grained sparse allocation to solve
this allocation chunk < cluster size problem....

> - On-disk lifecycle of the sparse inode chunks feature bit:
> 
> We set an incompatible feature bit once a sparse inode chunk is
> allocated because older revisions of code will interpret the non-zero
> holemask bits in the higher order bytes of the record freecount. The
> feature bit must be removed once all sparse inode chunks are eliminated
> one way or another. This series does not currently remove the feature
> bit once set simply because I hadn't thought through the mechanism quite
> yet. For the next version, I'm thinking about adding an inobt walk
> mechanism that can be conditionally invoked (i.e., feature bit is
> currently set and a sparse inode chunk has been eliminated) either via
> workqueue on an interval or during unmount if necessary. Thoughts or
> alternative suggestions on that appreciated.

I wouldn't bother. Let xfs_repair determine if the bit needs to be
set or not when it does it's final superblock write after it has
scanned and repaired the fs.

I'm even in two minds of whether we want the sb bit added
dynamically, because it means the same upgrade/downdgrade cycle can
have different results simply due to filesystem freespace
fragmentation patterns...

Perhaps an xfs-admin command to turn the feature on dynamicallyi for
existing filesystems, kind of like what we did with lazy superblock
counters when they were introduced?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 08/18] xfs: create helper to manage record overlap for sparse inode chunks
  2014-07-24 14:22 ` [PATCH 08/18] xfs: create helper to manage record overlap for sparse inode chunks Brian Foster
@ 2014-07-24 22:41   ` Dave Chinner
  2014-07-28 16:19     ` Brian Foster
  0 siblings, 1 reply; 50+ messages in thread
From: Dave Chinner @ 2014-07-24 22:41 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Thu, Jul 24, 2014 at 10:22:58AM -0400, Brian Foster wrote:
> Create xfs_spchunk_has_record() to receive the parameters of a new
> sparse inode chunk allocation and identify whether a record exists that
> is capable of tracking this sparse chunk.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_ialloc.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 55 insertions(+)
> 
> diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> index 27d3437..be57b51 100644
> --- a/fs/xfs/libxfs/xfs_ialloc.c
> +++ b/fs/xfs/libxfs/xfs_ialloc.c
> @@ -351,6 +351,61 @@ xfs_ialloc_inode_init(
>  }
>  
>  /*
> + * Determine whether part of a sparse inode chunk that has just been allocated
> + * is covered by an existing inobt record.
> + */
> +STATIC int
> +xfs_spchunk_has_record(

not sure I like the "spchunk" naming. I see that and I have no idea
what subsystem it belongs to. It's actually an inobt lookup
function, and doesn't really have anything to do with sparse chunks.
So maybe xfs_inobt_rec_exists or xfs_inobt_lookup_exact?

> +	struct xfs_mount		*mp,
> +	struct xfs_trans		*tp,
> +	struct xfs_buf			*agbp,
> +	xfs_agino_t			newino,
> +	xfs_agino_t			count,
> +	xfs_btnum_t			btnum,
> +	struct xfs_inobt_rec_incore	*orec)
> +{
> +	struct xfs_btree_cur		*cur;
> +	struct xfs_agi			*agi = XFS_BUF_TO_AGI(agbp);
> +	xfs_agnumber_t			agno = be32_to_cpu(agi->agi_seqno);
> +	xfs_agino_t			previno;
> +	int				error;
> +	int				i;
> +	struct xfs_inobt_rec_incore	rec;
> +
> +	orec->ir_startino = NULLAGINO;
> +
> +	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno, btnum);
> +
> +	previno = newino + count - XFS_INODES_PER_CHUNK;
> +	error = xfs_inobt_lookup(cur, previno, XFS_LOOKUP_GE, &i);

You want XFS_LOOKUP_EQ, yes? i.e. XFS_LOOKUP_GE won't fail if the
exact record for the inode chunk does not exist - it will return the
next one in the btree.

> +	if (error)
> +		goto error;
> +	if (i == 0)
> +		goto out;
> +
> +	error = xfs_inobt_get_rec(cur, &rec, &i);
> +	if (error)
> +		goto error;
> +	XFS_WANT_CORRUPTED_GOTO(i == 1, error);
> +
> +	if (rec.ir_startino > newino)
> +		goto out;

And so this check would not be necessary...

> +
> +	ASSERT(rec.ir_startino <= newino &&
> +	       rec.ir_startino + XFS_INODES_PER_CHUNK > newino);
> +	ASSERT(rec.ir_freecount + count <= XFS_INODES_PER_CHUNK);
> +
> +	*orec = rec;

/* struct copy */

-Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 10/18] xfs: set sparse inodes feature bit when a sparse chunk is allocated
  2014-07-24 14:23 ` [PATCH 10/18] xfs: set sparse inodes feature bit when a sparse chunk is allocated Brian Foster
@ 2014-07-24 22:46   ` Dave Chinner
  2014-07-28 16:23     ` Brian Foster
  0 siblings, 1 reply; 50+ messages in thread
From: Dave Chinner @ 2014-07-24 22:46 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Thu, Jul 24, 2014 at 10:23:00AM -0400, Brian Foster wrote:
> The sparse chunk inode record format is backwards compatible with old
> format inobt records as long as full chunks are allocated. The holemask
> field uses higher order bytes of the freecount. While sparse chunks can
> be enabled on previously unsupported fs, older kernel drivers cannot
> parse sparse inode records.
> 
> Set the feature incompatible bit once a sparse inode chunk is allocated
> to prevent older XFS drivers from tripping over the new format.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_ialloc.c | 33 +++++++++++++++++++++++++++++++++
>  1 file changed, 33 insertions(+)
> 
> diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> index 4226b1b..4dd45c2 100644
> --- a/fs/xfs/libxfs/xfs_ialloc.c
> +++ b/fs/xfs/libxfs/xfs_ialloc.c
> @@ -407,6 +407,27 @@ error:
>  	return error;
>  }
>  
> +STATIC void
> +xfs_sbversion_add_spinodes(
> +	struct xfs_trans	*tp,
> +	struct xfs_mount	*mp)
> +{
> +	if (xfs_sb_has_incompat_log_feature(&mp->m_sb,
> +					    XFS_SB_FEAT_INCOMPAT_SPINODES))

wrong function. xfs_sb_has_incompat_feature().

> +		return;
> +
> +	spin_lock(&mp->m_sb_lock);
> +	if (xfs_sb_has_incompat_log_feature(&mp->m_sb,
> +					    XFS_SB_FEAT_INCOMPAT_SPINODES)) {
> +		spin_unlock(&mp->m_sb_lock);
> +		return;
> +	}
> +
> +	mp->m_sb.sb_features_incompat |= XFS_SB_FEAT_INCOMPAT_SPINODES;
> +	spin_unlock(&mp->m_sb_lock);
> +	xfs_mod_sb(tp, XFS_SB_FEATURES_INCOMPAT);
> +}
> +
>  /*
>   * Allocate new inodes in the allocation group specified by agbp.
>   * Return 0 for success, else error code.
> @@ -631,6 +652,18 @@ xfs_ialloc_ag_alloc(
>  		if (error)
>  			return error;
>  	}
> +
> +	/*
> +	 * Set an incompat feature bit as old drivers can't parse sparse
> +	 * records. Pre-sparse inode chunk drivers will include the
> +	 * holemask in the higher order freecount bits, resulting in a
> +	 * bogus value.
> +	 *
> +	 * XXX: when is this bit removed?
> +	 */
> +	if (xfs_inobt_issparse(&rec))
> +		xfs_sbversion_add_spinodes(tp, args.mp);

I'd like to avoid this dynamic feature bit adding if possible. Do we
really need it?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 11/18] xfs: reduce min. inode allocation space requirement for sparse inode chunks
  2014-07-24 14:23 ` [PATCH 11/18] xfs: reduce min. inode allocation space requirement for sparse inode chunks Brian Foster
@ 2014-07-24 22:50   ` Dave Chinner
  0 siblings, 0 replies; 50+ messages in thread
From: Dave Chinner @ 2014-07-24 22:50 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Thu, Jul 24, 2014 at 10:23:01AM -0400, Brian Foster wrote:
> xfs_ialloc_ag_select() iterates through the allocation groups looking
> for free inodes or free space to determine whether to allow an inode
> allocation to proceed. If no free inodes are available, it assumes that
> an AG must have an extent longer than mp->m_ialloc_blks.
> 
> Sparse inode chunk support currently allows for allocations as small as
> the cluster size. If sparse inodes are supported, use the cluster
> alignment as a minimum extent size limit to determine whether an inode
> chunk allocation attempt can proceed.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_ialloc.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> index 4dd45c2..4e98a21 100644
> --- a/fs/xfs/libxfs/xfs_ialloc.c
> +++ b/fs/xfs/libxfs/xfs_ialloc.c
> @@ -770,9 +770,15 @@ xfs_ialloc_ag_select(
>  
>  		/*
>  		 * Is there enough free space for the file plus a block of
> -		 * inodes? (if we need to allocate some)?
> +		 * inodes? (if we need to allocate some)? If sparse inode chunks
> +		 * are supported, we only require an extent of length equal to
> +		 * the cluster size.
>  		 */
> -		ineed = mp->m_ialloc_blks;
> +		if (xfs_sb_version_hassparseinodes(&mp->m_sb))
> +			ineed = xfs_ialloc_cluster_alignment(mp);
> +		else
> +			ineed = mp->m_ialloc_blks;

Why wouldn't we calculate this once at mount time and dump it in
mp->m_ialloc_min_blks? And if we don't have sparse inodes enabled,
when would be ever select an AG that would trigger a sparse
allocation and then turn the feature bit on?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 12/18] xfs: helper to convert inobt record holemask to inode alloc. bitmap
  2014-07-24 14:23 ` [PATCH 12/18] xfs: helper to convert inobt record holemask to inode alloc. bitmap Brian Foster
@ 2014-07-24 23:21   ` Dave Chinner
  0 siblings, 0 replies; 50+ messages in thread
From: Dave Chinner @ 2014-07-24 23:21 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Thu, Jul 24, 2014 at 10:23:02AM -0400, Brian Foster wrote:
> The inobt record holemask field is a condensed data type designed to fit
> into the existing on-disk record and is zero based (allocated regions
> are set to 0, sparse regions are set to 1) to provide backwards
> compatibility. Thus the type is unnecessarily complex for use in higher
> level inode manipulations such as individual inode allocations, etc.
> 
> Rather than foist the complexity of dealing with this field to every bit
> of logic that requires inode chunk allocation information, create the
> xfs_inobt_ialloc_bitmap() helper to convert the inobt record holemask to
> an inode allocation bitmap. The inode allocation bitmap is inode
> granularity similar to the inobt record free mask and indicates which
> inodes of the chunk are physically allocated on disk irrespective of
> whether the inode is considered allocated or free by the filesystem.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_format.h |  1 +
>  fs/xfs/libxfs/xfs_ialloc.c | 67 ++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 68 insertions(+)
> 
> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> index 0baad50..cbc3296 100644
> --- a/fs/xfs/libxfs/xfs_format.h
> +++ b/fs/xfs/libxfs/xfs_format.h
> @@ -206,6 +206,7 @@ typedef __be32 xfs_alloc_ptr_t;
>  #define	XFS_FIBT_CRC_MAGIC	0x46494233	/* 'FIB3' */
>  
>  typedef	__uint64_t	xfs_inofree_t;
> +typedef	__uint64_t	xfs_inoalloc_t;
>  #define	XFS_INODES_PER_CHUNK		(NBBY * sizeof(xfs_inofree_t))
>  #define	XFS_INODES_PER_CHUNK_LOG	(XFS_NBBYLOG + 3)
>  #define	XFS_INOBT_ALL_FREE		((xfs_inofree_t)-1)
> diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> index 4e98a21..166602e 100644
> --- a/fs/xfs/libxfs/xfs_ialloc.c
> +++ b/fs/xfs/libxfs/xfs_ialloc.c
> @@ -863,6 +863,73 @@ xfs_ialloc_get_rec(
>  }
>  
>  /*
> + * Convert the inode record holemask to an inode allocation bitmap. The inode
> + * allocation bitmap is inode granularity and specifies whether an inode is
> + * physically allocated on disk (not whether the inode is considered allocated
> + * or free by the fs).
> + */
> +STATIC xfs_inoalloc_t
> +xfs_inobt_ialloc_bitmap(
> +	struct xfs_inobt_rec_incore	*rec)
> +{
> +	xfs_inofree_t	bitmap = 0;
> +	xfs_inofree_t	sparsebits;
> +	int		nextbit;
> +	int		shift;
> +	__uint16_t	allocmask;
> +	uint		allocbitmap;

allocbitmap should be a 64 bit value?

> +
> +	/*
> +	 * Each holemask bit represents XFS_INODES_PER_SPCHUNK inodes. Determine
> +	 * the inode bits per holemask bit.
> +	 */
> +	sparsebits = xfs_mask64lo(XFS_INODES_PER_SPCHUNK);

Can we just open code that?

	sparsebits = (1ULL << XFS_INODES_PER_SPCHUNK) - 1;


> +	/*
> +	 * The bit flip and type conversion are intentionally done separately
> +	 * here to zero-extend the bitmask.
> +	 */
> +	allocmask = ~rec->ir_holemask;
> +	allocbitmap = allocmask;

urk. That's a landmine.

> +
> +	/*
> +	 * Each bit of allocbitmap represents an allocated region of the inode
> +	 * chunk. Thus, each bit represents XFS_INODES_PER_SPCHUNK physical
> +	 * inodes. Walk through allocbitmap and set the associated individual
> +	 * inode bits in the inode bitmap for each allocated chunk.
> +	 *
> +	 * For example, consider a 512b inode fs with a cluster size of 16k.
> +	 * Each holemask bit represents 4 inodes and each cluster contains 32
> +	 * inodes. Since sparse chunks are allocated at cluster granularity, a
> +	 * valid 16-bit holemask (and negated allocbitmap) with this geometry
> +	 * might look as follows:
> +	 *
> +	 * 	holemask		~	allocbitmap
> +	 * 	0000 0000 1111 1111	=>	1111 1111 0000 0000
> +	 *
> +	 * At 4 inodes per bit, this indicates that the first 32 inodes of the
> +	 * chunk are not physically allocated inodes. This is a hole from the
> +	 * perspective of the inode record. Converting the allocbitmap to a
> +	 * 64-bit inode allocation bitmap yields:
> +	 *
> +	 * 	0xFFFFFFFF00000000
> +	 *
> +	 * ... where any of the 32 inodes defined by the higher order 32 bits of
> +	 * the map could be in use or free. Logically AND this bitmap with the
> +	 * record ir_free map to identify which of the physically allocated
> +	 * inodes are in use.
> +	 */
> +	nextbit = xfs_next_bit(&allocbitmap, 1, 0);
> +	while (nextbit != -1) {
> +		shift = nextbit * XFS_INODES_PER_SPCHUNK;
> +		bitmap |= (sparsebits << shift);
> +		nextbit = xfs_next_bit(&allocbitmap, 1, nextbit + 1);
> +	}

We need to get rid of xfs_next_bit() and friends, not add more
users. The generic bitmap functionality is better suited to this,
I think. A cleaned up version of something like this, perhaps:


	DECLARE_BITMAP(allocmask, 16);
	DECLARE_BITMAP(bitmap, 64);

	bitmap_complement(&allocmask, &rec->ir_holemask, 16);
	bitmap_full(&bitmap, 64);
	for (i = 0; i < 16; i++) {
		if (!test_bit(&allocmask, i)) {
			bitmap_release_region(&bitmap, i * 4,
					      ilog2(XFS_INODES_PER_SPCHUNK));
		}
	}

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 15/18] xfs: only free allocated regions of inode chunks
  2014-07-24 14:23 ` [PATCH 15/18] xfs: only free allocated regions of inode chunks Brian Foster
@ 2014-07-24 23:24   ` Dave Chinner
  0 siblings, 0 replies; 50+ messages in thread
From: Dave Chinner @ 2014-07-24 23:24 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Thu, Jul 24, 2014 at 10:23:05AM -0400, Brian Foster wrote:
> An inode chunk is currently added to the transaction free list based on
> a simple fsb conversion and hardcoded chunk length. The nature of sparse
> chunks is such that the physical chunk of inodes on disk may consist of
> one or more discontiguous parts. Blocks that reside in the holes of the
> inode chunk are not inodes and could be allocated to any other use or
> not allocated at all.
> 
> Refactor the existing xfs_bmap_add_free() call into the
> xfs_difree_inode_chunk() helper. The new helper uses the existing
> calculation if a chunk is not sparse. Otherwise, use the inobt record
> holemask to free the contiguous regions of the chunk.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_ialloc.c | 64 +++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 61 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> index f75f191..1be57b1 100644
> --- a/fs/xfs/libxfs/xfs_ialloc.c
> +++ b/fs/xfs/libxfs/xfs_ialloc.c
> @@ -1659,6 +1659,66 @@ out_error:
>  	return error;
>  }
>  
> +/*
> + * Free the blocks of an inode chunk. We must consider that the inode chunk
> + * might be sparse and only free the regions that are allocated as part of the
> + * chunk.
> + */
> +STATIC void
> +xfs_difree_inode_chunk(
> +	struct xfs_mount		*mp,
> +	xfs_agnumber_t			agno,
> +	struct xfs_inobt_rec_incore	*rec,
> +	struct xfs_bmap_free		*flist)
> +{
> +	xfs_agblock_t	sagbno = XFS_AGINO_TO_AGBNO(mp, rec->ir_startino);
> +	xfs_agblock_t	agbno;
> +	int		nextbit;
> +	int		contig, contigblk;
> +	__uint16_t	allocmask;
> +	uint		allocbitmap;
> +
> +	if (!xfs_inobt_issparse(rec)) {
> +		/* not sparse, calculate extent info directly */
> +		xfs_bmap_add_free(XFS_AGB_TO_FSB(mp, agno,
> +				  XFS_AGINO_TO_AGBNO(mp, rec->ir_startino)),
> +				  mp->m_ialloc_blks, flist, mp);
> +		return;
> +	}
> +
> +	/*
> +	 * The bit flip and type conversion are intentionally done separately
> +	 * here to zero-extend the bitmask.
> +	 */
> +	allocmask = ~rec->ir_holemask;
> +	allocbitmap = allocmask;

If we are going to keep this code, then a helper is absolutely
necessary....

> +
> +	/*
> +	 * We now have an allocation bitmap in units of inodes at sparse chunk
> +	 * granularity (e.g., more than one inode per bit). Use the bitmask
> +	 * functions to find each contigious range of bits in the map. For each
> +	 * range, convert the start bit and count to block values and use that
> +	 * data to add the associated extent to the free list.
> +	 */
> +	nextbit = xfs_next_bit(&allocbitmap, 1, 0);
> +	while (nextbit != -1) {
> +		agbno = (nextbit * XFS_INODES_PER_SPCHUNK) /
> +				mp->m_sb.sb_inopblock;
> +		agbno += sagbno;
> +
> +		contig = xfs_contig_bits(&allocbitmap, 1, nextbit);
> +		contigblk = (contig * XFS_INODES_PER_SPCHUNK) /
> +				mp->m_sb.sb_inopblock;
> +
> +		ASSERT(agbno % xfs_ialloc_cluster_alignment(mp) == 0);
> +		ASSERT(contigblk % xfs_ialloc_cluster_alignment(mp) == 0);
> +		xfs_bmap_add_free(XFS_AGB_TO_FSB(mp, agno, agbno), contigblk,
> +				  flist, mp);
> +
> +		nextbit = xfs_next_bit(&allocbitmap, 1, nextbit + contig + 1);
> +	}

Again, I think that the generic bitmap code is a better way to
implement this...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 17/18] xfs: use actual inode count for sparse records in bulkstat/inumbers
  2014-07-24 14:23 ` [PATCH 17/18] xfs: use actual inode count for sparse records in bulkstat/inumbers Brian Foster
@ 2014-07-24 23:29   ` Dave Chinner
  0 siblings, 0 replies; 50+ messages in thread
From: Dave Chinner @ 2014-07-24 23:29 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Thu, Jul 24, 2014 at 10:23:07AM -0400, Brian Foster wrote:
> The bulkstat and inumbers mechanisms make the assumption that inode
> records consist of a full 64 inode chunk in several places. E.g., this
> is used to track how many inodes have been processed overall as well as
> to determine when all in-use inodes of a record have been processed. The
> record processing, in particular, increments the record freecount for
> each in-use inode until it hits the expected max of 64.
> 
> This is invalid for sparse inode records. While all inodes might be
> marked free in the free mask regardless of whether they are allocated on
> disk, ir_freecount is based on the total number of physically allocated
> inodes and thus may be less than 64 inodes on a completely free inode
> chunk.
> 
> Create the xfs_inobt_count() helper to calculate the total number of
> physically allocated inodes based on the holemask. Use the helper in
> xfs_bulkstat() and xfs_inumbers() instead of the fixed
> XFS_INODE_PER_CHUNK value to ensure correct accounting in the event of
> sparse inode records.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_ialloc.c | 27 +++++++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_ialloc.h |  5 +++++
>  fs/xfs/xfs_itable.c        | 12 +++++++-----
>  3 files changed, 39 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> index 86c6ccd..daf317f 100644
> --- a/fs/xfs/libxfs/xfs_ialloc.c
> +++ b/fs/xfs/libxfs/xfs_ialloc.c
> @@ -951,6 +951,33 @@ xfs_inobt_first_free_inode(
>  }
>  
>  /*
> + * Calculate the real count of inodes in a chunk.
> + */
> +int
> +xfs_inobt_count(
> +	struct xfs_inobt_rec_incore	*rec)
> +{
> +	__uint16_t	allocmask;
> +	uint		allocbitmap;
> +	int		nextbit;
> +	int		count = 0;
> +
> +	if (!xfs_inobt_issparse(rec))
> +		return XFS_INODES_PER_CHUNK;
> +
> +	allocmask = ~rec->ir_holemask;
> +	allocbitmap = allocmask;
> +
> +	nextbit = xfs_next_bit(&allocbitmap, 1, 0);
> +	while (nextbit != -1) {
> +		count += XFS_INODES_PER_SPCHUNK;
> +		nextbit = xfs_next_bit(&allocbitmap, 1, nextbit + 1);
> +	}

	bitmap_weight() * XFS_INODES_PER_SPCHUNK?

-Dave
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 18/18] xfs: enable sparse inode chunks for v5 superblocks
  2014-07-24 14:23 ` [PATCH 18/18] xfs: enable sparse inode chunks for v5 superblocks Brian Foster
@ 2014-07-24 23:34   ` Dave Chinner
  0 siblings, 0 replies; 50+ messages in thread
From: Dave Chinner @ 2014-07-24 23:34 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Thu, Jul 24, 2014 at 10:23:08AM -0400, Brian Foster wrote:
> Enable the use and processing of sparse inode chunks.  Fix the
> xfs_sb_version_hassparseinodes() helper to allow the allocation of
> sparse chunks on v5 format superblocks. Add the incompat. feature bit to
> the *_ALL mask such that fs' with sparse chunks can be mounted.

Interesting way of allowing XFS_SB_FEAT_INCOMPAT_SPINODES to be
optional. :)

I didn't expect it to be used this way, hence my earlier comments
about chicken and egg feature bits. I still think I'd like this to
be explicit rather than just added when the format first changes...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 03/18] xfs: define sparse inode chunks v5 sb feature bit and helper function
  2014-07-24 14:22 ` [PATCH 03/18] xfs: define sparse inode chunks v5 sb feature bit and helper function Brian Foster
  2014-07-24 17:08   ` Mark Tinguely
@ 2014-07-24 23:35   ` Dave Chinner
  1 sibling, 0 replies; 50+ messages in thread
From: Dave Chinner @ 2014-07-24 23:35 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Thu, Jul 24, 2014 at 10:22:53AM -0400, Brian Foster wrote:
> The sparse inode chunks feature will use the helper function to enable
> the allocation of sparse inode chunks. The incompatible feature bit is
> set on disk once a sparse inode chunk is allocated to prevent older
> drivers from mounting an fs with sparse chunks.
> 
> Note that the feature is hardcoded disabled and the feature bit not
> included in the all features mask until fully implemented.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_sb.h | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/fs/xfs/libxfs/xfs_sb.h b/fs/xfs/libxfs/xfs_sb.h
> index c43c2d6..6f48de9 100644
> --- a/fs/xfs/libxfs/xfs_sb.h
> +++ b/fs/xfs/libxfs/xfs_sb.h
> @@ -509,6 +509,7 @@ xfs_sb_has_ro_compat_feature(
>  }
>  
>  #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
> +#define XFS_SB_FEAT_INCOMPAT_SPINODES	(1 << 1)	/* sparse inode chunks */
>  #define XFS_SB_FEAT_INCOMPAT_ALL \
>  		(XFS_SB_FEAT_INCOMPAT_FTYPE)
>  
> @@ -558,6 +559,11 @@ static inline int xfs_sb_version_hasfinobt(xfs_sb_t *sbp)
>  		(sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_FINOBT);
>  }
>  
> +static inline int xfs_sb_version_hassparseinodes(struct xfs_sb *sbp)

static inline bool ...

> +{
> +	return 0; /* not yet enabled */

	return false;

-Dave.

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 00/18] xfs: sparse inode chunks
  2014-07-24 22:32 ` Dave Chinner
@ 2014-07-25 16:30   ` Brian Foster
  2014-07-26  0:03     ` Dave Chinner
  0 siblings, 1 reply; 50+ messages in thread
From: Brian Foster @ 2014-07-25 16:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Fri, Jul 25, 2014 at 08:32:11AM +1000, Dave Chinner wrote:
> On Thu, Jul 24, 2014 at 10:22:50AM -0400, Brian Foster wrote:
> > Hi all,
> > 
> > This is a first pass at sparse inode chunk support for XFS. Some
> > background on this work is available here:
> > 
> > http://oss.sgi.com/archives/xfs/2013-08/msg00346.html
> > 
> > The basic idea is to allow the partial allocation of inode chunks into
> > fragmented regions of free space. This is accomplished through addition
> > of a holemask field into the inobt record that defines what portion(s)
> > of an inode chunk are invalid (i.e., holes in the chunk). This work is
> > not quite complete, but is at a point where I'd like to start getting
> > feedback on the design and what direction to take for some of the known
> > gaps.
> > 
> > The basic breakdown of functionality in this set is as follows:
> > 
> > - Patches 1-2 - A couple generic cleanups that are dependencies for later
> >   patches in the series.
> > - Patches 3-5 - Basic data structure update, feature bit and minor
> >   helper introduction.
> > - Patches 6-7 - Update v5 icreate logging and recovery to handle sparse
> >   inode records.
> > - Patches 8-13 - Allocation support for sparse inode records. Physical
> >   chunk allocation and individual inode allocation.
> > - Patches 14-16 - Deallocation support for sparse inode chunks. Physical
> >   chunk deallocation, individual inode free and cluster free.
> > - Patch 17 - Fixes for bulkstat/inumbers.
> > - Patch 18 - Activate support for sparse chunk allocation and
> >   processing.
> > 
> > This work is lightly tested for regression (some xfstests failures due
> > to repair) and basic functionality. I have a new xfstests test I'll
> > forward along for demonstration purposes.
> > 
> > Some notes on gaps in the design:
> > 
> > - Sparse inode chunk allocation granularity:
> > 
> > The current minimum sparse chunk allocation granularity is the cluster
> > size.
> 
> Looking at the patchset (I got to patch 5 that first uses this),
> this is problematic. the cluster size is currently a kernel
> implementation detail, and not something defined by the on-disk
> format. We can change the cluster size in the kernel and not affect
> the format on disk. Making the cluster size a part of the disk
> format by defining it to be the resolution of sparse inode chunks
> changes that - it's now a part of the on-disk inode format, and that
> greatly limits what we can do with it.
> 

I was going off the inoalignmt bit that's set at mkfs time, but looking
at the code I think I see what you mean...

> > My initial attempts at this work tried to redefine to the minimum
> > chunk length based on the holemask granularity (a la the stale macro I
> > seemingly left in this series ;), but this involves tweaking the
> > codepaths that use the cluster size (i.e., imap) which proved rather
> > hairy.
> 
> This is where we need to head towards, though. The cluster size is
> currently the unit of inode IO, so that needs to be influenced by the
> sparse inode chunk granularity. Yes, we can define the inode chunk
> granularity to be the same as the cluster size, but that simply
> means we need to configure the cluster size appropriately at mount.
> It doesn't mean we need to change what cluster size means or it's
> implementation....
> 

I don't think I've necessarily encoded the cluster size into the disk
format explicitly, but rather used it as a heuristic for how to size the
sparse inode chunk allocations. I think what you're saying is that the
cluster size can change across a mount or code update, while the on-disk
allocation state will not. Thus we can't always go on the assumption
that the on-disk allocations will be sane with regard to the current
cluster size. For example, I suppose if the cluster size increased after
we have some existing sparse allocations we'd end up with some broken
buffer mappings for inode chunks.

> > This also means we need a solution where an imap can change if an
> > inode was initially mapped as a sparse chunk and said chunk is
> > subsequently made full. E.g., we'd perhaps need to invalidate the inode
> > buffers for sparse chunks at the time where they are made full. Given
> > that, I landed on using the cluster size and leaving those codepaths as
> > is for the time being.
> 
> Again, that's kernel inode buffer cache implementaiton details, not
> something that matters for the on-disk format. So really these need
> to be separated. Probably means we need a "sparse inode allocation
> alignment" field in the superblock to define this. Having
> the kernel reject sparse alignments it can't support from the
> initial implementation means we can improve the kernel
> implementation over time and (eventually) support sub-cluster sized
> sparse inode allocation.
> 
> i.e. initial implementation only supports sparse alignment ==
> cluster size, and rejects everything else....
> 

Yeah, that's pretty much what I was trying to accomplish, just not
encoded properly with regard to the superblock options. E.g., I
overloaded the cluster size in a fragile way. Instead, explicitly set
the sparse inode allocation granularity and let that influence the
cluster size. I suspect that means we could even still use the cluster
size heuristic to define the default alloc. granularity, but of course
we aren't married to it so can evaluate that independently. That sounds
much nicer, thanks.

> > There is a tradeoff here for v5 superblocks because we've recently made
> > a change to scale the cluster size based on the factor increase in the
> > inode size from the default (see xfsprogs commit 7b5f9801). This means
> > that effectiveness of sparse chunks is tied to whether the level of free
> > space fragmentation matches the cluster size. By that I mean effectivess
> > is good (near 100% utilization possible) if free space fragmentation
> > leaves free extents around that at least match the cluster size. If
> > fragmentation is worse than the cluster size, effectiveness is reduced.
> > This can also be demonstrated with the forthcoming xfstests test.
> 
> Exactly. We don't need to solve every problem with the initial
> implementation - we can iteratively improve the code because once
> the fields are one disk we only need to change the kernel
> implemenation to support finer grained sparse allocation to solve
> this allocation chunk < cluster size problem....
> 

Right... I didn't want to rule out fixing the imap logic and whatnot at
all by any means. I thought about it a bit, but the solution isn't yet
clear and certainly could involve some more refactoring or rethinking of
abstractions, so I think that's better suited as a follow on effort.

> > - On-disk lifecycle of the sparse inode chunks feature bit:
> > 
> > We set an incompatible feature bit once a sparse inode chunk is
> > allocated because older revisions of code will interpret the non-zero
> > holemask bits in the higher order bytes of the record freecount. The
> > feature bit must be removed once all sparse inode chunks are eliminated
> > one way or another. This series does not currently remove the feature
> > bit once set simply because I hadn't thought through the mechanism quite
> > yet. For the next version, I'm thinking about adding an inobt walk
> > mechanism that can be conditionally invoked (i.e., feature bit is
> > currently set and a sparse inode chunk has been eliminated) either via
> > workqueue on an interval or during unmount if necessary. Thoughts or
> > alternative suggestions on that appreciated.
> 
> I wouldn't bother. Let xfs_repair determine if the bit needs to be
> set or not when it does it's final superblock write after it has
> scanned and repaired the fs.
> 

That sounds good to me.

> I'm even in two minds of whether we want the sb bit added
> dynamically, because it means the same upgrade/downdgrade cycle can
> have different results simply due to filesystem freespace
> fragmentation patterns...
> 
> Perhaps an xfs-admin command to turn the feature on dynamicallyi for
> existing filesystems, kind of like what we did with lazy superblock
> counters when they were introduced?
> 

Hmm, I suppose that does create a new and interesting dynamic with
regard to the feature bit (non-deterministic backwards compatibility).
One could certainly value backwards compatibility over this particular
feature, and there is currently no way to control it. I'll look into
doing something with xfs_admin. In fact, I was thinking of adding
something to tune the cluster size bit to get around the v5 scaling
issue anyways.

Thanks for the feedback. I'm under the weather today so I'll start going
through the rest of it when my head is less foggy.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 00/18] xfs: sparse inode chunks
  2014-07-25 16:30   ` Brian Foster
@ 2014-07-26  0:03     ` Dave Chinner
  2014-07-28 12:14       ` Brian Foster
  0 siblings, 1 reply; 50+ messages in thread
From: Dave Chinner @ 2014-07-26  0:03 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Fri, Jul 25, 2014 at 12:30:57PM -0400, Brian Foster wrote:
> On Fri, Jul 25, 2014 at 08:32:11AM +1000, Dave Chinner wrote:
> > On Thu, Jul 24, 2014 at 10:22:50AM -0400, Brian Foster wrote:
> > > Hi all,
> > > 
> > > This is a first pass at sparse inode chunk support for XFS. Some
> > > background on this work is available here:
> > > 
> > > http://oss.sgi.com/archives/xfs/2013-08/msg00346.html
> > > 
> > > The basic idea is to allow the partial allocation of inode chunks into
> > > fragmented regions of free space. This is accomplished through addition
> > > of a holemask field into the inobt record that defines what portion(s)
> > > of an inode chunk are invalid (i.e., holes in the chunk). This work is
> > > not quite complete, but is at a point where I'd like to start getting
> > > feedback on the design and what direction to take for some of the known
> > > gaps.
> > > 
> > > The basic breakdown of functionality in this set is as follows:
> > > 
> > > - Patches 1-2 - A couple generic cleanups that are dependencies for later
> > >   patches in the series.
> > > - Patches 3-5 - Basic data structure update, feature bit and minor
> > >   helper introduction.
> > > - Patches 6-7 - Update v5 icreate logging and recovery to handle sparse
> > >   inode records.
> > > - Patches 8-13 - Allocation support for sparse inode records. Physical
> > >   chunk allocation and individual inode allocation.
> > > - Patches 14-16 - Deallocation support for sparse inode chunks. Physical
> > >   chunk deallocation, individual inode free and cluster free.
> > > - Patch 17 - Fixes for bulkstat/inumbers.
> > > - Patch 18 - Activate support for sparse chunk allocation and
> > >   processing.
> > > 
> > > This work is lightly tested for regression (some xfstests failures due
> > > to repair) and basic functionality. I have a new xfstests test I'll
> > > forward along for demonstration purposes.
> > > 
> > > Some notes on gaps in the design:
> > > 
> > > - Sparse inode chunk allocation granularity:
> > > 
> > > The current minimum sparse chunk allocation granularity is the cluster
> > > size.
> > 
> > Looking at the patchset (I got to patch 5 that first uses this),
> > this is problematic. the cluster size is currently a kernel
> > implementation detail, and not something defined by the on-disk
> > format. We can change the cluster size in the kernel and not affect
> > the format on disk. Making the cluster size a part of the disk
> > format by defining it to be the resolution of sparse inode chunks
> > changes that - it's now a part of the on-disk inode format, and that
> > greatly limits what we can do with it.
> > 
> 
> I was going off the inoalignmt bit that's set at mkfs time, but looking
> at the code I think I see what you mean...
> 
> > > My initial attempts at this work tried to redefine to the minimum
> > > chunk length based on the holemask granularity (a la the stale macro I
> > > seemingly left in this series ;), but this involves tweaking the
> > > codepaths that use the cluster size (i.e., imap) which proved rather
> > > hairy.
> > 
> > This is where we need to head towards, though. The cluster size is
> > currently the unit of inode IO, so that needs to be influenced by the
> > sparse inode chunk granularity. Yes, we can define the inode chunk
> > granularity to be the same as the cluster size, but that simply
> > means we need to configure the cluster size appropriately at mount.
> > It doesn't mean we need to change what cluster size means or it's
> > implementation....
> > 
> 
> I don't think I've necessarily encoded the cluster size into the disk
> format explicitly, but rather used it as a heuristic for how to size the
> sparse inode chunk allocations. I think what you're saying is that the
> cluster size can change across a mount or code update, while the on-disk
> allocation state will not. Thus we can't always go on the assumption
> that the on-disk allocations will be sane with regard to the current
> cluster size. For example, I suppose if the cluster size increased after
> we have some existing sparse allocations we'd end up with some broken
> buffer mappings for inode chunks.

Yup, that's the problem in a nutshell: cluster size is not fixed.

> > Again, that's kernel inode buffer cache implementaiton details, not
> > something that matters for the on-disk format. So really these need
> > to be separated. Probably means we need a "sparse inode allocation
> > alignment" field in the superblock to define this. Having
> > the kernel reject sparse alignments it can't support from the
> > initial implementation means we can improve the kernel
> > implementation over time and (eventually) support sub-cluster sized
> > sparse inode allocation.
> > 
> > i.e. initial implementation only supports sparse alignment ==
> > cluster size, and rejects everything else....
> > 
> 
> Yeah, that's pretty much what I was trying to accomplish, just not
> encoded properly with regard to the superblock options. E.g., I
> overloaded the cluster size in a fragile way. Instead, explicitly set
> the sparse inode allocation granularity and let that influence the
> cluster size. I suspect that means we could even still use the cluster
> size heuristic to define the default alloc. granularity, but of course
> we aren't married to it so can evaluate that independently. That sounds
> much nicer, thanks.

Right. We may not do anything else in the short or medium term, but
having a separate superblock field gives us the flexibility for
change the inode cluster implementation/behaviour in the future...

> > dynamically, because it means the same upgrade/downdgrade cycle can
> > have different results simply due to filesystem freespace
> > fragmentation patterns...
> > 
> > Perhaps an xfs-admin command to turn the feature on dynamicallyi for
> > existing filesystems, kind of like what we did with lazy superblock
> > counters when they were introduced?
> > 
> 
> Hmm, I suppose that does create a new and interesting dynamic with
> regard to the feature bit (non-deterministic backwards compatibility).
> One could certainly value backwards compatibility over this particular
> feature, and there is currently no way to control it. I'll look into
> doing something with xfs_admin. In fact, I was thinking of adding
> something to tune the cluster size bit to get around the v5 scaling
> issue anyways.

What v5 scalability issue is that? I don't recall any outstanding
issues with inode cluster IO....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 00/18] xfs: sparse inode chunks
  2014-07-26  0:03     ` Dave Chinner
@ 2014-07-28 12:14       ` Brian Foster
  2014-07-29  0:26         ` Dave Chinner
  0 siblings, 1 reply; 50+ messages in thread
From: Brian Foster @ 2014-07-28 12:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Sat, Jul 26, 2014 at 10:03:35AM +1000, Dave Chinner wrote:
> On Fri, Jul 25, 2014 at 12:30:57PM -0400, Brian Foster wrote:
> > On Fri, Jul 25, 2014 at 08:32:11AM +1000, Dave Chinner wrote:
> > > On Thu, Jul 24, 2014 at 10:22:50AM -0400, Brian Foster wrote:
> > > > Hi all,
> > > > 
> > > > This is a first pass at sparse inode chunk support for XFS. Some
> > > > background on this work is available here:
> > > > 
> > > > http://oss.sgi.com/archives/xfs/2013-08/msg00346.html
> > > > 
> > > > The basic idea is to allow the partial allocation of inode chunks into
> > > > fragmented regions of free space. This is accomplished through addition
> > > > of a holemask field into the inobt record that defines what portion(s)
> > > > of an inode chunk are invalid (i.e., holes in the chunk). This work is
> > > > not quite complete, but is at a point where I'd like to start getting
> > > > feedback on the design and what direction to take for some of the known
> > > > gaps.
> > > > 
> > > > The basic breakdown of functionality in this set is as follows:
> > > > 
> > > > - Patches 1-2 - A couple generic cleanups that are dependencies for later
> > > >   patches in the series.
> > > > - Patches 3-5 - Basic data structure update, feature bit and minor
> > > >   helper introduction.
> > > > - Patches 6-7 - Update v5 icreate logging and recovery to handle sparse
> > > >   inode records.
> > > > - Patches 8-13 - Allocation support for sparse inode records. Physical
> > > >   chunk allocation and individual inode allocation.
> > > > - Patches 14-16 - Deallocation support for sparse inode chunks. Physical
> > > >   chunk deallocation, individual inode free and cluster free.
> > > > - Patch 17 - Fixes for bulkstat/inumbers.
> > > > - Patch 18 - Activate support for sparse chunk allocation and
> > > >   processing.
> > > > 
> > > > This work is lightly tested for regression (some xfstests failures due
> > > > to repair) and basic functionality. I have a new xfstests test I'll
> > > > forward along for demonstration purposes.
> > > > 
> > > > Some notes on gaps in the design:
> > > > 
> > > > - Sparse inode chunk allocation granularity:
> > > > 
> > > > The current minimum sparse chunk allocation granularity is the cluster
> > > > size.
> > > 
> > > Looking at the patchset (I got to patch 5 that first uses this),
> > > this is problematic. the cluster size is currently a kernel
> > > implementation detail, and not something defined by the on-disk
> > > format. We can change the cluster size in the kernel and not affect
> > > the format on disk. Making the cluster size a part of the disk
> > > format by defining it to be the resolution of sparse inode chunks
> > > changes that - it's now a part of the on-disk inode format, and that
> > > greatly limits what we can do with it.
> > > 
> > 
> > I was going off the inoalignmt bit that's set at mkfs time, but looking
> > at the code I think I see what you mean...
> > 
> > > > My initial attempts at this work tried to redefine to the minimum
> > > > chunk length based on the holemask granularity (a la the stale macro I
> > > > seemingly left in this series ;), but this involves tweaking the
> > > > codepaths that use the cluster size (i.e., imap) which proved rather
> > > > hairy.
> > > 
> > > This is where we need to head towards, though. The cluster size is
> > > currently the unit of inode IO, so that needs to be influenced by the
> > > sparse inode chunk granularity. Yes, we can define the inode chunk
> > > granularity to be the same as the cluster size, but that simply
> > > means we need to configure the cluster size appropriately at mount.
> > > It doesn't mean we need to change what cluster size means or it's
> > > implementation....
> > > 
> > 
> > I don't think I've necessarily encoded the cluster size into the disk
> > format explicitly, but rather used it as a heuristic for how to size the
> > sparse inode chunk allocations. I think what you're saying is that the
> > cluster size can change across a mount or code update, while the on-disk
> > allocation state will not. Thus we can't always go on the assumption
> > that the on-disk allocations will be sane with regard to the current
> > cluster size. For example, I suppose if the cluster size increased after
> > we have some existing sparse allocations we'd end up with some broken
> > buffer mappings for inode chunks.
> 
> Yup, that's the problem in a nutshell: cluster size is not fixed.
> 
> > > Again, that's kernel inode buffer cache implementaiton details, not
> > > something that matters for the on-disk format. So really these need
> > > to be separated. Probably means we need a "sparse inode allocation
> > > alignment" field in the superblock to define this. Having
> > > the kernel reject sparse alignments it can't support from the
> > > initial implementation means we can improve the kernel
> > > implementation over time and (eventually) support sub-cluster sized
> > > sparse inode allocation.
> > > 
> > > i.e. initial implementation only supports sparse alignment ==
> > > cluster size, and rejects everything else....
> > > 
> > 
> > Yeah, that's pretty much what I was trying to accomplish, just not
> > encoded properly with regard to the superblock options. E.g., I
> > overloaded the cluster size in a fragile way. Instead, explicitly set
> > the sparse inode allocation granularity and let that influence the
> > cluster size. I suspect that means we could even still use the cluster
> > size heuristic to define the default alloc. granularity, but of course
> > we aren't married to it so can evaluate that independently. That sounds
> > much nicer, thanks.
> 
> Right. We may not do anything else in the short or medium term, but
> having a separate superblock field gives us the flexibility for
> change the inode cluster implementation/behaviour in the future...
> 
> > > dynamically, because it means the same upgrade/downdgrade cycle can
> > > have different results simply due to filesystem freespace
> > > fragmentation patterns...
> > > 
> > > Perhaps an xfs-admin command to turn the feature on dynamicallyi for
> > > existing filesystems, kind of like what we did with lazy superblock
> > > counters when they were introduced?
> > > 
> > 
> > Hmm, I suppose that does create a new and interesting dynamic with
> > regard to the feature bit (non-deterministic backwards compatibility).
> > One could certainly value backwards compatibility over this particular
> > feature, and there is currently no way to control it. I'll look into
> > doing something with xfs_admin. In fact, I was thinking of adding
> > something to tune the cluster size bit to get around the v5 scaling
> > issue anyways.
> 
> What v5 scalability issue is that? I don't recall any outstanding
> issues with inode cluster IO....
> 

There's no scalability issue... I'm just referring to the fact that we
scale the cluster size by the inode size increase factor on v5
superblocks.

E.g., my free space fragmentation xfstests test started out with a fixed
file size based on something close to the worst case with an
implementation that used the allocation granularity of max(<holemask bit
granularity>, <inodes per block>). Once I tied the implementation to the
cluster size due to the aforementioned complexities, it became apparent
the test was less effective with my chosen file size on v5 supers,
particularly as the inode size increased. So from there I was
considering a similar xfs_admin command a user could run to reduce the
cluster size as a backstop should this limitation arise in the real
world. We can start with doing something just to enable the feature as
outlined above and revisit this then...

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 01/18] xfs: refactor xfs_inobt_insert() to eliminate loop and support variable count
  2014-07-24 22:10   ` Dave Chinner
@ 2014-07-28 16:03     ` Brian Foster
  2014-07-28 23:32       ` Dave Chinner
  0 siblings, 1 reply; 50+ messages in thread
From: Brian Foster @ 2014-07-28 16:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Fri, Jul 25, 2014 at 08:10:38AM +1000, Dave Chinner wrote:
> On Thu, Jul 24, 2014 at 10:22:51AM -0400, Brian Foster wrote:
> > Inodes are always allocated in chunks of 64 and thus the loop in
> > xfs_inobt_insert() is unnecessary.
> 
> I don't believe this is true. The number of inodes allocated at once
> is:
> 
>         mp->m_ialloc_inos = (int)MAX((__uint16_t)XFS_INODES_PER_CHUNK,
> 	                                        sbp->sb_inopblock);
> 

So I'm going on that effectively that the number of inodes per block
will never be larger than 8 (v5) due to a max block size of 4k.

> So when the block size is, say, 64k, the number of 512 byte inodes
> allocated at once is 128. i.e. 2 chunks. Hence xfs_inobt_insert()
> can be called with a inode could of > 64 and therefore the loop is
> still necessary...
> 

Playing with mkfs I see that we actually can format >4k bsize
filesystems and the min and max are set at 512b and 64k. I can't
actually mount such filesystems due to the page size limitation. FWIW,
the default log size params appear to be broken for bsize >= 32k as
well, so I wonder if/how often that format tends to occur.

What's the situation with regard to >PAGE_SIZE block size support? Is
this something we actually could support today? Do we know about any
large page sized arches that could push us into this territory with the
actual page size limitation?

> And, indeed, we might want to increase the allocation size in future
> to do entire stripe units or stripe widths of inodes at once:
> 
> http://xfs.org/index.php/Improving_inode_Caching#Contiguous_Inode_Allocation
> 
> This also means a loop would be required -somewhere-...
> 

Indeed, though I'm less inclined to keep this around for the purposes of
this unimplemented feature. It should be easy enough to add the loop in
the appropriate place according to the code at the time this is
implemented.

I suppose if we have >4k page sized arches that utilize block sizes
outside of the 256b-4k range, that's enough to justify the existence of
the range in the general sense. I just might have to factor this area of
code a bit differently. It would also be nice if there was a means to
test.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 04/18] xfs: introduce inode record hole mask for sparse inode chunks
  2014-07-24 22:14   ` Dave Chinner
@ 2014-07-28 16:16     ` Brian Foster
  2014-08-07 15:18       ` Brian Foster
  0 siblings, 1 reply; 50+ messages in thread
From: Brian Foster @ 2014-07-28 16:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Fri, Jul 25, 2014 at 08:14:53AM +1000, Dave Chinner wrote:
> On Thu, Jul 24, 2014 at 10:22:54AM -0400, Brian Foster wrote:
> > The inode btrees track 64 inodes per record, regardless of inode size.
> > Thus, inode chunks on disk vary in size depending on the size of the
> > inodes. This creates a contiguous allocation requirement for new inode
> > chunks that can be difficult to satisfy on an aged and fragmented (free
> > space) filesystem.
> > 
> > The inode record freecount currently uses 4 bytes on disk to track the
> > free inode count. With a maximum freecount value of 64, only one byte is
> > required. Convert the freecount field to a single byte and reserve two
> > of the remaining 3 higher order bytes left to the hole mask field.
> > 
> > The hole mask field tracks potential holes in the chunks of physical
> > space that the inode record refers to. This facilitates the sparse
> > allocation of inode chunks when contiguous chunks are not available and
> > allows the inode btrees to identify what portions of the chunk contain
> > valid inodes.
> > 
> > Tracking holes means the field is initialized to zero and thus maintains
> > backwards compatibility with existing filesystems. E.g., the higher
> > order bytes of a counter with a max value of 64 are already initialized
> > to 0. Update the inode record management functions to handle the new
> > field and initialize it to zero for now.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/libxfs/xfs_format.h       | 7 +++++--
> >  fs/xfs/libxfs/xfs_ialloc.c       | 9 +++++++--
> >  fs/xfs/libxfs/xfs_ialloc_btree.c | 4 +++-
> >  3 files changed, 15 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > index 34d85ac..39022d9 100644
> > --- a/fs/xfs/libxfs/xfs_format.h
> > +++ b/fs/xfs/libxfs/xfs_format.h
> > @@ -221,13 +221,16 @@ static inline xfs_inofree_t xfs_inobt_maskn(int i, int n)
> >   */
> >  typedef struct xfs_inobt_rec {
> >  	__be32		ir_startino;	/* starting inode number */
> > -	__be32		ir_freecount;	/* count of free inodes (set bits) */
> > +	__be16		ir_holemask;	/* hole mask for sparse chunks */
> > +	__u8		ir_pad;
> > +	__u8		ir_freecount;	/* count of free inodes (set bits) */
> >  	__be64		ir_free;	/* free inode mask */
> >  } xfs_inobt_rec_t;
> 
> might we want the number of inodes allocated in the chunk there as
> well (i.e. in the pad field) so we can validate the holemask against
> the number of inodes allocated in the chunk?
> 

So you're suggesting something like this?

-	__be32		ir_freecount;	/* count of free inodes (set bits) */
+	__be16		ir_holemask;	/* hole mask for sparse chunks */
+	__u8		ir_count;	/* total inode count */
+	__u8		ir_freecount;	/* count of free inodes (set bits) */

That's an interesting thought. It might make some of the code more clear
and eliminate the need for the derivation of that value from the
holemask (beyond for validation purposes). I do like the extra
validation and potential debug use given the holemask is not quite as
human friendly as the free mask in terms of having a bit per inode.

As long as there isn't any concern over reserving this space for
something else down the road (I suspect not, since the pad is introduced
by this feature), I'll look to use it as an inode count.

Brian

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 08/18] xfs: create helper to manage record overlap for sparse inode chunks
  2014-07-24 22:41   ` Dave Chinner
@ 2014-07-28 16:19     ` Brian Foster
  2014-07-29  0:07       ` Dave Chinner
  0 siblings, 1 reply; 50+ messages in thread
From: Brian Foster @ 2014-07-28 16:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Fri, Jul 25, 2014 at 08:41:12AM +1000, Dave Chinner wrote:
> On Thu, Jul 24, 2014 at 10:22:58AM -0400, Brian Foster wrote:
> > Create xfs_spchunk_has_record() to receive the parameters of a new
> > sparse inode chunk allocation and identify whether a record exists that
> > is capable of tracking this sparse chunk.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/libxfs/xfs_ialloc.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 55 insertions(+)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> > index 27d3437..be57b51 100644
> > --- a/fs/xfs/libxfs/xfs_ialloc.c
> > +++ b/fs/xfs/libxfs/xfs_ialloc.c
> > @@ -351,6 +351,61 @@ xfs_ialloc_inode_init(
> >  }
> >  
> >  /*
> > + * Determine whether part of a sparse inode chunk that has just been allocated
> > + * is covered by an existing inobt record.
> > + */
> > +STATIC int
> > +xfs_spchunk_has_record(
> 
> not sure I like the "spchunk" naming. I see that and I have no idea
> what subsystem it belongs to. It's actually an inobt lookup
> function, and doesn't really have anything to do with sparse chunks.
> So maybe xfs_inobt_rec_exists or xfs_inobt_lookup_exact?
> 

xfs_inobt_rec_exists() sounds good.

> > +	struct xfs_mount		*mp,
> > +	struct xfs_trans		*tp,
> > +	struct xfs_buf			*agbp,
> > +	xfs_agino_t			newino,
> > +	xfs_agino_t			count,
> > +	xfs_btnum_t			btnum,
> > +	struct xfs_inobt_rec_incore	*orec)
> > +{
> > +	struct xfs_btree_cur		*cur;
> > +	struct xfs_agi			*agi = XFS_BUF_TO_AGI(agbp);
> > +	xfs_agnumber_t			agno = be32_to_cpu(agi->agi_seqno);
> > +	xfs_agino_t			previno;
> > +	int				error;
> > +	int				i;
> > +	struct xfs_inobt_rec_incore	rec;
> > +
> > +	orec->ir_startino = NULLAGINO;
> > +
> > +	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno, btnum);
> > +
> > +	previno = newino + count - XFS_INODES_PER_CHUNK;
> > +	error = xfs_inobt_lookup(cur, previno, XFS_LOOKUP_GE, &i);
> 
> You want XFS_LOOKUP_EQ, yes? i.e. XFS_LOOKUP_GE won't fail if the
> exact record for the inode chunk does not exist - it will return the
> next one in the btree.
> 

Assuming variable sparse chunk granularity, I don't really know the
start ino of the record that potentially covers the new inode chunk.
Given that, we use the smallest possible start ino that could include
this chunk and search forward from there. As you've noted below, I
wasn't relying on failure here to detect the scenario where there is no
existing record.

Brian

> > +	if (error)
> > +		goto error;
> > +	if (i == 0)
> > +		goto out;
> > +
> > +	error = xfs_inobt_get_rec(cur, &rec, &i);
> > +	if (error)
> > +		goto error;
> > +	XFS_WANT_CORRUPTED_GOTO(i == 1, error);
> > +
> > +	if (rec.ir_startino > newino)
> > +		goto out;
> 
> And so this check would not be necessary...
> 
> > +
> > +	ASSERT(rec.ir_startino <= newino &&
> > +	       rec.ir_startino + XFS_INODES_PER_CHUNK > newino);
> > +	ASSERT(rec.ir_freecount + count <= XFS_INODES_PER_CHUNK);
> > +
> > +	*orec = rec;
> 
> /* struct copy */
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 10/18] xfs: set sparse inodes feature bit when a sparse chunk is allocated
  2014-07-24 22:46   ` Dave Chinner
@ 2014-07-28 16:23     ` Brian Foster
  0 siblings, 0 replies; 50+ messages in thread
From: Brian Foster @ 2014-07-28 16:23 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Fri, Jul 25, 2014 at 08:46:28AM +1000, Dave Chinner wrote:
> On Thu, Jul 24, 2014 at 10:23:00AM -0400, Brian Foster wrote:
> > The sparse chunk inode record format is backwards compatible with old
> > format inobt records as long as full chunks are allocated. The holemask
> > field uses higher order bytes of the freecount. While sparse chunks can
> > be enabled on previously unsupported fs, older kernel drivers cannot
> > parse sparse inode records.
> > 
> > Set the feature incompatible bit once a sparse inode chunk is allocated
> > to prevent older XFS drivers from tripping over the new format.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/libxfs/xfs_ialloc.c | 33 +++++++++++++++++++++++++++++++++
> >  1 file changed, 33 insertions(+)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> > index 4226b1b..4dd45c2 100644
> > --- a/fs/xfs/libxfs/xfs_ialloc.c
> > +++ b/fs/xfs/libxfs/xfs_ialloc.c
> > @@ -407,6 +407,27 @@ error:
> >  	return error;
> >  }
> >  
> > +STATIC void
> > +xfs_sbversion_add_spinodes(
> > +	struct xfs_trans	*tp,
> > +	struct xfs_mount	*mp)
> > +{
> > +	if (xfs_sb_has_incompat_log_feature(&mp->m_sb,
> > +					    XFS_SB_FEAT_INCOMPAT_SPINODES))
> 
> wrong function. xfs_sb_has_incompat_feature().
> 

Oops, right.

> > +		return;
> > +
> > +	spin_lock(&mp->m_sb_lock);
> > +	if (xfs_sb_has_incompat_log_feature(&mp->m_sb,
> > +					    XFS_SB_FEAT_INCOMPAT_SPINODES)) {
> > +		spin_unlock(&mp->m_sb_lock);
> > +		return;
> > +	}
> > +
> > +	mp->m_sb.sb_features_incompat |= XFS_SB_FEAT_INCOMPAT_SPINODES;
> > +	spin_unlock(&mp->m_sb_lock);
> > +	xfs_mod_sb(tp, XFS_SB_FEATURES_INCOMPAT);
> > +}
> > +
> >  /*
> >   * Allocate new inodes in the allocation group specified by agbp.
> >   * Return 0 for success, else error code.
> > @@ -631,6 +652,18 @@ xfs_ialloc_ag_alloc(
> >  		if (error)
> >  			return error;
> >  	}
> > +
> > +	/*
> > +	 * Set an incompat feature bit as old drivers can't parse sparse
> > +	 * records. Pre-sparse inode chunk drivers will include the
> > +	 * holemask in the higher order freecount bits, resulting in a
> > +	 * bogus value.
> > +	 *
> > +	 * XXX: when is this bit removed?
> > +	 */
> > +	if (xfs_inobt_issparse(&rec))
> > +		xfs_sbversion_add_spinodes(tp, args.mp);
> 
> I'd like to avoid this dynamic feature bit adding if possible. Do we
> really need it?
> 

Nope. As discussed, I'll move to something purely manual for the next
iteration and we can move forward from there with regard to if/when to
set it by default.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 01/18] xfs: refactor xfs_inobt_insert() to eliminate loop and support variable count
  2014-07-28 16:03     ` Brian Foster
@ 2014-07-28 23:32       ` Dave Chinner
  2014-07-29 14:43         ` Brian Foster
  0 siblings, 1 reply; 50+ messages in thread
From: Dave Chinner @ 2014-07-28 23:32 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Mon, Jul 28, 2014 at 12:03:52PM -0400, Brian Foster wrote:
> On Fri, Jul 25, 2014 at 08:10:38AM +1000, Dave Chinner wrote:
> > On Thu, Jul 24, 2014 at 10:22:51AM -0400, Brian Foster wrote:
> > > Inodes are always allocated in chunks of 64 and thus the loop in
> > > xfs_inobt_insert() is unnecessary.
> > 
> > I don't believe this is true. The number of inodes allocated at once
> > is:
> > 
> >         mp->m_ialloc_inos = (int)MAX((__uint16_t)XFS_INODES_PER_CHUNK,
> > 	                                        sbp->sb_inopblock);
> > 
> 
> So I'm going on that effectively that the number of inodes per block
> will never be larger than 8 (v5) due to a max block size of 4k.

The whole world is not x86... ;)

> > So when the block size is, say, 64k, the number of 512 byte inodes
> > allocated at once is 128. i.e. 2 chunks. Hence xfs_inobt_insert()
> > can be called with a inode could of > 64 and therefore the loop is
> > still necessary...
> > 
> 
> Playing with mkfs I see that we actually can format >4k bsize
> filesystems and the min and max are set at 512b and 64k. I can't
> actually mount such filesystems due to the page size limitation.

The whole world is not x86.... ;)

ia64 and power default to 64k page size, so we have to code
everything to work with 64k block sizes.

> FWIW,
> the default log size params appear to be broken for bsize >= 32k as

In what way?

> well, so I wonder if/how often that format tends to occur.

More often than you think.

> What's the situation with regard to >PAGE_SIZE block size support? Is
> this something we actually could support today?

Well, the problem is bufferheads and page cache don't support blocks
large than page size. The metadata side of XFS supports it just fine
through the xfs_buf structures, but the file data side doesn't.
That's one of the things I'm slowly trying to find time to fix (i.e.
kill bufferheads).

> Do we know about any
> large page sized arches that could push us into this territory with the
> actual page size limitation?

Yes, see above. We have always supported 64k block sizes on Linux
ever since ia64 supported 64k page sizes (i.e. for at least 10
years), so we can't now say "we only support 4k block sizes"....

> I suppose if we have >4k page sized arches that utilize block sizes
> outside of the 256b-4k range, that's enough to justify the existence of
> the range in the general sense. I just might have to factor this area of
> code a bit differently. It would also be nice if there was a means to
> test.

Grab a ppc64 box from the RH QE guys or ask them to test it....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 08/18] xfs: create helper to manage record overlap for sparse inode chunks
  2014-07-28 16:19     ` Brian Foster
@ 2014-07-29  0:07       ` Dave Chinner
  2014-07-29 15:10         ` Brian Foster
  0 siblings, 1 reply; 50+ messages in thread
From: Dave Chinner @ 2014-07-29  0:07 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Mon, Jul 28, 2014 at 12:19:25PM -0400, Brian Foster wrote:
> On Fri, Jul 25, 2014 at 08:41:12AM +1000, Dave Chinner wrote:
> > On Thu, Jul 24, 2014 at 10:22:58AM -0400, Brian Foster wrote:
> > > +	struct xfs_btree_cur		*cur;
> > > +	struct xfs_agi			*agi = XFS_BUF_TO_AGI(agbp);
> > > +	xfs_agnumber_t			agno = be32_to_cpu(agi->agi_seqno);
> > > +	xfs_agino_t			previno;
> > > +	int				error;
> > > +	int				i;
> > > +	struct xfs_inobt_rec_incore	rec;
> > > +
> > > +	orec->ir_startino = NULLAGINO;
> > > +
> > > +	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno, btnum);
> > > +
> > > +	previno = newino + count - XFS_INODES_PER_CHUNK;
> > > +	error = xfs_inobt_lookup(cur, previno, XFS_LOOKUP_GE, &i);
> > 
> > You want XFS_LOOKUP_EQ, yes? i.e. XFS_LOOKUP_GE won't fail if the
> > exact record for the inode chunk does not exist - it will return the
> > next one in the btree.
> > 
> 
> Assuming variable sparse chunk granularity,

Isn't the granularity fixed for the specific filesystem
configuration as part of the on-disk format?

> I don't really know the
> start ino of the record that potentially covers the new inode chunk.
> Given that, we use the smallest possible start ino that could include
> this chunk and search forward from there. As you've noted below, I
> wasn't relying on failure here to detect the scenario where there is no
> existing record.

Ok, that's not how I thought the code was attempting to implement
the "has record" check. My mistake - a comment explaining how the
match is supposed to work would be helpful, I think.

However, with that in mind, why do you even bother calculating at
"previno"?  If you want the chunk that the "newino" lies in, then
by definition it's going to be the first record at an equal or
lower start inode number than newino.  i.e.:

	xfs_inobt_lookup(cur, newino, XFS_LOOKUP_LE, &i);

Will return either:

	- a match with startino <= newino < startino + XFS_INODES_PER_CHUNK
	- a match with startino + XFS_INODES_PER_CHUNK <= newino
	- a failure due to no record.

i.e. the first case is the chunk record we want, the others are
"does not exist" failures. We don't need to calculate the "previno"
at all.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 00/18] xfs: sparse inode chunks
  2014-07-28 12:14       ` Brian Foster
@ 2014-07-29  0:26         ` Dave Chinner
  2014-07-29 15:25           ` Brian Foster
  0 siblings, 1 reply; 50+ messages in thread
From: Dave Chinner @ 2014-07-29  0:26 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Mon, Jul 28, 2014 at 08:14:01AM -0400, Brian Foster wrote:
> On Sat, Jul 26, 2014 at 10:03:35AM +1000, Dave Chinner wrote:
> > On Fri, Jul 25, 2014 at 12:30:57PM -0400, Brian Foster wrote:
> > > Hmm, I suppose that does create a new and interesting dynamic with
> > > regard to the feature bit (non-deterministic backwards compatibility).
> > > One could certainly value backwards compatibility over this particular
> > > feature, and there is currently no way to control it. I'll look into
> > > doing something with xfs_admin. In fact, I was thinking of adding
> > > something to tune the cluster size bit to get around the v5 scaling
> > > issue anyways.
> > 
> > What v5 scalability issue is that? I don't recall any outstanding
> > issues with inode cluster IO....
> > 
> 
> There's no scalability issue... I'm just referring to the fact that we
> scale the cluster size by the inode size increase factor on v5
> superblocks.
> 
> E.g., my free space fragmentation xfstests test started out with a fixed
> file size based on something close to the worst case with an
> implementation that used the allocation granularity of max(<holemask bit
> granularity>, <inodes per block>). Once I tied the implementation to the
> cluster size due to the aforementioned complexities, it became apparent
> the test was less effective with my chosen file size on v5 supers,
> particularly as the inode size increased. So from there I was
> considering a similar xfs_admin command a user could run to reduce the
> cluster size as a backstop should this limitation arise in the real
> world. We can start with doing something just to enable the feature as
> outlined above and revisit this then...

Right, but I'd suggest that the better long term solution to avoid
the limitations of inode cluster buffer alignment issues is to get
rid of inode clusters and inode buffers altogether. We only need
inode buffers for logging unlinked list modifications, so once we
log those as part of the inode core for for v5 filesystems then we
can do much more dynamic inode IO. That then frees us up to do fine
grained sparse inode allocation because we aren't limited by
in-memory buffering limitations.

http://xfs.org/index.php/Improving_inode_Caching#Food_For_Thought_.28Crazy_Ideas.29

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 01/18] xfs: refactor xfs_inobt_insert() to eliminate loop and support variable count
  2014-07-28 23:32       ` Dave Chinner
@ 2014-07-29 14:43         ` Brian Foster
  0 siblings, 0 replies; 50+ messages in thread
From: Brian Foster @ 2014-07-29 14:43 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Jul 29, 2014 at 09:32:19AM +1000, Dave Chinner wrote:
> On Mon, Jul 28, 2014 at 12:03:52PM -0400, Brian Foster wrote:
> > On Fri, Jul 25, 2014 at 08:10:38AM +1000, Dave Chinner wrote:
> > > On Thu, Jul 24, 2014 at 10:22:51AM -0400, Brian Foster wrote:
> > > > Inodes are always allocated in chunks of 64 and thus the loop in
> > > > xfs_inobt_insert() is unnecessary.
> > > 
> > > I don't believe this is true. The number of inodes allocated at once
> > > is:
> > > 
> > >         mp->m_ialloc_inos = (int)MAX((__uint16_t)XFS_INODES_PER_CHUNK,
> > > 	                                        sbp->sb_inopblock);
> > > 
> > 
> > So I'm going on that effectively that the number of inodes per block
> > will never be larger than 8 (v5) due to a max block size of 4k.
> 
> The whole world is not x86... ;)
> 
> > > So when the block size is, say, 64k, the number of 512 byte inodes
> > > allocated at once is 128. i.e. 2 chunks. Hence xfs_inobt_insert()
> > > can be called with a inode could of > 64 and therefore the loop is
> > > still necessary...
> > > 
> > 
> > Playing with mkfs I see that we actually can format >4k bsize
> > filesystems and the min and max are set at 512b and 64k. I can't
> > actually mount such filesystems due to the page size limitation.
> 
> The whole world is not x86.... ;)
> 
> ia64 and power default to 64k page size, so we have to code
> everything to work with 64k block sizes.
> 

Yeah, I was aware there are >4k paged arches. I just wasn't sure which
and how they're used. I'll look into these.

> > FWIW,
> > the default log size params appear to be broken for bsize >= 32k as
> 
> In what way?
> 

# mkfs.xfs -f /dev/test/scratch -bsize=32k
log size 320 blocks too small, minimum size is 512 blocks
Usage: mkfs.xfs
...

This is a 10G lv so I suspect the following code is related:

	...
	} else if (dblocks < GIGABYTES(16, blocklog)) {
	...
		logblocks = MIN(XFS_MIN_LOG_BYTES >> blocklog,
				min_logblocks * XFS_DFL_LOG_FACTOR);
	} else {
	...

E.g., MIN_LOG_BYTES is 10MB and MIN_LOG_BLOCKS is 512 (then multiplied
by 5 here). The latter calculation results in 80MB, so we choose the
former and the subsequent log size validation fails due to not meeting
the minimum block count requirement. It still doesn't make much sense to
me why we use the min here. The minimum log ends up being 16MB for 32k
block size even if we skip the LOG_FACTOR scaling.

> > well, so I wonder if/how often that format tends to occur.
> 
> More often than you think.
> 

Not too surprising. :) FWIW, this is in fact limited to the <16GB fs
case. The small size range probably reduces the chances of hitting this
(as opposed to block size).

> > What's the situation with regard to >PAGE_SIZE block size support? Is
> > this something we actually could support today?
> 
> Well, the problem is bufferheads and page cache don't support blocks
> large than page size. The metadata side of XFS supports it just fine
> through the xfs_buf structures, but the file data side doesn't.
> That's one of the things I'm slowly trying to find time to fix (i.e.
> kill bufferheads).
> 

Ok.

> > Do we know about any
> > large page sized arches that could push us into this territory with the
> > actual page size limitation?
> 
> Yes, see above. We have always supported 64k block sizes on Linux
> ever since ia64 supported 64k page sizes (i.e. for at least 10
> years), so we can't now say "we only support 4k block sizes"....
> 

Indeed, I'd expect to have to support it. I was just looking for more
background.

> > I suppose if we have >4k page sized arches that utilize block sizes
> > outside of the 256b-4k range, that's enough to justify the existence of
> > the range in the general sense. I just might have to factor this area of
> > code a bit differently. It would also be nice if there was a means to
> > test.
> 
> Grab a ppc64 box from the RH QE guys or ask them to test it....
> 

Will do, thanks.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 08/18] xfs: create helper to manage record overlap for sparse inode chunks
  2014-07-29  0:07       ` Dave Chinner
@ 2014-07-29 15:10         ` Brian Foster
  0 siblings, 0 replies; 50+ messages in thread
From: Brian Foster @ 2014-07-29 15:10 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Jul 29, 2014 at 10:07:33AM +1000, Dave Chinner wrote:
> On Mon, Jul 28, 2014 at 12:19:25PM -0400, Brian Foster wrote:
> > On Fri, Jul 25, 2014 at 08:41:12AM +1000, Dave Chinner wrote:
> > > On Thu, Jul 24, 2014 at 10:22:58AM -0400, Brian Foster wrote:
> > > > +	struct xfs_btree_cur		*cur;
> > > > +	struct xfs_agi			*agi = XFS_BUF_TO_AGI(agbp);
> > > > +	xfs_agnumber_t			agno = be32_to_cpu(agi->agi_seqno);
> > > > +	xfs_agino_t			previno;
> > > > +	int				error;
> > > > +	int				i;
> > > > +	struct xfs_inobt_rec_incore	rec;
> > > > +
> > > > +	orec->ir_startino = NULLAGINO;
> > > > +
> > > > +	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno, btnum);
> > > > +
> > > > +	previno = newino + count - XFS_INODES_PER_CHUNK;
> > > > +	error = xfs_inobt_lookup(cur, previno, XFS_LOOKUP_GE, &i);
> > > 
> > > You want XFS_LOOKUP_EQ, yes? i.e. XFS_LOOKUP_GE won't fail if the
> > > exact record for the inode chunk does not exist - it will return the
> > > next one in the btree.
> > > 
> > 
> > Assuming variable sparse chunk granularity,
> 
> Isn't the granularity fixed for the specific filesystem
> configuration as part of the on-disk format?
> 

Sort of, but I'm thinking of that as a limitation of the imap code and
such. I'd like to avoid introducing more of such assumptions where
possible in the implementation. That's what I meant before about not
explicitly encoding it. I wanted to use the cluster size (now the
"spinoalignmt") only in the few places that needed the allocation size
and let the rest of the code make no assumptions and work against the
minimum granularity defined by the on-disk format (i.e., inodes per
holemask bit, inodes per record).

The only reason I had to base the sparse alloc. granularity on the
cluster size is so I don't have to go through and figure out how to fix
that inode buffer code as a dependency to get a basic mechanism working.

There's also the scenario where if the granularity can end up small
enough, I'm not sure we can reliably calculate the starting inode of a
record (unless we make changes in the allocation path). TBH, even if we
could, I'd rather keep the code flexible and warn/assert/error on the
failed assumption with more information.

> > I don't really know the
> > start ino of the record that potentially covers the new inode chunk.
> > Given that, we use the smallest possible start ino that could include
> > this chunk and search forward from there. As you've noted below, I
> > wasn't relying on failure here to detect the scenario where there is no
> > existing record.
> 
> Ok, that's not how I thought the code was attempting to implement
> the "has record" check. My mistake - a comment explaining how the
> match is supposed to work would be helpful, I think.
> 

Indeed, I'll add a comment with some context.

> However, with that in mind, why do you even bother calculating at
> "previno"?  If you want the chunk that the "newino" lies in, then
> by definition it's going to be the first record at an equal or
> lower start inode number than newino.  i.e.:
> 
> 	xfs_inobt_lookup(cur, newino, XFS_LOOKUP_LE, &i);
> 
> Will return either:
> 
> 	- a match with startino <= newino < startino + XFS_INODES_PER_CHUNK
> 	- a match with startino + XFS_INODES_PER_CHUNK <= newino
> 	- a failure due to no record.
> 
> i.e. the first case is the chunk record we want, the others are
> "does not exist" failures. We don't need to calculate the "previno"
> at all.
> 

Yeah, that might be nicer. I'll try the search in the other direction.
Thanks.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 00/18] xfs: sparse inode chunks
  2014-07-29  0:26         ` Dave Chinner
@ 2014-07-29 15:25           ` Brian Foster
  0 siblings, 0 replies; 50+ messages in thread
From: Brian Foster @ 2014-07-29 15:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Jul 29, 2014 at 10:26:50AM +1000, Dave Chinner wrote:
> On Mon, Jul 28, 2014 at 08:14:01AM -0400, Brian Foster wrote:
> > On Sat, Jul 26, 2014 at 10:03:35AM +1000, Dave Chinner wrote:
> > > On Fri, Jul 25, 2014 at 12:30:57PM -0400, Brian Foster wrote:
> > > > Hmm, I suppose that does create a new and interesting dynamic with
> > > > regard to the feature bit (non-deterministic backwards compatibility).
> > > > One could certainly value backwards compatibility over this particular
> > > > feature, and there is currently no way to control it. I'll look into
> > > > doing something with xfs_admin. In fact, I was thinking of adding
> > > > something to tune the cluster size bit to get around the v5 scaling
> > > > issue anyways.
> > > 
> > > What v5 scalability issue is that? I don't recall any outstanding
> > > issues with inode cluster IO....
> > > 
> > 
> > There's no scalability issue... I'm just referring to the fact that we
> > scale the cluster size by the inode size increase factor on v5
> > superblocks.
> > 
> > E.g., my free space fragmentation xfstests test started out with a fixed
> > file size based on something close to the worst case with an
> > implementation that used the allocation granularity of max(<holemask bit
> > granularity>, <inodes per block>). Once I tied the implementation to the
> > cluster size due to the aforementioned complexities, it became apparent
> > the test was less effective with my chosen file size on v5 supers,
> > particularly as the inode size increased. So from there I was
> > considering a similar xfs_admin command a user could run to reduce the
> > cluster size as a backstop should this limitation arise in the real
> > world. We can start with doing something just to enable the feature as
> > outlined above and revisit this then...
> 
> Right, but I'd suggest that the better long term solution to avoid
> the limitations of inode cluster buffer alignment issues is to get
> rid of inode clusters and inode buffers altogether. We only need
> inode buffers for logging unlinked list modifications, so once we
> log those as part of the inode core for for v5 filesystems then we
> can do much more dynamic inode IO. That then frees us up to do fine
> grained sparse inode allocation because we aren't limited by
> in-memory buffering limitations.
> 
> http://xfs.org/index.php/Improving_inode_Caching#Food_For_Thought_.28Crazy_Ideas.29
> 

Interesting, thanks. I guess removing the need for the code that's
incompatible with the sub-cluster size granularity is a nice option. :)
I'll have to read into this some more once the basic sparse inode
feature is more hashed out.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 04/18] xfs: introduce inode record hole mask for sparse inode chunks
  2014-07-28 16:16     ` Brian Foster
@ 2014-08-07 15:18       ` Brian Foster
  0 siblings, 0 replies; 50+ messages in thread
From: Brian Foster @ 2014-08-07 15:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Mon, Jul 28, 2014 at 12:16:33PM -0400, Brian Foster wrote:
> On Fri, Jul 25, 2014 at 08:14:53AM +1000, Dave Chinner wrote:
> > On Thu, Jul 24, 2014 at 10:22:54AM -0400, Brian Foster wrote:
> > > The inode btrees track 64 inodes per record, regardless of inode size.
> > > Thus, inode chunks on disk vary in size depending on the size of the
> > > inodes. This creates a contiguous allocation requirement for new inode
> > > chunks that can be difficult to satisfy on an aged and fragmented (free
> > > space) filesystem.
> > > 
> > > The inode record freecount currently uses 4 bytes on disk to track the
> > > free inode count. With a maximum freecount value of 64, only one byte is
> > > required. Convert the freecount field to a single byte and reserve two
> > > of the remaining 3 higher order bytes left to the hole mask field.
> > > 
> > > The hole mask field tracks potential holes in the chunks of physical
> > > space that the inode record refers to. This facilitates the sparse
> > > allocation of inode chunks when contiguous chunks are not available and
> > > allows the inode btrees to identify what portions of the chunk contain
> > > valid inodes.
> > > 
> > > Tracking holes means the field is initialized to zero and thus maintains
> > > backwards compatibility with existing filesystems. E.g., the higher
> > > order bytes of a counter with a max value of 64 are already initialized
> > > to 0. Update the inode record management functions to handle the new
> > > field and initialize it to zero for now.
> > > 
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_format.h       | 7 +++++--
> > >  fs/xfs/libxfs/xfs_ialloc.c       | 9 +++++++--
> > >  fs/xfs/libxfs/xfs_ialloc_btree.c | 4 +++-
> > >  3 files changed, 15 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > > index 34d85ac..39022d9 100644
> > > --- a/fs/xfs/libxfs/xfs_format.h
> > > +++ b/fs/xfs/libxfs/xfs_format.h
> > > @@ -221,13 +221,16 @@ static inline xfs_inofree_t xfs_inobt_maskn(int i, int n)
> > >   */
> > >  typedef struct xfs_inobt_rec {
> > >  	__be32		ir_startino;	/* starting inode number */
> > > -	__be32		ir_freecount;	/* count of free inodes (set bits) */
> > > +	__be16		ir_holemask;	/* hole mask for sparse chunks */
> > > +	__u8		ir_pad;
> > > +	__u8		ir_freecount;	/* count of free inodes (set bits) */
> > >  	__be64		ir_free;	/* free inode mask */
> > >  } xfs_inobt_rec_t;
> > 
> > might we want the number of inodes allocated in the chunk there as
> > well (i.e. in the pad field) so we can validate the holemask against
> > the number of inodes allocated in the chunk?
> > 
> 
> So you're suggesting something like this?
> 
> -	__be32		ir_freecount;	/* count of free inodes (set bits) */
> +	__be16		ir_holemask;	/* hole mask for sparse chunks */
> +	__u8		ir_count;	/* total inode count */
> +	__u8		ir_freecount;	/* count of free inodes (set bits) */
> 
> That's an interesting thought. It might make some of the code more clear
> and eliminate the need for the derivation of that value from the
> holemask (beyond for validation purposes). I do like the extra
> validation and potential debug use given the holemask is not quite as
> human friendly as the free mask in terms of having a bit per inode.
> 
> As long as there isn't any concern over reserving this space for
> something else down the road (I suspect not, since the pad is introduced
> by this feature), I'll look to use it as an inode count.
> 

The ir_count field along with changing the way we handle the feature bit
to be fixed rather than dynamic introduces an interesting design point
with regard to backwards compatibility.

The feature bit is now fixed at mkfs time and we have an assumption that
it could be removed via some userspace mechanism (e.g., repair,
xfs_admin, etc.). To do that safely, the mechanism requires some kind of
verification that sparse inode chunks do not exist, either as part of
the repair tracking or some kind of new counter somewhere that xfs_admin
could refer to.

The ir_count field helps with verification of the holemask and
simplifies calculation of a "real" inode count for such records (e.g.,
for bulkstat). We could set ir_count only for inode records that are
sparse, but that seems slightly unfortunate if we expect this record
format to eventually be default. Alternatively (and what I have in my
tree at the moment), we can use ir_count unconditionally so long as the
feature bit is set. That obviously breaks the backwards compatibility of
the record format since this is a higher order byte of ir_freecount on
older kernels.

So if the feature bit is mkfs time and there is no intent to enable it
thereafter, it seems there isn't much benefit to record backwards
compatibility. If we go that route, perhaps it's better to have an
ir_alloc field or some such instead of ir_holemask, invert the bits and
eliminate that bit of complexity. On the flipside, limiting usage of
ir_count and retaining ir_holemask in its current form leaves open the
possibility for enabling sparse inode chunks without a reformat (given
certain requirements, e.g., a v5 superblock). Thoughts?

Brian

> Brian
> 
> > -Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2014-08-07 15:18 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-24 14:22 [PATCH RFC 00/18] xfs: sparse inode chunks Brian Foster
2014-07-24 14:22 ` [PATCH 01/18] xfs: refactor xfs_inobt_insert() to eliminate loop and support variable count Brian Foster
2014-07-24 22:10   ` Dave Chinner
2014-07-28 16:03     ` Brian Foster
2014-07-28 23:32       ` Dave Chinner
2014-07-29 14:43         ` Brian Foster
2014-07-24 14:22 ` [PATCH 02/18] xfs: pass xfs_mount directly to xfs_ialloc_cluster_alignment() Brian Foster
2014-07-24 14:22 ` [PATCH 03/18] xfs: define sparse inode chunks v5 sb feature bit and helper function Brian Foster
2014-07-24 17:08   ` Mark Tinguely
2014-07-24 17:37     ` Brian Foster
2014-07-24 18:38       ` Mark Tinguely
2014-07-24 19:38         ` Brian Foster
2014-07-24 23:35   ` Dave Chinner
2014-07-24 14:22 ` [PATCH 04/18] xfs: introduce inode record hole mask for sparse inode chunks Brian Foster
2014-07-24 22:14   ` Dave Chinner
2014-07-28 16:16     ` Brian Foster
2014-08-07 15:18       ` Brian Foster
2014-07-24 14:22 ` [PATCH 05/18] xfs: create macros/helpers for dealing with " Brian Foster
2014-07-24 22:13   ` Dave Chinner
2014-07-24 14:22 ` [PATCH 06/18] xfs: pass inode count through ordered icreate log item Brian Foster
2014-07-24 14:22 ` [PATCH 07/18] xfs: handle sparse inode chunks in icreate log recovery Brian Foster
2014-07-24 14:22 ` [PATCH 08/18] xfs: create helper to manage record overlap for sparse inode chunks Brian Foster
2014-07-24 22:41   ` Dave Chinner
2014-07-28 16:19     ` Brian Foster
2014-07-29  0:07       ` Dave Chinner
2014-07-29 15:10         ` Brian Foster
2014-07-24 14:22 ` [PATCH 09/18] xfs: allocate sparse inode chunks on full chunk allocation failure Brian Foster
2014-07-24 14:23 ` [PATCH 10/18] xfs: set sparse inodes feature bit when a sparse chunk is allocated Brian Foster
2014-07-24 22:46   ` Dave Chinner
2014-07-28 16:23     ` Brian Foster
2014-07-24 14:23 ` [PATCH 11/18] xfs: reduce min. inode allocation space requirement for sparse inode chunks Brian Foster
2014-07-24 22:50   ` Dave Chinner
2014-07-24 14:23 ` [PATCH 12/18] xfs: helper to convert inobt record holemask to inode alloc. bitmap Brian Foster
2014-07-24 23:21   ` Dave Chinner
2014-07-24 14:23 ` [PATCH 13/18] xfs: filter out sparse regions from individual inode allocation Brian Foster
2014-07-24 14:23 ` [PATCH 14/18] xfs: update free inode record logic to support sparse inode records Brian Foster
2014-07-24 14:23 ` [PATCH 15/18] xfs: only free allocated regions of inode chunks Brian Foster
2014-07-24 23:24   ` Dave Chinner
2014-07-24 14:23 ` [PATCH 16/18] xfs: skip unallocated regions of inode chunks in xfs_ifree_cluster() Brian Foster
2014-07-24 14:23 ` [PATCH 17/18] xfs: use actual inode count for sparse records in bulkstat/inumbers Brian Foster
2014-07-24 23:29   ` Dave Chinner
2014-07-24 14:23 ` [PATCH 18/18] xfs: enable sparse inode chunks for v5 superblocks Brian Foster
2014-07-24 23:34   ` Dave Chinner
2014-07-24 16:28 ` [PATCH RFC 00/18] xfs: sparse inode chunks Brian Foster
2014-07-24 22:32 ` Dave Chinner
2014-07-25 16:30   ` Brian Foster
2014-07-26  0:03     ` Dave Chinner
2014-07-28 12:14       ` Brian Foster
2014-07-29  0:26         ` Dave Chinner
2014-07-29 15:25           ` Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.