All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/18] xfs: sparse inode chunks
@ 2015-02-06 19:52 Brian Foster
  2015-02-06 19:52 ` [PATCH v3 01/18] xfs: add sparse inode chunk alignment superblock field Brian Foster
                   ` (17 more replies)
  0 siblings, 18 replies; 41+ messages in thread
From: Brian Foster @ 2015-02-06 19:52 UTC (permalink / raw)
  To: xfs

Hi all,

Here's v3 of sparse inode chunk suport for XFS. The primary update for
this version is an update to how inodes are aligned when sparse inode
support is enabled. Inode chunks are currently aligned to cluster size
to support single I/O per inode cluster. This means that the minimum
block range between two non-adjacent inode chunks is the cluster size.
Cluster size is also the granularity of sparse allocation. Therefore, it
is possible to allocate a cluster size chunk that cannot be converted to
an inode record due to overlap on both sides (ambiguous metadata). The
only recourse in this situation is to undo the allocation and likely
return ENOSPC. Given the added complexity of that and the already
complicated inode allocation path, an approach that avoids this
potential condition in the first place is preferred.

To address this situation, inode alignment is increased from cluster
size to chunk size (by mkfs) when sparse inode chunks are enabled. This
guarantees that the minimum block range between two non-adjacent inode
chunks is at least big enough for one full chunk. This greatly
simplifies sparse inode record management. Allocations occur at cluster
size granularity and are shifted into inode records that align to chunk
size. In other words, for any particular sparse allocation, a well known
record startino is determined by aligning the agbno of the allocation to
the chunk size.

The increased inode chunk alignment does limit the ability to allocate
full inode chunks on a significantly populated fs, but what is lost in
that regard is regained by the ability to allocate sparse records in any
AG that can satisfy the minimum free space requirement[1].

Other changes in this version include marking the feature as
experimental, a block allocation agbno range limit to avoid invalid
inode records at AG boundaries, DEBUG mode allocation logic to improve
test coverage, etc. This series survives xfstests regression runs on
basic v5 configurations as well as some longer term debug mode fsstress
testing. Thoughts, reviews, flames appreciated!

Brian

[1] - This can also be mitigated by future work to consider allocation
of inode chunks in batches rather than one at a time.

v3:
- Rebase to latest for-next (bulkstat rework, data structure shuffling,
  etc.).
- Fix issparse helper logic.
- Update inode alignment model w/ spinodes enabled. All inode records
  are chunk size aligned, sparse allocations cluster size aligned (both
  enforced on mount).
- Reworked sparse inode record merge logic to coincide w/ new alignment
  model.
- Mark feature as experimental (warn on mount).
- Include and use block allocation agbno range limit to prevent
  allocation of invalid inode records.
- Add some DEBUG bits to improve sparse alloc. test coverage.
v2: http://oss.sgi.com/archives/xfs/2014-11/msg00007.html
- Use a manually set feature bit instead of dynamic based on the
  existence of sparse inode chunks.
- Add sb/mp fields for sparse alloc. granularity (use instead of cluster
  size).
- Undo xfs_inobt_insert() loop removal to avoid breakage of larger page
  size arches.
- Rename sparse record overlap helper and do XFS_LOOKUP_LE search.
- Use byte of pad space in inobt record for inode count field.
- Convert bitmap mgmt to use generic bitmap code.
- Rename XFS_INODES_PER_SPCHUNK to XFS_INODES_PER_HOLEMASK_BIT.
- Add fs geometry bit for sparse inodes.
- Rebase to latest for-next (bulkstat refactor).
v1: http://oss.sgi.com/archives/xfs/2014-07/msg00355.html

Brian Foster (18):
  xfs: add sparse inode chunk alignment superblock field
  xfs: use sparse chunk alignment for min. inode allocation requirement
  xfs: sparse inode chunks feature helpers and mount requirements
  xfs: introduce inode record hole mask for sparse inode chunks
  xfs: create macros/helpers for dealing with sparse inode chunks
  xfs: pass inode count through ordered icreate log item
  xfs: handle sparse inode chunks in icreate log recovery
  xfs: helpers to convert holemask to/from generic bitmap
  xfs: support min/max agbno args in block allocator
  xfs: allocate sparse inode chunks on full chunk allocation failure
  xfs: randomly do sparse inode allocations in DEBUG mode
  xfs: filter out sparse regions from individual inode allocation
  xfs: update free inode record logic to support sparse inode records
  xfs: only free allocated regions of inode chunks
  xfs: skip unallocated regions of inode chunks in xfs_ifree_cluster()
  xfs: use actual inode count for sparse records in bulkstat/inumbers
  xfs: add fs geometry bit for sparse inode chunks
  xfs: enable sparse inode chunks for v5 superblocks

 fs/xfs/libxfs/xfs_alloc.c        |  42 ++-
 fs/xfs/libxfs/xfs_alloc.h        |   2 +
 fs/xfs/libxfs/xfs_format.h       |  33 +-
 fs/xfs/libxfs/xfs_fs.h           |   1 +
 fs/xfs/libxfs/xfs_ialloc.c       | 651 ++++++++++++++++++++++++++++++++++++---
 fs/xfs/libxfs/xfs_ialloc.h       |  17 +-
 fs/xfs/libxfs/xfs_ialloc_btree.c |   4 +-
 fs/xfs/libxfs/xfs_sb.c           |  31 +-
 fs/xfs/xfs_fsops.c               |   4 +-
 fs/xfs/xfs_inode.c               |  28 +-
 fs/xfs/xfs_itable.c              |  14 +-
 fs/xfs/xfs_log_recover.c         |  23 +-
 fs/xfs/xfs_mount.c               |  16 +
 fs/xfs/xfs_mount.h               |   2 +
 fs/xfs/xfs_trace.h               |  47 +++
 15 files changed, 836 insertions(+), 79 deletions(-)

-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v3 01/18] xfs: add sparse inode chunk alignment superblock field
  2015-02-06 19:52 [PATCH v3 00/18] xfs: sparse inode chunks Brian Foster
@ 2015-02-06 19:52 ` Brian Foster
  2015-02-06 22:40   ` Dave Chinner
  2015-02-06 19:52 ` [PATCH v3 02/18] xfs: use sparse chunk alignment for min. inode allocation requirement Brian Foster
                   ` (16 subsequent siblings)
  17 siblings, 1 reply; 41+ messages in thread
From: Brian Foster @ 2015-02-06 19:52 UTC (permalink / raw)
  To: xfs

Add sb_spinoalignmt to the superblock to specify sparse inode chunk
alignment. This also currently represents the minimum allowable sparse
chunk allocation size.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_format.h | 6 +++---
 fs/xfs/libxfs/xfs_sb.c     | 4 ++--
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 8eb7189..051c24d 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -170,7 +170,7 @@ typedef struct xfs_sb {
 	__uint32_t	sb_features_log_incompat;
 
 	__uint32_t	sb_crc;		/* superblock crc */
-	__uint32_t	sb_pad;
+	xfs_extlen_t	sb_spinoalignmt;/* sparse inode chunk alignment */
 
 	xfs_ino_t	sb_pquotino;	/* project quota inode */
 	xfs_lsn_t	sb_lsn;		/* last write sequence */
@@ -256,7 +256,7 @@ typedef struct xfs_dsb {
 	__be32		sb_features_log_incompat;
 
 	__le32		sb_crc;		/* superblock crc */
-	__be32		sb_pad;
+	__be32		sb_spinoalignmt;/* sparse inode chunk alignment */
 
 	__be64		sb_pquotino;	/* project quota inode */
 	__be64		sb_lsn;		/* last write sequence */
@@ -282,7 +282,7 @@ typedef enum {
 	XFS_SBS_LOGSECTLOG, XFS_SBS_LOGSECTSIZE, XFS_SBS_LOGSUNIT,
 	XFS_SBS_FEATURES2, XFS_SBS_BAD_FEATURES2, XFS_SBS_FEATURES_COMPAT,
 	XFS_SBS_FEATURES_RO_COMPAT, XFS_SBS_FEATURES_INCOMPAT,
-	XFS_SBS_FEATURES_LOG_INCOMPAT, XFS_SBS_CRC, XFS_SBS_PAD,
+	XFS_SBS_FEATURES_LOG_INCOMPAT, XFS_SBS_CRC, XFS_SBS_SPINOALIGNMT,
 	XFS_SBS_PQUOTINO, XFS_SBS_LSN,
 	XFS_SBS_FIELDCOUNT
 } xfs_sb_field_t;
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index b0a5fe9..eb22b5f 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -382,7 +382,7 @@ __xfs_sb_from_disk(
 				be32_to_cpu(from->sb_features_log_incompat);
 	/* crc is only used on disk, not in memory; just init to 0 here. */
 	to->sb_crc = 0;
-	to->sb_pad = 0;
+	to->sb_spinoalignmt = be32_to_cpu(from->sb_spinoalignmt);
 	to->sb_pquotino = be64_to_cpu(from->sb_pquotino);
 	to->sb_lsn = be64_to_cpu(from->sb_lsn);
 	/* Convert on-disk flags to in-memory flags? */
@@ -524,7 +524,7 @@ xfs_sb_to_disk(
 				cpu_to_be32(from->sb_features_incompat);
 		to->sb_features_log_incompat =
 				cpu_to_be32(from->sb_features_log_incompat);
-		to->sb_pad = 0;
+		to->sb_spinoalignmt = cpu_to_be32(from->sb_spinoalignmt);
 		to->sb_lsn = cpu_to_be64(from->sb_lsn);
 	}
 }
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v3 02/18] xfs: use sparse chunk alignment for min. inode allocation requirement
  2015-02-06 19:52 [PATCH v3 00/18] xfs: sparse inode chunks Brian Foster
  2015-02-06 19:52 ` [PATCH v3 01/18] xfs: add sparse inode chunk alignment superblock field Brian Foster
@ 2015-02-06 19:52 ` Brian Foster
  2015-02-06 19:52 ` [PATCH v3 03/18] xfs: sparse inode chunks feature helpers and mount requirements Brian Foster
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 41+ messages in thread
From: Brian Foster @ 2015-02-06 19:52 UTC (permalink / raw)
  To: xfs

xfs_ialloc_ag_select() iterates through the allocation groups looking
for free inodes or free space to determine whether to allow an inode
allocation to proceed. If no free inodes are available, it assumes that
an AG must have an extent longer than mp->m_ialloc_blks.

Sparse inode chunk support currently allows for allocations smaller than
the traditional inode chunk size specified in m_ialloc_blks. The current
minimum sparse allocation is set in the superblock sb_spinoalignmt field
at mkfs time. Create a new m_ialloc_min_blks field in xfs_mount and use
this to represent the minimum supported allocation size for inode
chunks. Initialize m_ialloc_min_blks at mount time based on whether
sparse inodes are supported.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_ialloc.c | 2 +-
 fs/xfs/libxfs/xfs_sb.c     | 5 +++++
 fs/xfs/xfs_mount.h         | 2 ++
 3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 116ef1d..6f2153e 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -644,7 +644,7 @@ xfs_ialloc_ag_select(
 		 * if we fail allocation due to alignment issues then it is most
 		 * likely a real ENOSPC condition.
 		 */
-		ineed = mp->m_ialloc_blks;
+		ineed = mp->m_ialloc_min_blks;
 		if (flags && ineed > 1)
 			ineed += xfs_ialloc_cluster_alignment(mp);
 		longest = pag->pagf_longest;
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index eb22b5f..d6b4dbeb 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -697,6 +697,11 @@ xfs_sb_mount_common(
 	mp->m_ialloc_inos = (int)MAX((__uint16_t)XFS_INODES_PER_CHUNK,
 					sbp->sb_inopblock);
 	mp->m_ialloc_blks = mp->m_ialloc_inos >> sbp->sb_inopblog;
+
+	if (sbp->sb_spinoalignmt)
+		mp->m_ialloc_min_blks = sbp->sb_spinoalignmt;
+	else
+		mp->m_ialloc_min_blks = mp->m_ialloc_blks;
 }
 
 /*
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index a5b2ff8..97f17d5 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -136,6 +136,8 @@ typedef struct xfs_mount {
 	__uint64_t		m_flags;	/* global mount flags */
 	int			m_ialloc_inos;	/* inodes in inode allocation */
 	int			m_ialloc_blks;	/* blocks in inode allocation */
+	int			m_ialloc_min_blks;/* min blocks in sparse inode
+						   * allocation */
 	int			m_inoalign_mask;/* mask sb_inoalignmt if used */
 	uint			m_qflags;	/* quota status flags */
 	struct xfs_trans_resv	m_resv;		/* precomputed res values */
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v3 03/18] xfs: sparse inode chunks feature helpers and mount requirements
  2015-02-06 19:52 [PATCH v3 00/18] xfs: sparse inode chunks Brian Foster
  2015-02-06 19:52 ` [PATCH v3 01/18] xfs: add sparse inode chunk alignment superblock field Brian Foster
  2015-02-06 19:52 ` [PATCH v3 02/18] xfs: use sparse chunk alignment for min. inode allocation requirement Brian Foster
@ 2015-02-06 19:52 ` Brian Foster
  2015-02-06 22:54   ` Dave Chinner
  2015-02-06 19:52 ` [PATCH v3 04/18] xfs: introduce inode record hole mask for sparse inode chunks Brian Foster
                   ` (14 subsequent siblings)
  17 siblings, 1 reply; 41+ messages in thread
From: Brian Foster @ 2015-02-06 19:52 UTC (permalink / raw)
  To: xfs

The sparse inode chunks feature uses the helper function to enable the
allocation of sparse inode chunks. The incompatible feature bit is set
on disk at mkfs time to prevent mount from unsupported kernels.

Also, enforce the inode alignment requirements required for sparse inode
chunks at mount time. When enabled, full inode chunks (and all inode
record) alignment is increased from cluster size to inode chunk size.
Sparse inode alignment must match the cluster size of the fs. Both
superblock alignment fields are set as such by mkfs when sparse inode
support is enabled.

Finally, warn that sparse inode chunks is an experimental feature until
further notice.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_format.h |  7 +++++++
 fs/xfs/libxfs/xfs_sb.c     | 22 ++++++++++++++++++++++
 fs/xfs/xfs_mount.c         | 16 ++++++++++++++++
 3 files changed, 45 insertions(+)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 051c24d..26e5d92 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -519,6 +519,7 @@ xfs_sb_has_ro_compat_feature(
 }
 
 #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
+#define XFS_SB_FEAT_INCOMPAT_SPINODES	(1 << 1)	/* sparse inode chunks */
 #define XFS_SB_FEAT_INCOMPAT_ALL \
 		(XFS_SB_FEAT_INCOMPAT_FTYPE)
 
@@ -568,6 +569,12 @@ static inline int xfs_sb_version_hasfinobt(xfs_sb_t *sbp)
 		(sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_FINOBT);
 }
 
+static inline bool xfs_sb_version_hassparseinodes(struct xfs_sb *sbp)
+{
+	return XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5 &&
+		xfs_sb_has_incompat_feature(sbp, XFS_SB_FEAT_INCOMPAT_SPINODES);
+}
+
 /*
  * end of superblock version macros
  */
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index d6b4dbeb..90d252a 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -182,6 +182,28 @@ xfs_mount_validate_sb(
 			return -EFSCORRUPTED;
 	}
 
+	/*
+	 * Full inode chunks must be aligned to inode chunk size when
+	 * sparse inodes are enabled to support the sparse chunk
+	 * allocation algorithm and prevent overlapping inode records.
+	 */
+	if (xfs_sb_version_hassparseinodes(sbp)) {
+		uint32_t	align;
+
+		xfs_alert(mp,
+		"v5 superblock with sparse inode chunk support detected.\n"
+		"This feature is EXPERIMENTAL. Use at your own risk!");
+
+		align = XFS_INODES_PER_CHUNK * sbp->sb_inodesize
+				>> sbp->sb_blocklog;
+		if (sbp->sb_inoalignmt != align) {
+			xfs_warn(mp, "Invalid inode alignment (%u blks). "
+		"Must match inode chunk size (%u blks) with sparse inodes.",
+				 sbp->sb_inoalignmt, align);
+			return -EINVAL;
+		}
+	}
+
 	if (unlikely(
 	    sbp->sb_logstart == 0 && mp->m_logdev_targp == mp->m_ddev_targp)) {
 		xfs_warn(mp,
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 4fa80e6..7ac52a3 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -738,6 +738,22 @@ xfs_mountfs(
 	}
 
 	/*
+	 * If enabled, sparse inode chunk alignment is expected to match the
+	 * cluster size. Full inode chunk alignment must match the chunk size,
+	 * but that is checked on sb read verification...
+	 */
+	if (xfs_sb_version_hassparseinodes(&mp->m_sb) &&
+	    mp->m_sb.sb_spinoalignmt !=
+	    XFS_B_TO_FSBT(mp, mp->m_inode_cluster_size)) {
+		xfs_warn(mp, "Invalid sparse inode chunk alignment (%u blks). "
+			 "Must match cluster size (%llu blks).",
+			 mp->m_sb.sb_spinoalignmt,
+			 XFS_B_TO_FSBT(mp, mp->m_inode_cluster_size));
+		error = -EINVAL;
+		goto out_remove_uuid;
+	}
+
+	/*
 	 * Set inode alignment fields
 	 */
 	xfs_set_inoalignment(mp);
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v3 04/18] xfs: introduce inode record hole mask for sparse inode chunks
  2015-02-06 19:52 [PATCH v3 00/18] xfs: sparse inode chunks Brian Foster
                   ` (2 preceding siblings ...)
  2015-02-06 19:52 ` [PATCH v3 03/18] xfs: sparse inode chunks feature helpers and mount requirements Brian Foster
@ 2015-02-06 19:52 ` Brian Foster
  2015-02-06 23:16   ` Dave Chinner
  2015-02-06 19:52 ` [PATCH v3 05/18] xfs: create macros/helpers for dealing with " Brian Foster
                   ` (13 subsequent siblings)
  17 siblings, 1 reply; 41+ messages in thread
From: Brian Foster @ 2015-02-06 19:52 UTC (permalink / raw)
  To: xfs

The inode btrees track 64 inodes per record, regardless of inode size.
Thus, inode chunks on disk vary in size depending on the size of the
inodes. This creates a contiguous allocation requirement for new inode
chunks that can be difficult to satisfy on an aged and fragmented (free
space) filesystem.

The inode record freecount currently uses 4 bytes on disk to track the
free inode count. With a maximum freecount value of 64, only one byte is
required. Convert the freecount field to a single byte and use two of
the remaining 3 higher order bytes left for the hole mask field. Use
the final leftover byte for the total count field.

The hole mask field tracks holes in the chunks of physical space that
the inode record refers to. This facilitates the sparse allocation of
inode chunks when contiguous chunks are not available and allows the
inode btrees to identify what portions of the chunk contain valid
inodes. The total count field contains the total number of valid inodes
referred to by the record. This can also be deduced from the hole mask.
The count field provides clarity and redundancy for internal record
verification.

Note that both fields are initialized to zero to maintain backwards
compatibility with existing filesystems (e.g., the higher order bytes of
freecount are always 0). Tracking holes means that the hole mask is
initialized to zero and thus remains "valid" in accordance with a
non-sparse inode fs when no sparse chunks are physically allocated.
Update the inode record management functions to handle the new fields
and initialize to zero.

[XXX: The count field breaks backwards compatibility with !sparseinode
fs. Should we reconsider the addition of total count or the idea of
converting back and forth between sparse inode fs with a feature bit?]

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_format.h       |  8 ++++++--
 fs/xfs/libxfs/xfs_ialloc.c       | 12 ++++++++++--
 fs/xfs/libxfs/xfs_ialloc_btree.c |  4 +++-
 3 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 26e5d92..6c2f1be 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -1295,13 +1295,17 @@ static inline xfs_inofree_t xfs_inobt_maskn(int i, int n)
  */
 typedef struct xfs_inobt_rec {
 	__be32		ir_startino;	/* starting inode number */
-	__be32		ir_freecount;	/* count of free inodes (set bits) */
+	__be16		ir_holemask;	/* hole mask for sparse chunks */
+	__u8		ir_count;	/* total inode count */
+	__u8		ir_freecount;	/* count of free inodes (set bits) */
 	__be64		ir_free;	/* free inode mask */
 } xfs_inobt_rec_t;
 
 typedef struct xfs_inobt_rec_incore {
 	xfs_agino_t	ir_startino;	/* starting inode number */
-	__int32_t	ir_freecount;	/* count of free inodes (set bits) */
+	__uint16_t	ir_holemask;	/* hole mask for sparse chunks */
+	__uint8_t	ir_count;	/* total inode count */
+	__uint8_t	ir_freecount;	/* count of free inodes (set bits) */
 	xfs_inofree_t	ir_free;	/* free inode mask */
 } xfs_inobt_rec_incore_t;
 
diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 6f2153e..32fdb7c 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -65,6 +65,8 @@ xfs_inobt_lookup(
 	int			*stat)	/* success/failure */
 {
 	cur->bc_rec.i.ir_startino = ino;
+	cur->bc_rec.i.ir_holemask = 0;
+	cur->bc_rec.i.ir_count = 0;
 	cur->bc_rec.i.ir_freecount = 0;
 	cur->bc_rec.i.ir_free = 0;
 	return xfs_btree_lookup(cur, dir, stat);
@@ -82,7 +84,9 @@ xfs_inobt_update(
 	union xfs_btree_rec	rec;
 
 	rec.inobt.ir_startino = cpu_to_be32(irec->ir_startino);
-	rec.inobt.ir_freecount = cpu_to_be32(irec->ir_freecount);
+	rec.inobt.ir_holemask = cpu_to_be16(irec->ir_holemask);
+	rec.inobt.ir_count = irec->ir_count;
+	rec.inobt.ir_freecount = irec->ir_freecount;
 	rec.inobt.ir_free = cpu_to_be64(irec->ir_free);
 	return xfs_btree_update(cur, &rec);
 }
@@ -102,7 +106,9 @@ xfs_inobt_get_rec(
 	error = xfs_btree_get_rec(cur, &rec, stat);
 	if (!error && *stat == 1) {
 		irec->ir_startino = be32_to_cpu(rec->inobt.ir_startino);
-		irec->ir_freecount = be32_to_cpu(rec->inobt.ir_freecount);
+		irec->ir_holemask = be16_to_cpu(rec->inobt.ir_holemask);
+		irec->ir_count = rec->inobt.ir_count;
+		irec->ir_freecount = rec->inobt.ir_freecount;
 		irec->ir_free = be64_to_cpu(rec->inobt.ir_free);
 	}
 	return error;
@@ -118,6 +124,8 @@ xfs_inobt_insert_rec(
 	xfs_inofree_t		free,
 	int			*stat)
 {
+	cur->bc_rec.i.ir_holemask = 0;
+	cur->bc_rec.i.ir_count = 0; /* zero for backwards compatibility */
 	cur->bc_rec.i.ir_freecount = freecount;
 	cur->bc_rec.i.ir_free = free;
 	return xfs_btree_insert(cur, stat);
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index 964c465..c9378e8 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -167,7 +167,9 @@ xfs_inobt_init_rec_from_cur(
 	union xfs_btree_rec	*rec)
 {
 	rec->inobt.ir_startino = cpu_to_be32(cur->bc_rec.i.ir_startino);
-	rec->inobt.ir_freecount = cpu_to_be32(cur->bc_rec.i.ir_freecount);
+	rec->inobt.ir_holemask = cpu_to_be16(cur->bc_rec.i.ir_holemask);
+	rec->inobt.ir_count = cur->bc_rec.i.ir_count;
+	rec->inobt.ir_freecount = cur->bc_rec.i.ir_freecount;
 	rec->inobt.ir_free = cpu_to_be64(cur->bc_rec.i.ir_free);
 }
 
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v3 05/18] xfs: create macros/helpers for dealing with sparse inode chunks
  2015-02-06 19:52 [PATCH v3 00/18] xfs: sparse inode chunks Brian Foster
                   ` (3 preceding siblings ...)
  2015-02-06 19:52 ` [PATCH v3 04/18] xfs: introduce inode record hole mask for sparse inode chunks Brian Foster
@ 2015-02-06 19:52 ` Brian Foster
  2015-02-06 19:52 ` [PATCH v3 06/18] xfs: pass inode count through ordered icreate log item Brian Foster
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 41+ messages in thread
From: Brian Foster @ 2015-02-06 19:52 UTC (permalink / raw)
  To: xfs

Sparse inode chunks allow the traditional inode btree record format to
describe an inode chunk that is not fully allocated and/or contiguous.
The granularity of a sparse chunk is defined by the the 16-bit holemask
field in the inode record. Assuming 64 inodes per full chunk, a single
holemask bit accounts for 4 inodes.

Define a constant for the number of inodes per holemask bit and a helper
function to easily detect sparse inode chunks.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_format.h | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 6c2f1be..b2faced 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -1285,6 +1285,10 @@ typedef	__uint64_t	xfs_inofree_t;
 #define	XFS_INOBT_ALL_FREE		((xfs_inofree_t)-1)
 #define	XFS_INOBT_MASK(i)		((xfs_inofree_t)1 << (i))
 
+#define XFS_INOBT_HOLEMASK_BITS		(NBBY * sizeof(__uint16_t))
+#define XFS_INODES_PER_HOLEMASK_BIT	\
+	(XFS_INODES_PER_CHUNK / (NBBY * sizeof(__uint16_t)))
+
 static inline xfs_inofree_t xfs_inobt_maskn(int i, int n)
 {
 	return ((n >= XFS_INODES_PER_CHUNK ? 0 : XFS_INOBT_MASK(n)) - 1) << i;
@@ -1309,6 +1313,11 @@ typedef struct xfs_inobt_rec_incore {
 	xfs_inofree_t	ir_free;	/* free inode mask */
 } xfs_inobt_rec_incore_t;
 
+static inline bool xfs_inobt_issparse(uint16_t holemask)
+{
+	/* non-zero holemask represents a sparse rec. */
+	return holemask;
+}
 
 /*
  * Key structure
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v3 06/18] xfs: pass inode count through ordered icreate log item
  2015-02-06 19:52 [PATCH v3 00/18] xfs: sparse inode chunks Brian Foster
                   ` (4 preceding siblings ...)
  2015-02-06 19:52 ` [PATCH v3 05/18] xfs: create macros/helpers for dealing with " Brian Foster
@ 2015-02-06 19:52 ` Brian Foster
  2015-02-06 19:52 ` [PATCH v3 07/18] xfs: handle sparse inode chunks in icreate log recovery Brian Foster
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 41+ messages in thread
From: Brian Foster @ 2015-02-06 19:52 UTC (permalink / raw)
  To: xfs

v5 superblocks use an ordered log item for logging the initialization of
inode chunks. The icreate log item is currently hardcoded to an inode
count of 64 inodes.

The agbno and extent length are used to initialize the inode chunk from
log recovery. While an incorrect inode count does not lead to bad inode
chunk initialization, we should pass the correct inode count such that log
recovery has enough data to perform meaningful validity checks on the
chunk.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_ialloc.c | 7 ++++---
 fs/xfs/libxfs/xfs_ialloc.h | 2 +-
 fs/xfs/xfs_log_recover.c   | 4 ++--
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 32fdb7c..72ade0e 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -228,6 +228,7 @@ xfs_ialloc_inode_init(
 	struct xfs_mount	*mp,
 	struct xfs_trans	*tp,
 	struct list_head	*buffer_list,
+	int			icount,
 	xfs_agnumber_t		agno,
 	xfs_agblock_t		agbno,
 	xfs_agblock_t		length,
@@ -283,7 +284,7 @@ xfs_ialloc_inode_init(
 		 * they track in the AIL as if they were physically logged.
 		 */
 		if (tp)
-			xfs_icreate_log(tp, agno, agbno, mp->m_ialloc_inos,
+			xfs_icreate_log(tp, agno, agbno, icount,
 					mp->m_sb.sb_inodesize, length, gen);
 	} else
 		version = 2;
@@ -502,8 +503,8 @@ xfs_ialloc_ag_alloc(
 	 * rather than a linear progression to prevent the next generation
 	 * number from being easily guessable.
 	 */
-	error = xfs_ialloc_inode_init(args.mp, tp, NULL, agno, args.agbno,
-			args.len, prandom_u32());
+	error = xfs_ialloc_inode_init(args.mp, tp, NULL, newlen, agno,
+			args.agbno, args.len, prandom_u32());
 
 	if (error)
 		return error;
diff --git a/fs/xfs/libxfs/xfs_ialloc.h b/fs/xfs/libxfs/xfs_ialloc.h
index 100007d..4d4b702 100644
--- a/fs/xfs/libxfs/xfs_ialloc.h
+++ b/fs/xfs/libxfs/xfs_ialloc.h
@@ -156,7 +156,7 @@ int xfs_inobt_get_rec(struct xfs_btree_cur *cur,
  * Inode chunk initialisation routine
  */
 int xfs_ialloc_inode_init(struct xfs_mount *mp, struct xfs_trans *tp,
-			  struct list_head *buffer_list,
+			  struct list_head *buffer_list, int icount,
 			  xfs_agnumber_t agno, xfs_agblock_t agbno,
 			  xfs_agblock_t length, unsigned int gen);
 
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index a5a945f..ecc73d5 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -3091,8 +3091,8 @@ xlog_recover_do_icreate_pass2(
 			XFS_AGB_TO_DADDR(mp, agno, agbno), length, 0))
 		return 0;
 
-	xfs_ialloc_inode_init(mp, NULL, buffer_list, agno, agbno, length,
-					be32_to_cpu(icl->icl_gen));
+	xfs_ialloc_inode_init(mp, NULL, buffer_list, count, agno, agbno, length,
+			      be32_to_cpu(icl->icl_gen));
 	return 0;
 }
 
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v3 07/18] xfs: handle sparse inode chunks in icreate log recovery
  2015-02-06 19:52 [PATCH v3 00/18] xfs: sparse inode chunks Brian Foster
                   ` (5 preceding siblings ...)
  2015-02-06 19:52 ` [PATCH v3 06/18] xfs: pass inode count through ordered icreate log item Brian Foster
@ 2015-02-06 19:52 ` Brian Foster
  2015-02-06 23:19   ` Dave Chinner
  2015-02-06 19:52 ` [PATCH v3 08/18] xfs: helpers to convert holemask to/from generic bitmap Brian Foster
                   ` (10 subsequent siblings)
  17 siblings, 1 reply; 41+ messages in thread
From: Brian Foster @ 2015-02-06 19:52 UTC (permalink / raw)
  To: xfs

Recovery of icreate transactions assumes hardcoded values for the inode
count and chunk length.

Sparse inode chunks are allocated in units of m_ialloc_min_blks. Update
the icreate validity checks to allow for appropriately sized inode
chunks and verify the inode count matches what is expected based on the
extent length rather than assuming a hardcoded count.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_log_recover.c | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index ecc73d5..5a5ee20 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -3068,12 +3068,19 @@ xlog_recover_do_icreate_pass2(
 		return -EINVAL;
 	}
 
-	/* existing allocation is fixed value */
-	ASSERT(count == mp->m_ialloc_inos);
-	ASSERT(length == mp->m_ialloc_blks);
-	if (count != mp->m_ialloc_inos ||
-	     length != mp->m_ialloc_blks) {
-		xfs_warn(log->l_mp, "xlog_recover_do_icreate_trans: bad count 2");
+	/* inode chunk is either full or sparse */
+	if (length != mp->m_ialloc_blks &&
+	    length != mp->m_ialloc_min_blks) {
+		xfs_warn(log->l_mp,
+			 "%s: unsupported chunk length", __FUNCTION__);
+		return -EINVAL;
+	}
+
+	/* verify inode count is consistent with extent length */
+	if ((count >> mp->m_sb.sb_inopblog) != length) {
+		xfs_warn(log->l_mp,
+			 "%s: inconsistent inode count and chunk length",
+			 __FUNCTION__);
 		return -EINVAL;
 	}
 
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v3 08/18] xfs: helpers to convert holemask to/from generic bitmap
  2015-02-06 19:52 [PATCH v3 00/18] xfs: sparse inode chunks Brian Foster
                   ` (6 preceding siblings ...)
  2015-02-06 19:52 ` [PATCH v3 07/18] xfs: handle sparse inode chunks in icreate log recovery Brian Foster
@ 2015-02-06 19:52 ` Brian Foster
  2015-02-08 23:54   ` Dave Chinner
  2015-02-06 19:52 ` [PATCH v3 09/18] xfs: support min/max agbno args in block allocator Brian Foster
                   ` (9 subsequent siblings)
  17 siblings, 1 reply; 41+ messages in thread
From: Brian Foster @ 2015-02-06 19:52 UTC (permalink / raw)
  To: xfs

The inobt record holemask field is a condensed data type designed to fit
into the existing on-disk record and is zero based (allocated regions
are set to 0, sparse regions are set to 1) to provide backwards
compatibility. This makes the type somewhat complex for use in higher
level inode manipulations such as individual inode allocation, etc.

Rather than foist the complexity of dealing with this field to every bit
of logic that requires inode granular information, create the
xfs_inobt_ialloc_bitmap() helper to convert the holemask to an inode
allocation bitmap. The inode allocation bitmap is inode granularity
similar to the inobt record free mask and indicates which inodes of the
chunk are physically allocated on disk, irrespective of whether the
inode is considered allocated or free by the filesystem. The generic
bitmap type requires the use of generic kernel bitmap APIs.

The bitmap_to_u64() helper is provided to convert the bitmap back to a
native 64-bit type (for native bitwise operations). This is required
because the generic bitmap code represents a bitmap as an array of
unsigned long in a little endian style (though each array value is cpu
order). To ensure compatibility with various word sizes and endianness',
bitmap_to_u64() exports the bitmap to little endian and swaps back to
cpu byte order.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_ialloc.c | 69 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 69 insertions(+)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 72ade0e..fc001d9 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -39,6 +39,8 @@
 #include "xfs_icache.h"
 #include "xfs_trace.h"
 
+STATIC void
+xfs_inobt_ialloc_bitmap(unsigned long *, struct xfs_inobt_rec_incore *);
 
 /*
  * Allocation group level functions.
@@ -739,6 +741,73 @@ xfs_ialloc_get_rec(
 	return 0;
 }
 
+static inline uint64_t
+bitmap_to_u64(
+	const unsigned long	*bitmap,
+	int			nbits)
+{
+	__le64			lebitmap;
+
+	ASSERT(nbits <= 64);
+
+	/*
+	 * The bitmap format depends on the native word size. E.g., bit 0 could
+	 * land in the middle of the 64 bits on a big endian 32-bit arch (see
+	 * arch/powerpc/include/asm/bitops.h). To handle this, export the bitmap
+	 * as 64-bit little endian and convert back to native byte order.
+	 */
+	bitmap_copy_le(&lebitmap, bitmap, nbits);
+	return le64_to_cpu(lebitmap);
+}
+
+/*
+ * Convert the inode record holemask to an inode allocation bitmap. The inode
+ * allocation bitmap is inode granularity and specifies whether an inode is
+ * physically allocated on disk (not whether the inode is considered allocated
+ * or free by the fs).
+ *
+ * We have to be careful with regard to byte order and word size since the
+ * generic bitmap code uses primitive types.
+ */
+STATIC void
+xfs_inobt_ialloc_bitmap(
+	unsigned long			*allocbmap,
+	struct xfs_inobt_rec_incore	*rec)
+{
+	int				nextbit;
+	DECLARE_BITMAP(holemask, 16);
+
+	bitmap_zero(allocbmap, 64);
+
+	/*
+	 * bitmaps are represented as an array of unsigned long (each in cpu
+	 * byte order). ir_holemask is only 16 bits, so direct assignment is
+	 * safe.
+	 */
+	ASSERT(sizeof(rec->ir_holemask) <= sizeof(holemask[0]));
+	holemask[0] = rec->ir_holemask;
+
+	/*
+	 * holemask is a 16-bit bitmap and inode records span 64 inodes. Thus
+	 * each holemask bit represents multiple (XFS_INODES_PER_HOLEMASK_BIT)
+	 * inodes. Since holemask bits represent holes, each set bit represents
+	 * a region of the physical chunk that is not tracked by the record.
+	 *
+	 * We want an inode granularity allocation bitmap. We have to invert the
+	 * holemask and set XFS_INODES_PER_HOLEMASK_BIT bits for each set bit.
+	 * We invert and expand simultaneously by walking the unset (zeroed)
+	 * holemask bits. For each zero bit in holemask, set the corresponding
+	 * XFS_INODES_PER_HOLEMASK_BIT bits in the allocation bitmap.
+	 */
+	nextbit = find_first_zero_bit(holemask, 16);
+	while (nextbit < 16) {
+		bitmap_set(allocbmap, nextbit * XFS_INODES_PER_HOLEMASK_BIT,
+			   XFS_INODES_PER_HOLEMASK_BIT);
+
+		nextbit = find_next_zero_bit(holemask, 16, nextbit + 1);
+	}
+}
+
 /*
  * Allocate an inode using the inobt-only algorithm.
  */
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v3 09/18] xfs: support min/max agbno args in block allocator
  2015-02-06 19:52 [PATCH v3 00/18] xfs: sparse inode chunks Brian Foster
                   ` (7 preceding siblings ...)
  2015-02-06 19:52 ` [PATCH v3 08/18] xfs: helpers to convert holemask to/from generic bitmap Brian Foster
@ 2015-02-06 19:52 ` Brian Foster
  2015-02-09  0:01   ` Dave Chinner
  2015-02-06 19:52 ` [PATCH v3 10/18] xfs: allocate sparse inode chunks on full chunk allocation failure Brian Foster
                   ` (8 subsequent siblings)
  17 siblings, 1 reply; 41+ messages in thread
From: Brian Foster @ 2015-02-06 19:52 UTC (permalink / raw)
  To: xfs

The block allocator supports various arguments to tweak block allocation
behavior and set allocation requirements. The sparse inode chunk feature
introduces a new requirement not supported by the current arguments.
Sparse inode allocations must convert or merge into an inode record that
describes a fixed length chunk (64 inodes x inodesize). Full inode chunk
allocations by definition always result in valid inode records. Sparse
chunk allocations are smaller and the associated records can refer to
blocks not owned by the inode chunk. This model can result in invalid
inode records in certain cases.

For example, if a sparse allocation occurs near the start of an AG, the
aligned inode record for that chunk might refer to agbno 0. If an
allocation occurs towards the end of the AG and the AG size is not
aligned, the inode record could refer to blocks beyond the end of the
AG. While neither of these scenarios directly result in corruption, they
both insert invalid inode records and at minimum cause repair to
complain, are unlikely to merge into full chunks over time and set land
mines for other areas of code.

To guarantee sparse inode chunk allocation creates valid inode records,
support the ability to specify an agbno range limit for
XFS_ALLOCTYPE_NEAR_BNO block allocations. The min/max agbno's are
specified in the allocation arguments and limit the block allocation
algorithms to that range. The starting 'agbno' hint is clamped to the
range if the specified agbno is out of range. If no sufficient extent is
available within the range, the allocation fails. For backwards
compatibility, the min/max fields can be initialized to 0 to disable
range limiting (e.g., equivalent to min=0,max=agsize).

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_alloc.c | 42 +++++++++++++++++++++++++++++++++++++-----
 fs/xfs/libxfs/xfs_alloc.h |  2 ++
 2 files changed, 39 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index a6fbf44..0ddf6c9 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -149,13 +149,27 @@ xfs_alloc_compute_aligned(
 {
 	xfs_agblock_t	bno;
 	xfs_extlen_t	len;
+	xfs_extlen_t	diff;
 
 	/* Trim busy sections out of found extent */
 	xfs_extent_busy_trim(args, foundbno, foundlen, &bno, &len);
 
+	/*
+	 * If we have a largish extent that happens to start before min_agbno,
+	 * see if we can shift it into range...
+	 */
+	if (bno < args->min_agbno && bno + len > args->min_agbno) {
+		diff = args->min_agbno - bno;
+		if (len > diff) {
+			bno += diff;
+			len -= diff;
+		}
+	}
+
 	if (args->alignment > 1 && len >= args->minlen) {
 		xfs_agblock_t	aligned_bno = roundup(bno, args->alignment);
-		xfs_extlen_t	diff = aligned_bno - bno;
+
+		diff = aligned_bno - bno;
 
 		*resbno = aligned_bno;
 		*reslen = diff >= len ? 0 : len - diff;
@@ -790,9 +804,13 @@ xfs_alloc_find_best_extent(
 		 * The good extent is closer than this one.
 		 */
 		if (!dir) {
+			if (*sbnoa > args->max_agbno)
+				goto out_use_good;
 			if (*sbnoa >= args->agbno + gdiff)
 				goto out_use_good;
 		} else {
+			if (*sbnoa < args->min_agbno)
+				goto out_use_good;
 			if (*sbnoa <= args->agbno - gdiff)
 				goto out_use_good;
 		}
@@ -879,6 +897,17 @@ xfs_alloc_ag_vextent_near(
 	dofirst = prandom_u32() & 1;
 #endif
 
+	/* handle unitialized agbno range so caller doesn't have to */
+	if (!args->min_agbno && !args->max_agbno)
+		args->max_agbno = args->mp->m_sb.sb_agblocks - 1;
+	ASSERT(args->min_agbno <= args->max_agbno);
+
+	/* clamp agbno to the range if it's outside */
+	if (args->agbno < args->min_agbno)
+		args->agbno = args->min_agbno;
+	if (args->agbno > args->max_agbno)
+		args->agbno = args->max_agbno;
+
 restart:
 	bno_cur_lt = NULL;
 	bno_cur_gt = NULL;
@@ -971,6 +1000,8 @@ restart:
 						  &ltbnoa, &ltlena);
 			if (ltlena < args->minlen)
 				continue;
+			if (ltbnoa < args->min_agbno || ltbnoa > args->max_agbno)
+				continue;
 			args->len = XFS_EXTLEN_MIN(ltlena, args->maxlen);
 			xfs_alloc_fix_len(args);
 			ASSERT(args->len >= args->minlen);
@@ -1091,11 +1122,11 @@ restart:
 			XFS_WANT_CORRUPTED_GOTO(i == 1, error0);
 			xfs_alloc_compute_aligned(args, ltbno, ltlen,
 						  &ltbnoa, &ltlena);
-			if (ltlena >= args->minlen)
+			if (ltlena >= args->minlen && ltbnoa >= args->min_agbno)
 				break;
 			if ((error = xfs_btree_decrement(bno_cur_lt, 0, &i)))
 				goto error0;
-			if (!i) {
+			if (!i || ltbnoa < args->min_agbno) {
 				xfs_btree_del_cursor(bno_cur_lt,
 						     XFS_BTREE_NOERROR);
 				bno_cur_lt = NULL;
@@ -1107,11 +1138,11 @@ restart:
 			XFS_WANT_CORRUPTED_GOTO(i == 1, error0);
 			xfs_alloc_compute_aligned(args, gtbno, gtlen,
 						  &gtbnoa, &gtlena);
-			if (gtlena >= args->minlen)
+			if (gtlena >= args->minlen && gtbnoa <= args->max_agbno)
 				break;
 			if ((error = xfs_btree_increment(bno_cur_gt, 0, &i)))
 				goto error0;
-			if (!i) {
+			if (!i || gtbnoa > args->max_agbno) {
 				xfs_btree_del_cursor(bno_cur_gt,
 						     XFS_BTREE_NOERROR);
 				bno_cur_gt = NULL;
@@ -1211,6 +1242,7 @@ restart:
 	ASSERT(ltnew >= ltbno);
 	ASSERT(ltnew + rlen <= ltbnoa + ltlena);
 	ASSERT(ltnew + rlen <= be32_to_cpu(XFS_BUF_TO_AGF(args->agbp)->agf_length));
+	ASSERT(ltnew >= args->min_agbno && ltnew <= args->max_agbno);
 	args->agbno = ltnew;
 
 	if ((error = xfs_alloc_fixup_trees(cnt_cur, bno_cur_lt, ltbno, ltlen,
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index d1b4b6a..29f27b2 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -112,6 +112,8 @@ typedef struct xfs_alloc_arg {
 	xfs_extlen_t	total;		/* total blocks needed in xaction */
 	xfs_extlen_t	alignment;	/* align answer to multiple of this */
 	xfs_extlen_t	minalignslop;	/* slop for minlen+alignment calcs */
+	xfs_agblock_t	min_agbno;	/* set an agbno range for NEAR allocs */
+	xfs_agblock_t	max_agbno;	/* ... */
 	xfs_extlen_t	len;		/* output: actual size of extent */
 	xfs_alloctype_t	type;		/* allocation type XFS_ALLOCTYPE_... */
 	xfs_alloctype_t	otype;		/* original allocation type */
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v3 10/18] xfs: allocate sparse inode chunks on full chunk allocation failure
  2015-02-06 19:52 [PATCH v3 00/18] xfs: sparse inode chunks Brian Foster
                   ` (8 preceding siblings ...)
  2015-02-06 19:52 ` [PATCH v3 09/18] xfs: support min/max agbno args in block allocator Brian Foster
@ 2015-02-06 19:52 ` Brian Foster
  2015-02-09  1:25   ` Dave Chinner
  2015-02-06 19:52 ` [PATCH v3 11/18] xfs: randomly do sparse inode allocations in DEBUG mode Brian Foster
                   ` (7 subsequent siblings)
  17 siblings, 1 reply; 41+ messages in thread
From: Brian Foster @ 2015-02-06 19:52 UTC (permalink / raw)
  To: xfs

xfs_ialloc_ag_alloc() makes several attempts to allocate a full inode
chunk. If all else fails, reduce the allocation to the minimum sparse
granularity and attempt to allocate a sparse inode chunk.

If sparse chunk allocation succeeds, check whether an inobt record
already exists that can track the chunk. If so, inherit and update the
existing record. Otherwise, insert a new record for the sparse chunk.

Update xfs_inobt_insert_rec() to take the holemask as a parameter and
set the associated field on disk. Create the xfs_inobt_update_insert()
helper to handle the sparse chunk allocation case - insert or update an
existing record depending on whether it already exists.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_ialloc.c | 397 +++++++++++++++++++++++++++++++++++++++++++--
 fs/xfs/xfs_trace.h         |  47 ++++++
 2 files changed, 426 insertions(+), 18 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index fc001d9..090d114 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -122,12 +122,16 @@ xfs_inobt_get_rec(
 STATIC int
 xfs_inobt_insert_rec(
 	struct xfs_btree_cur	*cur,
+	__uint16_t		holemask,
+	__uint8_t		count,
 	__int32_t		freecount,
 	xfs_inofree_t		free,
 	int			*stat)
 {
-	cur->bc_rec.i.ir_holemask = 0;
-	cur->bc_rec.i.ir_count = 0; /* zero for backwards compatibility */
+	ASSERT(count == 0 || xfs_sb_version_hassparseinodes(&cur->bc_mp->m_sb));
+
+	cur->bc_rec.i.ir_holemask = holemask;
+	cur->bc_rec.i.ir_count = count;
 	cur->bc_rec.i.ir_freecount = freecount;
 	cur->bc_rec.i.ir_free = free;
 	return xfs_btree_insert(cur, stat);
@@ -151,6 +155,19 @@ xfs_inobt_insert(
 	xfs_agino_t		thisino;
 	int			i;
 	int			error;
+	uint8_t			count;
+
+	/*
+	 * Only set ir_count in the inobt record if the sparse inodes feature is
+	 * enabled. If disabled, we must maintain backwards compatibility with
+	 * the older inobt record format where the current count and holemask
+	 * fields map to the higher order bytes of freecount and thus must be
+	 * zeroed.
+	 */
+	if (xfs_sb_version_hassparseinodes(&mp->m_sb))
+		count = XFS_INODES_PER_CHUNK;
+	else
+		count = 0;
 
 	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno, btnum);
 
@@ -164,7 +181,7 @@ xfs_inobt_insert(
 		}
 		ASSERT(i == 0);
 
-		error = xfs_inobt_insert_rec(cur, XFS_INODES_PER_CHUNK,
+		error = xfs_inobt_insert_rec(cur, 0, count, XFS_INODES_PER_CHUNK,
 					     XFS_INOBT_ALL_FREE, &i);
 		if (error) {
 			xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
@@ -174,8 +191,58 @@ xfs_inobt_insert(
 	}
 
 	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+	return 0;
+}
 
+/*
+ * Update or insert a new record based on a sparse inode chunk allocation.
+ *
+ * If a record already exists, the new record is an updated version of that
+ * record based on a merge of sparse inode chunks. Update the record in place.
+ * Otherwise, insert a new record in the tree. Note that the record to insert
+ * must already have been aligned and merged, if necessary.
+ */
+STATIC int
+xfs_inobt_update_insert(
+	struct xfs_mount		*mp,
+	struct xfs_trans		*tp,
+	struct xfs_buf			*agbp,
+	struct xfs_inobt_rec_incore	*rec,
+	xfs_btnum_t			btnum)
+{
+	struct xfs_btree_cur		*cur;
+	struct xfs_agi			*agi = XFS_BUF_TO_AGI(agbp);
+	xfs_agnumber_t			agno = be32_to_cpu(agi->agi_seqno);
+	int				i;
+	int				error;
+
+	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno, btnum);
+
+	error = xfs_inobt_lookup(cur, rec->ir_startino, XFS_LOOKUP_EQ, &i);
+	if (error)
+		goto error;
+	if (i == 1) {
+		/* found a record, update it with the merged record */
+		error = xfs_inobt_update(cur, rec);
+		if (error)
+			goto error;
+		goto out;
+	}
+
+	/* no existing record, insert a new one */
+	error = xfs_inobt_insert_rec(cur, rec->ir_holemask, rec->ir_count,
+				     rec->ir_freecount, rec->ir_free, &i);
+	if (error)
+		goto error;
+	XFS_WANT_CORRUPTED_GOTO(i == 1, error);
+
+out:
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
 	return 0;
+
+error:
+	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+	return error;
 }
 
 /*
@@ -215,8 +282,36 @@ xfs_check_agi_freecount(
 	}
 	return 0;
 }
+
+/*
+ * Verify that an inode record has a valid inode count. With sparse inode chunk
+ * support enabled, the count must be consistent with the holemask. Otherwise,
+ * the count is set to 0.
+ */
+STATIC int
+xfs_inobt_rec_check_count(
+	struct xfs_mount		*mp,
+	struct xfs_inobt_rec_incore	*rec)
+{
+	int	inocount;
+	DECLARE_BITMAP(allocbmap, XFS_INODES_PER_CHUNK);
+
+	if (!xfs_sb_version_hassparseinodes(&mp->m_sb)) {
+		if (rec->ir_count)
+			return -EFSCORRUPTED;
+		return 0;
+	}
+
+	xfs_inobt_ialloc_bitmap(allocbmap, rec);
+	inocount = bitmap_weight(allocbmap, XFS_INODES_PER_CHUNK);
+	if (inocount != rec->ir_count)
+		return -EFSCORRUPTED;
+
+	return 0;
+}
 #else
 #define xfs_check_agi_freecount(cur, agi)	0
+#define xfs_inobt_rec_check_count(mp, rec)	0
 #endif
 
 /*
@@ -358,6 +453,183 @@ xfs_ialloc_inode_init(
 }
 
 /*
+ * Align a record for a recently allocated sparse chunk. The input is a record
+ * that describes the unaligned chunk. The record is aligned such that it is fit
+ * for insertion (or merge) into the on-disk inode btrees.
+ */
+STATIC void
+xfs_align_sparse_rec(
+	struct xfs_mount		*mp,
+	struct xfs_inobt_rec_incore	*rec)
+{
+	xfs_agblock_t			agbno;
+	xfs_agblock_t			mod;
+	int				offset;
+	uint16_t			allocmask;
+
+	agbno = XFS_AGINO_TO_AGBNO(mp, rec->ir_startino);
+	mod = agbno % mp->m_sb.sb_inoalignmt;
+	if (!mod)
+		return;
+
+	/* calculate the inode offset and align startino */
+	offset = mod << mp->m_sb.sb_inopblog;
+	rec->ir_startino -= offset;
+
+	/*
+	 * Since startino has been aligned down, we have to left shift
+	 * ir_holemask such that it continues to represent the same physical
+	 * inodes as the unaligned record. The unaligned record by definition
+	 * tracks the allocated inodes with the lowest order bits.
+	 *
+	 * ir_holemask is inverted before the shift such that set bits represent
+	 * allocated inodes. This makes it safe for the bit-shift to introduce
+	 * zeroes in the lower order bits without corrupting the record.
+	 *
+	 * Note that no change is required for ir_count, ir_freecount or
+	 * ir_free. The count values are not affected by alignment and ir_free
+	 * is initialized to 1s for all inodes, sparse or otherwise.
+	 */
+	allocmask = ~rec->ir_holemask;
+	allocmask <<= offset / XFS_INODES_PER_HOLEMASK_BIT;
+	rec->ir_holemask = ~allocmask;
+}
+
+/*
+ * Determine whether two sparse inode records can be merged. The inode ranges
+ * must match and there must be no allocation overlap between the records.
+ */
+STATIC bool
+__xfs_inobt_can_merge(
+	struct xfs_inobt_rec_incore	*trec,	/* tgt record */
+	struct xfs_inobt_rec_incore	*srec)	/* src record */
+{
+	DECLARE_BITMAP(talloc, 64);
+	DECLARE_BITMAP(salloc, 64);
+	DECLARE_BITMAP(tmp, 64);
+
+	/* records must cover the same inode range */
+	if (trec->ir_startino != srec->ir_startino)
+		return false;
+
+	/* both records must be sparse */
+	if (!xfs_inobt_issparse(trec->ir_holemask) ||
+	    !xfs_inobt_issparse(srec->ir_holemask))
+		return false;
+
+	/* can't exceed capacity of a full record */
+	if (trec->ir_count + srec->ir_count > XFS_INODES_PER_CHUNK)
+		return false;
+
+	/* verify there is no allocation overlap */
+	xfs_inobt_ialloc_bitmap(talloc, trec);
+	xfs_inobt_ialloc_bitmap(salloc, srec);
+
+	bitmap_and(tmp, salloc, talloc, 64);
+	if (!bitmap_empty(tmp, 64))
+		return false;
+
+	return true;
+}
+
+/*
+ * Merge two sparse inode records. The caller must call __xfs_inobt_can_merge()
+ * to ensure the merge is valid.
+ */
+STATIC void
+__xfs_inobt_rec_merge(
+	struct xfs_inobt_rec_incore	*trec,	/* target */
+	struct xfs_inobt_rec_incore	*srec)	/* src */
+{
+	ASSERT(trec->ir_startino == srec->ir_startino);
+
+	/* combine the counts */
+	trec->ir_count += srec->ir_count;
+	trec->ir_freecount += srec->ir_freecount;
+
+	/* merge the holemask */
+	trec->ir_holemask &= srec->ir_holemask;
+
+	/* merge the free mask */
+	trec->ir_free &= srec->ir_free;
+}
+
+/*
+ * Determine whether a newly allocated sparse inode chunk record overlaps with
+ * an existing sparse record in the inobt. When sparse inode chunks are enabled,
+ * all inode chunk alignment is increased from cluster size to physical inode
+ * chunk size. This means that the smallest, non-zero gap between two inode
+ * chunks is at least one full inode chunk. When a sparse inode chunk is
+ * allocated, the containing record is also aligned in this manner such that
+ * future sparse allocations within that same range all align to the same record
+ * startino. This alignment policy supports the ability to merge sparse chunks
+ * into complete chunks over time.
+ *
+ * Given an newly allocated/aligned sparse inode record, look up whether a
+ * sparse record already exists at this startino. If so, merge the two records
+ * and return the merged record in nrec.
+ *
+ * An error is returned if records overlap but a merge is not possible. Given
+ * the alignment constraints described above, this should never happen and thus
+ * is treated as fs corruption.
+ */
+STATIC int
+xfs_inobt_rec_merge(
+	struct xfs_mount		*mp,
+	struct xfs_trans		*tp,
+	struct xfs_buf			*agbp,
+	xfs_btnum_t			btnum,
+	struct xfs_inobt_rec_incore	*nrec)	/* in/out: new/merged rec. */
+{
+	struct xfs_btree_cur		*cur;
+	struct xfs_agi			*agi = XFS_BUF_TO_AGI(agbp);
+	xfs_agnumber_t			agno = be32_to_cpu(agi->agi_seqno);
+	int				error;
+	int				i;
+	struct xfs_inobt_rec_incore	rec;
+
+	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno, btnum);
+
+	/* the new record is pre-aligned so we know where to look */
+	error = xfs_inobt_lookup(cur, nrec->ir_startino, XFS_LOOKUP_EQ, &i);
+	if (error)
+		goto error;
+	/* if nothing there, we're done */
+	if (i == 0)
+		goto out;
+
+	error = xfs_inobt_get_rec(cur, &rec, &i);
+	if (error)
+		goto error;
+	XFS_WANT_CORRUPTED_GOTO(i == 1, error);
+	ASSERT(rec.ir_startino == nrec->ir_startino);
+
+	/*
+	 * This should never happen. If we have coexisting records that cannot
+	 * merge, something is seriously wrong.
+	 */
+	if (!__xfs_inobt_can_merge(nrec, &rec)) {
+		error = -EFSCORRUPTED;
+		goto error;
+	}
+
+	trace_xfs_irec_merge_pre(mp, agno, rec.ir_startino, rec.ir_holemask,
+				 nrec->ir_startino, nrec->ir_holemask);
+
+	__xfs_inobt_rec_merge(nrec, &rec);
+
+	trace_xfs_irec_merge_post(mp, agno, nrec->ir_startino,
+				  nrec->ir_holemask);
+
+out:
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+	return 0;
+error:
+	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+	return error;
+}
+
+/*
  * Allocate new inodes in the allocation group specified by agbp.
  * Return 0 for success, else error code.
  */
@@ -375,6 +647,9 @@ xfs_ialloc_ag_alloc(
 	xfs_agino_t	newlen;		/* new number of inodes */
 	int		isaligned = 0;	/* inode allocation at stripe unit */
 					/* boundary */
+	uint16_t	allocmask = (uint16_t) -1; /* init. to full chunk */
+	struct xfs_inobt_rec_incore rec;
+
 	struct xfs_perag *pag;
 
 	memset(&args, 0, sizeof(args));
@@ -490,6 +765,45 @@ xfs_ialloc_ag_alloc(
 			return error;
 	}
 
+	/*
+	 * Finally, try a sparse allocation if the filesystem supports it and
+	 * the sparse allocation length is smaller than a full chunk.
+	 */
+	if (xfs_sb_version_hassparseinodes(&args.mp->m_sb) &&
+	    args.mp->m_ialloc_min_blks < args.mp->m_ialloc_blks &&
+	    args.fsbno == NULLFSBLOCK) {
+		args.type = XFS_ALLOCTYPE_NEAR_BNO;
+		args.agbno = be32_to_cpu(agi->agi_root);
+		args.fsbno = XFS_AGB_TO_FSB(args.mp, agno, args.agbno);
+		args.alignment = args.mp->m_sb.sb_spinoalignmt;
+		args.prod = 1;
+
+		args.minlen = args.mp->m_ialloc_min_blks;
+		args.maxlen = args.minlen;
+
+		/*
+		 * The inode record will be aligned to full chunk size. We must
+		 * prevent sparse allocation from AG boundaries that result in
+		 * invalid inode records, such as records that start at agbno 0
+		 * or extend beyond the AG.
+		 *
+		 * Set min agbno to the first aligned, non-zero agbno and max to
+		 * the last aligned agbno that is at least one full chunk from
+		 * the end of the AG.
+		 */
+		args.min_agbno = args.mp->m_sb.sb_inoalignmt;
+		args.max_agbno = round_down(args.mp->m_sb.sb_agblocks,
+					    args.mp->m_sb.sb_inoalignmt) -
+				 args.mp->m_ialloc_blks;
+
+		error = xfs_alloc_vextent(&args);
+		if (error)
+			return error;
+
+		newlen = args.len << args.mp->m_sb.sb_inopblog;
+		allocmask = (1 << (newlen / XFS_INODES_PER_HOLEMASK_BIT)) - 1;
+	}
+
 	if (args.fsbno == NULLFSBLOCK) {
 		*alloc = 0;
 		return 0;
@@ -514,6 +828,65 @@ xfs_ialloc_ag_alloc(
 	 * Convert the results.
 	 */
 	newino = XFS_OFFBNO_TO_AGINO(args.mp, args.agbno, 0);
+
+	if (xfs_inobt_issparse(~allocmask)) {
+		/*
+		 * We've allocated a sparse chunk...
+		 */
+		rec.ir_startino = newino;
+		rec.ir_holemask = ~allocmask;
+		rec.ir_count = newlen;
+		rec.ir_freecount = newlen;
+		rec.ir_free = XFS_INOBT_ALL_FREE;
+
+		/* align record and update newino for agi_newino */
+		xfs_align_sparse_rec(args.mp, &rec);
+		newino = rec.ir_startino;
+
+		error = xfs_inobt_rec_merge(args.mp, tp, agbp, XFS_BTNUM_INO,
+					    &rec);
+		if (!error)
+			error = xfs_inobt_rec_check_count(args.mp, &rec);
+		if (error == -EFSCORRUPTED) {
+			xfs_alert(args.mp,
+	"invalid sparse inode record: ino 0x%llx holemask 0x%x count %u",
+				  XFS_AGINO_TO_INO(args.mp, agno,
+						   rec.ir_startino),
+				  rec.ir_holemask, rec.ir_count);
+			xfs_force_shutdown(args.mp, SHUTDOWN_CORRUPT_INCORE);
+		}
+		if (error)
+			return error;
+
+		error = xfs_inobt_update_insert(args.mp, tp, agbp, &rec,
+						XFS_BTNUM_INO);
+		if (error)
+			return error;
+
+		if (xfs_sb_version_hasfinobt(&args.mp->m_sb)) {
+			error = xfs_inobt_update_insert(args.mp, tp, agbp, &rec,
+							XFS_BTNUM_FINO);
+			if (error)
+				return error;
+		}
+	} else {
+		/* full chunk - insert new records to both btrees */
+		error = xfs_inobt_insert(args.mp, tp, agbp, newino, newlen,
+					 XFS_BTNUM_INO);
+		if (error)
+			return error;
+
+		if (xfs_sb_version_hasfinobt(&args.mp->m_sb)) {
+			error = xfs_inobt_insert(args.mp, tp, agbp, newino,
+						 newlen, XFS_BTNUM_FINO);
+			if (error)
+				return error;
+		}
+	}
+
+	/*
+	 * Update AGI counts and newino.
+	 */
 	be32_add_cpu(&agi->agi_count, newlen);
 	be32_add_cpu(&agi->agi_freecount, newlen);
 	pag = xfs_perag_get(args.mp, agno);
@@ -522,20 +895,6 @@ xfs_ialloc_ag_alloc(
 	agi->agi_newino = cpu_to_be32(newino);
 
 	/*
-	 * Insert records describing the new inode chunk into the btrees.
-	 */
-	error = xfs_inobt_insert(args.mp, tp, agbp, newino, newlen,
-				 XFS_BTNUM_INO);
-	if (error)
-		return error;
-
-	if (xfs_sb_version_hasfinobt(&args.mp->m_sb)) {
-		error = xfs_inobt_insert(args.mp, tp, agbp, newino, newlen,
-					 XFS_BTNUM_FINO);
-		if (error)
-			return error;
-	}
-	/*
 	 * Log allocation group header fields
 	 */
 	xfs_ialloc_log_agi(tp, agbp,
@@ -1672,7 +2031,9 @@ xfs_difree_finobt(
 		 */
 		XFS_WANT_CORRUPTED_GOTO(ibtrec->ir_freecount == 1, error);
 
-		error = xfs_inobt_insert_rec(cur, ibtrec->ir_freecount,
+		error = xfs_inobt_insert_rec(cur, ibtrec->ir_holemask,
+					     ibtrec->ir_count,
+					     ibtrec->ir_freecount,
 					     ibtrec->ir_free, &i);
 		if (error)
 			goto error;
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 51372e3..12a4bf4 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -734,6 +734,53 @@ TRACE_EVENT(xfs_iomap_prealloc_size,
 		  __entry->blocks, __entry->shift, __entry->writeio_blocks)
 )
 
+TRACE_EVENT(xfs_irec_merge_pre,
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, xfs_agino_t agino,
+		 uint16_t holemask, xfs_agino_t nagino, uint16_t nholemask),
+	TP_ARGS(mp, agno, agino, holemask, nagino, nholemask),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agino_t, agino)
+		__field(uint16_t, holemask)
+		__field(xfs_agino_t, nagino)
+		__field(uint16_t, nholemask)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->agno = agno;
+		__entry->agino = agino;
+		__entry->holemask = holemask;
+		__entry->nagino = nagino;
+		__entry->nholemask = holemask;
+	),
+	TP_printk("dev %d:%d agno %d inobt (%u:0x%x) new (%u:0x%x)",
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->agno,
+		  __entry->agino, __entry->holemask, __entry->nagino,
+		  __entry->nholemask)
+)
+
+TRACE_EVENT(xfs_irec_merge_post,
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, xfs_agino_t agino,
+		 uint16_t holemask),
+	TP_ARGS(mp, agno, agino, holemask),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agino_t, agino)
+		__field(uint16_t, holemask)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->agno = agno;
+		__entry->agino = agino;
+		__entry->holemask = holemask;
+	),
+	TP_printk("dev %d:%d agno %d inobt (%u:0x%x)", MAJOR(__entry->dev),
+		  MINOR(__entry->dev), __entry->agno, __entry->agino,
+		  __entry->holemask)
+)
+
 #define DEFINE_IREF_EVENT(name) \
 DEFINE_EVENT(xfs_iref_class, name, \
 	TP_PROTO(struct xfs_inode *ip, unsigned long caller_ip), \
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v3 11/18] xfs: randomly do sparse inode allocations in DEBUG mode
  2015-02-06 19:52 [PATCH v3 00/18] xfs: sparse inode chunks Brian Foster
                   ` (9 preceding siblings ...)
  2015-02-06 19:52 ` [PATCH v3 10/18] xfs: allocate sparse inode chunks on full chunk allocation failure Brian Foster
@ 2015-02-06 19:52 ` Brian Foster
  2015-02-08 23:02   ` Dave Chinner
  2015-02-06 19:52 ` [PATCH v3 12/18] xfs: filter out sparse regions from individual inode allocation Brian Foster
                   ` (6 subsequent siblings)
  17 siblings, 1 reply; 41+ messages in thread
From: Brian Foster @ 2015-02-06 19:52 UTC (permalink / raw)
  To: xfs

Sparse inode allocations generally only occur when full inode chunk
allocation fails. This requires some level of filesystem space usage and
fragmentation.

For filesystems formatted with sparse inode chunks enabled, do random
sparse inode chunk allocs when compiled in DEBUG mode to increase test
coverage.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_ialloc.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 090d114..3e5d3eb 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -652,9 +652,18 @@ xfs_ialloc_ag_alloc(
 
 	struct xfs_perag *pag;
 
+#ifdef DEBUG
+	int		do_sparse = 0;
+
+	/* randomly do sparse inode allocations */
+	if (xfs_sb_version_hassparseinodes(&tp->t_mountp->m_sb))
+		do_sparse = prandom_u32() & 1;
+#endif
+
 	memset(&args, 0, sizeof(args));
 	args.tp = tp;
 	args.mp = tp->t_mountp;
+	args.fsbno = NULLFSBLOCK;
 
 	/*
 	 * Locking will ensure that we don't have two callers in here
@@ -675,6 +684,10 @@ xfs_ialloc_ag_alloc(
 	agno = be32_to_cpu(agi->agi_seqno);
 	args.agbno = XFS_AGINO_TO_AGBNO(args.mp, newino) +
 		     args.mp->m_ialloc_blks;
+#ifdef DEBUG
+	if (do_sparse)
+		goto sparse_alloc;
+#endif
 	if (likely(newino != NULLAGINO &&
 		  (args.agbno < be32_to_cpu(agi->agi_length)))) {
 		args.fsbno = XFS_AGB_TO_FSB(args.mp, agno, args.agbno);
@@ -713,8 +726,7 @@ xfs_ialloc_ag_alloc(
 		 * subsequent requests.
 		 */
 		args.minalignslop = 0;
-	} else
-		args.fsbno = NULLFSBLOCK;
+	}
 
 	if (unlikely(args.fsbno == NULLFSBLOCK)) {
 		/*
@@ -769,6 +781,9 @@ xfs_ialloc_ag_alloc(
 	 * Finally, try a sparse allocation if the filesystem supports it and
 	 * the sparse allocation length is smaller than a full chunk.
 	 */
+#ifdef DEBUG
+sparse_alloc:
+#endif
 	if (xfs_sb_version_hassparseinodes(&args.mp->m_sb) &&
 	    args.mp->m_ialloc_min_blks < args.mp->m_ialloc_blks &&
 	    args.fsbno == NULLFSBLOCK) {
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v3 12/18] xfs: filter out sparse regions from individual inode allocation
  2015-02-06 19:52 [PATCH v3 00/18] xfs: sparse inode chunks Brian Foster
                   ` (10 preceding siblings ...)
  2015-02-06 19:52 ` [PATCH v3 11/18] xfs: randomly do sparse inode allocations in DEBUG mode Brian Foster
@ 2015-02-06 19:52 ` Brian Foster
  2015-02-08 22:49   ` Dave Chinner
  2015-02-06 19:53 ` [PATCH v3 13/18] xfs: update free inode record logic to support sparse inode records Brian Foster
                   ` (5 subsequent siblings)
  17 siblings, 1 reply; 41+ messages in thread
From: Brian Foster @ 2015-02-06 19:52 UTC (permalink / raw)
  To: xfs

Inode allocation from an existing record with free inodes traditionally
selects the first inode available according to the ir_free mask. With
sparse inode chunks, the ir_free mask could refer to an unallocated
region. We must mask the unallocated regions out of ir_free before using
it to select a free inode in the chunk.

Create the xfs_inobt_first_free_inode() helper to find the first free
inode available of the allocated regions of the inode chunk.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_ialloc.c | 28 ++++++++++++++++++++++++++--
 1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 3e5d3eb..2691d3e 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -1183,6 +1183,30 @@ xfs_inobt_ialloc_bitmap(
 }
 
 /*
+ * Return the offset of the first free inode in the record. If the inode chunk
+ * is sparsely allocated, we convert the record holemask to inode granularity
+ * and mask off the unallocated regions from the inode free mask.
+ */
+STATIC int
+xfs_inobt_first_free_inode(
+	struct xfs_inobt_rec_incore	*rec)
+{
+	xfs_inofree_t	realfree;
+	DECLARE_BITMAP(alloc, 64);
+
+	/* if there are no holes, return the first available offset */
+	if (!xfs_inobt_issparse(rec->ir_holemask))
+		return xfs_lowbit64(rec->ir_free);
+
+	xfs_inobt_ialloc_bitmap(alloc, rec);
+	realfree = bitmap_to_u64(alloc, 64);
+
+	realfree &= rec->ir_free;
+
+	return xfs_lowbit64(realfree);
+}
+
+/*
  * Allocate an inode using the inobt-only algorithm.
  */
 STATIC int
@@ -1412,7 +1436,7 @@ newino:
 	}
 
 alloc_inode:
-	offset = xfs_lowbit64(rec.ir_free);
+	offset = xfs_inobt_first_free_inode(&rec);
 	ASSERT(offset >= 0);
 	ASSERT(offset < XFS_INODES_PER_CHUNK);
 	ASSERT((XFS_AGINO_TO_OFFSET(mp, rec.ir_startino) %
@@ -1661,7 +1685,7 @@ xfs_dialloc_ag(
 	if (error)
 		goto error_cur;
 
-	offset = xfs_lowbit64(rec.ir_free);
+	offset = xfs_inobt_first_free_inode(&rec);
 	ASSERT(offset >= 0);
 	ASSERT(offset < XFS_INODES_PER_CHUNK);
 	ASSERT((XFS_AGINO_TO_OFFSET(mp, rec.ir_startino) %
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v3 13/18] xfs: update free inode record logic to support sparse inode records
  2015-02-06 19:52 [PATCH v3 00/18] xfs: sparse inode chunks Brian Foster
                   ` (11 preceding siblings ...)
  2015-02-06 19:52 ` [PATCH v3 12/18] xfs: filter out sparse regions from individual inode allocation Brian Foster
@ 2015-02-06 19:53 ` Brian Foster
  2015-02-08 22:33   ` Dave Chinner
  2015-02-06 19:53 ` [PATCH v3 14/18] xfs: only free allocated regions of inode chunks Brian Foster
                   ` (4 subsequent siblings)
  17 siblings, 1 reply; 41+ messages in thread
From: Brian Foster @ 2015-02-06 19:53 UTC (permalink / raw)
  To: xfs

xfs_difree_inobt() uses logic in a couple places that assume inobt
records refer to fully allocated chunks. Specifically, the use of
mp->m_ialloc_inos can cause problems for inode chunks that are sparsely
allocated. Sparse inode chunks can, by definition, define a smaller
number of inodes than a full inode chunk.

Fix the logic that determines whether an inode record should be removed
from the inobt to use the ir_free mask rather than ir_freecount.

Fix the agi counters modification to use ir_freecount to add the actual
number of inodes freed rather than assuming a full inode chunk.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_ialloc.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 2691d3e..e50e57a 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -1975,7 +1975,7 @@ xfs_difree_inobt(
 	 * When an inode cluster is free, it becomes eligible for removal
 	 */
 	if (!(mp->m_flags & XFS_MOUNT_IKEEP) &&
-	    (rec.ir_freecount == mp->m_ialloc_inos)) {
+	    (rec.ir_free == XFS_INOBT_ALL_FREE)) {
 
 		*deleted = 1;
 		*first_ino = XFS_AGINO_TO_INO(mp, agno, rec.ir_startino);
@@ -1985,7 +1985,7 @@ xfs_difree_inobt(
 		 * AGI and Superblock inode counts, and mark the disk space
 		 * to be freed when the transaction is committed.
 		 */
-		ilen = mp->m_ialloc_inos;
+		ilen = rec.ir_freecount;
 		be32_add_cpu(&agi->agi_count, -ilen);
 		be32_add_cpu(&agi->agi_freecount, -(ilen - 1));
 		xfs_ialloc_log_agi(tp, agbp, XFS_AGI_COUNT | XFS_AGI_FREECOUNT);
@@ -2108,7 +2108,7 @@ xfs_difree_finobt(
 	 * keeping inode chunks permanently on disk, remove the record.
 	 * Otherwise, update the record with the new information.
 	 */
-	if (rec.ir_freecount == mp->m_ialloc_inos &&
+	if (rec.ir_free == XFS_INOBT_ALL_FREE &&
 	    !(mp->m_flags & XFS_MOUNT_IKEEP)) {
 		error = xfs_btree_delete(cur, &i);
 		if (error)
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v3 14/18] xfs: only free allocated regions of inode chunks
  2015-02-06 19:52 [PATCH v3 00/18] xfs: sparse inode chunks Brian Foster
                   ` (12 preceding siblings ...)
  2015-02-06 19:53 ` [PATCH v3 13/18] xfs: update free inode record logic to support sparse inode records Brian Foster
@ 2015-02-06 19:53 ` Brian Foster
  2015-02-06 19:53 ` [PATCH v3 15/18] xfs: skip unallocated regions of inode chunks in xfs_ifree_cluster() Brian Foster
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 41+ messages in thread
From: Brian Foster @ 2015-02-06 19:53 UTC (permalink / raw)
  To: xfs

An inode chunk is currently added to the transaction free list based on
a simple fsb conversion and hardcoded chunk length. The nature of sparse
chunks is such that the physical chunk of inodes on disk may consist of
one or more discontiguous parts. Blocks that reside in the holes of the
inode chunk are not inodes and could be allocated to any other use or
not allocated at all.

Refactor the existing xfs_bmap_add_free() call into the
xfs_difree_inode_chunk() helper. The new helper uses the existing
calculation if a chunk is not sparse. Otherwise, use the inobt record
holemask to free the contiguous regions of the chunk.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_ialloc.c | 81 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 78 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index e50e57a..2bac038 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -1910,6 +1910,83 @@ out_error:
 	return error;
 }
 
+/*
+ * Free the blocks of an inode chunk. We must consider that the inode chunk
+ * might be sparse and only free the regions that are allocated as part of the
+ * chunk.
+ */
+STATIC void
+xfs_difree_inode_chunk(
+	struct xfs_mount		*mp,
+	xfs_agnumber_t			agno,
+	struct xfs_inobt_rec_incore	*rec,
+	struct xfs_bmap_free		*flist)
+{
+	xfs_agblock_t	sagbno = XFS_AGINO_TO_AGBNO(mp, rec->ir_startino);
+	int		startidx, endidx;
+	int		nextbit;
+	xfs_agblock_t	agbno;
+	int		contigblk;
+	DECLARE_BITMAP(holemask, XFS_INOBT_HOLEMASK_BITS);
+
+	if (!xfs_inobt_issparse(rec->ir_holemask)) {
+		/* not sparse, calculate extent info directly */
+		xfs_bmap_add_free(XFS_AGB_TO_FSB(mp, agno,
+				  XFS_AGINO_TO_AGBNO(mp, rec->ir_startino)),
+				  mp->m_ialloc_blks, flist, mp);
+		return;
+	}
+
+	/* holemask is only 16-bits (fits in an unsigned long) */
+	ASSERT(sizeof(rec->ir_holemask) <= sizeof(holemask[0]));
+	holemask[0] = rec->ir_holemask;
+
+	/*
+	 * Find contiguous ranges of zeroes (i.e., allocated regions) in the
+	 * holemask and convert the start/end index of each range to an extent.
+	 * We start with the start and end index both pointing at the first 0 in
+	 * the mask.
+	 */
+	startidx = endidx = find_first_zero_bit(holemask,
+						XFS_INOBT_HOLEMASK_BITS);
+	nextbit = startidx + 1;
+	while (startidx < XFS_INOBT_HOLEMASK_BITS) {
+		nextbit = find_next_zero_bit(holemask, XFS_INOBT_HOLEMASK_BITS,
+					     nextbit);
+		/*
+		 * If the next zero bit is contiguous, update the end index of
+		 * the current range and continue.
+		 */
+		if (nextbit != XFS_INOBT_HOLEMASK_BITS &&
+		    nextbit == endidx + 1) {
+			endidx = nextbit;
+			goto next;
+		}
+
+		/*
+		 * nextbit is not contiguous with the current end index. Convert
+		 * the current start/end to an extent and add it to the free
+		 * list.
+		 */
+		agbno = sagbno + (startidx * XFS_INODES_PER_HOLEMASK_BIT) /
+				  mp->m_sb.sb_inopblock;
+		contigblk = ((endidx - startidx + 1) *
+			     XFS_INODES_PER_HOLEMASK_BIT) /
+			    mp->m_sb.sb_inopblock;
+
+		ASSERT(agbno % mp->m_sb.sb_spinoalignmt == 0);
+		ASSERT(contigblk % mp->m_sb.sb_spinoalignmt == 0);
+		xfs_bmap_add_free(XFS_AGB_TO_FSB(mp, agno, agbno), contigblk,
+				  flist, mp);
+
+		/* reset range to current bit and carry on... */
+		startidx = endidx = nextbit;
+
+next:
+		nextbit++;
+	}
+}
+
 STATIC int
 xfs_difree_inobt(
 	struct xfs_mount		*mp,
@@ -2001,9 +2078,7 @@ xfs_difree_inobt(
 			goto error0;
 		}
 
-		xfs_bmap_add_free(XFS_AGB_TO_FSB(mp, agno,
-				  XFS_AGINO_TO_AGBNO(mp, rec.ir_startino)),
-				  mp->m_ialloc_blks, flist, mp);
+		xfs_difree_inode_chunk(mp, agno, &rec, flist);
 	} else {
 		*deleted = 0;
 
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v3 15/18] xfs: skip unallocated regions of inode chunks in xfs_ifree_cluster()
  2015-02-06 19:52 [PATCH v3 00/18] xfs: sparse inode chunks Brian Foster
                   ` (13 preceding siblings ...)
  2015-02-06 19:53 ` [PATCH v3 14/18] xfs: only free allocated regions of inode chunks Brian Foster
@ 2015-02-06 19:53 ` Brian Foster
  2015-02-06 19:53 ` [PATCH v3 16/18] xfs: use actual inode count for sparse records in bulkstat/inumbers Brian Foster
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 41+ messages in thread
From: Brian Foster @ 2015-02-06 19:53 UTC (permalink / raw)
  To: xfs

xfs_ifree_cluster() is called to mark all in-memory inodes and inode
buffers as stale. This occurs after we've removed the inobt records and
dropped any references of inobt data. xfs_ifree_cluster() uses the
starting inode number to walk the namespace of inodes expected for a
single chunk a cluster buffer at a time. The cluster buffer disk
addresses are calculated by decoding the sequential inode numbers
expected from the chunk.

The problem with this approach is that if the inode chunk being removed
is a sparse chunk, not all of the buffer addresses that are calculated
as part of this sequence may be inode clusters. Attempting to acquire
the buffer based on expected inode characterstics (i.e., cluster length)
can lead to errors and is generally incorrect.

We already use a couple variables to carry requisite state from
xfs_difree() to xfs_ifree_cluster(). Rather than add a third, define a
new internal structure to carry the existing parameters through these
functions. Add an alloc field that represents the physical allocation
bitmap of inodes in the chunk being removed. Modify xfs_ifree_cluster()
to check each inode against the bitmap and skip the clusters that were
never allocated as real inodes on disk.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_ialloc.c | 18 +++++++++---------
 fs/xfs/libxfs/xfs_ialloc.h | 10 ++++++++--
 fs/xfs/xfs_inode.c         | 28 ++++++++++++++++++++--------
 3 files changed, 37 insertions(+), 19 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 2bac038..c104ee5 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -1994,8 +1994,7 @@ xfs_difree_inobt(
 	struct xfs_buf			*agbp,
 	xfs_agino_t			agino,
 	struct xfs_bmap_free		*flist,
-	int				*deleted,
-	xfs_ino_t			*first_ino,
+	struct xfs_icluster		*xic,
 	struct xfs_inobt_rec_incore	*orec)
 {
 	struct xfs_agi			*agi = XFS_BUF_TO_AGI(agbp);
@@ -2053,9 +2052,12 @@ xfs_difree_inobt(
 	 */
 	if (!(mp->m_flags & XFS_MOUNT_IKEEP) &&
 	    (rec.ir_free == XFS_INOBT_ALL_FREE)) {
+		DECLARE_BITMAP(alloc, 64);
 
-		*deleted = 1;
-		*first_ino = XFS_AGINO_TO_INO(mp, agno, rec.ir_startino);
+		xic->deleted = 1;
+		xic->first_ino = XFS_AGINO_TO_INO(mp, agno, rec.ir_startino);
+		xfs_inobt_ialloc_bitmap(alloc, &rec);
+		xic->alloc = bitmap_to_u64(alloc, 64);
 
 		/*
 		 * Remove the inode cluster from the AGI B+Tree, adjust the
@@ -2080,7 +2082,7 @@ xfs_difree_inobt(
 
 		xfs_difree_inode_chunk(mp, agno, &rec, flist);
 	} else {
-		*deleted = 0;
+		xic->deleted = 0;
 
 		error = xfs_inobt_update(cur, &rec);
 		if (error) {
@@ -2219,8 +2221,7 @@ xfs_difree(
 	struct xfs_trans	*tp,		/* transaction pointer */
 	xfs_ino_t		inode,		/* inode to be freed */
 	struct xfs_bmap_free	*flist,		/* extents to free */
-	int			*deleted,/* set if inode cluster was deleted */
-	xfs_ino_t		*first_ino)/* first inode in deleted cluster */
+	struct xfs_icluster	*xic)	/* cluster info if deleted */
 {
 	/* REFERENCED */
 	xfs_agblock_t		agbno;	/* block number containing inode */
@@ -2271,8 +2272,7 @@ xfs_difree(
 	/*
 	 * Fix up the inode allocation btree.
 	 */
-	error = xfs_difree_inobt(mp, tp, agbp, agino, flist, deleted, first_ino,
-				 &rec);
+	error = xfs_difree_inobt(mp, tp, agbp, agino, flist, xic, &rec);
 	if (error)
 		goto error0;
 
diff --git a/fs/xfs/libxfs/xfs_ialloc.h b/fs/xfs/libxfs/xfs_ialloc.h
index 4d4b702..12401fe 100644
--- a/fs/xfs/libxfs/xfs_ialloc.h
+++ b/fs/xfs/libxfs/xfs_ialloc.h
@@ -28,6 +28,13 @@ struct xfs_btree_cur;
 /* Move inodes in clusters of this size */
 #define	XFS_INODE_BIG_CLUSTER_SIZE	8192
 
+struct xfs_icluster {
+	bool		deleted;	/* record is deleted */
+	xfs_ino_t	first_ino;	/* first inode number */
+	uint64_t	alloc;		/* inode phys. allocation bitmap for
+					 * sparse chunks */
+};
+
 /* Calculate and return the number of filesystem blocks per inode cluster */
 static inline int
 xfs_icluster_size_fsb(
@@ -90,8 +97,7 @@ xfs_difree(
 	struct xfs_trans *tp,		/* transaction pointer */
 	xfs_ino_t	inode,		/* inode to be freed */
 	struct xfs_bmap_free *flist,	/* extents to free */
-	int		*deleted,	/* set if inode cluster was deleted */
-	xfs_ino_t	*first_ino);	/* first inode in deleted cluster */
+	struct xfs_icluster *ifree);	/* cluster info if deleted */
 
 /*
  * Return the location of the inode in imap, for mapping it into a buffer.
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index daafa1f..a054110 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2182,9 +2182,9 @@ xfs_iunlink_remove(
  */
 STATIC int
 xfs_ifree_cluster(
-	xfs_inode_t	*free_ip,
-	xfs_trans_t	*tp,
-	xfs_ino_t	inum)
+	xfs_inode_t		*free_ip,
+	xfs_trans_t		*tp,
+	struct xfs_icluster	*xic)
 {
 	xfs_mount_t		*mp = free_ip->i_mount;
 	int			blks_per_cluster;
@@ -2197,13 +2197,26 @@ xfs_ifree_cluster(
 	xfs_inode_log_item_t	*iip;
 	xfs_log_item_t		*lip;
 	struct xfs_perag	*pag;
+	xfs_ino_t		inum;
 
+	inum = xic->first_ino;
 	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, inum));
 	blks_per_cluster = xfs_icluster_size_fsb(mp);
 	inodes_per_cluster = blks_per_cluster << mp->m_sb.sb_inopblog;
 	nbufs = mp->m_ialloc_blks / blks_per_cluster;
 
 	for (j = 0; j < nbufs; j++, inum += inodes_per_cluster) {
+		/*
+		 * The allocation bitmap tells us which inodes of the chunk were
+		 * physically allocated. Skip the cluster if an inode falls into
+		 * a sparse region.
+		 */
+		if ((xic->alloc & XFS_INOBT_MASK(inum - xic->first_ino)) == 0) {
+			ASSERT(((inum - xic->first_ino) %
+				inodes_per_cluster) == 0);
+			continue;
+		}
+
 		blkno = XFS_AGB_TO_DADDR(mp, XFS_INO_TO_AGNO(mp, inum),
 					 XFS_INO_TO_AGBNO(mp, inum));
 
@@ -2361,8 +2374,7 @@ xfs_ifree(
 	xfs_bmap_free_t	*flist)
 {
 	int			error;
-	int			delete;
-	xfs_ino_t		first_ino;
+	struct xfs_icluster	xic = { 0 };
 
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
 	ASSERT(ip->i_d.di_nlink == 0);
@@ -2378,7 +2390,7 @@ xfs_ifree(
 	if (error)
 		return error;
 
-	error = xfs_difree(tp, ip->i_ino, flist, &delete, &first_ino);
+	error = xfs_difree(tp, ip->i_ino, flist, &xic);
 	if (error)
 		return error;
 
@@ -2395,8 +2407,8 @@ xfs_ifree(
 	ip->i_d.di_gen++;
 	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
 
-	if (delete)
-		error = xfs_ifree_cluster(ip, tp, first_ino);
+	if (xic.deleted)
+		error = xfs_ifree_cluster(ip, tp, &xic);
 
 	return error;
 }
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v3 16/18] xfs: use actual inode count for sparse records in bulkstat/inumbers
  2015-02-06 19:52 [PATCH v3 00/18] xfs: sparse inode chunks Brian Foster
                   ` (14 preceding siblings ...)
  2015-02-06 19:53 ` [PATCH v3 15/18] xfs: skip unallocated regions of inode chunks in xfs_ifree_cluster() Brian Foster
@ 2015-02-06 19:53 ` Brian Foster
  2015-02-08 22:29   ` Dave Chinner
  2015-02-06 19:53 ` [PATCH v3 17/18] xfs: add fs geometry bit for sparse inode chunks Brian Foster
  2015-02-06 19:53 ` [PATCH v3 18/18] xfs: enable sparse inode chunks for v5 superblocks Brian Foster
  17 siblings, 1 reply; 41+ messages in thread
From: Brian Foster @ 2015-02-06 19:53 UTC (permalink / raw)
  To: xfs

The bulkstat and inumbers mechanisms make the assumption that inode
records consist of a full 64 inode chunk in several places. For example,
this is used to track how many inodes have been processed overall as
well as to determine whether a record has allocated inodes that must be
handled.

This assumption is invalid for sparse inode records. While sparse inodes
will be marked as free in the ir_free mask, they are not accounted as
free in ir_freecount because they cannot be allocated. Therefore,
ir_freecount may be less than 64 inodes in an inode record for which all
physically allocated inodes are free (and in turn ir_freecount < 64 does
not signify that the record has allocated inodes).

Create the xfs_inobt_count() helper to calculate the total number of
physically allocated inodes based on the holemask. Use the helper in
xfs_bulkstat() and xfs_inumbers() instead of the fixed
XFS_INODE_PER_CHUNK value to ensure correct accounting in the event of
sparse inode records.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_ialloc.c | 16 ++++++++++++++++
 fs/xfs/libxfs/xfs_ialloc.h |  5 +++++
 fs/xfs/xfs_itable.c        | 14 ++++++++++----
 3 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index c104ee5..4d151ed 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -1207,6 +1207,22 @@ xfs_inobt_first_free_inode(
 }
 
 /*
+ * Calculate the real count of inodes in a chunk.
+ */
+int
+xfs_inobt_count(
+	struct xfs_mount		*mp,
+	struct xfs_inobt_rec_incore	*rec)
+{
+	ASSERT(!xfs_inobt_rec_check_count(mp, rec));
+
+	if (!xfs_inobt_issparse(rec->ir_holemask))
+		return XFS_INODES_PER_CHUNK;
+
+	return rec->ir_count;
+}
+
+/*
  * Allocate an inode using the inobt-only algorithm.
  */
 STATIC int
diff --git a/fs/xfs/libxfs/xfs_ialloc.h b/fs/xfs/libxfs/xfs_ialloc.h
index 12401fe..065abfc 100644
--- a/fs/xfs/libxfs/xfs_ialloc.h
+++ b/fs/xfs/libxfs/xfs_ialloc.h
@@ -170,4 +170,9 @@ int xfs_read_agi(struct xfs_mount *mp, struct xfs_trans *tp,
 		xfs_agnumber_t agno, struct xfs_buf **bpp);
 
 
+/*
+ * Calculate the real count of inodes in a chunk.
+ */
+int xfs_inobt_count(struct xfs_mount *mp, struct xfs_inobt_rec_incore *rec);
+
 #endif	/* __XFS_IALLOC_H__ */
diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
index 82e3142..34eecdf 100644
--- a/fs/xfs/xfs_itable.c
+++ b/fs/xfs/xfs_itable.c
@@ -252,7 +252,8 @@ xfs_bulkstat_grab_ichunk(
 		}
 
 		irec->ir_free |= xfs_inobt_maskn(0, idx);
-		*icount = XFS_INODES_PER_CHUNK - irec->ir_freecount;
+		*icount = xfs_inobt_count(cur->bc_mp, irec) -
+			  irec->ir_freecount;
 	}
 
 	return 0;
@@ -415,6 +416,8 @@ xfs_bulkstat(
 				goto del_cursor;
 			if (icount) {
 				irbp->ir_startino = r.ir_startino;
+				irbp->ir_holemask = r.ir_holemask;
+				irbp->ir_count = r.ir_count;
 				irbp->ir_freecount = r.ir_freecount;
 				irbp->ir_free = r.ir_free;
 				irbp++;
@@ -447,13 +450,16 @@ xfs_bulkstat(
 			 * If this chunk has any allocated inodes, save it.
 			 * Also start read-ahead now for this chunk.
 			 */
-			if (r.ir_freecount < XFS_INODES_PER_CHUNK) {
+			if (r.ir_freecount < xfs_inobt_count(mp, &r)) {
 				xfs_bulkstat_ichunk_ra(mp, agno, &r);
 				irbp->ir_startino = r.ir_startino;
+				irbp->ir_holemask = r.ir_holemask;
+				irbp->ir_count = r.ir_count;
 				irbp->ir_freecount = r.ir_freecount;
 				irbp->ir_free = r.ir_free;
 				irbp++;
-				icount += XFS_INODES_PER_CHUNK - r.ir_freecount;
+				icount += xfs_inobt_count(mp, &r) -
+					  r.ir_freecount;
 			}
 			error = xfs_btree_increment(cur, 0, &stat);
 			if (error || stat == 0) {
@@ -600,7 +606,7 @@ xfs_inumbers(
 		buffer[bufidx].xi_startino =
 			XFS_AGINO_TO_INO(mp, agno, r.ir_startino);
 		buffer[bufidx].xi_alloccount =
-			XFS_INODES_PER_CHUNK - r.ir_freecount;
+			xfs_inobt_count(mp, &r) - r.ir_freecount;
 		buffer[bufidx].xi_allocmask = ~r.ir_free;
 		if (++bufidx == bcount) {
 			long	written;
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v3 17/18] xfs: add fs geometry bit for sparse inode chunks
  2015-02-06 19:52 [PATCH v3 00/18] xfs: sparse inode chunks Brian Foster
                   ` (15 preceding siblings ...)
  2015-02-06 19:53 ` [PATCH v3 16/18] xfs: use actual inode count for sparse records in bulkstat/inumbers Brian Foster
@ 2015-02-06 19:53 ` Brian Foster
  2015-02-06 19:53 ` [PATCH v3 18/18] xfs: enable sparse inode chunks for v5 superblocks Brian Foster
  17 siblings, 0 replies; 41+ messages in thread
From: Brian Foster @ 2015-02-06 19:53 UTC (permalink / raw)
  To: xfs

Define an fs geometry bit for sparse inode chunks such that the
characteristic of the fs can be identified by userspace.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_fs.h | 1 +
 fs/xfs/xfs_fsops.c     | 4 +++-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 18dc721..89689c6 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -239,6 +239,7 @@ typedef struct xfs_fsop_resblks {
 #define XFS_FSOP_GEOM_FLAGS_V5SB	0x8000	/* version 5 superblock */
 #define XFS_FSOP_GEOM_FLAGS_FTYPE	0x10000	/* inode directory types */
 #define XFS_FSOP_GEOM_FLAGS_FINOBT	0x20000	/* free inode btree */
+#define XFS_FSOP_GEOM_FLAGS_SPINODES	0x40000	/* sparse inode chunks	*/
 
 /*
  * Minimum and maximum sizes need for growth checks.
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index fba6532..ea390ef 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -101,7 +101,9 @@ xfs_fs_geometry(
 			(xfs_sb_version_hasftype(&mp->m_sb) ?
 				XFS_FSOP_GEOM_FLAGS_FTYPE : 0) |
 			(xfs_sb_version_hasfinobt(&mp->m_sb) ?
-				XFS_FSOP_GEOM_FLAGS_FINOBT : 0);
+				XFS_FSOP_GEOM_FLAGS_FINOBT : 0) |
+			(xfs_sb_version_hassparseinodes(&mp->m_sb) ?
+				XFS_FSOP_GEOM_FLAGS_SPINODES : 0);
 		geo->logsectsize = xfs_sb_version_hassector(&mp->m_sb) ?
 				mp->m_sb.sb_logsectsize : BBSIZE;
 		geo->rtsectsize = mp->m_sb.sb_blocksize;
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v3 18/18] xfs: enable sparse inode chunks for v5 superblocks
  2015-02-06 19:52 [PATCH v3 00/18] xfs: sparse inode chunks Brian Foster
                   ` (16 preceding siblings ...)
  2015-02-06 19:53 ` [PATCH v3 17/18] xfs: add fs geometry bit for sparse inode chunks Brian Foster
@ 2015-02-06 19:53 ` Brian Foster
  17 siblings, 0 replies; 41+ messages in thread
From: Brian Foster @ 2015-02-06 19:53 UTC (permalink / raw)
  To: xfs

Enable mounting of filesystems with sparse inode support enabled. Add
the incompat. feature bit to the *_ALL mask.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_format.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index b2faced..f1ce389 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -521,7 +521,8 @@ xfs_sb_has_ro_compat_feature(
 #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
 #define XFS_SB_FEAT_INCOMPAT_SPINODES	(1 << 1)	/* sparse inode chunks */
 #define XFS_SB_FEAT_INCOMPAT_ALL \
-		(XFS_SB_FEAT_INCOMPAT_FTYPE)
+		(XFS_SB_FEAT_INCOMPAT_FTYPE|	\
+		 XFS_SB_FEAT_INCOMPAT_SPINODES)
 
 #define XFS_SB_FEAT_INCOMPAT_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_ALL
 static inline bool
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 01/18] xfs: add sparse inode chunk alignment superblock field
  2015-02-06 19:52 ` [PATCH v3 01/18] xfs: add sparse inode chunk alignment superblock field Brian Foster
@ 2015-02-06 22:40   ` Dave Chinner
  2015-02-08 16:04     ` Brian Foster
  0 siblings, 1 reply; 41+ messages in thread
From: Dave Chinner @ 2015-02-06 22:40 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Fri, Feb 06, 2015 at 02:52:48PM -0500, Brian Foster wrote:
> Add sb_spinoalignmt to the superblock to specify sparse inode chunk
> alignment. This also currently represents the minimum allowable sparse
> chunk allocation size.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_format.h | 6 +++---
>  fs/xfs/libxfs/xfs_sb.c     | 4 ++--
>  2 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> index 8eb7189..051c24d 100644
> --- a/fs/xfs/libxfs/xfs_format.h
> +++ b/fs/xfs/libxfs/xfs_format.h
> @@ -170,7 +170,7 @@ typedef struct xfs_sb {
>  	__uint32_t	sb_features_log_incompat;
>  
>  	__uint32_t	sb_crc;		/* superblock crc */
> -	__uint32_t	sb_pad;
> +	xfs_extlen_t	sb_spinoalignmt;/* sparse inode chunk alignment */

That's a mounthful. sb_spino_align is a bit easier to read, IMO.

> @@ -282,7 +282,7 @@ typedef enum {
>  	XFS_SBS_LOGSECTLOG, XFS_SBS_LOGSECTSIZE, XFS_SBS_LOGSUNIT,
>  	XFS_SBS_FEATURES2, XFS_SBS_BAD_FEATURES2, XFS_SBS_FEATURES_COMPAT,
>  	XFS_SBS_FEATURES_RO_COMPAT, XFS_SBS_FEATURES_INCOMPAT,
> -	XFS_SBS_FEATURES_LOG_INCOMPAT, XFS_SBS_CRC, XFS_SBS_PAD,
> +	XFS_SBS_FEATURES_LOG_INCOMPAT, XFS_SBS_CRC, XFS_SBS_SPINOALIGNMT,
>  	XFS_SBS_PQUOTINO, XFS_SBS_LSN,
>  	XFS_SBS_FIELDCOUNT
>  } xfs_sb_field_t;

These are gone in the for-next tree.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 03/18] xfs: sparse inode chunks feature helpers and mount requirements
  2015-02-06 19:52 ` [PATCH v3 03/18] xfs: sparse inode chunks feature helpers and mount requirements Brian Foster
@ 2015-02-06 22:54   ` Dave Chinner
  0 siblings, 0 replies; 41+ messages in thread
From: Dave Chinner @ 2015-02-06 22:54 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Fri, Feb 06, 2015 at 02:52:50PM -0500, Brian Foster wrote:
> The sparse inode chunks feature uses the helper function to enable the
> allocation of sparse inode chunks. The incompatible feature bit is set
> on disk at mkfs time to prevent mount from unsupported kernels.
> 
> Also, enforce the inode alignment requirements required for sparse inode
> chunks at mount time. When enabled, full inode chunks (and all inode
> record) alignment is increased from cluster size to inode chunk size.
> Sparse inode alignment must match the cluster size of the fs. Both
> superblock alignment fields are set as such by mkfs when sparse inode
> support is enabled.
> 
> Finally, warn that sparse inode chunks is an experimental feature until
> further notice.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
....
> @@ -182,6 +182,28 @@ xfs_mount_validate_sb(
>  			return -EFSCORRUPTED;
>  	}
>  
> +	/*
> +	 * Full inode chunks must be aligned to inode chunk size when
> +	 * sparse inodes are enabled to support the sparse chunk
> +	 * allocation algorithm and prevent overlapping inode records.
> +	 */
> +	if (xfs_sb_version_hassparseinodes(sbp)) {
> +		uint32_t	align;
> +
> +		xfs_alert(mp,
> +		"v5 superblock with sparse inode chunk support detected.\n"
> +		"This feature is EXPERIMENTAL. Use at your own risk!");

No need to mention v5 superblocks here - we've already got that in
the initial mount output. Say:

"EXPERIMENTAL sparse inode chunk feature enabled. Use at your own risk!"

> +
> +		align = XFS_INODES_PER_CHUNK * sbp->sb_inodesize
> +				>> sbp->sb_blocklog;
> +		if (sbp->sb_inoalignmt != align) {
> +			xfs_warn(mp, "Invalid inode alignment (%u blks). "
> +		"Must match inode chunk size (%u blks) with sparse inodes.",

"Inode block alignment (%u) must match chunk size (%u) for sparse inodes."

That makes both error messages a single line a easy to grep for.

> @@ -738,6 +738,22 @@ xfs_mountfs(
>  	}
>  
>  	/*
> +	 * If enabled, sparse inode chunk alignment is expected to match the
> +	 * cluster size. Full inode chunk alignment must match the chunk size,
> +	 * but that is checked on sb read verification...
> +	 */
> +	if (xfs_sb_version_hassparseinodes(&mp->m_sb) &&
> +	    mp->m_sb.sb_spinoalignmt !=
> +	    XFS_B_TO_FSBT(mp, mp->m_inode_cluster_size)) {

Indent the second line of the comaprison so it's distinct from the
logic comparison. i.e. at first glance the above looks like three
logic conditions being compared, when in fact it is only two.

	if (xfs_sb_version_hassparseinodes(&mp->m_sb) &&
	    mp->m_sb.sb_spinoalignmt !=
			XFS_B_TO_FSBT(mp, mp->m_inode_cluster_size)) {

This makes it clear at a glance there are only two logic conditions
in the if() statement.

> +		xfs_warn(mp, "Invalid sparse inode chunk alignment (%u blks). "
> +			 "Must match cluster size (%llu blks).",

"Sparse inode chunk block alignment (%u) must match cluster size (%u)."

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 04/18] xfs: introduce inode record hole mask for sparse inode chunks
  2015-02-06 19:52 ` [PATCH v3 04/18] xfs: introduce inode record hole mask for sparse inode chunks Brian Foster
@ 2015-02-06 23:16   ` Dave Chinner
  2015-02-08 16:06     ` Brian Foster
  0 siblings, 1 reply; 41+ messages in thread
From: Dave Chinner @ 2015-02-06 23:16 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Fri, Feb 06, 2015 at 02:52:51PM -0500, Brian Foster wrote:
> The inode btrees track 64 inodes per record, regardless of inode size.
> Thus, inode chunks on disk vary in size depending on the size of the
> inodes. This creates a contiguous allocation requirement for new inode
> chunks that can be difficult to satisfy on an aged and fragmented (free
> space) filesystem.
> 
> The inode record freecount currently uses 4 bytes on disk to track the
> free inode count. With a maximum freecount value of 64, only one byte is
> required. Convert the freecount field to a single byte and use two of
> the remaining 3 higher order bytes left for the hole mask field. Use
> the final leftover byte for the total count field.
> 
> The hole mask field tracks holes in the chunks of physical space that
> the inode record refers to. This facilitates the sparse allocation of
> inode chunks when contiguous chunks are not available and allows the
> inode btrees to identify what portions of the chunk contain valid
> inodes. The total count field contains the total number of valid inodes
> referred to by the record. This can also be deduced from the hole mask.
> The count field provides clarity and redundancy for internal record
> verification.
> 
> Note that both fields are initialized to zero to maintain backwards
> compatibility with existing filesystems (e.g., the higher order bytes of
> freecount are always 0). Tracking holes means that the hole mask is
> initialized to zero and thus remains "valid" in accordance with a
> non-sparse inode fs when no sparse chunks are physically allocated.
> Update the inode record management functions to handle the new fields
> and initialize to zero.
> 
> [XXX: The count field breaks backwards compatibility with !sparseinode
> fs. Should we reconsider the addition of total count or the idea of
> converting back and forth between sparse inode fs with a feature bit?]
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_format.h       |  8 ++++++--
>  fs/xfs/libxfs/xfs_ialloc.c       | 12 ++++++++++--
>  fs/xfs/libxfs/xfs_ialloc_btree.c |  4 +++-
>  3 files changed, 19 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> index 26e5d92..6c2f1be 100644
> --- a/fs/xfs/libxfs/xfs_format.h
> +++ b/fs/xfs/libxfs/xfs_format.h
> @@ -1295,13 +1295,17 @@ static inline xfs_inofree_t xfs_inobt_maskn(int i, int n)
>   */
>  typedef struct xfs_inobt_rec {
>  	__be32		ir_startino;	/* starting inode number */
> -	__be32		ir_freecount;	/* count of free inodes (set bits) */
> +	__be16		ir_holemask;	/* hole mask for sparse chunks */
> +	__u8		ir_count;	/* total inode count */
> +	__u8		ir_freecount;	/* count of free inodes (set bits) */
>  	__be64		ir_free;	/* free inode mask */
>  } xfs_inobt_rec_t;

I think I'd prefer to see a union here so that we explicitly state
what the difference in the on-disk format is. i.e. similar to how we
express the difference in long and short btree block headers.

struct xfs_inobt_rec {
	__be32		ir_startino;	/* starting inode number */
	__be32		ir_freecount;	/* count of free inodes (set bits) */
	union {
		struct {
			__be32	ir_freecount;
		} f;
		struct {
			__be16	ir_holemask;	/* hole mask for sparse chunks */
			__u8	ir_count;	/* total inode count */
			__u8	ir_freecount;	/* count of free inodes (set bits) */
		} sp;
	}
	__be64		ir_free;	/* free inode mask */
};

This will prevent us from using the wrong method of
referencing/modifying the record because we now need to be explicit
in how we modify it...

>  typedef struct xfs_inobt_rec_incore {
>  	xfs_agino_t	ir_startino;	/* starting inode number */
> -	__int32_t	ir_freecount;	/* count of free inodes (set bits) */
> +	__uint16_t	ir_holemask;	/* hole mask for sparse chunks */
> +	__uint8_t	ir_count;	/* total inode count */
> +	__uint8_t	ir_freecount;	/* count of free inodes (set bits) */
>  	xfs_inofree_t	ir_free;	/* free inode mask */
>  } xfs_inobt_rec_incore_t;

Though this is still fine - it doesn't need to explicitly follow the
on-disk format structure, but it would be nice to be explicit on
conversion to disk format records what we are actually using from
this record.

> diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> index 6f2153e..32fdb7c 100644
> --- a/fs/xfs/libxfs/xfs_ialloc.c
> +++ b/fs/xfs/libxfs/xfs_ialloc.c
> @@ -65,6 +65,8 @@ xfs_inobt_lookup(
>  	int			*stat)	/* success/failure */
>  {
>  	cur->bc_rec.i.ir_startino = ino;
> +	cur->bc_rec.i.ir_holemask = 0;
> +	cur->bc_rec.i.ir_count = 0;
>  	cur->bc_rec.i.ir_freecount = 0;
>  	cur->bc_rec.i.ir_free = 0;
>  	return xfs_btree_lookup(cur, dir, stat);
> @@ -82,7 +84,9 @@ xfs_inobt_update(
>  	union xfs_btree_rec	rec;
>  
>  	rec.inobt.ir_startino = cpu_to_be32(irec->ir_startino);
> -	rec.inobt.ir_freecount = cpu_to_be32(irec->ir_freecount);
> +	rec.inobt.ir_holemask = cpu_to_be16(irec->ir_holemask);
> +	rec.inobt.ir_count = irec->ir_count;
> +	rec.inobt.ir_freecount = irec->ir_freecount;
>  	rec.inobt.ir_free = cpu_to_be64(irec->ir_free);
>  	return xfs_btree_update(cur, &rec);
>  }

Hmmm - perhaps a similar set of helpers for sparse inode enabled
filesystems

> @@ -118,6 +124,8 @@ xfs_inobt_insert_rec(
>  	xfs_inofree_t		free,
>  	int			*stat)
>  {
> +	cur->bc_rec.i.ir_holemask = 0;
> +	cur->bc_rec.i.ir_count = 0; /* zero for backwards compatibility */
>  	cur->bc_rec.i.ir_freecount = freecount;
>  	cur->bc_rec.i.ir_free = free;
>  	return xfs_btree_insert(cur, stat);

That would make this sort of thing very clear - this doesn't look
like it would work for a sparse chunk record...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 07/18] xfs: handle sparse inode chunks in icreate log recovery
  2015-02-06 19:52 ` [PATCH v3 07/18] xfs: handle sparse inode chunks in icreate log recovery Brian Foster
@ 2015-02-06 23:19   ` Dave Chinner
  2015-02-08 16:06     ` Brian Foster
  0 siblings, 1 reply; 41+ messages in thread
From: Dave Chinner @ 2015-02-06 23:19 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Fri, Feb 06, 2015 at 02:52:54PM -0500, Brian Foster wrote:
> Recovery of icreate transactions assumes hardcoded values for the inode
> count and chunk length.
> 
> Sparse inode chunks are allocated in units of m_ialloc_min_blks. Update
> the icreate validity checks to allow for appropriately sized inode
> chunks and verify the inode count matches what is expected based on the
> extent length rather than assuming a hardcoded count.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_log_recover.c | 19 +++++++++++++------
>  1 file changed, 13 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index ecc73d5..5a5ee20 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -3068,12 +3068,19 @@ xlog_recover_do_icreate_pass2(
>  		return -EINVAL;
>  	}
>  
> -	/* existing allocation is fixed value */
> -	ASSERT(count == mp->m_ialloc_inos);
> -	ASSERT(length == mp->m_ialloc_blks);
> -	if (count != mp->m_ialloc_inos ||
> -	     length != mp->m_ialloc_blks) {
> -		xfs_warn(log->l_mp, "xlog_recover_do_icreate_trans: bad count 2");
> +	/* inode chunk is either full or sparse */
> +	if (length != mp->m_ialloc_blks &&
> +	    length != mp->m_ialloc_min_blks) {
> +		xfs_warn(log->l_mp,
> +			 "%s: unsupported chunk length", __FUNCTION__);
> +		return -EINVAL;
> +	}

Hmmm - this would prevent recovery of sparse inode chunk allocation
in multiples of mp->m_ialloc_min_blks, right? Surely we can allow
any sub-chunk extent size to be allocated as long as alignment and
size restrictions are met?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 01/18] xfs: add sparse inode chunk alignment superblock field
  2015-02-06 22:40   ` Dave Chinner
@ 2015-02-08 16:04     ` Brian Foster
  2015-02-08 21:43       ` Dave Chinner
  0 siblings, 1 reply; 41+ messages in thread
From: Brian Foster @ 2015-02-08 16:04 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Sat, Feb 07, 2015 at 09:40:47AM +1100, Dave Chinner wrote:
> On Fri, Feb 06, 2015 at 02:52:48PM -0500, Brian Foster wrote:
> > Add sb_spinoalignmt to the superblock to specify sparse inode chunk
> > alignment. This also currently represents the minimum allowable sparse
> > chunk allocation size.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/libxfs/xfs_format.h | 6 +++---
> >  fs/xfs/libxfs/xfs_sb.c     | 4 ++--
> >  2 files changed, 5 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > index 8eb7189..051c24d 100644
> > --- a/fs/xfs/libxfs/xfs_format.h
> > +++ b/fs/xfs/libxfs/xfs_format.h
> > @@ -170,7 +170,7 @@ typedef struct xfs_sb {
> >  	__uint32_t	sb_features_log_incompat;
> >  
> >  	__uint32_t	sb_crc;		/* superblock crc */
> > -	__uint32_t	sb_pad;
> > +	xfs_extlen_t	sb_spinoalignmt;/* sparse inode chunk alignment */
> 
> That's a mounthful. sb_spino_align is a bit easier to read, IMO.
> 

Ok.

> > @@ -282,7 +282,7 @@ typedef enum {
> >  	XFS_SBS_LOGSECTLOG, XFS_SBS_LOGSECTSIZE, XFS_SBS_LOGSUNIT,
> >  	XFS_SBS_FEATURES2, XFS_SBS_BAD_FEATURES2, XFS_SBS_FEATURES_COMPAT,
> >  	XFS_SBS_FEATURES_RO_COMPAT, XFS_SBS_FEATURES_INCOMPAT,
> > -	XFS_SBS_FEATURES_LOG_INCOMPAT, XFS_SBS_CRC, XFS_SBS_PAD,
> > +	XFS_SBS_FEATURES_LOG_INCOMPAT, XFS_SBS_CRC, XFS_SBS_SPINOALIGNMT,
> >  	XFS_SBS_PQUOTINO, XFS_SBS_LSN,
> >  	XFS_SBS_FIELDCOUNT
> >  } xfs_sb_field_t;
> 
> These are gone in the for-next tree.
> 

The per-field logging stuff is gone... this apparently still exists. It
looks like it goes away as part of the icsb rework so this will drop
naturally whenever that goes in.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 04/18] xfs: introduce inode record hole mask for sparse inode chunks
  2015-02-06 23:16   ` Dave Chinner
@ 2015-02-08 16:06     ` Brian Foster
  2015-02-08 21:57       ` Dave Chinner
  0 siblings, 1 reply; 41+ messages in thread
From: Brian Foster @ 2015-02-08 16:06 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Sat, Feb 07, 2015 at 10:16:15AM +1100, Dave Chinner wrote:
> On Fri, Feb 06, 2015 at 02:52:51PM -0500, Brian Foster wrote:
> > The inode btrees track 64 inodes per record, regardless of inode size.
> > Thus, inode chunks on disk vary in size depending on the size of the
> > inodes. This creates a contiguous allocation requirement for new inode
> > chunks that can be difficult to satisfy on an aged and fragmented (free
> > space) filesystem.
> > 
> > The inode record freecount currently uses 4 bytes on disk to track the
> > free inode count. With a maximum freecount value of 64, only one byte is
> > required. Convert the freecount field to a single byte and use two of
> > the remaining 3 higher order bytes left for the hole mask field. Use
> > the final leftover byte for the total count field.
> > 
> > The hole mask field tracks holes in the chunks of physical space that
> > the inode record refers to. This facilitates the sparse allocation of
> > inode chunks when contiguous chunks are not available and allows the
> > inode btrees to identify what portions of the chunk contain valid
> > inodes. The total count field contains the total number of valid inodes
> > referred to by the record. This can also be deduced from the hole mask.
> > The count field provides clarity and redundancy for internal record
> > verification.
> > 
> > Note that both fields are initialized to zero to maintain backwards
> > compatibility with existing filesystems (e.g., the higher order bytes of
> > freecount are always 0). Tracking holes means that the hole mask is
> > initialized to zero and thus remains "valid" in accordance with a
> > non-sparse inode fs when no sparse chunks are physically allocated.
> > Update the inode record management functions to handle the new fields
> > and initialize to zero.
> > 
> > [XXX: The count field breaks backwards compatibility with !sparseinode
> > fs. Should we reconsider the addition of total count or the idea of
> > converting back and forth between sparse inode fs with a feature bit?]
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/libxfs/xfs_format.h       |  8 ++++++--
> >  fs/xfs/libxfs/xfs_ialloc.c       | 12 ++++++++++--
> >  fs/xfs/libxfs/xfs_ialloc_btree.c |  4 +++-
> >  3 files changed, 19 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > index 26e5d92..6c2f1be 100644
> > --- a/fs/xfs/libxfs/xfs_format.h
> > +++ b/fs/xfs/libxfs/xfs_format.h
> > @@ -1295,13 +1295,17 @@ static inline xfs_inofree_t xfs_inobt_maskn(int i, int n)
> >   */
> >  typedef struct xfs_inobt_rec {
> >  	__be32		ir_startino;	/* starting inode number */
> > -	__be32		ir_freecount;	/* count of free inodes (set bits) */
> > +	__be16		ir_holemask;	/* hole mask for sparse chunks */
> > +	__u8		ir_count;	/* total inode count */
> > +	__u8		ir_freecount;	/* count of free inodes (set bits) */
> >  	__be64		ir_free;	/* free inode mask */
> >  } xfs_inobt_rec_t;
> 
> I think I'd prefer to see a union here so that we explicitly state
> what the difference in the on-disk format is. i.e. similar to how we
> express the difference in long and short btree block headers.
> 
> struct xfs_inobt_rec {
> 	__be32		ir_startino;	/* starting inode number */
> 	__be32		ir_freecount;	/* count of free inodes (set bits) */
> 	union {
> 		struct {
> 			__be32	ir_freecount;
> 		} f;
> 		struct {
> 			__be16	ir_holemask;	/* hole mask for sparse chunks */
> 			__u8	ir_count;	/* total inode count */
> 			__u8	ir_freecount;	/* count of free inodes (set bits) */
> 		} sp;
> 	}
> 	__be64		ir_free;	/* free inode mask */
> };
> 
> This will prevent us from using the wrong method of
> referencing/modifying the record because we now need to be explicit
> in how we modify it...
> 

I agree in principle. This is definitely explicit in that there are two
variants of this structure that must be handled in a particular way.
That said, 'sparse' and 'full' aren't quite logical differentiators
given that we now have the ir_count field and that it is used whenever
sparse inode support is enabled. In other words, a sparse enabled fs'
always uses the 'sp' variant of the inobt record, regardless of whether
the record happens to sparse.

Ultimately, I think the benefit of doing something like this depends on
its utility...

> >  typedef struct xfs_inobt_rec_incore {
> >  	xfs_agino_t	ir_startino;	/* starting inode number */
> > -	__int32_t	ir_freecount;	/* count of free inodes (set bits) */
> > +	__uint16_t	ir_holemask;	/* hole mask for sparse chunks */
> > +	__uint8_t	ir_count;	/* total inode count */
> > +	__uint8_t	ir_freecount;	/* count of free inodes (set bits) */
> >  	xfs_inofree_t	ir_free;	/* free inode mask */
> >  } xfs_inobt_rec_incore_t;
> 
> Though this is still fine - it doesn't need to explicitly follow the
> on-disk format structure, but it would be nice to be explicit on
> conversion to disk format records what we are actually using from
> this record.
> 

Ok.

> > diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> > index 6f2153e..32fdb7c 100644
> > --- a/fs/xfs/libxfs/xfs_ialloc.c
> > +++ b/fs/xfs/libxfs/xfs_ialloc.c
> > @@ -65,6 +65,8 @@ xfs_inobt_lookup(
> >  	int			*stat)	/* success/failure */
> >  {
> >  	cur->bc_rec.i.ir_startino = ino;
> > +	cur->bc_rec.i.ir_holemask = 0;
> > +	cur->bc_rec.i.ir_count = 0;
> >  	cur->bc_rec.i.ir_freecount = 0;
> >  	cur->bc_rec.i.ir_free = 0;
> >  	return xfs_btree_lookup(cur, dir, stat);
> > @@ -82,7 +84,9 @@ xfs_inobt_update(
> >  	union xfs_btree_rec	rec;
> >  
> >  	rec.inobt.ir_startino = cpu_to_be32(irec->ir_startino);
> > -	rec.inobt.ir_freecount = cpu_to_be32(irec->ir_freecount);
> > +	rec.inobt.ir_holemask = cpu_to_be16(irec->ir_holemask);
> > +	rec.inobt.ir_count = irec->ir_count;
> > +	rec.inobt.ir_freecount = irec->ir_freecount;
> >  	rec.inobt.ir_free = cpu_to_be64(irec->ir_free);
> >  	return xfs_btree_update(cur, &rec);
> >  }
> 
> Hmmm - perhaps a similar set of helpers for sparse inode enabled
> filesystems
> 

We could do that for the insert/update helpers, but note again the
separate helpers would not split along the lines of sparse vs full
chunks. Even if we were to change the design such that they do, that
would have th effect of complicating some of the subsequent code. For
example, the merge/update code currently has no reason to care about
whether it has created a full chunk out of multiple sparse chunks. This
would require more code to make that distinction simply to pick the
correct api, for something that can easily be done with a simple generic
helper. The same goes for things like the finobt, where now
lookup/update/insert has to make distinctions depending on the type of
inode record.

An alternate approach could be to stick with the generic helpers, but
try and tweak them to use the appropriate on-disk format depending on
the record. Then again, how can one really identify one from the other
in the lookup or _get_rec() scenarios?

> > @@ -118,6 +124,8 @@ xfs_inobt_insert_rec(
> >  	xfs_inofree_t		free,
> >  	int			*stat)
> >  {
> > +	cur->bc_rec.i.ir_holemask = 0;
> > +	cur->bc_rec.i.ir_count = 0; /* zero for backwards compatibility */
> >  	cur->bc_rec.i.ir_freecount = freecount;
> >  	cur->bc_rec.i.ir_free = free;
> >  	return xfs_btree_insert(cur, stat);
> 
> That would make this sort of thing very clear - this doesn't look
> like it would work for a sparse chunk record...
> 

Right... but this is just a plumbing patch and effectively a placeholder
for the subsequent patches that implement the mechanism. I suppose the
api could have been pulled back sooner into this patch, but I'd rather
not reshuffle things just for that intermediate state. That context
probably wasn't quite clear to me at the time.

Before I go one way or another here with regard to the on-disk data
structure, care to take a look at the subsequent patch(es) that use
these helpers? Patch 10 in particular pretty much sums up how these
helpers are used for sparse inode record management.

FWIW, comments on the generic bitmap stuff is also appreciated as that's
another area I'm not totally convinved is done right. :)

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 07/18] xfs: handle sparse inode chunks in icreate log recovery
  2015-02-06 23:19   ` Dave Chinner
@ 2015-02-08 16:06     ` Brian Foster
  0 siblings, 0 replies; 41+ messages in thread
From: Brian Foster @ 2015-02-08 16:06 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Sat, Feb 07, 2015 at 10:19:46AM +1100, Dave Chinner wrote:
> On Fri, Feb 06, 2015 at 02:52:54PM -0500, Brian Foster wrote:
> > Recovery of icreate transactions assumes hardcoded values for the inode
> > count and chunk length.
> > 
> > Sparse inode chunks are allocated in units of m_ialloc_min_blks. Update
> > the icreate validity checks to allow for appropriately sized inode
> > chunks and verify the inode count matches what is expected based on the
> > extent length rather than assuming a hardcoded count.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/xfs_log_recover.c | 19 +++++++++++++------
> >  1 file changed, 13 insertions(+), 6 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> > index ecc73d5..5a5ee20 100644
> > --- a/fs/xfs/xfs_log_recover.c
> > +++ b/fs/xfs/xfs_log_recover.c
> > @@ -3068,12 +3068,19 @@ xlog_recover_do_icreate_pass2(
> >  		return -EINVAL;
> >  	}
> >  
> > -	/* existing allocation is fixed value */
> > -	ASSERT(count == mp->m_ialloc_inos);
> > -	ASSERT(length == mp->m_ialloc_blks);
> > -	if (count != mp->m_ialloc_inos ||
> > -	     length != mp->m_ialloc_blks) {
> > -		xfs_warn(log->l_mp, "xlog_recover_do_icreate_trans: bad count 2");
> > +	/* inode chunk is either full or sparse */
> > +	if (length != mp->m_ialloc_blks &&
> > +	    length != mp->m_ialloc_min_blks) {
> > +		xfs_warn(log->l_mp,
> > +			 "%s: unsupported chunk length", __FUNCTION__);
> > +		return -EINVAL;
> > +	}
> 
> Hmmm - this would prevent recovery of sparse inode chunk allocation
> in multiples of mp->m_ialloc_min_blks, right? Surely we can allow
> any sub-chunk extent size to be allocated as long as alignment and
> size restrictions are met?
> 

Indeed it does prevent recovery of allocations in multiples of
m_ialloc_min_blks, but that is not supported at the moment so this alone
shouldn't cause any problems. The reason for the limitation is the same
general issue of dealing with allocations that may not be guaranteed to
convert directly to valid inode records. In other words, this is the
minimum allocation we can guarantee will span only a single record.

Now that I think about it after doing the whole agbno range thing,
perhaps an 'end_align' allocation argument might be a way to deal with
that one? For example, sparse allocs could request a cluster aligned
start bno and a chunk aligned end bno..? That might require multiple
sparse alloc attempts to take full advantage, probably requires more
thought either way...

Also note that for all fs' that this feature supports (i.e., v5), the
next multiple of the m_ialloc_min_blks corresponds to a full chunk (due
to the cluster size scaling). Given all of that, I'd at least prefer to
defer more advanced allocation support from this initial series as an
enhancement (provided there's value).

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 01/18] xfs: add sparse inode chunk alignment superblock field
  2015-02-08 16:04     ` Brian Foster
@ 2015-02-08 21:43       ` Dave Chinner
  0 siblings, 0 replies; 41+ messages in thread
From: Dave Chinner @ 2015-02-08 21:43 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Sun, Feb 08, 2015 at 11:04:03AM -0500, Brian Foster wrote:
> On Sat, Feb 07, 2015 at 09:40:47AM +1100, Dave Chinner wrote:
> > On Fri, Feb 06, 2015 at 02:52:48PM -0500, Brian Foster wrote:
> > > Add sb_spinoalignmt to the superblock to specify sparse inode chunk
> > > alignment. This also currently represents the minimum allowable sparse
> > > chunk allocation size.
> > > 
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_format.h | 6 +++---
> > >  fs/xfs/libxfs/xfs_sb.c     | 4 ++--
> > >  2 files changed, 5 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > > index 8eb7189..051c24d 100644
> > > --- a/fs/xfs/libxfs/xfs_format.h
> > > +++ b/fs/xfs/libxfs/xfs_format.h
> > > @@ -170,7 +170,7 @@ typedef struct xfs_sb {
> > >  	__uint32_t	sb_features_log_incompat;
> > >  
> > >  	__uint32_t	sb_crc;		/* superblock crc */
> > > -	__uint32_t	sb_pad;
> > > +	xfs_extlen_t	sb_spinoalignmt;/* sparse inode chunk alignment */
> > 
> > That's a mounthful. sb_spino_align is a bit easier to read, IMO.
> > 
> 
> Ok.
> 
> > > @@ -282,7 +282,7 @@ typedef enum {
> > >  	XFS_SBS_LOGSECTLOG, XFS_SBS_LOGSECTSIZE, XFS_SBS_LOGSUNIT,
> > >  	XFS_SBS_FEATURES2, XFS_SBS_BAD_FEATURES2, XFS_SBS_FEATURES_COMPAT,
> > >  	XFS_SBS_FEATURES_RO_COMPAT, XFS_SBS_FEATURES_INCOMPAT,
> > > -	XFS_SBS_FEATURES_LOG_INCOMPAT, XFS_SBS_CRC, XFS_SBS_PAD,
> > > +	XFS_SBS_FEATURES_LOG_INCOMPAT, XFS_SBS_CRC, XFS_SBS_SPINOALIGNMT,
> > >  	XFS_SBS_PQUOTINO, XFS_SBS_LSN,
> > >  	XFS_SBS_FIELDCOUNT
> > >  } xfs_sb_field_t;
> > 
> > These are gone in the for-next tree.
> > 
> 
> The per-field logging stuff is gone... this apparently still exists. It
> looks like it goes away as part of the icsb rework so this will drop
> naturally whenever that goes in.

Ah, I can't even keep up with what I'm doing! :)

Yeah, that will go into the tree as soon as the 3.20 merge
window is done.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 04/18] xfs: introduce inode record hole mask for sparse inode chunks
  2015-02-08 16:06     ` Brian Foster
@ 2015-02-08 21:57       ` Dave Chinner
  2015-02-09 15:15         ` Brian Foster
  0 siblings, 1 reply; 41+ messages in thread
From: Dave Chinner @ 2015-02-08 21:57 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Sun, Feb 08, 2015 at 11:06:01AM -0500, Brian Foster wrote:
> On Sat, Feb 07, 2015 at 10:16:15AM +1100, Dave Chinner wrote:
> > On Fri, Feb 06, 2015 at 02:52:51PM -0500, Brian Foster wrote:
> > > The inode btrees track 64 inodes per record, regardless of inode size.
> > > Thus, inode chunks on disk vary in size depending on the size of the
> > > inodes. This creates a contiguous allocation requirement for new inode
> > > chunks that can be difficult to satisfy on an aged and fragmented (free
> > > space) filesystem.
> > > 
> > > The inode record freecount currently uses 4 bytes on disk to track the
> > > free inode count. With a maximum freecount value of 64, only one byte is
> > > required. Convert the freecount field to a single byte and use two of
> > > the remaining 3 higher order bytes left for the hole mask field. Use
> > > the final leftover byte for the total count field.
> > > 
> > > The hole mask field tracks holes in the chunks of physical space that
> > > the inode record refers to. This facilitates the sparse allocation of
> > > inode chunks when contiguous chunks are not available and allows the
> > > inode btrees to identify what portions of the chunk contain valid
> > > inodes. The total count field contains the total number of valid inodes
> > > referred to by the record. This can also be deduced from the hole mask.
> > > The count field provides clarity and redundancy for internal record
> > > verification.
> > > 
> > > Note that both fields are initialized to zero to maintain backwards
> > > compatibility with existing filesystems (e.g., the higher order bytes of
> > > freecount are always 0). Tracking holes means that the hole mask is
> > > initialized to zero and thus remains "valid" in accordance with a
> > > non-sparse inode fs when no sparse chunks are physically allocated.
> > > Update the inode record management functions to handle the new fields
> > > and initialize to zero.
> > > 
> > > [XXX: The count field breaks backwards compatibility with !sparseinode
> > > fs. Should we reconsider the addition of total count or the idea of
> > > converting back and forth between sparse inode fs with a feature bit?]
> > > 
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_format.h       |  8 ++++++--
> > >  fs/xfs/libxfs/xfs_ialloc.c       | 12 ++++++++++--
> > >  fs/xfs/libxfs/xfs_ialloc_btree.c |  4 +++-
> > >  3 files changed, 19 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > > index 26e5d92..6c2f1be 100644
> > > --- a/fs/xfs/libxfs/xfs_format.h
> > > +++ b/fs/xfs/libxfs/xfs_format.h
> > > @@ -1295,13 +1295,17 @@ static inline xfs_inofree_t xfs_inobt_maskn(int i, int n)
> > >   */
> > >  typedef struct xfs_inobt_rec {
> > >  	__be32		ir_startino;	/* starting inode number */
> > > -	__be32		ir_freecount;	/* count of free inodes (set bits) */
> > > +	__be16		ir_holemask;	/* hole mask for sparse chunks */
> > > +	__u8		ir_count;	/* total inode count */
> > > +	__u8		ir_freecount;	/* count of free inodes (set bits) */
> > >  	__be64		ir_free;	/* free inode mask */
> > >  } xfs_inobt_rec_t;
> > 
> > I think I'd prefer to see a union here so that we explicitly state
> > what the difference in the on-disk format is. i.e. similar to how we
> > express the difference in long and short btree block headers.
> > 
> > struct xfs_inobt_rec {
> > 	__be32		ir_startino;	/* starting inode number */
> > 	__be32		ir_freecount;	/* count of free inodes (set bits) */
> > 	union {
> > 		struct {
> > 			__be32	ir_freecount;
> > 		} f;
> > 		struct {
> > 			__be16	ir_holemask;	/* hole mask for sparse chunks */
> > 			__u8	ir_count;	/* total inode count */
> > 			__u8	ir_freecount;	/* count of free inodes (set bits) */
> > 		} sp;
> > 	}
> > 	__be64		ir_free;	/* free inode mask */
> > };
> > 
> > This will prevent us from using the wrong method of
> > referencing/modifying the record because we now need to be explicit
> > in how we modify it...
> > 
> 
> I agree in principle. This is definitely explicit in that there are two
> variants of this structure that must be handled in a particular way.
> That said, 'sparse' and 'full' aren't quite logical differentiators
> given that we now have the ir_count field and that it is used whenever
> sparse inode support is enabled. In other words, a sparse enabled fs'
> always uses the 'sp' variant of the inobt record, regardless of whether
> the record happens to sparse.

Sure, that's fine for the in memory code, but we've always been
explicit about the disk format and marshalling to/from it,
especially in the cases where the structure on disk can have
multiple formats.

> > > @@ -82,7 +84,9 @@ xfs_inobt_update(
> > >  	union xfs_btree_rec	rec;
> > >  
> > >  	rec.inobt.ir_startino = cpu_to_be32(irec->ir_startino);
> > > -	rec.inobt.ir_freecount = cpu_to_be32(irec->ir_freecount);
> > > +	rec.inobt.ir_holemask = cpu_to_be16(irec->ir_holemask);
> > > +	rec.inobt.ir_count = irec->ir_count;
> > > +	rec.inobt.ir_freecount = irec->ir_freecount;
> > >  	rec.inobt.ir_free = cpu_to_be64(irec->ir_free);
> > >  	return xfs_btree_update(cur, &rec);
> > >  }
> > 
> > Hmmm - perhaps a similar set of helpers for sparse inode enabled
> > filesystems
> > 
> 
> We could do that for the insert/update helpers, but note again the
> separate helpers would not split along the lines of sparse vs full
> chunks. Even if we were to change the design such that they do, that
> would have th effect of complicating some of the subsequent code. For

I suspect it might, but I really don't like the idea of writing
fields that don't actually exist (and hence are always zero) in
the on-disk format to disk. See, for example, the v5 superblock
field writes are conditional in xfs_sb_to_disk, the v3 inode field
writes are conditional in xfs_dinode_to_disk, etc.

> example, the merge/update code currently has no reason to care about
> whether it has created a full chunk out of multiple sparse chunks. This
> would require more code to make that distinction simply to pick the
> correct api, for something that can easily be done with a simple generic
> helper. The same goes for things like the finobt, where now
> lookup/update/insert has to make distinctions depending on the type of
> inode record.

That was another thing that wasn't clear to me - does the finobt
record format change on disk? I don't think it does, as it only
holds a free count and a free inode bitmask, so it doesn't care
which bits of the chunk are allocated or not....

> An alternate approach could be to stick with the generic helpers, but
> try and tweak them to use the appropriate on-disk format depending on
> the record. Then again, how can one really identify one from the other
> in the lookup or _get_rec() scenarios?

Solving these problem was exactly why I suggested different helpers
- the btree ops structure that is chosen for the cursor is in a
place where we know what format we are using, and that avoids
needing "if (xfs_sb_version...)" checks in the helpers to determine
what format we need to use.

> > > @@ -118,6 +124,8 @@ xfs_inobt_insert_rec(
> > >  	xfs_inofree_t		free,
> > >  	int			*stat)
> > >  {
> > > +	cur->bc_rec.i.ir_holemask = 0;
> > > +	cur->bc_rec.i.ir_count = 0; /* zero for backwards compatibility */
> > >  	cur->bc_rec.i.ir_freecount = freecount;
> > >  	cur->bc_rec.i.ir_free = free;
> > >  	return xfs_btree_insert(cur, stat);
> > 
> > That would make this sort of thing very clear - this doesn't look
> > like it would work for a sparse chunk record...
> 
> Right... but this is just a plumbing patch and effectively a placeholder
> for the subsequent patches that implement the mechanism. I suppose the
> api could have been pulled back sooner into this patch, but I'd rather
> not reshuffle things just for that intermediate state. That context
> probably wasn't quite clear to me at the time.
> 
> Before I go one way or another here with regard to the on-disk data
> structure, care to take a look at the subsequent patch(es) that use
> these helpers? Patch 10 in particular pretty much sums up how these
> helpers are used for sparse inode record management.

Yup, I'll get to it.

> FWIW, comments on the generic bitmap stuff is also appreciated as that's
> another area I'm not totally convinved is done right. :)

Why use the generic bitmap stuff rather than the existing XFS code?
If that's just an in-memory change to the free mask processing, then
it can be done separately to the sparse inode functionality, and
probably should be done first in the series....

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 16/18] xfs: use actual inode count for sparse records in bulkstat/inumbers
  2015-02-06 19:53 ` [PATCH v3 16/18] xfs: use actual inode count for sparse records in bulkstat/inumbers Brian Foster
@ 2015-02-08 22:29   ` Dave Chinner
  0 siblings, 0 replies; 41+ messages in thread
From: Dave Chinner @ 2015-02-08 22:29 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Fri, Feb 06, 2015 at 02:53:03PM -0500, Brian Foster wrote:
> The bulkstat and inumbers mechanisms make the assumption that inode
> records consist of a full 64 inode chunk in several places. For example,
> this is used to track how many inodes have been processed overall as
> well as to determine whether a record has allocated inodes that must be
> handled.
> 
> This assumption is invalid for sparse inode records. While sparse inodes
> will be marked as free in the ir_free mask, they are not accounted as
> free in ir_freecount because they cannot be allocated. Therefore,
> ir_freecount may be less than 64 inodes in an inode record for which all
> physically allocated inodes are free (and in turn ir_freecount < 64 does
> not signify that the record has allocated inodes).
> 
> Create the xfs_inobt_count() helper to calculate the total number of
> physically allocated inodes based on the holemask. Use the helper in
> xfs_bulkstat() and xfs_inumbers() instead of the fixed
> XFS_INODE_PER_CHUNK value to ensure correct accounting in the event of
> sparse inode records.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>

This should be moved to early in the patchset as a preparation
patch. i.e. xfs_inobt_issparse() can be introduced without any other
dependencies on the sparse inode changes, and then this factoring
can be added before any of the actual sparse inode changes.

That way anywhere that uses the number of allocated inodes in a
chunk will be correct for sparse inodes even before sparse inode
support is added...

> ---
>  fs/xfs/libxfs/xfs_ialloc.c | 16 ++++++++++++++++
>  fs/xfs/libxfs/xfs_ialloc.h |  5 +++++
>  fs/xfs/xfs_itable.c        | 14 ++++++++++----
>  3 files changed, 31 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> index c104ee5..4d151ed 100644
> --- a/fs/xfs/libxfs/xfs_ialloc.c
> +++ b/fs/xfs/libxfs/xfs_ialloc.c
> @@ -1207,6 +1207,22 @@ xfs_inobt_first_free_inode(
>  }
>  
>  /*
> + * Calculate the real count of inodes in a chunk.
> + */
> +int
> +xfs_inobt_count(
> +	struct xfs_mount		*mp,
> +	struct xfs_inobt_rec_incore	*rec)
> +{
> +	ASSERT(!xfs_inobt_rec_check_count(mp, rec));
> +
> +	if (!xfs_inobt_issparse(rec->ir_holemask))
> +		return XFS_INODES_PER_CHUNK;
> +
> +	return rec->ir_count;
> +}

This btree record state check should probably be added to
xfs_ialloc_btree.c.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 13/18] xfs: update free inode record logic to support sparse inode records
  2015-02-06 19:53 ` [PATCH v3 13/18] xfs: update free inode record logic to support sparse inode records Brian Foster
@ 2015-02-08 22:33   ` Dave Chinner
  0 siblings, 0 replies; 41+ messages in thread
From: Dave Chinner @ 2015-02-08 22:33 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Fri, Feb 06, 2015 at 02:53:00PM -0500, Brian Foster wrote:
> xfs_difree_inobt() uses logic in a couple places that assume inobt
> records refer to fully allocated chunks. Specifically, the use of
> mp->m_ialloc_inos can cause problems for inode chunks that are sparsely
> allocated. Sparse inode chunks can, by definition, define a smaller
> number of inodes than a full inode chunk.
> 
> Fix the logic that determines whether an inode record should be removed
> from the inobt to use the ir_free mask rather than ir_freecount.
> 
> Fix the agi counters modification to use ir_freecount to add the actual
> number of inodes freed rather than assuming a full inode chunk.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>

This should be moved to the start of the series, along with any
other code that does rec.ir_freecount checks to see if the chunk
has inodes in use or not.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 12/18] xfs: filter out sparse regions from individual inode allocation
  2015-02-06 19:52 ` [PATCH v3 12/18] xfs: filter out sparse regions from individual inode allocation Brian Foster
@ 2015-02-08 22:49   ` Dave Chinner
  0 siblings, 0 replies; 41+ messages in thread
From: Dave Chinner @ 2015-02-08 22:49 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Fri, Feb 06, 2015 at 02:52:59PM -0500, Brian Foster wrote:
> Inode allocation from an existing record with free inodes traditionally
> selects the first inode available according to the ir_free mask. With
> sparse inode chunks, the ir_free mask could refer to an unallocated
> region. We must mask the unallocated regions out of ir_free before using
> it to select a free inode in the chunk.
> 
> Create the xfs_inobt_first_free_inode() helper to find the first free
> inode available of the allocated regions of the inode chunk.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>

Helper can go early on in the patch set, and then when the sparse
inode support is added the bitmap stuff can added to the helepr.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 11/18] xfs: randomly do sparse inode allocations in DEBUG mode
  2015-02-06 19:52 ` [PATCH v3 11/18] xfs: randomly do sparse inode allocations in DEBUG mode Brian Foster
@ 2015-02-08 23:02   ` Dave Chinner
  2015-02-09 15:20     ` Brian Foster
  0 siblings, 1 reply; 41+ messages in thread
From: Dave Chinner @ 2015-02-08 23:02 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Fri, Feb 06, 2015 at 02:52:58PM -0500, Brian Foster wrote:
> Sparse inode allocations generally only occur when full inode chunk
> allocation fails. This requires some level of filesystem space usage and
> fragmentation.
> 
> For filesystems formatted with sparse inode chunks enabled, do random
> sparse inode chunk allocs when compiled in DEBUG mode to increase test
> coverage.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_ialloc.c | 19 +++++++++++++++++--
>  1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> index 090d114..3e5d3eb 100644
> --- a/fs/xfs/libxfs/xfs_ialloc.c
> +++ b/fs/xfs/libxfs/xfs_ialloc.c
> @@ -652,9 +652,18 @@ xfs_ialloc_ag_alloc(
>  
>  	struct xfs_perag *pag;
>  
> +#ifdef DEBUG
> +	int		do_sparse = 0;
> +
> +	/* randomly do sparse inode allocations */
> +	if (xfs_sb_version_hassparseinodes(&tp->t_mountp->m_sb))
> +		do_sparse = prandom_u32() & 1;
> +#endif
> +

Bit ugly with all the ifdefs. If you define the do_sparse variable
outside the ifdef, then the rest of the code other than this check
doesn't need ifdefs.

>  	memset(&args, 0, sizeof(args));
>  	args.tp = tp;
>  	args.mp = tp->t_mountp;
> +	args.fsbno = NULLFSBLOCK;
>  
>  	/*
>  	 * Locking will ensure that we don't have two callers in here
> @@ -675,6 +684,10 @@ xfs_ialloc_ag_alloc(
>  	agno = be32_to_cpu(agi->agi_seqno);
>  	args.agbno = XFS_AGINO_TO_AGBNO(args.mp, newino) +
>  		     args.mp->m_ialloc_blks;
> +#ifdef DEBUG
> +	if (do_sparse)
> +		goto sparse_alloc;
> +#endif
>  	if (likely(newino != NULLAGINO &&
>  		  (args.agbno < be32_to_cpu(agi->agi_length)))) {
>  		args.fsbno = XFS_AGB_TO_FSB(args.mp, agno, args.agbno);
> @@ -713,8 +726,7 @@ xfs_ialloc_ag_alloc(
>  		 * subsequent requests.
>  		 */
>  		args.minalignslop = 0;
> -	} else
> -		args.fsbno = NULLFSBLOCK;
> +	}
>  
>  	if (unlikely(args.fsbno == NULLFSBLOCK)) {
>  		/*
> @@ -769,6 +781,9 @@ xfs_ialloc_ag_alloc(
>  	 * Finally, try a sparse allocation if the filesystem supports it and
>  	 * the sparse allocation length is smaller than a full chunk.
>  	 */
> +#ifdef DEBUG
> +sparse_alloc:
> +#endif
>  	if (xfs_sb_version_hassparseinodes(&args.mp->m_sb) &&
>  	    args.mp->m_ialloc_min_blks < args.mp->m_ialloc_blks &&
>  	    args.fsbno == NULLFSBLOCK) {

The label can go after the if() statement, right? We've already
guaranteed all those other parameters are true, though I suspect
there's a case where that m_ialloc_min_blks < m_ialloc_blks will
fail: block size larger than inode chunk size (e.g. 64k block size, 512
byte inodes) so that would result in the code above failing to
allocate any inode chunks at all...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 08/18] xfs: helpers to convert holemask to/from generic bitmap
  2015-02-06 19:52 ` [PATCH v3 08/18] xfs: helpers to convert holemask to/from generic bitmap Brian Foster
@ 2015-02-08 23:54   ` Dave Chinner
  2015-02-09 15:15     ` Brian Foster
  0 siblings, 1 reply; 41+ messages in thread
From: Dave Chinner @ 2015-02-08 23:54 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Fri, Feb 06, 2015 at 02:52:55PM -0500, Brian Foster wrote:
> The inobt record holemask field is a condensed data type designed to fit
> into the existing on-disk record and is zero based (allocated regions
> are set to 0, sparse regions are set to 1) to provide backwards
> compatibility. This makes the type somewhat complex for use in higher
> level inode manipulations such as individual inode allocation, etc.
> 
> Rather than foist the complexity of dealing with this field to every bit
> of logic that requires inode granular information, create the
> xfs_inobt_ialloc_bitmap() helper to convert the holemask to an inode
> allocation bitmap. The inode allocation bitmap is inode granularity
> similar to the inobt record free mask and indicates which inodes of the
> chunk are physically allocated on disk, irrespective of whether the
> inode is considered allocated or free by the filesystem. The generic
> bitmap type requires the use of generic kernel bitmap APIs.
> 
> The bitmap_to_u64() helper is provided to convert the bitmap back to a
> native 64-bit type (for native bitwise operations). This is required
> because the generic bitmap code represents a bitmap as an array of
> unsigned long in a little endian style (though each array value is cpu
> order). To ensure compatibility with various word sizes and endianness',
> bitmap_to_u64() exports the bitmap to little endian and swaps back to
> cpu byte order.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_ialloc.c | 69 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 69 insertions(+)
> 
> diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> index 72ade0e..fc001d9 100644
> --- a/fs/xfs/libxfs/xfs_ialloc.c
> +++ b/fs/xfs/libxfs/xfs_ialloc.c
> @@ -39,6 +39,8 @@
>  #include "xfs_icache.h"
>  #include "xfs_trace.h"
>  
> +STATIC void
> +xfs_inobt_ialloc_bitmap(unsigned long *, struct xfs_inobt_rec_incore *);

xfs_inobt_irec_to_allocmap() seems more appropriate for what it does.
I'd also suggest it should be in xfs_ialloc_btree.c, too.

>  /*
>   * Allocation group level functions.
> @@ -739,6 +741,73 @@ xfs_ialloc_get_rec(
>  	return 0;
>  }
>  
> +static inline uint64_t
> +bitmap_to_u64(
> +	const unsigned long	*bitmap,
> +	int			nbits)
> +{
> +	__le64			lebitmap;
> +
> +	ASSERT(nbits <= 64);
> +
> +	/*
> +	 * The bitmap format depends on the native word size. E.g., bit 0 could
> +	 * land in the middle of the 64 bits on a big endian 32-bit arch (see
> +	 * arch/powerpc/include/asm/bitops.h). To handle this, export the bitmap
> +	 * as 64-bit little endian and convert back to native byte order.
> +	 */
> +	bitmap_copy_le(&lebitmap, bitmap, nbits);
> +	return le64_to_cpu(lebitmap);
> +}

Ok, so this is for doing logic operations on the bitmap, such as
inverting or masking out a bunch of bits?

The first use of this function is xfs_inobt_first_free_inode(), and
the other use is in xfs_difree_inobt() to build the mask of inodes
in the chunk that need to be freed, and in both those cases they do

	xfs_inobt_ialloc_bitmap(alloc, rec)
	allocmask = bitmap_to_u64(alloc, 64);

and the local stack bitmap is otherwise unused. So, wouldn't a
helper like:

/*
 * Return a host format mask of all the allocated inodes in the
 * chunk. The bitmap format depends on the native word size (e.g.
 * see arch/powerpc/include/asm/bitops.h) and so we have to marshall
 * the bitmap through a defined endian conversion so that we can
 * perform host native 64 bit logic operations on the resultant
 * allocation mask.
 *
 * A bit value of 1 means the inode is allocated, a value of 0 means it
 * is free.
 */
u64
xfs_inobt_irec_to_allocmask(
	struct xfs_inobt_rec_incore *irec)
{
	DECLARE_BITMAP(allocmap, 64),
	__le64			lebitmap;

	xfs_inobt_rec_to_allocmap(&allocmap, irec);
	bitmap_copy_le(&lebitmap, allocmap, 64);
	return le64_to_cpu(lebitmap);
}

> +
> +/*
> + * Convert the inode record holemask to an inode allocation bitmap. The inode
> + * allocation bitmap is inode granularity and specifies whether an inode is
> + * physically allocated on disk (not whether the inode is considered allocated
> + * or free by the fs).
> + *
> + * We have to be careful with regard to byte order and word size since the
> + * generic bitmap code uses primitive types.

" a bit value of 1 means the inode is allocated, a value of 0 means
it is free"

> + */
> +STATIC void
> +xfs_inobt_ialloc_bitmap(
> +	unsigned long			*allocbmap,
> +	struct xfs_inobt_rec_incore	*rec)
> +{
> +	int				nextbit;
> +	DECLARE_BITMAP(holemask, 16);
> +
> +	bitmap_zero(allocbmap, 64);
> +
> +	/*
> +	 * bitmaps are represented as an array of unsigned long (each in cpu
> +	 * byte order). ir_holemask is only 16 bits, so direct assignment is
> +	 * safe.
> +	 */
> +	ASSERT(sizeof(rec->ir_holemask) <= sizeof(holemask[0]));

BUILD_BUG_ON(sizeof(rec->ir_holemask) <= sizeof(holemask[0]));

i.e. if we come across a platform where this fails, break the build
rather than waiting for the unlikely event of someone running a
debug kernel on that platform...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 09/18] xfs: support min/max agbno args in block allocator
  2015-02-06 19:52 ` [PATCH v3 09/18] xfs: support min/max agbno args in block allocator Brian Foster
@ 2015-02-09  0:01   ` Dave Chinner
  0 siblings, 0 replies; 41+ messages in thread
From: Dave Chinner @ 2015-02-09  0:01 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Fri, Feb 06, 2015 at 02:52:56PM -0500, Brian Foster wrote:
> The block allocator supports various arguments to tweak block allocation
> behavior and set allocation requirements. The sparse inode chunk feature
> introduces a new requirement not supported by the current arguments.
> Sparse inode allocations must convert or merge into an inode record that
> describes a fixed length chunk (64 inodes x inodesize). Full inode chunk
> allocations by definition always result in valid inode records. Sparse
> chunk allocations are smaller and the associated records can refer to
> blocks not owned by the inode chunk. This model can result in invalid
> inode records in certain cases.
> 
> For example, if a sparse allocation occurs near the start of an AG, the
> aligned inode record for that chunk might refer to agbno 0. If an
> allocation occurs towards the end of the AG and the AG size is not
> aligned, the inode record could refer to blocks beyond the end of the
> AG. While neither of these scenarios directly result in corruption, they
> both insert invalid inode records and at minimum cause repair to
> complain, are unlikely to merge into full chunks over time and set land
> mines for other areas of code.
> 
> To guarantee sparse inode chunk allocation creates valid inode records,
> support the ability to specify an agbno range limit for
> XFS_ALLOCTYPE_NEAR_BNO block allocations. The min/max agbno's are
> specified in the allocation arguments and limit the block allocation
> algorithms to that range. The starting 'agbno' hint is clamped to the
> range if the specified agbno is out of range. If no sufficient extent is
> available within the range, the allocation fails. For backwards
> compatibility, the min/max fields can be initialized to 0 to disable
> range limiting (e.g., equivalent to min=0,max=agsize).
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>

Seems sane on reading the implementation. I'll see if I still think
that way when I read the code that uses it ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 10/18] xfs: allocate sparse inode chunks on full chunk allocation failure
  2015-02-06 19:52 ` [PATCH v3 10/18] xfs: allocate sparse inode chunks on full chunk allocation failure Brian Foster
@ 2015-02-09  1:25   ` Dave Chinner
  2015-02-09 15:20     ` Brian Foster
  0 siblings, 1 reply; 41+ messages in thread
From: Dave Chinner @ 2015-02-09  1:25 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Fri, Feb 06, 2015 at 02:52:57PM -0500, Brian Foster wrote:
> xfs_ialloc_ag_alloc() makes several attempts to allocate a full inode
> chunk. If all else fails, reduce the allocation to the minimum sparse
> granularity and attempt to allocate a sparse inode chunk.
> 
> If sparse chunk allocation succeeds, check whether an inobt record
> already exists that can track the chunk. If so, inherit and update the
> existing record. Otherwise, insert a new record for the sparse chunk.
> 
> Update xfs_inobt_insert_rec() to take the holemask as a parameter and
> set the associated field on disk. Create the xfs_inobt_update_insert()
> helper to handle the sparse chunk allocation case - insert or update an
> existing record depending on whether it already exists.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_ialloc.c | 397 +++++++++++++++++++++++++++++++++++++++++++--
>  fs/xfs/xfs_trace.h         |  47 ++++++
>  2 files changed, 426 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> index fc001d9..090d114 100644
> --- a/fs/xfs/libxfs/xfs_ialloc.c
> +++ b/fs/xfs/libxfs/xfs_ialloc.c
> @@ -122,12 +122,16 @@ xfs_inobt_get_rec(
>  STATIC int
>  xfs_inobt_insert_rec(
>  	struct xfs_btree_cur	*cur,
> +	__uint16_t		holemask,
> +	__uint8_t		count,
>  	__int32_t		freecount,
>  	xfs_inofree_t		free,
>  	int			*stat)
>  {
> -	cur->bc_rec.i.ir_holemask = 0;
> -	cur->bc_rec.i.ir_count = 0; /* zero for backwards compatibility */
> +	ASSERT(count == 0 || xfs_sb_version_hassparseinodes(&cur->bc_mp->m_sb));
> +
> +	cur->bc_rec.i.ir_holemask = holemask;
> +	cur->bc_rec.i.ir_count = count;
>  	cur->bc_rec.i.ir_freecount = freecount;
>  	cur->bc_rec.i.ir_free = free;
>  	return xfs_btree_insert(cur, stat);

Ok, so I'd definitely prefer two separate helpers here so that we
don't need asserts or validity checking....

> @@ -151,6 +155,19 @@ xfs_inobt_insert(
>  	xfs_agino_t		thisino;
>  	int			i;
>  	int			error;
> +	uint8_t			count;
> +
> +	/*
> +	 * Only set ir_count in the inobt record if the sparse inodes feature is
> +	 * enabled. If disabled, we must maintain backwards compatibility with
> +	 * the older inobt record format where the current count and holemask
> +	 * fields map to the higher order bytes of freecount and thus must be
> +	 * zeroed.
> +	 */
> +	if (xfs_sb_version_hassparseinodes(&mp->m_sb))
> +		count = XFS_INODES_PER_CHUNK;
> +	else
> +		count = 0;

I'd strongly suggest that we pass the holemask, count and freecount
from the caller, for reasons that will become obvious later.

> @@ -164,7 +181,7 @@ xfs_inobt_insert(
>  		}
>  		ASSERT(i == 0);
>  
> -		error = xfs_inobt_insert_rec(cur, XFS_INODES_PER_CHUNK,
> +		error = xfs_inobt_insert_rec(cur, 0, count, XFS_INODES_PER_CHUNK,
>  					     XFS_INOBT_ALL_FREE, &i);
>  		if (error) {
>  			xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);

And this hunk would become:

		if (xfs_sb_version_hassparseinodes(&mp->m_sb)) {
			error = xfs_inobt_insert_sprec(cur, holemask, count,
						       freecount, &i);
						       holemask,
		else
			error = xfs_inobt_insert_rec(cur, count,
						     freecount, &i);

> @@ -174,8 +191,58 @@ xfs_inobt_insert(
>  	}
>  
>  	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> +	return 0;
> +}
>  
> +/*
> + * Update or insert a new record based on a sparse inode chunk allocation.
> + *
> + * If a record already exists, the new record is an updated version of that
> + * record based on a merge of sparse inode chunks. Update the record in place.
> + * Otherwise, insert a new record in the tree. Note that the record to insert
> + * must already have been aligned and merged, if necessary.
> + */
> +STATIC int
> +xfs_inobt_update_insert(
> +	struct xfs_mount		*mp,
> +	struct xfs_trans		*tp,
> +	struct xfs_buf			*agbp,
> +	struct xfs_inobt_rec_incore	*rec,
> +	xfs_btnum_t			btnum)
> +{
> +	struct xfs_btree_cur		*cur;
> +	struct xfs_agi			*agi = XFS_BUF_TO_AGI(agbp);
> +	xfs_agnumber_t			agno = be32_to_cpu(agi->agi_seqno);
> +	int				i;
> +	int				error;
> +
> +	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno, btnum);
> +
> +	error = xfs_inobt_lookup(cur, rec->ir_startino, XFS_LOOKUP_EQ, &i);
> +	if (error)
> +		goto error;
> +	if (i == 1) {
> +		/* found a record, update it with the merged record */
> +		error = xfs_inobt_update(cur, rec);
> +		if (error)
> +			goto error;
> +		goto out;
> +	}
> +
> +	/* no existing record, insert a new one */
> +	error = xfs_inobt_insert_rec(cur, rec->ir_holemask, rec->ir_count,
> +				     rec->ir_freecount, rec->ir_free, &i);
> +	if (error)
> +		goto error;
> +	XFS_WANT_CORRUPTED_GOTO(i == 1, error);
> +
> +out:
> +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
>  	return 0;
> +
> +error:
> +	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
> +	return error;
>  }

I think that the way you've factored out the sparse inode record
update code can do with some work. I'll come back to this.

> @@ -358,6 +453,183 @@ xfs_ialloc_inode_init(
>  }
>  
>  /*
> + * Align a record for a recently allocated sparse chunk. The input is a record
> + * that describes the unaligned chunk. The record is aligned such that it is fit
> + * for insertion (or merge) into the on-disk inode btrees.
> + */
> +STATIC void
> +xfs_align_sparse_rec(
> +	struct xfs_mount		*mp,
> +	struct xfs_inobt_rec_incore	*rec)
> +{
> +	xfs_agblock_t			agbno;
> +	xfs_agblock_t			mod;
> +	int				offset;
> +	uint16_t			allocmask;
> +
> +	agbno = XFS_AGINO_TO_AGBNO(mp, rec->ir_startino);
> +	mod = agbno % mp->m_sb.sb_inoalignmt;
> +	if (!mod)
> +		return;
> +
> +	/* calculate the inode offset and align startino */
> +	offset = mod << mp->m_sb.sb_inopblog;
> +	rec->ir_startino -= offset;
> +
> +	/*
> +	 * Since startino has been aligned down, we have to left shift
> +	 * ir_holemask such that it continues to represent the same physical
> +	 * inodes as the unaligned record. The unaligned record by definition
> +	 * tracks the allocated inodes with the lowest order bits.
> +	 *
> +	 * ir_holemask is inverted before the shift such that set bits represent
> +	 * allocated inodes. This makes it safe for the bit-shift to introduce
> +	 * zeroes in the lower order bits without corrupting the record.
> +	 *
> +	 * Note that no change is required for ir_count, ir_freecount or
> +	 * ir_free. The count values are not affected by alignment and ir_free
> +	 * is initialized to 1s for all inodes, sparse or otherwise.
> +	 */
> +	allocmask = ~rec->ir_holemask;
> +	allocmask <<= offset / XFS_INODES_PER_HOLEMASK_BIT;
> +	rec->ir_holemask = ~allocmask;
> +}

I don't understand why the holemask would need changing. I would
have expected this to be done before the hole mask has been placed
inside the record. Indeed, putting it into the record first means we
have to deal with inversions and jump through other hoops.

static void
xfs_align_sparse_ino(
	struct xfs_mount	*mp,
	xfs_agino_t		*startino,
	uint16_t		*allocmask)
{
	agbno = XFS_AGINO_TO_AGBNO(mp, *startino);
	mod = agbno % mp->m_sb.sb_inoalignmt;
	if (!mod)
		return;

	offset = mod << mp->m_sb.sb_inopblog;
	*startino -= offset;
	*allocmask <<= offset / XFS_INODES_PER_HOLEMASK_BIT;
}

And then place the result in the irec....

> +/*
> + * Determine whether a sparse inode records can be merged. The inode ranges
> + * must match and there must be no allocation overlap between the records.
> + */

Hmmm - src + target is not great for an operation where both records
are a source. Perhaps be explict in the comment? e.g.

/*
 * Determine whether the source inode record can be merged into a
 * target record. Both records must be sparse, the inode ranges
 * must match and there must be no allocation overlap between the
 * records.
 */

> +STATIC bool
> +__xfs_inobt_can_merge(
> +	struct xfs_inobt_rec_incore	*trec,	/* tgt record */
> +	struct xfs_inobt_rec_incore	*srec)	/* src record */

If we detect overlap, isn't that an indication of corruption?

> +{
> +	DECLARE_BITMAP(talloc, 64);
> +	DECLARE_BITMAP(salloc, 64);
> +	DECLARE_BITMAP(tmp, 64);
> +
> +	/* records must cover the same inode range */
> +	if (trec->ir_startino != srec->ir_startino)
> +		return false;
> +
> +	/* both records must be sparse */
> +	if (!xfs_inobt_issparse(trec->ir_holemask) ||
> +	    !xfs_inobt_issparse(srec->ir_holemask))
> +		return false;
> +
> +	/* can't exceed capacity of a full record */
> +	if (trec->ir_count + srec->ir_count > XFS_INODES_PER_CHUNK)
> +		return false;

probably also need to check that both records have inodes allocated
in them. ir_count == 0 is indicative of a corrupt record, right?

> +
> +	/* verify there is no allocation overlap */
> +	xfs_inobt_ialloc_bitmap(talloc, trec);
> +	xfs_inobt_ialloc_bitmap(salloc, srec);
> +
> +	bitmap_and(tmp, salloc, talloc, 64);
> +	if (!bitmap_empty(tmp, 64))
> +		return false;

And if one has an ir_count of zero, the bitmap will always be
zero....

> +/*
> + * Merge two sparse inode records. The caller must call __xfs_inobt_can_merge()
> + * to ensure the merge is valid.
> + */

same thing about src/target here.

> +STATIC void
> +__xfs_inobt_rec_merge(
> +	struct xfs_inobt_rec_incore	*trec,	/* target */
> +	struct xfs_inobt_rec_incore	*srec)	/* src */
> +{
> +	ASSERT(trec->ir_startino == srec->ir_startino);
> +
> +	/* combine the counts */
> +	trec->ir_count += srec->ir_count;
> +	trec->ir_freecount += srec->ir_freecount;
> +
> +	/* merge the holemask */
> +	trec->ir_holemask &= srec->ir_holemask;

Comments in this function are not very useful. Please remind me:
does the holemask use bit value of 1 as allocated or as a hole?
My head is full of allocmask stuff and so I can't tell if a merge
should use |= or &= to get the correct value.

> +	/* merge the free mask */
> +	trec->ir_free &= srec->ir_free;

Same here.

> +}
> +
> +/*
> + * Determine whether a newly allocated sparse inode chunk record overlaps with
> + * an existing sparse record in the inobt. When sparse inode chunks are enabled,
> + * all inode chunk alignment is increased from cluster size to physical inode
> + * chunk size. This means that the smallest, non-zero gap between two inode
> + * chunks is at least one full inode chunk. When a sparse inode chunk is
> + * allocated, the containing record is also aligned in this manner such that
> + * future sparse allocations within that same range all align to the same record
> + * startino. This alignment policy supports the ability to merge sparse chunks
> + * into complete chunks over time.
> + *
> + * Given an newly allocated/aligned sparse inode record, look up whether a
> + * sparse record already exists at this startino. If so, merge the two records
> + * and return the merged record in nrec.
> + *
> + * An error is returned if records overlap but a merge is not possible. Given
> + * the alignment constraints described above, this should never happen and thus
> + * is treated as fs corruption.
> + */
> +STATIC int
> +xfs_inobt_rec_merge(
> +	struct xfs_mount		*mp,
> +	struct xfs_trans		*tp,
> +	struct xfs_buf			*agbp,
> +	xfs_btnum_t			btnum,
> +	struct xfs_inobt_rec_incore	*nrec)	/* in/out: new/merged rec. */
> +{
> +	struct xfs_btree_cur		*cur;
> +	struct xfs_agi			*agi = XFS_BUF_TO_AGI(agbp);
> +	xfs_agnumber_t			agno = be32_to_cpu(agi->agi_seqno);
> +	int				error;
> +	int				i;
> +	struct xfs_inobt_rec_incore	rec;
> +
> +	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno, btnum);
> +
> +	/* the new record is pre-aligned so we know where to look */
> +	error = xfs_inobt_lookup(cur, nrec->ir_startino, XFS_LOOKUP_EQ, &i);
> +	if (error)
> +		goto error;
> +	/* if nothing there, we're done */
> +	if (i == 0)
> +		goto out;
> +
> +	error = xfs_inobt_get_rec(cur, &rec, &i);
> +	if (error)
> +		goto error;
> +	XFS_WANT_CORRUPTED_GOTO(i == 1, error);
> +	ASSERT(rec.ir_startino == nrec->ir_startino);

XFS_WANT_CORRUPTED_GOTO() is better than an assert here. On a debug
kernel, it will still assert, but on a production kernel it will
trigger a corruption error rather than potentially propagating a
corruption further.

> +	/*
> +	 * This should never happen. If we have coexisting records that cannot
> +	 * merge, something is seriously wrong.
> +	 */
> +	if (!__xfs_inobt_can_merge(nrec, &rec)) {
> +		error = -EFSCORRUPTED;
> +		goto error;
> +	}

Ah, so it is a corruption issue if merges can't occur. ;)

	XFS_WANT_CORRUPTED_GOTO(!__xfs_inobt_can_merge(nrec, &rec));

> @@ -514,6 +828,65 @@ xfs_ialloc_ag_alloc(
>  	 * Convert the results.
>  	 */
>  	newino = XFS_OFFBNO_TO_AGINO(args.mp, args.agbno, 0);
> +
> +	if (xfs_inobt_issparse(~allocmask)) {
> +		/*
> +		 * We've allocated a sparse chunk...
> +		 */
> +		rec.ir_startino = newino;
> +		rec.ir_holemask = ~allocmask;
> +		rec.ir_count = newlen;
> +		rec.ir_freecount = newlen;
> +		rec.ir_free = XFS_INOBT_ALL_FREE;
> +
> +		/* align record and update newino for agi_newino */
> +		xfs_align_sparse_rec(args.mp, &rec);
> +		newino = rec.ir_startino;

See my earlier comments about aligning before putting stuff into
the record, especially as you are having to pull modified
information straight back out of the record.


> +		error = xfs_inobt_rec_merge(args.mp, tp, agbp, XFS_BTNUM_INO,
> +					    &rec);
> +		if (!error)
> +			error = xfs_inobt_rec_check_count(args.mp, &rec);
> +		if (error == -EFSCORRUPTED) {
> +			xfs_alert(args.mp,
> +	"invalid sparse inode record: ino 0x%llx holemask 0x%x count %u",
> +				  XFS_AGINO_TO_INO(args.mp, agno,
> +						   rec.ir_startino),
> +				  rec.ir_holemask, rec.ir_count);
> +			xfs_force_shutdown(args.mp, SHUTDOWN_CORRUPT_INCORE);
> +		}
> +		if (error)
> +			return error;
> +
> +		error = xfs_inobt_update_insert(args.mp, tp, agbp, &rec,
> +						XFS_BTNUM_INO);
> +		if (error)
> +			return error;
> +
> +		if (xfs_sb_version_hasfinobt(&args.mp->m_sb)) {
> +			error = xfs_inobt_update_insert(args.mp, tp, agbp, &rec,
> +							XFS_BTNUM_FINO);
> +			if (error)
> +				return error;
> +		}

I think this needs to become a helper function, and the
implementation refactored. The reason I say this is that the
xfs_inobt_rec_merge() function already looks up the record that is
to be updated, and if it exists if calculates the merged record
value. If then drops all that state and returns the merged record.

Then we call xfs_inobt_update_insert() to determine if we have to
update the existing record with the merged record (i.e. build all
the state again), and then call xfs_inobt_update() on that record.

Essentially, if xfs_inobt_rec_merge() determines we have a merge
candidate, we should do the btree update there immediately, as well
as fix up the finobt record.

If we aren't doing a record merge and update, then it's just an
insert operation, and we can use the modified xfs_inobt_insert()
function I mentioned earlier to pass the partial chunk record
information for insertion. Perhaps even just pass the irec
structure...

> +	} else {
> +		/* full chunk - insert new records to both btrees */
> +		error = xfs_inobt_insert(args.mp, tp, agbp, newino, newlen,
> +					 XFS_BTNUM_INO);
> +		if (error)
> +			return error;
> +
> +		if (xfs_sb_version_hasfinobt(&args.mp->m_sb)) {
> +			error = xfs_inobt_insert(args.mp, tp, agbp, newino,
> +						 newlen, XFS_BTNUM_FINO);
> +			if (error)
> +				return error;
> +		}

And this could pass an irec that indicates a full inode chunk rather
than just the subset of inofmration needed to describe a full inode
chunk.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 04/18] xfs: introduce inode record hole mask for sparse inode chunks
  2015-02-08 21:57       ` Dave Chinner
@ 2015-02-09 15:15         ` Brian Foster
  2015-02-09 21:48           ` Dave Chinner
  0 siblings, 1 reply; 41+ messages in thread
From: Brian Foster @ 2015-02-09 15:15 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Mon, Feb 09, 2015 at 08:57:10AM +1100, Dave Chinner wrote:
> On Sun, Feb 08, 2015 at 11:06:01AM -0500, Brian Foster wrote:
> > On Sat, Feb 07, 2015 at 10:16:15AM +1100, Dave Chinner wrote:
> > > On Fri, Feb 06, 2015 at 02:52:51PM -0500, Brian Foster wrote:
> > > > The inode btrees track 64 inodes per record, regardless of inode size.
> > > > Thus, inode chunks on disk vary in size depending on the size of the
> > > > inodes. This creates a contiguous allocation requirement for new inode
> > > > chunks that can be difficult to satisfy on an aged and fragmented (free
> > > > space) filesystem.
> > > > 
> > > > The inode record freecount currently uses 4 bytes on disk to track the
> > > > free inode count. With a maximum freecount value of 64, only one byte is
> > > > required. Convert the freecount field to a single byte and use two of
> > > > the remaining 3 higher order bytes left for the hole mask field. Use
> > > > the final leftover byte for the total count field.
> > > > 
> > > > The hole mask field tracks holes in the chunks of physical space that
> > > > the inode record refers to. This facilitates the sparse allocation of
> > > > inode chunks when contiguous chunks are not available and allows the
> > > > inode btrees to identify what portions of the chunk contain valid
> > > > inodes. The total count field contains the total number of valid inodes
> > > > referred to by the record. This can also be deduced from the hole mask.
> > > > The count field provides clarity and redundancy for internal record
> > > > verification.
> > > > 
> > > > Note that both fields are initialized to zero to maintain backwards
> > > > compatibility with existing filesystems (e.g., the higher order bytes of
> > > > freecount are always 0). Tracking holes means that the hole mask is
> > > > initialized to zero and thus remains "valid" in accordance with a
> > > > non-sparse inode fs when no sparse chunks are physically allocated.
> > > > Update the inode record management functions to handle the new fields
> > > > and initialize to zero.
> > > > 
> > > > [XXX: The count field breaks backwards compatibility with !sparseinode
> > > > fs. Should we reconsider the addition of total count or the idea of
> > > > converting back and forth between sparse inode fs with a feature bit?]
> > > > 
> > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > ---
> > > >  fs/xfs/libxfs/xfs_format.h       |  8 ++++++--
> > > >  fs/xfs/libxfs/xfs_ialloc.c       | 12 ++++++++++--
> > > >  fs/xfs/libxfs/xfs_ialloc_btree.c |  4 +++-
> > > >  3 files changed, 19 insertions(+), 5 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > > > index 26e5d92..6c2f1be 100644
> > > > --- a/fs/xfs/libxfs/xfs_format.h
> > > > +++ b/fs/xfs/libxfs/xfs_format.h
> > > > @@ -1295,13 +1295,17 @@ static inline xfs_inofree_t xfs_inobt_maskn(int i, int n)
> > > >   */
> > > >  typedef struct xfs_inobt_rec {
> > > >  	__be32		ir_startino;	/* starting inode number */
> > > > -	__be32		ir_freecount;	/* count of free inodes (set bits) */
> > > > +	__be16		ir_holemask;	/* hole mask for sparse chunks */
> > > > +	__u8		ir_count;	/* total inode count */
> > > > +	__u8		ir_freecount;	/* count of free inodes (set bits) */
> > > >  	__be64		ir_free;	/* free inode mask */
> > > >  } xfs_inobt_rec_t;
> > > 
> > > I think I'd prefer to see a union here so that we explicitly state
> > > what the difference in the on-disk format is. i.e. similar to how we
> > > express the difference in long and short btree block headers.
> > > 
> > > struct xfs_inobt_rec {
> > > 	__be32		ir_startino;	/* starting inode number */
> > > 	__be32		ir_freecount;	/* count of free inodes (set bits) */
> > > 	union {
> > > 		struct {
> > > 			__be32	ir_freecount;
> > > 		} f;
> > > 		struct {
> > > 			__be16	ir_holemask;	/* hole mask for sparse chunks */
> > > 			__u8	ir_count;	/* total inode count */
> > > 			__u8	ir_freecount;	/* count of free inodes (set bits) */
> > > 		} sp;
> > > 	}
> > > 	__be64		ir_free;	/* free inode mask */
> > > };
> > > 
> > > This will prevent us from using the wrong method of
> > > referencing/modifying the record because we now need to be explicit
> > > in how we modify it...
> > > 
> > 
> > I agree in principle. This is definitely explicit in that there are two
> > variants of this structure that must be handled in a particular way.
> > That said, 'sparse' and 'full' aren't quite logical differentiators
> > given that we now have the ir_count field and that it is used whenever
> > sparse inode support is enabled. In other words, a sparse enabled fs'
> > always uses the 'sp' variant of the inobt record, regardless of whether
> > the record happens to sparse.
> 
> Sure, that's fine for the in memory code, but we've always been
> explicit about the disk format and marshalling to/from it,
> especially in the cases where the structure on disk can have
> multiple formats.
> 

Right, this makes sense...

> > > > @@ -82,7 +84,9 @@ xfs_inobt_update(
> > > >  	union xfs_btree_rec	rec;
> > > >  
> > > >  	rec.inobt.ir_startino = cpu_to_be32(irec->ir_startino);
> > > > -	rec.inobt.ir_freecount = cpu_to_be32(irec->ir_freecount);
> > > > +	rec.inobt.ir_holemask = cpu_to_be16(irec->ir_holemask);
> > > > +	rec.inobt.ir_count = irec->ir_count;
> > > > +	rec.inobt.ir_freecount = irec->ir_freecount;
> > > >  	rec.inobt.ir_free = cpu_to_be64(irec->ir_free);
> > > >  	return xfs_btree_update(cur, &rec);
> > > >  }
> > > 
> > > Hmmm - perhaps a similar set of helpers for sparse inode enabled
> > > filesystems
> > > 
> > 
> > We could do that for the insert/update helpers, but note again the
> > separate helpers would not split along the lines of sparse vs full
> > chunks. Even if we were to change the design such that they do, that
> > would have th effect of complicating some of the subsequent code. For
> 
> I suspect it might, but I really don't like the idea of writing
> fields that don't actually exist (and hence are always zero) in
> the on-disk format to disk. See, for example, the v5 superblock
> field writes are conditional in xfs_sb_to_disk, the v3 inode field
> writes are conditional in xfs_dinode_to_disk, etc.
> 

Indeed. The point here is just that a v1/v2 differentiation might be
more relevant than full vs. sparse. As it is, a sparse inode enabled fs
uses all of the fields (e.g., ir_count == 64 on a full chunk). So this
is more about providing clean backwards compatibility than deciding
between two different "allocation types," so to speak.

Beyond that, I'm more just probing for a clean use of the new on-disk
format distinction. The references above actually use the associated
feature checks in a single helper as opposed to a separate one, and that
seems a bit more clean an approach to me.

> > example, the merge/update code currently has no reason to care about
> > whether it has created a full chunk out of multiple sparse chunks. This
> > would require more code to make that distinction simply to pick the
> > correct api, for something that can easily be done with a simple generic
> > helper. The same goes for things like the finobt, where now
> > lookup/update/insert has to make distinctions depending on the type of
> > inode record.
> 
> That was another thing that wasn't clear to me - does the finobt
> record format change on disk? I don't think it does, as it only
> holds a free count and a free inode bitmask, so it doesn't care
> which bits of the chunk are allocated or not....
> 

No, it shouldn't change the format. But if we have a separate set of
inobt helpers, it has to use one or the other depending on the type of
record holding the inode that is freed or allocated.

> > An alternate approach could be to stick with the generic helpers, but
> > try and tweak them to use the appropriate on-disk format depending on
> > the record. Then again, how can one really identify one from the other
> > in the lookup or _get_rec() scenarios?
> 
> Solving these problem was exactly why I suggested different helpers
> - the btree ops structure that is chosen for the cursor is in a
> place where we know what format we are using, and that avoids
> needing "if (xfs_sb_version...)" checks in the helpers to determine
> what format we need to use.
> 

Hmm, I'm not sure I follow the train of thought here. The examples above
use exactly that kind of logic within the helper. Further, some of the
inobt helpers (e.g., lookup, insert_rec() shown below) use the in-core
record, so there really is no separate format at that level.

> > > > @@ -118,6 +124,8 @@ xfs_inobt_insert_rec(
> > > >  	xfs_inofree_t		free,
> > > >  	int			*stat)
> > > >  {
> > > > +	cur->bc_rec.i.ir_holemask = 0;
> > > > +	cur->bc_rec.i.ir_count = 0; /* zero for backwards compatibility */
> > > >  	cur->bc_rec.i.ir_freecount = freecount;
> > > >  	cur->bc_rec.i.ir_free = free;
> > > >  	return xfs_btree_insert(cur, stat);
> > > 
> > > That would make this sort of thing very clear - this doesn't look
> > > like it would work for a sparse chunk record...
> > 
> > Right... but this is just a plumbing patch and effectively a placeholder
> > for the subsequent patches that implement the mechanism. I suppose the
> > api could have been pulled back sooner into this patch, but I'd rather
> > not reshuffle things just for that intermediate state. That context
> > probably wasn't quite clear to me at the time.
> > 
> > Before I go one way or another here with regard to the on-disk data
> > structure, care to take a look at the subsequent patch(es) that use
> > these helpers? Patch 10 in particular pretty much sums up how these
> > helpers are used for sparse inode record management.
> 
> Yup, I'll get to it.
> 

Thanks. I've made a quick pass and most of it makes sense save some of
the helper business we're discussing here. I can start into some of the
higher level refactoring, try to incorporate the on-disk record format
union as clean as I can and we can move forward using the code of the
next version, if that's easier. ;)

> > FWIW, comments on the generic bitmap stuff is also appreciated as that's
> > another area I'm not totally convinved is done right. :)
> 
> Why use the generic bitmap stuff rather than the existing XFS code?
> If that's just an in-memory change to the free mask processing, then
> it can be done separately to the sparse inode functionality, and
> probably should be done first in the series....
> 

http://oss.sgi.com/archives/xfs/2014-07/msg00380.html

Last comment in that mail. ;) To be honest, I liked the original code
better because it didn't require the bitmap_to_u64() helper hack and
ports directly to userspace (so far I've managed to get away with
dropping that stuff entirely in userspace). Alternatively, I've done
significantly more testing with the latest code than the initial version
so I'd rather not change it back unless a real need arises.

Brian

> -- 
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 08/18] xfs: helpers to convert holemask to/from generic bitmap
  2015-02-08 23:54   ` Dave Chinner
@ 2015-02-09 15:15     ` Brian Foster
  0 siblings, 0 replies; 41+ messages in thread
From: Brian Foster @ 2015-02-09 15:15 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Mon, Feb 09, 2015 at 10:54:48AM +1100, Dave Chinner wrote:
> On Fri, Feb 06, 2015 at 02:52:55PM -0500, Brian Foster wrote:
> > The inobt record holemask field is a condensed data type designed to fit
> > into the existing on-disk record and is zero based (allocated regions
> > are set to 0, sparse regions are set to 1) to provide backwards
> > compatibility. This makes the type somewhat complex for use in higher
> > level inode manipulations such as individual inode allocation, etc.
> > 
> > Rather than foist the complexity of dealing with this field to every bit
> > of logic that requires inode granular information, create the
> > xfs_inobt_ialloc_bitmap() helper to convert the holemask to an inode
> > allocation bitmap. The inode allocation bitmap is inode granularity
> > similar to the inobt record free mask and indicates which inodes of the
> > chunk are physically allocated on disk, irrespective of whether the
> > inode is considered allocated or free by the filesystem. The generic
> > bitmap type requires the use of generic kernel bitmap APIs.
> > 
> > The bitmap_to_u64() helper is provided to convert the bitmap back to a
> > native 64-bit type (for native bitwise operations). This is required
> > because the generic bitmap code represents a bitmap as an array of
> > unsigned long in a little endian style (though each array value is cpu
> > order). To ensure compatibility with various word sizes and endianness',
> > bitmap_to_u64() exports the bitmap to little endian and swaps back to
> > cpu byte order.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/libxfs/xfs_ialloc.c | 69 ++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 69 insertions(+)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> > index 72ade0e..fc001d9 100644
> > --- a/fs/xfs/libxfs/xfs_ialloc.c
> > +++ b/fs/xfs/libxfs/xfs_ialloc.c
> > @@ -39,6 +39,8 @@
> >  #include "xfs_icache.h"
> >  #include "xfs_trace.h"
> >  
> > +STATIC void
> > +xfs_inobt_ialloc_bitmap(unsigned long *, struct xfs_inobt_rec_incore *);
> 
> xfs_inobt_irec_to_allocmap() seems more appropriate for what it does.
> I'd also suggest it should be in xfs_ialloc_btree.c, too.
> 

Ok.

> >  /*
> >   * Allocation group level functions.
> > @@ -739,6 +741,73 @@ xfs_ialloc_get_rec(
> >  	return 0;
> >  }
> >  
> > +static inline uint64_t
> > +bitmap_to_u64(
> > +	const unsigned long	*bitmap,
> > +	int			nbits)
> > +{
> > +	__le64			lebitmap;
> > +
> > +	ASSERT(nbits <= 64);
> > +
> > +	/*
> > +	 * The bitmap format depends on the native word size. E.g., bit 0 could
> > +	 * land in the middle of the 64 bits on a big endian 32-bit arch (see
> > +	 * arch/powerpc/include/asm/bitops.h). To handle this, export the bitmap
> > +	 * as 64-bit little endian and convert back to native byte order.
> > +	 */
> > +	bitmap_copy_le(&lebitmap, bitmap, nbits);
> > +	return le64_to_cpu(lebitmap);
> > +}
> 
> Ok, so this is for doing logic operations on the bitmap, such as
> inverting or masking out a bunch of bits?
> 

Yes, generally to convert to inode granularity for whatever operations
require it (e.g., find a "real" free inode from ir_free).

> The first use of this function is xfs_inobt_first_free_inode(), and
> the other use is in xfs_difree_inobt() to build the mask of inodes
> in the chunk that need to be freed, and in both those cases they do
> 
> 	xfs_inobt_ialloc_bitmap(alloc, rec)
> 	allocmask = bitmap_to_u64(alloc, 64);
> 
> and the local stack bitmap is otherwise unused. So, wouldn't a
> helper like:
> 
> /*
>  * Return a host format mask of all the allocated inodes in the
>  * chunk. The bitmap format depends on the native word size (e.g.
>  * see arch/powerpc/include/asm/bitops.h) and so we have to marshall
>  * the bitmap through a defined endian conversion so that we can
>  * perform host native 64 bit logic operations on the resultant
>  * allocation mask.
>  *
>  * A bit value of 1 means the inode is allocated, a value of 0 means it
>  * is free.
>  */
> u64
> xfs_inobt_irec_to_allocmask(
> 	struct xfs_inobt_rec_incore *irec)
> {
> 	DECLARE_BITMAP(allocmap, 64),
> 	__le64			lebitmap;
> 
> 	xfs_inobt_rec_to_allocmap(&allocmap, irec);
> 	bitmap_copy_le(&lebitmap, allocmap, 64);
> 	return le64_to_cpu(lebitmap);
> }
> 

Yeah, I went back and forth with doing the conversion within the
original helper. We ultimately end up with a couple more callers that do
use the generic bitmap, but this could be helpful to those that don't.

> > +
> > +/*
> > + * Convert the inode record holemask to an inode allocation bitmap. The inode
> > + * allocation bitmap is inode granularity and specifies whether an inode is
> > + * physically allocated on disk (not whether the inode is considered allocated
> > + * or free by the fs).
> > + *
> > + * We have to be careful with regard to byte order and word size since the
> > + * generic bitmap code uses primitive types.
> 
> " a bit value of 1 means the inode is allocated, a value of 0 means
> it is free"
> 

Ok.

> > + */
> > +STATIC void
> > +xfs_inobt_ialloc_bitmap(
> > +	unsigned long			*allocbmap,
> > +	struct xfs_inobt_rec_incore	*rec)
> > +{
> > +	int				nextbit;
> > +	DECLARE_BITMAP(holemask, 16);
> > +
> > +	bitmap_zero(allocbmap, 64);
> > +
> > +	/*
> > +	 * bitmaps are represented as an array of unsigned long (each in cpu
> > +	 * byte order). ir_holemask is only 16 bits, so direct assignment is
> > +	 * safe.
> > +	 */
> > +	ASSERT(sizeof(rec->ir_holemask) <= sizeof(holemask[0]));
> 
> BUILD_BUG_ON(sizeof(rec->ir_holemask) <= sizeof(holemask[0]));
> 
> i.e. if we come across a platform where this fails, break the build
> rather than waiting for the unlikely event of someone running a
> debug kernel on that platform...
> 

I assume the logic has to be inverted, but otherwise that's probably a
better idea. Thanks.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 10/18] xfs: allocate sparse inode chunks on full chunk allocation failure
  2015-02-09  1:25   ` Dave Chinner
@ 2015-02-09 15:20     ` Brian Foster
  0 siblings, 0 replies; 41+ messages in thread
From: Brian Foster @ 2015-02-09 15:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Mon, Feb 09, 2015 at 12:25:19PM +1100, Dave Chinner wrote:
> On Fri, Feb 06, 2015 at 02:52:57PM -0500, Brian Foster wrote:
> > xfs_ialloc_ag_alloc() makes several attempts to allocate a full inode
> > chunk. If all else fails, reduce the allocation to the minimum sparse
> > granularity and attempt to allocate a sparse inode chunk.
> > 
> > If sparse chunk allocation succeeds, check whether an inobt record
> > already exists that can track the chunk. If so, inherit and update the
> > existing record. Otherwise, insert a new record for the sparse chunk.
> > 
> > Update xfs_inobt_insert_rec() to take the holemask as a parameter and
> > set the associated field on disk. Create the xfs_inobt_update_insert()
> > helper to handle the sparse chunk allocation case - insert or update an
> > existing record depending on whether it already exists.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/libxfs/xfs_ialloc.c | 397 +++++++++++++++++++++++++++++++++++++++++++--
> >  fs/xfs/xfs_trace.h         |  47 ++++++
> >  2 files changed, 426 insertions(+), 18 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> > index fc001d9..090d114 100644
> > --- a/fs/xfs/libxfs/xfs_ialloc.c
> > +++ b/fs/xfs/libxfs/xfs_ialloc.c
> > @@ -122,12 +122,16 @@ xfs_inobt_get_rec(
> >  STATIC int
> >  xfs_inobt_insert_rec(
> >  	struct xfs_btree_cur	*cur,
> > +	__uint16_t		holemask,
> > +	__uint8_t		count,
> >  	__int32_t		freecount,
> >  	xfs_inofree_t		free,
> >  	int			*stat)
> >  {
> > -	cur->bc_rec.i.ir_holemask = 0;
> > -	cur->bc_rec.i.ir_count = 0; /* zero for backwards compatibility */
> > +	ASSERT(count == 0 || xfs_sb_version_hassparseinodes(&cur->bc_mp->m_sb));
> > +
> > +	cur->bc_rec.i.ir_holemask = holemask;
> > +	cur->bc_rec.i.ir_count = count;
> >  	cur->bc_rec.i.ir_freecount = freecount;
> >  	cur->bc_rec.i.ir_free = free;
> >  	return xfs_btree_insert(cur, stat);
> 
> Ok, so I'd definitely prefer two separate helpers here so that we
> don't need asserts or validity checking....
> 

I think that's doable, but goes back to my earlier concern. For one,
this uses the in-core record so I'm not sure what the difference would
be beyond function signature and asserts. Second, this complicates
callers like xfs_difree_finobt(). Which one should it use?

> > @@ -151,6 +155,19 @@ xfs_inobt_insert(
> >  	xfs_agino_t		thisino;
> >  	int			i;
> >  	int			error;
> > +	uint8_t			count;
> > +
> > +	/*
> > +	 * Only set ir_count in the inobt record if the sparse inodes feature is
> > +	 * enabled. If disabled, we must maintain backwards compatibility with
> > +	 * the older inobt record format where the current count and holemask
> > +	 * fields map to the higher order bytes of freecount and thus must be
> > +	 * zeroed.
> > +	 */
> > +	if (xfs_sb_version_hassparseinodes(&mp->m_sb))
> > +		count = XFS_INODES_PER_CHUNK;
> > +	else
> > +		count = 0;
> 
> I'd strongly suggest that we pass the holemask, count and freecount
> from the caller, for reasons that will become obvious later.
> 
> > @@ -164,7 +181,7 @@ xfs_inobt_insert(
> >  		}
> >  		ASSERT(i == 0);
> >  
> > -		error = xfs_inobt_insert_rec(cur, XFS_INODES_PER_CHUNK,
> > +		error = xfs_inobt_insert_rec(cur, 0, count, XFS_INODES_PER_CHUNK,
> >  					     XFS_INOBT_ALL_FREE, &i);
> >  		if (error) {
> >  			xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
> 
> And this hunk would become:
> 
> 		if (xfs_sb_version_hassparseinodes(&mp->m_sb)) {
> 			error = xfs_inobt_insert_sprec(cur, holemask, count,
> 						       freecount, &i);
> 						       holemask,
> 		else
> 			error = xfs_inobt_insert_rec(cur, count,
> 						     freecount, &i);
> 

Ok, but as alluded to previously with regard to the union discussion, I
find the naming a little confusing. I see now that you refer to the
record format as sparse whereas I refer to the record content as
(potentially) sparse, so that clears up a bit. ;)

I guess I could get used to that. Thoughts on a v1/v2 naming scheme?
Would we expect the new format to become ubiquitous eventually?

> > @@ -174,8 +191,58 @@ xfs_inobt_insert(
> >  	}
> >  
> >  	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> > +	return 0;
> > +}
> >  
> > +/*
> > + * Update or insert a new record based on a sparse inode chunk allocation.
> > + *
> > + * If a record already exists, the new record is an updated version of that
> > + * record based on a merge of sparse inode chunks. Update the record in place.
> > + * Otherwise, insert a new record in the tree. Note that the record to insert
> > + * must already have been aligned and merged, if necessary.
> > + */
> > +STATIC int
> > +xfs_inobt_update_insert(
> > +	struct xfs_mount		*mp,
> > +	struct xfs_trans		*tp,
> > +	struct xfs_buf			*agbp,
> > +	struct xfs_inobt_rec_incore	*rec,
> > +	xfs_btnum_t			btnum)
> > +{
> > +	struct xfs_btree_cur		*cur;
> > +	struct xfs_agi			*agi = XFS_BUF_TO_AGI(agbp);
> > +	xfs_agnumber_t			agno = be32_to_cpu(agi->agi_seqno);
> > +	int				i;
> > +	int				error;
> > +
> > +	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno, btnum);
> > +
> > +	error = xfs_inobt_lookup(cur, rec->ir_startino, XFS_LOOKUP_EQ, &i);
> > +	if (error)
> > +		goto error;
> > +	if (i == 1) {
> > +		/* found a record, update it with the merged record */
> > +		error = xfs_inobt_update(cur, rec);
> > +		if (error)
> > +			goto error;
> > +		goto out;
> > +	}
> > +
> > +	/* no existing record, insert a new one */
> > +	error = xfs_inobt_insert_rec(cur, rec->ir_holemask, rec->ir_count,
> > +				     rec->ir_freecount, rec->ir_free, &i);
> > +	if (error)
> > +		goto error;
> > +	XFS_WANT_CORRUPTED_GOTO(i == 1, error);
> > +
> > +out:
> > +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> >  	return 0;
> > +
> > +error:
> > +	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
> > +	return error;
> >  }
> 
> I think that the way you've factored out the sparse inode record
> update code can do with some work. I'll come back to this.
> 
> > @@ -358,6 +453,183 @@ xfs_ialloc_inode_init(
> >  }
> >  
> >  /*
> > + * Align a record for a recently allocated sparse chunk. The input is a record
> > + * that describes the unaligned chunk. The record is aligned such that it is fit
> > + * for insertion (or merge) into the on-disk inode btrees.
> > + */
> > +STATIC void
> > +xfs_align_sparse_rec(
> > +	struct xfs_mount		*mp,
> > +	struct xfs_inobt_rec_incore	*rec)
> > +{
> > +	xfs_agblock_t			agbno;
> > +	xfs_agblock_t			mod;
> > +	int				offset;
> > +	uint16_t			allocmask;
> > +
> > +	agbno = XFS_AGINO_TO_AGBNO(mp, rec->ir_startino);
> > +	mod = agbno % mp->m_sb.sb_inoalignmt;
> > +	if (!mod)
> > +		return;
> > +
> > +	/* calculate the inode offset and align startino */
> > +	offset = mod << mp->m_sb.sb_inopblog;
> > +	rec->ir_startino -= offset;
> > +
> > +	/*
> > +	 * Since startino has been aligned down, we have to left shift
> > +	 * ir_holemask such that it continues to represent the same physical
> > +	 * inodes as the unaligned record. The unaligned record by definition
> > +	 * tracks the allocated inodes with the lowest order bits.
> > +	 *
> > +	 * ir_holemask is inverted before the shift such that set bits represent
> > +	 * allocated inodes. This makes it safe for the bit-shift to introduce
> > +	 * zeroes in the lower order bits without corrupting the record.
> > +	 *
> > +	 * Note that no change is required for ir_count, ir_freecount or
> > +	 * ir_free. The count values are not affected by alignment and ir_free
> > +	 * is initialized to 1s for all inodes, sparse or otherwise.
> > +	 */
> > +	allocmask = ~rec->ir_holemask;
> > +	allocmask <<= offset / XFS_INODES_PER_HOLEMASK_BIT;
> > +	rec->ir_holemask = ~allocmask;
> > +}
> 
> I don't understand why the holemask would need changing. I would
> have expected this to be done before the hole mask has been placed
> inside the record. Indeed, putting it into the record first means we
> have to deal with inversions and jump through other hoops.
> 

Yes, the interface was designed around the implementation in v2 and was
further enhanced to handle left or right merges. The latter was not part
of a release of this code, but was implemented such that the merged
record bits were abstracted from record insertion/update. The
implementation was simplified a great deal with the alignment changes we
discussed on irc, but the interface mostly remained as a remnant of the
more complex implementation.

> static void
> xfs_align_sparse_ino(
> 	struct xfs_mount	*mp,
> 	xfs_agino_t		*startino,
> 	uint16_t		*allocmask)
> {
> 	agbno = XFS_AGINO_TO_AGBNO(mp, *startino);
> 	mod = agbno % mp->m_sb.sb_inoalignmt;
> 	if (!mod)
> 		return;
> 
> 	offset = mod << mp->m_sb.sb_inopblog;
> 	*startino -= offset;
> 	*allocmask <<= offset / XFS_INODES_PER_HOLEMASK_BIT;
> }
> 
> And then place the result in the irec....
> 
> > +/*
> > + * Determine whether a sparse inode records can be merged. The inode ranges
> > + * must match and there must be no allocation overlap between the records.
> > + */
> 
> Hmmm - src + target is not great for an operation where both records
> are a source. Perhaps be explict in the comment? e.g.
> 

Another remnant of the old implementation where the src/tgt had more
significance.

> /*
>  * Determine whether the source inode record can be merged into a
>  * target record. Both records must be sparse, the inode ranges
>  * must match and there must be no allocation overlap between the
>  * records.
>  */
> 
> > +STATIC bool
> > +__xfs_inobt_can_merge(
> > +	struct xfs_inobt_rec_incore	*trec,	/* tgt record */
> > +	struct xfs_inobt_rec_incore	*srec)	/* src record */
> 
> If we detect overlap, isn't that an indication of corruption?
> 

Yeah.

> > +{
> > +	DECLARE_BITMAP(talloc, 64);
> > +	DECLARE_BITMAP(salloc, 64);
> > +	DECLARE_BITMAP(tmp, 64);
> > +
> > +	/* records must cover the same inode range */
> > +	if (trec->ir_startino != srec->ir_startino)
> > +		return false;
> > +
> > +	/* both records must be sparse */
> > +	if (!xfs_inobt_issparse(trec->ir_holemask) ||
> > +	    !xfs_inobt_issparse(srec->ir_holemask))
> > +		return false;
> > +
> > +	/* can't exceed capacity of a full record */
> > +	if (trec->ir_count + srec->ir_count > XFS_INODES_PER_CHUNK)
> > +		return false;
> 
> probably also need to check that both records have inodes allocated
> in them. ir_count == 0 is indicative of a corrupt record, right?
> 

Yeah, good point.

> > +
> > +	/* verify there is no allocation overlap */
> > +	xfs_inobt_ialloc_bitmap(talloc, trec);
> > +	xfs_inobt_ialloc_bitmap(salloc, srec);
> > +
> > +	bitmap_and(tmp, salloc, talloc, 64);
> > +	if (!bitmap_empty(tmp, 64))
> > +		return false;
> 
> And if one has an ir_count of zero, the bitmap will always be
> zero....
> 

This is based on holemask which is indirectly checked by the issparse()
checks above. The ir_count check still makes sense, regardless.

> > +/*
> > + * Merge two sparse inode records. The caller must call __xfs_inobt_can_merge()
> > + * to ensure the merge is valid.
> > + */
> 
> same thing about src/target here.
> 

Ok.

> > +STATIC void
> > +__xfs_inobt_rec_merge(
> > +	struct xfs_inobt_rec_incore	*trec,	/* target */
> > +	struct xfs_inobt_rec_incore	*srec)	/* src */
> > +{
> > +	ASSERT(trec->ir_startino == srec->ir_startino);
> > +
> > +	/* combine the counts */
> > +	trec->ir_count += srec->ir_count;
> > +	trec->ir_freecount += srec->ir_freecount;
> > +
> > +	/* merge the holemask */
> > +	trec->ir_holemask &= srec->ir_holemask;
> 
> Comments in this function are not very useful. Please remind me:
> does the holemask use bit value of 1 as allocated or as a hole?
> My head is full of allocmask stuff and so I can't tell if a merge
> should use |= or &= to get the correct value.
> 

Ok.

> > +	/* merge the free mask */
> > +	trec->ir_free &= srec->ir_free;
> 
> Same here.
> 
> > +}
> > +
> > +/*
> > + * Determine whether a newly allocated sparse inode chunk record overlaps with
> > + * an existing sparse record in the inobt. When sparse inode chunks are enabled,
> > + * all inode chunk alignment is increased from cluster size to physical inode
> > + * chunk size. This means that the smallest, non-zero gap between two inode
> > + * chunks is at least one full inode chunk. When a sparse inode chunk is
> > + * allocated, the containing record is also aligned in this manner such that
> > + * future sparse allocations within that same range all align to the same record
> > + * startino. This alignment policy supports the ability to merge sparse chunks
> > + * into complete chunks over time.
> > + *
> > + * Given an newly allocated/aligned sparse inode record, look up whether a
> > + * sparse record already exists at this startino. If so, merge the two records
> > + * and return the merged record in nrec.
> > + *
> > + * An error is returned if records overlap but a merge is not possible. Given
> > + * the alignment constraints described above, this should never happen and thus
> > + * is treated as fs corruption.
> > + */
> > +STATIC int
> > +xfs_inobt_rec_merge(
> > +	struct xfs_mount		*mp,
> > +	struct xfs_trans		*tp,
> > +	struct xfs_buf			*agbp,
> > +	xfs_btnum_t			btnum,
> > +	struct xfs_inobt_rec_incore	*nrec)	/* in/out: new/merged rec. */
> > +{
> > +	struct xfs_btree_cur		*cur;
> > +	struct xfs_agi			*agi = XFS_BUF_TO_AGI(agbp);
> > +	xfs_agnumber_t			agno = be32_to_cpu(agi->agi_seqno);
> > +	int				error;
> > +	int				i;
> > +	struct xfs_inobt_rec_incore	rec;
> > +
> > +	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno, btnum);
> > +
> > +	/* the new record is pre-aligned so we know where to look */
> > +	error = xfs_inobt_lookup(cur, nrec->ir_startino, XFS_LOOKUP_EQ, &i);
> > +	if (error)
> > +		goto error;
> > +	/* if nothing there, we're done */
> > +	if (i == 0)
> > +		goto out;
> > +
> > +	error = xfs_inobt_get_rec(cur, &rec, &i);
> > +	if (error)
> > +		goto error;
> > +	XFS_WANT_CORRUPTED_GOTO(i == 1, error);
> > +	ASSERT(rec.ir_startino == nrec->ir_startino);
> 
> XFS_WANT_CORRUPTED_GOTO() is better than an assert here. On a debug
> kernel, it will still assert, but on a production kernel it will
> trigger a corruption error rather than potentially propagating a
> corruption further.
> 

Indeed.

> > +	/*
> > +	 * This should never happen. If we have coexisting records that cannot
> > +	 * merge, something is seriously wrong.
> > +	 */
> > +	if (!__xfs_inobt_can_merge(nrec, &rec)) {
> > +		error = -EFSCORRUPTED;
> > +		goto error;
> > +	}
> 
> Ah, so it is a corruption issue if merges can't occur. ;)
> 
> 	XFS_WANT_CORRUPTED_GOTO(!__xfs_inobt_can_merge(nrec, &rec));
> 
> > @@ -514,6 +828,65 @@ xfs_ialloc_ag_alloc(
> >  	 * Convert the results.
> >  	 */
> >  	newino = XFS_OFFBNO_TO_AGINO(args.mp, args.agbno, 0);
> > +
> > +	if (xfs_inobt_issparse(~allocmask)) {
> > +		/*
> > +		 * We've allocated a sparse chunk...
> > +		 */
> > +		rec.ir_startino = newino;
> > +		rec.ir_holemask = ~allocmask;
> > +		rec.ir_count = newlen;
> > +		rec.ir_freecount = newlen;
> > +		rec.ir_free = XFS_INOBT_ALL_FREE;
> > +
> > +		/* align record and update newino for agi_newino */
> > +		xfs_align_sparse_rec(args.mp, &rec);
> > +		newino = rec.ir_startino;
> 
> See my earlier comments about aligning before putting stuff into
> the record, especially as you are having to pull modified
> information straight back out of the record.
> 
> 
> > +		error = xfs_inobt_rec_merge(args.mp, tp, agbp, XFS_BTNUM_INO,
> > +					    &rec);
> > +		if (!error)
> > +			error = xfs_inobt_rec_check_count(args.mp, &rec);
> > +		if (error == -EFSCORRUPTED) {
> > +			xfs_alert(args.mp,
> > +	"invalid sparse inode record: ino 0x%llx holemask 0x%x count %u",
> > +				  XFS_AGINO_TO_INO(args.mp, agno,
> > +						   rec.ir_startino),
> > +				  rec.ir_holemask, rec.ir_count);
> > +			xfs_force_shutdown(args.mp, SHUTDOWN_CORRUPT_INCORE);
> > +		}
> > +		if (error)
> > +			return error;
> > +
> > +		error = xfs_inobt_update_insert(args.mp, tp, agbp, &rec,
> > +						XFS_BTNUM_INO);
> > +		if (error)
> > +			return error;
> > +
> > +		if (xfs_sb_version_hasfinobt(&args.mp->m_sb)) {
> > +			error = xfs_inobt_update_insert(args.mp, tp, agbp, &rec,
> > +							XFS_BTNUM_FINO);
> > +			if (error)
> > +				return error;
> > +		}
> 
> I think this needs to become a helper function, and the
> implementation refactored. The reason I say this is that the
> xfs_inobt_rec_merge() function already looks up the record that is
> to be updated, and if it exists if calculates the merged record
> value. If then drops all that state and returns the merged record.
> 
> Then we call xfs_inobt_update_insert() to determine if we have to
> update the existing record with the merged record (i.e. build all
> the state again), and then call xfs_inobt_update() on that record.
> 
> Essentially, if xfs_inobt_rec_merge() determines we have a merge
> candidate, we should do the btree update there immediately, as well
> as fix up the finobt record.
> 

This also sort of derives from the previous implementation where the
merge could have adjusted the startino of the record via left and right
searching, etc.

Perhaps we can use a 'merged' parameter or something of that nature here
to identify whether the record was merged and updated. Note that a merge
to the inobt does not guarantee a merge to the finobt, however. An
insert is required if the existing record had no free inodes. That was
part of the reason for the update or insert helper. Given that, I'm not
totally sure we can refactor in exactly this fashion, but I'll give it a
try and see what falls out.

> If we aren't doing a record merge and update, then it's just an
> insert operation, and we can use the modified xfs_inobt_insert()
> function I mentioned earlier to pass the partial chunk record
> information for insertion. Perhaps even just pass the irec
> structure...
> 

In general, I think creating the irec asap and passing that around as
necessary (for shifting/merging/updates/etc.) is probably cleaner than
sending around all the individual values.

Brian

> > +	} else {
> > +		/* full chunk - insert new records to both btrees */
> > +		error = xfs_inobt_insert(args.mp, tp, agbp, newino, newlen,
> > +					 XFS_BTNUM_INO);
> > +		if (error)
> > +			return error;
> > +
> > +		if (xfs_sb_version_hasfinobt(&args.mp->m_sb)) {
> > +			error = xfs_inobt_insert(args.mp, tp, agbp, newino,
> > +						 newlen, XFS_BTNUM_FINO);
> > +			if (error)
> > +				return error;
> > +		}
> 
> And this could pass an irec that indicates a full inode chunk rather
> than just the subset of inofmration needed to describe a full inode
> chunk.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 11/18] xfs: randomly do sparse inode allocations in DEBUG mode
  2015-02-08 23:02   ` Dave Chinner
@ 2015-02-09 15:20     ` Brian Foster
  0 siblings, 0 replies; 41+ messages in thread
From: Brian Foster @ 2015-02-09 15:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Mon, Feb 09, 2015 at 10:02:32AM +1100, Dave Chinner wrote:
> On Fri, Feb 06, 2015 at 02:52:58PM -0500, Brian Foster wrote:
> > Sparse inode allocations generally only occur when full inode chunk
> > allocation fails. This requires some level of filesystem space usage and
> > fragmentation.
> > 
> > For filesystems formatted with sparse inode chunks enabled, do random
> > sparse inode chunk allocs when compiled in DEBUG mode to increase test
> > coverage.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/libxfs/xfs_ialloc.c | 19 +++++++++++++++++--
> >  1 file changed, 17 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> > index 090d114..3e5d3eb 100644
> > --- a/fs/xfs/libxfs/xfs_ialloc.c
> > +++ b/fs/xfs/libxfs/xfs_ialloc.c
> > @@ -652,9 +652,18 @@ xfs_ialloc_ag_alloc(
> >  
> >  	struct xfs_perag *pag;
> >  
> > +#ifdef DEBUG
> > +	int		do_sparse = 0;
> > +
> > +	/* randomly do sparse inode allocations */
> > +	if (xfs_sb_version_hassparseinodes(&tp->t_mountp->m_sb))
> > +		do_sparse = prandom_u32() & 1;
> > +#endif
> > +
> 
> Bit ugly with all the ifdefs. If you define the do_sparse variable
> outside the ifdef, then the rest of the code other than this check
> doesn't need ifdefs.
> 

Ok.

> >  	memset(&args, 0, sizeof(args));
> >  	args.tp = tp;
> >  	args.mp = tp->t_mountp;
> > +	args.fsbno = NULLFSBLOCK;
> >  
> >  	/*
> >  	 * Locking will ensure that we don't have two callers in here
> > @@ -675,6 +684,10 @@ xfs_ialloc_ag_alloc(
> >  	agno = be32_to_cpu(agi->agi_seqno);
> >  	args.agbno = XFS_AGINO_TO_AGBNO(args.mp, newino) +
> >  		     args.mp->m_ialloc_blks;
> > +#ifdef DEBUG
> > +	if (do_sparse)
> > +		goto sparse_alloc;
> > +#endif
> >  	if (likely(newino != NULLAGINO &&
> >  		  (args.agbno < be32_to_cpu(agi->agi_length)))) {
> >  		args.fsbno = XFS_AGB_TO_FSB(args.mp, agno, args.agbno);
> > @@ -713,8 +726,7 @@ xfs_ialloc_ag_alloc(
> >  		 * subsequent requests.
> >  		 */
> >  		args.minalignslop = 0;
> > -	} else
> > -		args.fsbno = NULLFSBLOCK;
> > +	}
> >  
> >  	if (unlikely(args.fsbno == NULLFSBLOCK)) {
> >  		/*
> > @@ -769,6 +781,9 @@ xfs_ialloc_ag_alloc(
> >  	 * Finally, try a sparse allocation if the filesystem supports it and
> >  	 * the sparse allocation length is smaller than a full chunk.
> >  	 */
> > +#ifdef DEBUG
> > +sparse_alloc:
> > +#endif
> >  	if (xfs_sb_version_hassparseinodes(&args.mp->m_sb) &&
> >  	    args.mp->m_ialloc_min_blks < args.mp->m_ialloc_blks &&
> >  	    args.fsbno == NULLFSBLOCK) {
> 
> The label can go after the if() statement, right? We've already
> guaranteed all those other parameters are true, though I suspect
> there's a case where that m_ialloc_min_blks < m_ialloc_blks will
> fail: block size larger than inode chunk size (e.g. 64k block size, 512
> byte inodes) so that would result in the code above failing to
> allocate any inode chunks at all...
> 

Hmm, yeah I added that blks comparison recently simply to prevent
another allocation attempt on such systems where the block size is >=
the chunk size.

This changes behavior of such systems either way, but debug mode has
that effect regardless I suppose. We hit a codepath that we wouldn't
normally hit in that large bsize configuration, but it's debug mode and
the alignment/size of the allocation should be equivalent to the
original allocation code so I think it should be fine to move the label.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 04/18] xfs: introduce inode record hole mask for sparse inode chunks
  2015-02-09 15:15         ` Brian Foster
@ 2015-02-09 21:48           ` Dave Chinner
  2015-02-10 12:58             ` Brian Foster
  0 siblings, 1 reply; 41+ messages in thread
From: Dave Chinner @ 2015-02-09 21:48 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Mon, Feb 09, 2015 at 10:15:02AM -0500, Brian Foster wrote:
> On Mon, Feb 09, 2015 at 08:57:10AM +1100, Dave Chinner wrote:
> > On Sun, Feb 08, 2015 at 11:06:01AM -0500, Brian Foster wrote:
> > > > > @@ -82,7 +84,9 @@ xfs_inobt_update(
> > > > >  	union xfs_btree_rec	rec;
> > > > >  
> > > > >  	rec.inobt.ir_startino = cpu_to_be32(irec->ir_startino);
> > > > > -	rec.inobt.ir_freecount = cpu_to_be32(irec->ir_freecount);
> > > > > +	rec.inobt.ir_holemask = cpu_to_be16(irec->ir_holemask);
> > > > > +	rec.inobt.ir_count = irec->ir_count;
> > > > > +	rec.inobt.ir_freecount = irec->ir_freecount;
> > > > >  	rec.inobt.ir_free = cpu_to_be64(irec->ir_free);
> > > > >  	return xfs_btree_update(cur, &rec);
> > > > >  }
> > > > 
> > > > Hmmm - perhaps a similar set of helpers for sparse inode enabled
> > > > filesystems
> > > > 
> > > 
> > > We could do that for the insert/update helpers, but note again the
> > > separate helpers would not split along the lines of sparse vs full
> > > chunks. Even if we were to change the design such that they do, that
> > > would have th effect of complicating some of the subsequent code. For
> > 
> > I suspect it might, but I really don't like the idea of writing
> > fields that don't actually exist (and hence are always zero) in
> > the on-disk format to disk. See, for example, the v5 superblock
> > field writes are conditional in xfs_sb_to_disk, the v3 inode field
> > writes are conditional in xfs_dinode_to_disk, etc.
> > 
> 
> Indeed. The point here is just that a v1/v2 differentiation might be
> more relevant than full vs. sparse. As it is, a sparse inode enabled fs
> uses all of the fields (e.g., ir_count == 64 on a full chunk). So this
> is more about providing clean backwards compatibility than deciding
> between two different "allocation types," so to speak.
> 
> Beyond that, I'm more just probing for a clean use of the new on-disk
> format distinction. The references above actually use the associated
> feature checks in a single helper as opposed to a separate one, and that
> seems a bit more clean an approach to me.

BUt that's because those object have enough information internally
to determine what the correct action to take is. An inobt record
doesn't have the internal information to say whay format it is
supposed to be in. Hence we need to enforce the use of the correct
marshalling logic in some way...

> > > example, the merge/update code currently has no reason to care about
> > > whether it has created a full chunk out of multiple sparse chunks. This
> > > would require more code to make that distinction simply to pick the
> > > correct api, for something that can easily be done with a simple generic
> > > helper. The same goes for things like the finobt, where now
> > > lookup/update/insert has to make distinctions depending on the type of
> > > inode record.
> > 
> > That was another thing that wasn't clear to me - does the finobt
> > record format change on disk? I don't think it does, as it only
> > holds a free count and a free inode bitmask, so it doesn't care
> > which bits of the chunk are allocated or not....
> > 
> 
> No, it shouldn't change the format. But if we have a separate set of
> inobt helpers, it has to use one or the other depending on the type of
> record holding the inode that is freed or allocated.

Ok, so we now have 3 instances of inode btrees that use the current
format, and one that uses the new sparse record format. (i.e.
existing inobt, existing finobt and sparse inode enable finobt vs
sparse inode inobt). That really, to me, says we need clear
demarcation....

> > > An alternate approach could be to stick with the generic helpers, but
> > > try and tweak them to use the appropriate on-disk format depending on
> > > the record. Then again, how can one really identify one from the other
> > > in the lookup or _get_rec() scenarios?
> > 
> > Solving these problem was exactly why I suggested different helpers
> > - the btree ops structure that is chosen for the cursor is in a
> > place where we know what format we are using, and that avoids
> > needing "if (xfs_sb_version...)" checks in the helpers to determine
> > what format we need to use.
> > 
> 
> Hmm, I'm not sure I follow the train of thought here. The examples above
> use exactly that kind of logic within the helper. Further, some of the

They use internal object state (i.e. inode version number,
superblock flag field in sb being translated) to make that decision.
the inobt changes use external structure state to do the
conversion...

> > > Before I go one way or another here with regard to the on-disk data
> > > structure, care to take a look at the subsequent patch(es) that use
> > > these helpers? Patch 10 in particular pretty much sums up how these
> > > helpers are used for sparse inode record management.
> > 
> > Yup, I'll get to it.
> > 
> 
> Thanks. I've made a quick pass and most of it makes sense save some of
> the helper business we're discussing here. I can start into some of the
> higher level refactoring, try to incorporate the on-disk record format
> union as clean as I can and we can move forward using the code of the
> next version, if that's easier. ;)

Yup, that's fine - refactoring the code might make a difference ;)

> > > FWIW, comments on the generic bitmap stuff is also appreciated as that's
> > > another area I'm not totally convinved is done right. :)
> > 
> > Why use the generic bitmap stuff rather than the existing XFS code?
> > If that's just an in-memory change to the free mask processing, then
> > it can be done separately to the sparse inode functionality, and
> > probably should be done first in the series....
> > 
> 
> http://oss.sgi.com/archives/xfs/2014-07/msg00380.html
> 
> Last comment in that mail. ;) To be honest, I liked the original code
> better because it didn't require the bitmap_to_u64() helper hack and

I didn't realise there were arch-specific warts in the generic
bitmap code.

> ports directly to userspace (so far I've managed to get away with
> dropping that stuff entirely in userspace). Alternatively, I've done
> significantly more testing with the latest code than the initial version
> so I'd rather not change it back unless a real need arises.

Well, either way I'd like to avoid nasty conversions if it can be
avoided. Perhaps look at the in-memory bitmaps being held in LE
format (i.e. __le64) and using the primitives in
include/asm-generic/bitops/le.h to manipulate them (i.e
find_next_zero_bit_le() and friends)? Maybe that would avoid all the
problems of architectural layout of the bitmap fields and hence make
conversion simple (i.e. just a le64_to_cpu() call?)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 04/18] xfs: introduce inode record hole mask for sparse inode chunks
  2015-02-09 21:48           ` Dave Chinner
@ 2015-02-10 12:58             ` Brian Foster
  0 siblings, 0 replies; 41+ messages in thread
From: Brian Foster @ 2015-02-10 12:58 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Feb 10, 2015 at 08:48:15AM +1100, Dave Chinner wrote:
> On Mon, Feb 09, 2015 at 10:15:02AM -0500, Brian Foster wrote:
> > On Mon, Feb 09, 2015 at 08:57:10AM +1100, Dave Chinner wrote:
> > > On Sun, Feb 08, 2015 at 11:06:01AM -0500, Brian Foster wrote:
> > > > > > @@ -82,7 +84,9 @@ xfs_inobt_update(
> > > > > >  	union xfs_btree_rec	rec;
> > > > > >  
> > > > > >  	rec.inobt.ir_startino = cpu_to_be32(irec->ir_startino);
> > > > > > -	rec.inobt.ir_freecount = cpu_to_be32(irec->ir_freecount);
> > > > > > +	rec.inobt.ir_holemask = cpu_to_be16(irec->ir_holemask);
> > > > > > +	rec.inobt.ir_count = irec->ir_count;
> > > > > > +	rec.inobt.ir_freecount = irec->ir_freecount;
> > > > > >  	rec.inobt.ir_free = cpu_to_be64(irec->ir_free);
> > > > > >  	return xfs_btree_update(cur, &rec);
> > > > > >  }
> > > > > 
> > > > > Hmmm - perhaps a similar set of helpers for sparse inode enabled
> > > > > filesystems
> > > > > 
> > > > 
> > > > We could do that for the insert/update helpers, but note again the
> > > > separate helpers would not split along the lines of sparse vs full
> > > > chunks. Even if we were to change the design such that they do, that
> > > > would have th effect of complicating some of the subsequent code. For
> > > 
> > > I suspect it might, but I really don't like the idea of writing
> > > fields that don't actually exist (and hence are always zero) in
> > > the on-disk format to disk. See, for example, the v5 superblock
> > > field writes are conditional in xfs_sb_to_disk, the v3 inode field
> > > writes are conditional in xfs_dinode_to_disk, etc.
> > > 
> > 
> > Indeed. The point here is just that a v1/v2 differentiation might be
> > more relevant than full vs. sparse. As it is, a sparse inode enabled fs
> > uses all of the fields (e.g., ir_count == 64 on a full chunk). So this
> > is more about providing clean backwards compatibility than deciding
> > between two different "allocation types," so to speak.
> > 
> > Beyond that, I'm more just probing for a clean use of the new on-disk
> > format distinction. The references above actually use the associated
> > feature checks in a single helper as opposed to a separate one, and that
> > seems a bit more clean an approach to me.
> 
> BUt that's because those object have enough information internally
> to determine what the correct action to take is. An inobt record
> doesn't have the internal information to say whay format it is
> supposed to be in. Hence we need to enforce the use of the correct
> marshalling logic in some way...
> 

Sure, some structures have version information that is used to translate
the in-core structure appropriately, but I don't see the existence or
lack of such information as a trigger for using one pattern or another
to do so. xfs_sb_to_disk(), xfs_bmdr_to_bmbt(), are a couple examples
that use the same kind of pattern based on a feature bit.

> > > > example, the merge/update code currently has no reason to care about
> > > > whether it has created a full chunk out of multiple sparse chunks. This
> > > > would require more code to make that distinction simply to pick the
> > > > correct api, for something that can easily be done with a simple generic
> > > > helper. The same goes for things like the finobt, where now
> > > > lookup/update/insert has to make distinctions depending on the type of
> > > > inode record.
> > > 
> > > That was another thing that wasn't clear to me - does the finobt
> > > record format change on disk? I don't think it does, as it only
> > > holds a free count and a free inode bitmask, so it doesn't care
> > > which bits of the chunk are allocated or not....
> > > 
> > 
> > No, it shouldn't change the format. But if we have a separate set of
> > inobt helpers, it has to use one or the other depending on the type of
> > record holding the inode that is freed or allocated.
> 
> Ok, so we now have 3 instances of inode btrees that use the current
> format, and one that uses the new sparse record format. (i.e.
> existing inobt, existing finobt and sparse inode enable finobt vs
> sparse inode inobt). That really, to me, says we need clear
> demarcation....
> 
> > > > An alternate approach could be to stick with the generic helpers, but
> > > > try and tweak them to use the appropriate on-disk format depending on
> > > > the record. Then again, how can one really identify one from the other
> > > > in the lookup or _get_rec() scenarios?
> > > 
> > > Solving these problem was exactly why I suggested different helpers
> > > - the btree ops structure that is chosen for the cursor is in a
> > > place where we know what format we are using, and that avoids
> > > needing "if (xfs_sb_version...)" checks in the helpers to determine
> > > what format we need to use.
> > > 
> > 
> > Hmm, I'm not sure I follow the train of thought here. The examples above
> > use exactly that kind of logic within the helper. Further, some of the
> 
> They use internal object state (i.e. inode version number,
> superblock flag field in sb being translated) to make that decision.
> the inobt changes use external structure state to do the
> conversion...
> 
> > > > Before I go one way or another here with regard to the on-disk data
> > > > structure, care to take a look at the subsequent patch(es) that use
> > > > these helpers? Patch 10 in particular pretty much sums up how these
> > > > helpers are used for sparse inode record management.
> > > 
> > > Yup, I'll get to it.
> > > 
> > 
> > Thanks. I've made a quick pass and most of it makes sense save some of
> > the helper business we're discussing here. I can start into some of the
> > higher level refactoring, try to incorporate the on-disk record format
> > union as clean as I can and we can move forward using the code of the
> > next version, if that's easier. ;)
> 
> Yup, that's fine - refactoring the code might make a difference ;)
> 
> > > > FWIW, comments on the generic bitmap stuff is also appreciated as that's
> > > > another area I'm not totally convinved is done right. :)
> > > 
> > > Why use the generic bitmap stuff rather than the existing XFS code?
> > > If that's just an in-memory change to the free mask processing, then
> > > it can be done separately to the sparse inode functionality, and
> > > probably should be done first in the series....
> > > 
> > 
> > http://oss.sgi.com/archives/xfs/2014-07/msg00380.html
> > 
> > Last comment in that mail. ;) To be honest, I liked the original code
> > better because it didn't require the bitmap_to_u64() helper hack and
> 
> I didn't realise there were arch-specific warts in the generic
> bitmap code.
> 
> > ports directly to userspace (so far I've managed to get away with
> > dropping that stuff entirely in userspace). Alternatively, I've done
> > significantly more testing with the latest code than the initial version
> > so I'd rather not change it back unless a real need arises.
> 
> Well, either way I'd like to avoid nasty conversions if it can be
> avoided. Perhaps look at the in-memory bitmaps being held in LE
> format (i.e. __le64) and using the primitives in
> include/asm-generic/bitops/le.h to manipulate them (i.e
> find_next_zero_bit_le() and friends)? Maybe that would avoid all the
> problems of architectural layout of the bitmap fields and hence make
> conversion simple (i.e. just a le64_to_cpu() call?)
> 

Hmm, interesting. On a quick pass, it doesn't appear to provide some of
the additional functions (e.g., weight, and) beyond set/clear/test/find.
I suspect that means we'd end up open coding such things, possibly
introducing more cpu<->le conversions to go back and forth between
native bitwise ops and the ops provided by the utility functions.

I'll see how it looks with the changes suggested so far and we can
consider whether it's worth converting back from there...

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2015-02-10 12:58 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-06 19:52 [PATCH v3 00/18] xfs: sparse inode chunks Brian Foster
2015-02-06 19:52 ` [PATCH v3 01/18] xfs: add sparse inode chunk alignment superblock field Brian Foster
2015-02-06 22:40   ` Dave Chinner
2015-02-08 16:04     ` Brian Foster
2015-02-08 21:43       ` Dave Chinner
2015-02-06 19:52 ` [PATCH v3 02/18] xfs: use sparse chunk alignment for min. inode allocation requirement Brian Foster
2015-02-06 19:52 ` [PATCH v3 03/18] xfs: sparse inode chunks feature helpers and mount requirements Brian Foster
2015-02-06 22:54   ` Dave Chinner
2015-02-06 19:52 ` [PATCH v3 04/18] xfs: introduce inode record hole mask for sparse inode chunks Brian Foster
2015-02-06 23:16   ` Dave Chinner
2015-02-08 16:06     ` Brian Foster
2015-02-08 21:57       ` Dave Chinner
2015-02-09 15:15         ` Brian Foster
2015-02-09 21:48           ` Dave Chinner
2015-02-10 12:58             ` Brian Foster
2015-02-06 19:52 ` [PATCH v3 05/18] xfs: create macros/helpers for dealing with " Brian Foster
2015-02-06 19:52 ` [PATCH v3 06/18] xfs: pass inode count through ordered icreate log item Brian Foster
2015-02-06 19:52 ` [PATCH v3 07/18] xfs: handle sparse inode chunks in icreate log recovery Brian Foster
2015-02-06 23:19   ` Dave Chinner
2015-02-08 16:06     ` Brian Foster
2015-02-06 19:52 ` [PATCH v3 08/18] xfs: helpers to convert holemask to/from generic bitmap Brian Foster
2015-02-08 23:54   ` Dave Chinner
2015-02-09 15:15     ` Brian Foster
2015-02-06 19:52 ` [PATCH v3 09/18] xfs: support min/max agbno args in block allocator Brian Foster
2015-02-09  0:01   ` Dave Chinner
2015-02-06 19:52 ` [PATCH v3 10/18] xfs: allocate sparse inode chunks on full chunk allocation failure Brian Foster
2015-02-09  1:25   ` Dave Chinner
2015-02-09 15:20     ` Brian Foster
2015-02-06 19:52 ` [PATCH v3 11/18] xfs: randomly do sparse inode allocations in DEBUG mode Brian Foster
2015-02-08 23:02   ` Dave Chinner
2015-02-09 15:20     ` Brian Foster
2015-02-06 19:52 ` [PATCH v3 12/18] xfs: filter out sparse regions from individual inode allocation Brian Foster
2015-02-08 22:49   ` Dave Chinner
2015-02-06 19:53 ` [PATCH v3 13/18] xfs: update free inode record logic to support sparse inode records Brian Foster
2015-02-08 22:33   ` Dave Chinner
2015-02-06 19:53 ` [PATCH v3 14/18] xfs: only free allocated regions of inode chunks Brian Foster
2015-02-06 19:53 ` [PATCH v3 15/18] xfs: skip unallocated regions of inode chunks in xfs_ifree_cluster() Brian Foster
2015-02-06 19:53 ` [PATCH v3 16/18] xfs: use actual inode count for sparse records in bulkstat/inumbers Brian Foster
2015-02-08 22:29   ` Dave Chinner
2015-02-06 19:53 ` [PATCH v3 17/18] xfs: add fs geometry bit for sparse inode chunks Brian Foster
2015-02-06 19:53 ` [PATCH v3 18/18] xfs: enable sparse inode chunks for v5 superblocks Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.