All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] xfs: byte range buffer dirty region tracking
@ 2018-02-01  1:05 Dave Chinner
  2018-02-01  5:11 ` Darrick J. Wong
  2018-02-05  0:34 ` [PATCH v2] " Dave Chinner
  0 siblings, 2 replies; 21+ messages in thread
From: Dave Chinner @ 2018-02-01  1:05 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

One of the biggest performance problems with large directory block
sizes is the CPU overhead in maintaining the buffer log item direty
region bitmap.  The bit manipulations and buffer region mapping
calls are right at the top of the profiles when running tests on 64k
directory buffers:

  14.65%  [kernel]             [k] memcpy
   8.57%  [kernel]             [k] xfs_next_bit
   4.96%  [kernel]             [k] xfs_buf_item_format
   4.83%  [kernel]             [k] xfs_buf_item_size_segment.isra.4
   4.44%  [kernel]             [k] xfs_buf_offset



The memcpy is the copying of the dirty regions into the log vec
array, but almost twice as much CPU time is spent working out what
needs to be copied and where it needs to be copied from. As a
result, a debug kernel running a parallel fsmark file create
workload we see performance like this on a 64k block size directory:

FSUse%        Count         Size    Files/sec     App Overhead
     0      1600000            0     175994.3         13120040
     0      3200000            0     167829.7         14089107
     0      4800000            0     159274.7         15217029
     0      6400000            0     150097.3         16543647
....

In contrast, a 4k directory block size returns create rates around
310,000 files/s - almost 3x faster for the same CPU burn.

This patch switching the dirty range tracking to just the first and
last modified bytes in 4 separate regions on the buffer. This only
gets converted to a bitmap when the item is formatted into the CIL
log vector array.  Hence the profile of the relevant formatting
functions now looks like:

  22.21%  [kernel]  [k] memcpy
   0.51%  [kernel]  [k] xfs_buf_item_init
   0.49%  [kernel]  [k] xfs_buf_item_unlock
   0.39%  [kernel]  [k] xfs_buf_item_size
   0.29%  [kernel]  [k] xfs_buf_item_format
   0.20%  [kernel]  [k] xfs_buf_item_log
   0.14%  [kernel]  [k] xfs_buf_item_committing

And the performance is:

FSUse%        Count         Size    Files/sec     App Overhead
     0      1600000            0     224963.5         12631894
     0      3200000            0     226142.4         12608851
     0      4800000            0     237453.1         12509915
     0      6400000            0     233356.8         12939907

Substantially higher.
`
The memcpy time is higher because that's where we spend most of
the CPU we saved - in the buffer formatting routine:

....
       __xfs_trans_commit
        xfs_log_commit_cil
        xfs_buf_item_format
        xfs_buf_item_format_segment
        xfs_buf_iomove
        memcpy

Hence we can see that there is major reduction in buffer formatting
overhead that translates to improved performance.

The current implementation tracks, at most, four dirty regions per
buffer.  The nature of directory operations result in almost
operation modifying a header in the buffer, a tail section in the
buffer and then some number of bytes/regions in the middle of the
buffer.

If we just track a single region, it will almost always cover the
entire directory buffer as we do updates to both the head and tail
of most directory buffers.  That's a fairly large cost in terms of
log space and CPU overhead for random individual operations.
Similarly, increasing the number of regions to 8 (from 4) reduces
performance by 5-10%, so the gains from tracking multiple regions
tail off very quickly.

We also have to consider non-directory buffer modification patterns.
freespace, inode and extent btrees are the other major types of
buffers that get logged, but they also have modification patterns
that lend themselves well to a small number of ranges for dirty
tracking. That is, each btree block is kept compact, so when we
insert or remove a record or pointer we shift then higher
records/ptrs up or down as a block, and then log the lot of them.
And they also often have a header that is dirtied with each
insert/delete, so typically there are usually only one or two dirty
ranges in a btree block.

The only metadata type that really seems to benefit from fine
grained dirty range logging is the inode buffers. Specifically, for
v4 superblocks the create transaction only dirties the regions of
the inode core, so for 256 byte inodes only dirties every alternate
bitmap segment.  Dirty range tracking will double the required log
bandwidth of inode buffers during create (roughly 25% increase on a
4k directory block size filesystem). Typically this won't result in
a noticable performance differential (except in inode creation
benchmarks) on typical systems because the log is generally far from
being bandwidth bound.

For v5 filesystems, even this isn't an issue because the initialised
inode buffers are XFS_BLI_ORDERED buffers and so their contents
aren't logged.

The same problem happens with unlinks due to the unlinked list being
logged via the inode buffer. Again this results in an increase
in log bandwidth on both v4 and v5 filesystems, but there isn't any
performance differential that occurs because, again, the log isn't
bandwidth bound. As it is, there is an existing plan of improvement
to the unlinked list logging (moving the unlinked list logging into
the inode core transaction) and hence that will avoid any extra
overhead here as well.

Hence the overall CPU reduction benefits of minimal dirty range
tracking versus fine grained dirty bit tracking are overall going to
be beneficial to performance and throughput on current (v5) format
filesystems.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf.c      |   2 +
 fs/xfs/xfs_buf_item.c | 431 +++++++++++++++++++++++++-------------------------
 fs/xfs/xfs_buf_item.h |  19 +++
 3 files changed, 238 insertions(+), 214 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index d1da2ee9e6db..7621fabeb505 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1583,6 +1583,8 @@ xfs_buf_iomove(
 		page = bp->b_pages[page_index];
 		csize = min_t(size_t, PAGE_SIZE - page_offset,
 				      BBTOB(bp->b_io_length) - boff);
+		if (boff + csize > bend)
+			csize = bend - boff;
 
 		ASSERT((csize + page_offset) <= PAGE_SIZE);
 
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 270ddb4d2313..0629d09406a4 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -66,50 +66,12 @@ xfs_buf_item_size_segment(
 	int				*nvecs,
 	int				*nbytes)
 {
-	struct xfs_buf			*bp = bip->bli_buf;
-	int				next_bit;
-	int				last_bit;
-
-	last_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size, 0);
-	if (last_bit == -1)
-		return;
-
 	/*
 	 * initial count for a dirty buffer is 2 vectors - the format structure
-	 * and the first dirty region.
+	 * and the dirty region. Dirty region is accounted for separately.
 	 */
 	*nvecs += 2;
-	*nbytes += xfs_buf_log_format_size(blfp) + XFS_BLF_CHUNK;
-
-	while (last_bit != -1) {
-		/*
-		 * This takes the bit number to start looking from and
-		 * returns the next set bit from there.  It returns -1
-		 * if there are no more bits set or the start bit is
-		 * beyond the end of the bitmap.
-		 */
-		next_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size,
-					last_bit + 1);
-		/*
-		 * If we run out of bits, leave the loop,
-		 * else if we find a new set of bits bump the number of vecs,
-		 * else keep scanning the current set of bits.
-		 */
-		if (next_bit == -1) {
-			break;
-		} else if (next_bit != last_bit + 1) {
-			last_bit = next_bit;
-			(*nvecs)++;
-		} else if (xfs_buf_offset(bp, next_bit * XFS_BLF_CHUNK) !=
-			   (xfs_buf_offset(bp, last_bit * XFS_BLF_CHUNK) +
-			    XFS_BLF_CHUNK)) {
-			last_bit = next_bit;
-			(*nvecs)++;
-		} else {
-			last_bit++;
-		}
-		*nbytes += XFS_BLF_CHUNK;
-	}
+	*nbytes += xfs_buf_log_format_size(blfp);
 }
 
 /*
@@ -136,7 +98,9 @@ xfs_buf_item_size(
 	int			*nbytes)
 {
 	struct xfs_buf_log_item	*bip = BUF_ITEM(lip);
-	int			i;
+	struct xfs_buf	*bp = bip->bli_buf;
+	uint			offset;
+	int			i, j;
 
 	ASSERT(atomic_read(&bip->bli_refcount) > 0);
 	if (bip->bli_flags & XFS_BLI_STALE) {
@@ -155,6 +119,7 @@ xfs_buf_item_size(
 	}
 
 	ASSERT(bip->bli_flags & XFS_BLI_LOGGED);
+	ASSERT(bip->bli_flags & XFS_BLI_DIRTY);
 
 	if (bip->bli_flags & XFS_BLI_ORDERED) {
 		/*
@@ -169,17 +134,45 @@ xfs_buf_item_size(
 
 	/*
 	 * the vector count is based on the number of buffer vectors we have
-	 * dirty bits in. This will only be greater than one when we have a
+	 * dirty ranges in. This will only be greater than one when we have a
 	 * compound buffer with more than one segment dirty. Hence for compound
-	 * buffers we need to track which segment the dirty bits correspond to,
-	 * and when we move from one segment to the next increment the vector
-	 * count for the extra buf log format structure that will need to be
-	 * written.
+	 * buffers we need to track which segment the dirty ranges correspond
+	 * to, and when we move from one segment to the next increment the
+	 * vector count for the extra buf log format structure that will need to
+	 * be written.
 	 */
-	for (i = 0; i < bip->bli_format_count; i++) {
-		xfs_buf_item_size_segment(bip, &bip->bli_formats[i],
-					  nvecs, nbytes);
+	ASSERT(bip->bli_range[0].last != 0);
+	if (bip->bli_range[0].last == 0) {
+		/* clean! */
+		ASSERT(bip->bli_range[0].first == 0);
+		return;
 	}
+
+	for (i = 0, offset = 0;
+	     i < bip->bli_format_count;
+	     i++, offset += BBTOB(bp->b_maps[i].bm_len)) {
+		/* Only format dirty regions */
+		for (j = 0; j < bip->bli_ranges; j++) {
+			struct xfs_bli_range *rp = &bip->bli_range[j];
+
+			/* range ends before segment start, check next range */
+			if (rp->last < offset)
+				continue;
+
+			/* range beyond segment end, check next segment */
+			if (rp->first > offset + BBTOB(bp->b_maps[i].bm_len))
+				break;
+
+			/* dirty range overlaps segment, need headers */
+			xfs_buf_item_size_segment(bip, &bip->bli_formats[i],
+						  nvecs, nbytes);
+		}
+	}
+
+	for (j = 0; j < bip->bli_ranges; j++)
+		*nbytes += bip->bli_range[j].last - bip->bli_range[j].first;
+
+
 	trace_xfs_buf_item_size(bip);
 }
 
@@ -192,7 +185,6 @@ xfs_buf_item_copy_iovec(
 	int			first_bit,
 	uint			nbits)
 {
-	offset += first_bit * XFS_BLF_CHUNK;
 	xlog_copy_iovec(lv, vecp, XLOG_REG_TYPE_BCHUNK,
 			xfs_buf_offset(bp, offset),
 			nbits * XFS_BLF_CHUNK);
@@ -215,14 +207,18 @@ xfs_buf_item_format_segment(
 	struct xfs_buf_log_item	*bip,
 	struct xfs_log_vec	*lv,
 	struct xfs_log_iovec	**vecp,
+	struct xfs_bli_range	*rp,
 	uint			offset,
+	uint			length,
 	struct xfs_buf_log_format *blfp)
 {
 	struct xfs_buf		*bp = bip->bli_buf;
+	char			*buf;
 	uint			base_size;
+	uint			start;
+	uint			end;
 	int			first_bit;
 	int			last_bit;
-	int			next_bit;
 	uint			nbits;
 
 	/* copy the flags across from the base format item */
@@ -234,16 +230,6 @@ xfs_buf_item_format_segment(
 	 * memory structure.
 	 */
 	base_size = xfs_buf_log_format_size(blfp);
-
-	first_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size, 0);
-	if (!(bip->bli_flags & XFS_BLI_STALE) && first_bit == -1) {
-		/*
-		 * If the map is not be dirty in the transaction, mark
-		 * the size as zero and do not advance the vector pointer.
-		 */
-		return;
-	}
-
 	blfp = xlog_copy_iovec(lv, vecp, XLOG_REG_TYPE_BFORMAT, blfp, base_size);
 	blfp->blf_size = 1;
 
@@ -258,46 +244,40 @@ xfs_buf_item_format_segment(
 		return;
 	}
 
+	blfp->blf_size++;
 
 	/*
-	 * Fill in an iovec for each set of contiguous chunks.
+	 * Now we need to set the bits in the bitmap and set up the iovecs
+	 * appropriately. We know there is a contiguous range in this buffer
+	 * than needs to be set, so find the first bit, the last bit, and
+	 * go from there.
 	 */
-	last_bit = first_bit;
-	nbits = 1;
-	for (;;) {
-		/*
-		 * This takes the bit number to start looking from and
-		 * returns the next set bit from there.  It returns -1
-		 * if there are no more bits set or the start bit is
-		 * beyond the end of the bitmap.
-		 */
-		next_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size,
-					(uint)last_bit + 1);
-		/*
-		 * If we run out of bits fill in the last iovec and get out of
-		 * the loop.  Else if we start a new set of bits then fill in
-		 * the iovec for the series we were looking at and start
-		 * counting the bits in the new one.  Else we're still in the
-		 * same set of bits so just keep counting and scanning.
-		 */
-		if (next_bit == -1) {
-			xfs_buf_item_copy_iovec(lv, vecp, bp, offset,
-						first_bit, nbits);
-			blfp->blf_size++;
-			break;
-		} else if (next_bit != last_bit + 1 ||
-		           xfs_buf_item_straddle(bp, offset, next_bit, last_bit)) {
-			xfs_buf_item_copy_iovec(lv, vecp, bp, offset,
-						first_bit, nbits);
-			blfp->blf_size++;
-			first_bit = next_bit;
-			last_bit = next_bit;
-			nbits = 1;
-		} else {
-			last_bit++;
-			nbits++;
-		}
-	}
+	start = 0;
+	if (offset < rp->first)
+		start = rp->first - offset;
+	end = length - 1;
+	if (offset + length > rp->last)
+		end = rp->last - offset - 1;
+
+	start &= ~((1 << XFS_BLF_SHIFT) - 1);
+	first_bit = start >> XFS_BLF_SHIFT;
+	last_bit = end >> XFS_BLF_SHIFT;
+	nbits = last_bit - first_bit + 1;
+	bitmap_set((unsigned long *)blfp->blf_data_map, first_bit, nbits);
+
+	ASSERT(end <= length);
+	ASSERT(start <= length);
+	ASSERT(length >= nbits * XFS_BLF_CHUNK);
+	/*
+	 * Copy needs to be done a buffer page at a time as we can be logging
+	 * unmapped buffers. hence we have to use xfs_buf_iomove() rather than a
+	 * straight memcpy here.
+	 */
+	offset += first_bit * XFS_BLF_CHUNK;
+	length = nbits * XFS_BLF_CHUNK;
+	buf = xlog_prepare_iovec(lv, vecp, XLOG_REG_TYPE_BCHUNK);
+	xfs_buf_iomove(bp, offset, length, buf, XBRW_READ);
+	xlog_finish_iovec(lv, *vecp, length);
 }
 
 /*
@@ -314,8 +294,8 @@ xfs_buf_item_format(
 	struct xfs_buf_log_item	*bip = BUF_ITEM(lip);
 	struct xfs_buf		*bp = bip->bli_buf;
 	struct xfs_log_iovec	*vecp = NULL;
-	uint			offset = 0;
-	int			i;
+	uint			offset;
+	int			i, j;
 
 	ASSERT(atomic_read(&bip->bli_refcount) > 0);
 	ASSERT((bip->bli_flags & XFS_BLI_LOGGED) ||
@@ -326,7 +306,6 @@ xfs_buf_item_format(
 	ASSERT(!(bip->bli_flags & XFS_BLI_ORDERED) ||
 	       (bip->bli_flags & XFS_BLI_STALE));
 
-
 	/*
 	 * If it is an inode buffer, transfer the in-memory state to the
 	 * format flags and clear the in-memory state.
@@ -349,10 +328,36 @@ xfs_buf_item_format(
 		bip->bli_flags &= ~XFS_BLI_INODE_BUF;
 	}
 
-	for (i = 0; i < bip->bli_format_count; i++) {
-		xfs_buf_item_format_segment(bip, lv, &vecp, offset,
-					    &bip->bli_formats[i]);
-		offset += BBTOB(bp->b_maps[i].bm_len);
+	for (i = 0, offset = 0;
+	     i < bip->bli_format_count;
+	     i++, offset += BBTOB(bp->b_maps[i].bm_len)) {
+
+		/* stale regions cover the entire segment */
+		if (bip->bli_flags & XFS_BLI_STALE) {
+			xfs_buf_item_format_segment(bip, lv, &vecp, NULL, offset,
+						    BBTOB(bp->b_maps[i].bm_len),
+						    &bip->bli_formats[i]);
+			continue;
+		}
+
+		/* only format dirty ranges over the current segment */
+		for (j = 0; j < bip->bli_ranges; j++) {
+			struct xfs_bli_range *rp = &bip->bli_range[j];
+
+			/* range ends before segment start, check next range */
+			if (rp->last < offset)
+				continue;
+
+			/* range beyond segment end, check next segment */
+			if (rp->first > offset + BBTOB(bp->b_maps[i].bm_len))
+				break;
+
+			/* dirty range overlaps segment, need headers */
+			xfs_buf_item_format_segment(bip, lv, &vecp, rp, offset,
+						    BBTOB(bp->b_maps[i].bm_len),
+						    &bip->bli_formats[i]);
+
+		}
 	}
 
 	/*
@@ -737,6 +742,9 @@ xfs_buf_item_init(
 	int			error;
 	int			i;
 
+	for (i = 0; i < XFS_BLI_RANGES; i++)
+		bip->bli_range[i].first = UINT_MAX;
+
 	/*
 	 * Check to see if there is already a buf log item for
 	 * this buffer. If we do already have one, there is
@@ -788,133 +796,136 @@ xfs_buf_item_init(
 
 /*
  * Mark bytes first through last inclusive as dirty in the buf
- * item's bitmap.
+ * record dirty regions on the buffer.
  */
-static void
-xfs_buf_item_log_segment(
+void
+xfs_buf_item_log(
+	struct xfs_buf_log_item	*bip,
 	uint			first,
-	uint			last,
-	uint			*map)
+	uint			last)
 {
-	uint		first_bit;
-	uint		last_bit;
-	uint		bits_to_set;
-	uint		bits_set;
-	uint		word_num;
-	uint		*wordp;
-	uint		bit;
-	uint		end_bit;
-	uint		mask;
+	struct xfs_bli_range	*rp = NULL;
+	int			i;
+	ASSERT(last != 0);
+	ASSERT(first <= last);
+	ASSERT(last < BBTOB(bip->bli_buf->b_length));
+
+	/* simple case - first range being stored */
+	if (!bip->bli_ranges) {
+		bip->bli_ranges = 1;
+		bip->bli_range[0].first = rounddown(first, XFS_BLF_CHUNK);
+		bip->bli_range[0].last = roundup(last, XFS_BLF_CHUNK);
+		ASSERT(bip->bli_range[0].last != 0);
+		ASSERT(bip->bli_range[0].first <= bip->bli_range[0].last);
+		return;
+	}
 
-	/*
-	 * Convert byte offsets to bit numbers.
-	 */
-	first_bit = first >> XFS_BLF_SHIFT;
-	last_bit = last >> XFS_BLF_SHIFT;
+	/* 2nd case: search for overlaps and extend */
+	for (i = 0; i < bip->bli_ranges; i++) {
+		rp = &bip->bli_range[i];
 
-	/*
-	 * Calculate the total number of bits to be set.
-	 */
-	bits_to_set = last_bit - first_bit + 1;
+		/* wholly within an existing dirty range, we're done */
+		if (first >= rp->first && last <= rp->last)
+			return;
+		/* no overlap, continue */
+		if (first > rp->last || last < rp->first)
+			continue;
 
-	/*
-	 * Get a pointer to the first word in the bitmap
-	 * to set a bit in.
-	 */
-	word_num = first_bit >> BIT_TO_WORD_SHIFT;
-	wordp = &map[word_num];
+		/* left edge overlap, extend */
+		if (first < rp->first)
+			rp->first = rounddown(first, XFS_BLF_CHUNK);
 
-	/*
-	 * Calculate the starting bit in the first word.
-	 */
-	bit = first_bit & (uint)(NBWORD - 1);
+		/* right edge overlap, extend */
+		if (last > rp->last)
+			rp->last = roundup(last, XFS_BLF_CHUNK) - 1;
 
-	/*
-	 * First set any bits in the first word of our range.
-	 * If it starts at bit 0 of the word, it will be
-	 * set below rather than here.  That is what the variable
-	 * bit tells us. The variable bits_set tracks the number
-	 * of bits that have been set so far.  End_bit is the number
-	 * of the last bit to be set in this word plus one.
-	 */
-	if (bit) {
-		end_bit = MIN(bit + bits_to_set, (uint)NBWORD);
-		mask = ((1U << (end_bit - bit)) - 1) << bit;
-		*wordp |= mask;
-		wordp++;
-		bits_set = end_bit - bit;
-	} else {
-		bits_set = 0;
+		goto merge;
 	}
 
-	/*
-	 * Now set bits a whole word at a time that are between
-	 * first_bit and last_bit.
-	 */
-	while ((bits_to_set - bits_set) >= NBWORD) {
-		*wordp |= 0xffffffff;
-		bits_set += NBWORD;
-		wordp++;
-	}
+	/* 3rd case: not found, insert or extend */
+	ASSERT(i == bip->bli_ranges);
 
 	/*
 	 * Finally, set any bits left to be set in one last partial word.
+	 * Case 3a: Extend last slot.
+	 *
+	 * If the range is beyond the last slot, extend the last slot to
+	 * cover it. This treated the same as if an overlap existed with
+	 * the last range.
 	 */
-	end_bit = bits_to_set - bits_set;
-	if (end_bit) {
-		mask = (1U << end_bit) - 1;
-		*wordp |= mask;
+	if (i == XFS_BLI_RANGES) {
+		ASSERT(bip->bli_ranges == XFS_BLI_RANGES);
+		rp = &bip->bli_range[XFS_BLI_RANGES - 1];
+
+		if (first < rp->first)
+			rp->first = rounddown(first, XFS_BLF_CHUNK);
+		if (last > rp->last)
+			rp->last = roundup(last, XFS_BLF_CHUNK) - 1;
+		goto merge;
 	}
-}
 
-/*
- * Mark bytes first through last inclusive as dirty in the buf
- * item's bitmap.
- */
-void
-xfs_buf_item_log(
-	struct xfs_buf_log_item	*bip,
-	uint			first,
-	uint			last)
-{
-	int			i;
-	uint			start;
-	uint			end;
-	struct xfs_buf		*bp = bip->bli_buf;
+	/* Case 3b: insert new range.
+	 *
+	 * Find the insertion point for the new range, then make a hole
+	 * and insert the new range.
+	 */
+	for (i = 0; i < bip->bli_ranges; i++) {
+		rp = &bip->bli_range[i];
 
+		/* order ranges by ascending offset */
+		if (last < rp->first)
+			break;
+	}
+	/* shift down and insert */
+	ASSERT(i < XFS_BLI_RANGES);
+	rp = &bip->bli_range[i];
+	if (i < XFS_BLI_RANGES - 1)
+		memmove(rp + 1, rp, sizeof(*rp) * (bip->bli_ranges - i));
+	bip->bli_ranges++;
+	rp->first = rounddown(first, XFS_BLF_CHUNK);
+	rp->last = roundup(last, XFS_BLF_CHUNK) - 1;
+
+merge:
 	/*
-	 * walk each buffer segment and mark them dirty appropriately.
+	 * Check for overlaping ranges and merge them. If there is only one
+	 * range, there is nothing to merge so bail early.
 	 */
-	start = 0;
-	for (i = 0; i < bip->bli_format_count; i++) {
-		if (start > last)
-			break;
-		end = start + BBTOB(bp->b_maps[i].bm_len) - 1;
+	if (bip->bli_ranges == 1)
+		return;
+
+	for (i = 0; i < bip->bli_ranges - 1; i++) {
+		struct xfs_bli_range *rp_next;
+
+		rp = &bip->bli_range[i];
+		rp_next = &bip->bli_range[i + 1];
 
-		/* skip to the map that includes the first byte to log */
-		if (first > end) {
-			start += BBTOB(bp->b_maps[i].bm_len);
+
+check_merge:
+		ASSERT(rp->last != 0);
+		ASSERT(rp->first <= rp->last);
+
+		/* no overlap or adjacent, move on */
+		if (rp->last < rp_next->first - 1)
 			continue;
-		}
 
 		/*
-		 * Trim the range to this segment and mark it in the bitmap.
-		 * Note that we must convert buffer offsets to segment relative
-		 * offsets (e.g., the first byte of each segment is byte 0 of
-		 * that segment).
+		 * overlap: select lowest first, highest last, remove the merged
+		 * range (rp_next) and then go back and check the next range for
+		 * whether it can be merged (e.g. we have 4 separate ranges,
+		 * then something logs the buffer entirely. This merges all
+		 * ranges into one).
 		 */
-		if (first < start)
-			first = start;
-		if (end > last)
-			end = last;
-		xfs_buf_item_log_segment(first - start, end - start,
-					 &bip->bli_formats[i].blf_data_map[0]);
-
-		start += BBTOB(bp->b_maps[i].bm_len);
+		rp->first = min(rp->first, rp_next->first);
+		rp->last = max(rp->last, rp_next->last);
+		if (i + 2 < bip->bli_ranges)
+			memmove(rp_next, rp_next + 1, sizeof(*rp) *
+						(bip->bli_ranges - i - 2));
+		bip->bli_ranges--;
+		if (i < bip->bli_ranges - 1)
+			goto check_merge;
 	}
 }
 
-
 /*
  * Return true if the buffer has any ranges logged/dirtied by a transaction,
  * false otherwise.
@@ -923,15 +934,7 @@ bool
 xfs_buf_item_dirty_format(
 	struct xfs_buf_log_item	*bip)
 {
-	int			i;
-
-	for (i = 0; i < bip->bli_format_count; i++) {
-		if (!xfs_bitmap_empty(bip->bli_formats[i].blf_data_map,
-			     bip->bli_formats[i].blf_map_size))
-			return true;
-	}
-
-	return false;
+	return bip->bli_ranges > 0;
 }
 
 STATIC void
diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
index 643f53dcfe51..9b278c3a2db9 100644
--- a/fs/xfs/xfs_buf_item.h
+++ b/fs/xfs/xfs_buf_item.h
@@ -57,6 +57,25 @@ struct xfs_buf_log_item {
 	unsigned int		bli_recur;	/* lock recursion count */
 	atomic_t		bli_refcount;	/* cnt of tp refs */
 	int			bli_format_count;	/* count of headers */
+
+	/*
+	 * logging ranges. Keep a small number of distinct ranges rather than a
+	 * bitmap which is expensive to maintain.
+	 * 4 separate ranges s probably optimal so that we
+	 * can log separate header, tail and content changes (e.g. for dir
+	 * structures) without capturing the entire buffer unnecessarily for
+	 * isolated changes.
+	 *
+	 * Note: ranges are 32 bit values because we have to support an end
+	 * range value of 0x10000....
+	 */
+#define XFS_BLI_RANGES	4
+	struct xfs_bli_range {
+		uint32_t	first;
+		uint32_t	last;
+	}			bli_range[XFS_BLI_RANGES];
+	int			bli_ranges;
+
 	struct xfs_buf_log_format *bli_formats;	/* array of in-log header ptrs */
 	struct xfs_buf_log_format __bli_format;	/* embedded in-log header */
 };
-- 
2.15.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] xfs: byte range buffer dirty region tracking
  2018-02-01  1:05 [PATCH] xfs: byte range buffer dirty region tracking Dave Chinner
@ 2018-02-01  5:11 ` Darrick J. Wong
  2018-02-01  8:14   ` Dave Chinner
  2018-02-05  0:34 ` [PATCH v2] " Dave Chinner
  1 sibling, 1 reply; 21+ messages in thread
From: Darrick J. Wong @ 2018-02-01  5:11 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Feb 01, 2018 at 12:05:14PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> One of the biggest performance problems with large directory block
> sizes is the CPU overhead in maintaining the buffer log item direty
> region bitmap.  The bit manipulations and buffer region mapping
> calls are right at the top of the profiles when running tests on 64k
> directory buffers:
> 
>   14.65%  [kernel]             [k] memcpy
>    8.57%  [kernel]             [k] xfs_next_bit
>    4.96%  [kernel]             [k] xfs_buf_item_format
>    4.83%  [kernel]             [k] xfs_buf_item_size_segment.isra.4
>    4.44%  [kernel]             [k] xfs_buf_offset
> 
> 
> 
> The memcpy is the copying of the dirty regions into the log vec
> array, but almost twice as much CPU time is spent working out what
> needs to be copied and where it needs to be copied from. As a
> result, a debug kernel running a parallel fsmark file create
> workload we see performance like this on a 64k block size directory:
> 
> FSUse%        Count         Size    Files/sec     App Overhead
>      0      1600000            0     175994.3         13120040
>      0      3200000            0     167829.7         14089107
>      0      4800000            0     159274.7         15217029
>      0      6400000            0     150097.3         16543647
> ....
> 
> In contrast, a 4k directory block size returns create rates around
> 310,000 files/s - almost 3x faster for the same CPU burn.
> 
> This patch switching the dirty range tracking to just the first and
> last modified bytes in 4 separate regions on the buffer. This only
> gets converted to a bitmap when the item is formatted into the CIL
> log vector array.  Hence the profile of the relevant formatting
> functions now looks like:
> 
>   22.21%  [kernel]  [k] memcpy
>    0.51%  [kernel]  [k] xfs_buf_item_init
>    0.49%  [kernel]  [k] xfs_buf_item_unlock
>    0.39%  [kernel]  [k] xfs_buf_item_size
>    0.29%  [kernel]  [k] xfs_buf_item_format
>    0.20%  [kernel]  [k] xfs_buf_item_log
>    0.14%  [kernel]  [k] xfs_buf_item_committing
> 
> And the performance is:
> 
> FSUse%        Count         Size    Files/sec     App Overhead
>      0      1600000            0     224963.5         12631894
>      0      3200000            0     226142.4         12608851
>      0      4800000            0     237453.1         12509915
>      0      6400000            0     233356.8         12939907
> 
> Substantially higher.
> `
> The memcpy time is higher because that's where we spend most of
> the CPU we saved - in the buffer formatting routine:
> 
> ....
>        __xfs_trans_commit
>         xfs_log_commit_cil
>         xfs_buf_item_format
>         xfs_buf_item_format_segment
>         xfs_buf_iomove
>         memcpy
> 
> Hence we can see that there is major reduction in buffer formatting
> overhead that translates to improved performance.
> 
> The current implementation tracks, at most, four dirty regions per
> buffer.  The nature of directory operations result in almost
> operation modifying a header in the buffer, a tail section in the
> buffer and then some number of bytes/regions in the middle of the
> buffer.
> 
> If we just track a single region, it will almost always cover the
> entire directory buffer as we do updates to both the head and tail
> of most directory buffers.  That's a fairly large cost in terms of
> log space and CPU overhead for random individual operations.
> Similarly, increasing the number of regions to 8 (from 4) reduces
> performance by 5-10%, so the gains from tracking multiple regions
> tail off very quickly.
> 
> We also have to consider non-directory buffer modification patterns.
> freespace, inode and extent btrees are the other major types of
> buffers that get logged, but they also have modification patterns
> that lend themselves well to a small number of ranges for dirty
> tracking. That is, each btree block is kept compact, so when we
> insert or remove a record or pointer we shift then higher
> records/ptrs up or down as a block, and then log the lot of them.
> And they also often have a header that is dirtied with each
> insert/delete, so typically there are usually only one or two dirty
> ranges in a btree block.
> 
> The only metadata type that really seems to benefit from fine
> grained dirty range logging is the inode buffers. Specifically, for
> v4 superblocks the create transaction only dirties the regions of
> the inode core, so for 256 byte inodes only dirties every alternate
> bitmap segment.  Dirty range tracking will double the required log
> bandwidth of inode buffers during create (roughly 25% increase on a
> 4k directory block size filesystem). Typically this won't result in
> a noticable performance differential (except in inode creation
> benchmarks) on typical systems because the log is generally far from
> being bandwidth bound.
> 
> For v5 filesystems, even this isn't an issue because the initialised
> inode buffers are XFS_BLI_ORDERED buffers and so their contents
> aren't logged.
> 
> The same problem happens with unlinks due to the unlinked list being
> logged via the inode buffer. Again this results in an increase
> in log bandwidth on both v4 and v5 filesystems, but there isn't any
> performance differential that occurs because, again, the log isn't
> bandwidth bound. As it is, there is an existing plan of improvement
> to the unlinked list logging (moving the unlinked list logging into
> the inode core transaction) and hence that will avoid any extra
> overhead here as well.
> 
> Hence the overall CPU reduction benefits of minimal dirty range
> tracking versus fine grained dirty bit tracking are overall going to
> be beneficial to performance and throughput on current (v5) format
> filesystems.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_buf.c      |   2 +
>  fs/xfs/xfs_buf_item.c | 431 +++++++++++++++++++++++++-------------------------
>  fs/xfs/xfs_buf_item.h |  19 +++
>  3 files changed, 238 insertions(+), 214 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index d1da2ee9e6db..7621fabeb505 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -1583,6 +1583,8 @@ xfs_buf_iomove(
>  		page = bp->b_pages[page_index];
>  		csize = min_t(size_t, PAGE_SIZE - page_offset,
>  				      BBTOB(bp->b_io_length) - boff);
> +		if (boff + csize > bend)
> +			csize = bend - boff;

How often does csize exceed bend?

>  		ASSERT((csize + page_offset) <= PAGE_SIZE);
>  
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index 270ddb4d2313..0629d09406a4 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -66,50 +66,12 @@ xfs_buf_item_size_segment(
>  	int				*nvecs,
>  	int				*nbytes)
>  {
> -	struct xfs_buf			*bp = bip->bli_buf;
> -	int				next_bit;
> -	int				last_bit;
> -
> -	last_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size, 0);
> -	if (last_bit == -1)
> -		return;
> -
>  	/*
>  	 * initial count for a dirty buffer is 2 vectors - the format structure
> -	 * and the first dirty region.
> +	 * and the dirty region. Dirty region is accounted for separately.
>  	 */
>  	*nvecs += 2;
> -	*nbytes += xfs_buf_log_format_size(blfp) + XFS_BLF_CHUNK;
> -
> -	while (last_bit != -1) {
> -		/*
> -		 * This takes the bit number to start looking from and
> -		 * returns the next set bit from there.  It returns -1
> -		 * if there are no more bits set or the start bit is
> -		 * beyond the end of the bitmap.
> -		 */
> -		next_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size,
> -					last_bit + 1);
> -		/*
> -		 * If we run out of bits, leave the loop,
> -		 * else if we find a new set of bits bump the number of vecs,
> -		 * else keep scanning the current set of bits.
> -		 */
> -		if (next_bit == -1) {
> -			break;
> -		} else if (next_bit != last_bit + 1) {
> -			last_bit = next_bit;
> -			(*nvecs)++;
> -		} else if (xfs_buf_offset(bp, next_bit * XFS_BLF_CHUNK) !=
> -			   (xfs_buf_offset(bp, last_bit * XFS_BLF_CHUNK) +
> -			    XFS_BLF_CHUNK)) {
> -			last_bit = next_bit;
> -			(*nvecs)++;
> -		} else {
> -			last_bit++;
> -		}
> -		*nbytes += XFS_BLF_CHUNK;
> -	}
> +	*nbytes += xfs_buf_log_format_size(blfp);
>  }
>  
>  /*
> @@ -136,7 +98,9 @@ xfs_buf_item_size(
>  	int			*nbytes)
>  {
>  	struct xfs_buf_log_item	*bip = BUF_ITEM(lip);
> -	int			i;
> +	struct xfs_buf	*bp = bip->bli_buf;

Indentation before '*bp'...

> +	uint			offset;
> +	int			i, j;
>  
>  	ASSERT(atomic_read(&bip->bli_refcount) > 0);
>  	if (bip->bli_flags & XFS_BLI_STALE) {
> @@ -155,6 +119,7 @@ xfs_buf_item_size(
>  	}
>  
>  	ASSERT(bip->bli_flags & XFS_BLI_LOGGED);
> +	ASSERT(bip->bli_flags & XFS_BLI_DIRTY);
>  
>  	if (bip->bli_flags & XFS_BLI_ORDERED) {
>  		/*
> @@ -169,17 +134,45 @@ xfs_buf_item_size(
>  
>  	/*
>  	 * the vector count is based on the number of buffer vectors we have
> -	 * dirty bits in. This will only be greater than one when we have a
> +	 * dirty ranges in. This will only be greater than one when we have a
>  	 * compound buffer with more than one segment dirty. Hence for compound
> -	 * buffers we need to track which segment the dirty bits correspond to,
> -	 * and when we move from one segment to the next increment the vector
> -	 * count for the extra buf log format structure that will need to be
> -	 * written.
> +	 * buffers we need to track which segment the dirty ranges correspond
> +	 * to, and when we move from one segment to the next increment the
> +	 * vector count for the extra buf log format structure that will need to
> +	 * be written.
>  	 */
> -	for (i = 0; i < bip->bli_format_count; i++) {
> -		xfs_buf_item_size_segment(bip, &bip->bli_formats[i],
> -					  nvecs, nbytes);
> +	ASSERT(bip->bli_range[0].last != 0);
> +	if (bip->bli_range[0].last == 0) {
> +		/* clean! */
> +		ASSERT(bip->bli_range[0].first == 0);

Hm, so given that the firsts are initialized to UINT_MAX, this only
happens if the first (only?) range we log is ... (0, 0) ?

Mildly confused about what these asserts are going after, since the
first one implies that this shouldn't happen anyway.

> +		return;
>  	}
> +
> +	for (i = 0, offset = 0;
> +	     i < bip->bli_format_count;
> +	     i++, offset += BBTOB(bp->b_maps[i].bm_len)) {
> +		/* Only format dirty regions */
> +		for (j = 0; j < bip->bli_ranges; j++) {
> +			struct xfs_bli_range *rp = &bip->bli_range[j];
> +
> +			/* range ends before segment start, check next range */
> +			if (rp->last < offset)
> +				continue;
> +
> +			/* range beyond segment end, check next segment */
> +			if (rp->first > offset + BBTOB(bp->b_maps[i].bm_len))
> +				break;
> +
> +			/* dirty range overlaps segment, need headers */
> +			xfs_buf_item_size_segment(bip, &bip->bli_formats[i],
> +						  nvecs, nbytes);
> +		}
> +	}
> +
> +	for (j = 0; j < bip->bli_ranges; j++)
> +		*nbytes += bip->bli_range[j].last - bip->bli_range[j].first;
> +
> +
>  	trace_xfs_buf_item_size(bip);
>  }
>  
> @@ -192,7 +185,6 @@ xfs_buf_item_copy_iovec(
>  	int			first_bit,
>  	uint			nbits)
>  {
> -	offset += first_bit * XFS_BLF_CHUNK;
>  	xlog_copy_iovec(lv, vecp, XLOG_REG_TYPE_BCHUNK,
>  			xfs_buf_offset(bp, offset),
>  			nbits * XFS_BLF_CHUNK);
> @@ -215,14 +207,18 @@ xfs_buf_item_format_segment(
>  	struct xfs_buf_log_item	*bip,
>  	struct xfs_log_vec	*lv,
>  	struct xfs_log_iovec	**vecp,
> +	struct xfs_bli_range	*rp,
>  	uint			offset,
> +	uint			length,
>  	struct xfs_buf_log_format *blfp)
>  {
>  	struct xfs_buf		*bp = bip->bli_buf;
> +	char			*buf;
>  	uint			base_size;
> +	uint			start;
> +	uint			end;
>  	int			first_bit;
>  	int			last_bit;
> -	int			next_bit;
>  	uint			nbits;
>  
>  	/* copy the flags across from the base format item */
> @@ -234,16 +230,6 @@ xfs_buf_item_format_segment(
>  	 * memory structure.
>  	 */
>  	base_size = xfs_buf_log_format_size(blfp);
> -
> -	first_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size, 0);
> -	if (!(bip->bli_flags & XFS_BLI_STALE) && first_bit == -1) {
> -		/*
> -		 * If the map is not be dirty in the transaction, mark
> -		 * the size as zero and do not advance the vector pointer.
> -		 */
> -		return;
> -	}
> -
>  	blfp = xlog_copy_iovec(lv, vecp, XLOG_REG_TYPE_BFORMAT, blfp, base_size);
>  	blfp->blf_size = 1;
>  
> @@ -258,46 +244,40 @@ xfs_buf_item_format_segment(
>  		return;
>  	}
>  
> +	blfp->blf_size++;
>  
>  	/*
> -	 * Fill in an iovec for each set of contiguous chunks.
> +	 * Now we need to set the bits in the bitmap and set up the iovecs
> +	 * appropriately. We know there is a contiguous range in this buffer
> +	 * than needs to be set, so find the first bit, the last bit, and
> +	 * go from there.
>  	 */
> -	last_bit = first_bit;
> -	nbits = 1;
> -	for (;;) {
> -		/*
> -		 * This takes the bit number to start looking from and
> -		 * returns the next set bit from there.  It returns -1
> -		 * if there are no more bits set or the start bit is
> -		 * beyond the end of the bitmap.
> -		 */
> -		next_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size,
> -					(uint)last_bit + 1);
> -		/*
> -		 * If we run out of bits fill in the last iovec and get out of
> -		 * the loop.  Else if we start a new set of bits then fill in
> -		 * the iovec for the series we were looking at and start
> -		 * counting the bits in the new one.  Else we're still in the
> -		 * same set of bits so just keep counting and scanning.
> -		 */
> -		if (next_bit == -1) {
> -			xfs_buf_item_copy_iovec(lv, vecp, bp, offset,
> -						first_bit, nbits);
> -			blfp->blf_size++;
> -			break;
> -		} else if (next_bit != last_bit + 1 ||
> -		           xfs_buf_item_straddle(bp, offset, next_bit, last_bit)) {
> -			xfs_buf_item_copy_iovec(lv, vecp, bp, offset,
> -						first_bit, nbits);
> -			blfp->blf_size++;
> -			first_bit = next_bit;
> -			last_bit = next_bit;
> -			nbits = 1;
> -		} else {
> -			last_bit++;
> -			nbits++;
> -		}
> -	}
> +	start = 0;
> +	if (offset < rp->first)
> +		start = rp->first - offset;
> +	end = length - 1;
> +	if (offset + length > rp->last)
> +		end = rp->last - offset - 1;
> +
> +	start &= ~((1 << XFS_BLF_SHIFT) - 1);
> +	first_bit = start >> XFS_BLF_SHIFT;
> +	last_bit = end >> XFS_BLF_SHIFT;
> +	nbits = last_bit - first_bit + 1;
> +	bitmap_set((unsigned long *)blfp->blf_data_map, first_bit, nbits);
> +
> +	ASSERT(end <= length);
> +	ASSERT(start <= length);
> +	ASSERT(length >= nbits * XFS_BLF_CHUNK);
> +	/*
> +	 * Copy needs to be done a buffer page at a time as we can be logging
> +	 * unmapped buffers. hence we have to use xfs_buf_iomove() rather than a
> +	 * straight memcpy here.
> +	 */
> +	offset += first_bit * XFS_BLF_CHUNK;
> +	length = nbits * XFS_BLF_CHUNK;
> +	buf = xlog_prepare_iovec(lv, vecp, XLOG_REG_TYPE_BCHUNK);
> +	xfs_buf_iomove(bp, offset, length, buf, XBRW_READ);
> +	xlog_finish_iovec(lv, *vecp, length);
>  }
>  
>  /*
> @@ -314,8 +294,8 @@ xfs_buf_item_format(
>  	struct xfs_buf_log_item	*bip = BUF_ITEM(lip);
>  	struct xfs_buf		*bp = bip->bli_buf;
>  	struct xfs_log_iovec	*vecp = NULL;
> -	uint			offset = 0;
> -	int			i;
> +	uint			offset;
> +	int			i, j;
>  
>  	ASSERT(atomic_read(&bip->bli_refcount) > 0);
>  	ASSERT((bip->bli_flags & XFS_BLI_LOGGED) ||
> @@ -326,7 +306,6 @@ xfs_buf_item_format(
>  	ASSERT(!(bip->bli_flags & XFS_BLI_ORDERED) ||
>  	       (bip->bli_flags & XFS_BLI_STALE));
>  
> -
>  	/*
>  	 * If it is an inode buffer, transfer the in-memory state to the
>  	 * format flags and clear the in-memory state.
> @@ -349,10 +328,36 @@ xfs_buf_item_format(
>  		bip->bli_flags &= ~XFS_BLI_INODE_BUF;
>  	}
>  
> -	for (i = 0; i < bip->bli_format_count; i++) {
> -		xfs_buf_item_format_segment(bip, lv, &vecp, offset,
> -					    &bip->bli_formats[i]);
> -		offset += BBTOB(bp->b_maps[i].bm_len);
> +	for (i = 0, offset = 0;
> +	     i < bip->bli_format_count;
> +	     i++, offset += BBTOB(bp->b_maps[i].bm_len)) {
> +
> +		/* stale regions cover the entire segment */
> +		if (bip->bli_flags & XFS_BLI_STALE) {
> +			xfs_buf_item_format_segment(bip, lv, &vecp, NULL, offset,
> +						    BBTOB(bp->b_maps[i].bm_len),
> +						    &bip->bli_formats[i]);
> +			continue;
> +		}
> +
> +		/* only format dirty ranges over the current segment */
> +		for (j = 0; j < bip->bli_ranges; j++) {
> +			struct xfs_bli_range *rp = &bip->bli_range[j];
> +
> +			/* range ends before segment start, check next range */
> +			if (rp->last < offset)
> +				continue;
> +
> +			/* range beyond segment end, check next segment */
> +			if (rp->first > offset + BBTOB(bp->b_maps[i].bm_len))
> +				break;
> +
> +			/* dirty range overlaps segment, need headers */
> +			xfs_buf_item_format_segment(bip, lv, &vecp, rp, offset,
> +						    BBTOB(bp->b_maps[i].bm_len),
> +						    &bip->bli_formats[i]);
> +
> +		}
>  	}
>  
>  	/*
> @@ -737,6 +742,9 @@ xfs_buf_item_init(
>  	int			error;
>  	int			i;
>  
> +	for (i = 0; i < XFS_BLI_RANGES; i++)
> +		bip->bli_range[i].first = UINT_MAX;
> +
>  	/*
>  	 * Check to see if there is already a buf log item for
>  	 * this buffer. If we do already have one, there is
> @@ -788,133 +796,136 @@ xfs_buf_item_init(
>  
>  /*
>   * Mark bytes first through last inclusive as dirty in the buf
> - * item's bitmap.
> + * record dirty regions on the buffer.
>   */
> -static void
> -xfs_buf_item_log_segment(
> +void
> +xfs_buf_item_log(
> +	struct xfs_buf_log_item	*bip,
>  	uint			first,
> -	uint			last,
> -	uint			*map)
> +	uint			last)
>  {
> -	uint		first_bit;
> -	uint		last_bit;
> -	uint		bits_to_set;
> -	uint		bits_set;
> -	uint		word_num;
> -	uint		*wordp;
> -	uint		bit;
> -	uint		end_bit;
> -	uint		mask;
> +	struct xfs_bli_range	*rp = NULL;
> +	int			i;
> +	ASSERT(last != 0);
> +	ASSERT(first <= last);
> +	ASSERT(last < BBTOB(bip->bli_buf->b_length));
> +
> +	/* simple case - first range being stored */
> +	if (!bip->bli_ranges) {
> +		bip->bli_ranges = 1;
> +		bip->bli_range[0].first = rounddown(first, XFS_BLF_CHUNK);
> +		bip->bli_range[0].last = roundup(last, XFS_BLF_CHUNK);
> +		ASSERT(bip->bli_range[0].last != 0);
> +		ASSERT(bip->bli_range[0].first <= bip->bli_range[0].last);
> +		return;
> +	}
>  
> -	/*
> -	 * Convert byte offsets to bit numbers.
> -	 */
> -	first_bit = first >> XFS_BLF_SHIFT;
> -	last_bit = last >> XFS_BLF_SHIFT;
> +	/* 2nd case: search for overlaps and extend */
> +	for (i = 0; i < bip->bli_ranges; i++) {
> +		rp = &bip->bli_range[i];
>  
> -	/*
> -	 * Calculate the total number of bits to be set.
> -	 */
> -	bits_to_set = last_bit - first_bit + 1;
> +		/* wholly within an existing dirty range, we're done */
> +		if (first >= rp->first && last <= rp->last)
> +			return;
> +		/* no overlap, continue */
> +		if (first > rp->last || last < rp->first)
> +			continue;
>  
> -	/*
> -	 * Get a pointer to the first word in the bitmap
> -	 * to set a bit in.
> -	 */
> -	word_num = first_bit >> BIT_TO_WORD_SHIFT;
> -	wordp = &map[word_num];
> +		/* left edge overlap, extend */
> +		if (first < rp->first)
> +			rp->first = rounddown(first, XFS_BLF_CHUNK);
>  
> -	/*
> -	 * Calculate the starting bit in the first word.
> -	 */
> -	bit = first_bit & (uint)(NBWORD - 1);
> +		/* right edge overlap, extend */
> +		if (last > rp->last)
> +			rp->last = roundup(last, XFS_BLF_CHUNK) - 1;
>  
> -	/*
> -	 * First set any bits in the first word of our range.
> -	 * If it starts at bit 0 of the word, it will be
> -	 * set below rather than here.  That is what the variable
> -	 * bit tells us. The variable bits_set tracks the number
> -	 * of bits that have been set so far.  End_bit is the number
> -	 * of the last bit to be set in this word plus one.
> -	 */
> -	if (bit) {
> -		end_bit = MIN(bit + bits_to_set, (uint)NBWORD);
> -		mask = ((1U << (end_bit - bit)) - 1) << bit;
> -		*wordp |= mask;
> -		wordp++;
> -		bits_set = end_bit - bit;
> -	} else {
> -		bits_set = 0;
> +		goto merge;
>  	}
>  
> -	/*
> -	 * Now set bits a whole word at a time that are between
> -	 * first_bit and last_bit.
> -	 */
> -	while ((bits_to_set - bits_set) >= NBWORD) {
> -		*wordp |= 0xffffffff;
> -		bits_set += NBWORD;
> -		wordp++;
> -	}
> +	/* 3rd case: not found, insert or extend */
> +	ASSERT(i == bip->bli_ranges);
>  
>  	/*
>  	 * Finally, set any bits left to be set in one last partial word.
> +	 * Case 3a: Extend last slot.
> +	 *
> +	 * If the range is beyond the last slot, extend the last slot to
> +	 * cover it. This treated the same as if an overlap existed with
> +	 * the last range.
>  	 */
> -	end_bit = bits_to_set - bits_set;
> -	if (end_bit) {
> -		mask = (1U << end_bit) - 1;
> -		*wordp |= mask;
> +	if (i == XFS_BLI_RANGES) {
> +		ASSERT(bip->bli_ranges == XFS_BLI_RANGES);
> +		rp = &bip->bli_range[XFS_BLI_RANGES - 1];
> +
> +		if (first < rp->first)
> +			rp->first = rounddown(first, XFS_BLF_CHUNK);
> +		if (last > rp->last)
> +			rp->last = roundup(last, XFS_BLF_CHUNK) - 1;
> +		goto merge;
>  	}
> -}
>  
> -/*
> - * Mark bytes first through last inclusive as dirty in the buf
> - * item's bitmap.
> - */
> -void
> -xfs_buf_item_log(
> -	struct xfs_buf_log_item	*bip,
> -	uint			first,
> -	uint			last)
> -{
> -	int			i;
> -	uint			start;
> -	uint			end;
> -	struct xfs_buf		*bp = bip->bli_buf;
> +	/* Case 3b: insert new range.
> +	 *
> +	 * Find the insertion point for the new range, then make a hole
> +	 * and insert the new range.
> +	 */
> +	for (i = 0; i < bip->bli_ranges; i++) {
> +		rp = &bip->bli_range[i];
>  
> +		/* order ranges by ascending offset */
> +		if (last < rp->first)
> +			break;
> +	}
> +	/* shift down and insert */
> +	ASSERT(i < XFS_BLI_RANGES);
> +	rp = &bip->bli_range[i];
> +	if (i < XFS_BLI_RANGES - 1)
> +		memmove(rp + 1, rp, sizeof(*rp) * (bip->bli_ranges - i));
> +	bip->bli_ranges++;
> +	rp->first = rounddown(first, XFS_BLF_CHUNK);
> +	rp->last = roundup(last, XFS_BLF_CHUNK) - 1;
> +
> +merge:
>  	/*
> -	 * walk each buffer segment and mark them dirty appropriately.
> +	 * Check for overlaping ranges and merge them. If there is only one
> +	 * range, there is nothing to merge so bail early.
>  	 */
> -	start = 0;
> -	for (i = 0; i < bip->bli_format_count; i++) {
> -		if (start > last)
> -			break;
> -		end = start + BBTOB(bp->b_maps[i].bm_len) - 1;
> +	if (bip->bli_ranges == 1)
> +		return;
> +
> +	for (i = 0; i < bip->bli_ranges - 1; i++) {
> +		struct xfs_bli_range *rp_next;
> +
> +		rp = &bip->bli_range[i];
> +		rp_next = &bip->bli_range[i + 1];
>  
> -		/* skip to the map that includes the first byte to log */
> -		if (first > end) {
> -			start += BBTOB(bp->b_maps[i].bm_len);
> +
> +check_merge:
> +		ASSERT(rp->last != 0);
> +		ASSERT(rp->first <= rp->last);
> +
> +		/* no overlap or adjacent, move on */
> +		if (rp->last < rp_next->first - 1)
>  			continue;
> -		}
>  
>  		/*
> -		 * Trim the range to this segment and mark it in the bitmap.
> -		 * Note that we must convert buffer offsets to segment relative
> -		 * offsets (e.g., the first byte of each segment is byte 0 of
> -		 * that segment).
> +		 * overlap: select lowest first, highest last, remove the merged
> +		 * range (rp_next) and then go back and check the next range for
> +		 * whether it can be merged (e.g. we have 4 separate ranges,
> +		 * then something logs the buffer entirely. This merges all
> +		 * ranges into one).
>  		 */
> -		if (first < start)
> -			first = start;
> -		if (end > last)
> -			end = last;
> -		xfs_buf_item_log_segment(first - start, end - start,
> -					 &bip->bli_formats[i].blf_data_map[0]);
> -
> -		start += BBTOB(bp->b_maps[i].bm_len);
> +		rp->first = min(rp->first, rp_next->first);
> +		rp->last = max(rp->last, rp_next->last);
> +		if (i + 2 < bip->bli_ranges)
> +			memmove(rp_next, rp_next + 1, sizeof(*rp) *
> +						(bip->bli_ranges - i - 2));
> +		bip->bli_ranges--;
> +		if (i < bip->bli_ranges - 1)
> +			goto check_merge;
>  	}
>  }
>  
> -
>  /*
>   * Return true if the buffer has any ranges logged/dirtied by a transaction,
>   * false otherwise.
> @@ -923,15 +934,7 @@ bool
>  xfs_buf_item_dirty_format(
>  	struct xfs_buf_log_item	*bip)
>  {
> -	int			i;
> -
> -	for (i = 0; i < bip->bli_format_count; i++) {
> -		if (!xfs_bitmap_empty(bip->bli_formats[i].blf_data_map,
> -			     bip->bli_formats[i].blf_map_size))
> -			return true;
> -	}
> -
> -	return false;
> +	return bip->bli_ranges > 0;
>  }
>  
>  STATIC void
> diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
> index 643f53dcfe51..9b278c3a2db9 100644
> --- a/fs/xfs/xfs_buf_item.h
> +++ b/fs/xfs/xfs_buf_item.h
> @@ -57,6 +57,25 @@ struct xfs_buf_log_item {
>  	unsigned int		bli_recur;	/* lock recursion count */
>  	atomic_t		bli_refcount;	/* cnt of tp refs */
>  	int			bli_format_count;	/* count of headers */
> +
> +	/*
> +	 * logging ranges. Keep a small number of distinct ranges rather than a
> +	 * bitmap which is expensive to maintain.
> +	 * 4 separate ranges s probably optimal so that we

"...ranges is probably..." ?

Mostly looks ok, but whew. :)

--D

> +	 * can log separate header, tail and content changes (e.g. for dir
> +	 * structures) without capturing the entire buffer unnecessarily for
> +	 * isolated changes.
> +	 *
> +	 * Note: ranges are 32 bit values because we have to support an end
> +	 * range value of 0x10000....
> +	 */
> +#define XFS_BLI_RANGES	4
> +	struct xfs_bli_range {
> +		uint32_t	first;
> +		uint32_t	last;
> +	}			bli_range[XFS_BLI_RANGES];
> +	int			bli_ranges;
> +
>  	struct xfs_buf_log_format *bli_formats;	/* array of in-log header ptrs */
>  	struct xfs_buf_log_format __bli_format;	/* embedded in-log header */
>  };
> -- 
> 2.15.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] xfs: byte range buffer dirty region tracking
  2018-02-01  5:11 ` Darrick J. Wong
@ 2018-02-01  8:14   ` Dave Chinner
  2018-02-01 20:35     ` Darrick J. Wong
  0 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2018-02-01  8:14 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Jan 31, 2018 at 09:11:28PM -0800, Darrick J. Wong wrote:
> On Thu, Feb 01, 2018 at 12:05:14PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > One of the biggest performance problems with large directory block
> > sizes is the CPU overhead in maintaining the buffer log item direty
> > region bitmap.  The bit manipulations and buffer region mapping
> > calls are right at the top of the profiles when running tests on 64k
> > directory buffers:
.....
> > ---
> >  fs/xfs/xfs_buf.c      |   2 +
> >  fs/xfs/xfs_buf_item.c | 431 +++++++++++++++++++++++++-------------------------
> >  fs/xfs/xfs_buf_item.h |  19 +++
> >  3 files changed, 238 insertions(+), 214 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index d1da2ee9e6db..7621fabeb505 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -1583,6 +1583,8 @@ xfs_buf_iomove(
> >  		page = bp->b_pages[page_index];
> >  		csize = min_t(size_t, PAGE_SIZE - page_offset,
> >  				      BBTOB(bp->b_io_length) - boff);
> > +		if (boff + csize > bend)
> > +			csize = bend - boff;
> 
> How often does csize exceed bend?

/me checks notes when the patch was written a couple of years ago

Rarely. I didn't record the exact cause because it was a memory
corruption bug that showed up long after the cause was gone.
Reading between the lines, I think was a case where bsize was a
single chunk (128 bytes), boff was 256 (third chunk in the buffer)
b_io_length was 512 bytes and a page offset of ~512 bytes.

That means csize was coming out at 256 bytes, but we only wanted 128
bytes to be copied. In most cases this didn't cause a problem
because there was more space in the log iovec buffer being copied
into, but occasionally it would be the last copy into the
logvec buffer and that would overrun the user buffer and corrupt
memory.

Essentially we are trying to copy from boff to bend, there's
nothing in the loop to clamp the copy size to bend, and that's
what this is doing. I can separate it out into another patch if you
want - I'd completely forgotten this was in the patch because I've
been running this patch in my tree for a long time now without
really looking at it...

[...]


> > @@ -136,7 +98,9 @@ xfs_buf_item_size(
> >  	int			*nbytes)
> >  {
> >  	struct xfs_buf_log_item	*bip = BUF_ITEM(lip);
> > -	int			i;
> > +	struct xfs_buf	*bp = bip->bli_buf;
> 
> Indentation before '*bp'...

Ah, missed that on conflict resolution from the buf_log_item
typedef removal...

> > -	 * written.
> > +	 * buffers we need to track which segment the dirty ranges correspond
> > +	 * to, and when we move from one segment to the next increment the
> > +	 * vector count for the extra buf log format structure that will need to
> > +	 * be written.
> >  	 */
> > -	for (i = 0; i < bip->bli_format_count; i++) {
> > -		xfs_buf_item_size_segment(bip, &bip->bli_formats[i],
> > -					  nvecs, nbytes);
> > +	ASSERT(bip->bli_range[0].last != 0);
> > +	if (bip->bli_range[0].last == 0) {
> > +		/* clean! */
> > +		ASSERT(bip->bli_range[0].first == 0);
> 
> Hm, so given that the firsts are initialized to UINT_MAX, this only
> happens if the first (only?) range we log is ... (0, 0) ?

Yeah, basically it catches code that should not be logging buffers
because there is no dirty range in the buffer.

> Mildly confused about what these asserts are going after, since the
> first one implies that this shouldn't happen anyway.

If first is after last, then we've really screwed up because we've
got a dirty buffer with an invalid range. I can't recall seeing
either of these asserts fire, but we still need the check for clean
buffer ranges/ screwups in production code. maybe there's a better
way to do this?

> >  STATIC void
> > diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
> > index 643f53dcfe51..9b278c3a2db9 100644
> > --- a/fs/xfs/xfs_buf_item.h
> > +++ b/fs/xfs/xfs_buf_item.h
> > @@ -57,6 +57,25 @@ struct xfs_buf_log_item {
> >  	unsigned int		bli_recur;	/* lock recursion count */
> >  	atomic_t		bli_refcount;	/* cnt of tp refs */
> >  	int			bli_format_count;	/* count of headers */
> > +
> > +	/*
> > +	 * logging ranges. Keep a small number of distinct ranges rather than a
> > +	 * bitmap which is expensive to maintain.
> > +	 * 4 separate ranges s probably optimal so that we
> 
> "...ranges is probably..." ?
> 
> Mostly looks ok, but whew. :)

Thanks!

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] xfs: byte range buffer dirty region tracking
  2018-02-01  8:14   ` Dave Chinner
@ 2018-02-01 20:35     ` Darrick J. Wong
  2018-02-01 23:16       ` Dave Chinner
  0 siblings, 1 reply; 21+ messages in thread
From: Darrick J. Wong @ 2018-02-01 20:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Feb 01, 2018 at 07:14:52PM +1100, Dave Chinner wrote:
> On Wed, Jan 31, 2018 at 09:11:28PM -0800, Darrick J. Wong wrote:
> > On Thu, Feb 01, 2018 at 12:05:14PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > One of the biggest performance problems with large directory block
> > > sizes is the CPU overhead in maintaining the buffer log item direty
> > > region bitmap.  The bit manipulations and buffer region mapping
> > > calls are right at the top of the profiles when running tests on 64k
> > > directory buffers:
> .....
> > > ---
> > >  fs/xfs/xfs_buf.c      |   2 +
> > >  fs/xfs/xfs_buf_item.c | 431 +++++++++++++++++++++++++-------------------------
> > >  fs/xfs/xfs_buf_item.h |  19 +++
> > >  3 files changed, 238 insertions(+), 214 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > index d1da2ee9e6db..7621fabeb505 100644
> > > --- a/fs/xfs/xfs_buf.c
> > > +++ b/fs/xfs/xfs_buf.c
> > > @@ -1583,6 +1583,8 @@ xfs_buf_iomove(
> > >  		page = bp->b_pages[page_index];
> > >  		csize = min_t(size_t, PAGE_SIZE - page_offset,
> > >  				      BBTOB(bp->b_io_length) - boff);
> > > +		if (boff + csize > bend)
> > > +			csize = bend - boff;
> > 
> > How often does csize exceed bend?
> 
> /me checks notes when the patch was written a couple of years ago
> 
> Rarely. I didn't record the exact cause because it was a memory
> corruption bug that showed up long after the cause was gone.
> Reading between the lines, I think was a case where bsize was a
> single chunk (128 bytes), boff was 256 (third chunk in the buffer)
> b_io_length was 512 bytes and a page offset of ~512 bytes.
> 
> That means csize was coming out at 256 bytes, but we only wanted 128
> bytes to be copied. In most cases this didn't cause a problem
> because there was more space in the log iovec buffer being copied
> into, but occasionally it would be the last copy into the
> logvec buffer and that would overrun the user buffer and corrupt
> memory.
> 
> Essentially we are trying to copy from boff to bend, there's
> nothing in the loop to clamp the copy size to bend, and that's
> what this is doing. I can separate it out into another patch if you
> want - I'd completely forgotten this was in the patch because I've
> been running this patch in my tree for a long time now without
> really looking at it...

I don't know if this needs to be a separate patch, but it seems like the
upper levels shouldn't be sending us overlong lengths?  So either we
need to go find the ones that do and fix them to dtrt, possibly leaving
an assert here for "hey someone screwed up but we're fixing it"
analysis.

> 
> [...]
> 
> 
> > > @@ -136,7 +98,9 @@ xfs_buf_item_size(
> > >  	int			*nbytes)
> > >  {
> > >  	struct xfs_buf_log_item	*bip = BUF_ITEM(lip);
> > > -	int			i;
> > > +	struct xfs_buf	*bp = bip->bli_buf;
> > 
> > Indentation before '*bp'...
> 
> Ah, missed that on conflict resolution from the buf_log_item
> typedef removal...
> 
> > > -	 * written.
> > > +	 * buffers we need to track which segment the dirty ranges correspond
> > > +	 * to, and when we move from one segment to the next increment the
> > > +	 * vector count for the extra buf log format structure that will need to
> > > +	 * be written.
> > >  	 */
> > > -	for (i = 0; i < bip->bli_format_count; i++) {
> > > -		xfs_buf_item_size_segment(bip, &bip->bli_formats[i],
> > > -					  nvecs, nbytes);
> > > +	ASSERT(bip->bli_range[0].last != 0);
> > > +	if (bip->bli_range[0].last == 0) {
> > > +		/* clean! */
> > > +		ASSERT(bip->bli_range[0].first == 0);
> > 
> > Hm, so given that the firsts are initialized to UINT_MAX, this only
> > happens if the first (only?) range we log is ... (0, 0) ?
> 
> Yeah, basically it catches code that should not be logging buffers
> because there is no dirty range in the buffer.
> 
> > Mildly confused about what these asserts are going after, since the
> > first one implies that this shouldn't happen anyway.
> 
> If first is after last, then we've really screwed up because we've
> got a dirty buffer with an invalid range. I can't recall seeing
> either of these asserts fire, but we still need the check for clean
> buffer ranges/ screwups in production code. maybe there's a better
> way to do this?

I only came up with:

/*
 * If the first bli_range has a last of 0, we've been fed a clean
 * buffer.  This shouldn't happen but we'll be paranoid and check
 * anyway.
 */
if (bip->bli_range[0].last == 0) {
	ASSERT(0);
	ASSERT(bip->bli_range[0].first == 0);
	return;
}

> 
> > >  STATIC void
> > > diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
> > > index 643f53dcfe51..9b278c3a2db9 100644
> > > --- a/fs/xfs/xfs_buf_item.h
> > > +++ b/fs/xfs/xfs_buf_item.h
> > > @@ -57,6 +57,25 @@ struct xfs_buf_log_item {
> > >  	unsigned int		bli_recur;	/* lock recursion count */
> > >  	atomic_t		bli_refcount;	/* cnt of tp refs */
> > >  	int			bli_format_count;	/* count of headers */
> > > +
> > > +	/*
> > > +	 * logging ranges. Keep a small number of distinct ranges rather than a
> > > +	 * bitmap which is expensive to maintain.
> > > +	 * 4 separate ranges s probably optimal so that we
> > 
> > "...ranges is probably..." ?
> > 
> > Mostly looks ok, but whew. :)
> 
> Thanks!

FWIW I also ran straight into this when I applied it for giggles and ran
xfstests -g quick (generic/001 blew up):

[   31.909228] ================================================================================
[   31.911258] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a0
[   31.912375] IP: xfs_buf_item_init+0x33/0x350 [xfs]
[   31.913059] PGD 78f91067 P4D 78f91067 PUD 77d4a067 PMD 0 
[   31.913812] Oops: 0002 [#1] PREEMPT SMP
[   31.914363] Dumping ftrace buffer:
[   31.914852]    (ftrace buffer empty)
[   31.915361] Modules linked in: xfs libcrc32c dax_pmem device_dax nd_pmem sch_fq_codel af_packet
[   31.916529] CPU: 3 PID: 1269 Comm: cp Not tainted 4.15.0-rc7-djw #13
[   31.917600] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1djwong0 04/01/2014
[   31.919162] RIP: 0010:xfs_buf_item_init+0x33/0x350 [xfs]
[   31.919948] RSP: 0018:ffffc900008f37b8 EFLAGS: 00010296
[   31.920763] RAX: ffff88003c3dd180 RBX: 0000000000000000 RCX: 0000000000000001
[   31.921886] RDX: 0000000080000001 RSI: ffff88003c3ddb18 RDI: 00000000ffffffff
[   31.922978] RBP: ffff880079364000 R08: 0000000000000004 R09: 0000000000000000
[   31.924080] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880069311e40
[   31.925168] R13: 0000000000000001 R14: ffffc900008f3918 R15: ffffc900008f3888
[   31.926153] FS:  00007f0c8d4dc800(0000) GS:ffff88007f600000(0000) knlGS:0000000000000000
[   31.927116] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   31.927713] CR2: 00000000000000a0 CR3: 00000000692ae000 CR4: 00000000000006e0
[   31.928443] Call Trace:
[   31.928786]  _xfs_trans_bjoin+0x25/0xa0 [xfs]
[   31.929290]  xfs_trans_read_buf_map+0x2a1/0x9c0 [xfs]
[   31.929849]  xfs_read_agi+0xda/0x3b0 [xfs]
[   31.930319]  xfs_ialloc_read_agi+0x51/0x310 [xfs]
[   31.930856]  xfs_ialloc_pagi_init+0x20/0x50 [xfs]
[   31.931489]  xfs_ialloc_ag_select+0x126/0x2d0 [xfs]
[   31.932034]  xfs_dialloc+0x7f/0x360 [xfs]
[   31.932498]  xfs_ialloc+0x64/0x850 [xfs]
[   31.932966]  xfs_dir_ialloc+0x67/0x320 [xfs]
[   31.933458]  xfs_create+0x646/0xcb0 [xfs]
[   31.933931]  xfs_generic_create+0x20e/0x340 [xfs]
[   31.934435]  lookup_open+0x3ed/0x680
[   31.934840]  path_openat+0x428/0xa90
[   31.935307]  ? __might_fault+0x36/0x80
[   31.935736]  do_filp_open+0x8a/0xf0
[   31.936121]  ? __alloc_fd+0xe7/0x200
[   31.936509]  ? do_sys_open+0x11c/0x1f0
[   31.936926]  do_sys_open+0x11c/0x1f0
[   31.937318]  entry_SYSCALL_64_fastpath+0x1f/0x96
[   31.937803] RIP: 0033:0x7f0c8c9c4040
[   31.938189] RSP: 002b:00007fff1e3d64c8 EFLAGS: 00000246
[   31.938194] Code: 56 41 55 41 54 49 89 fc 55 48 89 f5 53 48 83 ec 18 48 85 ff 0f 84 dd 01 00 00 49 8b 9c 24 20 02 00 00 48 85 db 0f 84 b9 01 00 00 <c7> 83 a0 00 00 00 ff ff ff ff 31 c0 48 85 db c7 83 a8 00 00 00 
[   31.940783] RIP: xfs_buf_item_init+0x33/0x350 [xfs] RSP: ffffc900008f37b8
[   31.941490] CR2: 00000000000000a0
[   31.941958] ---[ end trace 14a6b74cb284bb21 ]---

--D

> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] xfs: byte range buffer dirty region tracking
  2018-02-01 20:35     ` Darrick J. Wong
@ 2018-02-01 23:16       ` Dave Chinner
  2018-02-01 23:22         ` Darrick J. Wong
  0 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2018-02-01 23:16 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Feb 01, 2018 at 12:35:26PM -0800, Darrick J. Wong wrote:
> On Thu, Feb 01, 2018 at 07:14:52PM +1100, Dave Chinner wrote:
> > On Wed, Jan 31, 2018 at 09:11:28PM -0800, Darrick J. Wong wrote:
> > > On Thu, Feb 01, 2018 at 12:05:14PM +1100, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > > One of the biggest performance problems with large directory block
> > > > sizes is the CPU overhead in maintaining the buffer log item direty
> > > > region bitmap.  The bit manipulations and buffer region mapping
> > > > calls are right at the top of the profiles when running tests on 64k
> > > > directory buffers:
> > .....
> > > > ---
> > > >  fs/xfs/xfs_buf.c      |   2 +
> > > >  fs/xfs/xfs_buf_item.c | 431 +++++++++++++++++++++++++-------------------------
> > > >  fs/xfs/xfs_buf_item.h |  19 +++
> > > >  3 files changed, 238 insertions(+), 214 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > > index d1da2ee9e6db..7621fabeb505 100644
> > > > --- a/fs/xfs/xfs_buf.c
> > > > +++ b/fs/xfs/xfs_buf.c
> > > > @@ -1583,6 +1583,8 @@ xfs_buf_iomove(
> > > >  		page = bp->b_pages[page_index];
> > > >  		csize = min_t(size_t, PAGE_SIZE - page_offset,
> > > >  				      BBTOB(bp->b_io_length) - boff);
> > > > +		if (boff + csize > bend)
> > > > +			csize = bend - boff;
> > > 
> > > How often does csize exceed bend?
> > 
> > /me checks notes when the patch was written a couple of years ago
> > 
> > Rarely. I didn't record the exact cause because it was a memory
> > corruption bug that showed up long after the cause was gone.
> > Reading between the lines, I think was a case where bsize was a
> > single chunk (128 bytes), boff was 256 (third chunk in the buffer)
> > b_io_length was 512 bytes and a page offset of ~512 bytes.
> > 
> > That means csize was coming out at 256 bytes, but we only wanted 128
> > bytes to be copied. In most cases this didn't cause a problem
> > because there was more space in the log iovec buffer being copied
> > into, but occasionally it would be the last copy into the
> > logvec buffer and that would overrun the user buffer and corrupt
> > memory.
> > 
> > Essentially we are trying to copy from boff to bend, there's
> > nothing in the loop to clamp the copy size to bend, and that's
> > what this is doing. I can separate it out into another patch if you
> > want - I'd completely forgotten this was in the patch because I've
> > been running this patch in my tree for a long time now without
> > really looking at it...
> 
> I don't know if this needs to be a separate patch, but it seems like the
> upper levels shouldn't be sending us overlong lengths?  So either we
> need to go find the ones that do and fix them to dtrt, possibly leaving
> an assert here for "hey someone screwed up but we're fixing it"
> analysis.

It was probably caused by a bug in the original range->bitmap
conversion code I'd written, not by any of the external code. I'll
add an assert into the code, but also leave the clamping so that
production systems don't go bad if there's some other bug in the
code that triggers it.

> > > > +	ASSERT(bip->bli_range[0].last != 0);
> > > > +	if (bip->bli_range[0].last == 0) {
> > > > +		/* clean! */
> > > > +		ASSERT(bip->bli_range[0].first == 0);
> > > 
> > > Hm, so given that the firsts are initialized to UINT_MAX, this only
> > > happens if the first (only?) range we log is ... (0, 0) ?
> > 
> > Yeah, basically it catches code that should not be logging buffers
> > because there is no dirty range in the buffer.
> > 
> > > Mildly confused about what these asserts are going after, since the
> > > first one implies that this shouldn't happen anyway.
> > 
> > If first is after last, then we've really screwed up because we've
> > got a dirty buffer with an invalid range. I can't recall seeing
> > either of these asserts fire, but we still need the check for clean
> > buffer ranges/ screwups in production code. maybe there's a better
> > way to do this?
> 
> I only came up with:
> 
> /*
>  * If the first bli_range has a last of 0, we've been fed a clean
>  * buffer.  This shouldn't happen but we'll be paranoid and check
>  * anyway.
>  */
> if (bip->bli_range[0].last == 0) {
> 	ASSERT(0);
> 	ASSERT(bip->bli_range[0].first == 0);
> 	return;
> }

Yup, that's a bit cleaner, I'll change it over.

> FWIW I also ran straight into this when I applied it for giggles and ran
> xfstests -g quick (generic/001 blew up):

I must have screwed up the forward port worse than usual - the
conflicts with the xfs_buf_log_item typedef removal were pretty
extensive.

> [   31.909228] ================================================================================
> [   31.911258] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a0
> [   31.912375] IP: xfs_buf_item_init+0x33/0x350 [xfs]

Hmmmm - I'm seeing that on my subvol smoke test script but not
elsewhere. I've been looking through the subvol code to try to find
this, maybe it's not the subvol code.  What mkfs parameters where
you using?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] xfs: byte range buffer dirty region tracking
  2018-02-01 23:16       ` Dave Chinner
@ 2018-02-01 23:22         ` Darrick J. Wong
  2018-02-01 23:55           ` Dave Chinner
  0 siblings, 1 reply; 21+ messages in thread
From: Darrick J. Wong @ 2018-02-01 23:22 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Feb 02, 2018 at 10:16:47AM +1100, Dave Chinner wrote:
> On Thu, Feb 01, 2018 at 12:35:26PM -0800, Darrick J. Wong wrote:
> > On Thu, Feb 01, 2018 at 07:14:52PM +1100, Dave Chinner wrote:
> > > On Wed, Jan 31, 2018 at 09:11:28PM -0800, Darrick J. Wong wrote:
> > > > On Thu, Feb 01, 2018 at 12:05:14PM +1100, Dave Chinner wrote:
> > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > 
> > > > > One of the biggest performance problems with large directory block
> > > > > sizes is the CPU overhead in maintaining the buffer log item direty
> > > > > region bitmap.  The bit manipulations and buffer region mapping
> > > > > calls are right at the top of the profiles when running tests on 64k
> > > > > directory buffers:
> > > .....
> > > > > ---
> > > > >  fs/xfs/xfs_buf.c      |   2 +
> > > > >  fs/xfs/xfs_buf_item.c | 431 +++++++++++++++++++++++++-------------------------
> > > > >  fs/xfs/xfs_buf_item.h |  19 +++
> > > > >  3 files changed, 238 insertions(+), 214 deletions(-)
> > > > > 
> > > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > > > index d1da2ee9e6db..7621fabeb505 100644
> > > > > --- a/fs/xfs/xfs_buf.c
> > > > > +++ b/fs/xfs/xfs_buf.c
> > > > > @@ -1583,6 +1583,8 @@ xfs_buf_iomove(
> > > > >  		page = bp->b_pages[page_index];
> > > > >  		csize = min_t(size_t, PAGE_SIZE - page_offset,
> > > > >  				      BBTOB(bp->b_io_length) - boff);
> > > > > +		if (boff + csize > bend)
> > > > > +			csize = bend - boff;
> > > > 
> > > > How often does csize exceed bend?
> > > 
> > > /me checks notes when the patch was written a couple of years ago
> > > 
> > > Rarely. I didn't record the exact cause because it was a memory
> > > corruption bug that showed up long after the cause was gone.
> > > Reading between the lines, I think was a case where bsize was a
> > > single chunk (128 bytes), boff was 256 (third chunk in the buffer)
> > > b_io_length was 512 bytes and a page offset of ~512 bytes.
> > > 
> > > That means csize was coming out at 256 bytes, but we only wanted 128
> > > bytes to be copied. In most cases this didn't cause a problem
> > > because there was more space in the log iovec buffer being copied
> > > into, but occasionally it would be the last copy into the
> > > logvec buffer and that would overrun the user buffer and corrupt
> > > memory.
> > > 
> > > Essentially we are trying to copy from boff to bend, there's
> > > nothing in the loop to clamp the copy size to bend, and that's
> > > what this is doing. I can separate it out into another patch if you
> > > want - I'd completely forgotten this was in the patch because I've
> > > been running this patch in my tree for a long time now without
> > > really looking at it...
> > 
> > I don't know if this needs to be a separate patch, but it seems like the
> > upper levels shouldn't be sending us overlong lengths?  So either we
> > need to go find the ones that do and fix them to dtrt, possibly leaving
> > an assert here for "hey someone screwed up but we're fixing it"
> > analysis.
> 
> It was probably caused by a bug in the original range->bitmap
> conversion code I'd written, not by any of the external code. I'll
> add an assert into the code, but also leave the clamping so that
> production systems don't go bad if there's some other bug in the
> code that triggers it.
> 
> > > > > +	ASSERT(bip->bli_range[0].last != 0);
> > > > > +	if (bip->bli_range[0].last == 0) {
> > > > > +		/* clean! */
> > > > > +		ASSERT(bip->bli_range[0].first == 0);
> > > > 
> > > > Hm, so given that the firsts are initialized to UINT_MAX, this only
> > > > happens if the first (only?) range we log is ... (0, 0) ?
> > > 
> > > Yeah, basically it catches code that should not be logging buffers
> > > because there is no dirty range in the buffer.
> > > 
> > > > Mildly confused about what these asserts are going after, since the
> > > > first one implies that this shouldn't happen anyway.
> > > 
> > > If first is after last, then we've really screwed up because we've
> > > got a dirty buffer with an invalid range. I can't recall seeing
> > > either of these asserts fire, but we still need the check for clean
> > > buffer ranges/ screwups in production code. maybe there's a better
> > > way to do this?
> > 
> > I only came up with:
> > 
> > /*
> >  * If the first bli_range has a last of 0, we've been fed a clean
> >  * buffer.  This shouldn't happen but we'll be paranoid and check
> >  * anyway.
> >  */
> > if (bip->bli_range[0].last == 0) {
> > 	ASSERT(0);
> > 	ASSERT(bip->bli_range[0].first == 0);
> > 	return;
> > }
> 
> Yup, that's a bit cleaner, I'll change it over.
> 
> > FWIW I also ran straight into this when I applied it for giggles and ran
> > xfstests -g quick (generic/001 blew up):
> 
> I must have screwed up the forward port worse than usual - the
> conflicts with the xfs_buf_log_item typedef removal were pretty
> extensive.

Ah, sorry about that.  I'd thought it was just the xfs_buf rename. :/

> > [   31.909228] ================================================================================
> > [   31.911258] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a0
> > [   31.912375] IP: xfs_buf_item_init+0x33/0x350 [xfs]
> 
> Hmmmm - I'm seeing that on my subvol smoke test script but not
> elsewhere. I've been looking through the subvol code to try to find
> this, maybe it's not the subvol code.  What mkfs parameters where
> you using?

mkfs.xfs -m rmapbt=1,reflink=1 -i sparse=1 /dev/pmem0

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] xfs: byte range buffer dirty region tracking
  2018-02-01 23:22         ` Darrick J. Wong
@ 2018-02-01 23:55           ` Dave Chinner
  2018-02-02 10:56             ` Brian Foster
  0 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2018-02-01 23:55 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Feb 01, 2018 at 03:22:58PM -0800, Darrick J. Wong wrote:
> On Fri, Feb 02, 2018 at 10:16:47AM +1100, Dave Chinner wrote:
> > On Thu, Feb 01, 2018 at 12:35:26PM -0800, Darrick J. Wong wrote:
> > > FWIW I also ran straight into this when I applied it for giggles and ran
> > > xfstests -g quick (generic/001 blew up):
> > 
> > I must have screwed up the forward port worse than usual - the
> > conflicts with the xfs_buf_log_item typedef removal were pretty
> > extensive.
> 
> Ah, sorry about that.  I'd thought it was just the xfs_buf rename. :/

Not your fault at all, Darrick!

I only complained about the xfs_buf typedef because it would cause
merge problems for ~80% of the patches in my current dev tree. This
was the only patch that the xfs_buf_log_item typedef removal
affected - more were affected by the trivial b_fspriv to b_log_item
changeover - and I figured that pain was worth it to get rid of
another typedef....

> > > [   31.909228] ================================================================================
> > > [   31.911258] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a0
> > > [   31.912375] IP: xfs_buf_item_init+0x33/0x350 [xfs]
> > 
> > Hmmmm - I'm seeing that on my subvol smoke test script but not
> > elsewhere. I've been looking through the subvol code to try to find
> > this, maybe it's not the subvol code.  What mkfs parameters where
> > you using?
> 
> mkfs.xfs -m rmapbt=1,reflink=1 -i sparse=1 /dev/pmem0

OK, nothing unusual, though I haven't been using sparse=1 recently.
I'll get onto it....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] xfs: byte range buffer dirty region tracking
  2018-02-01 23:55           ` Dave Chinner
@ 2018-02-02 10:56             ` Brian Foster
  0 siblings, 0 replies; 21+ messages in thread
From: Brian Foster @ 2018-02-02 10:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs

On Fri, Feb 02, 2018 at 10:55:48AM +1100, Dave Chinner wrote:
> On Thu, Feb 01, 2018 at 03:22:58PM -0800, Darrick J. Wong wrote:
> > On Fri, Feb 02, 2018 at 10:16:47AM +1100, Dave Chinner wrote:
> > > On Thu, Feb 01, 2018 at 12:35:26PM -0800, Darrick J. Wong wrote:
> > > > FWIW I also ran straight into this when I applied it for giggles and ran
> > > > xfstests -g quick (generic/001 blew up):
> > > 
> > > I must have screwed up the forward port worse than usual - the
> > > conflicts with the xfs_buf_log_item typedef removal were pretty
> > > extensive.
> > 
> > Ah, sorry about that.  I'd thought it was just the xfs_buf rename. :/
> 
> Not your fault at all, Darrick!
> 
> I only complained about the xfs_buf typedef because it would cause
> merge problems for ~80% of the patches in my current dev tree. This
> was the only patch that the xfs_buf_log_item typedef removal
> affected - more were affected by the trivial b_fspriv to b_log_item
> changeover - and I figured that pain was worth it to get rid of
> another typedef....
> 
> > > > [   31.909228] ================================================================================
> > > > [   31.911258] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a0
> > > > [   31.912375] IP: xfs_buf_item_init+0x33/0x350 [xfs]
> > > 
> > > Hmmmm - I'm seeing that on my subvol smoke test script but not
> > > elsewhere. I've been looking through the subvol code to try to find
> > > this, maybe it's not the subvol code.  What mkfs parameters where
> > > you using?
> > 
> > mkfs.xfs -m rmapbt=1,reflink=1 -i sparse=1 /dev/pmem0
> 
> OK, nothing unusual, though I haven't been using sparse=1 recently.
> I'll get onto it....
> 

See xfs_buf_item_init() with this patch applied to for-next. bip can be
NULL until allocated a few lines below attempting to initialize the
bip->bli_range fields...

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v2] xfs: byte range buffer dirty region tracking
  2018-02-01  1:05 [PATCH] xfs: byte range buffer dirty region tracking Dave Chinner
  2018-02-01  5:11 ` Darrick J. Wong
@ 2018-02-05  0:34 ` Dave Chinner
  2018-02-06 16:21   ` Brian Foster
  1 sibling, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2018-02-05  0:34 UTC (permalink / raw)
  To: linux-xfs


From: Dave Chinner <dchinner@redhat.com>

One of the biggest performance problems with large directory block
sizes is the CPU overhead in maintaining the buffer log item direty
region bitmap.  The bit manipulations and buffer region mapping
calls are right at the top of the profiles when running tests on 64k
directory buffers:

  14.65%  [kernel]             [k] memcpy
   8.57%  [kernel]             [k] xfs_next_bit
   4.96%  [kernel]             [k] xfs_buf_item_format
   4.83%  [kernel]             [k] xfs_buf_item_size_segment.isra.4
   4.44%  [kernel]             [k] xfs_buf_offset



The memcpy is the copying of the dirty regions into the log vec
array, but almost twice as much CPU time is spent working out what
needs to be copied and where it needs to be copied from. As a
result, a debug kernel running a parallel fsmark file create
workload we see performance like this on a 64k block size directory:

FSUse%        Count         Size    Files/sec     App Overhead
     0      1600000            0     175994.3         13120040
     0      3200000            0     167829.7         14089107
     0      4800000            0     159274.7         15217029
     0      6400000            0     150097.3         16543647
....

In contrast, a 4k directory block size returns create rates around
310,000 files/s - almost 3x faster for the same CPU burn.

This patch switching the dirty range tracking to just the first and
last modified bytes in 4 separate regions on the buffer. This only
gets converted to a bitmap when the item is formatted into the CIL
log vector array.  Hence the profile of the relevant formatting
functions now looks like:

  22.21%  [kernel]  [k] memcpy
   0.51%  [kernel]  [k] xfs_buf_item_init
   0.49%  [kernel]  [k] xfs_buf_item_unlock
   0.39%  [kernel]  [k] xfs_buf_item_size
   0.29%  [kernel]  [k] xfs_buf_item_format
   0.20%  [kernel]  [k] xfs_buf_item_log
   0.14%  [kernel]  [k] xfs_buf_item_committing 

And the performance is:

FSUse%        Count         Size    Files/sec     App Overhead
     0      1600000            0     224963.5         12631894
     0      3200000            0     226142.4         12608851
     0      4800000            0     237453.1         12509915
     0      6400000            0     233356.8         12939907

Substantially higher.
`
The memcpy time is higher because that's where we spend most of
the CPU we saved - in the buffer formatting routine:

....
       __xfs_trans_commit
        xfs_log_commit_cil
        xfs_buf_item_format
        xfs_buf_item_format_segment
        xfs_buf_iomove
        memcpy

Hence we can see that there is major reduction in buffer formatting
overhead that translates to improved performance.

The current implementation tracks, at most, four dirty regions per
buffer.  The nature of directory operations result in almost
operation modifying a header in the buffer, a tail section in the
buffer and then some number of bytes/regions in the middle of the
buffer.

If we just track a single region, it will almost always cover the
entire directory buffer as we do updates to both the head and tail
of most directory buffers.  That's a fairly large cost in terms of
log space and CPU overhead for random individual operations.
Similarly, increasing the number of regions to 8 (from 4) reduces
performance by 5-10%, so the gains from tracking multiple regions
tail off very quickly.

We also have to consider non-directory buffer modification patterns.
freespace, inode and extent btrees are the other major types of
buffers that get logged, but they also have modification patterns
that lend themselves well to a small number of ranges for dirty
tracking. That is, each btree block is kept compact, so when we
insert or remove a record or pointer we shift then higher
records/ptrs up or down as a block, and then log the lot of them.
And they also often have a header that is dirtied with each
insert/delete, so typically there are usually only one or two dirty
ranges in a btree block.

The only metadata type that really seems to benefit from fine
grained dirty range logging is the inode buffers. Specifically, for
v4 superblocks the create transaction only dirties the regions of
the inode core, so for 256 byte inodes only dirties every alternate
bitmap segment.  Dirty range tracking will double the required log
bandwidth of inode buffers during create (roughly 25% increase on a
4k directory block size filesystem). Typically this won't result in
a noticable performance differential (except in inode creation
benchmarks) on typical systems because the log is generally far from
being bandwidth bound.

For v5 filesystems, even this isn't an issue because the initialised
inode buffers are XFS_BLI_ORDERED buffers and so their contents
aren't logged.

The same problem happens with unlinks due to the unlinked list being
logged via the inode buffer. Again this results in an increase
in log bandwidth on both v4 and v5 filesystems, but there isn't any
performance differential that occurs because, again, the log isn't
bandwidth bound. As it is, there is an existing plan of improvement
to the unlinked list logging (moving the unlinked list logging into
the inode core transaction) and hence that will avoid any extra
overhead here as well.

Hence the overall CPU reduction benefits of minimal dirty range
tracking versus fine grained dirty bit tracking are overall going to
be beneficial to performance and throughput on current (v5) format
filesystems.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
Version 2:
- fixed up bugs due to mismerges on conflict resolution with current
  for-next tree
- cleaned up assert/comments in xfs_buf_item_size around empty
  buffer logging
- rediscovered the cause and reworked the fix in xfs_buf_iomove().
  This fix is needed because xfs_buf_iomove doesn't currently work
  with partial buffer regions. All current callers pass full buffer
  lengths, so this patch exposes a latent bug in that code and hence
  needs fixing to work correctly with this change.

 fs/xfs/xfs_buf.c      |  23 ++-
 fs/xfs/xfs_buf_item.c | 437 ++++++++++++++++++++++++++------------------------
 fs/xfs/xfs_buf_item.h |  19 +++
 3 files changed, 261 insertions(+), 218 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index d1da2ee9e6db..88587b33dd15 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1561,7 +1561,11 @@ xfs_buf_offset(
 }
 
 /*
- *	Move data into or out of a buffer.
+ * Move data into or out of a buffer.
+ *
+ * Care must be taken to ensure that neither source or destination
+ * data buffers are overrun as the user can request partial buffer range
+ * operations.
  */
 void
 xfs_buf_iomove(
@@ -1581,10 +1585,21 @@ xfs_buf_iomove(
 		page_index = (boff + bp->b_offset) >> PAGE_SHIFT;
 		page_offset = (boff + bp->b_offset) & ~PAGE_MASK;
 		page = bp->b_pages[page_index];
-		csize = min_t(size_t, PAGE_SIZE - page_offset,
-				      BBTOB(bp->b_io_length) - boff);
 
-		ASSERT((csize + page_offset) <= PAGE_SIZE);
+		/*
+		 * We cannot copy past the end of the current page,
+		 * the end of the buffer IO region, or the end of the
+		 * copy region requested by the user.
+		 */
+		csize = PAGE_SIZE - page_offset;
+		if (csize > BBTOB(bp->b_io_length) - boff)
+			csize = BBTOB(bp->b_io_length) - boff;
+		if (boff + csize > bend)
+			csize = bend - boff;
+		if (csize <= 0)
+			break;
+
+		ASSERT(csize + page_offset <= PAGE_SIZE);
 
 		switch (mode) {
 		case XBRW_ZERO:
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 270ddb4d2313..bc6514a08760 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -66,50 +66,12 @@ xfs_buf_item_size_segment(
 	int				*nvecs,
 	int				*nbytes)
 {
-	struct xfs_buf			*bp = bip->bli_buf;
-	int				next_bit;
-	int				last_bit;
-
-	last_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size, 0);
-	if (last_bit == -1)
-		return;
-
 	/*
 	 * initial count for a dirty buffer is 2 vectors - the format structure
-	 * and the first dirty region.
+	 * and the dirty region. Dirty region is accounted for separately.
 	 */
 	*nvecs += 2;
-	*nbytes += xfs_buf_log_format_size(blfp) + XFS_BLF_CHUNK;
-
-	while (last_bit != -1) {
-		/*
-		 * This takes the bit number to start looking from and
-		 * returns the next set bit from there.  It returns -1
-		 * if there are no more bits set or the start bit is
-		 * beyond the end of the bitmap.
-		 */
-		next_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size,
-					last_bit + 1);
-		/*
-		 * If we run out of bits, leave the loop,
-		 * else if we find a new set of bits bump the number of vecs,
-		 * else keep scanning the current set of bits.
-		 */
-		if (next_bit == -1) {
-			break;
-		} else if (next_bit != last_bit + 1) {
-			last_bit = next_bit;
-			(*nvecs)++;
-		} else if (xfs_buf_offset(bp, next_bit * XFS_BLF_CHUNK) !=
-			   (xfs_buf_offset(bp, last_bit * XFS_BLF_CHUNK) +
-			    XFS_BLF_CHUNK)) {
-			last_bit = next_bit;
-			(*nvecs)++;
-		} else {
-			last_bit++;
-		}
-		*nbytes += XFS_BLF_CHUNK;
-	}
+	*nbytes += xfs_buf_log_format_size(blfp);
 }
 
 /*
@@ -136,7 +98,9 @@ xfs_buf_item_size(
 	int			*nbytes)
 {
 	struct xfs_buf_log_item	*bip = BUF_ITEM(lip);
-	int			i;
+	struct xfs_buf		*bp = bip->bli_buf;
+	uint			offset;
+	int			i, j;
 
 	ASSERT(atomic_read(&bip->bli_refcount) > 0);
 	if (bip->bli_flags & XFS_BLI_STALE) {
@@ -155,6 +119,7 @@ xfs_buf_item_size(
 	}
 
 	ASSERT(bip->bli_flags & XFS_BLI_LOGGED);
+	ASSERT(bip->bli_flags & XFS_BLI_DIRTY);
 
 	if (bip->bli_flags & XFS_BLI_ORDERED) {
 		/*
@@ -167,19 +132,53 @@ xfs_buf_item_size(
 		return;
 	}
 
+
+	/*
+	 * If the last byte of teh first range is zero, then we've been fed a
+	 * clean buffer with a XFS_BLI_DIRTY flag set. This should never happen,
+	 * but be paranoid and catch it. If it does happen, then first should be
+	 * zero, too.
+	 */
+	if (bip->bli_range[0].last == 0) {
+		ASSERT(0);
+		ASSERT(bip->bli_range[0].first == 0);
+		return;
+	}
+
 	/*
 	 * the vector count is based on the number of buffer vectors we have
-	 * dirty bits in. This will only be greater than one when we have a
+	 * dirty ranges in. This will only be greater than one when we have a
 	 * compound buffer with more than one segment dirty. Hence for compound
-	 * buffers we need to track which segment the dirty bits correspond to,
-	 * and when we move from one segment to the next increment the vector
-	 * count for the extra buf log format structure that will need to be
-	 * written.
+	 * buffers we need to track which segment the dirty ranges correspond
+	 * to, and when we move from one segment to the next increment the
+	 * vector count for the extra buf log format structure that will need to
+	 * be written.
 	 */
-	for (i = 0; i < bip->bli_format_count; i++) {
-		xfs_buf_item_size_segment(bip, &bip->bli_formats[i],
-					  nvecs, nbytes);
+	for (i = 0, offset = 0;
+	     i < bip->bli_format_count;
+	     i++, offset += BBTOB(bp->b_maps[i].bm_len)) {
+		/* Only format dirty regions */
+		for (j = 0; j < bip->bli_ranges; j++) {
+			struct xfs_bli_range *rp = &bip->bli_range[j];
+
+			/* range ends before segment start, check next range */
+			if (rp->last < offset)
+				continue;
+
+			/* range beyond segment end, check next segment */
+			if (rp->first > offset + BBTOB(bp->b_maps[i].bm_len))
+				break;
+
+			/* dirty range overlaps segment, need headers */
+			xfs_buf_item_size_segment(bip, &bip->bli_formats[i],
+						  nvecs, nbytes);
+		}
 	}
+
+	for (j = 0; j < bip->bli_ranges; j++)
+		*nbytes += bip->bli_range[j].last - bip->bli_range[j].first;
+
+
 	trace_xfs_buf_item_size(bip);
 }
 
@@ -192,7 +191,6 @@ xfs_buf_item_copy_iovec(
 	int			first_bit,
 	uint			nbits)
 {
-	offset += first_bit * XFS_BLF_CHUNK;
 	xlog_copy_iovec(lv, vecp, XLOG_REG_TYPE_BCHUNK,
 			xfs_buf_offset(bp, offset),
 			nbits * XFS_BLF_CHUNK);
@@ -215,14 +213,18 @@ xfs_buf_item_format_segment(
 	struct xfs_buf_log_item	*bip,
 	struct xfs_log_vec	*lv,
 	struct xfs_log_iovec	**vecp,
+	struct xfs_bli_range	*rp,
 	uint			offset,
+	uint			length,
 	struct xfs_buf_log_format *blfp)
 {
 	struct xfs_buf		*bp = bip->bli_buf;
+	char			*buf;
 	uint			base_size;
+	uint			start;
+	uint			end;
 	int			first_bit;
 	int			last_bit;
-	int			next_bit;
 	uint			nbits;
 
 	/* copy the flags across from the base format item */
@@ -234,16 +236,6 @@ xfs_buf_item_format_segment(
 	 * memory structure.
 	 */
 	base_size = xfs_buf_log_format_size(blfp);
-
-	first_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size, 0);
-	if (!(bip->bli_flags & XFS_BLI_STALE) && first_bit == -1) {
-		/*
-		 * If the map is not be dirty in the transaction, mark
-		 * the size as zero and do not advance the vector pointer.
-		 */
-		return;
-	}
-
 	blfp = xlog_copy_iovec(lv, vecp, XLOG_REG_TYPE_BFORMAT, blfp, base_size);
 	blfp->blf_size = 1;
 
@@ -258,46 +250,40 @@ xfs_buf_item_format_segment(
 		return;
 	}
 
+	blfp->blf_size++;
 
 	/*
-	 * Fill in an iovec for each set of contiguous chunks.
+	 * Now we need to set the bits in the bitmap and set up the iovecs
+	 * appropriately. We know there is a contiguous range in this buffer
+	 * than needs to be set, so find the first bit, the last bit, and
+	 * go from there.
 	 */
-	last_bit = first_bit;
-	nbits = 1;
-	for (;;) {
-		/*
-		 * This takes the bit number to start looking from and
-		 * returns the next set bit from there.  It returns -1
-		 * if there are no more bits set or the start bit is
-		 * beyond the end of the bitmap.
-		 */
-		next_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size,
-					(uint)last_bit + 1);
-		/*
-		 * If we run out of bits fill in the last iovec and get out of
-		 * the loop.  Else if we start a new set of bits then fill in
-		 * the iovec for the series we were looking at and start
-		 * counting the bits in the new one.  Else we're still in the
-		 * same set of bits so just keep counting and scanning.
-		 */
-		if (next_bit == -1) {
-			xfs_buf_item_copy_iovec(lv, vecp, bp, offset,
-						first_bit, nbits);
-			blfp->blf_size++;
-			break;
-		} else if (next_bit != last_bit + 1 ||
-		           xfs_buf_item_straddle(bp, offset, next_bit, last_bit)) {
-			xfs_buf_item_copy_iovec(lv, vecp, bp, offset,
-						first_bit, nbits);
-			blfp->blf_size++;
-			first_bit = next_bit;
-			last_bit = next_bit;
-			nbits = 1;
-		} else {
-			last_bit++;
-			nbits++;
-		}
-	}
+	start = 0;
+	if (offset < rp->first)
+		start = rp->first - offset;
+	end = length - 1;
+	if (offset + length > rp->last)
+		end = rp->last - offset - 1;
+
+	start &= ~((1 << XFS_BLF_SHIFT) - 1);
+	first_bit = start >> XFS_BLF_SHIFT;
+	last_bit = end >> XFS_BLF_SHIFT;
+	nbits = last_bit - first_bit + 1;
+	bitmap_set((unsigned long *)blfp->blf_data_map, first_bit, nbits);
+
+	ASSERT(end <= length);
+	ASSERT(start <= length);
+	ASSERT(length >= nbits * XFS_BLF_CHUNK);
+	/*
+	 * Copy needs to be done a buffer page at a time as we can be logging
+	 * unmapped buffers. hence we have to use xfs_buf_iomove() rather than a
+	 * straight memcpy here.
+	 */
+	offset += first_bit * XFS_BLF_CHUNK;
+	length = nbits * XFS_BLF_CHUNK;
+	buf = xlog_prepare_iovec(lv, vecp, XLOG_REG_TYPE_BCHUNK);
+	xfs_buf_iomove(bp, offset, length, buf, XBRW_READ);
+	xlog_finish_iovec(lv, *vecp, length);
 }
 
 /*
@@ -314,8 +300,8 @@ xfs_buf_item_format(
 	struct xfs_buf_log_item	*bip = BUF_ITEM(lip);
 	struct xfs_buf		*bp = bip->bli_buf;
 	struct xfs_log_iovec	*vecp = NULL;
-	uint			offset = 0;
-	int			i;
+	uint			offset;
+	int			i, j;
 
 	ASSERT(atomic_read(&bip->bli_refcount) > 0);
 	ASSERT((bip->bli_flags & XFS_BLI_LOGGED) ||
@@ -326,7 +312,6 @@ xfs_buf_item_format(
 	ASSERT(!(bip->bli_flags & XFS_BLI_ORDERED) ||
 	       (bip->bli_flags & XFS_BLI_STALE));
 
-
 	/*
 	 * If it is an inode buffer, transfer the in-memory state to the
 	 * format flags and clear the in-memory state.
@@ -349,10 +334,36 @@ xfs_buf_item_format(
 		bip->bli_flags &= ~XFS_BLI_INODE_BUF;
 	}
 
-	for (i = 0; i < bip->bli_format_count; i++) {
-		xfs_buf_item_format_segment(bip, lv, &vecp, offset,
-					    &bip->bli_formats[i]);
-		offset += BBTOB(bp->b_maps[i].bm_len);
+	for (i = 0, offset = 0;
+	     i < bip->bli_format_count;
+	     i++, offset += BBTOB(bp->b_maps[i].bm_len)) {
+
+		/* stale regions cover the entire segment */
+		if (bip->bli_flags & XFS_BLI_STALE) {
+			xfs_buf_item_format_segment(bip, lv, &vecp, NULL, offset,
+						    BBTOB(bp->b_maps[i].bm_len),
+						    &bip->bli_formats[i]);
+			continue;
+		}
+
+		/* only format dirty ranges over the current segment */
+		for (j = 0; j < bip->bli_ranges; j++) {
+			struct xfs_bli_range *rp = &bip->bli_range[j];
+
+			/* range ends before segment start, check next range */
+			if (rp->last < offset)
+				continue;
+
+			/* range beyond segment end, check next segment */
+			if (rp->first > offset + BBTOB(bp->b_maps[i].bm_len))
+				break;
+
+			/* dirty range overlaps segment, need headers */
+			xfs_buf_item_format_segment(bip, lv, &vecp, rp, offset,
+						    BBTOB(bp->b_maps[i].bm_len),
+						    &bip->bli_formats[i]);
+
+		}
 	}
 
 	/*
@@ -751,6 +762,9 @@ xfs_buf_item_init(
 	bip = kmem_zone_zalloc(xfs_buf_item_zone, KM_SLEEP);
 	xfs_log_item_init(mp, &bip->bli_item, XFS_LI_BUF, &xfs_buf_item_ops);
 	bip->bli_buf = bp;
+	for (i = 0; i < XFS_BLI_RANGES; i++)
+		bip->bli_range[i].first = UINT_MAX;
+
 
 	/*
 	 * chunks is the number of XFS_BLF_CHUNK size pieces the buffer
@@ -788,133 +802,136 @@ xfs_buf_item_init(
 
 /*
  * Mark bytes first through last inclusive as dirty in the buf
- * item's bitmap.
+ * record dirty regions on the buffer.
  */
-static void
-xfs_buf_item_log_segment(
+void
+xfs_buf_item_log(
+	struct xfs_buf_log_item	*bip,
 	uint			first,
-	uint			last,
-	uint			*map)
+	uint			last)
 {
-	uint		first_bit;
-	uint		last_bit;
-	uint		bits_to_set;
-	uint		bits_set;
-	uint		word_num;
-	uint		*wordp;
-	uint		bit;
-	uint		end_bit;
-	uint		mask;
+	struct xfs_bli_range	*rp = NULL;
+	int			i;
+	ASSERT(last != 0);
+	ASSERT(first <= last);
+	ASSERT(last < BBTOB(bip->bli_buf->b_length));
+
+	/* simple case - first range being stored */
+	if (!bip->bli_ranges) {
+		bip->bli_ranges = 1;
+		bip->bli_range[0].first = rounddown(first, XFS_BLF_CHUNK);
+		bip->bli_range[0].last = roundup(last, XFS_BLF_CHUNK);
+		ASSERT(bip->bli_range[0].last != 0);
+		ASSERT(bip->bli_range[0].first <= bip->bli_range[0].last);
+		return;
+	}
 
-	/*
-	 * Convert byte offsets to bit numbers.
-	 */
-	first_bit = first >> XFS_BLF_SHIFT;
-	last_bit = last >> XFS_BLF_SHIFT;
+	/* 2nd case: search for overlaps and extend */
+	for (i = 0; i < bip->bli_ranges; i++) {
+		rp = &bip->bli_range[i];
 
-	/*
-	 * Calculate the total number of bits to be set.
-	 */
-	bits_to_set = last_bit - first_bit + 1;
+		/* wholly within an existing dirty range, we're done */
+		if (first >= rp->first && last <= rp->last)
+			return;
+		/* no overlap, continue */
+		if (first > rp->last || last < rp->first)
+			continue;
 
-	/*
-	 * Get a pointer to the first word in the bitmap
-	 * to set a bit in.
-	 */
-	word_num = first_bit >> BIT_TO_WORD_SHIFT;
-	wordp = &map[word_num];
+		/* left edge overlap, extend */
+		if (first < rp->first)
+			rp->first = rounddown(first, XFS_BLF_CHUNK);
 
-	/*
-	 * Calculate the starting bit in the first word.
-	 */
-	bit = first_bit & (uint)(NBWORD - 1);
+		/* right edge overlap, extend */
+		if (last > rp->last)
+			rp->last = roundup(last, XFS_BLF_CHUNK) - 1;
 
-	/*
-	 * First set any bits in the first word of our range.
-	 * If it starts at bit 0 of the word, it will be
-	 * set below rather than here.  That is what the variable
-	 * bit tells us. The variable bits_set tracks the number
-	 * of bits that have been set so far.  End_bit is the number
-	 * of the last bit to be set in this word plus one.
-	 */
-	if (bit) {
-		end_bit = MIN(bit + bits_to_set, (uint)NBWORD);
-		mask = ((1U << (end_bit - bit)) - 1) << bit;
-		*wordp |= mask;
-		wordp++;
-		bits_set = end_bit - bit;
-	} else {
-		bits_set = 0;
+		goto merge;
 	}
 
-	/*
-	 * Now set bits a whole word at a time that are between
-	 * first_bit and last_bit.
-	 */
-	while ((bits_to_set - bits_set) >= NBWORD) {
-		*wordp |= 0xffffffff;
-		bits_set += NBWORD;
-		wordp++;
-	}
+	/* 3rd case: not found, insert or extend */
+	ASSERT(i == bip->bli_ranges);
 
 	/*
 	 * Finally, set any bits left to be set in one last partial word.
+	 * Case 3a: Extend last slot.
+	 *
+	 * If the range is beyond the last slot, extend the last slot to
+	 * cover it. This treated the same as if an overlap existed with
+	 * the last range.
 	 */
-	end_bit = bits_to_set - bits_set;
-	if (end_bit) {
-		mask = (1U << end_bit) - 1;
-		*wordp |= mask;
+	if (i == XFS_BLI_RANGES) {
+		ASSERT(bip->bli_ranges == XFS_BLI_RANGES);
+		rp = &bip->bli_range[XFS_BLI_RANGES - 1];
+
+		if (first < rp->first)
+			rp->first = rounddown(first, XFS_BLF_CHUNK);
+		if (last > rp->last)
+			rp->last = roundup(last, XFS_BLF_CHUNK) - 1;
+		goto merge;
 	}
-}
 
-/*
- * Mark bytes first through last inclusive as dirty in the buf
- * item's bitmap.
- */
-void
-xfs_buf_item_log(
-	struct xfs_buf_log_item	*bip,
-	uint			first,
-	uint			last)
-{
-	int			i;
-	uint			start;
-	uint			end;
-	struct xfs_buf		*bp = bip->bli_buf;
+	/* Case 3b: insert new range.
+	 *
+	 * Find the insertion point for the new range, then make a hole
+	 * and insert the new range.
+	 */
+	for (i = 0; i < bip->bli_ranges; i++) {
+		rp = &bip->bli_range[i];
 
+		/* order ranges by ascending offset */
+		if (last < rp->first)
+			break;
+	}
+	/* shift down and insert */
+	ASSERT(i < XFS_BLI_RANGES);
+	rp = &bip->bli_range[i];
+	if (i < XFS_BLI_RANGES - 1)
+		memmove(rp + 1, rp, sizeof(*rp) * (bip->bli_ranges - i));
+	bip->bli_ranges++;
+	rp->first = rounddown(first, XFS_BLF_CHUNK);
+	rp->last = roundup(last, XFS_BLF_CHUNK) - 1;
+
+merge:
 	/*
-	 * walk each buffer segment and mark them dirty appropriately.
+	 * Check for overlaping ranges and merge them. If there is only one
+	 * range, there is nothing to merge so bail early.
 	 */
-	start = 0;
-	for (i = 0; i < bip->bli_format_count; i++) {
-		if (start > last)
-			break;
-		end = start + BBTOB(bp->b_maps[i].bm_len) - 1;
+	if (bip->bli_ranges == 1)
+		return;
+
+	for (i = 0; i < bip->bli_ranges - 1; i++) {
+		struct xfs_bli_range *rp_next;
+
+		rp = &bip->bli_range[i];
+		rp_next = &bip->bli_range[i + 1];
 
-		/* skip to the map that includes the first byte to log */
-		if (first > end) {
-			start += BBTOB(bp->b_maps[i].bm_len);
+
+check_merge:
+		ASSERT(rp->last != 0);
+		ASSERT(rp->first <= rp->last);
+
+		/* no overlap or adjacent, move on */
+		if (rp->last < rp_next->first - 1)
 			continue;
-		}
 
 		/*
-		 * Trim the range to this segment and mark it in the bitmap.
-		 * Note that we must convert buffer offsets to segment relative
-		 * offsets (e.g., the first byte of each segment is byte 0 of
-		 * that segment).
+		 * overlap: select lowest first, highest last, remove the merged
+		 * range (rp_next) and then go back and check the next range for
+		 * whether it can be merged (e.g. we have 4 separate ranges,
+		 * then something logs the buffer entirely. This merges all
+		 * ranges into one).
 		 */
-		if (first < start)
-			first = start;
-		if (end > last)
-			end = last;
-		xfs_buf_item_log_segment(first - start, end - start,
-					 &bip->bli_formats[i].blf_data_map[0]);
-
-		start += BBTOB(bp->b_maps[i].bm_len);
+		rp->first = min(rp->first, rp_next->first);
+		rp->last = max(rp->last, rp_next->last);
+		if (i + 2 < bip->bli_ranges)
+			memmove(rp_next, rp_next + 1, sizeof(*rp) *
+						(bip->bli_ranges - i - 2));
+		bip->bli_ranges--;
+		if (i < bip->bli_ranges - 1)
+			goto check_merge;
 	}
 }
 
-
 /*
  * Return true if the buffer has any ranges logged/dirtied by a transaction,
  * false otherwise.
@@ -923,15 +940,7 @@ bool
 xfs_buf_item_dirty_format(
 	struct xfs_buf_log_item	*bip)
 {
-	int			i;
-
-	for (i = 0; i < bip->bli_format_count; i++) {
-		if (!xfs_bitmap_empty(bip->bli_formats[i].blf_data_map,
-			     bip->bli_formats[i].blf_map_size))
-			return true;
-	}
-
-	return false;
+	return bip->bli_ranges > 0;
 }
 
 STATIC void
diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
index 643f53dcfe51..9b278c3a2db9 100644
--- a/fs/xfs/xfs_buf_item.h
+++ b/fs/xfs/xfs_buf_item.h
@@ -57,6 +57,25 @@ struct xfs_buf_log_item {
 	unsigned int		bli_recur;	/* lock recursion count */
 	atomic_t		bli_refcount;	/* cnt of tp refs */
 	int			bli_format_count;	/* count of headers */
+
+	/*
+	 * logging ranges. Keep a small number of distinct ranges rather than a
+	 * bitmap which is expensive to maintain.
+	 * 4 separate ranges s probably optimal so that we
+	 * can log separate header, tail and content changes (e.g. for dir
+	 * structures) without capturing the entire buffer unnecessarily for
+	 * isolated changes.
+	 *
+	 * Note: ranges are 32 bit values because we have to support an end
+	 * range value of 0x10000....
+	 */
+#define XFS_BLI_RANGES	4
+	struct xfs_bli_range {
+		uint32_t	first;
+		uint32_t	last;
+	}			bli_range[XFS_BLI_RANGES];
+	int			bli_ranges;
+
 	struct xfs_buf_log_format *bli_formats;	/* array of in-log header ptrs */
 	struct xfs_buf_log_format __bli_format;	/* embedded in-log header */
 };

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v2] xfs: byte range buffer dirty region tracking
  2018-02-05  0:34 ` [PATCH v2] " Dave Chinner
@ 2018-02-06 16:21   ` Brian Foster
  2018-02-12  2:41     ` Dave Chinner
  0 siblings, 1 reply; 21+ messages in thread
From: Brian Foster @ 2018-02-06 16:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Mon, Feb 05, 2018 at 11:34:15AM +1100, Dave Chinner wrote:
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> One of the biggest performance problems with large directory block
> sizes is the CPU overhead in maintaining the buffer log item direty
> region bitmap.  The bit manipulations and buffer region mapping
> calls are right at the top of the profiles when running tests on 64k
> directory buffers:
> 
...
> 
>  fs/xfs/xfs_buf.c      |  23 ++-
>  fs/xfs/xfs_buf_item.c | 437 ++++++++++++++++++++++++++------------------------
>  fs/xfs/xfs_buf_item.h |  19 +++
>  3 files changed, 261 insertions(+), 218 deletions(-)
> 
...
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index 270ddb4d2313..bc6514a08760 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -66,50 +66,12 @@ xfs_buf_item_size_segment(
>  	int				*nvecs,
>  	int				*nbytes)
>  {
> -	struct xfs_buf			*bp = bip->bli_buf;
> -	int				next_bit;
> -	int				last_bit;
> -
> -	last_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size, 0);
> -	if (last_bit == -1)
> -		return;
> -
>  	/*
>  	 * initial count for a dirty buffer is 2 vectors - the format structure
> -	 * and the first dirty region.
> +	 * and the dirty region. Dirty region is accounted for separately.
>  	 */
>  	*nvecs += 2;
> -	*nbytes += xfs_buf_log_format_size(blfp) + XFS_BLF_CHUNK;
> -
> -	while (last_bit != -1) {
> -		/*
> -		 * This takes the bit number to start looking from and
> -		 * returns the next set bit from there.  It returns -1
> -		 * if there are no more bits set or the start bit is
> -		 * beyond the end of the bitmap.
> -		 */
> -		next_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size,
> -					last_bit + 1);
> -		/*
> -		 * If we run out of bits, leave the loop,
> -		 * else if we find a new set of bits bump the number of vecs,
> -		 * else keep scanning the current set of bits.
> -		 */
> -		if (next_bit == -1) {
> -			break;
> -		} else if (next_bit != last_bit + 1) {
> -			last_bit = next_bit;
> -			(*nvecs)++;
> -		} else if (xfs_buf_offset(bp, next_bit * XFS_BLF_CHUNK) !=
> -			   (xfs_buf_offset(bp, last_bit * XFS_BLF_CHUNK) +
> -			    XFS_BLF_CHUNK)) {
> -			last_bit = next_bit;
> -			(*nvecs)++;
> -		} else {
> -			last_bit++;
> -		}
> -		*nbytes += XFS_BLF_CHUNK;
> -	}
> +	*nbytes += xfs_buf_log_format_size(blfp);

This function has been reduced such that the comment at the top could
probably use updating. In fact, we may be able to kill it entirely if
it's not used anywhere else..?

>  }
>  
>  /*
> @@ -136,7 +98,9 @@ xfs_buf_item_size(
>  	int			*nbytes)
>  {
>  	struct xfs_buf_log_item	*bip = BUF_ITEM(lip);
> -	int			i;
> +	struct xfs_buf		*bp = bip->bli_buf;
> +	uint			offset;
> +	int			i, j;
>  
>  	ASSERT(atomic_read(&bip->bli_refcount) > 0);
>  	if (bip->bli_flags & XFS_BLI_STALE) {
> @@ -155,6 +119,7 @@ xfs_buf_item_size(
>  	}
>  
>  	ASSERT(bip->bli_flags & XFS_BLI_LOGGED);
> +	ASSERT(bip->bli_flags & XFS_BLI_DIRTY);
>  
>  	if (bip->bli_flags & XFS_BLI_ORDERED) {
>  		/*
> @@ -167,19 +132,53 @@ xfs_buf_item_size(
>  		return;
>  	}
>  
> +
> +	/*
> +	 * If the last byte of teh first range is zero, then we've been fed a
> +	 * clean buffer with a XFS_BLI_DIRTY flag set. This should never happen,
> +	 * but be paranoid and catch it. If it does happen, then first should be
> +	 * zero, too.
> +	 */
> +	if (bip->bli_range[0].last == 0) {
> +		ASSERT(0);
> +		ASSERT(bip->bli_range[0].first == 0);
> +		return;
> +	}

Isn't first == last == 0 a valid, inclusive range? Perhaps this makes
sense since by this point the range should be rounded out to the chunk
size, but more on that below..

> +
>  	/*
>  	 * the vector count is based on the number of buffer vectors we have
> -	 * dirty bits in. This will only be greater than one when we have a
> +	 * dirty ranges in. This will only be greater than one when we have a
>  	 * compound buffer with more than one segment dirty. Hence for compound
> -	 * buffers we need to track which segment the dirty bits correspond to,
> -	 * and when we move from one segment to the next increment the vector
> -	 * count for the extra buf log format structure that will need to be
> -	 * written.
> +	 * buffers we need to track which segment the dirty ranges correspond
> +	 * to, and when we move from one segment to the next increment the
> +	 * vector count for the extra buf log format structure that will need to
> +	 * be written.
>  	 */
> -	for (i = 0; i < bip->bli_format_count; i++) {
> -		xfs_buf_item_size_segment(bip, &bip->bli_formats[i],
> -					  nvecs, nbytes);
> +	for (i = 0, offset = 0;
> +	     i < bip->bli_format_count;
> +	     i++, offset += BBTOB(bp->b_maps[i].bm_len)) {
> +		/* Only format dirty regions */
> +		for (j = 0; j < bip->bli_ranges; j++) {
> +			struct xfs_bli_range *rp = &bip->bli_range[j];
> +
> +			/* range ends before segment start, check next range */
> +			if (rp->last < offset)
> +				continue;
> +
> +			/* range beyond segment end, check next segment */
> +			if (rp->first > offset + BBTOB(bp->b_maps[i].bm_len))
> +				break;
> +
> +			/* dirty range overlaps segment, need headers */
> +			xfs_buf_item_size_segment(bip, &bip->bli_formats[i],
> +						  nvecs, nbytes);
> +		}
>  	}
> +
> +	for (j = 0; j < bip->bli_ranges; j++)
> +		*nbytes += bip->bli_range[j].last - bip->bli_range[j].first;
> +
> +
>  	trace_xfs_buf_item_size(bip);
>  }
>  
> @@ -192,7 +191,6 @@ xfs_buf_item_copy_iovec(
>  	int			first_bit,
>  	uint			nbits)
>  {
> -	offset += first_bit * XFS_BLF_CHUNK;
>  	xlog_copy_iovec(lv, vecp, XLOG_REG_TYPE_BCHUNK,
>  			xfs_buf_offset(bp, offset),
>  			nbits * XFS_BLF_CHUNK);
> @@ -215,14 +213,18 @@ xfs_buf_item_format_segment(
>  	struct xfs_buf_log_item	*bip,
>  	struct xfs_log_vec	*lv,
>  	struct xfs_log_iovec	**vecp,
> +	struct xfs_bli_range	*rp,
>  	uint			offset,
> +	uint			length,
>  	struct xfs_buf_log_format *blfp)
>  {
>  	struct xfs_buf		*bp = bip->bli_buf;
> +	char			*buf;
>  	uint			base_size;
> +	uint			start;
> +	uint			end;
>  	int			first_bit;
>  	int			last_bit;
> -	int			next_bit;
>  	uint			nbits;
>  
>  	/* copy the flags across from the base format item */
> @@ -234,16 +236,6 @@ xfs_buf_item_format_segment(
>  	 * memory structure.
>  	 */
>  	base_size = xfs_buf_log_format_size(blfp);
> -
> -	first_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size, 0);
> -	if (!(bip->bli_flags & XFS_BLI_STALE) && first_bit == -1) {
> -		/*
> -		 * If the map is not be dirty in the transaction, mark
> -		 * the size as zero and do not advance the vector pointer.
> -		 */
> -		return;
> -	}
> -
>  	blfp = xlog_copy_iovec(lv, vecp, XLOG_REG_TYPE_BFORMAT, blfp, base_size);
>  	blfp->blf_size = 1;

Perhaps we should set the datamap bits before we format out the blf? ;)

>  
> @@ -258,46 +250,40 @@ xfs_buf_item_format_segment(
>  		return;
>  	}
>  
> +	blfp->blf_size++;
>  
>  	/*
> -	 * Fill in an iovec for each set of contiguous chunks.
> +	 * Now we need to set the bits in the bitmap and set up the iovecs
> +	 * appropriately. We know there is a contiguous range in this buffer
> +	 * than needs to be set, so find the first bit, the last bit, and
> +	 * go from there.
>  	 */
> -	last_bit = first_bit;
> -	nbits = 1;
> -	for (;;) {
> -		/*
> -		 * This takes the bit number to start looking from and
> -		 * returns the next set bit from there.  It returns -1
> -		 * if there are no more bits set or the start bit is
> -		 * beyond the end of the bitmap.
> -		 */
> -		next_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size,
> -					(uint)last_bit + 1);
> -		/*
> -		 * If we run out of bits fill in the last iovec and get out of
> -		 * the loop.  Else if we start a new set of bits then fill in
> -		 * the iovec for the series we were looking at and start
> -		 * counting the bits in the new one.  Else we're still in the
> -		 * same set of bits so just keep counting and scanning.
> -		 */
> -		if (next_bit == -1) {
> -			xfs_buf_item_copy_iovec(lv, vecp, bp, offset,
> -						first_bit, nbits);
> -			blfp->blf_size++;
> -			break;
> -		} else if (next_bit != last_bit + 1 ||
> -		           xfs_buf_item_straddle(bp, offset, next_bit, last_bit)) {

FYI... this kills the only callers of xfs_buf_item_copy_iovec() and
xfs_buf_item_straddle() so they should probably be removed.

> -			xfs_buf_item_copy_iovec(lv, vecp, bp, offset,
> -						first_bit, nbits);
> -			blfp->blf_size++;
> -			first_bit = next_bit;
> -			last_bit = next_bit;
> -			nbits = 1;
> -		} else {
> -			last_bit++;
> -			nbits++;
> -		}
> -	}
> +	start = 0;
> +	if (offset < rp->first)
> +		start = rp->first - offset;
> +	end = length - 1;
> +	if (offset + length > rp->last)
> +		end = rp->last - offset - 1;
> +

FWIW, it took me a second to identify what was going on here. It might
be useful to incorporate that we're calculating the relative byte
offsets in order to convert into the bitmap in the new comment above.

Also, I could be lost in the maze a bit here, but why the '- 1' in the
end calculation above? Isn't rp->last inclusive?

> +	start &= ~((1 << XFS_BLF_SHIFT) - 1);
> +	first_bit = start >> XFS_BLF_SHIFT;

Why the mask if we're going to right shift anyways?

> +	last_bit = end >> XFS_BLF_SHIFT;
> +	nbits = last_bit - first_bit + 1;
> +	bitmap_set((unsigned long *)blfp->blf_data_map, first_bit, nbits);
> +
> +	ASSERT(end <= length);
> +	ASSERT(start <= length);
> +	ASSERT(length >= nbits * XFS_BLF_CHUNK);
> +	/*
> +	 * Copy needs to be done a buffer page at a time as we can be logging
> +	 * unmapped buffers. hence we have to use xfs_buf_iomove() rather than a
> +	 * straight memcpy here.
> +	 */
> +	offset += first_bit * XFS_BLF_CHUNK;
> +	length = nbits * XFS_BLF_CHUNK;
> +	buf = xlog_prepare_iovec(lv, vecp, XLOG_REG_TYPE_BCHUNK);
> +	xfs_buf_iomove(bp, offset, length, buf, XBRW_READ);
> +	xlog_finish_iovec(lv, *vecp, length);
>  }
>  
>  /*
> @@ -314,8 +300,8 @@ xfs_buf_item_format(
>  	struct xfs_buf_log_item	*bip = BUF_ITEM(lip);
>  	struct xfs_buf		*bp = bip->bli_buf;
>  	struct xfs_log_iovec	*vecp = NULL;
> -	uint			offset = 0;
> -	int			i;
> +	uint			offset;
> +	int			i, j;
>  
>  	ASSERT(atomic_read(&bip->bli_refcount) > 0);
>  	ASSERT((bip->bli_flags & XFS_BLI_LOGGED) ||
> @@ -326,7 +312,6 @@ xfs_buf_item_format(
>  	ASSERT(!(bip->bli_flags & XFS_BLI_ORDERED) ||
>  	       (bip->bli_flags & XFS_BLI_STALE));
>  
> -
>  	/*
>  	 * If it is an inode buffer, transfer the in-memory state to the
>  	 * format flags and clear the in-memory state.
> @@ -349,10 +334,36 @@ xfs_buf_item_format(
>  		bip->bli_flags &= ~XFS_BLI_INODE_BUF;
>  	}
>  
> -	for (i = 0; i < bip->bli_format_count; i++) {
> -		xfs_buf_item_format_segment(bip, lv, &vecp, offset,
> -					    &bip->bli_formats[i]);
> -		offset += BBTOB(bp->b_maps[i].bm_len);
> +	for (i = 0, offset = 0;
> +	     i < bip->bli_format_count;
> +	     i++, offset += BBTOB(bp->b_maps[i].bm_len)) {
> +
> +		/* stale regions cover the entire segment */

Something like "stale regions are fixed size" seems more accurate, since
we aren't actually logging any range(s).. Hm?

> +		if (bip->bli_flags & XFS_BLI_STALE) {
> +			xfs_buf_item_format_segment(bip, lv, &vecp, NULL, offset,
> +						    BBTOB(bp->b_maps[i].bm_len),
> +						    &bip->bli_formats[i]);
> +			continue;
> +		}
> +
> +		/* only format dirty ranges over the current segment */
> +		for (j = 0; j < bip->bli_ranges; j++) {
> +			struct xfs_bli_range *rp = &bip->bli_range[j];
> +
> +			/* range ends before segment start, check next range */
> +			if (rp->last < offset)
> +				continue;
> +
> +			/* range beyond segment end, check next segment */
> +			if (rp->first > offset + BBTOB(bp->b_maps[i].bm_len))
> +				break;
> +
> +			/* dirty range overlaps segment, need headers */
> +			xfs_buf_item_format_segment(bip, lv, &vecp, rp, offset,
> +						    BBTOB(bp->b_maps[i].bm_len),
> +						    &bip->bli_formats[i]);
> +
> +		}
>  	}
>  
>  	/*
> @@ -751,6 +762,9 @@ xfs_buf_item_init(
>  	bip = kmem_zone_zalloc(xfs_buf_item_zone, KM_SLEEP);
>  	xfs_log_item_init(mp, &bip->bli_item, XFS_LI_BUF, &xfs_buf_item_ops);
>  	bip->bli_buf = bp;
> +	for (i = 0; i < XFS_BLI_RANGES; i++)
> +		bip->bli_range[i].first = UINT_MAX;
> +
>  
>  	/*
>  	 * chunks is the number of XFS_BLF_CHUNK size pieces the buffer
> @@ -788,133 +802,136 @@ xfs_buf_item_init(
>  
>  /*
>   * Mark bytes first through last inclusive as dirty in the buf
> - * item's bitmap.
> + * record dirty regions on the buffer.
>   */
> -static void
> -xfs_buf_item_log_segment(
> +void
> +xfs_buf_item_log(
> +	struct xfs_buf_log_item	*bip,
>  	uint			first,
> -	uint			last,
> -	uint			*map)
> +	uint			last)
>  {
> -	uint		first_bit;
> -	uint		last_bit;
> -	uint		bits_to_set;
> -	uint		bits_set;
> -	uint		word_num;
> -	uint		*wordp;
> -	uint		bit;
> -	uint		end_bit;
> -	uint		mask;
> +	struct xfs_bli_range	*rp = NULL;
> +	int			i;
> +	ASSERT(last != 0);

The current code looks like it implicitly handles this case. Asserts
aside, it looks like this code could essentially add the range, fail to
size it correctly (due to the earlier check in the _size() path), but
then continue to log it based on the existing xfs_buf_item_log_segment()
code that has been shifted over to xfs_buf_item_format_segment().

The interface requests an inclusive range, so perhaps we should just
check for last == 0 (assuming first == 0) and bump last so the roundup
and all subsequent code continues to behave exactly as it does today.

> +	ASSERT(first <= last);
> +	ASSERT(last < BBTOB(bip->bli_buf->b_length));
> +
> +	/* simple case - first range being stored */
> +	if (!bip->bli_ranges) {
> +		bip->bli_ranges = 1;
> +		bip->bli_range[0].first = rounddown(first, XFS_BLF_CHUNK);
> +		bip->bli_range[0].last = roundup(last, XFS_BLF_CHUNK);
> +		ASSERT(bip->bli_range[0].last != 0);
> +		ASSERT(bip->bli_range[0].first <= bip->bli_range[0].last);
> +		return;
> +	}
>  
> -	/*
> -	 * Convert byte offsets to bit numbers.
> -	 */
> -	first_bit = first >> XFS_BLF_SHIFT;
> -	last_bit = last >> XFS_BLF_SHIFT;
> +	/* 2nd case: search for overlaps and extend */
> +	for (i = 0; i < bip->bli_ranges; i++) {
> +		rp = &bip->bli_range[i];
>  
> -	/*
> -	 * Calculate the total number of bits to be set.
> -	 */
> -	bits_to_set = last_bit - first_bit + 1;
> +		/* wholly within an existing dirty range, we're done */
> +		if (first >= rp->first && last <= rp->last)
> +			return;
> +		/* no overlap, continue */
> +		if (first > rp->last || last < rp->first)
> +			continue;
>  
> -	/*
> -	 * Get a pointer to the first word in the bitmap
> -	 * to set a bit in.
> -	 */
> -	word_num = first_bit >> BIT_TO_WORD_SHIFT;
> -	wordp = &map[word_num];
> +		/* left edge overlap, extend */
> +		if (first < rp->first)
> +			rp->first = rounddown(first, XFS_BLF_CHUNK);
>  
> -	/*
> -	 * Calculate the starting bit in the first word.
> -	 */
> -	bit = first_bit & (uint)(NBWORD - 1);
> +		/* right edge overlap, extend */
> +		if (last > rp->last)
> +			rp->last = roundup(last, XFS_BLF_CHUNK) - 1;
>  
> -	/*
> -	 * First set any bits in the first word of our range.
> -	 * If it starts at bit 0 of the word, it will be
> -	 * set below rather than here.  That is what the variable
> -	 * bit tells us. The variable bits_set tracks the number
> -	 * of bits that have been set so far.  End_bit is the number
> -	 * of the last bit to be set in this word plus one.
> -	 */
> -	if (bit) {
> -		end_bit = MIN(bit + bits_to_set, (uint)NBWORD);
> -		mask = ((1U << (end_bit - bit)) - 1) << bit;
> -		*wordp |= mask;
> -		wordp++;
> -		bits_set = end_bit - bit;
> -	} else {
> -		bits_set = 0;
> +		goto merge;
>  	}
>  
> -	/*
> -	 * Now set bits a whole word at a time that are between
> -	 * first_bit and last_bit.
> -	 */
> -	while ((bits_to_set - bits_set) >= NBWORD) {
> -		*wordp |= 0xffffffff;
> -		bits_set += NBWORD;
> -		wordp++;
> -	}
> +	/* 3rd case: not found, insert or extend */
> +	ASSERT(i == bip->bli_ranges);
>  
>  	/*
>  	 * Finally, set any bits left to be set in one last partial word.
> +	 * Case 3a: Extend last slot.
> +	 *
> +	 * If the range is beyond the last slot, extend the last slot to
> +	 * cover it. This treated the same as if an overlap existed with
> +	 * the last range.
>  	 */
> -	end_bit = bits_to_set - bits_set;
> -	if (end_bit) {
> -		mask = (1U << end_bit) - 1;
> -		*wordp |= mask;
> +	if (i == XFS_BLI_RANGES) {
> +		ASSERT(bip->bli_ranges == XFS_BLI_RANGES);
> +		rp = &bip->bli_range[XFS_BLI_RANGES - 1];
> +
> +		if (first < rp->first)
> +			rp->first = rounddown(first, XFS_BLF_CHUNK);
> +		if (last > rp->last)
> +			rp->last = roundup(last, XFS_BLF_CHUNK) - 1;
> +		goto merge;
>  	}

If I read this right, a 5th range arbitrarily extends the last range,
regardless of where that range sits in the buffer. For example, if we've
logged 4 small (128 byte), non-overlapping ranges within [4k-64k], then
say we log 0-128, we end up logging the entire 64k buffer.

It would be nice to be a little smarter here. A couple options could be
to merge with the first buffer that starts after the new range rather
than just using the last, or perhaps implementing a mechanism to
condense non-overlapping ranges to free a slot for a new range if doing
so would reduce the overall footprint.

Note that the latter sounded like overkill when I first thought of it,
but I think it may be possible to enhance the existing merge algorithm
you've already included into something that could merge non-adjacent
ranges based on an optional "weight" parameter that describes the
minimum distance between the new range and the closest existing range.
With something of that nature factored into a separate helper, it may
not be that difficult to make a decision on whether to condense, merge
or pick an existing range to extend. Worth a thought, at least...

Brian

> -}
>  
> -/*
> - * Mark bytes first through last inclusive as dirty in the buf
> - * item's bitmap.
> - */
> -void
> -xfs_buf_item_log(
> -	struct xfs_buf_log_item	*bip,
> -	uint			first,
> -	uint			last)
> -{
> -	int			i;
> -	uint			start;
> -	uint			end;
> -	struct xfs_buf		*bp = bip->bli_buf;
> +	/* Case 3b: insert new range.
> +	 *
> +	 * Find the insertion point for the new range, then make a hole
> +	 * and insert the new range.
> +	 */
> +	for (i = 0; i < bip->bli_ranges; i++) {
> +		rp = &bip->bli_range[i];
>  
> +		/* order ranges by ascending offset */
> +		if (last < rp->first)
> +			break;
> +	}
> +	/* shift down and insert */
> +	ASSERT(i < XFS_BLI_RANGES);
> +	rp = &bip->bli_range[i];
> +	if (i < XFS_BLI_RANGES - 1)
> +		memmove(rp + 1, rp, sizeof(*rp) * (bip->bli_ranges - i));
> +	bip->bli_ranges++;
> +	rp->first = rounddown(first, XFS_BLF_CHUNK);
> +	rp->last = roundup(last, XFS_BLF_CHUNK) - 1;
> +
> +merge:
>  	/*
> -	 * walk each buffer segment and mark them dirty appropriately.
> +	 * Check for overlaping ranges and merge them. If there is only one
> +	 * range, there is nothing to merge so bail early.
>  	 */
> -	start = 0;
> -	for (i = 0; i < bip->bli_format_count; i++) {
> -		if (start > last)
> -			break;
> -		end = start + BBTOB(bp->b_maps[i].bm_len) - 1;
> +	if (bip->bli_ranges == 1)
> +		return;
> +
> +	for (i = 0; i < bip->bli_ranges - 1; i++) {
> +		struct xfs_bli_range *rp_next;
> +
> +		rp = &bip->bli_range[i];
> +		rp_next = &bip->bli_range[i + 1];
>  
> -		/* skip to the map that includes the first byte to log */
> -		if (first > end) {
> -			start += BBTOB(bp->b_maps[i].bm_len);
> +
> +check_merge:
> +		ASSERT(rp->last != 0);
> +		ASSERT(rp->first <= rp->last);
> +
> +		/* no overlap or adjacent, move on */
> +		if (rp->last < rp_next->first - 1)
>  			continue;
> -		}
>  
>  		/*
> -		 * Trim the range to this segment and mark it in the bitmap.
> -		 * Note that we must convert buffer offsets to segment relative
> -		 * offsets (e.g., the first byte of each segment is byte 0 of
> -		 * that segment).
> +		 * overlap: select lowest first, highest last, remove the merged
> +		 * range (rp_next) and then go back and check the next range for
> +		 * whether it can be merged (e.g. we have 4 separate ranges,
> +		 * then something logs the buffer entirely. This merges all
> +		 * ranges into one).
>  		 */
> -		if (first < start)
> -			first = start;
> -		if (end > last)
> -			end = last;
> -		xfs_buf_item_log_segment(first - start, end - start,
> -					 &bip->bli_formats[i].blf_data_map[0]);
> -
> -		start += BBTOB(bp->b_maps[i].bm_len);
> +		rp->first = min(rp->first, rp_next->first);
> +		rp->last = max(rp->last, rp_next->last);
> +		if (i + 2 < bip->bli_ranges)
> +			memmove(rp_next, rp_next + 1, sizeof(*rp) *
> +						(bip->bli_ranges - i - 2));
> +		bip->bli_ranges--;
> +		if (i < bip->bli_ranges - 1)
> +			goto check_merge;
>  	}
>  }
>  
> -
>  /*
>   * Return true if the buffer has any ranges logged/dirtied by a transaction,
>   * false otherwise.
> @@ -923,15 +940,7 @@ bool
>  xfs_buf_item_dirty_format(
>  	struct xfs_buf_log_item	*bip)
>  {
> -	int			i;
> -
> -	for (i = 0; i < bip->bli_format_count; i++) {
> -		if (!xfs_bitmap_empty(bip->bli_formats[i].blf_data_map,
> -			     bip->bli_formats[i].blf_map_size))
> -			return true;
> -	}
> -
> -	return false;
> +	return bip->bli_ranges > 0;
>  }
>  
>  STATIC void
> diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
> index 643f53dcfe51..9b278c3a2db9 100644
> --- a/fs/xfs/xfs_buf_item.h
> +++ b/fs/xfs/xfs_buf_item.h
> @@ -57,6 +57,25 @@ struct xfs_buf_log_item {
>  	unsigned int		bli_recur;	/* lock recursion count */
>  	atomic_t		bli_refcount;	/* cnt of tp refs */
>  	int			bli_format_count;	/* count of headers */
> +
> +	/*
> +	 * logging ranges. Keep a small number of distinct ranges rather than a
> +	 * bitmap which is expensive to maintain.
> +	 * 4 separate ranges s probably optimal so that we
> +	 * can log separate header, tail and content changes (e.g. for dir
> +	 * structures) without capturing the entire buffer unnecessarily for
> +	 * isolated changes.
> +	 *
> +	 * Note: ranges are 32 bit values because we have to support an end
> +	 * range value of 0x10000....
> +	 */
> +#define XFS_BLI_RANGES	4
> +	struct xfs_bli_range {
> +		uint32_t	first;
> +		uint32_t	last;
> +	}			bli_range[XFS_BLI_RANGES];
> +	int			bli_ranges;
> +
>  	struct xfs_buf_log_format *bli_formats;	/* array of in-log header ptrs */
>  	struct xfs_buf_log_format __bli_format;	/* embedded in-log header */
>  };
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2] xfs: byte range buffer dirty region tracking
  2018-02-06 16:21   ` Brian Foster
@ 2018-02-12  2:41     ` Dave Chinner
  2018-02-12 14:26       ` Brian Foster
  0 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2018-02-12  2:41 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Tue, Feb 06, 2018 at 11:21:41AM -0500, Brian Foster wrote:
> On Mon, Feb 05, 2018 at 11:34:15AM +1100, Dave Chinner wrote:
> > 
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > One of the biggest performance problems with large directory block
> > sizes is the CPU overhead in maintaining the buffer log item direty
> > region bitmap.  The bit manipulations and buffer region mapping
> > calls are right at the top of the profiles when running tests on 64k
> > directory buffers:
> > 
> ...
> > 
> >  fs/xfs/xfs_buf.c      |  23 ++-
> >  fs/xfs/xfs_buf_item.c | 437 ++++++++++++++++++++++++++------------------------
> >  fs/xfs/xfs_buf_item.h |  19 +++
> >  3 files changed, 261 insertions(+), 218 deletions(-)
> > 
> ...
> > diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> > index 270ddb4d2313..bc6514a08760 100644
> > --- a/fs/xfs/xfs_buf_item.c
> > +++ b/fs/xfs/xfs_buf_item.c
> > @@ -66,50 +66,12 @@ xfs_buf_item_size_segment(
> >  	int				*nvecs,
> >  	int				*nbytes)
> >  {
> > -	struct xfs_buf			*bp = bip->bli_buf;
> > -	int				next_bit;
> > -	int				last_bit;
> > -
> > -	last_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size, 0);
> > -	if (last_bit == -1)
> > -		return;
> > -
> >  	/*
> >  	 * initial count for a dirty buffer is 2 vectors - the format structure
> > -	 * and the first dirty region.
> > +	 * and the dirty region. Dirty region is accounted for separately.
> >  	 */
> >  	*nvecs += 2;
> > -	*nbytes += xfs_buf_log_format_size(blfp) + XFS_BLF_CHUNK;
> > -
> > -	while (last_bit != -1) {
> > -		/*
> > -		 * This takes the bit number to start looking from and
> > -		 * returns the next set bit from there.  It returns -1
> > -		 * if there are no more bits set or the start bit is
> > -		 * beyond the end of the bitmap.
> > -		 */
> > -		next_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size,
> > -					last_bit + 1);
> > -		/*
> > -		 * If we run out of bits, leave the loop,
> > -		 * else if we find a new set of bits bump the number of vecs,
> > -		 * else keep scanning the current set of bits.
> > -		 */
> > -		if (next_bit == -1) {
> > -			break;
> > -		} else if (next_bit != last_bit + 1) {
> > -			last_bit = next_bit;
> > -			(*nvecs)++;
> > -		} else if (xfs_buf_offset(bp, next_bit * XFS_BLF_CHUNK) !=
> > -			   (xfs_buf_offset(bp, last_bit * XFS_BLF_CHUNK) +
> > -			    XFS_BLF_CHUNK)) {
> > -			last_bit = next_bit;
> > -			(*nvecs)++;
> > -		} else {
> > -			last_bit++;
> > -		}
> > -		*nbytes += XFS_BLF_CHUNK;
> > -	}
> > +	*nbytes += xfs_buf_log_format_size(blfp);
> 
> This function has been reduced such that the comment at the top could
> probably use updating. In fact, we may be able to kill it entirely if
> it's not used anywhere else..?

Killed.

> > @@ -167,19 +132,53 @@ xfs_buf_item_size(
> >  		return;
> >  	}
> >  
> > +
> > +	/*
> > +	 * If the last byte of teh first range is zero, then we've been fed a
> > +	 * clean buffer with a XFS_BLI_DIRTY flag set. This should never happen,
> > +	 * but be paranoid and catch it. If it does happen, then first should be
> > +	 * zero, too.
> > +	 */
> > +	if (bip->bli_range[0].last == 0) {
> > +		ASSERT(0);
> > +		ASSERT(bip->bli_range[0].first == 0);
> > +		return;
> > +	}
> 
> Isn't first == last == 0 a valid, inclusive range?

It indicates someone logged a clean buffer, and we shouldn't ever
get to this point if that has happened. Someone made a programming
mistake, so better we catch it here than silently log an empty range
and hope that log recovery does the right thing...

> > @@ -234,16 +236,6 @@ xfs_buf_item_format_segment(
> >  	 * memory structure.
> >  	 */
> >  	base_size = xfs_buf_log_format_size(blfp);
> > -
> > -	first_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size, 0);
> > -	if (!(bip->bli_flags & XFS_BLI_STALE) && first_bit == -1) {
> > -		/*
> > -		 * If the map is not be dirty in the transaction, mark
> > -		 * the size as zero and do not advance the vector pointer.
> > -		 */
> > -		return;
> > -	}
> > -
> >  	blfp = xlog_copy_iovec(lv, vecp, XLOG_REG_TYPE_BFORMAT, blfp, base_size);
> >  	blfp->blf_size = 1;
> 
> Perhaps we should set the datamap bits before we format out the blf? ;)

No. This code is slightly tricksy, clearly, but that's not something
this patch has changed.

That is: the blfp we pass in is the source data from the incore buf
log format structure attached to the buf log item. xlog_copy_iovec()
returns a pointer to the log iovec data buffer we copied the source
into. IOWS, after the call to xlog_copy_iovec() blfp points to the
structure that will be written to the log, not the in-memory
structure, and we modify that repeatedly as we add new iovecs
containing the buffer data being logged.

I'll clean it up to name the incoming blfp to "src_blfp" so it's
clear we are operating on different structures here.

> 
> >  
> > @@ -258,46 +250,40 @@ xfs_buf_item_format_segment(
> >  		return;
> >  	}
> >  
> > +	blfp->blf_size++;
> >  
> >  	/*
> > -	 * Fill in an iovec for each set of contiguous chunks.
> > +	 * Now we need to set the bits in the bitmap and set up the iovecs
> > +	 * appropriately. We know there is a contiguous range in this buffer
> > +	 * than needs to be set, so find the first bit, the last bit, and
> > +	 * go from there.
> >  	 */
> > -	last_bit = first_bit;
> > -	nbits = 1;
> > -	for (;;) {
> > -		/*
> > -		 * This takes the bit number to start looking from and
> > -		 * returns the next set bit from there.  It returns -1
> > -		 * if there are no more bits set or the start bit is
> > -		 * beyond the end of the bitmap.
> > -		 */
> > -		next_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size,
> > -					(uint)last_bit + 1);
> > -		/*
> > -		 * If we run out of bits fill in the last iovec and get out of
> > -		 * the loop.  Else if we start a new set of bits then fill in
> > -		 * the iovec for the series we were looking at and start
> > -		 * counting the bits in the new one.  Else we're still in the
> > -		 * same set of bits so just keep counting and scanning.
> > -		 */
> > -		if (next_bit == -1) {
> > -			xfs_buf_item_copy_iovec(lv, vecp, bp, offset,
> > -						first_bit, nbits);
> > -			blfp->blf_size++;
> > -			break;
> > -		} else if (next_bit != last_bit + 1 ||
> > -		           xfs_buf_item_straddle(bp, offset, next_bit, last_bit)) {
> 
> FYI... this kills the only callers of xfs_buf_item_copy_iovec() and
> xfs_buf_item_straddle() so they should probably be removed.

Good catch. Removed.

> 
> > -			xfs_buf_item_copy_iovec(lv, vecp, bp, offset,
> > -						first_bit, nbits);
> > -			blfp->blf_size++;
> > -			first_bit = next_bit;
> > -			last_bit = next_bit;
> > -			nbits = 1;
> > -		} else {
> > -			last_bit++;
> > -			nbits++;
> > -		}
> > -	}
> > +	start = 0;
> > +	if (offset < rp->first)
> > +		start = rp->first - offset;
> > +	end = length - 1;
> > +	if (offset + length > rp->last)
> > +		end = rp->last - offset - 1;
> > +
> 
> FWIW, it took me a second to identify what was going on here. It might
> be useful to incorporate that we're calculating the relative byte
> offsets in order to convert into the bitmap in the new comment above.

Done.

> 
> Also, I could be lost in the maze a bit here, but why the '- 1' in the
> end calculation above? Isn't rp->last inclusive?

It was fixing a bug in the initial setting of rp->last, in the case
where the first range is being set up (didn't have a "- 1" in that
data). Fixed the initial bug, removed the -1s from here.

> > +	start &= ~((1 << XFS_BLF_SHIFT) - 1);
> > +	first_bit = start >> XFS_BLF_SHIFT;
> 
> Why the mask if we're going to right shift anyways?

Left over debug stuff, I think. Removed.

> > @@ -349,10 +334,36 @@ xfs_buf_item_format(
> >  		bip->bli_flags &= ~XFS_BLI_INODE_BUF;
> >  	}
> >  
> > -	for (i = 0; i < bip->bli_format_count; i++) {
> > -		xfs_buf_item_format_segment(bip, lv, &vecp, offset,
> > -					    &bip->bli_formats[i]);
> > -		offset += BBTOB(bp->b_maps[i].bm_len);
> > +	for (i = 0, offset = 0;
> > +	     i < bip->bli_format_count;
> > +	     i++, offset += BBTOB(bp->b_maps[i].bm_len)) {
> > +
> > +		/* stale regions cover the entire segment */
> 
> Something like "stale regions are fixed size" seems more accurate, since
> we aren't actually logging any range(s).. Hm?

They aren't fixed size, either :P

	/*
	 * Stale buffers do not have any data logged with them, so
	 * shortcut the dirty range checks and just emit a segment
	 * header.
	 */


> > -static void
> > -xfs_buf_item_log_segment(
> > +void
> > +xfs_buf_item_log(
> > +	struct xfs_buf_log_item	*bip,
> >  	uint			first,
> > -	uint			last,
> > -	uint			*map)
> > +	uint			last)
> >  {
> > -	uint		first_bit;
> > -	uint		last_bit;
> > -	uint		bits_to_set;
> > -	uint		bits_set;
> > -	uint		word_num;
> > -	uint		*wordp;
> > -	uint		bit;
> > -	uint		end_bit;
> > -	uint		mask;
> > +	struct xfs_bli_range	*rp = NULL;
> > +	int			i;
> > +	ASSERT(last != 0);
> 
> The current code looks like it implicitly handles this case.

Yes, it may, but logging a zero size dirty region is simply wrong.

> Asserts
> aside, it looks like this code could essentially add the range, fail to
> size it correctly (due to the earlier check in the _size() path), but
> then continue to log it based on the existing xfs_buf_item_log_segment()
> code that has been shifted over to xfs_buf_item_format_segment().
> 
> The interface requests an inclusive range, so perhaps we should just
> check for last == 0 (assuming first == 0) and bump last so the roundup
> and all subsequent code continues to behave exactly as it does today.

No code today passes last == 0 into this function. I want to make
sure it stays taht way, because it's indicative of a bug in the code
that is calling xfs_buf_item_log().

> >  	/*
> >  	 * Finally, set any bits left to be set in one last partial word.
> > +	 * Case 3a: Extend last slot.
> > +	 *
> > +	 * If the range is beyond the last slot, extend the last slot to
> > +	 * cover it. This treated the same as if an overlap existed with
> > +	 * the last range.
> >  	 */
> > -	end_bit = bits_to_set - bits_set;
> > -	if (end_bit) {
> > -		mask = (1U << end_bit) - 1;
> > -		*wordp |= mask;
> > +	if (i == XFS_BLI_RANGES) {
> > +		ASSERT(bip->bli_ranges == XFS_BLI_RANGES);
> > +		rp = &bip->bli_range[XFS_BLI_RANGES - 1];
> > +
> > +		if (first < rp->first)
> > +			rp->first = rounddown(first, XFS_BLF_CHUNK);
> > +		if (last > rp->last)
> > +			rp->last = roundup(last, XFS_BLF_CHUNK) - 1;
> > +		goto merge;
> >  	}
> 
> If I read this right, a 5th range arbitrarily extends the last range,
> regardless of where that range sits in the buffer. For example, if we've
> logged 4 small (128 byte), non-overlapping ranges within [4k-64k], then
> say we log 0-128, we end up logging the entire 64k buffer.

Yup, I just did that for simplicity and convenience.

> It would be nice to be a little smarter here. A couple options could be
> to merge with the first buffer that starts after the new range rather
> than just using the last, or perhaps implementing a mechanism to
> condense non-overlapping ranges to free a slot for a new range if doing
> so would reduce the overall footprint.
> 
> Note that the latter sounded like overkill when I first thought of it,
> but I think it may be possible to enhance the existing merge algorithm
> you've already included into something that could merge non-adjacent
> ranges based on an optional "weight" parameter that describes the
> minimum distance between the new range and the closest existing range.
> With something of that nature factored into a separate helper, it may
> not be that difficult to make a decision on whether to condense, merge
> or pick an existing range to extend. Worth a thought, at least...

Pretty simple to do that, now that I look at it. We don't even need
a weight calculation because the ranges are ordered and we can
easily detect when the incoming range falls between two entries.

I'll post an updated version once I've tested it.

Thanks for looking at this, Brian!

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2] xfs: byte range buffer dirty region tracking
  2018-02-12  2:41     ` Dave Chinner
@ 2018-02-12 14:26       ` Brian Foster
  2018-02-12 21:18         ` Dave Chinner
  0 siblings, 1 reply; 21+ messages in thread
From: Brian Foster @ 2018-02-12 14:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Mon, Feb 12, 2018 at 01:41:38PM +1100, Dave Chinner wrote:
> On Tue, Feb 06, 2018 at 11:21:41AM -0500, Brian Foster wrote:
> > On Mon, Feb 05, 2018 at 11:34:15AM +1100, Dave Chinner wrote:
> > > 
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > One of the biggest performance problems with large directory block
> > > sizes is the CPU overhead in maintaining the buffer log item direty
> > > region bitmap.  The bit manipulations and buffer region mapping
> > > calls are right at the top of the profiles when running tests on 64k
> > > directory buffers:
> > > 
> > ...
> > > 
> > >  fs/xfs/xfs_buf.c      |  23 ++-
> > >  fs/xfs/xfs_buf_item.c | 437 ++++++++++++++++++++++++++------------------------
> > >  fs/xfs/xfs_buf_item.h |  19 +++
> > >  3 files changed, 261 insertions(+), 218 deletions(-)
> > > 
> > ...
> > > diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> > > index 270ddb4d2313..bc6514a08760 100644
> > > --- a/fs/xfs/xfs_buf_item.c
> > > +++ b/fs/xfs/xfs_buf_item.c
...
> > > @@ -167,19 +132,53 @@ xfs_buf_item_size(
> > >  		return;
> > >  	}
> > >  
> > > +
> > > +	/*
> > > +	 * If the last byte of teh first range is zero, then we've been fed a
> > > +	 * clean buffer with a XFS_BLI_DIRTY flag set. This should never happen,
> > > +	 * but be paranoid and catch it. If it does happen, then first should be
> > > +	 * zero, too.
> > > +	 */
> > > +	if (bip->bli_range[0].last == 0) {
> > > +		ASSERT(0);
> > > +		ASSERT(bip->bli_range[0].first == 0);
> > > +		return;
> > > +	}
> > 
> > Isn't first == last == 0 a valid, inclusive range?
> 
> It indicates someone logged a clean buffer, and we shouldn't ever
> get to this point if that has happened. Someone made a programming
> mistake, so better we catch it here than silently log an empty range
> and hope that log recovery does the right thing...
> 

The assert seems reasonable here... but I'm not sure a caller
programming mistake should reflect here unless we went to byte-aligned
ranges (rather than 128byte or whatever it is currently). I'm also not
convinced skipping out here is the right thing to do, but more on this
in xfs_buf_item_log()...

> > > @@ -234,16 +236,6 @@ xfs_buf_item_format_segment(
> > >  	 * memory structure.
> > >  	 */
> > >  	base_size = xfs_buf_log_format_size(blfp);
> > > -
> > > -	first_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size, 0);
> > > -	if (!(bip->bli_flags & XFS_BLI_STALE) && first_bit == -1) {
> > > -		/*
> > > -		 * If the map is not be dirty in the transaction, mark
> > > -		 * the size as zero and do not advance the vector pointer.
> > > -		 */
> > > -		return;
> > > -	}
> > > -
> > >  	blfp = xlog_copy_iovec(lv, vecp, XLOG_REG_TYPE_BFORMAT, blfp, base_size);
> > >  	blfp->blf_size = 1;
> > 
> > Perhaps we should set the datamap bits before we format out the blf? ;)
> 
> No. This code is slightly tricksy, clearly, but that's not something
> this patch has changed.
> 
> That is: the blfp we pass in is the source data from the incore buf
> log format structure attached to the buf log item. xlog_copy_iovec()
> returns a pointer to the log iovec data buffer we copied the source
> into. IOWS, after the call to xlog_copy_iovec() blfp points to the
> structure that will be written to the log, not the in-memory
> structure, and we modify that repeatedly as we add new iovecs
> containing the buffer data being logged.
> 

Ah, I see. I guess I missed the reassignment of blfp.

> I'll clean it up to name the incoming blfp to "src_blfp" so it's
> clear we are operating on different structures here.
> 

Yep, thanks.

> > 
> > >  
...
> > > @@ -349,10 +334,36 @@ xfs_buf_item_format(
> > >  		bip->bli_flags &= ~XFS_BLI_INODE_BUF;
> > >  	}
> > >  
> > > -	for (i = 0; i < bip->bli_format_count; i++) {
> > > -		xfs_buf_item_format_segment(bip, lv, &vecp, offset,
> > > -					    &bip->bli_formats[i]);
> > > -		offset += BBTOB(bp->b_maps[i].bm_len);
> > > +	for (i = 0, offset = 0;
> > > +	     i < bip->bli_format_count;
> > > +	     i++, offset += BBTOB(bp->b_maps[i].bm_len)) {
> > > +
> > > +		/* stale regions cover the entire segment */
> > 
> > Something like "stale regions are fixed size" seems more accurate, since
> > we aren't actually logging any range(s).. Hm?
> 
> They aren't fixed size, either :P
> 
> 	/*
> 	 * Stale buffers do not have any data logged with them, so
> 	 * shortcut the dirty range checks and just emit a segment
> 	 * header.
> 	 */
> 

Sounds fine.

> 
> > > -static void
> > > -xfs_buf_item_log_segment(
> > > +void
> > > +xfs_buf_item_log(
> > > +	struct xfs_buf_log_item	*bip,
> > >  	uint			first,
> > > -	uint			last,
> > > -	uint			*map)
> > > +	uint			last)
> > >  {
> > > -	uint		first_bit;
> > > -	uint		last_bit;
> > > -	uint		bits_to_set;
> > > -	uint		bits_set;
> > > -	uint		word_num;
> > > -	uint		*wordp;
> > > -	uint		bit;
> > > -	uint		end_bit;
> > > -	uint		mask;
> > > +	struct xfs_bli_range	*rp = NULL;
> > > +	int			i;
> > > +	ASSERT(last != 0);
> > 
> > The current code looks like it implicitly handles this case.
> 
> Yes, it may, but logging a zero size dirty region is simply wrong.
> 

That's not the case I'm referring to. If the range is inclusive, how
would you propose to log the first byte of a buffer? I know we probably
don't have this situation and may never, but afaict the current code
handles it as you'd expect (which is to say it should behave as if we
logged any other single byte in the first chunk of the buffer).

> > Asserts
> > aside, it looks like this code could essentially add the range, fail to
> > size it correctly (due to the earlier check in the _size() path), but
> > then continue to log it based on the existing xfs_buf_item_log_segment()
> > code that has been shifted over to xfs_buf_item_format_segment().
> > 
> > The interface requests an inclusive range, so perhaps we should just
> > check for last == 0 (assuming first == 0) and bump last so the roundup
> > and all subsequent code continues to behave exactly as it does today.
> 
> No code today passes last == 0 into this function. I want to make
> sure it stays taht way, because it's indicative of a bug in the code
> that is calling xfs_buf_item_log().
> 

How is that a bug? Looking at the current code, the case of first ==
last == 0 sets the first bit in the bitmap to log the first 128 byte
region, as expected. Strange as it may be, this results in correctly
sizing/formatting the first chunk of the buffer.

With this patch, we'd throw an assert and potentially add a first ==
last == 0 range to the bli. This leads to the subsequent assert
referenced earlier in xfs_buf_item_size(), which also now returns
without including the size of the range along with skipping the
remaining ranges in the bli. Because we've shifted the old bitmap
logging code over to the format side, it looks like the format code
would still copy everything as it does today (modulo the -1 being
removed from the end calculation, perhaps), however. :/ So it seems to
me this breaks a technically valid case in weird/subtle ways. For
example, why assert about last == 0, but then go on to add the range
anyways, explicitly not size it correctly, but then format it as if
nothing is wrong? If it were really wrong/invalid (which I don't think
it is), why not put the check in the log side and skip adding the range
rather than add it, skip sizing it, and then format it.

Despite seeing no major need for it, I'm not going to haggle over the
asserts (if we wanted to leave a last == 0 assert in the logging side to
catch unexpected callers, for example). But I see no reason why the rest
of the code shouldn't essentially function as it does today: round up
the last byte to properly cover the first chunk of the buffer and log it
appropriately.

I agree that the last == 0 case is bogus by the time we hit
xfs_buf_item_size(), but AFAICT that is a bug conceptually created by
this patch and all the early return does there is facilitate potential
buffer overrun at format time. It's fine to assert, but IMO something
needs to change there to at least be consistent between the sizing and
formatting of the bli (and if we do end up skipping formatting a buffer
that might have other valid ranges, I'm thinking something louder than
an assert might be appropriate).

Brian

> > >  	/*
> > >  	 * Finally, set any bits left to be set in one last partial word.
> > > +	 * Case 3a: Extend last slot.
> > > +	 *
> > > +	 * If the range is beyond the last slot, extend the last slot to
> > > +	 * cover it. This treated the same as if an overlap existed with
> > > +	 * the last range.
> > >  	 */
> > > -	end_bit = bits_to_set - bits_set;
> > > -	if (end_bit) {
> > > -		mask = (1U << end_bit) - 1;
> > > -		*wordp |= mask;
> > > +	if (i == XFS_BLI_RANGES) {
> > > +		ASSERT(bip->bli_ranges == XFS_BLI_RANGES);
> > > +		rp = &bip->bli_range[XFS_BLI_RANGES - 1];
> > > +
> > > +		if (first < rp->first)
> > > +			rp->first = rounddown(first, XFS_BLF_CHUNK);
> > > +		if (last > rp->last)
> > > +			rp->last = roundup(last, XFS_BLF_CHUNK) - 1;
> > > +		goto merge;
> > >  	}
> > 
> > If I read this right, a 5th range arbitrarily extends the last range,
> > regardless of where that range sits in the buffer. For example, if we've
> > logged 4 small (128 byte), non-overlapping ranges within [4k-64k], then
> > say we log 0-128, we end up logging the entire 64k buffer.
> 
> Yup, I just did that for simplicity and convenience.
> 
> > It would be nice to be a little smarter here. A couple options could be
> > to merge with the first buffer that starts after the new range rather
> > than just using the last, or perhaps implementing a mechanism to
> > condense non-overlapping ranges to free a slot for a new range if doing
> > so would reduce the overall footprint.
> > 
> > Note that the latter sounded like overkill when I first thought of it,
> > but I think it may be possible to enhance the existing merge algorithm
> > you've already included into something that could merge non-adjacent
> > ranges based on an optional "weight" parameter that describes the
> > minimum distance between the new range and the closest existing range.
> > With something of that nature factored into a separate helper, it may
> > not be that difficult to make a decision on whether to condense, merge
> > or pick an existing range to extend. Worth a thought, at least...
> 
> Pretty simple to do that, now that I look at it. We don't even need
> a weight calculation because the ranges are ordered and we can
> easily detect when the incoming range falls between two entries.
> 
> I'll post an updated version once I've tested it.
> 
> Thanks for looking at this, Brian!
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2] xfs: byte range buffer dirty region tracking
  2018-02-12 14:26       ` Brian Foster
@ 2018-02-12 21:18         ` Dave Chinner
  2018-02-13 13:15           ` Brian Foster
  0 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2018-02-12 21:18 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Mon, Feb 12, 2018 at 09:26:19AM -0500, Brian Foster wrote:
> On Mon, Feb 12, 2018 at 01:41:38PM +1100, Dave Chinner wrote:
> > On Tue, Feb 06, 2018 at 11:21:41AM -0500, Brian Foster wrote:
> > > On Mon, Feb 05, 2018 at 11:34:15AM +1100, Dave Chinner wrote:
> > > > -static void
> > > > -xfs_buf_item_log_segment(
> > > > +void
> > > > +xfs_buf_item_log(
> > > > +	struct xfs_buf_log_item	*bip,
> > > >  	uint			first,
> > > > -	uint			last,
> > > > -	uint			*map)
> > > > +	uint			last)
> > > >  {
> > > > -	uint		first_bit;
> > > > -	uint		last_bit;
> > > > -	uint		bits_to_set;
> > > > -	uint		bits_set;
> > > > -	uint		word_num;
> > > > -	uint		*wordp;
> > > > -	uint		bit;
> > > > -	uint		end_bit;
> > > > -	uint		mask;
> > > > +	struct xfs_bli_range	*rp = NULL;
> > > > +	int			i;
> > > > +	ASSERT(last != 0);
> > > 
> > > The current code looks like it implicitly handles this case.
> > 
> > Yes, it may, but logging a zero size dirty region is simply wrong.
> > 
> 
> That's not the case I'm referring to. If the range is inclusive, how
> would you propose to log the first byte of a buffer?

We don't. No structure on disk has a single byte that needs to be
logged individually as it's first member. Hence we don't ever do
this.

If we ever happen to screw up an on-disk structure such that it
doesn't have a 4 byte magic number and a chunk of self describing
metadata as it's first 20-30 bytes in a buffer and we try to log
just the first byte, then these asserts will fire to tell us that
we've screwed up a new on-disk structure....

> I know we probably
> don't have this situation and may never, but afaict the current code
> handles it as you'd expect (which is to say it should behave as if we
> logged any other single byte in the first chunk of the buffer).

Just because the code handles the case, it doesn't mean it's a valid
thing to be asking the code to do....

> > > Asserts
> > > aside, it looks like this code could essentially add the range, fail to
> > > size it correctly (due to the earlier check in the _size() path), but
> > > then continue to log it based on the existing xfs_buf_item_log_segment()
> > > code that has been shifted over to xfs_buf_item_format_segment().
> > > 
> > > The interface requests an inclusive range, so perhaps we should just
> > > check for last == 0 (assuming first == 0) and bump last so the roundup
> > > and all subsequent code continues to behave exactly as it does today.
> > 
> > No code today passes last == 0 into this function. I want to make
> > sure it stays taht way, because it's indicative of a bug in the code
> > that is calling xfs_buf_item_log().
> > 
> 
> How is that a bug? Looking at the current code, the case of first ==
> last == 0 sets the first bit in the bitmap to log the first 128 byte
> region, as expected. Strange as it may be, this results in correctly
> sizing/formatting the first chunk of the buffer.

Yes, but that doesn't mean the caller is correct.

> With this patch, we'd throw an assert and potentially add a first ==
> last == 0 range to the bli. This leads to the subsequent assert
> referenced earlier in xfs_buf_item_size(), which also now returns
> without including the size of the range along with skipping the
> remaining ranges in the bli. Because we've shifted the old bitmap
> logging code over to the format side, it looks like the format code
> would still copy everything as it does today (modulo the -1 being
> removed from the end calculation, perhaps), however.

Yes, and that's required by the relogging algorithm - we have to
copy all the older tracked dirty regions when we write an active log
item into the log multiple times. That's required so we can move
objects forward in the AIL.

> :/ So it seems to
> me this breaks a technically valid case in weird/subtle ways. For
> example, why assert about last == 0, but then go on to add the range
> anyways, explicitly not size it correctly, but then format it as if
> nothing is wrong? If it were really wrong/invalid (which I don't think
> it is), why not put the check in the log side and skip adding the range
> rather than add it, skip sizing it, and then format it.

So what you're really concerned about is that I put asserts into the
code to catch broken development code, but then allow production
systems through without caring whether it works correctly because
that boundary condition will never occur during runtime on
production systems?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2] xfs: byte range buffer dirty region tracking
  2018-02-12 21:18         ` Dave Chinner
@ 2018-02-13 13:15           ` Brian Foster
  2018-02-13 22:02             ` Dave Chinner
  0 siblings, 1 reply; 21+ messages in thread
From: Brian Foster @ 2018-02-13 13:15 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Feb 13, 2018 at 08:18:24AM +1100, Dave Chinner wrote:
> On Mon, Feb 12, 2018 at 09:26:19AM -0500, Brian Foster wrote:
> > On Mon, Feb 12, 2018 at 01:41:38PM +1100, Dave Chinner wrote:
> > > On Tue, Feb 06, 2018 at 11:21:41AM -0500, Brian Foster wrote:
> > > > On Mon, Feb 05, 2018 at 11:34:15AM +1100, Dave Chinner wrote:
> > > > > -static void
> > > > > -xfs_buf_item_log_segment(
> > > > > +void
> > > > > +xfs_buf_item_log(
> > > > > +	struct xfs_buf_log_item	*bip,
> > > > >  	uint			first,
> > > > > -	uint			last,
> > > > > -	uint			*map)
> > > > > +	uint			last)
> > > > >  {
> > > > > -	uint		first_bit;
> > > > > -	uint		last_bit;
> > > > > -	uint		bits_to_set;
> > > > > -	uint		bits_set;
> > > > > -	uint		word_num;
> > > > > -	uint		*wordp;
> > > > > -	uint		bit;
> > > > > -	uint		end_bit;
> > > > > -	uint		mask;
> > > > > +	struct xfs_bli_range	*rp = NULL;
> > > > > +	int			i;
> > > > > +	ASSERT(last != 0);
> > > > 
> > > > The current code looks like it implicitly handles this case.
> > > 
> > > Yes, it may, but logging a zero size dirty region is simply wrong.
> > > 
> > 
> > That's not the case I'm referring to. If the range is inclusive, how
> > would you propose to log the first byte of a buffer?
> 
> We don't. No structure on disk has a single byte that needs to be
> logged individually as it's first member. Hence we don't ever do
> this.
> 

Drumroll please...

> If we ever happen to screw up an on-disk structure such that it
> doesn't have a 4 byte magic number and a chunk of self describing
> metadata as it's first 20-30 bytes in a buffer and we try to log
> just the first byte, then these asserts will fire to tell us that
> we've screwed up a new on-disk structure....
> 

For one, the asserts you've added don't sufficiently cover verification
of a 4 byte magic at the top of every logged buffer. By this logic, we'd
be just as broken by attempting to the log just the second or third byte
(or some other non-4 byte permutation) rather than the first. The
asserts won't catch that, though the logging infrastructure will do the
right thing in terms of ensuring the associated range is logged.

Beyond that, this simply isn't the responsibility of this code.

...
> > :/ So it seems to
> > me this breaks a technically valid case in weird/subtle ways. For
> > example, why assert about last == 0, but then go on to add the range
> > anyways, explicitly not size it correctly, but then format it as if
> > nothing is wrong? If it were really wrong/invalid (which I don't think
> > it is), why not put the check in the log side and skip adding the range
> > rather than add it, skip sizing it, and then format it.
> 
> So what you're really concerned about is that I put asserts into the
> code to catch broken development code, but then allow production
> systems through without caring whether it works correctly because
> that boundary condition will never occur during runtime on
> production systems?
> 

No. As already mentioned in my previous mail, I care little about the
asserts. Asserts can easily be removed if they turn out to be bogus.
Wrong asserts tend to have little negative effect on production users
because along with only affecting debug kernels, they'd have to be
fairly rare to slip through our testing. So I'm perfectly _happy_ to be
cautious with regard to asserts.

What I care much more about is not leaving latent bugs around in the
code. IMO, there is very rarely good enough justification to knowingly
commit buggy/fragile code to the kernel, regardless of whether
associated problems can be triggered or not in the broader codebase. The
tradeoff is a few minutes of due diligence now to save somebody else
potentially hours of pain associated with debugging shit/broken code
down the road, particularly in subsystems that are highly complex as it
is. So for future reference: having spent a fair amount of time doing
the latter, it's highly unlikely I'll offer an r-b on any patch that
does such a thing based on the argument offered here.

... having said all that and having already wasted more time on this
than it would have taken for you to just fix the patch, I'll end my rant
with this splat[1]. It demonstrates the "boundary condition" that "will
never occur during runtime on production systems" (production system
level output included for extra fun ;P). Note that this is a real
incantation of a problem (not something I've modified the code to
manufacture) and something that the current code handles correctly. So
the entire argument above is not only dubious, but also incorrect.

Brian

[1]

BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: xlog_cil_push+0x184/0x400 [xfs]
PGD 0 P4D 0 
Oops: 0000 [#1] SMP
Modules linked in: xfs(O) ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcsp snd_pcm joydev virtio_balloon snd_timer snd e1000 soundcore i2c_piix4 virtio_scsi virtio_console virtio_blk serio_raw qxl sym53c8xx drm_kms_helper ttm scsi_transport_spi virtio_pci virtio_ring virtio ata_generic pata_acpi drm [last unloaded: xfs]
CPU: 2 PID: 162 Comm: kworker/2:2 Tainted: G           O     4.15.0-rc7+ #95
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Workqueue: xfs-cil/dm-3 xlog_cil_push_work [xfs]
RIP: 0010:xlog_cil_push+0x184/0x400 [xfs]
RSP: 0018:ffffa79d40f93d90 EFLAGS: 00010286
RAX: ffff8b5e55a15110 RBX: ffff8b5e57776900 RCX: dead000000000200
RDX: ffff8b5e4b6bb9d0 RSI: ffff8b5e4b6bb9d0 RDI: ffff8b5e55a15110
RBP: ffff8b5e5317b240 R08: 000000940b01765e R09: 0000000000000001
R10: 0000000000000001 R11: 0000000000000000 R12: ffff8b5e4af13000
R13: 0000000000000000 R14: ffff8b5e54e2f400 R15: 000000000000000e
FS:  0000000000000000(0000) GS:ffff8b5e5b200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000008 CR3: 000000010b6a5000 CR4: 00000000000006e0
Call Trace:
 ? lock_acquire+0x9f/0x1f0
 process_one_work+0x23e/0x680
 worker_thread+0x35/0x380
 ? process_one_work+0x680/0x680
 kthread+0x11a/0x130
 ? kthread_create_worker_on_cpu+0x70/0x70
 ret_from_fork+0x24/0x30
Code: 31 ff e8 90 6e f9 fa 49 8b 46 08 48 39 44 24 10 74 64 45 31 ed eb 1f 49 89 45 00 4c 8b 6a 10 48 c7 42 10 00 00 00 00 49 8b 46 08 <45> 03 7d 08 48 39 44 24 10 74 40 49 8b 56 08 48 89 d7 48 89 54 
RIP: xlog_cil_push+0x184/0x400 [xfs] RSP: ffffa79d40f93d90
CR2: 0000000000000008
---[ end trace ead0e466843c4bf9 ]---
BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:34
in_atomic(): 0, irqs_disabled(): 1, pid: 162, name: kworker/2:2
INFO: lockdep is turned off.
irq event stamp: 162680
hardirqs last  enabled at (162679): [<00000000645f1f5e>] __slab_alloc+0x54/0x90
hardirqs last disabled at (162680): [<00000000e7a1d8b6>] error_entry+0x82/0xe0
softirqs last  enabled at (162664): [<0000000047a13c56>] process_one_work+0x23e/0x680
softirqs last disabled at (162660): [<00000000e6aa07a5>] neigh_periodic_work+0x2c/0x300
CPU: 2 PID: 162 Comm: kworker/2:2 Tainted: G      D    O     4.15.0-rc7+ #95
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Workqueue: xfs-cil/dm-3 xlog_cil_push_work [xfs]
Call Trace:
 dump_stack+0x85/0xc5
 ___might_sleep+0x156/0x240
 exit_signals+0x2b/0x240
 do_exit+0xb3/0xd00
 ? process_one_work+0x680/0x680
 ? kthread+0x11a/0x130
 rewind_stack_do_exit+0x17/0x20

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2] xfs: byte range buffer dirty region tracking
  2018-02-13 13:15           ` Brian Foster
@ 2018-02-13 22:02             ` Dave Chinner
  2018-02-14 13:09               ` Brian Foster
  0 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2018-02-13 22:02 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Tue, Feb 13, 2018 at 08:15:26AM -0500, Brian Foster wrote:
> On Tue, Feb 13, 2018 at 08:18:24AM +1100, Dave Chinner wrote:
> > On Mon, Feb 12, 2018 at 09:26:19AM -0500, Brian Foster wrote:
> > > :/ So it seems to
> > > me this breaks a technically valid case in weird/subtle ways. For
> > > example, why assert about last == 0, but then go on to add the range
> > > anyways, explicitly not size it correctly, but then format it as if
> > > nothing is wrong? If it were really wrong/invalid (which I don't think
> > > it is), why not put the check in the log side and skip adding the range
> > > rather than add it, skip sizing it, and then format it.
> > 
> > So what you're really concerned about is that I put asserts into the
> > code to catch broken development code, but then allow production
> > systems through without caring whether it works correctly because
> > that boundary condition will never occur during runtime on
> > production systems?
> 
> No. As already mentioned in my previous mail, I care little about the
> asserts. Asserts can easily be removed if they turn out to be bogus.
> Wrong asserts tend to have little negative effect on production users
> because along with only affecting debug kernels, they'd have to be
> fairly rare to slip through our testing. So I'm perfectly _happy_ to be
> cautious with regard to asserts.
> 
> What I care much more about is not leaving latent bugs around in the
> code. IMO, there is very rarely good enough justification to knowingly
> commit buggy/fragile code to the kernel,

Hold on a minute!

I'm not asking anyone to commit buggy or fragile code. I've already
fixed the off-by-one problems you've pointed out, and all I was
trying to do was understand what you saw wrong with the asserts to
catch a "should never happen" condition so I could change it in a
way that you'd find acceptible.

There's no need to shout and rant at me....

> ... having said all that and having already wasted more time on this
> than it would have taken for you to just fix the patch, I'll end my rant
> with this splat[1]. It demonstrates the "boundary condition" that "will
> never occur during runtime on production systems" (production system
> level output included for extra fun ;P).

This is a pre-existing bug in xlog_cil_insert_format_items()
that my change has exposed:

                /* Skip items that do not have any vectors for writing */
		if (!shadow->lv_niovecs && !ordered)
			continue;

The code I added triggers this (niovecs == 0), and that now gives
us the case where we have a dirty log item descriptor
(XFS_LID_DIRTY) without a log vector attached to item->li_lv.
Then in xlog_cil_insert_items():

                /* Skip items which aren't dirty in this transaction. */
                if (!(lidp->lid_flags & XFS_LID_DIRTY))
                        continue;

                /*
                 * Only move the item if it isn't already at the tail. This is
                 * to prevent a transient list_empty() state when reinserting
                 * an item that is already the only item in the CIL.
                 */
                if (!list_is_last(&lip->li_cil, &cil->xc_cil))
                        list_move_tail(&lip->li_cil, &cil->xc_cil);


We put that "clean" log item on the CIL because XFS_LID_DIRTY is
set, and then when we push the CIL in xlog_cil_push(), we trip over
a dirty log item without a log vector when chaining log vectors to
pass to the log writing code here:

        while (!list_empty(&cil->xc_cil)) {
                struct xfs_log_item     *item;

                item = list_first_entry(&cil->xc_cil,
                                        struct xfs_log_item, li_cil);
                list_del_init(&item->li_cil);
                if (!ctx->lv_chain)
                        ctx->lv_chain = item->li_lv;
                else
                        lv->lv_next = item->li_lv;       <<<<<<<<<
 >>>>>>>>       lv = item->li_lv;
                item->li_lv = NULL;
                num_iovecs += lv->lv_niovecs;
        }

i.e. lv ends up null part way through the log item chain we are
processing and the next loop iteration fails.

IOWs, the bug isn't in the patch I wrote - it has uncovered a
latent bug added years ago for a condition that had never, ever been
exercised until now.

Brian, can you now give me all the details of what you were doing to
produce this and turn on CONFIG_XFS_DEBUG so that it catches the
zero length buffer that was logged when it happens?  That way I can
test a fix for this bug and that the buffer range logging exercises
this case properly...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2] xfs: byte range buffer dirty region tracking
  2018-02-13 22:02             ` Dave Chinner
@ 2018-02-14 13:09               ` Brian Foster
  2018-02-14 16:49                 ` Darrick J. Wong
  2018-02-14 22:30                 ` Dave Chinner
  0 siblings, 2 replies; 21+ messages in thread
From: Brian Foster @ 2018-02-14 13:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Feb 14, 2018 at 09:02:20AM +1100, Dave Chinner wrote:
> On Tue, Feb 13, 2018 at 08:15:26AM -0500, Brian Foster wrote:
> > On Tue, Feb 13, 2018 at 08:18:24AM +1100, Dave Chinner wrote:
> > > On Mon, Feb 12, 2018 at 09:26:19AM -0500, Brian Foster wrote:
> > > > :/ So it seems to
> > > > me this breaks a technically valid case in weird/subtle ways. For
> > > > example, why assert about last == 0, but then go on to add the range
> > > > anyways, explicitly not size it correctly, but then format it as if
> > > > nothing is wrong? If it were really wrong/invalid (which I don't think
> > > > it is), why not put the check in the log side and skip adding the range
> > > > rather than add it, skip sizing it, and then format it.
> > > 
> > > So what you're really concerned about is that I put asserts into the
> > > code to catch broken development code, but then allow production
> > > systems through without caring whether it works correctly because
> > > that boundary condition will never occur during runtime on
> > > production systems?
> > 
> > No. As already mentioned in my previous mail, I care little about the
> > asserts. Asserts can easily be removed if they turn out to be bogus.
> > Wrong asserts tend to have little negative effect on production users
> > because along with only affecting debug kernels, they'd have to be
> > fairly rare to slip through our testing. So I'm perfectly _happy_ to be
> > cautious with regard to asserts.
> > 
> > What I care much more about is not leaving latent bugs around in the
> > code. IMO, there is very rarely good enough justification to knowingly
> > commit buggy/fragile code to the kernel,
> 
> Hold on a minute!
> 
> I'm not asking anyone to commit buggy or fragile code. I've already
> fixed the off-by-one problems you've pointed out, and all I was
> trying to do was understand what you saw wrong with the asserts to
> catch a "should never happen" condition so I could change it in a
> way that you'd find acceptible.
> 
> There's no need to shout and rant at me....
> 

I pointed out the first-byte logging case looked broken. Rather than
indicate you're fixing that one way or another (as you had done for the
other issues we've found), you argued that "nobody does that" or "should
never happen."

Sorry for the rant and/or whether I've misinterpreted your comment, but
I have no way to read that other than an attempt to justify the problem,
particularly when in comparison every other issue was clearly noted that
it would be fixed. Also note that I think you've slightly misinterpreted
my fragile code comment to be harsher than intended (or I didn't express
my position clearly...). I'm simply trying to explain why I'll likely
not ack this patch unless that problem is fixed (e.g., because I
consider it fragile, independent of current behavior of outside
contexts).

I'll try to restate my position more distinctly/clearly... I consider
this patch broken so long as it doesn't handle the set of valid inputs
defined by xfs_trans_log_buf() correctly, as the current code does. I
expect any first <= last range to DTRT and ensure the associated chunk
of the buffer is logged.

> > ... having said all that and having already wasted more time on this
> > than it would have taken for you to just fix the patch, I'll end my rant
> > with this splat[1]. It demonstrates the "boundary condition" that "will
> > never occur during runtime on production systems" (production system
> > level output included for extra fun ;P).
> 
> This is a pre-existing bug in xlog_cil_insert_format_items()
> that my change has exposed:
> 
>                 /* Skip items that do not have any vectors for writing */
> 		if (!shadow->lv_niovecs && !ordered)
> 			continue;
> 
> The code I added triggers this (niovecs == 0), and that now gives
> us the case where we have a dirty log item descriptor
> (XFS_LID_DIRTY) without a log vector attached to item->li_lv.
> Then in xlog_cil_insert_items():
> 
>                 /* Skip items which aren't dirty in this transaction. */
>                 if (!(lidp->lid_flags & XFS_LID_DIRTY))
>                         continue;
> 
>                 /*
>                  * Only move the item if it isn't already at the tail. This is
>                  * to prevent a transient list_empty() state when reinserting
>                  * an item that is already the only item in the CIL.
>                  */
>                 if (!list_is_last(&lip->li_cil, &cil->xc_cil))
>                         list_move_tail(&lip->li_cil, &cil->xc_cil);
> 
> 
> We put that "clean" log item on the CIL because XFS_LID_DIRTY is
> set, and then when we push the CIL in xlog_cil_push(), we trip over
> a dirty log item without a log vector when chaining log vectors to
> pass to the log writing code here:
> 
>         while (!list_empty(&cil->xc_cil)) {
>                 struct xfs_log_item     *item;
> 
>                 item = list_first_entry(&cil->xc_cil,
>                                         struct xfs_log_item, li_cil);
>                 list_del_init(&item->li_cil);
>                 if (!ctx->lv_chain)
>                         ctx->lv_chain = item->li_lv;
>                 else
>                         lv->lv_next = item->li_lv;       <<<<<<<<<
>  >>>>>>>>       lv = item->li_lv;
>                 item->li_lv = NULL;
>                 num_iovecs += lv->lv_niovecs;
>         }
> 
> i.e. lv ends up null part way through the log item chain we are
> processing and the next loop iteration fails.
> 

Ok, I hadn't traced through the actual crash. The above all sounds sane
to me...

> IOWs, the bug isn't in the patch I wrote - it has uncovered a
> latent bug added years ago for a condition that had never, ever been
> exercised until now.
> 

Note that I agree you're probably right that there is a bug worth fixing
in the CIL code to not crash in this case. The crash is just a symptom,
however. There's still a bug in this patch because the buffer needs to
be logged.

IOW, the purpose of this test is to demonstrate that the "should never
happen" case argued above "actually can happen."

> Brian, can you now give me all the details of what you were doing to
> produce this and turn on CONFIG_XFS_DEBUG so that it catches the
> zero length buffer that was logged when it happens?  That way I can
> test a fix for this bug and that the buffer range logging exercises
> this case properly...
> 

Yep. It's a specially crafted symlink creation on a small FSB, v4
filesystem with fragmented free space. We log symlink buffers on v4
filesystems without any header, so the buffer content is not dictated by
any internal fs metadata format. If the link target is large enough to
span multiple blocks and free space is fragmented such that those blocks
are discontiguous, we can end up logging (solely) the first byte of the
last buffer of the link target.

This is actually reproducible on demand so I'll just append a basic
recipe rather than collect the debug data and whatnot..

Brian

--- 8< ---

dev=<dev>
mnt=/mnt

sym=`for i in $(seq 0 512); do echo -n a; done`

mkfs.xfs -f -mcrc=0 -bsize=512 -dsize=25m $dev
mount $dev $mnt

dd if=/dev/zero of=$mnt/spc1
~/xfstests-dev/src/punch-alternating $mnt/spc1
dd if=/dev/zero of=$mnt/spc2
xfs_io -c "fpunch 5m 25m" $mnt/spc2

for i in $(seq 0 2); do
        ln -s $sym $mnt/link.$i
        xfs_io -c fsync $mnt
done

umount $mnt

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2] xfs: byte range buffer dirty region tracking
  2018-02-14 13:09               ` Brian Foster
@ 2018-02-14 16:49                 ` Darrick J. Wong
  2018-02-14 18:08                   ` Brian Foster
  2018-02-14 22:30                 ` Dave Chinner
  1 sibling, 1 reply; 21+ messages in thread
From: Darrick J. Wong @ 2018-02-14 16:49 UTC (permalink / raw)
  To: Brian Foster; +Cc: Dave Chinner, linux-xfs

On Wed, Feb 14, 2018 at 08:09:39AM -0500, Brian Foster wrote:
> On Wed, Feb 14, 2018 at 09:02:20AM +1100, Dave Chinner wrote:
> > On Tue, Feb 13, 2018 at 08:15:26AM -0500, Brian Foster wrote:
> > > On Tue, Feb 13, 2018 at 08:18:24AM +1100, Dave Chinner wrote:
> > > > On Mon, Feb 12, 2018 at 09:26:19AM -0500, Brian Foster wrote:
> > > > > :/ So it seems to
> > > > > me this breaks a technically valid case in weird/subtle ways. For
> > > > > example, why assert about last == 0, but then go on to add the range
> > > > > anyways, explicitly not size it correctly, but then format it as if
> > > > > nothing is wrong? If it were really wrong/invalid (which I don't think
> > > > > it is), why not put the check in the log side and skip adding the range
> > > > > rather than add it, skip sizing it, and then format it.
> > > > 
> > > > So what you're really concerned about is that I put asserts into the
> > > > code to catch broken development code, but then allow production
> > > > systems through without caring whether it works correctly because
> > > > that boundary condition will never occur during runtime on
> > > > production systems?
> > > 
> > > No. As already mentioned in my previous mail, I care little about the
> > > asserts. Asserts can easily be removed if they turn out to be bogus.
> > > Wrong asserts tend to have little negative effect on production users
> > > because along with only affecting debug kernels, they'd have to be
> > > fairly rare to slip through our testing. So I'm perfectly _happy_ to be
> > > cautious with regard to asserts.
> > > 
> > > What I care much more about is not leaving latent bugs around in the
> > > code. IMO, there is very rarely good enough justification to knowingly
> > > commit buggy/fragile code to the kernel,
> > 
> > Hold on a minute!
> > 
> > I'm not asking anyone to commit buggy or fragile code. I've already
> > fixed the off-by-one problems you've pointed out, and all I was
> > trying to do was understand what you saw wrong with the asserts to
> > catch a "should never happen" condition so I could change it in a
> > way that you'd find acceptible.
> > 
> > There's no need to shout and rant at me....
> > 
> 
> I pointed out the first-byte logging case looked broken. Rather than
> indicate you're fixing that one way or another (as you had done for the
> other issues we've found), you argued that "nobody does that" or "should
> never happen."
> 
> Sorry for the rant and/or whether I've misinterpreted your comment, but
> I have no way to read that other than an attempt to justify the problem,
> particularly when in comparison every other issue was clearly noted that
> it would be fixed. Also note that I think you've slightly misinterpreted
> my fragile code comment to be harsher than intended (or I didn't express
> my position clearly...). I'm simply trying to explain why I'll likely
> not ack this patch unless that problem is fixed (e.g., because I
> consider it fragile, independent of current behavior of outside
> contexts).
> 
> I'll try to restate my position more distinctly/clearly... I consider
> this patch broken so long as it doesn't handle the set of valid inputs
> defined by xfs_trans_log_buf() correctly, as the current code does. I
> expect any first <= last range to DTRT and ensure the associated chunk
> of the buffer is logged.
> 
> > > ... having said all that and having already wasted more time on this
> > > than it would have taken for you to just fix the patch, I'll end my rant
> > > with this splat[1]. It demonstrates the "boundary condition" that "will
> > > never occur during runtime on production systems" (production system
> > > level output included for extra fun ;P).
> > 
> > This is a pre-existing bug in xlog_cil_insert_format_items()
> > that my change has exposed:
> > 
> >                 /* Skip items that do not have any vectors for writing */
> > 		if (!shadow->lv_niovecs && !ordered)
> > 			continue;
> > 
> > The code I added triggers this (niovecs == 0), and that now gives
> > us the case where we have a dirty log item descriptor
> > (XFS_LID_DIRTY) without a log vector attached to item->li_lv.
> > Then in xlog_cil_insert_items():
> > 
> >                 /* Skip items which aren't dirty in this transaction. */
> >                 if (!(lidp->lid_flags & XFS_LID_DIRTY))
> >                         continue;
> > 
> >                 /*
> >                  * Only move the item if it isn't already at the tail. This is
> >                  * to prevent a transient list_empty() state when reinserting
> >                  * an item that is already the only item in the CIL.
> >                  */
> >                 if (!list_is_last(&lip->li_cil, &cil->xc_cil))
> >                         list_move_tail(&lip->li_cil, &cil->xc_cil);
> > 
> > 
> > We put that "clean" log item on the CIL because XFS_LID_DIRTY is
> > set, and then when we push the CIL in xlog_cil_push(), we trip over
> > a dirty log item without a log vector when chaining log vectors to
> > pass to the log writing code here:
> > 
> >         while (!list_empty(&cil->xc_cil)) {
> >                 struct xfs_log_item     *item;
> > 
> >                 item = list_first_entry(&cil->xc_cil,
> >                                         struct xfs_log_item, li_cil);
> >                 list_del_init(&item->li_cil);
> >                 if (!ctx->lv_chain)
> >                         ctx->lv_chain = item->li_lv;
> >                 else
> >                         lv->lv_next = item->li_lv;       <<<<<<<<<
> >  >>>>>>>>       lv = item->li_lv;
> >                 item->li_lv = NULL;
> >                 num_iovecs += lv->lv_niovecs;
> >         }
> > 
> > i.e. lv ends up null part way through the log item chain we are
> > processing and the next loop iteration fails.
> > 
> 
> Ok, I hadn't traced through the actual crash. The above all sounds sane
> to me...
> 
> > IOWs, the bug isn't in the patch I wrote - it has uncovered a
> > latent bug added years ago for a condition that had never, ever been
> > exercised until now.
> > 
> 
> Note that I agree you're probably right that there is a bug worth fixing
> in the CIL code to not crash in this case. The crash is just a symptom,
> however. There's still a bug in this patch because the buffer needs to
> be logged.
> 
> IOW, the purpose of this test is to demonstrate that the "should never
> happen" case argued above "actually can happen."
> 
> > Brian, can you now give me all the details of what you were doing to
> > produce this and turn on CONFIG_XFS_DEBUG so that it catches the
> > zero length buffer that was logged when it happens?  That way I can
> > test a fix for this bug and that the buffer range logging exercises
> > this case properly...
> > 
> 
> Yep. It's a specially crafted symlink creation on a small FSB, v4
> filesystem with fragmented free space. We log symlink buffers on v4
> filesystems without any header, so the buffer content is not dictated by
> any internal fs metadata format. If the link target is large enough to
> span multiple blocks and free space is fragmented such that those blocks
> are discontiguous, we can end up logging (solely) the first byte of the
> last buffer of the link target.
> 
> This is actually reproducible on demand so I'll just append a basic
> recipe rather than collect the debug data and whatnot..
> 
> Brian
> 
> --- 8< ---
> 
> dev=<dev>
> mnt=/mnt
> 
> sym=`for i in $(seq 0 512); do echo -n a; done`
> 
> mkfs.xfs -f -mcrc=0 -bsize=512 -dsize=25m $dev
> mount $dev $mnt
> 
> dd if=/dev/zero of=$mnt/spc1
> ~/xfstests-dev/src/punch-alternating $mnt/spc1
> dd if=/dev/zero of=$mnt/spc2
> xfs_io -c "fpunch 5m 25m" $mnt/spc2
> 
> for i in $(seq 0 2); do
>         ln -s $sym $mnt/link.$i
>         xfs_io -c fsync $mnt
> done
> 
> umount $mnt

Did one of the "fragment free space, do stuff" xfstests hit this?  If
not, would it be worth turning into a test?

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2] xfs: byte range buffer dirty region tracking
  2018-02-14 16:49                 ` Darrick J. Wong
@ 2018-02-14 18:08                   ` Brian Foster
  2018-02-14 22:05                     ` Dave Chinner
  0 siblings, 1 reply; 21+ messages in thread
From: Brian Foster @ 2018-02-14 18:08 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs

On Wed, Feb 14, 2018 at 08:49:12AM -0800, Darrick J. Wong wrote:
> On Wed, Feb 14, 2018 at 08:09:39AM -0500, Brian Foster wrote:
> > On Wed, Feb 14, 2018 at 09:02:20AM +1100, Dave Chinner wrote:
> > > On Tue, Feb 13, 2018 at 08:15:26AM -0500, Brian Foster wrote:
> > > > On Tue, Feb 13, 2018 at 08:18:24AM +1100, Dave Chinner wrote:
> > > > > On Mon, Feb 12, 2018 at 09:26:19AM -0500, Brian Foster wrote:
> > > > > > :/ So it seems to
> > > > > > me this breaks a technically valid case in weird/subtle ways. For
> > > > > > example, why assert about last == 0, but then go on to add the range
> > > > > > anyways, explicitly not size it correctly, but then format it as if
> > > > > > nothing is wrong? If it were really wrong/invalid (which I don't think
> > > > > > it is), why not put the check in the log side and skip adding the range
> > > > > > rather than add it, skip sizing it, and then format it.
> > > > > 
> > > > > So what you're really concerned about is that I put asserts into the
> > > > > code to catch broken development code, but then allow production
> > > > > systems through without caring whether it works correctly because
> > > > > that boundary condition will never occur during runtime on
> > > > > production systems?
> > > > 
> > > > No. As already mentioned in my previous mail, I care little about the
> > > > asserts. Asserts can easily be removed if they turn out to be bogus.
> > > > Wrong asserts tend to have little negative effect on production users
> > > > because along with only affecting debug kernels, they'd have to be
> > > > fairly rare to slip through our testing. So I'm perfectly _happy_ to be
> > > > cautious with regard to asserts.
> > > > 
> > > > What I care much more about is not leaving latent bugs around in the
> > > > code. IMO, there is very rarely good enough justification to knowingly
> > > > commit buggy/fragile code to the kernel,
> > > 
> > > Hold on a minute!
> > > 
> > > I'm not asking anyone to commit buggy or fragile code. I've already
> > > fixed the off-by-one problems you've pointed out, and all I was
> > > trying to do was understand what you saw wrong with the asserts to
> > > catch a "should never happen" condition so I could change it in a
> > > way that you'd find acceptible.
> > > 
> > > There's no need to shout and rant at me....
> > > 
> > 
> > I pointed out the first-byte logging case looked broken. Rather than
> > indicate you're fixing that one way or another (as you had done for the
> > other issues we've found), you argued that "nobody does that" or "should
> > never happen."
> > 
> > Sorry for the rant and/or whether I've misinterpreted your comment, but
> > I have no way to read that other than an attempt to justify the problem,
> > particularly when in comparison every other issue was clearly noted that
> > it would be fixed. Also note that I think you've slightly misinterpreted
> > my fragile code comment to be harsher than intended (or I didn't express
> > my position clearly...). I'm simply trying to explain why I'll likely
> > not ack this patch unless that problem is fixed (e.g., because I
> > consider it fragile, independent of current behavior of outside
> > contexts).
> > 
> > I'll try to restate my position more distinctly/clearly... I consider
> > this patch broken so long as it doesn't handle the set of valid inputs
> > defined by xfs_trans_log_buf() correctly, as the current code does. I
> > expect any first <= last range to DTRT and ensure the associated chunk
> > of the buffer is logged.
> > 
> > > > ... having said all that and having already wasted more time on this
> > > > than it would have taken for you to just fix the patch, I'll end my rant
> > > > with this splat[1]. It demonstrates the "boundary condition" that "will
> > > > never occur during runtime on production systems" (production system
> > > > level output included for extra fun ;P).
> > > 
> > > This is a pre-existing bug in xlog_cil_insert_format_items()
> > > that my change has exposed:
> > > 
> > >                 /* Skip items that do not have any vectors for writing */
> > > 		if (!shadow->lv_niovecs && !ordered)
> > > 			continue;
> > > 
> > > The code I added triggers this (niovecs == 0), and that now gives
> > > us the case where we have a dirty log item descriptor
> > > (XFS_LID_DIRTY) without a log vector attached to item->li_lv.
> > > Then in xlog_cil_insert_items():
> > > 
> > >                 /* Skip items which aren't dirty in this transaction. */
> > >                 if (!(lidp->lid_flags & XFS_LID_DIRTY))
> > >                         continue;
> > > 
> > >                 /*
> > >                  * Only move the item if it isn't already at the tail. This is
> > >                  * to prevent a transient list_empty() state when reinserting
> > >                  * an item that is already the only item in the CIL.
> > >                  */
> > >                 if (!list_is_last(&lip->li_cil, &cil->xc_cil))
> > >                         list_move_tail(&lip->li_cil, &cil->xc_cil);
> > > 
> > > 
> > > We put that "clean" log item on the CIL because XFS_LID_DIRTY is
> > > set, and then when we push the CIL in xlog_cil_push(), we trip over
> > > a dirty log item without a log vector when chaining log vectors to
> > > pass to the log writing code here:
> > > 
> > >         while (!list_empty(&cil->xc_cil)) {
> > >                 struct xfs_log_item     *item;
> > > 
> > >                 item = list_first_entry(&cil->xc_cil,
> > >                                         struct xfs_log_item, li_cil);
> > >                 list_del_init(&item->li_cil);
> > >                 if (!ctx->lv_chain)
> > >                         ctx->lv_chain = item->li_lv;
> > >                 else
> > >                         lv->lv_next = item->li_lv;       <<<<<<<<<
> > >  >>>>>>>>       lv = item->li_lv;
> > >                 item->li_lv = NULL;
> > >                 num_iovecs += lv->lv_niovecs;
> > >         }
> > > 
> > > i.e. lv ends up null part way through the log item chain we are
> > > processing and the next loop iteration fails.
> > > 
> > 
> > Ok, I hadn't traced through the actual crash. The above all sounds sane
> > to me...
> > 
> > > IOWs, the bug isn't in the patch I wrote - it has uncovered a
> > > latent bug added years ago for a condition that had never, ever been
> > > exercised until now.
> > > 
> > 
> > Note that I agree you're probably right that there is a bug worth fixing
> > in the CIL code to not crash in this case. The crash is just a symptom,
> > however. There's still a bug in this patch because the buffer needs to
> > be logged.
> > 
> > IOW, the purpose of this test is to demonstrate that the "should never
> > happen" case argued above "actually can happen."
> > 
> > > Brian, can you now give me all the details of what you were doing to
> > > produce this and turn on CONFIG_XFS_DEBUG so that it catches the
> > > zero length buffer that was logged when it happens?  That way I can
> > > test a fix for this bug and that the buffer range logging exercises
> > > this case properly...
> > > 
> > 
> > Yep. It's a specially crafted symlink creation on a small FSB, v4
> > filesystem with fragmented free space. We log symlink buffers on v4
> > filesystems without any header, so the buffer content is not dictated by
> > any internal fs metadata format. If the link target is large enough to
> > span multiple blocks and free space is fragmented such that those blocks
> > are discontiguous, we can end up logging (solely) the first byte of the
> > last buffer of the link target.
> > 
> > This is actually reproducible on demand so I'll just append a basic
> > recipe rather than collect the debug data and whatnot..
> > 
> > Brian
> > 
> > --- 8< ---
> > 
> > dev=<dev>
> > mnt=/mnt
> > 
> > sym=`for i in $(seq 0 512); do echo -n a; done`
> > 
> > mkfs.xfs -f -mcrc=0 -bsize=512 -dsize=25m $dev
> > mount $dev $mnt
> > 
> > dd if=/dev/zero of=$mnt/spc1
> > ~/xfstests-dev/src/punch-alternating $mnt/spc1
> > dd if=/dev/zero of=$mnt/spc2
> > xfs_io -c "fpunch 5m 25m" $mnt/spc2
> > 
> > for i in $(seq 0 2); do
> >         ln -s $sym $mnt/link.$i
> >         xfs_io -c fsync $mnt
> > done
> > 
> > umount $mnt
> 
> Did one of the "fragment free space, do stuff" xfstests hit this?  If
> not, would it be worth turning into a test?
> 

This was just an experiment on this patch. I haven't run xfstests so I
can't say for sure whether some existing test would have caught it
(though I suspect Dave would have hit the problem by now, if so). I'm
not sure it's worth an independent test since it really just exercises a
bug in a patch that is still under development (as opposed to a
regression). This sequence won't trigger any problems that I'm aware of
on upstream XFS.

Brian

> --D
> 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2] xfs: byte range buffer dirty region tracking
  2018-02-14 18:08                   ` Brian Foster
@ 2018-02-14 22:05                     ` Dave Chinner
  0 siblings, 0 replies; 21+ messages in thread
From: Dave Chinner @ 2018-02-14 22:05 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, linux-xfs

On Wed, Feb 14, 2018 at 01:08:07PM -0500, Brian Foster wrote:
> On Wed, Feb 14, 2018 at 08:49:12AM -0800, Darrick J. Wong wrote:
> > On Wed, Feb 14, 2018 at 08:09:39AM -0500, Brian Foster wrote:
> > > On Wed, Feb 14, 2018 at 09:02:20AM +1100, Dave Chinner wrote:
> > > Yep. It's a specially crafted symlink creation on a small FSB, v4
> > > filesystem with fragmented free space. We log symlink buffers on v4
> > > filesystems without any header, so the buffer content is not dictated by
> > > any internal fs metadata format. If the link target is large enough to
> > > span multiple blocks and free space is fragmented such that those blocks
> > > are discontiguous, we can end up logging (solely) the first byte of the
> > > last buffer of the link target.

I'd completely forgotten about that whacky corner case in the v4
format. :(

> > > This is actually reproducible on demand so I'll just append a basic
> > > recipe rather than collect the debug data and whatnot..
> > > 
> > > Brian
> > > 
> > > --- 8< ---
> > > 
> > > dev=<dev>
> > > mnt=/mnt
> > > 
> > > sym=`for i in $(seq 0 512); do echo -n a; done`
> > > 
> > > mkfs.xfs -f -mcrc=0 -bsize=512 -dsize=25m $dev
> > > mount $dev $mnt
> > > 
> > > dd if=/dev/zero of=$mnt/spc1
> > > ~/xfstests-dev/src/punch-alternating $mnt/spc1
> > > dd if=/dev/zero of=$mnt/spc2
> > > xfs_io -c "fpunch 5m 25m" $mnt/spc2
> > > 
> > > for i in $(seq 0 2); do
> > >         ln -s $sym $mnt/link.$i
> > >         xfs_io -c fsync $mnt
> > > done
> > > 
> > > umount $mnt
> > 
> > Did one of the "fragment free space, do stuff" xfstests hit this?  If
> > not, would it be worth turning into a test?
> > 
> 
> This was just an experiment on this patch. I haven't run xfstests so I
> can't say for sure whether some existing test would have caught it
> (though I suspect Dave would have hit the problem by now, if so). I'm

Nope, a v4 512 byte block size filesystem is so far outside my
normal test config matrix it's not funny. In fact, I almost never
test on v4 filesystems anymore, and I rarely think of them when
developing new code as it's essentially a legacy format now....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2] xfs: byte range buffer dirty region tracking
  2018-02-14 13:09               ` Brian Foster
  2018-02-14 16:49                 ` Darrick J. Wong
@ 2018-02-14 22:30                 ` Dave Chinner
  2018-02-15 13:42                   ` Brian Foster
  1 sibling, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2018-02-14 22:30 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Feb 14, 2018 at 08:09:39AM -0500, Brian Foster wrote:
> On Wed, Feb 14, 2018 at 09:02:20AM +1100, Dave Chinner wrote:
> > On Tue, Feb 13, 2018 at 08:15:26AM -0500, Brian Foster wrote:
> > > On Tue, Feb 13, 2018 at 08:18:24AM +1100, Dave Chinner wrote:
> > > > On Mon, Feb 12, 2018 at 09:26:19AM -0500, Brian Foster wrote:
> > > > > :/ So it seems to
> > > > > me this breaks a technically valid case in weird/subtle ways. For
> > > > > example, why assert about last == 0, but then go on to add the range
> > > > > anyways, explicitly not size it correctly, but then format it as if
> > > > > nothing is wrong? If it were really wrong/invalid (which I don't think
> > > > > it is), why not put the check in the log side and skip adding the range
> > > > > rather than add it, skip sizing it, and then format it.
> > > > 
> > > > So what you're really concerned about is that I put asserts into the
> > > > code to catch broken development code, but then allow production
> > > > systems through without caring whether it works correctly because
> > > > that boundary condition will never occur during runtime on
> > > > production systems?
> > > 
> > > No. As already mentioned in my previous mail, I care little about the
> > > asserts. Asserts can easily be removed if they turn out to be bogus.
> > > Wrong asserts tend to have little negative effect on production users
> > > because along with only affecting debug kernels, they'd have to be
> > > fairly rare to slip through our testing. So I'm perfectly _happy_ to be
> > > cautious with regard to asserts.
> > > 
> > > What I care much more about is not leaving latent bugs around in the
> > > code. IMO, there is very rarely good enough justification to knowingly
> > > commit buggy/fragile code to the kernel,
> > 
> > Hold on a minute!
> > 
> > I'm not asking anyone to commit buggy or fragile code. I've already
> > fixed the off-by-one problems you've pointed out, and all I was
> > trying to do was understand what you saw wrong with the asserts to
> > catch a "should never happen" condition so I could change it in a
> > way that you'd find acceptible.
> > 
> > There's no need to shout and rant at me....
> > 
> 
> I pointed out the first-byte logging case looked broken. Rather than
> indicate you're fixing that one way or another (as you had done for the
> other issues we've found), you argued that "nobody does that" or "should
> never happen."

I was simply stating the reason why I'd put the assert there.
Asserts are used to document and runtime check assumptions about the code that
follows, which is why I put them in place - I'd made an assumption
that holds true on v5 filesystems....

But if I don't explain the reason/logic behind the assumption
documented in the assert, then nobody is going to be able to point
out where the mistake or wrong assumption in my reasoning is.

The way I read your comments was "the old code supported it, so the
new code must too" but they did not demonstrate any requirement for
the status quo to be maintained.  All you really needed to add was a
single sentence stating "fragmented v4 symlink buffers need to log a
1 byte range" and it would have been immediately clear (to
everyone!) where my assumptions had gone wrong....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2] xfs: byte range buffer dirty region tracking
  2018-02-14 22:30                 ` Dave Chinner
@ 2018-02-15 13:42                   ` Brian Foster
  0 siblings, 0 replies; 21+ messages in thread
From: Brian Foster @ 2018-02-15 13:42 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Feb 15, 2018 at 09:30:27AM +1100, Dave Chinner wrote:
> On Wed, Feb 14, 2018 at 08:09:39AM -0500, Brian Foster wrote:
> > On Wed, Feb 14, 2018 at 09:02:20AM +1100, Dave Chinner wrote:
> > > On Tue, Feb 13, 2018 at 08:15:26AM -0500, Brian Foster wrote:
> > > > On Tue, Feb 13, 2018 at 08:18:24AM +1100, Dave Chinner wrote:
> > > > > On Mon, Feb 12, 2018 at 09:26:19AM -0500, Brian Foster wrote:
> > > > > > :/ So it seems to
> > > > > > me this breaks a technically valid case in weird/subtle ways. For
> > > > > > example, why assert about last == 0, but then go on to add the range
> > > > > > anyways, explicitly not size it correctly, but then format it as if
> > > > > > nothing is wrong? If it were really wrong/invalid (which I don't think
> > > > > > it is), why not put the check in the log side and skip adding the range
> > > > > > rather than add it, skip sizing it, and then format it.
> > > > > 
> > > > > So what you're really concerned about is that I put asserts into the
> > > > > code to catch broken development code, but then allow production
> > > > > systems through without caring whether it works correctly because
> > > > > that boundary condition will never occur during runtime on
> > > > > production systems?
> > > > 
> > > > No. As already mentioned in my previous mail, I care little about the
> > > > asserts. Asserts can easily be removed if they turn out to be bogus.
> > > > Wrong asserts tend to have little negative effect on production users
> > > > because along with only affecting debug kernels, they'd have to be
> > > > fairly rare to slip through our testing. So I'm perfectly _happy_ to be
> > > > cautious with regard to asserts.
> > > > 
> > > > What I care much more about is not leaving latent bugs around in the
> > > > code. IMO, there is very rarely good enough justification to knowingly
> > > > commit buggy/fragile code to the kernel,
> > > 
> > > Hold on a minute!
> > > 
> > > I'm not asking anyone to commit buggy or fragile code. I've already
> > > fixed the off-by-one problems you've pointed out, and all I was
> > > trying to do was understand what you saw wrong with the asserts to
> > > catch a "should never happen" condition so I could change it in a
> > > way that you'd find acceptible.
> > > 
> > > There's no need to shout and rant at me....
> > > 
> > 
> > I pointed out the first-byte logging case looked broken. Rather than
> > indicate you're fixing that one way or another (as you had done for the
> > other issues we've found), you argued that "nobody does that" or "should
> > never happen."
> 
> I was simply stating the reason why I'd put the assert there.
> Asserts are used to document and runtime check assumptions about the code that
> follows, which is why I put them in place - I'd made an assumption
> that holds true on v5 filesystems....
> 
> But if I don't explain the reason/logic behind the assumption
> documented in the assert, then nobody is going to be able to point
> out where the mistake or wrong assumption in my reasoning is.
> 
> The way I read your comments was "the old code supported it, so the
> new code must too" but they did not demonstrate any requirement for
> the status quo to be maintained.  All you really needed to add was a
> single sentence stating "fragmented v4 symlink buffers need to log a
> 1 byte range" and it would have been immediately clear (to
> everyone!) where my assumptions had gone wrong....
> 

Just to close the loop on this (since I think we've cleared up our
mutual misunderstanding on irc)...

I had no idea about the symlink buffer case until the point where I
reported the splat. My criteria for review was always the following
invariant for the implementation:

- xfs_trans_log_buf() receives an inclusive byte range in the form of
  [first, last] and logs the corresponding chunk of the buffer

Despite the asserts that confused me, were explained and I eventually
agreed were reasonable, it was also clear on initial review that the
implementation did not satisfy the invariant with input parameters of
[0, 0]. I considered this case not because of the assert, but because
the explicit skip of last == 0 in the ->iop_size() handler set off alarm
bells.

I only began to look for a real bug when it became apparent that if I
were lucky enough to find one, a demonstration would be a more
convenient means to convince you that the implementation still had a bug
that needed fixing. So somehow this all got mixed up into you thinking I
was still arguing about the asserts and me thinking you were arguing
about asserts because you saw the bug but didn't care to fix it.

TBH, when I first saw the symlink code I expected to be able to
reproduce a log recovery symlink corruption. I figured that since
->iop_size() skipped the last == 0 case, we'd fail to log the buffer
entirely and hilarity would ensue in the event of a well-timed shutdown.
Instead, it blew up in my face (due to the other bug that you've already
described) and so I posted the splat.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2018-02-15 13:42 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-01  1:05 [PATCH] xfs: byte range buffer dirty region tracking Dave Chinner
2018-02-01  5:11 ` Darrick J. Wong
2018-02-01  8:14   ` Dave Chinner
2018-02-01 20:35     ` Darrick J. Wong
2018-02-01 23:16       ` Dave Chinner
2018-02-01 23:22         ` Darrick J. Wong
2018-02-01 23:55           ` Dave Chinner
2018-02-02 10:56             ` Brian Foster
2018-02-05  0:34 ` [PATCH v2] " Dave Chinner
2018-02-06 16:21   ` Brian Foster
2018-02-12  2:41     ` Dave Chinner
2018-02-12 14:26       ` Brian Foster
2018-02-12 21:18         ` Dave Chinner
2018-02-13 13:15           ` Brian Foster
2018-02-13 22:02             ` Dave Chinner
2018-02-14 13:09               ` Brian Foster
2018-02-14 16:49                 ` Darrick J. Wong
2018-02-14 18:08                   ` Brian Foster
2018-02-14 22:05                     ` Dave Chinner
2018-02-14 22:30                 ` Dave Chinner
2018-02-15 13:42                   ` Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.