linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] xfs: buffer log item optimisations
@ 2021-02-23  4:46 Dave Chinner
  2021-02-23  4:46 ` [PATCH 1/3] xfs: reduce buffer log item shadow allocations Dave Chinner
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Dave Chinner @ 2021-02-23  4:46 UTC (permalink / raw)
  To: linux-xfs

Hi folks,

A couple of optimisations and a bug fix that I don't think we could
trigger.

The bug fix was that we weren't passing the segment offset into the
buffer log item sizing calculation, so we weren't calculating when
we spanned discontiguous pages in the buffers correctly. I don't
think this ever mattered, because all buffers larger than a single
page are vmapped (so a contiguous virtual address range) and the
only direct mapped buffers we have are inode cluster buffers and
they never span discontiguous extents. Hence while the code is
clearly wrong, we never actually trigger the situation where it
results in an incorrect calculation. However, changing the way we
calculate the size of the dirty regions is difficult if we don't do
this calc the same way as the formatting code, so fix it.

The first optimisation is simply a mechanism to reduce the amount of
allocation and freeing overhead on the buffer item shadow buffer as
we increase the amount of the buffer that is dirty as we relog it.

The last optimisation is (finally) addressing the overhead of bitmap
based dirty tracking of the buffer log item. We walk a bit at a
time, calling xfs_buf_offset() at least twice for each bit in both
the size and the formatting code to see if the region crosses a
discontiguity in the buffer address space. This is expensive. The
log recovery code uses contiguous bit range detection to do the
same thing, so I've updated the logging code to operate on
contiguous bit ranges and only fall back to bit-by-bit checking in
the rare case that a contiguous dirty range spans an address space
discontiguity and hence has to be split into multiple regions to
copy it into the log.

This enables big performance improvements when using large directory
block sizes.

Cheers,

Dave.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/3] xfs: reduce buffer log item shadow allocations
  2021-02-23  4:46 [PATCH 0/3] xfs: buffer log item optimisations Dave Chinner
@ 2021-02-23  4:46 ` Dave Chinner
  2021-02-24 21:29   ` Darrick J. Wong
  2021-03-02 14:37   ` Brian Foster
  2021-02-23  4:46 ` [PATCH 2/3] xfs: xfs_buf_item_size_segment() needs to pass segment offset Dave Chinner
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 12+ messages in thread
From: Dave Chinner @ 2021-02-23  4:46 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

When we modify btrees repeatedly, we regularly increase the size of
the logged region by a single chunk at a time (per transaction
commit). This results in the CIL formatting code having to
reallocate the log vector buffer every time the buffer dirty region
grows. Hence over a typical 4kB btree buffer, we might grow the log
vector 4096/128 = 32x over a short period where we repeatedly add
or remove records to/from the buffer over a series of running
transaction. This means we are doing 32 memory allocations and frees
over this time during a performance critical path in the journal.

The amount of space tracked in the CIL for the object is calculated
during the ->iop_format() call for the buffer log item, but the
buffer memory allocated for it is calculated by the ->iop_size()
call. The size callout determines the size of the buffer, the format
call determines the space used in the buffer.

Hence we can oversize the buffer space required in the size
calculation without impacting the amount of space used and accounted
to the CIL for the changes being logged. This allows us to reduce
the number of allocations by rounding up the buffer size to allow
for future growth. This can safe a substantial amount of CPU time in
this path:

-   46.52%     2.02%  [kernel]                  [k] xfs_log_commit_cil
   - 44.49% xfs_log_commit_cil
      - 30.78% _raw_spin_lock
         - 30.75% do_raw_spin_lock
              30.27% __pv_queued_spin_lock_slowpath

(oh, ouch!)
....
      - 1.05% kmem_alloc_large
         - 1.02% kmem_alloc
              0.94% __kmalloc

This overhead here us what this patch is aimed at. After:

      - 0.76% kmem_alloc_large
         - 0.75% kmem_alloc
              0.70% __kmalloc

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
---
 fs/xfs/xfs_buf_item.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 17960b1ce5ef..0628a65d9c55 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -142,6 +142,7 @@ xfs_buf_item_size(
 {
 	struct xfs_buf_log_item	*bip = BUF_ITEM(lip);
 	int			i;
+	int			bytes;
 
 	ASSERT(atomic_read(&bip->bli_refcount) > 0);
 	if (bip->bli_flags & XFS_BLI_STALE) {
@@ -173,7 +174,7 @@ xfs_buf_item_size(
 	}
 
 	/*
-	 * the vector count is based on the number of buffer vectors we have
+	 * The vector count is based on the number of buffer vectors we have
 	 * dirty bits in. This will only be greater than one when we have a
 	 * compound buffer with more than one segment dirty. Hence for compound
 	 * buffers we need to track which segment the dirty bits correspond to,
@@ -181,10 +182,18 @@ xfs_buf_item_size(
 	 * count for the extra buf log format structure that will need to be
 	 * written.
 	 */
+	bytes = 0;
 	for (i = 0; i < bip->bli_format_count; i++) {
 		xfs_buf_item_size_segment(bip, &bip->bli_formats[i],
-					  nvecs, nbytes);
+					  nvecs, &bytes);
 	}
+
+	/*
+	 * Round up the buffer size required to minimise the number of memory
+	 * allocations that need to be done as this item grows when relogged by
+	 * repeated modifications.
+	 */
+	*nbytes = round_up(bytes, 512);
 	trace_xfs_buf_item_size(bip);
 }
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/3] xfs: xfs_buf_item_size_segment() needs to pass segment offset
  2021-02-23  4:46 [PATCH 0/3] xfs: buffer log item optimisations Dave Chinner
  2021-02-23  4:46 ` [PATCH 1/3] xfs: reduce buffer log item shadow allocations Dave Chinner
@ 2021-02-23  4:46 ` Dave Chinner
  2021-02-24 21:34   ` Darrick J. Wong
  2021-03-02 14:37   ` Brian Foster
  2021-02-23  4:46 ` [PATCH 3/3] xfs: optimise xfs_buf_item_size/format for contiguous regions Dave Chinner
  2021-02-25  9:01 ` [PATCH 0/3] xfs: buffer log item optimisations Christoph Hellwig
  3 siblings, 2 replies; 12+ messages in thread
From: Dave Chinner @ 2021-02-23  4:46 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Otherwise it doesn't correctly calculate the number of vectors
in a logged buffer that has a contiguous map that gets split into
multiple regions because the range spans discontigous memory.

Probably never been hit in practice - we don't log contiguous ranges
on unmapped buffers (inode clusters).

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf_item.c | 38 +++++++++++++++++++-------------------
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 0628a65d9c55..91dc7d8c9739 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -55,6 +55,18 @@ xfs_buf_log_format_size(
 			(blfp->blf_map_size * sizeof(blfp->blf_data_map[0]));
 }
 
+static inline bool
+xfs_buf_item_straddle(
+	struct xfs_buf		*bp,
+	uint			offset,
+	int			next_bit,
+	int			last_bit)
+{
+	return xfs_buf_offset(bp, offset + (next_bit << XFS_BLF_SHIFT)) !=
+		(xfs_buf_offset(bp, offset + (last_bit << XFS_BLF_SHIFT)) +
+		 XFS_BLF_CHUNK);
+}
+
 /*
  * Return the number of log iovecs and space needed to log the given buf log
  * item segment.
@@ -67,6 +79,7 @@ STATIC void
 xfs_buf_item_size_segment(
 	struct xfs_buf_log_item		*bip,
 	struct xfs_buf_log_format	*blfp,
+	uint				offset,
 	int				*nvecs,
 	int				*nbytes)
 {
@@ -101,12 +114,8 @@ xfs_buf_item_size_segment(
 		 */
 		if (next_bit == -1) {
 			break;
-		} else if (next_bit != last_bit + 1) {
-			last_bit = next_bit;
-			(*nvecs)++;
-		} else if (xfs_buf_offset(bp, next_bit * XFS_BLF_CHUNK) !=
-			   (xfs_buf_offset(bp, last_bit * XFS_BLF_CHUNK) +
-			    XFS_BLF_CHUNK)) {
+		} else if (next_bit != last_bit + 1 ||
+		           xfs_buf_item_straddle(bp, offset, next_bit, last_bit)) {
 			last_bit = next_bit;
 			(*nvecs)++;
 		} else {
@@ -141,8 +150,10 @@ xfs_buf_item_size(
 	int			*nbytes)
 {
 	struct xfs_buf_log_item	*bip = BUF_ITEM(lip);
+	struct xfs_buf		*bp = bip->bli_buf;
 	int			i;
 	int			bytes;
+	uint			offset = 0;
 
 	ASSERT(atomic_read(&bip->bli_refcount) > 0);
 	if (bip->bli_flags & XFS_BLI_STALE) {
@@ -184,8 +195,9 @@ xfs_buf_item_size(
 	 */
 	bytes = 0;
 	for (i = 0; i < bip->bli_format_count; i++) {
-		xfs_buf_item_size_segment(bip, &bip->bli_formats[i],
+		xfs_buf_item_size_segment(bip, &bip->bli_formats[i], offset,
 					  nvecs, &bytes);
+		offset += BBTOB(bp->b_maps[i].bm_len);
 	}
 
 	/*
@@ -212,18 +224,6 @@ xfs_buf_item_copy_iovec(
 			nbits * XFS_BLF_CHUNK);
 }
 
-static inline bool
-xfs_buf_item_straddle(
-	struct xfs_buf		*bp,
-	uint			offset,
-	int			next_bit,
-	int			last_bit)
-{
-	return xfs_buf_offset(bp, offset + (next_bit << XFS_BLF_SHIFT)) !=
-		(xfs_buf_offset(bp, offset + (last_bit << XFS_BLF_SHIFT)) +
-		 XFS_BLF_CHUNK);
-}
-
 static void
 xfs_buf_item_format_segment(
 	struct xfs_buf_log_item	*bip,
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/3] xfs: optimise xfs_buf_item_size/format for contiguous regions
  2021-02-23  4:46 [PATCH 0/3] xfs: buffer log item optimisations Dave Chinner
  2021-02-23  4:46 ` [PATCH 1/3] xfs: reduce buffer log item shadow allocations Dave Chinner
  2021-02-23  4:46 ` [PATCH 2/3] xfs: xfs_buf_item_size_segment() needs to pass segment offset Dave Chinner
@ 2021-02-23  4:46 ` Dave Chinner
  2021-02-24 21:39   ` Darrick J. Wong
  2021-03-02 14:38   ` Brian Foster
  2021-02-25  9:01 ` [PATCH 0/3] xfs: buffer log item optimisations Christoph Hellwig
  3 siblings, 2 replies; 12+ messages in thread
From: Dave Chinner @ 2021-02-23  4:46 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

We process the buf_log_item bitmap one set bit at a time with
xfs_next_bit() so we can detect if a region crosses a memcpy
discontinuity in the buffer data address. This has massive overhead
on large buffers (e.g. 64k directory blocks) because we do a lot of
unnecessary checks and xfs_buf_offset() calls.

For example, 16-way concurrent create workload on debug kernel
running CPU bound has this at the top of the profile at ~120k
create/s on 64kb directory block size:

  20.66%  [kernel]  [k] xfs_dir3_leaf_check_int
   7.10%  [kernel]  [k] memcpy
   6.22%  [kernel]  [k] xfs_next_bit
   3.55%  [kernel]  [k] xfs_buf_offset
   3.53%  [kernel]  [k] xfs_buf_item_format
   3.34%  [kernel]  [k] __pv_queued_spin_lock_slowpath
   3.04%  [kernel]  [k] do_raw_spin_lock
   2.84%  [kernel]  [k] xfs_buf_item_size_segment.isra.0
   2.31%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
   1.36%  [kernel]  [k] xfs_log_commit_cil

(debug checks hurt large blocks)

The only buffers with discontinuities in the data address are
unmapped buffers, and they are only used for inode cluster buffers
and only for logging unlinked pointers. IOWs, it is -rare- that we
even need to detect a discontinuity in the buffer item formatting
code.

Optimise all this by using xfs_contig_bits() to find the size of
the contiguous regions, then test for a discontiunity inside it. If
we find one, do the slow "bit at a time" method we do now. If we
don't, then just copy the entire contiguous range in one go.

Profile now looks like:

  25.26%  [kernel]  [k] xfs_dir3_leaf_check_int
   9.25%  [kernel]  [k] memcpy
   5.01%  [kernel]  [k] __pv_queued_spin_lock_slowpath
   2.84%  [kernel]  [k] do_raw_spin_lock
   2.22%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
   1.88%  [kernel]  [k] xfs_buf_find
   1.53%  [kernel]  [k] memmove
   1.47%  [kernel]  [k] xfs_log_commit_cil
....
   0.34%  [kernel]  [k] xfs_buf_item_format
....
   0.21%  [kernel]  [k] xfs_buf_offset
....
   0.16%  [kernel]  [k] xfs_contig_bits
....
   0.13%  [kernel]  [k] xfs_buf_item_size_segment.isra.0

So the bit scanning over for the dirty region tracking for the
buffer log items is basically gone. Debug overhead hurts even more
now...

Perf comparison

		dir block	 creates		unlink
		size (kb)	time	rate		time

Original	 4		4m08s	220k		 5m13s
Original	64		7m21s	115k		13m25s
Patched		 4		3m59s	230k		 5m03s
Patched		64		6m23s	143k		12m33s


Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf_item.c | 102 +++++++++++++++++++++++++++++++++++-------
 1 file changed, 87 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 91dc7d8c9739..14d1fefcbf4c 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -59,12 +59,18 @@ static inline bool
 xfs_buf_item_straddle(
 	struct xfs_buf		*bp,
 	uint			offset,
-	int			next_bit,
-	int			last_bit)
+	int			first_bit,
+	int			nbits)
 {
-	return xfs_buf_offset(bp, offset + (next_bit << XFS_BLF_SHIFT)) !=
-		(xfs_buf_offset(bp, offset + (last_bit << XFS_BLF_SHIFT)) +
-		 XFS_BLF_CHUNK);
+	void			*first, *last;
+
+	first = xfs_buf_offset(bp, offset + (first_bit << XFS_BLF_SHIFT));
+	last = xfs_buf_offset(bp,
+			offset + ((first_bit + nbits) << XFS_BLF_SHIFT));
+
+	if (last - first != nbits * XFS_BLF_CHUNK)
+		return true;
+	return false;
 }
 
 /*
@@ -84,20 +90,51 @@ xfs_buf_item_size_segment(
 	int				*nbytes)
 {
 	struct xfs_buf			*bp = bip->bli_buf;
+	int				first_bit;
+	int				nbits;
 	int				next_bit;
 	int				last_bit;
 
-	last_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size, 0);
-	if (last_bit == -1)
+	first_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size, 0);
+	if (first_bit == -1)
 		return;
 
-	/*
-	 * initial count for a dirty buffer is 2 vectors - the format structure
-	 * and the first dirty region.
-	 */
-	*nvecs += 2;
-	*nbytes += xfs_buf_log_format_size(blfp) + XFS_BLF_CHUNK;
+	(*nvecs)++;
+	*nbytes += xfs_buf_log_format_size(blfp);
+
+	do {
+		nbits = xfs_contig_bits(blfp->blf_data_map,
+					blfp->blf_map_size, first_bit);
+		ASSERT(nbits > 0);
+
+		/*
+		 * Straddling a page is rare because we don't log contiguous
+		 * chunks of unmapped buffers anywhere.
+		 */
+		if (nbits > 1 &&
+		    xfs_buf_item_straddle(bp, offset, first_bit, nbits))
+			goto slow_scan;
+
+		(*nvecs)++;
+		*nbytes += nbits * XFS_BLF_CHUNK;
+
+		/*
+		 * This takes the bit number to start looking from and
+		 * returns the next set bit from there.  It returns -1
+		 * if there are no more bits set or the start bit is
+		 * beyond the end of the bitmap.
+		 */
+		first_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size,
+					(uint)first_bit + nbits + 1);
+	} while (first_bit != -1);
 
+	return;
+
+slow_scan:
+	/* Count the first bit we jumped out of the above loop from */
+	(*nvecs)++;
+	*nbytes += XFS_BLF_CHUNK;
+	last_bit = first_bit;
 	while (last_bit != -1) {
 		/*
 		 * This takes the bit number to start looking from and
@@ -115,11 +152,14 @@ xfs_buf_item_size_segment(
 		if (next_bit == -1) {
 			break;
 		} else if (next_bit != last_bit + 1 ||
-		           xfs_buf_item_straddle(bp, offset, next_bit, last_bit)) {
+		           xfs_buf_item_straddle(bp, offset, first_bit, nbits)) {
 			last_bit = next_bit;
+			first_bit = next_bit;
 			(*nvecs)++;
+			nbits = 1;
 		} else {
 			last_bit++;
+			nbits++;
 		}
 		*nbytes += XFS_BLF_CHUNK;
 	}
@@ -276,6 +316,38 @@ xfs_buf_item_format_segment(
 	/*
 	 * Fill in an iovec for each set of contiguous chunks.
 	 */
+	do {
+		ASSERT(first_bit >= 0);
+		nbits = xfs_contig_bits(blfp->blf_data_map,
+					blfp->blf_map_size, first_bit);
+		ASSERT(nbits > 0);
+
+		/*
+		 * Straddling a page is rare because we don't log contiguous
+		 * chunks of unmapped buffers anywhere.
+		 */
+		if (nbits > 1 &&
+		    xfs_buf_item_straddle(bp, offset, first_bit, nbits))
+			goto slow_scan;
+
+		xfs_buf_item_copy_iovec(lv, vecp, bp, offset,
+					first_bit, nbits);
+		blfp->blf_size++;
+
+		/*
+		 * This takes the bit number to start looking from and
+		 * returns the next set bit from there.  It returns -1
+		 * if there are no more bits set or the start bit is
+		 * beyond the end of the bitmap.
+		 */
+		first_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size,
+					(uint)first_bit + nbits + 1);
+	} while (first_bit != -1);
+
+	return;
+
+slow_scan:
+	ASSERT(bp->b_addr == NULL);
 	last_bit = first_bit;
 	nbits = 1;
 	for (;;) {
@@ -300,7 +372,7 @@ xfs_buf_item_format_segment(
 			blfp->blf_size++;
 			break;
 		} else if (next_bit != last_bit + 1 ||
-		           xfs_buf_item_straddle(bp, offset, next_bit, last_bit)) {
+		           xfs_buf_item_straddle(bp, offset, first_bit, nbits)) {
 			xfs_buf_item_copy_iovec(lv, vecp, bp, offset,
 						first_bit, nbits);
 			blfp->blf_size++;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] xfs: reduce buffer log item shadow allocations
  2021-02-23  4:46 ` [PATCH 1/3] xfs: reduce buffer log item shadow allocations Dave Chinner
@ 2021-02-24 21:29   ` Darrick J. Wong
  2021-02-24 22:13     ` Dave Chinner
  2021-03-02 14:37   ` Brian Foster
  1 sibling, 1 reply; 12+ messages in thread
From: Darrick J. Wong @ 2021-02-24 21:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Feb 23, 2021 at 03:46:34PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we modify btrees repeatedly, we regularly increase the size of
> the logged region by a single chunk at a time (per transaction
> commit). This results in the CIL formatting code having to
> reallocate the log vector buffer every time the buffer dirty region
> grows. Hence over a typical 4kB btree buffer, we might grow the log
> vector 4096/128 = 32x over a short period where we repeatedly add
> or remove records to/from the buffer over a series of running
> transaction. This means we are doing 32 memory allocations and frees
> over this time during a performance critical path in the journal.
> 
> The amount of space tracked in the CIL for the object is calculated
> during the ->iop_format() call for the buffer log item, but the
> buffer memory allocated for it is calculated by the ->iop_size()
> call. The size callout determines the size of the buffer, the format
> call determines the space used in the buffer.
> 
> Hence we can oversize the buffer space required in the size
> calculation without impacting the amount of space used and accounted
> to the CIL for the changes being logged. This allows us to reduce
> the number of allocations by rounding up the buffer size to allow
> for future growth. This can safe a substantial amount of CPU time in
> this path:
> 
> -   46.52%     2.02%  [kernel]                  [k] xfs_log_commit_cil
>    - 44.49% xfs_log_commit_cil
>       - 30.78% _raw_spin_lock
>          - 30.75% do_raw_spin_lock
>               30.27% __pv_queued_spin_lock_slowpath
> 
> (oh, ouch!)
> ....
>       - 1.05% kmem_alloc_large
>          - 1.02% kmem_alloc
>               0.94% __kmalloc
> 
> This overhead here us what this patch is aimed at. After:
> 
>       - 0.76% kmem_alloc_large
>          - 0.75% kmem_alloc
>               0.70% __kmalloc
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>

Any particular reason for 512?  It looks like you simply picked an
arbitrary power of 2, but was there a particular target in mind? i.e.
we never need to realloc for the usual 4k filesystem?

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_buf_item.c | 13 +++++++++++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index 17960b1ce5ef..0628a65d9c55 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -142,6 +142,7 @@ xfs_buf_item_size(
>  {
>  	struct xfs_buf_log_item	*bip = BUF_ITEM(lip);
>  	int			i;
> +	int			bytes;
>  
>  	ASSERT(atomic_read(&bip->bli_refcount) > 0);
>  	if (bip->bli_flags & XFS_BLI_STALE) {
> @@ -173,7 +174,7 @@ xfs_buf_item_size(
>  	}
>  
>  	/*
> -	 * the vector count is based on the number of buffer vectors we have
> +	 * The vector count is based on the number of buffer vectors we have
>  	 * dirty bits in. This will only be greater than one when we have a
>  	 * compound buffer with more than one segment dirty. Hence for compound
>  	 * buffers we need to track which segment the dirty bits correspond to,
> @@ -181,10 +182,18 @@ xfs_buf_item_size(
>  	 * count for the extra buf log format structure that will need to be
>  	 * written.
>  	 */
> +	bytes = 0;
>  	for (i = 0; i < bip->bli_format_count; i++) {
>  		xfs_buf_item_size_segment(bip, &bip->bli_formats[i],
> -					  nvecs, nbytes);
> +					  nvecs, &bytes);
>  	}
> +
> +	/*
> +	 * Round up the buffer size required to minimise the number of memory
> +	 * allocations that need to be done as this item grows when relogged by
> +	 * repeated modifications.
> +	 */
> +	*nbytes = round_up(bytes, 512);
>  	trace_xfs_buf_item_size(bip);
>  }
>  
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/3] xfs: xfs_buf_item_size_segment() needs to pass segment offset
  2021-02-23  4:46 ` [PATCH 2/3] xfs: xfs_buf_item_size_segment() needs to pass segment offset Dave Chinner
@ 2021-02-24 21:34   ` Darrick J. Wong
  2021-03-02 14:37   ` Brian Foster
  1 sibling, 0 replies; 12+ messages in thread
From: Darrick J. Wong @ 2021-02-24 21:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Feb 23, 2021 at 03:46:35PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Otherwise it doesn't correctly calculate the number of vectors
> in a logged buffer that has a contiguous map that gets split into
> multiple regions because the range spans discontigous memory.
> 
> Probably never been hit in practice - we don't log contiguous ranges
> on unmapped buffers (inode clusters).
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Subtle.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_buf_item.c | 38 +++++++++++++++++++-------------------
>  1 file changed, 19 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index 0628a65d9c55..91dc7d8c9739 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -55,6 +55,18 @@ xfs_buf_log_format_size(
>  			(blfp->blf_map_size * sizeof(blfp->blf_data_map[0]));
>  }
>  
> +static inline bool
> +xfs_buf_item_straddle(
> +	struct xfs_buf		*bp,
> +	uint			offset,
> +	int			next_bit,
> +	int			last_bit)
> +{
> +	return xfs_buf_offset(bp, offset + (next_bit << XFS_BLF_SHIFT)) !=
> +		(xfs_buf_offset(bp, offset + (last_bit << XFS_BLF_SHIFT)) +
> +		 XFS_BLF_CHUNK);
> +}
> +
>  /*
>   * Return the number of log iovecs and space needed to log the given buf log
>   * item segment.
> @@ -67,6 +79,7 @@ STATIC void
>  xfs_buf_item_size_segment(
>  	struct xfs_buf_log_item		*bip,
>  	struct xfs_buf_log_format	*blfp,
> +	uint				offset,
>  	int				*nvecs,
>  	int				*nbytes)
>  {
> @@ -101,12 +114,8 @@ xfs_buf_item_size_segment(
>  		 */
>  		if (next_bit == -1) {
>  			break;
> -		} else if (next_bit != last_bit + 1) {
> -			last_bit = next_bit;
> -			(*nvecs)++;
> -		} else if (xfs_buf_offset(bp, next_bit * XFS_BLF_CHUNK) !=
> -			   (xfs_buf_offset(bp, last_bit * XFS_BLF_CHUNK) +
> -			    XFS_BLF_CHUNK)) {
> +		} else if (next_bit != last_bit + 1 ||
> +		           xfs_buf_item_straddle(bp, offset, next_bit, last_bit)) {
>  			last_bit = next_bit;
>  			(*nvecs)++;
>  		} else {
> @@ -141,8 +150,10 @@ xfs_buf_item_size(
>  	int			*nbytes)
>  {
>  	struct xfs_buf_log_item	*bip = BUF_ITEM(lip);
> +	struct xfs_buf		*bp = bip->bli_buf;
>  	int			i;
>  	int			bytes;
> +	uint			offset = 0;
>  
>  	ASSERT(atomic_read(&bip->bli_refcount) > 0);
>  	if (bip->bli_flags & XFS_BLI_STALE) {
> @@ -184,8 +195,9 @@ xfs_buf_item_size(
>  	 */
>  	bytes = 0;
>  	for (i = 0; i < bip->bli_format_count; i++) {
> -		xfs_buf_item_size_segment(bip, &bip->bli_formats[i],
> +		xfs_buf_item_size_segment(bip, &bip->bli_formats[i], offset,
>  					  nvecs, &bytes);
> +		offset += BBTOB(bp->b_maps[i].bm_len);
>  	}
>  
>  	/*
> @@ -212,18 +224,6 @@ xfs_buf_item_copy_iovec(
>  			nbits * XFS_BLF_CHUNK);
>  }
>  
> -static inline bool
> -xfs_buf_item_straddle(
> -	struct xfs_buf		*bp,
> -	uint			offset,
> -	int			next_bit,
> -	int			last_bit)
> -{
> -	return xfs_buf_offset(bp, offset + (next_bit << XFS_BLF_SHIFT)) !=
> -		(xfs_buf_offset(bp, offset + (last_bit << XFS_BLF_SHIFT)) +
> -		 XFS_BLF_CHUNK);
> -}
> -
>  static void
>  xfs_buf_item_format_segment(
>  	struct xfs_buf_log_item	*bip,
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] xfs: optimise xfs_buf_item_size/format for contiguous regions
  2021-02-23  4:46 ` [PATCH 3/3] xfs: optimise xfs_buf_item_size/format for contiguous regions Dave Chinner
@ 2021-02-24 21:39   ` Darrick J. Wong
  2021-03-02 14:38   ` Brian Foster
  1 sibling, 0 replies; 12+ messages in thread
From: Darrick J. Wong @ 2021-02-24 21:39 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Feb 23, 2021 at 03:46:36PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> We process the buf_log_item bitmap one set bit at a time with
> xfs_next_bit() so we can detect if a region crosses a memcpy
> discontinuity in the buffer data address. This has massive overhead
> on large buffers (e.g. 64k directory blocks) because we do a lot of
> unnecessary checks and xfs_buf_offset() calls.
> 
> For example, 16-way concurrent create workload on debug kernel
> running CPU bound has this at the top of the profile at ~120k
> create/s on 64kb directory block size:
> 
>   20.66%  [kernel]  [k] xfs_dir3_leaf_check_int
>    7.10%  [kernel]  [k] memcpy
>    6.22%  [kernel]  [k] xfs_next_bit
>    3.55%  [kernel]  [k] xfs_buf_offset
>    3.53%  [kernel]  [k] xfs_buf_item_format
>    3.34%  [kernel]  [k] __pv_queued_spin_lock_slowpath
>    3.04%  [kernel]  [k] do_raw_spin_lock
>    2.84%  [kernel]  [k] xfs_buf_item_size_segment.isra.0
>    2.31%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
>    1.36%  [kernel]  [k] xfs_log_commit_cil
> 
> (debug checks hurt large blocks)
> 
> The only buffers with discontinuities in the data address are
> unmapped buffers, and they are only used for inode cluster buffers
> and only for logging unlinked pointers. IOWs, it is -rare- that we
> even need to detect a discontinuity in the buffer item formatting
> code.
> 
> Optimise all this by using xfs_contig_bits() to find the size of
> the contiguous regions, then test for a discontiunity inside it. If
> we find one, do the slow "bit at a time" method we do now. If we
> don't, then just copy the entire contiguous range in one go.
> 
> Profile now looks like:
> 
>   25.26%  [kernel]  [k] xfs_dir3_leaf_check_int
>    9.25%  [kernel]  [k] memcpy
>    5.01%  [kernel]  [k] __pv_queued_spin_lock_slowpath
>    2.84%  [kernel]  [k] do_raw_spin_lock
>    2.22%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
>    1.88%  [kernel]  [k] xfs_buf_find
>    1.53%  [kernel]  [k] memmove
>    1.47%  [kernel]  [k] xfs_log_commit_cil
> ....
>    0.34%  [kernel]  [k] xfs_buf_item_format
> ....
>    0.21%  [kernel]  [k] xfs_buf_offset
> ....
>    0.16%  [kernel]  [k] xfs_contig_bits
> ....
>    0.13%  [kernel]  [k] xfs_buf_item_size_segment.isra.0
> 
> So the bit scanning over for the dirty region tracking for the
> buffer log items is basically gone. Debug overhead hurts even more
> now...
> 
> Perf comparison
> 
> 		dir block	 creates		unlink
> 		size (kb)	time	rate		time
> 
> Original	 4		4m08s	220k		 5m13s
> Original	64		7m21s	115k		13m25s
> Patched		 4		3m59s	230k		 5m03s
> Patched		64		6m23s	143k		12m33s
> 
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Seems straightforward enough...

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_buf_item.c | 102 +++++++++++++++++++++++++++++++++++-------
>  1 file changed, 87 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index 91dc7d8c9739..14d1fefcbf4c 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -59,12 +59,18 @@ static inline bool
>  xfs_buf_item_straddle(
>  	struct xfs_buf		*bp,
>  	uint			offset,
> -	int			next_bit,
> -	int			last_bit)
> +	int			first_bit,
> +	int			nbits)
>  {
> -	return xfs_buf_offset(bp, offset + (next_bit << XFS_BLF_SHIFT)) !=
> -		(xfs_buf_offset(bp, offset + (last_bit << XFS_BLF_SHIFT)) +
> -		 XFS_BLF_CHUNK);
> +	void			*first, *last;
> +
> +	first = xfs_buf_offset(bp, offset + (first_bit << XFS_BLF_SHIFT));
> +	last = xfs_buf_offset(bp,
> +			offset + ((first_bit + nbits) << XFS_BLF_SHIFT));
> +
> +	if (last - first != nbits * XFS_BLF_CHUNK)
> +		return true;
> +	return false;
>  }
>  
>  /*
> @@ -84,20 +90,51 @@ xfs_buf_item_size_segment(
>  	int				*nbytes)
>  {
>  	struct xfs_buf			*bp = bip->bli_buf;
> +	int				first_bit;
> +	int				nbits;
>  	int				next_bit;
>  	int				last_bit;
>  
> -	last_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size, 0);
> -	if (last_bit == -1)
> +	first_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size, 0);
> +	if (first_bit == -1)
>  		return;
>  
> -	/*
> -	 * initial count for a dirty buffer is 2 vectors - the format structure
> -	 * and the first dirty region.
> -	 */
> -	*nvecs += 2;
> -	*nbytes += xfs_buf_log_format_size(blfp) + XFS_BLF_CHUNK;
> +	(*nvecs)++;
> +	*nbytes += xfs_buf_log_format_size(blfp);
> +
> +	do {
> +		nbits = xfs_contig_bits(blfp->blf_data_map,
> +					blfp->blf_map_size, first_bit);
> +		ASSERT(nbits > 0);
> +
> +		/*
> +		 * Straddling a page is rare because we don't log contiguous
> +		 * chunks of unmapped buffers anywhere.
> +		 */
> +		if (nbits > 1 &&
> +		    xfs_buf_item_straddle(bp, offset, first_bit, nbits))
> +			goto slow_scan;
> +
> +		(*nvecs)++;
> +		*nbytes += nbits * XFS_BLF_CHUNK;
> +
> +		/*
> +		 * This takes the bit number to start looking from and
> +		 * returns the next set bit from there.  It returns -1
> +		 * if there are no more bits set or the start bit is
> +		 * beyond the end of the bitmap.
> +		 */
> +		first_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size,
> +					(uint)first_bit + nbits + 1);
> +	} while (first_bit != -1);
>  
> +	return;
> +
> +slow_scan:
> +	/* Count the first bit we jumped out of the above loop from */
> +	(*nvecs)++;
> +	*nbytes += XFS_BLF_CHUNK;
> +	last_bit = first_bit;
>  	while (last_bit != -1) {
>  		/*
>  		 * This takes the bit number to start looking from and
> @@ -115,11 +152,14 @@ xfs_buf_item_size_segment(
>  		if (next_bit == -1) {
>  			break;
>  		} else if (next_bit != last_bit + 1 ||
> -		           xfs_buf_item_straddle(bp, offset, next_bit, last_bit)) {
> +		           xfs_buf_item_straddle(bp, offset, first_bit, nbits)) {
>  			last_bit = next_bit;
> +			first_bit = next_bit;
>  			(*nvecs)++;
> +			nbits = 1;
>  		} else {
>  			last_bit++;
> +			nbits++;
>  		}
>  		*nbytes += XFS_BLF_CHUNK;
>  	}
> @@ -276,6 +316,38 @@ xfs_buf_item_format_segment(
>  	/*
>  	 * Fill in an iovec for each set of contiguous chunks.
>  	 */
> +	do {
> +		ASSERT(first_bit >= 0);
> +		nbits = xfs_contig_bits(blfp->blf_data_map,
> +					blfp->blf_map_size, first_bit);
> +		ASSERT(nbits > 0);
> +
> +		/*
> +		 * Straddling a page is rare because we don't log contiguous
> +		 * chunks of unmapped buffers anywhere.
> +		 */
> +		if (nbits > 1 &&
> +		    xfs_buf_item_straddle(bp, offset, first_bit, nbits))
> +			goto slow_scan;
> +
> +		xfs_buf_item_copy_iovec(lv, vecp, bp, offset,
> +					first_bit, nbits);
> +		blfp->blf_size++;
> +
> +		/*
> +		 * This takes the bit number to start looking from and
> +		 * returns the next set bit from there.  It returns -1
> +		 * if there are no more bits set or the start bit is
> +		 * beyond the end of the bitmap.
> +		 */
> +		first_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size,
> +					(uint)first_bit + nbits + 1);
> +	} while (first_bit != -1);
> +
> +	return;
> +
> +slow_scan:
> +	ASSERT(bp->b_addr == NULL);
>  	last_bit = first_bit;
>  	nbits = 1;
>  	for (;;) {
> @@ -300,7 +372,7 @@ xfs_buf_item_format_segment(
>  			blfp->blf_size++;
>  			break;
>  		} else if (next_bit != last_bit + 1 ||
> -		           xfs_buf_item_straddle(bp, offset, next_bit, last_bit)) {
> +		           xfs_buf_item_straddle(bp, offset, first_bit, nbits)) {
>  			xfs_buf_item_copy_iovec(lv, vecp, bp, offset,
>  						first_bit, nbits);
>  			blfp->blf_size++;
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] xfs: reduce buffer log item shadow allocations
  2021-02-24 21:29   ` Darrick J. Wong
@ 2021-02-24 22:13     ` Dave Chinner
  0 siblings, 0 replies; 12+ messages in thread
From: Dave Chinner @ 2021-02-24 22:13 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Feb 24, 2021 at 01:29:29PM -0800, Darrick J. Wong wrote:
> On Tue, Feb 23, 2021 at 03:46:34PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > When we modify btrees repeatedly, we regularly increase the size of
> > the logged region by a single chunk at a time (per transaction
> > commit). This results in the CIL formatting code having to
> > reallocate the log vector buffer every time the buffer dirty region
> > grows. Hence over a typical 4kB btree buffer, we might grow the log
> > vector 4096/128 = 32x over a short period where we repeatedly add
> > or remove records to/from the buffer over a series of running
> > transaction. This means we are doing 32 memory allocations and frees
> > over this time during a performance critical path in the journal.
> > 
> > The amount of space tracked in the CIL for the object is calculated
> > during the ->iop_format() call for the buffer log item, but the
> > buffer memory allocated for it is calculated by the ->iop_size()
> > call. The size callout determines the size of the buffer, the format
> > call determines the space used in the buffer.
> > 
> > Hence we can oversize the buffer space required in the size
> > calculation without impacting the amount of space used and accounted
> > to the CIL for the changes being logged. This allows us to reduce
> > the number of allocations by rounding up the buffer size to allow
> > for future growth. This can safe a substantial amount of CPU time in
> > this path:
> > 
> > -   46.52%     2.02%  [kernel]                  [k] xfs_log_commit_cil
> >    - 44.49% xfs_log_commit_cil
> >       - 30.78% _raw_spin_lock
> >          - 30.75% do_raw_spin_lock
> >               30.27% __pv_queued_spin_lock_slowpath
> > 
> > (oh, ouch!)
> > ....
> >       - 1.05% kmem_alloc_large
> >          - 1.02% kmem_alloc
> >               0.94% __kmalloc
> > 
> > This overhead here us what this patch is aimed at. After:
> > 
> >       - 0.76% kmem_alloc_large
> >          - 0.75% kmem_alloc
> >               0.70% __kmalloc
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
> 
> Any particular reason for 512?  It looks like you simply picked an
> arbitrary power of 2, but was there a particular target in mind? i.e.
> we never need to realloc for the usual 4k filesystem?

It is based on the bitmap chunk size being 128 bytes and that random
directory entry updates almost never require more than 3-4 128 byte
regions to be logged in the directory block.

The other observation is for per-ag btrees. When we are inserting
into a new btree block, we'll pack it from the front. Hence the
first few records land in the first 128 bytes so we log only 128
bytes, the next 8-16 records land in the second region so now we log
256 bytes. And so on.  If we are doing random updates, it will only
allocate every 4 random 128 byte regions that are dirtied instead of
every single one.

Any larger than this and I noticed an increase in memory footprint
in my scalability workloads. Any less than this and I didn't really
see any significant benefit to CPU usage.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/3] xfs: buffer log item optimisations
  2021-02-23  4:46 [PATCH 0/3] xfs: buffer log item optimisations Dave Chinner
                   ` (2 preceding siblings ...)
  2021-02-23  4:46 ` [PATCH 3/3] xfs: optimise xfs_buf_item_size/format for contiguous regions Dave Chinner
@ 2021-02-25  9:01 ` Christoph Hellwig
  3 siblings, 0 replies; 12+ messages in thread
From: Christoph Hellwig @ 2021-02-25  9:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

The whole series looks good to me:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] xfs: reduce buffer log item shadow allocations
  2021-02-23  4:46 ` [PATCH 1/3] xfs: reduce buffer log item shadow allocations Dave Chinner
  2021-02-24 21:29   ` Darrick J. Wong
@ 2021-03-02 14:37   ` Brian Foster
  1 sibling, 0 replies; 12+ messages in thread
From: Brian Foster @ 2021-03-02 14:37 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Feb 23, 2021 at 03:46:34PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we modify btrees repeatedly, we regularly increase the size of
> the logged region by a single chunk at a time (per transaction
> commit). This results in the CIL formatting code having to
> reallocate the log vector buffer every time the buffer dirty region
> grows. Hence over a typical 4kB btree buffer, we might grow the log
> vector 4096/128 = 32x over a short period where we repeatedly add
> or remove records to/from the buffer over a series of running
> transaction. This means we are doing 32 memory allocations and frees
> over this time during a performance critical path in the journal.
> 
> The amount of space tracked in the CIL for the object is calculated
> during the ->iop_format() call for the buffer log item, but the
> buffer memory allocated for it is calculated by the ->iop_size()
> call. The size callout determines the size of the buffer, the format
> call determines the space used in the buffer.
> 
> Hence we can oversize the buffer space required in the size
> calculation without impacting the amount of space used and accounted
> to the CIL for the changes being logged. This allows us to reduce
> the number of allocations by rounding up the buffer size to allow
> for future growth. This can safe a substantial amount of CPU time in
> this path:
> 
> -   46.52%     2.02%  [kernel]                  [k] xfs_log_commit_cil
>    - 44.49% xfs_log_commit_cil
>       - 30.78% _raw_spin_lock
>          - 30.75% do_raw_spin_lock
>               30.27% __pv_queued_spin_lock_slowpath
> 
> (oh, ouch!)
> ....
>       - 1.05% kmem_alloc_large
>          - 1.02% kmem_alloc
>               0.94% __kmalloc
> 
> This overhead here us what this patch is aimed at. After:
> 
>       - 0.76% kmem_alloc_large
>          - 0.75% kmem_alloc
>               0.70% __kmalloc
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_buf_item.c | 13 +++++++++++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index 17960b1ce5ef..0628a65d9c55 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -142,6 +142,7 @@ xfs_buf_item_size(
>  {
>  	struct xfs_buf_log_item	*bip = BUF_ITEM(lip);
>  	int			i;
> +	int			bytes;
>  
>  	ASSERT(atomic_read(&bip->bli_refcount) > 0);
>  	if (bip->bli_flags & XFS_BLI_STALE) {
> @@ -173,7 +174,7 @@ xfs_buf_item_size(
>  	}
>  
>  	/*
> -	 * the vector count is based on the number of buffer vectors we have
> +	 * The vector count is based on the number of buffer vectors we have
>  	 * dirty bits in. This will only be greater than one when we have a
>  	 * compound buffer with more than one segment dirty. Hence for compound
>  	 * buffers we need to track which segment the dirty bits correspond to,
> @@ -181,10 +182,18 @@ xfs_buf_item_size(
>  	 * count for the extra buf log format structure that will need to be
>  	 * written.
>  	 */
> +	bytes = 0;
>  	for (i = 0; i < bip->bli_format_count; i++) {
>  		xfs_buf_item_size_segment(bip, &bip->bli_formats[i],
> -					  nvecs, nbytes);
> +					  nvecs, &bytes);
>  	}
> +
> +	/*
> +	 * Round up the buffer size required to minimise the number of memory
> +	 * allocations that need to be done as this item grows when relogged by
> +	 * repeated modifications.
> +	 */
> +	*nbytes = round_up(bytes, 512);
>  	trace_xfs_buf_item_size(bip);
>  }
>  
> -- 
> 2.28.0
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/3] xfs: xfs_buf_item_size_segment() needs to pass segment offset
  2021-02-23  4:46 ` [PATCH 2/3] xfs: xfs_buf_item_size_segment() needs to pass segment offset Dave Chinner
  2021-02-24 21:34   ` Darrick J. Wong
@ 2021-03-02 14:37   ` Brian Foster
  1 sibling, 0 replies; 12+ messages in thread
From: Brian Foster @ 2021-03-02 14:37 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Feb 23, 2021 at 03:46:35PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Otherwise it doesn't correctly calculate the number of vectors
> in a logged buffer that has a contiguous map that gets split into
> multiple regions because the range spans discontigous memory.
> 
> Probably never been hit in practice - we don't log contiguous ranges
> on unmapped buffers (inode clusters).
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_buf_item.c | 38 +++++++++++++++++++-------------------
>  1 file changed, 19 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index 0628a65d9c55..91dc7d8c9739 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -55,6 +55,18 @@ xfs_buf_log_format_size(
>  			(blfp->blf_map_size * sizeof(blfp->blf_data_map[0]));
>  }
>  
> +static inline bool
> +xfs_buf_item_straddle(
> +	struct xfs_buf		*bp,
> +	uint			offset,
> +	int			next_bit,
> +	int			last_bit)
> +{
> +	return xfs_buf_offset(bp, offset + (next_bit << XFS_BLF_SHIFT)) !=
> +		(xfs_buf_offset(bp, offset + (last_bit << XFS_BLF_SHIFT)) +
> +		 XFS_BLF_CHUNK);
> +}
> +
>  /*
>   * Return the number of log iovecs and space needed to log the given buf log
>   * item segment.
> @@ -67,6 +79,7 @@ STATIC void
>  xfs_buf_item_size_segment(
>  	struct xfs_buf_log_item		*bip,
>  	struct xfs_buf_log_format	*blfp,
> +	uint				offset,
>  	int				*nvecs,
>  	int				*nbytes)
>  {
> @@ -101,12 +114,8 @@ xfs_buf_item_size_segment(
>  		 */
>  		if (next_bit == -1) {
>  			break;
> -		} else if (next_bit != last_bit + 1) {
> -			last_bit = next_bit;
> -			(*nvecs)++;
> -		} else if (xfs_buf_offset(bp, next_bit * XFS_BLF_CHUNK) !=
> -			   (xfs_buf_offset(bp, last_bit * XFS_BLF_CHUNK) +
> -			    XFS_BLF_CHUNK)) {
> +		} else if (next_bit != last_bit + 1 ||
> +		           xfs_buf_item_straddle(bp, offset, next_bit, last_bit)) {
>  			last_bit = next_bit;
>  			(*nvecs)++;
>  		} else {
> @@ -141,8 +150,10 @@ xfs_buf_item_size(
>  	int			*nbytes)
>  {
>  	struct xfs_buf_log_item	*bip = BUF_ITEM(lip);
> +	struct xfs_buf		*bp = bip->bli_buf;
>  	int			i;
>  	int			bytes;
> +	uint			offset = 0;
>  
>  	ASSERT(atomic_read(&bip->bli_refcount) > 0);
>  	if (bip->bli_flags & XFS_BLI_STALE) {
> @@ -184,8 +195,9 @@ xfs_buf_item_size(
>  	 */
>  	bytes = 0;
>  	for (i = 0; i < bip->bli_format_count; i++) {
> -		xfs_buf_item_size_segment(bip, &bip->bli_formats[i],
> +		xfs_buf_item_size_segment(bip, &bip->bli_formats[i], offset,
>  					  nvecs, &bytes);
> +		offset += BBTOB(bp->b_maps[i].bm_len);
>  	}
>  
>  	/*
> @@ -212,18 +224,6 @@ xfs_buf_item_copy_iovec(
>  			nbits * XFS_BLF_CHUNK);
>  }
>  
> -static inline bool
> -xfs_buf_item_straddle(
> -	struct xfs_buf		*bp,
> -	uint			offset,
> -	int			next_bit,
> -	int			last_bit)
> -{
> -	return xfs_buf_offset(bp, offset + (next_bit << XFS_BLF_SHIFT)) !=
> -		(xfs_buf_offset(bp, offset + (last_bit << XFS_BLF_SHIFT)) +
> -		 XFS_BLF_CHUNK);
> -}
> -
>  static void
>  xfs_buf_item_format_segment(
>  	struct xfs_buf_log_item	*bip,
> -- 
> 2.28.0
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] xfs: optimise xfs_buf_item_size/format for contiguous regions
  2021-02-23  4:46 ` [PATCH 3/3] xfs: optimise xfs_buf_item_size/format for contiguous regions Dave Chinner
  2021-02-24 21:39   ` Darrick J. Wong
@ 2021-03-02 14:38   ` Brian Foster
  1 sibling, 0 replies; 12+ messages in thread
From: Brian Foster @ 2021-03-02 14:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Feb 23, 2021 at 03:46:36PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> We process the buf_log_item bitmap one set bit at a time with
> xfs_next_bit() so we can detect if a region crosses a memcpy
> discontinuity in the buffer data address. This has massive overhead
> on large buffers (e.g. 64k directory blocks) because we do a lot of
> unnecessary checks and xfs_buf_offset() calls.
> 
> For example, 16-way concurrent create workload on debug kernel
> running CPU bound has this at the top of the profile at ~120k
> create/s on 64kb directory block size:
> 
>   20.66%  [kernel]  [k] xfs_dir3_leaf_check_int
>    7.10%  [kernel]  [k] memcpy
>    6.22%  [kernel]  [k] xfs_next_bit
>    3.55%  [kernel]  [k] xfs_buf_offset
>    3.53%  [kernel]  [k] xfs_buf_item_format
>    3.34%  [kernel]  [k] __pv_queued_spin_lock_slowpath
>    3.04%  [kernel]  [k] do_raw_spin_lock
>    2.84%  [kernel]  [k] xfs_buf_item_size_segment.isra.0
>    2.31%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
>    1.36%  [kernel]  [k] xfs_log_commit_cil
> 
> (debug checks hurt large blocks)
> 
> The only buffers with discontinuities in the data address are
> unmapped buffers, and they are only used for inode cluster buffers
> and only for logging unlinked pointers. IOWs, it is -rare- that we
> even need to detect a discontinuity in the buffer item formatting
> code.
> 
> Optimise all this by using xfs_contig_bits() to find the size of
> the contiguous regions, then test for a discontiunity inside it. If
> we find one, do the slow "bit at a time" method we do now. If we
> don't, then just copy the entire contiguous range in one go.
> 
> Profile now looks like:
> 
>   25.26%  [kernel]  [k] xfs_dir3_leaf_check_int
>    9.25%  [kernel]  [k] memcpy
>    5.01%  [kernel]  [k] __pv_queued_spin_lock_slowpath
>    2.84%  [kernel]  [k] do_raw_spin_lock
>    2.22%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
>    1.88%  [kernel]  [k] xfs_buf_find
>    1.53%  [kernel]  [k] memmove
>    1.47%  [kernel]  [k] xfs_log_commit_cil
> ....
>    0.34%  [kernel]  [k] xfs_buf_item_format
> ....
>    0.21%  [kernel]  [k] xfs_buf_offset
> ....
>    0.16%  [kernel]  [k] xfs_contig_bits
> ....
>    0.13%  [kernel]  [k] xfs_buf_item_size_segment.isra.0
> 
> So the bit scanning over for the dirty region tracking for the
> buffer log items is basically gone. Debug overhead hurts even more
> now...
> 
> Perf comparison
> 
> 		dir block	 creates		unlink
> 		size (kb)	time	rate		time
> 
> Original	 4		4m08s	220k		 5m13s
> Original	64		7m21s	115k		13m25s
> Patched		 4		3m59s	230k		 5m03s
> Patched		64		6m23s	143k		12m33s
> 
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_buf_item.c | 102 +++++++++++++++++++++++++++++++++++-------
>  1 file changed, 87 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index 91dc7d8c9739..14d1fefcbf4c 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
...
> @@ -84,20 +90,51 @@ xfs_buf_item_size_segment(
>  	int				*nbytes)
>  {
>  	struct xfs_buf			*bp = bip->bli_buf;
> +	int				first_bit;
> +	int				nbits;
>  	int				next_bit;
>  	int				last_bit;
>  
> -	last_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size, 0);
> -	if (last_bit == -1)
> +	first_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size, 0);
> +	if (first_bit == -1)
>  		return;
>  
> -	/*
> -	 * initial count for a dirty buffer is 2 vectors - the format structure
> -	 * and the first dirty region.
> -	 */
> -	*nvecs += 2;
> -	*nbytes += xfs_buf_log_format_size(blfp) + XFS_BLF_CHUNK;
> +	(*nvecs)++;
> +	*nbytes += xfs_buf_log_format_size(blfp);
> +
> +	do {
> +		nbits = xfs_contig_bits(blfp->blf_data_map,
> +					blfp->blf_map_size, first_bit);
> +		ASSERT(nbits > 0);
> +
> +		/*
> +		 * Straddling a page is rare because we don't log contiguous
> +		 * chunks of unmapped buffers anywhere.
> +		 */
> +		if (nbits > 1 &&
> +		    xfs_buf_item_straddle(bp, offset, first_bit, nbits))
> +			goto slow_scan;
> +
> +		(*nvecs)++;
> +		*nbytes += nbits * XFS_BLF_CHUNK;
> +
> +		/*
> +		 * This takes the bit number to start looking from and
> +		 * returns the next set bit from there.  It returns -1
> +		 * if there are no more bits set or the start bit is
> +		 * beyond the end of the bitmap.
> +		 */
> +		first_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size,
> +					(uint)first_bit + nbits + 1);

I think the range tracking logic would be a bit more robust to not +1
here and thus not make assumptions on how the rest of the loop is
implemented, but regardless this looks Ok to me:

Reviewed-by: Brian Foster <bfoster@redhat.com>

> +	} while (first_bit != -1);
>  
> +	return;
> +
> +slow_scan:
> +	/* Count the first bit we jumped out of the above loop from */
> +	(*nvecs)++;
> +	*nbytes += XFS_BLF_CHUNK;
> +	last_bit = first_bit;
>  	while (last_bit != -1) {
>  		/*
>  		 * This takes the bit number to start looking from and
> @@ -115,11 +152,14 @@ xfs_buf_item_size_segment(
>  		if (next_bit == -1) {
>  			break;
>  		} else if (next_bit != last_bit + 1 ||
> -		           xfs_buf_item_straddle(bp, offset, next_bit, last_bit)) {
> +		           xfs_buf_item_straddle(bp, offset, first_bit, nbits)) {
>  			last_bit = next_bit;
> +			first_bit = next_bit;
>  			(*nvecs)++;
> +			nbits = 1;
>  		} else {
>  			last_bit++;
> +			nbits++;
>  		}
>  		*nbytes += XFS_BLF_CHUNK;
>  	}
> @@ -276,6 +316,38 @@ xfs_buf_item_format_segment(
>  	/*
>  	 * Fill in an iovec for each set of contiguous chunks.
>  	 */
> +	do {
> +		ASSERT(first_bit >= 0);
> +		nbits = xfs_contig_bits(blfp->blf_data_map,
> +					blfp->blf_map_size, first_bit);
> +		ASSERT(nbits > 0);
> +
> +		/*
> +		 * Straddling a page is rare because we don't log contiguous
> +		 * chunks of unmapped buffers anywhere.
> +		 */
> +		if (nbits > 1 &&
> +		    xfs_buf_item_straddle(bp, offset, first_bit, nbits))
> +			goto slow_scan;
> +
> +		xfs_buf_item_copy_iovec(lv, vecp, bp, offset,
> +					first_bit, nbits);
> +		blfp->blf_size++;
> +
> +		/*
> +		 * This takes the bit number to start looking from and
> +		 * returns the next set bit from there.  It returns -1
> +		 * if there are no more bits set or the start bit is
> +		 * beyond the end of the bitmap.
> +		 */
> +		first_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size,
> +					(uint)first_bit + nbits + 1);
> +	} while (first_bit != -1);
> +
> +	return;
> +
> +slow_scan:
> +	ASSERT(bp->b_addr == NULL);
>  	last_bit = first_bit;
>  	nbits = 1;
>  	for (;;) {
> @@ -300,7 +372,7 @@ xfs_buf_item_format_segment(
>  			blfp->blf_size++;
>  			break;
>  		} else if (next_bit != last_bit + 1 ||
> -		           xfs_buf_item_straddle(bp, offset, next_bit, last_bit)) {
> +		           xfs_buf_item_straddle(bp, offset, first_bit, nbits)) {
>  			xfs_buf_item_copy_iovec(lv, vecp, bp, offset,
>  						first_bit, nbits);
>  			blfp->blf_size++;
> -- 
> 2.28.0
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-03-03  3:19 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-23  4:46 [PATCH 0/3] xfs: buffer log item optimisations Dave Chinner
2021-02-23  4:46 ` [PATCH 1/3] xfs: reduce buffer log item shadow allocations Dave Chinner
2021-02-24 21:29   ` Darrick J. Wong
2021-02-24 22:13     ` Dave Chinner
2021-03-02 14:37   ` Brian Foster
2021-02-23  4:46 ` [PATCH 2/3] xfs: xfs_buf_item_size_segment() needs to pass segment offset Dave Chinner
2021-02-24 21:34   ` Darrick J. Wong
2021-03-02 14:37   ` Brian Foster
2021-02-23  4:46 ` [PATCH 3/3] xfs: optimise xfs_buf_item_size/format for contiguous regions Dave Chinner
2021-02-24 21:39   ` Darrick J. Wong
2021-03-02 14:38   ` Brian Foster
2021-02-25  9:01 ` [PATCH 0/3] xfs: buffer log item optimisations Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).